Replacing Pandas with Polars

anyfactor · on Jan 22, 2023

I have tried polars for a couple of week and I am one of those weird guys who likes pandas syntax more than sql. Honestly, this is one of those "I will just wait until it (Pandas) gets better" thing. Anyone who uses Pandas and SQL extensively knows that whatever question you might have with them someone already has an answer for you. One the other hand Polars is new and I feel like the Polars community is pushing the better syntax to wrangle data just doesn't feel right. I am not smart enough to put three different wrangling/query syntax in my brain.

I am hopeful about duckdb mostly because how friendly the people behind the project is. But honestly they really need to improve their csv reader operation. The data type recognition for auto read csv needs work. And duckdb people knows about CSV read is the reason polars has an edge on them and I know they are working on it.

But at the end of all that, I will just wait for Pandas to get faster and better.

RandomWorker · on Jan 22, 2023

Here here, I agree. In the most recent versions we have seen speed increases. The fact that polars exist shows that there is a tone of low hanging fruit.

There is also dask to increase parallelism and performance which I’ve used on some massive datasets 200GB+

maegul · on Jan 22, 2023

Isn't the low hanging fruit that polars picked: "how about lazy evaluation to allow the query to be optimised?" ... which is mostly anathema to the design of pandas?

__mharrison__ · on Jan 22, 2023

Pandas (the API) is also getting better at big data. I'm an advisor at a company, Ponder, that will take your Pandas code and execute it on "big data".

tccole · on Jan 22, 2023

Aren’t there a bunch of plugins for that kind of thing?

zeitlupe · on Jan 22, 2023

I do not fully get the speed argument. If the dataset is small'ish, it does not significantly matter (given you only use the vectorized functions, of course). And if it's big data, I do not use pandas but Spark/a cloud data warehouse solution. For this reason I also do not get the use case for duckdb.

Beyond this, a dataframe api/syntax, compared to SQL syntax, is to me way easier to follow and to debug.

orlp · on Jan 22, 2023

The use case for DuckDB is the that the vast majority of analytics aren't on "big data", but also aren't "small-ish". If your data fits on a single machine, DuckDB will allow you to query it at great speed. I disagree that the speed "doesn't significantly matter".

There is work being done on a dataframe API frontend for DuckDB, if you prefer that interface.

FridgeSeal · on Jan 22, 2023

> I do not fully get the speed argument

The way I see the speed improvements is this: fast processes run faster, moderate-to-long processes (that are still awkward and large in Python, but not enough to justify the shift to Spark) now run in noticeably faster time, and the threshold for “what I need to use Spark for, shifts a long way up the scale. That is, you can now do more, with the same amount of compute.

zeitlupe · on Jan 30, 2023

Some more thought in the same direction: https://dataengineeringcentral.substack.com/p/whats-all-the-...

smohare · on Jan 22, 2023

This sounds like it’s spoken from someone who doesn’t understand a lick about sql. Pandas has one of the worst APIs I’ve ever seen. Truly a blight on the data processing landscape.

brahbrah · on Feb 5, 2023

If you have a dataframe of power plant capacities and a dataframe of power plant capacity reductions how would you figure out the available capacities of all your power plants. In pandas it would be `capacities_df - reductions_df`. How would you do it in polars or sql, it’s not nearly as nice. Pandas has the benefit of allowing you to work with data in a relational or ndarray style. Rather than one or the other. The api does incur some bloat due to that, but that ability is very valuable.

rubenfiszel · on Jan 22, 2023

One thing that high-performance polars/arrow enables is to build ETL without data sharding because in most cases, all your data fits into one node. Jupyter and local scripts become sufficient and you avoid a lot of complexity.

We hesitated a lot before mentioning ETL in the OSS project windmill [1] but since we support any python including polars, passing data between steps through shared folders and since polars is blazingly fast, it is a suprisingly good system for ETL waiting for the need of using spark/materialize directly.

[1]: https://github.com/windmill-labs/windmill

RockyMcNuts · on Jan 22, 2023

Vertical is the new horizontal, because of stuff like Epycs with 96 cores/192 threads, AWS instances with 24TB of RAM, M2 MacBook Pros with 96GB RAM, 12 cores, ludicrous SOC bandwidth.

What used to be 'big data' fits on an instance, you only need shards/clusters for 'ludicrous data'.

RockyMcNuts · on Jan 22, 2023

Also, it would make me so happy if someone did a proper benchmark of

- pandas

- polars

- modin with various backends

- DuckDB and Postgres SQL from Jupyter/python with the usual SQL dataframe magic

- R

- on medium and 'pretty big data' i.e. 10GB dataframes with millions of rows.

- maybe on consumer Intel, M2, and Epyc with lots of cores

extr · on Jan 22, 2023

I tried to seriously use Polars in a side project and while I do like that the API surface is a little smaller, and cleaner, it's quite verbose. And I ran into a few bugs doing "basic" things that made me think it's not quite ready for prime time yet. I think with a bit more development and a more interactive-friendly API it would be awesome though.

ericpauley · on Jan 22, 2023

I've noticed a trend lately (though I doubt it's new) of systems/frameworks/whatever coming out promising to unseat the existing offering by being simpler, more opinionated, etc. But it turns out all that complexity was because the underlying problem is actually complex, so as the new simpler solution gains adoption it feature creeps its way back to being the same as the old.

I feel like I'm witnessing the start of that here.

extr · on Jan 22, 2023

I agree in the general case but I think here the problem is more that to really make these types of packages to air-tight production grade status, you need a critical mass of users to basically field test it and find all the edge cases and bugs. This is what I see polars struggling with right now. The issue I ran into was with the drop_nan function. A pretty common task. I was using it in a weird way, but if polars wants to succeed with it's small-but-feature-complete API surface, it needs it's verbs to be fully, reliably composable. No bugs allowed.

The big thing pandas has going for it is that it's already been through this field testing. All the bugs have been ironed out by the hundreds of thousands of users. And the compute/memory efficiency is "good enough" for most use-cases, that those who need better efficiency can spend some time to optimize it. Sure the API is confusing and bad, but most people just kind of muddle through using stackoverflow answers from 2017 and it works well enough.

I've stanned for data.table (the R package) on HN before and I'll do it again now. It's a great example of a package that has been thoroughly field tested as well as having a simple, elegant API that allows you to do basically anything. So we know it's possible!

lr1970 · on Jan 22, 2023

> The big thing pandas has going for it is that it's already been through this field testing. All the bugs have been ironed out by the hundreds of thousands of users.

At this very moment pandas github repo has 1563 open issues labeled with a bug tag [0]. So much for "all the bugs have been ironed out".

[0] https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3...

ayembee · on Jan 22, 2023

Did you raise an issue with the drop_nan bug? If not, please do ;)

chazeon · on Jan 22, 2023

The benchmarks[1] from polars' websites seems really promising, and I am going to try it on my next jupyter notebook. But what I wonder really is about parsing time, this is not shown in bench marks.

I my previous project, I use pandas' read_table to avoid for loop to parse molecular dynamics trajectories LAMMPS. The trajectories are pure text file with size of a few Gigs. Compared to a pure Python implementation ~3-5 min, pandas reduce it to 20 secs. I wonder how Polars works.

To be honest, compared to parsing, data manipulation such as the benchmark showcased, is just not so costly.

[1]: https://www.pola.rs/benchmarks.html

ritchie46 · on Jan 22, 2023

Author of polars here. If we would have included csv parsing in the benchmark, polars would have come out much better than it already has.

henrydark · on Jan 22, 2023

Better than the apache arrow c++ implementation that's in pyarrow?

ritchie46 · on Jan 22, 2023

It is on par. Just did a local benchmark on the new york taxi dataset.

``` # polars

We should not rechunk as that is not what pyarrow does. Rechunking is a different operation and is also optimized away in many lazy operations.

%%time pl.read_csv("csv-benchmark/yellow_tripdata_2010-01.csv", rechunk=False, ignore_errors=True)

CPU times: user 17.4 s, sys: 2.42 s, total: 19.8 s Wall time: 1.89 s

%%time pa.csv.read_csv("csv-benchmark/yellow_tripdata_2010-01.csv")

CPU times: user 17.1 s, sys: 3.07 s, total: 20.1 s Wall time: 1.99 s

# pyarrow embedded new lines (e.g. valid csv files)

Polars by default allows embedded new line characters. Pyarrow does not, if we tell it to do so, polars is faster on my laptop.

%%time pa.csv.read_csv("csv-benchmark/yellow_tripdata_2010-01.csv", parse_options=pa.csv.ParseOptions(newlines_in_values=True))

CPU times: user 17.8 s, sys: 3.09 s, total: 20.9 s Wall time: 2.72 s

# pandas (pyarrow engine)

And pandas itself has to pay for the copy from pyarrow to pandas:

%%time pd.read_csv("csv-benchmark/yellow_tripdata_2010-01.csv", engine="pyarrow")

CPU times: user 18.5 s, sys: 5.48 s, total: 24 s Wall time: 2.81 s

# pandas (default engine)

I also tried with pandas default csv parser, but I got an OOM after 20 seconds or so. I have 16GB of RAM. ```

henrydark · on Jan 22, 2023

super cool.

How many cores you got? Arrow has one thread serially chunking the data, and uses the rest of the cores to process the chunks in parallel. On large machines (and files, with fast io) the chunking becomes the bottleneck. Is this the same for polars?

ritchie46 · on Jan 22, 2023

I have got 12 cores. We partition the work of over all threads, I think we are mostly IO bound as we don't use async, but we do require some pre-work before we can start reading.

nicoburns · on Jan 22, 2023

You only need to avoid for loops because Python is so slow. The fantastic thing about Rust is that for the most part you can just use a for loop and it’ll be fast by default (there are still tricks to speed things up that you may need in some scenarios)

agoose77 · on Jan 22, 2023

This is certainly a motivation, but there are other reasons to avoid loops. In many domains, array-at-a-time abstractions offer a higher-level view of the problem. That's why something like `xtensor` exists for a NumPy-like API in C++.

maegul · on Jan 22, 2023

My understanding is that polars can get fast(er) parsing mainly/only by leveraging arrow and parquet files.

ritchie46 · on Jan 22, 2023

Author of polars here. That's incorrect. I have put a lot of effort in polars' csv parser for instance and it is one of the fastest csv parsers out there. This has nothing to do with leveraging arrow.

We control all our parsers and read directly into arrow memory. This differs from pandas, which utilizes pyarrow for reading parquet and then finally has to copy the arrow memory over to pandas memory.

maegul · on Jan 22, 2023

Apologies! I got that from some blog post somewhere I believe, not from any personal wackereren our judgement (which I should have signalled better in my post).

Nonetheless, would reading a parquet file with polars be faster than reading a csv?

Also thanks for polars! Great contribution data science!

ritchie46 · on Jan 22, 2023

> Apologies! I got that from some blog post somewhere I believe, not from any personal wackereren our judgement (which I should have signalled better in my post).

No worries. :) Most blogs on the topic I encounter in the wild make incorrect claims, I understand the confusion.

> Nonetheless, would reading a parquet file with polars be faster than reading a csv?

Yes, much faster. Please don't use the csv format for anything of a reasonable size. It is a terrible format to process and very ambiguous.

Jweb_Guru · on Jan 22, 2023

As far as I know, it's much faster than Pandas (and other competitors) at parsing plain CSVs as well.

aorist · on Jan 22, 2023

The ergonomics of grouping and aggregation in R are really much better because libraries can make of its non-standard evaluation[^0] (which in other cases also makes the language a nightmare to deal with).

Compare:

    pd_df.groupby(['date'])['failure'].count() #  pandas
    pl_df.groupby(pl.col('date')).agg(pl.count('failure')) #  polars
    dt[, .N, date] # R data.table

In both Pandas and Polars, the specification of the date has to be a string inside a list or method call, but in R it can be a bare token.

[^0]: http://adv-r.had.co.nz/Computing-on-the-language.html

wokwokwok · on Jan 22, 2023

Having migrated 1000s of lines of legacy r, all I can say is… yes, but then you have you to use r. (:

R is not a replacement for pandas.

R is it’s own special little painful ecosystem, loved by people who don’t have to maintain the code they write.

You can complain all you like about pandas, but at the end of the day, it’s python. Python tooling works with it. The python ecosystem works with it.

It’s not without faults, but at least you’ll have a community of people to help when things go wrong.

R, not so much.

(Spoken as jaded developer who had to support r on databricks, which is deep in the hell of “well, it’s not really a tier one language” even from their support team)

jasonpbecker · on Jan 22, 2023

Having written tens of thousands of lines of R code that I've been maintaining and using for production pipelines for 9 years...

Sounds like you worked with (or wrote) really bad R code.

wokwokwok · on Jan 22, 2023

The point I’m making isn’t that you can write bad r code; you can write bad code in any language.

…the point I’m making is that when you already have bad r code, it’s a massive pain in the ass.

Bad python is terrible too, but lots of people know python, and it’s easy to find help to unduck bad python code and turn it into maintainable code.

That has not been my experience with r. Ever. At any organisation.

Your experience may vary. (:

nequo · on Jan 22, 2023

> when you already have bad r code, it’s a massive pain in the ass.

Do you think it’s because R code tends to be written by statisticians and stats-adjacent domain experts who don’t necessarily know how to write clean code while Python code has at least some input from actual programmers? Or is this really down to the language itself?

bovinejoni · on Jan 22, 2023

Can you elaborate please? This hasn’t been my experience whatsoever, curious what the issues have been

wokwokwok · on Jan 22, 2023

? I'm not sure I can say more than I already did, but I'll try to be more specific:

The R community is categorically smaller than the python community. The support on community forums is harder to get, or non-existent (eg. with databricks).

Are you saying you've worked in places where its easier to find people that are familiar with R to help work on a project than it is to find people are familiar with python?

That you've found its easier to hire people who are familiar with R than it is to hire people who are familiar with python?

I... all I can say is that has not been my experience.

The places I've worked, of all the developers a small handful of people use R, and a small subset of those are good at it.

I don't hate R. I don't think it's a bad language. I'm saying: It's harder to support, because it's obscure, rarely used by most developers, and the people who use it and know it well are rare and expensive.

As a data engineer, expected to support workflows in production: Don't use obscure crap and expect other people to support it. Not R. Not rust. Not pony.

Using R on databricks, specifically, is a) unsupported^, and b) obscure and c) buggy. Don't do it.

(^ sorry, it's a 'tier 2 language' if you speak to DB representative, which means bugs don't count and new features don't get support)

All I can say, is that my experience has been that supporting python has been less painful; it's a simple known quantity, and its easy to scale up a team to fix projects if you need to.

bovinejoni · on Jan 22, 2023

Thanks for sharing. Seems your issues are more with databricks than R, but certainly R is more obscure.

At least in my experience we’ve never had issues with people learning it on the job and far fewer software issues from eg versioning, dependencies, regression bugs. It just works, there’s rarely even a need for a venv.

I’d never expect it applied as a general purpose language like python though, typical projects are <1k lines of some specific data task, perhaps our use cases are just different

_Wintermute · on Jan 22, 2023

I think the difference is working in teams or with other people's code. Bad python is usually fairly readable, there is a sense of "pythonic" code that the language pushes you towards. R is the complete opposite, there are 50 different ways to do every simple thing, coupled with R users generally not having much sense of good code practices.

Maybe I'm just jaded because I've inherited a 100K+ line R codebase at my job written by a single person with no unit tests and about 3 lines of comments, and it's a completely miserable experience.

minimaxir · on Jan 22, 2023

data.table syntax is indeed more concise but harder to parse later, which is more important in a collaborative environment.

dplyr syntax IMO is the best balance between clarity and nonredundancy as evident in pandas/polar code.

barumrho · on Jan 22, 2023

While it's shorter, it seems more magical? How does it know to count `failure`?

aorist · on Jan 22, 2023

It doesn't count `failure` — just the number of rows. But neither does the pandas version: `pd_df.groupby(['date'])['failure'].count()` and `pd_df.groupby(['date']).count()` are the same except the former returns a single `pd.Series` with the count and the latter produces a `pd.DataFrame` where each column has the same count (not super useful).

e.g.

    > iris.groupby('species').count()
                sepal_length  sepal_width  petal_length  petal_width
    species
    setosa                50           50            50           50
    versicolor            50           50            50           50
    virginica             50           50            50           50

vs.

    > iris.groupby('species')['sepal_length'].count()
    species
    setosa        50
    versicolor    50
    virginica     50

iamlemec · on Jan 22, 2023

I believe `count` will only give you the number of non-null rows, so the numbers from the first command could differ by column if there were null values. You can also use the `size` command to get the total number of rows, and that will return a `pd.Series` with or without a column specifier.

tryptophan · on Jan 22, 2023

R is easy and fun to use.

R is also impossible to understand whta is actually going on.

miohtama · on Jan 22, 2023

R is Perl for math

mutant_self · on Jan 22, 2023

I would do the Pandas and Polars examples differently:

``` pd_df['date'].value_counts() # pandas pl_df.select(pl.col('date').value_counts()) # polars ```

Note: we could also do it the Pandas way in Polars but square bracket indexing in Polars is not recommended

Lyngbakr · on Jan 22, 2023

> You know, Pandas feels like Airflow, everyone keeps talking about its demise, but there it is everywhere … used by everyone.

Wait — did I miss the memo? What's replacing Airflow?

massaman_yams · on Jan 22, 2023

Prefect, dagster, or maybe one could argue google cloud composer is different enough from vanilla airflow that it earns a spot on this list?

But you're right, Airflow is still the de facto standard, there's just increasing awareness of pain points with it.

ericpauley · on Jan 22, 2023

Given these examples, I don't really see why anyone would replace Pandas with Polars for existing workloads. Memory's not an issue because they already bought the DIMMs (which have gotten insanely cheap). Personally I don't find the speed of Pandas to be that bad if you use it smartly (e.g., not parsing data more often than you have to).

minimaxir · on Jan 22, 2023

Working with Actual Big Data (tm) will inevitably hit issues with both aspects in pandas, where polars can theoretically help.

Granted, the real correct solution for manipulating Actual Big Data (tm) is at the data warehouse level itself before it even hits Python but that's a topic for another day.

__mharrison__ · on Jan 22, 2023

Ponder is working on making Pandas actually work with really big data. (Disclaimer: I'm an advisor for them.)

__mharrison__ · on Jan 22, 2023

This article suffers from the same thing that most Pandas articles or books suffer from. Showing operations in isolation.

I rarely do one thing to a dataset. Show me what real world code looks like.

Pandas and Polars are both awesome. Pandas is more entrenched and is a solid choice these days (and getting better for big data).

Polars is an awesome up and coming library. Competition is good for the Python Data ecosystem.

(Disclaimer: I've authored multiple Pandas books and have taught courses on both Pandas and Polars.)

jerpint · on Jan 22, 2023

I agree nothing in that article was a compelling argument

liminal · on Jan 22, 2023

I've been centering Arrow in my data workflows, which has led to a system that uses Ray for distributed computation, Polars for single node data frame computation on Arrow data sets, and Ray Data for distributed data frames using Arrow. I love the speed and Polars' expressive API, and I love that I can use any other Arrow tool on the data without any serialization overhead.

That said, using Ray places limits on that, since it can distribute data across machines, breaking the Arrow model of in-memory data sharing between tools. I know that Ray Data is starting to use Polars internally, but I wish I could use the Polars API over distributed Ray Data sets so I could write code once and not worry about whether Ray has split it across machines.

VyseofArcadia · on Jan 22, 2023

Surely I'm not the only one who saw the headline and thought it would be about bears.

I thought we were going to get some interesting zoo logistics or something.

shadowofneptune · on Jan 22, 2023

Yeah, the literal meaning sounds absolutely disastrous and I'd love/dread to see it.

rgavuliak · on Jan 22, 2023

A blog like this only scratches the surface. A couple of simple examples don't really help decide on whether it's a good substitute.

bayesian_horse · on Jan 22, 2023

And in ten years we'll replace Polars with the new library "Koalas".

gordonhart · on Jan 22, 2023

Ten years is a lot of time, but is it enough time for the current Koalas package [0] to truly die and leave the name for the next dataframe library?

[0] https://koalas.readthedocs.io/en/latest/

ibgeek · on Jan 22, 2023

I wish the article addressed integration with seaborn and other libraries. I’m sure Polars can do most of what I want in isolation, but I’m not sure how well it integrates with everything else.

maegul · on Jan 22, 2023

This is the weird thing that's happening right now, and maybe it's not a good thing. Any pipelines that have been built from the daraframe onward (datavis libraries being the big ones) have been based on pandas (or rather its DataFrame API) as the de facto standard. Additionally, other libraries or wrappers have sought to replicate or be compatible with the pandas API.

But, the one universal complaint about pandas was always its API! Moving on from it, I'd say, is a substantial part of the motivation behind something like polars. So we're in some unpleasant "worse is better" territory here, and to me, it feels like python really missed the mark by too much. The pandas API is "fine", but really, we could do with better.

One option is to just treat the pandas DataFrame as the translation layer. If you need to dump tabular data from your library into a datavis library, then `plot(my_df.to_pandas())`.

Hopefully that conversion overhead never becomes too much of a headache. If pandas gets full Arrow compatibility (last I checked, they were behind R and tidyverse on this??), then it shouldn't really be an issue anymore and Wes Mckinney can hopefully stop getting complaints about the pandas API!.

ZeroCool2u · on Jan 22, 2023

I use it with plotly and it's great.

captainmuon · on Jan 22, 2023

I like that there is an alternative to Pandas, but unfortunately it uses the same dataframes paradigm. I know I'm in the minority, but I never liked dataframes. They are too "R"-like, don't fit into Python, and try to do too many things at once.

Columns have names, except when they don't. Things (rows? individual cells?) have types which is sometimes enforced and sometimes not. If you don't follow a recipe exactly, your dataframe will be subtly different and might not work. A lot of code mixes mutation in place and creating new dataframes, and so on. And Pandas lends itself to a special chained method syntax which is almost it's own DSL in Python.

ritchie46 · on Jan 22, 2023

> Columns have names, except when they don't. Things (rows? individual cells?)

They are not a fit for every problem. Please don't try to use them for that. For normalized tabular data, it is a good fit.

> have types which is sometimes enforced and sometimes not. If you don't follow a recipe exactly, your dataframe will be subtly different and might not work.

Polars is really strict on types. A type is always determined by the input datatype and the operation applied. You can check the schema before you run the query. Polars will produce the same output datatype as defined by its logical plan.

> A lot of code mixes mutation in place and creating new dataframes, and so on

All polars methods are pure, e.g. they don't mutate the original.

jiggunjer · on Jan 22, 2023

You mean if your memory can only hold a single large dataframe, polars won't work well?

ayembee · on Jan 22, 2023

No; there is an OOC (out-of-core, eg: > RAM) streaming mode. https://github.com/pola-rs/polars#handles-larger-than-ram-da...

pydry · on Jan 22, 2023

The problem isnt really with the paradigm but that it is untyped.

There are like 10 3rd party solutions for this but ffs it should be built in.

d4rkp4ttern · on Jan 22, 2023

Fugue is an interesting library in this space , though I haven’t tried it

https://github.com/fugue-project/fugue

A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.

davnn · on Jan 22, 2023

It‘s a bit sad that Pandas has become the default API for data manipulation in Python. I think it‘s less intuitive than any of the other API I‘ve worked with, for example R‘s tidyverse, Julia‘s DataTable and even Mathematica‘s approach makes more sense than Pandas in my opinion.

happymellon · on Jan 22, 2023

I found that my problem with Pandas, was that is was quite un-pythonic.

Almost everything required reading to grasp and creation of several dummy tasks, nothing was intuitive and in the end while some of the functions helped I am not entirely sure that the overhead was worth it compared to researching alternative solutions.

snthpy · on Jan 22, 2023

[Disclaimer: PRQL core dev here]

I'm hoping that PRQL [0] will one day become the universal API for tabular/relational data. We're targeting SQL as the backend in the first iteration given it's universality but there are early plans to support other backends.

You can already use PRQL with Pandas, the tidyverse, shell and pretty much any database. See my presentation [1]. PRQL reads very similarly to dplyr, and in my (biased) opinion, actually a bit better than dplyr because it can do away with some of the punctuation due to being its own language.

For questions see our Discord [2] and if you would like to see PRQL in more places, file an issue on Github [3].

Some examples below:

## Pandas

    ```python
    #!pip install pyprql
    import pandas as pd
    import pyprql.pandas_accessor
    
    df = pd.read_csv("data/customers.csv")
    df.prql.query('filter country=="Germany"')
    ```

## tidyverse

    ```sh
    mkdir -p ~/.local/R_libs
    R -q -e 'install.packages("prqlr", repos =  "https://eitsupi.r-universe.dev", lib="~/.local/R_libs/")'
    ```
    
    ```R
    library(prqlr, lib.loc="~/.local/R_libs/")
    library("tidyquery")
    "
    from mtcars
    filter cyl > 6
    sort [-mpg]
    select [cyl, mpg]
    " |> prql_to_sql() |> query()
    ```

### PRQL

    ```prql
    from employees
    filter start_date > @2021-01-01
    derive [
      gross_salary = salary + (tax ?? 0),
      gross_cost = gross_salary + benefits_cost,
    ]
    filter gross_cost > 0
    group [title, country] (
      aggregate [
        average gross_salary,
        sum_gross_cost = sum gross_cost,
      ]
    )
    filter sum_gross_cost > 100_000
    derive id = f"{title}_{country}"
    derive country_code = s"LEFT(country, 2)"
    sort [sum_gross_cost, -country]
    take 1..20
    ```

[0]: https://prql-lang.org/

[1]: https://github.com/snth/normconf2022/blob/main/notebooks/nor...

[2]: https://discord.gg/N63hWUhw

[3]: https://github.com/prql/prql

lakecresva · on Jan 22, 2023

I'm still a datalog guy, but imo PRQL is indeed the nicest of the SQL family. Thanks for the hard work, and best of luck.

snthpy · on Jan 23, 2023

Thank you.

If you need recursive/iterative queries then watch this space. That's ok the roadmap!

IceDane · on Jan 22, 2023

Pandas is just the worst. I mean, it's obviously powerful, but it's still the worst.

It suffers from the same problem as so many other python libraries which were clearly designed by scientists rather than software engineers. The design philosophy seems to be that having 100s of ways to do every single thing is "convenient". In reality, it makes almost every single function in the pandas library waaaaaaay too complicated. Nearly all of them have different "modes", and sometimes choosing one thing over another will absolutely cripple the performance of your application.

I hadn't heard about polars, but I'm currently using pandas in a part of our stack, and since it's not a lot of code yet, I might just be able to replace pandas. Thanks!

d4rkp4ttern · on Jan 22, 2023

Yesterday I spent over 45 mins dealing with this error:

https://stackoverflow.com/questions/20625582/how-to-deal-wit...

“SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.”

brahbrah · on Feb 5, 2023

So in pandas you might get that warning when you’re trying to do an in place operation like so:

    df.loc['2023-01-01': '2023-12-31'] *= 2

This doubles every row in 2023 in your frame.

In polars there’s no mutating operations, everything is pure, so you won’t see any warnings of that nature, but what you will see instead is having to do something like this to solve the same problem:

    df.with_column([
        pl.when(pl.col('date').is_between(
            datetime(2023, 1, 1),
            datetime(2023, 12, 31),
        )).then(pl.col(x) * 2)
        .otherwise(pl.col(x))
        .alias(x)
        for x in df.columns if x != 'date'
    ])

So there’s definitely some trade offs, but mutating operations are sometimes worth learning how to use them properly to avoid that pandas warning (or know when to ignore it, because unfortunately there can be false positives).

kristianp · on Feb 5, 2023

When I saw your comment I thought, oh there's another story about polars! Turns out you can reply to a 13 day old thread.

__tmk__ · on Jan 22, 2023

Why is the article using PNG images for the code samples? I guess because the author thinks it looks prettier this way, but I don't really see it.

fumeux_fume · on Jan 22, 2023

Yeah, Pandas is so awful they copied the api almost exactly, but with less control over functionality.

sixdimensional · on Jan 22, 2023

Or replace Pandas with Pandas API on Spark [1]?

[1] https://spark.apache.org/docs/latest/api/python/user_guide/p...