I have tried polars for a couple of week and I am one of those weird guys who likes pandas syntax more than sql. Honestly, this is one of those "I will just wait until it (Pandas) gets better" thing. Anyone who uses Pandas and SQL extensively knows that whatever question you might have with them someone already has an answer for you. One the other hand Polars is new and I feel like the Polars community is pushing the better syntax to wrangle data just doesn't feel right. I am not smart enough to put three different wrangling/query syntax in my brain.
I am hopeful about duckdb mostly because how friendly the people behind the project is. But honestly they really need to improve their csv reader operation. The data type recognition for auto read csv needs work. And duckdb people knows about CSV read is the reason polars has an edge on them and I know they are working on it.
But at the end of all that, I will just wait for Pandas to get faster and better.
Here here, I agree. In the most recent versions we have seen speed increases. The fact that polars exist shows that there is a tone of low hanging fruit.
There is also dask to increase parallelism and performance which I’ve used on some massive datasets 200GB+
Isn't the low hanging fruit that polars picked: "how about lazy evaluation to allow the query to be optimised?" ... which is mostly anathema to the design of pandas?
Pandas (the API) is also getting better at big data. I'm an advisor at a company, Ponder, that will take your Pandas code and execute it on "big data".
I do not fully get the speed argument. If the dataset is small'ish, it does not significantly matter (given you only use the vectorized functions, of course). And if it's big data, I do not use pandas but Spark/a cloud data warehouse solution. For this reason I also do not get the use case for duckdb.
Beyond this, a dataframe api/syntax, compared to SQL syntax, is to me way easier to follow and to debug.
The use case for DuckDB is the that the vast majority of analytics aren't on "big data", but also aren't "small-ish". If your data fits on a single machine, DuckDB will allow you to query it at great speed. I disagree that the speed "doesn't significantly matter".
There is work being done on a dataframe API frontend for DuckDB, if you prefer that interface.
The way I see the speed improvements is this: fast processes run faster, moderate-to-long processes (that are still awkward and large in Python, but not enough to justify the shift to Spark) now run in noticeably faster time, and the threshold for “what I need to use Spark for, shifts a long way up the scale. That is, you can now do more, with the same amount of compute.
This sounds like it’s spoken from someone who doesn’t understand a lick about sql. Pandas has one of the worst APIs I’ve ever seen. Truly a blight on the data processing landscape.
If you have a dataframe of power plant capacities and a dataframe of power plant capacity reductions how would you figure out the available capacities of all your power plants. In pandas it would be `capacities_df - reductions_df`. How would you do it in polars or sql, it’s not nearly as nice. Pandas has the benefit of allowing you to work with data in a relational or ndarray style. Rather than one or the other. The api does incur some bloat due to that, but that ability is very valuable.
One thing that high-performance polars/arrow enables is to build ETL without data sharding because in most cases, all your data fits into one node. Jupyter and local scripts become sufficient and you avoid a lot of complexity.
We hesitated a lot before mentioning ETL in the OSS project windmill [1] but since we support any python including polars, passing data between steps through shared folders and since polars is blazingly fast, it is a suprisingly good system for ETL waiting for the need of using spark/materialize directly.
Vertical is the new horizontal, because of stuff like Epycs with 96 cores/192 threads, AWS instances with 24TB of RAM, M2 MacBook Pros with 96GB RAM, 12 cores, ludicrous SOC bandwidth.
What used to be 'big data' fits on an instance, you only need shards/clusters for 'ludicrous data'.
I tried to seriously use Polars in a side project and while I do like that the API surface is a little smaller, and cleaner, it's quite verbose. And I ran into a few bugs doing "basic" things that made me think it's not quite ready for prime time yet. I think with a bit more development and a more interactive-friendly API it would be awesome though.
I've noticed a trend lately (though I doubt it's new) of systems/frameworks/whatever coming out promising to unseat the existing offering by being simpler, more opinionated, etc. But it turns out all that complexity was because the underlying problem is actually complex, so as the new simpler solution gains adoption it feature creeps its way back to being the same as the old.
I feel like I'm witnessing the start of that here.
I agree in the general case but I think here the problem is more that to really make these types of packages to air-tight production grade status, you need a critical mass of users to basically field test it and find all the edge cases and bugs. This is what I see polars struggling with right now. The issue I ran into was with the drop_nan function. A pretty common task. I was using it in a weird way, but if polars wants to succeed with it's small-but-feature-complete API surface, it needs it's verbs to be fully, reliably composable. No bugs allowed.
The big thing pandas has going for it is that it's already been through this field testing. All the bugs have been ironed out by the hundreds of thousands of users. And the compute/memory efficiency is "good enough" for most use-cases, that those who need better efficiency can spend some time to optimize it. Sure the API is confusing and bad, but most people just kind of muddle through using stackoverflow answers from 2017 and it works well enough.
I've stanned for data.table (the R package) on HN before and I'll do it again now. It's a great example of a package that has been thoroughly field tested as well as having a simple, elegant API that allows you to do basically anything. So we know it's possible!
> The big thing pandas has going for it is that it's already been through this field testing. All the bugs have been ironed out by the hundreds of thousands of users.
At this very moment pandas github repo has 1563 open issues labeled with a bug tag [0]. So much for "all the bugs have been ironed out".
The benchmarks[1] from polars' websites seems really promising, and I am going to try it on my next jupyter notebook. But what I wonder really is about parsing time, this is not shown in bench marks.
I my previous project, I use pandas' read_table to avoid for loop to parse molecular dynamics trajectories LAMMPS. The trajectories are pure text file with size of a few Gigs. Compared to a pure Python implementation ~3-5 min, pandas reduce it to 20 secs. I wonder how Polars works.
To be honest, compared to parsing, data manipulation such as the benchmark showcased, is just not so costly.
How many cores you got?
Arrow has one thread serially chunking the data, and uses the rest of the cores to process the chunks in parallel. On large machines (and files, with fast io) the chunking becomes the bottleneck. Is this the same for polars?
I have got 12 cores. We partition the work of over all threads, I think we are mostly IO bound as we don't use async, but we do require some pre-work before we can start reading.
You only need to avoid for loops because Python is so slow. The fantastic thing about Rust is that for the most part you can just use a for loop and it’ll be fast by default (there are still tricks to speed things up that you may need in some scenarios)
This is certainly a motivation, but there are other reasons to avoid loops. In many domains, array-at-a-time abstractions offer a higher-level view of the problem. That's why something like `xtensor` exists for a NumPy-like API in C++.
Author of polars here. That's incorrect. I have put a lot of effort in polars' csv parser for instance and it is one of the fastest csv parsers out there. This has nothing to do with leveraging arrow.
We control all our parsers and read directly into arrow memory. This differs from pandas, which utilizes pyarrow for reading parquet and then finally has to copy the arrow memory over to pandas memory.
Apologies! I got that from some blog post somewhere I believe, not from any personal wackereren our judgement (which I should have signalled better in my post).
Nonetheless, would reading a parquet file with polars be faster than reading a csv?
Also thanks for polars! Great contribution data science!
> Apologies! I got that from some blog post somewhere I believe, not from any personal wackereren our judgement (which I should have signalled better in my post).
No worries. :) Most blogs on the topic I encounter in the wild make incorrect claims, I understand the confusion.
> Nonetheless, would reading a parquet file with polars be faster than reading a csv?
Yes, much faster. Please don't use the csv format for anything of a reasonable size. It is a terrible format to process and very ambiguous.
The ergonomics of grouping and aggregation in R are really much better because libraries can make of its non-standard evaluation[^0] (which in other cases also makes the language a nightmare to deal with).
Having migrated 1000s of lines of legacy r, all I can say is… yes, but then you have you to use r. (:
R is not a replacement for pandas.
R is it’s own special little painful ecosystem, loved by people who don’t have to maintain the code they write.
You can complain all you like about pandas, but at the end of the day, it’s python. Python tooling works with it. The python ecosystem works with it.
It’s not without faults, but at least you’ll have a community of people to help when things go wrong.
R, not so much.
(Spoken as jaded developer who had to support r on databricks, which is deep in the hell of “well, it’s not really a tier one language” even from their support team)
> when you already have bad r code, it’s a massive pain in the ass.
Do you think it’s because R code tends to be written by statisticians and stats-adjacent domain experts who don’t necessarily know how to write clean code while Python code has at least some input from actual programmers? Or is this really down to the language itself?
? I'm not sure I can say more than I already did, but I'll try to be more specific:
The R community is categorically smaller than the python community. The support on community forums is harder to get, or non-existent (eg. with databricks).
Are you saying you've worked in places where its easier to find people that are familiar with R to help work on a project than it is to find people are familiar with python?
That you've found its easier to hire people who are familiar with R than it is to hire people who are familiar with python?
I... all I can say is that has not been my experience.
The places I've worked, of all the developers a small handful of people use R, and a small subset of those are good at it.
I don't hate R. I don't think it's a bad language. I'm saying: It's harder to support, because it's obscure, rarely used by most developers, and the people who use it and know it well are rare and expensive.
As a data engineer, expected to support workflows in production: Don't use obscure crap and expect other people to support it. Not R. Not rust. Not pony.
Using R on databricks, specifically, is a) unsupported^, and b) obscure and c) buggy. Don't do it.
(^ sorry, it's a 'tier 2 language' if you speak to DB representative, which means bugs don't count and new features don't get support)
All I can say, is that my experience has been that supporting python has been less painful; it's a simple known quantity, and its easy to scale up a team to fix projects if you need to.
Thanks for sharing. Seems your issues are more with databricks than R, but certainly R is more obscure.
At least in my experience we’ve never had issues with people learning it on the job and far fewer software issues from eg versioning, dependencies, regression bugs. It just works, there’s rarely even a need for a venv.
I’d never expect it applied as a general purpose language like python though, typical projects are <1k lines of some specific data task, perhaps our use cases are just different
I think the difference is working in teams or with other people's code. Bad python is usually fairly readable, there is a sense of "pythonic" code that the language pushes you towards. R is the complete opposite, there are 50 different ways to do every simple thing, coupled with R users generally not having much sense of good code practices.
Maybe I'm just jaded because I've inherited a 100K+ line R codebase at my job written by a single person with no unit tests and about 3 lines of comments, and it's a completely miserable experience.
It doesn't count `failure` — just the number of rows. But neither does the pandas version: `pd_df.groupby(['date'])['failure'].count()` and `pd_df.groupby(['date']).count()` are the same except the former returns a single `pd.Series` with the count and the latter produces a `pd.DataFrame` where each column has the same count (not super useful).
I believe `count` will only give you the number of non-null rows, so the numbers from the first command could differ by column if there were null values. You can also use the `size` command to get the total number of rows, and that will return a `pd.Series` with or without a column specifier.
Given these examples, I don't really see why anyone would replace Pandas with Polars for existing workloads. Memory's not an issue because they already bought the DIMMs (which have gotten insanely cheap). Personally I don't find the speed of Pandas to be that bad if you use it smartly (e.g., not parsing data more often than you have to).
Working with Actual Big Data (tm) will inevitably hit issues with both aspects in pandas, where polars can theoretically help.
Granted, the real correct solution for manipulating Actual Big Data (tm) is at the data warehouse level itself before it even hits Python but that's a topic for another day.
I've been centering Arrow in my data workflows, which has led to a system that uses Ray for distributed computation, Polars for single node data frame computation on Arrow data sets, and Ray Data for distributed data frames using Arrow. I love the speed and Polars' expressive API, and I love that I can use any other Arrow tool on the data without any serialization overhead.
That said, using Ray places limits on that, since it can distribute data across machines, breaking the Arrow model of in-memory data sharing between tools. I know that Ray Data is starting to use Polars internally, but I wish I could use the Polars API over distributed Ray Data sets so I could write code once and not worry about whether Ray has split it across machines.
I wish the article addressed integration with seaborn and other libraries. I’m sure Polars can do most of what I want in isolation, but I’m not sure how well it integrates with everything else.
This is the weird thing that's happening right now, and maybe it's not a good thing. Any pipelines that have been built from the daraframe onward (datavis libraries being the big ones) have been based on pandas (or rather its DataFrame API) as the de facto standard. Additionally, other libraries or wrappers have sought to replicate or be compatible with the pandas API.
But, the one universal complaint about pandas was always its API! Moving on from it, I'd say, is a substantial part of the motivation behind something like polars. So we're in some unpleasant "worse is better" territory here, and to me, it feels like python really missed the mark by too much. The pandas API is "fine", but really, we could do with better.
One option is to just treat the pandas DataFrame as the translation layer. If you need to dump tabular data from your library into a datavis library, then `plot(my_df.to_pandas())`.
Hopefully that conversion overhead never becomes too much of a headache. If pandas gets full Arrow compatibility (last I checked, they were behind R and tidyverse on this??), then it shouldn't really be an issue anymore and Wes Mckinney can hopefully stop getting complaints about the pandas API!.
I like that there is an alternative to Pandas, but unfortunately it uses the same dataframes paradigm. I know I'm in the minority, but I never liked dataframes. They are too "R"-like, don't fit into Python, and try to do too many things at once.
Columns have names, except when they don't. Things (rows? individual cells?) have types which is sometimes enforced and sometimes not. If you don't follow a recipe exactly, your dataframe will be subtly different and might not work. A lot of code mixes mutation in place and creating new dataframes, and so on. And Pandas lends itself to a special chained method syntax which is almost it's own DSL in Python.
> Columns have names, except when they don't. Things (rows? individual cells?)
They are not a fit for every problem. Please don't try to use them for that. For normalized tabular data, it is a good fit.
> have types which is sometimes enforced and sometimes not. If you don't follow a recipe exactly, your dataframe will be subtly different and might not work.
Polars is really strict on types. A type is always determined by the input datatype and the operation applied. You can check the schema before you run the query. Polars will produce the same output datatype as defined by its logical plan.
> A lot of code mixes mutation in place and creating new dataframes, and so on
All polars methods are pure, e.g. they don't mutate the original.
It‘s a bit sad that Pandas has become the default API for data manipulation in Python. I think it‘s less intuitive than any of the other API I‘ve worked with, for example R‘s tidyverse, Julia‘s DataTable and even Mathematica‘s approach makes more sense than Pandas in my opinion.
I found that my problem with Pandas, was that is was quite un-pythonic.
Almost everything required reading to grasp and creation of several dummy tasks, nothing was intuitive and in the end while some of the functions helped I am not entirely sure that the overhead was worth it compared to researching alternative solutions.
I'm hoping that PRQL [0] will one day become the universal API for tabular/relational data. We're targeting SQL as the backend in the first iteration given it's universality but there are early plans to support other backends.
You can already use PRQL with Pandas, the tidyverse, shell and pretty much any database. See my presentation [1]. PRQL reads very similarly to dplyr, and in my (biased) opinion, actually a bit better than dplyr because it can do away with some of the punctuation due to being its own language.
For questions see our Discord [2] and if you would like to see PRQL in more places, file an issue on Github [3].
Pandas is just the worst. I mean, it's obviously powerful, but it's still the worst.
It suffers from the same problem as so many other python libraries which were clearly designed by scientists rather than software engineers. The design philosophy seems to be that having 100s of ways to do every single thing is "convenient". In reality, it makes almost every single function in the pandas library waaaaaaay too complicated. Nearly all of them have different "modes", and sometimes choosing one thing over another will absolutely cripple the performance of your application.
I hadn't heard about polars, but I'm currently using pandas in a part of our stack, and since it's not a lot of code yet, I might just be able to replace pandas. Thanks!
So in pandas you might get that warning when you’re trying to do an in place operation like so:
df.loc['2023-01-01': '2023-12-31'] *= 2
This doubles every row in 2023 in your frame.
In polars there’s no mutating operations, everything is pure, so you won’t see any warnings of that nature, but what you will see instead is having to do something like this to solve the same problem:
df.with_column([
pl.when(pl.col('date').is_between(
datetime(2023, 1, 1),
datetime(2023, 12, 31),
)).then(pl.col(x) * 2)
.otherwise(pl.col(x))
.alias(x)
for x in df.columns if x != 'date'
])
So there’s definitely some trade offs, but mutating operations are sometimes worth learning how to use them properly to avoid that pandas warning (or know when to ignore it, because unfortunately there can be false positives).
I am hopeful about duckdb mostly because how friendly the people behind the project is. But honestly they really need to improve their csv reader operation. The data type recognition for auto read csv needs work. And duckdb people knows about CSV read is the reason polars has an edge on them and I know they are working on it.
But at the end of all that, I will just wait for Pandas to get faster and better.