Python + Pandas + Jupyter Notebook/Lab

anst · on Nov 5, 2019

Python + Jupyter OK, but pandas actually reads everything at once, doesn’t it. 100MB is no problem but bigger files could result in high swapping pression.

cgufus · on Nov 5, 2019

I definitely agree that with this amount of data, you should move to a more programmatic way to handle it... pandas or R.

Keep in mind that pandas (and probably also R?) internally uses optimized structures based on numpy. So a 10 GB csv, depending on the content, might end up with a much smnaller memory footprint inside pandas.

If you have 10 GB csv, I think you will be happy working with pandas locally even on a Laptop. If you go to csv files with tens of GB, a cloud vm with corresponding memory might serve you well. If you need to handle big-data-scale csvs (hundreds of GB or even >TB), a scalable parallel solution like Spark will be your thing. Before you scale up however, maybe your task allows to pre-filter the data and reduces the amount by orders of magnitude... often, thinking the problem through reduces the amount of metal one needs to throw at the problem...