To measure performance the author looked at latency, but most S3 workloads are throughput oriented. The magic of S3 is that it's cheap because it's built on spinning HDDs, which are slow and unreliable individually, but when you have millions of them, you can mask the tail and deliver multi TBs/sec of throughput.
It's misleading to look at S3 as a CDN. It's fine for that, but it's real strength is backing the world's data lakes and cloud data warehouses. Those workloads have a lot of data that's often cold, but S3 can deliver massive throughout when you need it. R2 can't do that, and as far as I can tell, isn't trying to.
Yeah, I'd be interested in the bandwidth as well. Can R2 saturate 10/25/50 gigabit links? Can it do so with single requests, or if not, how many parallel requests does that require?
Cloudflare's paid DDoS protection product being able to soak up insane L3/4 DDoS attacks doesn't answer the question as to whether or not the specific product, R2 from Cloudflare which has free egress is able to saturate a pipe.
Cloudflare has the network to do that, but they charge money to do so with their other offerings, so why would they give that to you for free? R2 is not a CDN.
That's unrelated to the performance of (for instance) the R2 storage layer. All the bandwidth in the world won't help you if you're blocked on storage. It isn't clear whether the overall performance of R2 is capable of saturating user bandwidth, or whether it'll be blocked on something.
S3 can't saturate user bandwidth unless you make many parallel requests. I'd be (pleasantly) surprised if R2 can.
I'm confused, I assumed we were talking about the network layer.
If we are talking about storage, well, SATA can't give you more than ~5Gbps so I guess the answer is no? But also no one else can do it, unless they're using super exotic HDD tech (hint: they're not, it's actually the opposite).
What a weird thing to argue about, btw, literally everybody is running a network layer on top of storage that lets you have much higher throughput. When one talks about R2/S3 throughput no one (on my circle, ofc.) would think we are referring to the speed of their HDDs, lmao. But it's nice to see this, it's always amusing to stumble upon people with a wildly different point of view on things.
We're talking about the user-visible behavior. You argued that because Cloudflare's CDN has an obscene amount of bandwidth, R2 will be able to saturate user bandwidth; that doesn't follow, hence my counterpoint that it could be bottlenecked on storage rather than network. The question at hand is what performance R2 offers, and that hasn't been answered.
There are any number of ways they could implement R2 that would allow it to run at full wire speed, but S3 doesn't run at full wire speed by default (unless you make many parallel requests) and I'd be surprised if R2 does.
I have some large files stored in R2 and a 50Gbps interface to the world.
curl to Linode's speed test is ~200MB/sec.
curl to R2 is also ~200MB/sec.
I'm only getting 1Gbps but given that Linode's speed is pretty much the same I would think the bottleneck is somewhere else. Dually, R2 gives you at least 1Gbps.
No, most people aren’t interested in subcomponent performance, just in total performance. A trivial example is that even a 4-striped U2 NVMe disk array exported over Ethernet can deliver a lot more data than 5 Gbps and store mucho TiB.
That comment didn't +1 what you just said. It basically said that we care about the total, usable throughput. Whether some specific components are capable of more doesn't mean anything unless / until that greather throughput is usable by us.
lol I think the only reason you're being downvoted is because the common belief at HN is, "of course marketing is lying and/or doesn't know what they're talking about."
I didn’t downvote but s3 does have low latency offerings (express). Which has reasonable latency compared to EFS iirc. I’d be shocked if it was as popular as the other higher latency s3 tiers though.
I agree - we generally do the opposite of trusting marketing, but sometimes marketing is coincidentally correct.
Cloudflare wants to "protect" the world from the evils of DNS services other than themselves even knowing what geographical region people are in, so they strip all geographical information, even general, broad location, from DNS lookups. This has the effect of increasing latency for non-Cloudflare CDNs sometimes, since data will sometimes end up being served out of the wrong region.
I've wondered since I first heard about this if this is their way to enshittify CDN deliverability in general and make their latency look better in comparison.
You should check out Row Zero (https://rowzero.io). We launched on HN earlier this year. Our CSV handling is the best on the market.
You can import multi GB csvs, we auto infer your format, and land your data in a full-featured spreadsheet that supports filter, sort, ctrl-F, sharing, graphs, the full Excel formula language, native Python, and export to Postgres, Snowflake, and Databricks.
or skip the spreadsheet and go relational with DuckDB. Pretty cool to run directly against a set of CSVs and get performant, results in a language most of us already know and use regularly.
I suppose. But as a software developer I've never created an Excel spreadsheet that wasn't first a CSV. I do most of my own work with local data files in jq for JSON or q for CSV, then go from a CSV to an Excel spreadsheet only when it's time to communicate that data with non-programmers.
Their niche is clearly supposed to be in helping developers and data scientists make that same leap, from the tools and formats native to their data pipelines to feature-rich spreadsheets as an export/reporting/analysis format for consumption by people who otherwise don't code. CS V support (especially for huge files) is unusually important there.
I've been working on a better spreadsheet for a while now. https://rowzero.io is 1000x faster spreadsheet than Excel/Google Sheets. It looks and feels like those products but can open multi GB data sets, supports Python natively, and can also connect directly to Snowflake/Databricks/Redshift.
We built our spreadsheet (https://rowzero.io) from the ground up to integrate natively with Python. Bolting it on like Microsoft did, or as an add in like xlwings, just feels second class. To make it first class, we had to solve three hard problems:
1. Sandboxing and dependencies. Python is extremely unsafe to share, so you need to sandbox execution. There's also the environment/package management problem (does the user you're sharing your workbook with have the same version of pandas as you?). We run workbooks in the cloud to solve both of these.
2. The type system. You need a way to natively interop between Excel's type system and Python's much richer type system. The problem with Excel is there are only two types - numbers and strings. Even dates are just numbers in Excel. Python has rich types like pandas Dataframes, lists, and dictionaries, which Excel can't represent natively. We solved this in a similar way to how Typescript evolved Javascript. We support the Excel formula language and all of its types and also added support for lists, dictionaries, structs, and dataframes.
3. Performance. Our goal was to build a spreadsheet 1000x faster than Excel. Early on we used Python as our formula language but were constantly fighting the GIL and slow interpreter performance. Instead we implemented the spreadsheet engine in Rust as a columnar engine and seamlessly marshal Python types to the spreadsheet type system and back.
It's the hardest systems problem our team's ever worked on. Previously we wrote the S3 file system, so it's not like this was our first rodeo. There's just a ton of details you need to get right to make it feel seamless.
As the author of said second class add-in, let me just guess that your most popular feature request was adding the "Import from xlsx" functionality...which describes the whole issue: it's always Excel + something, never something instead of Excel.
My apologies, that came off harsher than I intended. I've used xlwings in previous jobs to complete Excel automation tasks, so thank you for building it. xlwings is one of the projects that motivated me to start Row Zero. My main issue with it, and other Excel add-ins, is they break the promise of an .xlsx file as a self-contained virtual machine of code and data. I can no longer just send the .xlsx file - I need the recipient to install (e.g.) Python first. This makes collaboration a nightmare.
I wanted a spreadsheet interface, which my business partners need, but with a way for power users (me) to do more complicated stuff in Python instead of VBA.
To borrow your phrasing, our thesis is that it has to be Excel-compatible spreadsheet + something, not necessarily Excel + something. It's early days for us, but we've seen a couple publicly traded companies switch off Excel to Row Zero to eliminate the security risks that come with Excel's desktop model.
No offense taken, and happy that xlwings was an inspiration for creating Row Zero! I don't really buy the security issues though for being the reason for switching from Excel to Row Zero. Yes, Excel has security issues, but so does the cloud, but at least the issues with Excel can be dealt with: disable VBA macros on a company level, run Excel on airgapped computers, etc. Promising that your cloud won't be hacked or is unintentionally leaking information is impossible, no matter how much auditing and certification you're going through.
The relatively recent addition of xlwings Server fixes pretty much all of the issues you encountered in your previous company: user don't need a local installation of Python, but the Office admin just pushes an Office.js add-in to them and their done. No sensitive credentials etc. are required to be stored on the end-users computer or spreadsheet either as you can take advantage of SSO and can manage user roles on Microsoft Entra ID (that companies are using already anyways).
These are exactly the issues I would have guessed you would run into when using Python in a spreadsheet. Python has really been promoted above its level of competence. It's not suitable for these things at all.
I would say Typescript is a more obvious choice, or potentially Dart. Maybe even something more obscure like Nim (though I have no experience of that).
I get that you want compatibility with Pandas, Numpy, etc. but you're going to pay for that with endless pain.
I think calling out Durability is a bit of a straw man. Most services get their durability from S3 or some other managed database service. So they're really only making the "do it on a beefy machine argument" for the stateless portion of their service.
I agree with the other points for production services with the caveat that many workloads don't need all of those. Internal workloads or batch data processing use cases often don't need 4 9's of availability and can be done more simply and cheaply on a chonky EC2 instance.
The last point is part of our thesis for https://rowzero.io. You can vertically scale data analysis workloads way further than most people expect.
It looks and feels like Google Sheets and scales to 1B+ row data sets. We natively support Python and Parquet, and connect directly to Postgres, Snowflake, Databricks, Redshift, and S3.
99.9%+ of data sets fit on an SSD, and 99%+ fit in memory. [1]
This is the thesis for https://rowzero.io. We provide a real spreadsheet interface on top of these data sets, which gives you all the richness and interactivity that affords.
In the rare cases you need more than that, you can hire engineers. The rest of the time, a spreadsheet is all you need.
Shameless plug: If you have bigger data sets, check out https://rowzero.io
We scale up to hundreds of millions of rows and have native Python support.
You can define functions in Python and call them as formulas from any spreadsheet cell. We seamlessly marshal Pandas dataframes from Python land to spreadsheet land and back. [1]
We're also hosted and support real time collaboration like Google Sheets. We reimplemented the Excel formula language. We connect directly to Postgres, S3, Snowflake, Redshift, and Databricks. And the first workbook is free.
Shameless plug: If you have bigger data sets, check out rowzero.io.
We implemented something like PySheets initially where the formula language was full Python. But we found the Python interpreter to be the bottleneck during (e.g.) large CSV imports, and the GIL prevented parallelizing evaluation. It was also harder for business users to adopt due to small syntactic differences between Python and the Excel formula language.
So we implemented the spreadsheet engine and formula language in Rust. We have a Python code window that allows you to write arbitrary Python functions. Those functions can be called as formulas from any spreadsheet cell. We seamlessly marshall Pandas dataframes from Python land to spreadsheet land and back. It gives you 90% of the benefits of pure Python without compromising on performance.
Rowzero is a better spreadsheet, while PySheets is a better Jupyter Notebook. Although they converge in certain aspects, their distinct target audiences set them apart. This divergence may create some overlap, but it also leaves ample room for user preference.
PySheets currently runs inside the browser, on top of WebAsm, and the limitations there are bigger than just Python's slowness. You have only 4G addressable memory, including the interpreter and libraries. Network bandwidth is also a limiting factor for client-side computation.
That said, PySheets can render a sheet based on a 50,000-row Excel sheet in 0.5s and needs about 20s to do a full end-to-end recompute run. There are limits to what you can do in the browser without using an external kernel that can run Polars on large datasets. But, I think most people will be fine with what PySheets can let them do.
Finally, as the author of PySheets I am honored that a "competitor" sees us as a threat. I am quite impressed by Rowzero myself. Nice work :-)
Kudos on the technical achievement. We considered the thick client approach you're doing, and one of the reasons we punted was because it was so hard.
One really nice thing about your approach is it minimizes infrastructure cost. That positions you well for embedding use cases, like New York Times visualizations, that we struggle to do economically.
I am feeling pretty Okay now, indeed. I played golf today. It was on a Par3 course, so it only tested my short game. However, I scored -1, with almost a hole-in-one. I blame it on the success of PySheets :-)
I've been trying to get a platform to create dashboards where some data comes from spreadsheets and some data comes from databases. Something like a notebook interface crossed with a grafana interface while also enabling forms for input is sorely missing. While it can be stitched together, speed/performance and flexibility (in terms of JS or Python) seems to be lacking atm.
I want to use such a thing to create internal dashboards similar to retool.
Does it need to be live (i.e when database or underlying spreadsheet updates does it need to be reflected in real time on the dashboard) or are you ok with static display.
Live updating data is a pain I've messed around using javascript to force refresh html iframes on a timer. But I was never really satisfied with this. I've heard you can do things with websockets but that is starting to get too complicated for me (I'm not a programmer).
For static stuff one of the data scientists in my org pointed me to Streamlit (https://streamlit.io/) it's a python package I found very easy to use. Can easily combine SQL with CSV imports and display them all on one dashboard. Can use forms toggle butotns etc to control the display.
You can do that today with PySheets. On the PySheets landing page, you can find a live example. The data comes directly out of a sheet that uses a service to convert metrics into charts. For example, one of the three charts shown on https://pysheets.app/#Traction is directly embedded as an iframe from https://pysheets.app/embed?U=uXNuCGO2JU1E5aL7zcOh&k=C12. If I rerun the sheet that produces the charts, the PySheets landing page updates automatically with the latest data.
You should try http://rowzero.io. We connect directly to DBs and data warehouses, support Python natively, and scale up to hundreds of millions of rows.
Rowzero seems incredible, but this and PySheets target the wrong users. You are targeting data scientist while I would target finance people to get traction. So let me tell why I would use it as a Data Scientist but not as a finance guy:
1) It runs on the cloud, I would go with something that runs locally (or on premise) since there are sensible data there (with rust as a backend should be fine, python you need to ship a set of libraries using docker) or should be integrated into GCP/AWS/Azure.
2) You need to create a PowerPoint/Word alternative as well where you can just copy/paste stuff or you need to make the copy/paste in PowerPoint/Word easy
3) Push strong on big data and DB connection, right now those are the bottlenecks, also create some API in python for popular services in finance (Bloomberg, Factset, CapitalIQ, ...) so that they are available out of the box with a subscription
4) Do something for the text part, like getting embeddings for similarity, fuzzy match in python plus probably the interface can be different in analyzing text (highlights in green of keywords, search in text and so on), people in finance often work also with PDF and having all in a platform is nice instead of having two windows as of today
PySheets has been designed to run on-prem and on GCP as well. The beta version you are looking at is just offered as a zero-install experimentation platform. We are actively talking with financial institutions, and both co-founders on the team, https://pysheets.app/#Team, have a long history in Finance, so we are very sensitive to all the (correct) points you make. We will look in more detail at your very helpful suggestions!
Any chance you could expand on how the DAG is implemented in Rust for the execution engine? I'm trying to do something similar (not for spreadsheets but rather for a language: https://docs.yoctoproject.org/bitbake/bitbake-user-manual/bi...). I cannot find any good examples of how to implement something like this in Rust. E.g. should I use a graph library like petgraph, or roll my own?
Both of the solutions seem interesting for different reasons. @breakognize. You said 90% of the benefits. Can you or @laffa give an example of the 10% that would prevent me from using your solution?
A major part is, in the form of Pyscript-LTK. I keep moving more of PySheets to LTK as I find reusable parts. I truly love open-source, but I am also trying to get some revenue for the months of work I spent on developing PySheets.
It's misleading to look at S3 as a CDN. It's fine for that, but it's real strength is backing the world's data lakes and cloud data warehouses. Those workloads have a lot of data that's often cold, but S3 can deliver massive throughout when you need it. R2 can't do that, and as far as I can tell, isn't trying to.
Source: I used to work on S3