More

breckognize · 2024-11-27T16:06:18 1732723578

To measure performance the author looked at latency, but most S3 workloads are throughput oriented. The magic of S3 is that it's cheap because it's built on spinning HDDs, which are slow and unreliable individually, but when you have millions of them, you can mask the tail and deliver multi TBs/sec of throughput.

It's misleading to look at S3 as a CDN. It's fine for that, but it's real strength is backing the world's data lakes and cloud data warehouses. Those workloads have a lot of data that's often cold, but S3 can deliver massive throughout when you need it. R2 can't do that, and as far as I can tell, isn't trying to.

Source: I used to work on S3

vtuulos · 2024-11-27T17:15:39 1732727739

yes, this. In case you are interested in seeing some numbers backing this claim, see here https://outerbounds.com/blog/metaflow-fast-data

Source: I used to work at Netflix, building systems that pull TBs from S3 hourly

JoshTriplett · 2024-11-27T16:31:09 1732725069

Yeah, I'd be interested in the bandwidth as well. Can R2 saturate 10/25/50 gigabit links? Can it do so with single requests, or if not, how many parallel requests does that require?

moralestapia · 2024-11-27T16:38:30 1732725510

Yes, they absolutely can [1].

1: https://blog.cloudflare.com/how-cloudflare-auto-mitigated-wo...

fragmede · 2024-11-27T16:43:17 1732725797

Cloudflare's paid DDoS protection product being able to soak up insane L3/4 DDoS attacks doesn't answer the question as to whether or not the specific product, R2 from Cloudflare which has free egress is able to saturate a pipe.

Cloudflare has the network to do that, but they charge money to do so with their other offerings, so why would they give that to you for free? R2 is not a CDN.

moralestapia · 2024-11-27T16:49:08 1732726148

[flagged]

fragmede · 2024-11-27T16:53:37 1732726417

> can't read CDN

> Can't read R2

k

JoshTriplett · 2024-11-27T17:20:33 1732728033

That's unrelated to the performance of (for instance) the R2 storage layer. All the bandwidth in the world won't help you if you're blocked on storage. It isn't clear whether the overall performance of R2 is capable of saturating user bandwidth, or whether it'll be blocked on something.

S3 can't saturate user bandwidth unless you make many parallel requests. I'd be (pleasantly) surprised if R2 can.

moralestapia · 2024-11-27T17:40:10 1732729210

I'm confused, I assumed we were talking about the network layer.

If we are talking about storage, well, SATA can't give you more than ~5Gbps so I guess the answer is no? But also no one else can do it, unless they're using super exotic HDD tech (hint: they're not, it's actually the opposite).

What a weird thing to argue about, btw, literally everybody is running a network layer on top of storage that lets you have much higher throughput. When one talks about R2/S3 throughput no one (on my circle, ofc.) would think we are referring to the speed of their HDDs, lmao. But it's nice to see this, it's always amusing to stumble upon people with a wildly different point of view on things.

JoshTriplett · 2024-11-27T18:41:08 1732732868

We're talking about the user-visible behavior. You argued that because Cloudflare's CDN has an obscene amount of bandwidth, R2 will be able to saturate user bandwidth; that doesn't follow, hence my counterpoint that it could be bottlenecked on storage rather than network. The question at hand is what performance R2 offers, and that hasn't been answered.

There are any number of ways they could implement R2 that would allow it to run at full wire speed, but S3 doesn't run at full wire speed by default (unless you make many parallel requests) and I'd be surprised if R2 does.

aipatselarom · 2024-11-27T19:20:53 1732735253

n = 1 aside.

I have some large files stored in R2 and a 50Gbps interface to the world.

curl to Linode's speed test is ~200MB/sec.

curl to R2 is also ~200MB/sec.

I'm only getting 1Gbps but given that Linode's speed is pretty much the same I would think the bottleneck is somewhere else. Dually, R2 gives you at least 1Gbps.

renewiltord · 2024-11-27T17:53:01 1732729981

No, most people aren’t interested in subcomponent performance, just in total performance. A trivial example is that even a 4-striped U2 NVMe disk array exported over Ethernet can deliver a lot more data than 5 Gbps and store mucho TiB.

moralestapia · 2024-11-27T18:07:08 1732730828

Thanks for +1 what I just said. So, apparently, it's not just me and my peers who think like this.

chaoskitty · 2024-11-28T17:03:48 1732813428

That comment didn't +1 what you just said. It basically said that we care about the total, usable throughput. Whether some specific components are capable of more doesn't mean anything unless / until that greather throughput is usable by us.

bananapub · 2024-11-27T16:49:34 1732726174

that's completely unrelated. the way to soak up a ddos at scale is just "have lots of peering and a fucking massive amount of ingress".

neither of these tell you how fast you can serve static data.

moralestapia · 2024-11-27T16:51:54 1732726314

>that's completely unrelated

Yeah, I'm sure they use a completely different network infrastructure to serve R2 requests.

michaelt · 2024-11-27T16:32:42 1732725162

I mean, it may be true in practice that most S3 workloads are throughput oriented and unconcerned with latency.

But if you look at https://aws.amazon.com/s3/ it says things like:

"Object storage built to retrieve any amount of data from anywhere"

"any amount of data for virtually any use case"

"S3 delivers the resiliency, flexibility, latency, and throughput, to ensure storage never limits performance"

So if S3 is not intended for low-latency applications, the marketing team haven't gotten the message :)

troyvit · 2024-11-27T17:05:17 1732727117

lol I think the only reason you're being downvoted is because the common belief at HN is, "of course marketing is lying and/or doesn't know what they're talking about."

Personally I think you have a point.

mikeshi42 · 2024-11-27T17:18:46 1732727926

I didn’t downvote but s3 does have low latency offerings (express). Which has reasonable latency compared to EFS iirc. I’d be shocked if it was as popular as the other higher latency s3 tiers though.

chaoskitty · 2024-11-28T17:16:27 1732814187

I agree - we generally do the opposite of trusting marketing, but sometimes marketing is coincidentally correct.

Cloudflare wants to "protect" the world from the evils of DNS services other than themselves even knowing what geographical region people are in, so they strip all geographical information, even general, broad location, from DNS lookups. This has the effect of increasing latency for non-Cloudflare CDNs sometimes, since data will sometimes end up being served out of the wrong region.

I've wondered since I first heard about this if this is their way to enshittify CDN deliverability in general and make their latency look better in comparison.

breckognize · 2024-11-02T17:02:49 1730566969

You should check out Row Zero (https://rowzero.io). We launched on HN earlier this year. Our CSV handling is the best on the market.

You can import multi GB csvs, we auto infer your format, and land your data in a full-featured spreadsheet that supports filter, sort, ctrl-F, sharing, graphs, the full Excel formula language, native Python, and export to Postgres, Snowflake, and Databricks.

skeeter2020 · 2024-11-02T19:49:36 1730576976

or skip the spreadsheet and go relational with DuckDB. Pretty cool to run directly against a set of CSVs and get performant, results in a language most of us already know and use regularly.

layer8 · 2024-11-02T18:55:40 1730573740

> Our CSV handling is the best on the market.

It’s ironic that you cite the one thing that being bad at hasn’t held Excel back. ;)

pxc · 2024-11-02T20:18:16 1730578696

I suppose. But as a software developer I've never created an Excel spreadsheet that wasn't first a CSV. I do most of my own work with local data files in jq for JSON or q for CSV, then go from a CSV to an Excel spreadsheet only when it's time to communicate that data with non-programmers.

Their niche is clearly supposed to be in helping developers and data scientists make that same leap, from the tools and formats native to their data pipelines to feature-rich spreadsheets as an export/reporting/analysis format for consumption by people who otherwise don't code. CS V support (especially for huge files) is unusually important there.

breckognize · 2024-08-25T03:49:35 1724557775

I've been working on a better spreadsheet for a while now. https://rowzero.io is 1000x faster spreadsheet than Excel/Google Sheets. It looks and feels like those products but can open multi GB data sets, supports Python natively, and can also connect directly to Snowflake/Databricks/Redshift.

breckognize · 2024-06-28T16:47:17 1719593237

Similar to Amazon's Retail fulfillment infrastructure, the AWS supply chain infrastructure is definitely not a commodity.

breckognize · 2024-06-14T03:07:47 1718334467

We built our spreadsheet (https://rowzero.io) from the ground up to integrate natively with Python. Bolting it on like Microsoft did, or as an add in like xlwings, just feels second class. To make it first class, we had to solve three hard problems:

1. Sandboxing and dependencies. Python is extremely unsafe to share, so you need to sandbox execution. There's also the environment/package management problem (does the user you're sharing your workbook with have the same version of pandas as you?). We run workbooks in the cloud to solve both of these.

2. The type system. You need a way to natively interop between Excel's type system and Python's much richer type system. The problem with Excel is there are only two types - numbers and strings. Even dates are just numbers in Excel. Python has rich types like pandas Dataframes, lists, and dictionaries, which Excel can't represent natively. We solved this in a similar way to how Typescript evolved Javascript. We support the Excel formula language and all of its types and also added support for lists, dictionaries, structs, and dataframes.

3. Performance. Our goal was to build a spreadsheet 1000x faster than Excel. Early on we used Python as our formula language but were constantly fighting the GIL and slow interpreter performance. Instead we implemented the spreadsheet engine in Rust as a columnar engine and seamlessly marshal Python types to the spreadsheet type system and back.

It's the hardest systems problem our team's ever worked on. Previously we wrote the S3 file system, so it's not like this was our first rodeo. There's just a ton of details you need to get right to make it feel seamless.

You can try it free here: https://rowzero.io/new?feature=code

fzumstein · 2024-06-14T13:23:03 1718371383

As the author of said second class add-in, let me just guess that your most popular feature request was adding the "Import from xlsx" functionality...which describes the whole issue: it's always Excel + something, never something instead of Excel.

breckognize · 2024-06-14T13:54:04 1718373244

My apologies, that came off harsher than I intended. I've used xlwings in previous jobs to complete Excel automation tasks, so thank you for building it. xlwings is one of the projects that motivated me to start Row Zero. My main issue with it, and other Excel add-ins, is they break the promise of an .xlsx file as a self-contained virtual machine of code and data. I can no longer just send the .xlsx file - I need the recipient to install (e.g.) Python first. This makes collaboration a nightmare.

I wanted a spreadsheet interface, which my business partners need, but with a way for power users (me) to do more complicated stuff in Python instead of VBA.

To borrow your phrasing, our thesis is that it has to be Excel-compatible spreadsheet + something, not necessarily Excel + something. It's early days for us, but we've seen a couple publicly traded companies switch off Excel to Row Zero to eliminate the security risks that come with Excel's desktop model.

fzumstein · 2024-06-14T14:35:51 1718375751

No offense taken, and happy that xlwings was an inspiration for creating Row Zero! I don't really buy the security issues though for being the reason for switching from Excel to Row Zero. Yes, Excel has security issues, but so does the cloud, but at least the issues with Excel can be dealt with: disable VBA macros on a company level, run Excel on airgapped computers, etc. Promising that your cloud won't be hacked or is unintentionally leaking information is impossible, no matter how much auditing and certification you're going through. The relatively recent addition of xlwings Server fixes pretty much all of the issues you encountered in your previous company: user don't need a local installation of Python, but the Office admin just pushes an Office.js add-in to them and their done. No sensitive credentials etc. are required to be stored on the end-users computer or spreadsheet either as you can take advantage of SSO and can manage user roles on Microsoft Entra ID (that companies are using already anyways).

IshKebab · 2024-06-14T12:58:57 1718369937

These are exactly the issues I would have guessed you would run into when using Python in a spreadsheet. Python has really been promoted above its level of competence. It's not suitable for these things at all.

I would say Typescript is a more obvious choice, or potentially Dart. Maybe even something more obscure like Nim (though I have no experience of that).

I get that you want compatibility with Pandas, Numpy, etc. but you're going to pay for that with endless pain.

yvely · 2024-06-14T05:47:59 1718344079

Looks very cool. Will be keeping an eye on this for local network hosted and/or desktop application version. Thanks for sharing!

breckognize · 2024-06-14T14:00:12 1718373612

We have private hosting available (in your VPC) for enterprise customers.

victor106 · 2024-06-14T04:10:20 1718338220

looks cool!

do you have a desktop app in the works?

breckognize · 2024-06-14T04:24:21 1718339061

We have some development desktop builds working. Is it something you'd pay for?

breckognize · 2024-06-05T06:55:17 1717570517

I think calling out Durability is a bit of a straw man. Most services get their durability from S3 or some other managed database service. So they're really only making the "do it on a beefy machine argument" for the stateless portion of their service.

I agree with the other points for production services with the caveat that many workloads don't need all of those. Internal workloads or batch data processing use cases often don't need 4 9's of availability and can be done more simply and cheaply on a chonky EC2 instance.

The last point is part of our thesis for https://rowzero.io. You can vertically scale data analysis workloads way further than most people expect.

packetlost · 2024-06-05T13:22:04 1717593724

> Most services get their durability from S3 or some other managed database service.

I don't think this is as true as you think it is. Sure, many do, but I'd wager it's not most.

breckognize · 2024-06-03T02:38:05 1717382285

If you like spreadsheets, check out https://rowzero.io.

It looks and feels like Google Sheets and scales to 1B+ row data sets. We natively support Python and Parquet, and connect directly to Postgres, Snowflake, Databricks, Redshift, and S3.

breckognize · 2024-05-27T15:19:16 1716823156

99.9%+ of data sets fit on an SSD, and 99%+ fit in memory. [1]

This is the thesis for https://rowzero.io. We provide a real spreadsheet interface on top of these data sets, which gives you all the richness and interactivity that affords.

In the rare cases you need more than that, you can hire engineers. The rest of the time, a spreadsheet is all you need.

[1] I made these up.

breckognize · on May 7, 2024

Shameless plug: If you have bigger data sets, check out https://rowzero.io

We scale up to hundreds of millions of rows and have native Python support.

You can define functions in Python and call them as formulas from any spreadsheet cell. We seamlessly marshal Pandas dataframes from Python land to spreadsheet land and back. [1]

We're also hosted and support real time collaboration like Google Sheets. We reimplemented the Excel formula language. We connect directly to Postgres, S3, Snowflake, Redshift, and Databricks. And the first workbook is free.

[1] https://rowzero.io/docs/code-window

blagie · on May 8, 2024

Shameless suggestion: Instead of one free workbook, make unlimited free workbooks, but unusable for corporate settings. I'd recommend binary sharing:

- Only I have access (private / personal use)

- The universe has access (open-source use)

But one free workbook still beats the typical 30-day trial.

breckognize · on April 28, 2024

Shameless plug: If you have bigger data sets, check out rowzero.io.

We implemented something like PySheets initially where the formula language was full Python. But we found the Python interpreter to be the bottleneck during (e.g.) large CSV imports, and the GIL prevented parallelizing evaluation. It was also harder for business users to adopt due to small syntactic differences between Python and the Excel formula language.

So we implemented the spreadsheet engine and formula language in Rust. We have a Python code window that allows you to write arbitrary Python functions. Those functions can be called as formulas from any spreadsheet cell. We seamlessly marshall Pandas dataframes from Python land to spreadsheet land and back. It gives you 90% of the benefits of pure Python without compromising on performance.

laffra · on April 28, 2024

Rowzero is a better spreadsheet, while PySheets is a better Jupyter Notebook. Although they converge in certain aspects, their distinct target audiences set them apart. This divergence may create some overlap, but it also leaves ample room for user preference.

PySheets currently runs inside the browser, on top of WebAsm, and the limitations there are bigger than just Python's slowness. You have only 4G addressable memory, including the interpreter and libraries. Network bandwidth is also a limiting factor for client-side computation.

That said, PySheets can render a sheet based on a 50,000-row Excel sheet in 0.5s and needs about 20s to do a full end-to-end recompute run. There are limits to what you can do in the browser without using an external kernel that can run Polars on large datasets. But, I think most people will be fine with what PySheets can let them do.

Finally, as the author of PySheets I am honored that a "competitor" sees us as a threat. I am quite impressed by Rowzero myself. Nice work :-)

breckognize · on April 28, 2024

Kudos on the technical achievement. We considered the thick client approach you're doing, and one of the reasons we punted was because it was so hard.

One really nice thing about your approach is it minimizes infrastructure cost. That positions you well for embedding use cases, like New York Times visualizations, that we struggle to do economically.

Best of luck!

laffra · on April 28, 2024

Yes, my total development bill for EVERYTHING, including DigitalOcean, Google, and OpenAI is about $15.

donjigweed · on April 29, 2024

Kudos to you. I would be quite flattered to have built a thing that competes with what a small startup built.

laffra · on April 29, 2024

I am feeling pretty Okay now, indeed. I played golf today. It was on a Par3 course, so it only tested my short game. However, I scored -1, with almost a hole-in-one. I blame it on the success of PySheets :-)

CMCDragonkai · on April 29, 2024

I've been trying to get a platform to create dashboards where some data comes from spreadsheets and some data comes from databases. Something like a notebook interface crossed with a grafana interface while also enabling forms for input is sorely missing. While it can be stitched together, speed/performance and flexibility (in terms of JS or Python) seems to be lacking atm.

I want to use such a thing to create internal dashboards similar to retool.

bigger_cheese · on April 29, 2024

Does it need to be live (i.e when database or underlying spreadsheet updates does it need to be reflected in real time on the dashboard) or are you ok with static display.

Live updating data is a pain I've messed around using javascript to force refresh html iframes on a timer. But I was never really satisfied with this. I've heard you can do things with websockets but that is starting to get too complicated for me (I'm not a programmer).

For static stuff one of the data scientists in my org pointed me to Streamlit (https://streamlit.io/) it's a python package I found very easy to use. Can easily combine SQL with CSV imports and display them all on one dashboard. Can use forms toggle butotns etc to control the display.

laffra · on April 29, 2024

You can do that today with PySheets. On the PySheets landing page, you can find a live example. The data comes directly out of a sheet that uses a service to convert metrics into charts. For example, one of the three charts shown on https://pysheets.app/#Traction is directly embedded as an iframe from https://pysheets.app/embed?U=uXNuCGO2JU1E5aL7zcOh&k=C12. If I rerun the sheet that produces the charts, the PySheets landing page updates automatically with the latest data.

breckognize · on April 29, 2024

You should try http://rowzero.io. We connect directly to DBs and data warehouses, support Python natively, and scale up to hundreds of millions of rows.

Lots of people use us for dashboards.

anton_ai · on April 29, 2024

Rowzero seems incredible, but this and PySheets target the wrong users. You are targeting data scientist while I would target finance people to get traction. So let me tell why I would use it as a Data Scientist but not as a finance guy: 1) It runs on the cloud, I would go with something that runs locally (or on premise) since there are sensible data there (with rust as a backend should be fine, python you need to ship a set of libraries using docker) or should be integrated into GCP/AWS/Azure. 2) You need to create a PowerPoint/Word alternative as well where you can just copy/paste stuff or you need to make the copy/paste in PowerPoint/Word easy 3) Push strong on big data and DB connection, right now those are the bottlenecks, also create some API in python for popular services in finance (Bloomberg, Factset, CapitalIQ, ...) so that they are available out of the box with a subscription 4) Do something for the text part, like getting embeddings for similarity, fuzzy match in python plus probably the interface can be different in analyzing text (highlights in green of keywords, search in text and so on), people in finance often work also with PDF and having all in a platform is nice instead of having two windows as of today

laffra · on April 29, 2024

PySheets has been designed to run on-prem and on GCP as well. The beta version you are looking at is just offered as a zero-install experimentation platform. We are actively talking with financial institutions, and both co-founders on the team, https://pysheets.app/#Team, have a long history in Finance, so we are very sensitive to all the (correct) points you make. We will look in more detail at your very helpful suggestions!

mostthingsweb · on April 29, 2024

Any chance you could expand on how the DAG is implemented in Rust for the execution engine? I'm trying to do something similar (not for spreadsheets but rather for a language: https://docs.yoctoproject.org/bitbake/bitbake-user-manual/bi...). I cannot find any good examples of how to implement something like this in Rust. E.g. should I use a graph library like petgraph, or roll my own?

laffra · on April 29, 2024

PySheets is not based on Rust. It is 100% Python.

mostthingsweb · on April 29, 2024

I replied to the rowzero guy, which is written in Rust.

toddofficeguru · on May 1, 2024

Both of the solutions seem interesting for different reasons. @breakognize. You said 90% of the benefits. Can you or @laffa give an example of the 10% that would prevent me from using your solution?

wslh · on April 28, 2024

Are Row Zero and/or PySheets open source?

laffra · on April 28, 2024

A major part is, in the form of Pyscript-LTK. I keep moving more of PySheets to LTK as I find reusable parts. I truly love open-source, but I am also trying to get some revenue for the months of work I spent on developing PySheets.

Keyframe · on April 28, 2024

nah, but it would be nice to have a communist version too.

wslh · on April 28, 2024

That was not the point, there is a natural focus in HN towards open source software. Open source is not equal to Communism.