Amazon can apparently get data cheaply out of cold storage with a day or so of w...

the-rc · on March 16, 2022

If you're doing disaster recovery, do you really want to wait a day for your data? I don't understand why you want Google to slow things down. Why can't Amazon get cold data faster? It's fairly simple: all Google storage is online. There is public material on how it all works. Maybe you can ask them to add an artificial delay. There is offline storage, yes, but that's tape, which has bigger issues...

Your pricing model is flawed, because you're not taking into account other factors such as rebalancing. Years ago, I would have loved for retrieval to be that cheap to perform behind the scenes.

Dylan16807 · on March 17, 2022

> If you're doing disaster recovery, do you really want to wait a day for your data?

I'd like to have the option.

The point isn't that I want to wait, it's that I want it to be super low priority to make it cheaper.

For a super low priority job, why should reading be multiple times as expensive as writing?

> Your pricing model is flawed, because you're not taking into account other factors such as rebalancing. Years ago, I would have loved for retrieval to be that cheap to perform behind the scenes.

I don't understand. If rebalancing happens behind the scenes, then that has to get paid for as part of storage.

Which means the cost of 1 single I/O is a smaller fraction of the storage cost.

Which means the profit margin for retrieval is significantly higher than my estimate.

My cost estimate uses the most flattering possible case for the retrieval pricing. Any storage costs I didn't account for, deliberately or accidentally, make my argument stronger.

the-rc · on March 18, 2022

> For a super low priority job, why should reading be multiple times as expensive as writing?

Because the particular aggregate mix of storage helps drive down overall costs. Changing that mix affects the entire stack, as well as capacity planning. Ok, make cold storage very cheap to retrieve. What happens now? Everybody will buy that and abuse it for more demanding applications, with quality of service for latency sensitive traffic going down the toilet. So you end up throwing more resources at the problem and/or charging more across the board. Pricing is one of the few factors that users really pay attention to in the real world, not best practices. Unfortunately.

Furthermore, to implement what you want, you can keep a request open for hours, which causes issues all over the stack (where do you keep that state? How does that interact with load balancers?) or you mark the cold object and return temporary failures until it's finally retrievable. That's extra state and extra complexity that doesn't exist right now. Those extra costs would have to be recouped somewhere.

> I don't understand. If rebalancing happens behind the scenes, then that has to get paid for as part of storage.

Why? Rebalancing doesn't happen in a vacuum. It's linked to the traffic mix. You can't look at just the total bytes used in a cluster and figure how many HDs, SSDs, CPUs, RAM and NICs you need to serve that data while still meeting your SLOs. Unless it's a W/O cluster, you need more signals. Amount and behavior of cold vs hot storage are two of those.

Anyway, cold storage that warms up most likely requires extra rebalancing that wouldn't have happened otherwise. How would you price that? Who would you charge?

Again, your cost estimate for retrieval does not take into account how things actually work. Rebalancing is not purely a storage cost. Yes, your argument is strong, but only if you start from flawed assumptions.

Dylan16807 · on March 18, 2022

> Because the particular aggregate mix of storage helps drive down overall costs. Changing that mix affects the entire stack, as well as capacity planning. Ok, make cold storage very cheap to retrieve. What happens now?

It doesn't have to be very cheap. Let's start with just trying to match the price of writes. That shouldn't really affect the total amount of I/O, and there's no reason reads should be harder on the system than writes.

> Furthermore, to implement what you want, you can keep a request open for hours, which causes issues all over the stack (where do you keep that state? How does that interact with load balancers?) or you mark the cold object and return temporary failures until it's finally retrievable. That's extra state and extra complexity that doesn't exist right now. Those extra costs would have to be recouped somewhere.

I suppose. But the cost of keeping a request open should be much much less than the current cost of having everything fully accessible in milliseconds.

> Anyway, cold storage that warms up most likely requires extra rebalancing that wouldn't have happened otherwise. How would you price that? Who would you charge?

Reads that cost a significant amount of dollars each don't require rebalancing. I'm not suggesting they go so cheap that rebalancing is required. You'd still do only one read to a completely separate hot storage system, like it currently works.

immibis · on March 15, 2022

Sir, we live in a capitalist economy. Prices don't mirror costs.

Dylan16807 · on March 15, 2022

My complaint is that the competition is awful here.

Capitalism is supposed to pit companies against each other and drive profit margins below 50%. And it's failing to do that here.

And moreso, it's a very cruel pricing system because it lures you in with low numbers, then overcharges to get your data back when you need it. Being antagonistic to your customers does have negative effects in the long run. And I think it's worth pointing out situations like that when people are shopping around.