Hacker News new | past | comments | ask | show | jobs | submit login
We Test PCIe 4.0 Storage: The AnandTech 2021 SSD Benchmark Suite (anandtech.com)
184 points by pella on Feb 1, 2021 | hide | past | favorite | 82 comments



Few notable things that I didn't realise with Modern SSD.

1. Extremely Power Efficient. They have come a long way with idle power and power management. Even on the Desktop where Energy efficiency isn't much of a concern. There used to be a time where SSD are 1W+ even when idling.

2. The amount of idle time, so these IO traces shown most of the time SSD isn't doing anything at all.

3. Read Latency are now extremely good. We are talking about sub 100us and sub 200us in 99th percentile. To the point Optane doesn't provide any meaningful differences in consumer usage. We will find out soon when Billy mentioned he will be testing Optane PX5800, I cant wait to see that. I should also note before someone jump in about Optane's Random Read Write advantage in QD1, I still dont believe it matters in consumer usage beyond the current NAND SSD performances.

I think with coming PCI-E 5.0 Drive we have a roadmap that has pretty much "solved" the performance category if we haven't already done so. What I want to see is slower, but large capacity SSD that is more affordable. But apart from some unforeseeable market condition, I dont see anything on the roadmap or forecast that could bring us 4TB SSD for sub $200 within the next 4 years. We seems to be entering the end of the S Curve where improvement ( cost reduction ) will be slower.


Regarding Optane and "consumer" usage: If you're talking about a typical PC usage, like browsing, using MS Office, and playing games, then sure, there's absolutely no perceptible difference between Optane and any other SSD, even a SATA SSD. But that's not what people buy Optane for, so it doesn't make much sense testing such an expensive drive on regular "consumer" PC tasks.

I train deep learning models, so I read random images from disk to feed data loader pipelines, and I also write back stats, logs, and model checkpoints. I frequently use 8 GPUs, each running a separate experiment, with its own data pipeline. This means every GPU is reading 1.2M jpg images in a random manner, every 30 minutes. I haven't actually tested if the disk reads are a bottleneck, but I'm guessing random reads of 20M images per hour would stress any drive, so I'd rather not risk it and get the best money can buy.


Normal SSDs have more throughput than Optane. Image reads, stats, logs, and checkpoints can all be done asynchronously, so I think a high end flash SSD could very likely perform better for you.


According to benchmarks [1] the older 905p destroys the best "normal" SSD (980 Pro) - it's more than 3 times faster in random reads. As I showed above, in the worst case I'm trying to get up to 2,000MB/s of random reads from my drive. How can you possibly believe a flash SSD can be a better drive for this use case?

[1] https://www.anandtech.com/show/16087/the-samsung-980-pro-pci...


But you're not doing 4K reads, and you're not stuck doing them one at a time.

The next page, working with 128KB chunks, is far more relevant to your use case. Especially this chart: https://images.anandtech.com/doci/16087/sr-s-905p-1500.png

Here's a similar benchmark, to prove that it's nothing unique to sequential reads. 256KB random chunks at queue depth 4 go well over 5 gigabytes per second. https://www.legitreviews.com/wp-content/uploads/2020/09/ATTO...

So for the use case of loading image files, with the ability to grab a bunch in parallel, flash seems like a clear winner.

It will also write large checkpoint data twice as fast.

And the extra microseconds when you're writing to a log shouldn't matter.


Unfortunately I am stuck doing them one at a time, and every image is located in a different folder. At least that’s how the official Pytorch Imagenet dataloader is reading images for training.

You do make a good point though: my reads are not 4k, they closer to 256k on average, but still, they are random reads. I’m not sure how much faster those get with size.

And I’m pretty sure doing writes in parallel with those random reads makes things worse, but ok, let’s ignore that.


> At least that’s how the official Pytorch Imagenet dataloader is reading images for training.

Well, something to keep in mind if you want to optimize it later.

> still, they are random reads. I’m not sure how much faster those get with size.

The second image I linked should demonstrate that pretty well.

> And I’m pretty sure doing writes in parallel with those random reads makes things worse

It looks like both models handle a mixed workload acceptably. Writing 128KB chunks is somewhat slower than reading, and mixing them gives you a speed somewhere in between. http://images.anandtech.com/doci/16087/sm-s-905p-1500.png


I'm not sure the second image directly answers my question though. It appears it shows a transfer of a 256MB file using different block sizes. Is this accurate? In my case, I'm transferring a 256KB file. And we want to know how much faster it is to transfer a 256KB file compared to transferring a 4KB file. Would it be the same as comparing how much faster it is to transfer a 256MB file than transferring a 4MB file on that plot? Because if so, there's no improvement shown. And it's clearly not the same as comparing the transfer of 256MB file using 256KB blocks vs 4KB blocks, because in both cases the block size is much smaller than the file size.


You should mostly ignore the 256MB number, that's just how big the testing arena is on the drive and barely affects the numbers.

The chart is showing how fast it is to load random chunks of data with different sizes. When you load image files, you're effectively loading random chunks of data. So if you want to figure out how fast this drive could load images for you, look at the line that best represents the size of your images.

> Would it be the same as comparing how much faster it is to transfer a 256MB file than transferring a 4MB file on that plot?

No, because the relationship is non-linear. If you want to compare 4KB and 256KB you have to look at the numbers for 4KB and 256KB. 256KB random reads are 15x faster than 4KB random reads.

The point is that when you initially linked the data for 4KB random reads, you were effectively worrying about how fast the drive would be at loading 4KB files. That number doesn't matter. You want to know how fast it is at loading 128KB or 256KB files. So look at a chart that's measuring random reads of the correct size.

If the question you want answered is the one implied by "still, they are random reads. I’m not sure how much faster those get with size.", then the answer is: That is literally the point of the ATTO chart. It shows how much faster random reads get with size. The relationship is a curve, so it's best to just look at the chart for your specific size. (And also keep in mind that this particular benchmark always had 4 IOs pending at once. And with real files, there will be a minor speed impact because they're not aligned as nicely.)


Just to clarify: Intel hasn't actually promised me a P5800X yet. But I think it's pretty likely they'll sample it for review eventually. For the P4800X, they initially offered us remote access to test in one of their servers, and later provided a server with a P4800X to keep as my new enterprise SSD testbed. Since I already have some decently powerful PCIe 4.0 machines, I'm not sure I want them to offer another rackmount server that would be too loud to make good use of in the home office.


Good to see a thorough benchmark and a good writeup. Many "tech" sites now just repeat manufacturers press releases. Not Anandtech.


I like notebookcheck's reviews as well. Very thorough, consistent over the years, everything is inttheir database and it's easy to make comparisons, they not only do objective measurements of brightness, noise, temperature and bunch of other things but also write down how they measured and with what. The only thing missing are confidence intervals


Yes, Notebookcheck is a fantastic site.

The German press seems to be more cynical about marketing, c't Magazine is great, but no longer available in English.

The Register is also a good less technical site if you want to cut through BullSh_t in IT.


I recommend gamers nexus as well. It’s mainly YouTube content but they do very in-depth benchmarking and some innovative visualization on how uniform the cooling surfaces of a certain cooler are and things like that


gamers nexus is one of few very rare places where they actually dont do the manufacturer "recommended set of tests", aka dont buy marketing bullshit. They are run for the users, not for selling advertisement.


Absolutely. They are on fire lately, burning bridges with certain manufacturers who are now on my do not buy list. Won't name who here, best to leave his lawyers to deal with the flak, but it's definitely great to see those refusing to become sellouts!


Is Gamers Nexus related to Hardware Unboxed which was briefly banned from reviewing NVidia cards for similar reasons to what's being discussed here?


No.


The amount of thought that the Gamer's Nexus team puts in their work and benchmarks is incredible.


Oh, thanks Anandtech indeed! Finally a proper test of PCIE4 NVMEs - I'm very happy with my choice of the 1TB 980 Pro now. I mean, I was always happy with it, but I've now got that validation one tends to seek, showing that it was a solid choice.

It is no surprise though, as my 980 Pro comes from a long series of Samsung drive purchases including 970 EVO 2TB (in same system) and others going back, having chosen Samsung after they did seriously good on some of the earlier SSD torture tests.

(I once had 4 IBM Deathstars in raid 10...)


It’s a pity they didn’t test it against 970 pro


I left that one out of this initial batch because it's not really representative of anything on the market these days. But I am also curious to see how it scores, especially because it's one of the few recent consumer SSDs that doesn't rely on SLC caching.


I won't be surprised if it's actually a better drive than the 980 in metrics that matter.


Its not that simple, often thorough benchmarks will be the ones suggested by manufacturer, or even designed and written by manufacturer.

As it happens this is precisely Anandtech problem. They have a long legacy of printing whatever toiled paper manufacturers/vendors ship them https://news.ycombinator.com/item?id=22241800

>Intel tactic was manufacturing facts and positive press stories, something we now call fake news. ...

Intel SSE also got a big push with fake 3D acceleration claims https://www.vogons.org/viewtopic.php?f=46&t=65247&start=20#p...

Intel version: "At the time, I was working for Intel and was involved in the launch of the Pentium 3, aka Katmai.

We _engaged_ a number of games manufacturers to provide demos showcasing not only Screaming Sindy's Extensions, but the arcane and mysterious Katmai New Instructions.

One such outfit was Rage Software, now sadly deceased. Rage provided demos of Incoming and an early prototype of a game called Dispatched, which as far as I know never actually saw the light of day. Dispatched featured a strangely-arousing cat riding a jet powered motorcycle. The first version I saw was running on a 400MHz Katmai and was still in wireframe. It was bloody impressive."

Reality, according to hardware.fr: "Let's start with Dispatched first. This is actually a Rage Software game that should come out late 99, which Intel showed the demo at Comdex Fall to highlight the benefits of the SSE. Big interest, it is possible to enable or disable the use of SSE instructions at any time.

Nothing to say in terms of speed, it goes squarely faster once the SSE activated, + 50% to + 100% depending on the scenes! But looking closely at the demo, we notice - as you can see on the screenshots - that the _SSE version is less detailed_ than the non-SSE version (see the ground). Intel would you try to roll the journalists in the flour?"

SSE version is less detailed? How convenient! Rage Software Dispatched never came out. The only outfit, other than Intel, in possession of this software was Anandtech. They used this exclusive press access to pimp out Pentium 3 benchmarks manufacturing fiction like this https://images.anandtech.com/old/cpu/intel-pentium3/Image94....

>so it was pretty much an Intel commissioned demo piece to showcase P3 during Comdex Fall, and was cheating with details. Two other SSE patched games mentioned on hardware.fr actually ran slower with SSE

Anandtech used Intel commissioned piece of fake software to lie to its readers about SSE for couple of years during the time Intel was getting kicked by AMD and paying bribes to vendors.


That was 21 years ago.



And what's wrong with that?

There are differences between Reporting and Benchmarking / Reviewing.


>Update: As our readers were quick to pick up on from Intel's full press release, Intel is using faster LPDDR4X on their Ice Lake-U system. This is something that was not disclosed directly by Intel during their pre-Computex presentation.

Anandtech is perfectly happy to print Intel PR without trying to verify or make common sense assertions.


for anyone that has a very new motherboard/chipset with proper PCI-E 4.0 support, and wants to add a lot of NVME M.2 format storage (in addition to the on-motherboard M.2 slots), take a look at this:

https://www.asus.com/ca-en/Motherboard-Accessories/HYPER-M-2...

It costs about $70 USD.

In a single socket ryzen zen3 workstation you'd typically be able to use one of these. One PCI-E 4.0 x16 slot for your video card, one slot for this.


Just curious for my own sake—is there a board out there with two electrically PCIe 4.0 x16 slots? I was under the impression that Ryzen Zen 3 has just 24 PCIe lanes, and usually four are split out for the chipset, so how could you run both your x16 video card and x16 M.2 carrier simultaneously without linking one at a lower rate?


Yes yes yes this you're absolutely right. Virtually every motherboards has two "x16" slots but only the first x16 is actually x16, one on the far side from CPU is electrically just x4 and is mutually exclusive with first M.2 slot. Unless you're building Threadripper/EPYC or Xeon HEDT/server. I didn't realize any of that and am enjoying free salt.


It's not quite that bad. Lots of motherboards connect the primary x16 to two slots, which can either be used as x16+empty or x8+x8. And x8 is quite a lot at 4.0 speeds.


Threadripper has 64 lanes (except 3990x, which has 88), and I think most motherboard for it have 2 PCIe 4.0 x16 slots. EPYC has even more lanes.


The chipset's only fed by 4 lanes, so while it technically could be fully connected to a second x16 slot it would be kind of crazy for a motherboard maker to actually do that since the chipset only supplies 16 lanes total.

Realistically you need to give this card your main slot and your GPU can get an x4 connection and lose a few fps.


The Hyper M.2 requires a motherboard that supports PCIe bifurcation. If not, you'll need a card with a built-in switch like an Ableconn PEXM2.


Fortunately, AMD seems willing to enable bifurcation on all their desktop platforms instead of using Intel's strategy of treating it as a product segmentation thing. Quad M.2 risers work in both my B550 and X570 system.


Artificial "product segmentation" is such nonsense. If you're able to put in the extra effort of disabling features that you just get for free otherwise, and make more profit by doing so that's just a sign that the market you're operating in is severely lacking in competition. It's good to finally see AMD giving Intel a run for their money.


Eh, it's like how non-K intel processors technically can be overclocked* in that less features can mean you get the same CPU but cheaper.

*: I know there are other reasons for non-K processors (binning might actually mean your non-K CPU can't overclock), but it's largely just a way to move more product and sell at a lower price point to those who can't afford unlocked or otherwise wouldn't have overclocked.


If Intel can disable a feature and sell something at a lower price, then they could also sell the same thing at that lowered price without the feature disabled. The only reason they don't is because they exist in a patent encumbered, (very) competition light environment.

(Binning because of imperfect production yields is different though and I don't have a problem with that. It's hard to tell how much of the K/non-K split is legitimate binning, and how much is just a bool in the microcode).


Didnt stop AMD from disabling PCI Express Resizable BAR support, hidden behind AMD marketing wank "smart access memory (TM)" name, on older "chipsets" despite both memory and PCIE controllers being build INTO THE CPU ....

https://www.amd.com/en/technologies/smart-access-memory

Or that time x470 was going to support PCIE 4, but then it was made x570 exclusive


Resizable BAR support was added in the latest BIOS update of my B450 motherboard (that also added Zen 3 compatibility).


I had no idea this kind of riser was possible. That solves a problem I have at the moment.


I'm concerned about heat if that little fan goes. I know the storage throttles, but I would imagine 90C+ flash memory is not good for it's long term durability.

I'd like to see a complete fin heatsink that benefits from case flow.


I think the fan and enclosure are a bad idea personally, you'd achieve better cooling on the M.2 cards by leaving it off and having a case that has sufficient airflow front to rear in it. And possibly gluing your own small aluminum heatsinks onto the M.2 cards (which are cheap and easy to do), if they don't come with their own heatsinks.


It depends. Because NAND flash storage relies on electron tunneling, you actually need high temps for the mechanism to work well. Too much cooling and you'd be increasing the wear levels prematurely.

Not sure if this thing is smart enough to know when to kick the fans on and when to keep the temps "high". That is something worth noting though.


There was even technology announced for self healing NAND flash - drives that could cook itself to 800C to heal worn out cells. Never materialized tho :(

https://www.extremetech.com/computing/142096-self-healing-se...

https://hardware.slashdot.org/story/12/12/02/2222235/self-he...

https://ieeexplore.ieee.org/document/7524803


https://www.eeweb.com/industrial-temperature-and-nand-flash-...

I found this here that says lower temperatures are better for storage but higher temperatures are better for writes. I'm not sure how accurate that all is though


Wait! There is commodity hardware that relies on electron tunneling?!




The enclosure itself is a giant heatsink [1] so I imagine it is fine if you only run 2 NVMe SSDs in this with a bit of case flow.

[1] https://youtu.be/ApXsJkUKDGI?t=341


Not really. Each card would only get 8 lanes. Then no more then two NVME drives will work in the storage card and that only if the board has bifurcation support.

For PCI-E 3.0 i can recommend a card with a Broadcom PEX 8724 controller. Works well with 4 drives and is not much more expensive. The fan is horrible: https://www.aliexpress.com/item/1005001782779648.html


Not sure why this is downvoted. Almost all (if not all?) consumer motherboards will drop the 2 x16 slots to x8 mode if you insert anything in the second slot. So using a PCIe riser for NVME SSDs means your GPU will get less bandwidth. This may or may not be noticeable, if you are running PCIe 4 the x8 will definitely not bottleneck the GPU, but x8 PICe 3 lanes probably will, especially for cards with less VRAM.


Good benchmark suite, but missing two big PCIe 4 SSD contenders: the WD Black SN850 and the Sabrent Rocket 4 Plus!


Part of my goal with this article was to take a fresh look at drives that had been through the old test suite, and be able to refer to the older results as a sanity check for the new results. (I found at least two bugs in FIO and one in Windows while working on this.) I have tested the SN850 on the new suite, but haven't written up a proper review yet: https://www.anandtech.com/bench/product/2732 I should also be getting an equivalent to the Sabrent Rocket 4 Plus soon.


Love it, thank you — the article already had my upvote! In particular SLC cache-aware tests and the inclusion of latency information (both average and 99th percentile) were both heartening to see. Too much crappy storage benchmarking these days.

Looking forward to seeing what you pick in terms of application benchmarks. I think application benchmarks help to keep us honest in terms of real-world impact of storage upgrades. LTT seeing if people could tell the difference between SATA and NVMe for games was great. I'm interested in database workloads, and the reason I was asking about the SN850 is that in Postgres benchmarks I've seen it has smoked the 980 Pro.


And the Corsair MP600 too.


The original Corsair MP600 is equivalent to the Silicon Power US70 in the article, plus a heatsink: Phison E16 controller with TLC NAND. Corsair's newer MP600 PRO with the Phison E18 controller has been announced but isn't quite available yet.


It's been a while I bought a motherboard. Sata (6gb/sec?) was godsend at the time. Smaller cables and connectors and nicer to keep tidy. Pet peeve of mine to keep it looking nice especially with some stacked hard-drives in the bay. Back then it was pretty clear what you might do with the slots available.

Could someone fill me in with PCIe. What slots are available and are these mainly used now for graphics and SSD? SSD in slots is new to me which I haven't gotten around to yet.


PCIe slots come either from the CPU or the Chipset.

Typically, there are 16 lanes used per GPU and 4 per M.2 SSD.

With older CPUs, you had to either reduce the lanes assigned to the GPU to connect an SSD to the CPU, or go through the chipset.

With newer CPUs (newest intel or AMD Zen), the CPU provides 20 Lanes (16+4) for general use and in the case of AMD, 4 additional lanes for the chipset.

If you want more lanes connected directly to the CPU, you have to go to HEDT or Server Chips with up to 128 lanes available.

See this article for a typical block diagram of a AM4 system.

https://www.guru3d.com/news-story/amd-ryzen-3000-new-block-d...

Here you see, there are 1x16 / 2x8 lanes for GPU, and 4x PCIe/SATA combo ports from the CPU which can be used for connectivity.

More lanes can be provided by the PCH (Chipset).

Another example is in this artice by anandtech, which even lists the slots explicitly in the block diagram. https://www.anandtech.com/show/14657/the-asus-pro-ws-x570ace...


The block diagram helped! Great summary. Did some additional readings as I forgot all about PCH and came across south bridge / north bridge which were familiar sounding. As I understand, please correct me, summarizing parts what you wrote. So 16 lanes for GPU and the 4 (usually to PCH), which can be thought of as an extension? I'm not sure the right term but I guess this provides bus sharing through some mechanism so it does not affect the GPU lanes.


I recently got the Asus Hyper and 4 Sabrent Rocket NVMe disks and had them in RAID0. The speeds were just stunning to say the least at 15GB/s. Getting the drivers to work for the boot drive on Windows 10 was a pain though. AMD's RaidXpert requires you to do manually install the drivers in a specific order while installing Windows to recognize it.


> The speeds were just stunning to say the least at 15GB/s.

That is amazing. That is around DDR4-1866 speeds, and not far from DDR4-2666 (~21 GB/s). At those speeds I would happily work with dataframes sitting on the disk rather than in memory [1, 2]. Did you benchmark RAID 0 with less than four disks?

[1] R: https://github.com/xiaodaigh/disk.frame

[2] Python: https://docs.dask.org/en/latest/dataframe.html


No. I spent way too much time in installing those drivers as it was not a smooth experience. Will give it a shot! My specs:

3990X, 265GB DDR4 3200Mhz, Asus Rog Strix E-Gaming


You're welcome to try, but remember that getting any specific piece of data takes a thousand times longer. Some workloads can handle that, and some can't.


I threw these on a PCIe 2.0 x16 port, and I get 3GB/s, as expected. There isn't anything in between 500MB/s SATA 3 SSDs and multi-gigabyte high capacity NVMe SSDs.

I'm blown away, so what are you all using the even greater throughput for?

The utility is quite slim as is, because of the networking limitations and other I/O. Only certain internal processing and parsing can take advantage of this.


Right now, very few programs take advantage of the speed. Things are just mildly smoother. But soon you should see lots of videogames that can dynamically load massive amounts of data, letting you switch levels in a second or have highly detailed levels that don't need to worry about fitting into memory all at once. (Presumably the latter will be optional for most titles.)

See the first half of https://www.youtube.com/watch?v=fqi09JnJHOo It's partly a comparison against hard drives, but it gets into why you'd want gigabytes per second.


Very surprising performance from Hynix, given their previous low profile.

They are the only PCIe 3.0 contender, yet beat Samsung an almost everything, but raw latency, and throughput.

They are indeed limited just by the PCIe 3.0 ceiling.


I wish some of these tests tested the performance of durable writes (via FUA writes or cache flush). Some otherwise fast drives are just ridiculously slow for those.


The PCMark 10 Storage tests do include some cache flushes in their traces. I haven't started extracting and graphing those stats, so I don't know if it's enough to say anything meaningful about how a drive would perform on sync-heavy workloads like a database.


Maybe someone can accurately give me a good guesstimate here. With the new direct memory access to GPUs (like RDMA) is there a big advantage to using PCIe 4.0 storage systems? I saw on a LTT episode that for the most part gamers didn't notice a difference using PCIe 4.0 (though RDMA wasn't out yet) but I'm wondering if there will be a bigger performance boost to more computational workloads.

tldr: if I'm buying a new machine doing a lot of GPU work, do I get a PCIe 4.0 storage system or 3.0?


For existing software, there is little to no difference.

Both Sony and Microsoft are talking big about the storage subsystems of their new consoles, and how it's going to enable entirely new things, such as low-latency direct requests to the flash from the GPU. for gaming in the future. This is likely to spill into the PC ecosystem during this console generation, both for games and gpu compute.

Personally, I would not purchase a motherboard that didn't support PCIe 4.0 anymore, but I would not worry about getting the fastest possible drive to plug into it. The idea behind this being that I am almost certainly going to expand/replace the drive before it matters, but I'm quite likely to still be using the motherboard at that time.


So what about computational work like using pytorch?


It's good that some tests are done on 80% used situation.


TLDR: A good pcie 3.0 drive is often better than most of pcie 4.0 drives.

Hynix P31 seems to be good choice overall if you are looking for an NVMe drive (speed, energy efficiency, 5 yr warranty)


In my experience, on most SSDs/NVMEs, the firmware GC destroys performance under a heavy read/write load to the point of being no better than a mechanical HD of yore. Except Samsung and Intel. But Intel SSDs have bricked on me a lot. So now I just use Samsung NVMEs exclusively.


This article tests that. Read down to the "Sequential Drive Fill" section.


I would love to see some test how SLC caches interact with dmcrypt / mdadm / lvm - there are a lot of old rumors i believe are outdated by now


Probably depends on whether you configure your storage layers to pass through TRIM commands.


yes, that would be my guess, too


The version of explanation I have read is that it's SLC cache getting saturated(same deal as CMR cache region in DM-SMR except read/write can parallelize), and that it won't happen with enough cache amount and fast, parallel writes, so explanation differs but solution sounds right


interesting. Can you elaborate a bit on the problem? Perf downgrade to the level of HDD is crazy




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: