Flash is coming to the data center. Contrary to perception 18 months ago, now this seems to be accepted as ‘common knowledge.’ There is still much discussion around what that flash will look like and in what form it will be consumed. I plan to write a series of blogs describing the unique challenges involved in building large flash Memory Arrays and some of the decisions made along the road. A good place to start with is the “let’s build it myself” group of folks and the challenges they will encounter.
The very adventuresome amongst you might start with, "I'll just make my own".
Roll your own
I can't tell you how many times I've heard people, be they end users or large system vendors who thought they could just roll their own flash memory storage. They wouldn't think of building their own disk drive, DIMMs or mother-boards, yet some how they think building their own is a perfectly reasonable idea when it comes to flash. Why do they think making their own is a good idea? Two top reasons, price and performance. In the old days, flash was expensive and vendors sold it at very steep premiums, so obviously the answer was to make your own and cut out the middle man and save money! Lets be honest, unless your name is Steve (or now Tim), or you happen to be partnered with a major flash supplier, you can't just go out in the spot market and buy flash in volume with reliable supply at a price that ever makes building your own drive make sense. Also you would have to learn how to manage the flash, so advanced error-correction, how to make the flash work when the power fails (something even todays large vendors have trouble with) and how to handle the highly-variable performance issues of consumer-grade MLC. Oh you didn't know that when the MLC data sheet says "typical", the metric in question could range from 3x to 1/3 of that "typical" value? Does that complicate things? It "typically" does.
Sure, MLC benchmarks at the advertised speed, for a few minutes, then the garbage collection (grooming, page reclamation, flash block cleanup…) starts, and then it falls off the write cliff. All that bus bandwidth promised to you, starts being using by the drive to perform garbage collection and that means your performance goes down drastically and your price/performance numbers go up along with your frustration levels. You can make your own flash storage, but you will have to garbage collect once all the empty space is written to, just like everyone else does. Garbage collection is necessary when the drive is busy so that there are always cleared flash blocks available for the application to write to. When this happens, the bandwidth available to the user decreases because it’s spending its time doing reads, writes and more importantly erases which all take time. Accessing flash for reads nominally take 90 microseconds, writes can take a millisecond or more and in the case of erases 5 to 20 milliseconds. While the SSD is busy erasing a block, the entire flash die holding that block cannot be read from, so if you wanted to read a page of flash from a die being erased you may have to wait up to 10 milliseconds for your data. Imagine the number of 90 uS reads that might queue up while that 10 milliseconds counts down to zero – think that might generate a latency spike at the application level?
Latency spikes are far more important than the loss of performance due to the write cliff. Most benchmarks just throw as many outstanding IOs at a drive at the specified queue depth of the test. For a big enough queue depth of *unrelated* reads, the additional latency doesn't appear to effect the performance. But real applications are not like benchmarks, when the application reads data from storage it often will issue a handful of reads before it has to wait for a response, then it can issue the next set of reads. Think about files systems or database search indexes - those data structures are traversed based on the result of any given read and that read will point to the next location to be read. Most interesting applications are like this and if the latency for a single read is very low (say 90us,) then even with only a single outstanding read an application can deliver 10,000 IOP/s, but if applications’ reads keep getting stuck behind erases, then in the worst case it might only get 100 IOP/s. That would result in a factor of 100 loss of performance. Now will erase blocking always be this bad? No, "typically" it will not be this bad, but sometimes it will, meaning you will have totally unpredictable application performance.
So when people complain that existing SSDs don't deliver on their promised performance it is usually due to the write cliff and ongoing garbage collection. The performance loss for streaming applications and almost all enterprise applications is the result of the large unpredictable spikes in read latency inherent in poor garbage collection strategy (and implementations.)
Now if I were in marketing, after telling you everyone has to do garbage collection, I would tell you how we at Violin have developed some magical technique to make garbage collection go away, and then you would justifiably ignore the rest of what I have to say. But I take the T in CTO seriously, so I'm not going to say we don't have to do garbage collection, we do, why don't we have a write cliff,?, well you have to read the rest of the future blogs to find out.
That being said we do have an advanced (patent pending vRAID) technology that makes those giant latency spikes (due to erases) completely disappear under *all* circumstances, which some might consider sufficiently advanced to be indistinguishable from magic. Certainly our customers find the resulting “predictable performance” application acceleration experience using our Memory Arrays to be pretty magical.