Insights, Best Practices and Forward Thinking from the Customer Facing Team, Solution Architects and Leaders on Extreme Performance Applications, Infrastructure, Storage and the Real-World Impact Possible

The Problem with “Always On” Deduplication

by VIOLIN SYSTEMS on October 1, 2014

What is "Always On" Deduplication?

Deduplication is the process of removing redundant blocks of data. Several all-flash array vendors provide deduplication that can't be turned off. That's the "always on" approach. You might ask, what problem is "always on" trying to solve? That's harder to explain. It's solving an architectural problem for all-flash arrays.

The growing use of flash storage to provide high performance in a world that has grown increasingly unaccepting of mechanical disk drive latency has some challenges of its own. One of the challenges is flash cell wear-out. It is a characteristic of flash that every all-flash array vendor cannot avoid. A flash NAND cell is capable of a fixed number of writes before it significantly degrades in performance. At some point the cell becomes useless. To make flash work as intended, it's necessary to design an architecture that will get the longest life from a cell without the burden of too much cost or performance degradation. Thus, every flash controller has a function to manage the life of a flash cell.

Deduplication came of age in the 1990s and was originally for backups. It allowed a smaller amount of data to be written to tape for backup to reduce the amount of storage needed for backup and it had a performance benefit by reducing the impact on a backup window since writing less data takes less time. This technology was then looked at in a new context as a way of managing flash wear by reducing the amount of write traffic. By always deduplicating data before writing, it follows that you would get longer flash life as you reduce the number of writes. Thus, the "always on" option, and it is commonly done to manage flash resiliency, not because the applications necessarily benefit from deduplication.

The "Always On" Deduplication Approach

blocksAll-flash array vendors who use solid state disks (SSDs) depend on their SSD vendor to provide flash management in the controller on the SSD. There is also controller software in the array where deduplication can be used to reduce the number of writes, hence extending the life of flash storage in the array. There is collateral impact from using deduplication to manage flash resilience or life. Some applications make very good use of deduplication such as virtual desktop infrastructure (VDI) which can reduce the amount of storage required by up to 90%, and therefore reduce the amount of wear on the flash. Virtual server infrastructure (VSI) can also benefit greatly from deduplication, saving up to 65% of the amount of storage and also saving on flash wear. However, not all applications are a good fit for deduplication.

The Problem with the "Always On" Deduplication Approach

Databases are an example of an application that shouldn't be used with deduplication. There is some small space benefit for deduplication for databases, although not nearly as much benefit as with VDI or VSI. The bigger problem is the way in which database systems store data. Relational databases use tables to improve performance and manage operations. A relational database such as Oracle has no duplicate data blocks, because each block in a tablespace (the logical container in which tables and indexes are stored) contains a unique key at the start and a checksum containing part of that key at the end. As a result, most shops are going to see little space saving, while paying the price of increased latency as the hardware pointlessly attempts to find matching blocks.

It's possible that some deduplication can be achieved by customers storing copies of their databases on the same array (but this is a better use case for space-efficient snapshots), by the deduplication of unallocated space (but this is really a use case for thin provisioning) or by the accidental deduplication of critical entities (Oracle deliberately stores multiple copies of its key files such as redo logs, control files, etc.). But the reality is that deduplication has no place in a tier-one database.

Another workload that doesn't usually make sense for deduplication is encrypted data. Encryption is, by design, a unique data stream where deduplication only adds latency. Think of credit card numbers as a common workload to be encrypted, and you'll understand it also has low affinity for deduplication. If there is a need to deduplicate encrypted data you must have access to the unencrypted data so that the storage system can identify duplicates. This implies data encryption can't be performed within the application if you want to deduplicate that data. Any storage processing of encrypted data needs to be architected very carefully to preserve the security of the data.

The Violin Solution: A Different Approach

Violin takes a different approach since we control the flash architecture, and don't need to rely on deduplication to manage flash resiliency. Violin's Flash Fabric Architecture™ (FFA) takes a fundamentally different approach to flash resilience. Because we work directly with the flash die, we can manage resilience at the array level. This allows us to not only avoid the hot spots that create early SSD burn out, we also get improved performance due to the parallelism built into the architecture. When writes and flash management, (including garbage collection) are managed at the array level, the life of the flash is extended, and the latency spikes as SSDs going through garbage collection are avoided.

dashboard

When you place a workload that could benefit from deduplication like VDI or VSI (about 14% of most data center workloads) Violin provides granular control so you can take advantage of the reduced writes. When you have a workload that will have a small or no benefit from deduplication like databases or encrypted data, you can turn deduplication off. You decide if you want to use deduplication on a given workload. Violin does not implement "always on" deduplication, because it's not always a good idea from an application point of view. Violin doesn't need "always on" deduplication because our FFA provides a better way to preserve flash resilience. In fact, we provide a deduplication dashboard so you can see the effective deduplication rate so you can decide what is working and what's not working. If deduplication is good for a workload, you might go looking for similar workloads to deduplicate. If it's not working, throw it out and make room for something that will benefit from the technology.

Violin provides you the tools to get the most from deduplication technology. You turn it on. You turn it off. You turn it on when there is a benefit. You can turn it off when there is no benefit.

Why would you do it any other way?

To see what IDC thinks about inline deduplication, see their white paper on the topic here:

IDC White Paper: Why Inline Data Reduction Is Required for Enterprise Flash Arrays

To see the Violin deduplication solution check out the Concerto 2200 here:

http://www.violin-memory.com/products/concerto-2200-data-reduction-appliance/