Click here to close now.

Welcome!

Linux Authors: Lori MacVittie, Carmen Gonzalez, Aria Blog, Roger Strukhoff, Elizabeth White

Related Topics: Virtualization, MICROSERVICES, .NET, Linux, Security, SDN Journal

Virtualization: Article

Data Efficiency at Scale

Overcoming limitations in data efficiency features

The initial wave of data efficiency features for primary storage focus on silos of information organized in terms of individual file systems. Deduplication and compression features provided by some vendors are limited by the scalability of those underlying file systems, essentially the file systems have become silos of optimized data. For example, NetApp deduplication can't scale beyond a 100 TB limit, because that's the limit in size of its WAFL file system. But ask anyone who's ever used NetApp deduplication if they've done it on a 100 TB file system, and you're likely to hear "are you crazy?" It's one thing to claim that data efficiency features can scale, quite a different one to actually use them with performance at scale.

Challenges around scalability generally center on two areas: scalability of random IO and memory overhead. Older solutions, like the one from NetApp, face the first challenge while newer flash-based storage systems are struggling with the second. I'll review both here:

The IO Challenge
Primary data-oriented storage devices handle both streaming and random throughput and therefore are sensitive to latency effects. Data efficiency requirements for primary storage must have fast hashing techniques to reduce the impact of latency. Fast hashes are non-cryptographic in nature and so require data comparison when used to do deduplication. It works like this:

  1. When a new chunk of data is read in it is first given a name using the hash algorithm.
  2. The system then checks a deduplication index to see if a chunk with that name has been seen before (note that this can consume disk IO and tremendous amounts of memory if done wrong).
  3. If the name has been seen we need to take extra steps. Because fast hashes are non-cryptographic, it is possible to have a name match while the data content differs. This is known in computer science as a hash-collision. To account for this, the existing copy of the chunk must be read in and compared bit-by-bit to the new. If they match, only a reference to the chunk is created. If not, then the new chunk must be written.

Essentially, this form of deduplication means trading a write of a duplicate chunk for a read. Depending on the design of the underlying block virtualization layer, duplicate chunks may be widely dispersed throughout the system. In that case, the bigger the system gets, the more expensive reads get - so processing of duplicate data becomes slower and slower as the storage system fills - this is why you won't find many 100 TB NetApp file systems with deduplication turned on. Certainly not for primary storage applications, the system would be flooded with random read requests and NetApp's deduplication process can end up taking months, years or even never complete.

A number of techniques have been used to reduce the impact of IO in other products. For example, the Hitachi NAS (HNAS) and Hitachi Unified Storage (HUS) solutions from HDS make use of hardware-acceleration to generate cryptographically secure hashes that do not require a data compare at all - this allows for linear scaling of deduplication performance on volumes up to 256 TB in size. Data is also written out before it is deduplicated to avoid introducing any latency through the hash computation process itself.

Permabit's own Albireo Virtual Data Optimizer (VDO) product, a plug-in module for Linux-based storage solutions, takes a different approach but with a similar result. VDO works inline to provide immediate data reduction. When data is written out, the VDO process intelligently lays it out in a sequential pattern, so that subsequent read compares of duplicates are more likely to be sequential as well. Both solutions do a fine job at solving the problem in real world scenarios, they just take different approaches.

The Memory Challenge
Many of today's flash array vendors are providing deduplication using similar fast hashing techniques to what I outlined above. With flash, the cost of doing random reads for read compares is a non-issue (random seeks on flash are much less expensive than for hard drive environments) so the use of the fast hash alone is enough to minimize latency. These systems (such as EMC's recently launched XtremIO product) are focused on delivering performance and the big challenge to performance at scale is available memory (DRAM). As above, after chunks are read in, they are named using a fast hashing algorithm. After that, the flash system must determine whether or not a chunk has been seen before. To get at this information as quickly as possible, flash-based storage systems have tended to use huge amounts of DRAM to cache chunk names in memory. It's not uncommon to see flash storage systems that allocate 16 GB of working cache per TB of storage. To support a 256 TB storage volume, such a system would require a TBs of DRAM. The increased hard costs in terms of more expensive (denser) DIMMS, as well as the increased cost of the server board required to support this many DIMMs combine to make this an extremely costly and unpopular proposition. Combine this with the fact that DRAM prices are not falling at the same rate as flash prices, and you can see why no vendor today makes a 256TB flash storage array with global deduplication capabilities.

The solution to the memory challenge is coming, in the form of a next generation of flash storage products that utilize Albireo indexing and Albireo VDO. Unlike the flash arrays described above, flash-optimized arrays with VDO takes advantage of advanced caching techniques to operate with 128 MB of working cache per TB of storage and deliver excellent performance. With VDO, a 256 TB system can be delivered with as little as 32 GB of RAM while delivering 1M IOPS performance. The net result is a cost effective and easily deployed data efficiency solution for flash arrays.

Conclusion

Deduplication Scalability by Vendor

As you can see in the table above, forward thinking vendors like HDS have done a good job at overcoming limitations in their data efficiency features and have products on the market today that can scale to meet the requirements of the large enterprise. Many other vendors are lagging behind, because of their inability to address IO and/or memory requirements, a serious downfall since data efficiency is at the core of distinguishing storage solutions, a critical end user requirement, and a ‘must have' component for 2014. Permabit's VDO product overcomes both of these limitations through the use of advanced memory-efficient caching techniques.

More Stories By Louis Imershein

As Senior Director of Product Strategy at Permabit Technology Corporation, Louis Imershein is responsible for product evolution and strategic planning for the Albireo family of products. He has 22 years of technical leadership experience in product management, software development and support. Prior to joining Permabit, Imershein was a Senior Product Marketing Manager for the Sun Microsystems Data Management Group. He has a Bachelor's degree in Biological Science from the University of California, Santa Cruz.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
Internet of Things (IoT) will be a hybrid ecosystem of diverse devices and sensors collaborating with operational and enterprise systems to create the next big application. In their session at @ThingsExpo, Bramh Gupta, founder and CEO of robomq.io, and Fred Yatzeck, principal architect leading product development at robomq.io, will discuss how choosing the right middleware and integration strategy from the get-go will enable IoT solution developers to adapt and grow with the industry, while at the same time reduce Time to Market (TTM) by using plug and play capabilities offered by a robust I...
Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 16th Cloud Expo at the Javits Center in New York June 9-11 will find fresh new content in a new track called PaaS | Containers & Microservices Containers are not being considered for the first time by the cloud community, but a current era of re-consideration has pushed them to the top of the cloud agenda. With the launch of Docker's initial release in March of 2013, interest was revved up several notches. Then late last...
The WebRTC Summit 2015 New York, to be held June 9-11, 2015, at the Javits Center in New York, NY, announces that its Call for Papers is open. Topics include all aspects of improving IT delivery by eliminating waste through automated business models leveraging cloud technologies. WebRTC Summit is co-located with 16th International Cloud Expo, @ThingsExpo, Big Data Expo, and DevOps Summit.
SYS-CON Events announced today that Litmus Automation will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Litmus Automation’s vision is to provide a solution for companies that are in a rush to embrace the disruptive Internet of Things technology and leverage it for real business challenges. Litmus Automation simplifies the complexity of connected devices applications with Loop, a secure and scalable cloud platform.
SOA Software has changed its name to Akana. With roots in Web Services and SOA Governance, Akana has established itself as a leader in API Management and is expanding into cloud integration as an alternative to the traditional heavyweight enterprise service bus (ESB). The company recently announced that it achieved more than 90% year-over-year growth. As Akana, the company now addresses the evolution and diversification of SOA, unifying security, management, and DevOps across SOA, APIs, microservices, and more.
Wearable technology was dominant at this year’s International Consumer Electronics Show (CES) , and MWC was no exception to this trend. New versions of favorites, such as the Samsung Gear (three new products were released: the Gear 2, the Gear 2 Neo and the Gear Fit), shared the limelight with new wearables like Pebble Time Steel (the new premium version of the company’s previously released smartwatch) and the LG Watch Urbane. The most dramatic difference at MWC was an emphasis on presenting wearables as fashion accessories and moving away from the original clunky technology associated with t...
SYS-CON Events announced today that robomq.io will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. robomq.io is an interoperable and composable platform that connects any device to any application. It helps systems integrators and the solution providers build new and innovative products and service for industries requiring monitoring or intelligence from devices and sensors.
After making a doctor’s appointment via your mobile device, you receive a calendar invite. The day of your appointment, you get a reminder with the doctor’s location and contact information. As you enter the doctor’s exam room, the medical team is equipped with the latest tablet containing your medical history – he or she makes real time updates to your medical file. At the end of your visit, you receive an electronic prescription to your preferred pharmacy and can schedule your next appointment.
SYS-CON Events announced today that Solgenia will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Solgenia is the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions. Designed to “Bridge the Gap” between Personal and Professional Social, Mobile and Cloud user experiences, our solutions help large and medium-sized organizations dr...
While not quite mainstream yet, WebRTC is starting to gain ground with Carriers, Enterprises and Independent Software Vendors (ISV’s) alike. WebRTC makes it easy for developers to add audio and video communications into their applications by using Web browsers as their platform. But like any market, every customer engagement has unique requirements, as well as constraints. And of course, one size does not fit all. In her session at WebRTC Summit, Dr. Natasha Tamaskar, Vice President, Head of Cloud and Mobile Strategy at GENBAND, will explore what is needed to take a real time communications ...
The world's leading Cloud event, Cloud Expo has launched Microservices Journal on the SYS-CON.com portal, featuring over 19,000 original articles, news stories, features, and blog entries. DevOps Journal is focused on this critical enterprise IT topic in the world of cloud computing. Microservices Journal offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. Follow new article posts on Twitter at @MicroservicesE
SYS-CON Events announced today the IoT Bootcamp – Jumpstart Your IoT Strategy, being held June 9–10, 2015, in conjunction with 16th Cloud Expo and Internet of @ThingsExpo at the Javits Center in New York City. This is your chance to jumpstart your IoT strategy. Combined with real-world scenarios and use cases, the IoT Bootcamp is not just based on presentations but includes hands-on demos and walkthroughs. We will introduce you to a variety of Do-It-Yourself IoT platforms including Arduino, Raspberry Pi, BeagleBone, Spark and Intel Edison. You will also get an overview of cloud technologies s...
The list of ‘new paradigm’ technologies that now surrounds us appears to be at an all time high. From cloud computing and Big Data analytics to Bring Your Own Device (BYOD) and the Internet of Things (IoT), today we have to deal with what the industry likes to call ‘paradigm shifts’ at every level of IT. This is disruption; of course, we understand that – change is almost always disruptive.
SYS-CON Events announced today that SafeLogic has been named “Bag Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. SafeLogic provides security products for applications in mobile and server/appliance environments. SafeLogic’s flagship product CryptoComply is a FIPS 140-2 validated cryptographic engine designed to secure data on servers, workstations, appliances, mobile devices, and in the Cloud.
GENBAND has announced that SageNet is leveraging the Nuvia platform to deliver Unified Communications as a Service (UCaaS) to its large base of retail and enterprise customers. Nuvia’s cloud-based solution provides SageNet’s customers with a full suite of business communications and collaboration tools. Two large national SageNet retail customers have recently signed up to deploy the Nuvia platform and the company will continue to sell the service to new and existing customers. Nuvia’s capabilities include HD voice, video, multimedia messaging, mobility, conferencing, Web collaboration, deskt...
SYS-CON Media announced today that @WebRTCSummit Blog, the largest WebRTC resource in the world, has been launched. @WebRTCSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. @WebRTCSummit Blog can be bookmarked ▸ Here @WebRTCSummit conference site can be bookmarked ▸ Here
SYS-CON Events announced today that Cisco, the worldwide leader in IT that transforms how people connect, communicate and collaborate, has been named “Gold Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Cisco makes amazing things happen by connecting the unconnected. Cisco has shaped the future of the Internet by becoming the worldwide leader in transforming how people connect, communicate and collaborate. Cisco and our partners are building the platform for the Internet of Everything by connecting the...
Temasys has announced senior management additions to its team. Joining are David Holloway as Vice President of Commercial and Nadine Yap as Vice President of Product. Over the past 12 months Temasys has doubled in size as it adds new customers and expands the development of its Skylink platform. Skylink leads the charge to move WebRTC, traditionally seen as a desktop, browser based technology, to become a ubiquitous web communications technology on web and mobile, as well as Internet of Things compatible devices.
Docker is an excellent platform for organizations interested in running microservices. It offers portability and consistency between development and production environments, quick provisioning times, and a simple way to isolate services. In his session at DevOps Summit at 16th Cloud Expo, Shannon Williams, co-founder of Rancher Labs, will walk through these and other benefits of using Docker to run microservices, and provide an overview of RancherOS, a minimalist distribution of Linux designed expressly to run Docker. He will also discuss Rancher, an orchestration and service discovery platf...
SYS-CON Events announced today that Vitria Technology, Inc. will exhibit at SYS-CON’s @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Vitria will showcase the company’s new IoT Analytics Platform through live demonstrations at booth #330. Vitria’s IoT Analytics Platform, fully integrated and powered by an operational intelligence engine, enables customers to rapidly build and operationalize advanced analytics to deliver timely business outcomes for use cases across the industrial, enterprise, and consumer segments.