Sepaton just completed a survey of large enterprises and found that data growth continues to be the most significant data protection challenge. This growth is increasing annually and is driving a variety of additional challenges, including data center sprawl and a need for more-efficient deduplication. Although we have seen some improvement in two key areas—remote office data protection and disaster recovery protection—there is still significant room for improvement.
This article looks at the top five data protection challenges that result from this rapid data growth and offers practical advice on using new technologies to gain the backup, deduplication, replication and restore performance needed to address these challenges. Not surprisingly, these challenges are interrelated.
Data Centers Reaching Data Growth “Breaking Point”
Data center managers have been dealing with exponential data growth for years. The pace of growth, however, has accelerated dramatically in the last year. Nearly one quarter of respondents reported a 25 percent higher growth rate compared with last year. The speed of growth and the sheer volume of data under protection have reached a critical point such that enterprises cannot continue to backup, deduplicate, replicate or restore without a highly automated and scalable solution.
Enterprise data centers that have disk-based backup systems that do not scale deal with data growth by adding more systems to increase backup performance or capacity. Although this box-by-box approach works for small and medium-size backup environments, it is neither efficient nor effective beyond a certain point. In our experience, this breaking point is reached when enterprises have full daily backups of about 20TB. At that volume, you may need five or more box-by-box backup systems to meet your backup window and capacity requirements—far too many to be managed or maintained efficiently. This proliferation of backup systems adds to another challenge: data center sprawl.
Costly Data Center “Sprawl”
According to our survey, fifty percent of respondents characterized their environments as having “moderate” or “severe” sprawl, requiring them to routinely add data protection systems to scale performance or capacity. In addition to the administration costs described above, data center “sprawl” adds to power, cooling and data center footprint costs. It also causes an exponential increase in the administration time required to maintain, upgrade and manage all that hardware. Once they are online, they operate independently—without the ability to deduplicate data volumes across systems.
Although managing a handful of systems in this box-by-box strategy may have sufficed in smaller environments, it is not feasible in today’s complex, high-growth, high-volume data center. Today’s IT administrators are responsible for ensuring that petabytes of data are backed up, retained, restored and erased in accordance with challenging service-level agreements (SLAs), regulatory requirements and business continuity metrics (RTO, RPO). Enterprises are finding that the box-by-box backup systems add enormous administration costs and risk of human error. They force administrators to divide backups onto individual systems—each of which has to be managed and maintained individually with its own software upgrades/patches, hardware upgrades and system optimization. They also require a fairly complex load-balancing process to bring them online in a way that shares the load efficiently with existing equipment. Because they cannot deduplicate data that is stored in different systems, the box-by-box systems are inherently less efficient in their capacity reduction than their scalable counterparts. Given the rate of data growth reported in our survey, this inefficiency could add significant unnecessary costs in an enterprise data center.
The best solution for enterprises is to move to an automated, centralized data protection solution that allows them to add capacity and performance as needed. These solutions eliminate a wide range of manual administration tasks and provide more-efficient and flexible utilization of capacity and performance. For example, they can automate scheduled backups, replication and secure data erasure. If new capacity or performance is added to these systems, it is automatically integrated and load balanced without operator intervention. Thee solutions also provide predictive monitoring of hardware systems and automatically notify administrators of possible emerging issues (i.e., fan overheating) before they become real problems.
With all of the data in one place, these systems can track and report on capacity usage, deduplication efficiency and replication performance far more efficiently than the box-by-box method. They also ameliorate the second challenge on the list: data center sprawl.
Enterprises Need More Backup Capacity and Performance
Because enterprise IT managers in today’s backup environments are dealing with larger, faster-growing data volumes than ever before, they face the challenge of backing up and restoring this data fast enough to stay within their shrinking backup windows. The majority of survey respondents rated increasing capacity and performance of the data protection systems as the top IT priorities for the coming months.
One reason companies are not able to meet their backup window is that they are using inline/hash-based deduplication. This type of deduplication, which was designed for smaller backup environments, typically causes a bottleneck in backup and restore performance in enterprise environments.
Inline/hash-based technologies are designed to find matches in data before it is written to disk. If deduplication takes too long, it creates a performance bottleneck that can jeopardize backup windows. These technologies analyze segments of data as it is being backed up and assign a unique identifier called a fingerprint to each segment, storing the fingerprints in an index. The fingerprints of all incoming data are compared with those already in the index. If a fingerprint is already in the index, then the incoming data is deemed to be duplicated and is not backed up. Instead, the system stores a pointer to the original data. If the fingerprint is not in the index, the data is written to the disk.
Large enterprises should use a different type of deduplication—content-aware byte differential deduplication—that is designed to deduplicate massive data volumes without slowing performance. These technologies extract metadata from the incoming data stream and use it to identify duplicate data. They then analyze at the byte level this small subset of data that contains duplicates, seeking optimal capacity reduction. They do not slow backup or restore processes. Because there is no index, and because the analysis of suspect duplicates can be done in parallel, these technologies are able to scale processing across many nodes and to scale capacity to store tens of petabytes in a single system with deduplication. As a result, they are a critical tool for enterprises that need to improve the efficiency of their capacity and performance utilization. Because they scale across processing nodes and disk trays, these tools enable enterprises to add performance and capacity as needed to address rapid data growth.
Inefficient Deduplication of Very Large Databases
More and more of today’s enterprise data is stored in very large databases, which cannot be efficiently deduplicated by hash-based deduplication technology. With data growth rates increasing exponentially, enterprises cannot afford to store and protect massive volumes of data. Databases often store data in segments smaller than 8KB—a size too small for most inline/hash-based deduplication technologies to process efficiently. In addition, most very large databases are backed up using multistreaming and multiplexing to cut backup times to a minimum. Inline/hash-based technologies cannot deduplicate these multistreamed, multiplexed databases well, leaving massive volumes with duplicate data.
Enterprises must ensure their deduplication technology is both performance optimized and able to deduplicate very large databases with high data change rates.
DR and Remote Office Replication a Necessity
Although the Sepaton survey revealed some improvement in data protection for remote offices, a large volume of data still remains insufficiently protected in these locations. Many enterprises continue to copy data on physical tape and ship tapes to off-site storage locations for disaster protection or remote office data retention. This solution is slow, manual and difficult to test.
Enterprises that are using hash-based deduplication to speed their replication find that they take just as long to replicate as non-deduplicated data unless the deduplication rate is very high. Replication is even slower when database data or other high-change-rate data is replicated.
Content-aware byte differential technologies solve these replication problems by streaming delta requests to target systems. Data is only pulled from the source system when the delta rules can’t be applied, effectively minimizing the amount of data transferred, while also effectively utilizing the full bandwidth of the wire and avoiding costly latency gaps.
With already massive data volumes growing at an explosive rate, today’s enterprise data centers cannot continue to add “backup boxes.” Enterprise backup environments can dramatically improve their efficiency and cost effectiveness by consolidating backups onto scalable systems and by using deduplication technologies designed for large enterprises. Together, these steps can help combat unabated data growth and rising IT costs.
About the Author
Joe Forgione is senior vice president of product operations and business development at Sepaton, Inc. Forgione brings to this position extensive senior management experience in both early-stage and large high-technology companies. He has been involved in leading many emerging trends in the software industry from systems level on up through middleware and applications. Before coming to Sepaton, he served as CEO of mValent, a data center applications management software company acquired by Oracle in 2009.
Disk drive photo courtesy of Jason Bache