Data warehouse applications have been around since the late 1980s, and their main purpose was simply to take raw data and turn it into a sensible format for business or financial reporting. Over the years, however, the demands on reporting systems have changed dramatically. To name a few,
- We’ve moved toward real-time data processing
- Data footprints have increased
- The number of users has exploded
- Queries are ad hoc, not always planned in advance
- Updates occur over the course of the business day, not just after hours
- Usage is 24x7x365
These demands place massive performance pressures on the IT infrastructure to support it, and traditional spinning-disk storage can no longer keep up. Disks are made of moving parts, and this configuration greatly inhibits the speed at which they can process data, leading to massive performance problems when any physical I/O is required—especially under heavy load. Batch processes have started to reach unacceptable levels, leading to missed SLAs. Parallel reporting is starting to bring systems to their knees, and the demand of moving toward real-time processing is constrained by an inability to process data fast enough.
All-Silicon Array to the Rescue
All-silicon architecture is designed specifically to address the speed, scale, administration, concurrency and TCO issues plaguing modern infrastructures. By allowing every memory address to be equally accessible at the same great speed, flash fixes the problems that disks are creating. With little to no administration or tuning, all data will flow at flash memory speeds regardless of the locality or the number of LUNs, databases or users. Here are the top five reasons to fast-track your data warehouse applications on flash:
1. Reduce data imports from hours to minutes.
As data sets grow in size, imports have started to take longer and longer to run, breaking SLAs or causing overnight jobs to breach production traffic the next day and thus affecting user experience. Flash storage has three major attributes that can reduce data imports from hours to minutes: super low microsecond latency, high throughput and massive parallelism. With flash storage, a data warehouse can ingest data from 2x, 3x or 4x more sources all in parallel and write data in microsecond latency to reduce batch jobs. For example, data imports have fallen from eight hours to 60 minutes; the best I have seen so far was one week to one day! This change reduces risk for a business and makes data available earlier.
2. Enable real-time data processing.
The biggest constraint in enabling real-time data processing is slow storage. Disks process data in milliseconds, but today’s demand is for near instantaneous. Flash storage processes data in sub-microseconds, which is up to 15 times faster than traditional disk storage. This capability allows businesses to move their data warehouses toward real-time processing where data is instantly imported and transformed, ready for the end users. The power of real-time processing is immense. Imagine being able to process product stock data for thousands of stores instantly, or to manage a production line in real time in order to see how efficiently it is operating and to make adjustments to improve productivity. Flash storage removes all the constraints that exist today and makes real time a reality for businesses.
3. Data imports are less complex.
To query large amounts of data quickly, during the import and transformation stages a data warehouse must logically sort the random data and then, using a single threaded process, write it to disk sequentially. This is because disks perform well with sequential workloads but badly with random ones. Single-threaded processes only use one CPU core, and therefore, the application is unable to do the work in parallel to fully utilize the power of all CPUs in a server. It’s one of the reasons why data warehouse batch jobs takes a long time. Flash loves parallel workloads, and even better it loves random workloads. Therefore the sequential sorting and writing stages can be removed, dramatically simplifying the data import process, which helps to reduce batch jobs.
4. Run more reports at the same time with zero degradation in performance.
Flash storage is all about parallelism. The more work you provide an all-flash array, the better it performs. That is a fact. Data is distributed very finely across all the flash chips, which means the array can perform many operations at the same time without dropping performance. Combined with ultralow microsecond latency, flash can enable multiple data warehouse end users to run reports at the same time without any performance degradation. No more running small sets of invoices at a time. With flash you can now generate all the invoices for your customers at the same time and know that the performance of other end users will not drop.
5. Administration and configuration are simpler and become more manageable.
Let’s face it, configuration of disk storage for performance is complicated. Short stroking, different RAID for data types, auto-tiering, caching—the list goes on. To get any decent level of performance out of disks, you have to use a number of different techniques, but these techniques are just Band-Aids for a problem that cannot be solved; disks are slow. What if the workload changes and you have used a particular RAID system that is no longer suitable? If you’re short stroking, you are using 20–30 percent of capacity and wasting terabytes of space. When you are working with a multi-terabyte or petabyte data warehouses, these points become important because that is a lot of storage to manage. Flash storage simplifies the administration and manageability of a data warehouse. All flash in an array is fast—no short stroking, no choosing which RAID is best. Scalability is also simpler when you no longer need to use these traditional techniques. This allows businesses to concentrate on future projects and enables innovation to grow stronger in the marketplace.
Flash storage can have a great benefit in data warehouse environments in both performance and cost reduction. The technology—which provides low latency, high IOPs and high throughput—can enable real-time processing, dramatically speed up data processing and simplify administration, helping businesses exceed their goals.
Leading article image courtesy of Paul L Dineen
About the Author
Ashminder Ubhi has specialized in databases and applications for over 10 years, working with many of the top Fortune 500 companies. He is currently a Pre-Sales Database Specialist for Violin Memory, an all-flash-array provider. Previously, he ran a consultancy firm that provided niche Oracle skills, including Oracle’s Exadata, to businesses. Ashminder regularly speaks at events across Europe and has an active blog.