Moving enterprise data is time consuming and expensive. But when a company faces migration to a new storage platform or consolidation of multiple environments, it can avoid movement of outdated, abandoned and aged data.
This valueless data can easily account for 30–50 percent of the total capacity. Now, new data profiling technology enables data center managers to eliminate data with no business value before a consolidation or migration occurs. Data profiling is the metadata analysis of unstructured user files. Providing an efficient and cost-effective index (via NFS/CIFS/NDMP protocols) of data storage, data profiling works by extracting key metadata. The last modified/accessed dates, owner, location, size and even duplicate content can be located with custom queries.
In addition, integrating data profiling software with Active Directory/LDAP allows reports and analysis to be summarized by groups and departments as well as by active versus inactive (ex-employees) users.
Data profiling software greatly differs from existing analysis solutions that analyze access logs and network metadata and from storage resource management (SRM) tools that analyze capacity and not data. Data profiling goes deep in the files, even a full-text profile if required, and delivers comprehensive access to file information. When managing files this is the only solution that provides the level of knowledge, as well as the analytical tools and disposition capability, needed to efficiently migrate data.
Data profiling provides flexible filters, queries and dynamic summary reports that offer the knowledge corporate data centers need to make appropriate decisions.
Filters can be applied to specific locations/directories, file types, users and more. Once this focus is defined, dynamic summary reports can be used to provide an analysis and roll up of the data. For example, performing an analysis of the human resources data on a shared server, filters can be applied to focus on the directories that HR uses, then summary reports can be generated by data age.
Using the dynamic capabilities of these reports, the age can be further filtered to data that has not been accessed in more than three years, and a new summary report on owners and file types can be generated. This approach would yield an analysis of HR data that has aged with details on who owns it and what type of data it is.
Analysis is flexible and will allow the user to understand the current state of data as well as how it changes. From finding and managing data that has outlived its business value to finding data that must be preserved in an archive, data profiling delivers the reports and disposition tools needed to get the job done.
Classification for Disposition
Classification helps distinguish data that still has business or legal value from data that has lost all value. It also separates iTunes libraries and other personal files that are wasting capacity in the data center.
Abandoned: This data includes files owned by ex-employees that have not been accessed in three or more years. Analyze the content on the basis of the owner or department to show if it has value or should be reviewed. For example, if this content was owned by R&D and contained potentially valuable research data, it may require additional investigation beyond metadata reports. If it was owned by administrative employees and this content typically has no long-term business value or legal hold requirements, however, it can be deleted.
Aged: This represents a significant portion of unstructured network data. It is owned by current employees, but it has not been accessed in more than three years. Aged data does not necessarily mean it has no business value, but if employees are not currently using it, then it is likely to be content that can be purged or easily moved offline for further analysis. On the basis of the department or person who owns the data, it can be classified as purge or archive for long-term retention.
Redundant: Users share files multiple times between employees and departments, creating copies that are scattered about the network. Data profiling can generate a unique document signature for each file that will determine if it is an exact copy of what already exists. Duplicate data often clogs servers. If it is redundant and has not been accessed, it can probably be added to the leave-behind bucket.
Personal: Users store personal content on their desktops and even network file shares. Photos, multimedia files such as iTunes libraries and downloaded movies can take up valuable space. This personal content gets in the way of managing and organizing data with real business value. Identify this personal content, leave it behind and notify the culprit’s manager.
Archive: This case requires finding data that has long-term business value and should be preserved according to policies and regulatory requirements, instead of moving it from server to server. Using this example, all documents and spreadsheets owned by members of the “Project X” design team that were created between the start and end time of the project can be migrated to an archive to support regulatory hold requirements.
Active: Active data is content that has been created within the last three years. It has the highest likelihood of being accessed again and should be moved over during consolidation efforts.
Data Disposition Options
Using the dynamic reports and narrowing down the analysis of the data set, it is then easy to manage the disposition of the content. Using the built-in tools to delete, migrate or archive the data, or exporting a csv text file so you can use existing tools, content can be easily managed.
Deletion of content should be performed in a defensible manner. Once you have refined a subset of content to be purged and have received sign-off from the legal or compliance department, the data can be deleted in a single click. This process is defensible, as a log of this activity—including the person, time and specific files—is stored in a database for future reference.
Migration of data can also be managed, including moving content to a more appropriate platform, preserving it in an existing archive or pushing it out to a cloud repository. This strategy allows data tiering on the basis of value and access requirements, and it frees up expensive storage for more-important content.
Streamlined Migration and Consolidation
Organizations are continually migrating and consolidation data centers. Investing effort and expense in moving data that has outlived its business value is a significant waste of resources. Typical 40 TB servers can easily contain 22 percent abandoned data, 14 percent that has aged and outlived its business value, 24 percent duplicate content, and 6 percent personal files such as vacation photos and movies. This could account for over 50 percent of wasted capacity.
Managing a migration or consolidation where you can cut the volume in half would free up tremendous resources and cut expenses. These expenses will continue to pay off over time.
Leading article photo courtesy of brewbooks
About the Author
Jim McGann is vice president of information management and archiving company Index Engines. He can be reached at Jim.McGann@IndexEngines.com.