By Robert Fox, VP Emerging Technologies, Liaison Technologies
Most companies have plenty of legacy data coming from many old systems that needs to be integrated into their line of business. When thinking about legacy data, we tend to think CSV files, fixed width data, spreadsheets data and other forms of flat or semi-flat (pseudo-hierarchical) data. Some people classify traditional EDI (ANSI X12, UN/EDIFACT, ANA TRADACOMS, etc.) as legacy data. I actually disagree for a very specific reason – EDI has a data dictionary which can be used to validate the structure and contents of an EDI document.
Today, XML is fairly standard as the data format of choice, because it is self-describing, and because one can write XML schemas to describe and validate the data. Unlike some EDI standards, XML schemas are portable and not proprietary (i.e., you don’t have to hand your credit card to ASC X12 to get the data dictionaries).
So if we were to plot our standards in terms of historical eras, I would lay them out as:
Dark Ages – Fixed width and CSV files
Middle Ages – EDI
Golden Age – XML
When working with various data formats, it’s this Dark Age data that is neither self-describing (column headers in CSV/Excel does not really count here), nor has any machine readable validation rules via a data dictionary that gives companies so much trouble. Describing field types, formats, min/max lengths, repeatability, legal values, etc. – these characteristics are all missing. Yet, it’s been estimated that over 90% of companies still use them. Why is this? We’ll come back to that in a moment.
Recently I was working on a project where I was trying to auto-generate XML schemas to describe XML instance documents for which no XML schema existed. This reminded me of the Dark Age data case – except it wasn’t. There are tools available that can read the XML instance and spit out an XML schema. One can argue that the resulting XML schema won’t be complete since it is based upon an instance of data, and not a standard. Still, it is intriguing to be able to analyze data and produce a standard that will describe and validate it.
We actually do this all the time with Dark Age data when we bring the data into a data translation tool, such as Liaison Contivo or Delta. We generate a “model” or “interface” to add the missing metadata and to superimpose structure on our Dark Age data. The key is – how to generate these representations quickly, easily, and accurately? How can the data then be validated outside of these tools? Generally – it can’t except by the application that will consume it. That’s when it occurred to me. It’s the rich tools that keep Dark Age data in-play. There are a number of reasons why companies still rely on this data today. We know that IT may look at application modernization as a cost (see my last Blog entitled – Modernizing Applications: A Closer Look at IT Darwinism). Companies have major investments in systems that are not easy to replace, or the expertise to manage and replace has since moved on. Quite frankly, there is still an elegant simplicity for many people to continue to work in these formats – architects, business analysts, finance accountants, etc.
So the burden of overcoming the limitations of the Dark Age data formats is left to 3rd party tooling to “make sense” of the data in a meaningful way. That’s just fine with me. Here at Liaison, we love taking data of all shapes and sizes, and giving it “meaningful use” so that business can extract the most value as possible form their Dark Age data (and any other data).
So what does this make the new run on Big Data and techniques to validate it? I look forward to your suggestions.