This article is more than 1 year old
Extract, transform, load? More like extremely tough to load, amirite?
Thankfully, now there are data integration platforms for that
Data integration has been an IT challenge for decades. Long before cloud, even before client server. Back then, though, it was relatively simple – you were working with systems of record, desktop databases and business systems.
Today, it's not so simple. We exchange information about virtually everything from science to retail, from Facebook Likes to regional sales during August.
The volumes of data we're dealing with are significantly higher than anything we've collectively worked with historically.
In addition to the ever-increasing volume of data is the types of data – we're not in structured data land any more, Dorothy, and relational monogamy is a thing of the past with data on images and device parameters mingling with information nestled nicely in relational rows and tables.
ETL is hard
We will – of course – be familiar with Extract, Transform and Load (ETL) from those bygone, pre-cloud years.
ETL is a series of steps used to take data from one source and make it available to another. ETL is the very core of modern data integration: one extracts data from an application or service, transforms that data so that it can be understood by one or more target applications and services, and loads that data into the target application(s) and/or service(s).
ETL can occur in many forms. For example, one could save data from an application as a Comma Separated Values (CSV) file, open that file in a spreadsheet application to change the name of various columns, and then load that data into another application. This is a fairly common form of ETL operation that many organisations perform manually.
ETL doesn't always involve a file as an intermediate stage. Frequently, data is extracted from the source application via an API, and loaded into the destination application via an API. In cases where the data integration to be performed is relatively simple (such as having few or no transformation steps), this sort of integration can be automated easily.
API-driven applications offer an easier form of integration.
When not performing direct API-to-API mind melds, data integration requires an intermediate data repository between the extraction and loading steps. This intermediate repository may be a file, database, middleware solutions, or data warehouse.
Extracting data from a source and storing it in a data warehouse, while not always required, is quite common in cases where multiple data sources are being combined as part of the data transformation process.
In this case, data from different source systems is converted into one consolidated data warehouse format ready for transformation. With multiple data sources gathered in a single, centralised data warehouse it becomes easier to transform the data before loading it into the destination application or service.
Transforming the data may involve some or all of the following tasks:
- Applying business rules: Here data may be transformed in order to meet a business's practical operational requirements. Examples include converting data from imperial to metric, adding a production priority based on a secondary data set's criteria, adding a time stamp, or imprinting a chain of custody to data for future auditing purposes.
- Cleaning: Data often needs to be altered in subtle ways to make it compatible with a destination application. Examples include mapping NULL values to 0, altering the values "Male" to "M" and "Female" to "F", or changing date and time formats.
- Filtering: Transformation can be as simple as selecting only certain columns to exchange between applications or services. Even where filtering is not the sole data transformation taking place, it is a regular component of the larger data transformation process.
- Splitting: Data doesn't always come neatly divided. In many cases, information stored in a single column may need to be split into multiple columns. Examples of this might be splitting a date/time column into separate date and time columns, or breaking up a comma-separated array stored in a column into multiple columns.
- Joining: Multiple data points may need to be joined before loading into a destination application or service. This joining may be the opposite of splitting, in which multiple columns of data from a single source are merged into a single column, or it may involve combining together data from multiple sources. Joining doesn't always involve merging data into a single column; it may involve adding data that wasn't in the original export, such as using GPS data from the source to identify the nearest city and adding that to the data loaded into the destination.
- Transposing: One of the more complicated transformation functions is transposing. Transposing involves disconnecting rows and columns from their original arrangement and altering their relationship to one another. A simple transposition would be to make all columns rows and all rows columns. A more complicated one might involve converting a table of data into key/value pairs or vice versa.
- Applying data validation: Validating data extracted from the source before loading it into the destination is generally a good idea. One might want to, for example, verify that a postal code exists, or that an email address is formatted appropriately. Data validation is sometimes done for utility, but also often done for security reasons. One does not want to attempt to load data into an SQL server, for example, that would cause the SQL server to delete all of its contents.
A to A versus B to B
While data integration is centered on the ETL of data from applications and IT services, these applications and services are not always owned by – and do not always serve – the same organisation. Data integration is often broken into two broad categories: application to application and business to business.
Application-to-application data integration is shorthand for ETL operations on applications and services used by a single organisation. An example would be connecting ERP and CRM systems. Application-to-application integrations are increasingly done at the API level, though some integrations with older solutions may still occur directly at a database level, or even require files to be transferred between applications.
Modern application-to-application integration rarely requires an intermediate database, as "direct" integrations between popular applications and services exist. These direct integrations only cover some use cases, however, and many organisations still pump data into a middleware solution between extraction and loading. This is more common for organisations that also engage in business-to-business data integration, as data from one application or service may be used in multiple ways in these organisations.
Business-to-business integration (also known as external data integration) involves the exchange of data between multiple organisations. One classic example of this is data integration pattern is customers loading data into a logistics company's IT systems to automatically generate waybills.
Business-to-business integration typically differs from application-to-application integration in that integration with external entities is usually done by exchanging external documents. Most organisations are understandably reluctant to open up the APIs of their critical business applications for a variety of reasons.
The exchange of documents that is typical of B to B data integration usually occurs in some sort of loosely coupled format, meaning that each customer's files can be a little different from the next. In the logistics company example, this means that the company's IT practitioners have to cope with hundreds of thousands of ever-so-slightly different documents from its hundreds of thousands of customers.
Those ingesting data from such files typically write scripts to transform the data formats they are receiving. However, this problem gets nasty in a hurry as it scales.
The scale problem
At the core of the problem with external data integration is that each organisation extracting and transmitting data from their systems not only uses different source IT solutions, but they have all have different developers available to them. These developers have varying skill sets, and varying levels of experience.
Even where two organisations are using the same system from which they plan to extract data, one organisation might apply data transformation to that extracted data while another might not.
In the logistics example, I'll assume the company is transforms its data before it sends it by applying filtering and validation. It sends only the data that is absolutely necessary to send, and ensure that the data is formatted such that it should not cause database errors.
A fellow organisation, however, might send what amounts to a raw data dump from their IT system. This might include superfluous Personally Identifiable Information (PII) that will instantly place a burden of responsibility on the destination organisation. In addition, any malformed or malicious data will go through unchanged.
If the recipient of these two companies' data were to load these two datasets without transformation, not only would the two data sets look out of whack but the recipient could also corrupt its database.
External integration in practice
Where does this leave us? Put yourself in the position of that recipient – there's plenty of us.
Organisations that receive data from multiple sources need to be able to quickly and efficiently design parsers to perform ETL. Each customer may need their own parser, and each parser will need a standard set of data transformations applied to it, including a great deal of data verification.
Unfortunately, this isn't easy. Many systems are simply not designed to perform integrations on this sort of scale, nor do they have the appropriate tools to handle complex data transformation. In addition, many organisations assume that if the application they use is new enough to do API integration then it will also handle files, and this is increasingly incorrect.
This leaves IT building complex and intricate custom ETL answers to front end their data warehouse, their in-house middleware – or both. For many, the majority of their IT staff time expenditure will be spent on building data ingestion and transformation pipelines. Each customer will be its own pipeline, and there can be hundreds of thousands feeding into a single destination application.
Data integration platforms
For those of us looking for a slightly more industrial, less artisanal approach, it's worth taking the time to explore the re-emerging class of data integration platforms.
Data integration platforms are basically glorified parsers. They handle the ETL portion of the equation, but also tend to build in most – if not all – of the dummy-proofing by default.
Some data integration platforms won't require more of administrators than mapping data fields on the input data type to data fields on the destination. The platform is aware of the data field requirements of the destination, and it adds in by default all the input validation and malicious input detection you could want.
I wish I'd known about those 15 years ago when I wrote my first middleware systems.
Years of work went into that application, and probably 80 per cent of it was building the fleventy-fleven parsers and the layers upon layers of validation, checking and so forth.
Data integration vendors have been around for decades with new companies emerging recently. They may not meet all your needs, but they are worth looking into before trying to roll your own artisan answer. ®