Extract, transform, load? More like extremely tough to load, amirite?

Thankfully, now there are data integration platforms for that


Data integration has been an IT challenge for decades. Long before cloud, even before client server. Back then, though, it was relatively simple – you were working with systems of record, desktop databases and business systems.

Today, it's not so simple. We exchange information about virtually everything from science to retail, from Facebook Likes to regional sales during August.

The volumes of data we're dealing with are significantly higher than anything we've collectively worked with historically.

In addition to the ever-increasing volume of data is the types of data – we're not in structured data land any more, Dorothy, and relational monogamy is a thing of the past with data on images and device parameters mingling with information nestled nicely in relational rows and tables.

ETL is hard

We will – of course – be familiar with Extract, Transform and Load (ETL) from those bygone, pre-cloud years.

ETL is a series of steps used to take data from one source and make it available to another. ETL is the very core of modern data integration: one extracts data from an application or service, transforms that data so that it can be understood by one or more target applications and services, and loads that data into the target application(s) and/or service(s).

ETL can occur in many forms. For example, one could save data from an application as a Comma Separated Values (CSV) file, open that file in a spreadsheet application to change the name of various columns, and then load that data into another application. This is a fairly common form of ETL operation that many organisations perform manually.

ETL doesn't always involve a file as an intermediate stage. Frequently, data is extracted from the source application via an API, and loaded into the destination application via an API. In cases where the data integration to be performed is relatively simple (such as having few or no transformation steps), this sort of integration can be automated easily.

Enter APIs

API-driven applications offer an easier form of integration.

When not performing direct API-to-API mind melds, data integration requires an intermediate data repository between the extraction and loading steps. This intermediate repository may be a file, database, middleware solutions, or data warehouse.

Extracting data from a source and storing it in a data warehouse, while not always required, is quite common in cases where multiple data sources are being combined as part of the data transformation process.

In this case, data from different source systems is converted into one consolidated data warehouse format ready for transformation. With multiple data sources gathered in a single, centralised data warehouse it becomes easier to transform the data before loading it into the destination application or service.

Transforming the data may involve some or all of the following tasks:

  • Applying business rules: Here data may be transformed in order to meet a business's practical operational requirements. Examples include converting data from imperial to metric, adding a production priority based on a secondary data set's criteria, adding a time stamp, or imprinting a chain of custody to data for future auditing purposes.
  • Cleaning: Data often needs to be altered in subtle ways to make it compatible with a destination application. Examples include mapping NULL values to 0, altering the values "Male" to "M" and "Female" to "F", or changing date and time formats.
  • Filtering: Transformation can be as simple as selecting only certain columns to exchange between applications or services. Even where filtering is not the sole data transformation taking place, it is a regular component of the larger data transformation process.
  • Splitting: Data doesn't always come neatly divided. In many cases, information stored in a single column may need to be split into multiple columns. Examples of this might be splitting a date/time column into separate date and time columns, or breaking up a comma-separated array stored in a column into multiple columns.
  • Joining: Multiple data points may need to be joined before loading into a destination application or service. This joining may be the opposite of splitting, in which multiple columns of data from a single source are merged into a single column, or it may involve combining together data from multiple sources. Joining doesn't always involve merging data into a single column; it may involve adding data that wasn't in the original export, such as using GPS data from the source to identify the nearest city and adding that to the data loaded into the destination.
  • Transposing: One of the more complicated transformation functions is transposing. Transposing involves disconnecting rows and columns from their original arrangement and altering their relationship to one another. A simple transposition would be to make all columns rows and all rows columns. A more complicated one might involve converting a table of data into key/value pairs or vice versa.
  • Applying data validation: Validating data extracted from the source before loading it into the destination is generally a good idea. One might want to, for example, verify that a postal code exists, or that an email address is formatted appropriately. Data validation is sometimes done for utility, but also often done for security reasons. One does not want to attempt to load data into an SQL server, for example, that would cause the SQL server to delete all of its contents.

A to A versus B to B

While data integration is centered on the ETL of data from applications and IT services, these applications and services are not always owned by – and do not always serve – the same organisation. Data integration is often broken into two broad categories: application to application and business to business.

Application-to-application data integration is shorthand for ETL operations on applications and services used by a single organisation. An example would be connecting ERP and CRM systems. Application-to-application integrations are increasingly done at the API level, though some integrations with older solutions may still occur directly at a database level, or even require files to be transferred between applications.

Modern application-to-application integration rarely requires an intermediate database, as "direct" integrations between popular applications and services exist. These direct integrations only cover some use cases, however, and many organisations still pump data into a middleware solution between extraction and loading. This is more common for organisations that also engage in business-to-business data integration, as data from one application or service may be used in multiple ways in these organisations.

Business-to-business integration (also known as external data integration) involves the exchange of data between multiple organisations. One classic example of this is data integration pattern is customers loading data into a logistics company's IT systems to automatically generate waybills.

Business-to-business integration typically differs from application-to-application integration in that integration with external entities is usually done by exchanging external documents. Most organisations are understandably reluctant to open up the APIs of their critical business applications for a variety of reasons.

The exchange of documents that is typical of B to B data integration usually occurs in some sort of loosely coupled format, meaning that each customer's files can be a little different from the next. In the logistics company example, this means that the company's IT practitioners have to cope with hundreds of thousands of ever-so-slightly different documents from its hundreds of thousands of customers.

Those ingesting data from such files typically write scripts to transform the data formats they are receiving. However, this problem gets nasty in a hurry as it scales.

The scale problem

At the core of the problem with external data integration is that each organisation extracting and transmitting data from their systems not only uses different source IT solutions, but they have all have different developers available to them. These developers have varying skill sets, and varying levels of experience.

Even where two organisations are using the same system from which they plan to extract data, one organisation might apply data transformation to that extracted data while another might not.

In the logistics example, I'll assume the company is transforms its data before it sends it by applying filtering and validation. It sends only the data that is absolutely necessary to send, and ensure that the data is formatted such that it should not cause database errors.

A fellow organisation, however, might send what amounts to a raw data dump from their IT system. This might include superfluous Personally Identifiable Information (PII) that will instantly place a burden of responsibility on the destination organisation. In addition, any malformed or malicious data will go through unchanged.

If the recipient of these two companies' data were to load these two datasets without transformation, not only would the two data sets look out of whack but the recipient could also corrupt its database.

External integration in practice

Where does this leave us? Put yourself in the position of that recipient – there's plenty of us.

Organisations that receive data from multiple sources need to be able to quickly and efficiently design parsers to perform ETL. Each customer may need their own parser, and each parser will need a standard set of data transformations applied to it, including a great deal of data verification.

Unfortunately, this isn't easy. Many systems are simply not designed to perform integrations on this sort of scale, nor do they have the appropriate tools to handle complex data transformation. In addition, many organisations assume that if the application they use is new enough to do API integration then it will also handle files, and this is increasingly incorrect.

This leaves IT building complex and intricate custom ETL answers to front end their data warehouse, their in-house middleware – or both. For many, the majority of their IT staff time expenditure will be spent on building data ingestion and transformation pipelines. Each customer will be its own pipeline, and there can be hundreds of thousands feeding into a single destination application.

Data integration platforms

For those of us looking for a slightly more industrial, less artisanal approach, it's worth taking the time to explore the re-emerging class of data integration platforms.

Data integration platforms are basically glorified parsers. They handle the ETL portion of the equation, but also tend to build in most – if not all – of the dummy-proofing by default.

Some data integration platforms won't require more of administrators than mapping data fields on the input data type to data fields on the destination. The platform is aware of the data field requirements of the destination, and it adds in by default all the input validation and malicious input detection you could want.

I wish I'd known about those 15 years ago when I wrote my first middleware systems.

Years of work went into that application, and probably 80 per cent of it was building the fleventy-fleven parsers and the layers upon layers of validation, checking and so forth.

Data integration vendors have been around for decades with new companies emerging recently. They may not meet all your needs, but they are worth looking into before trying to roll your own artisan answer. ®


Other stories you might like

  • How ICE became a $2.8b domestic surveillance agency
    Your US tax dollars at work

    The US Immigration and Customs Enforcement (ICE) agency has spent about $2.8 billion over the past 14 years on a massive surveillance "dragnet" that uses big data and facial-recognition technology to secretly spy on most Americans, according to a report from Georgetown Law's Center on Privacy and Technology.

    The research took two years and included "hundreds" of Freedom of Information Act requests, along with reviews of ICE's contracting and procurement records. It details how ICE surveillance spending jumped from about $71 million annually in 2008 to about $388 million per year as of 2021. The network it has purchased with this $2.8 billion means that "ICE now operates as a domestic surveillance agency" and its methods cross "legal and ethical lines," the report concludes.

    ICE did not respond to The Register's request for comment.

    Continue reading
  • Fully automated AI networks less than 5 years away, reckons Juniper CEO
    You robot kids, get off my LAN

    AI will completely automate the network within five years, Juniper CEO Rami Rahim boasted during the company’s Global Summit this week.

    “I truly believe that just as there is this need today for a self-driving automobile, the future is around a self-driving network where humans literally have to do nothing,” he said. “It's probably weird for people to hear the CEO of a networking company say that… but that's exactly what we should be wishing for.”

    Rahim believes AI-driven automation is the latest phase in computer networking’s evolution, which began with the rise of TCP/IP and the internet, was accelerated by faster and more efficient silicon, and then made manageable by advances in software.

    Continue reading
  • Pictured: Sagittarius A*, the supermassive black hole at the center of the Milky Way
    We speak to scientists involved in historic first snap – and no, this isn't the M87*

    Astronomers have captured a clear image of the gigantic supermassive black hole at the center of our galaxy for the first time.

    Sagittarius A*, or Sgr A* for short, is 27,000 light-years from Earth. Scientists knew for a while there was a mysterious object in the constellation of Sagittarius emitting strong radio waves, though it wasn't really discovered until the 1970s. Although astronomers managed to characterize some of the object's properties, experts weren't quite sure what exactly they were looking at.

    Years later, in 2020, the Nobel Prize in physics was awarded to a pair of scientists, who mathematically proved the object must be a supermassive black hole. Now, their work has been experimentally verified in the form of the first-ever snap of Sgr A*, captured by more than 300 researchers working across 80 institutions in the Event Horizon Telescope Collaboration. 

    Continue reading
  • Shopping for malware: $260 gets you a password stealer. $90 for a crypto-miner...
    We take a look at low, low subscription prices – not that we want to give anyone any ideas

    A Tor-hidden website dubbed the Eternity Project is offering a toolkit of malware, including ransomware, worms, and – coming soon – distributed denial-of-service programs, at low prices.

    According to researchers at cyber-intelligence outfit Cyble, the Eternity site's operators also have a channel on Telegram, where they provide videos detailing features and functions of the Windows malware. Once bought, it's up to the buyer how victims' computers are infected; we'll leave that to your imagination.

    The Telegram channel has about 500 subscribers, Team Cyble documented this week. Once someone decides to purchase of one or more of Eternity's malware components, they have the option to customize the final binary executable for whatever crimes they want to commit.

    Continue reading
  • Ukrainian crook jailed in US for selling thousands of stolen login credentials
    Touting info on 6,700 compromised systems will get you four years behind bars

    A Ukrainian man has been sentenced to four years in a US federal prison for selling on a dark-web marketplace stolen login credentials for more than 6,700 compromised servers.

    Glib Oleksandr Ivanov-Tolpintsev, 28, was arrested by Polish authorities in Korczowa, Poland, on October 3, 2020, and extradited to America. He pleaded guilty on February 22, and was sentenced on Thursday in a Florida federal district court. The court also ordered Ivanov-Tolpintsev, of Chernivtsi, Ukraine, to forfeit his ill-gotten gains of $82,648 from the credential theft scheme.

    The prosecution's documents [PDF] detail an unnamed, dark-web marketplace on which usernames and passwords along with personal data, including more than 330,000 dates of birth and social security numbers belonging to US residents, were bought and sold illegally.

    Continue reading

Biting the hand that feeds IT © 1998–2022