Modern data infrastructures don’t do ETL

Businesses are 24/7. This includes everything from website, back office, supply chain and beyond. At another time, everything was working in batches. Until a few years ago, operational systems were suspended so that data could be loaded into a data warehouse and reports run. Now, reports indicate where things stand right now. There is no time for ETL.

Much of the IT architecture still relies on a hub-and-spoke system. Operational systems feed a data warehouse, which then feeds other systems. Specialized visualization software creates reports and dashboards based on the “warehouse”. However, this is changing and these changes in the business require adaptation of both databases and system architecture.

Fewer copies, better databases

Part of the great cloud migration and scaling efforts of the past decade has resulted in the use of many purpose-built databases. In many companies, the website relies on a NoSQL database, while critical systems involving money reside on a mainframe or relational database. This is only the surface of the problem. For many problems, even more specialized databases are used. Often, this architecture requires moving large amounts of data using traditional batch processes. Operational complexity not only causes latencies but also outages. This architecture was not made to scale, but was patched to stop the bleeding.

Databases evolve. Relational databases are now able to handle unstructured, document, and JSON data. NoSQL databases now have at least transactional support. Meanwhile, distributed SQL databases enable data integrity, relational data, and extreme scalability while maintaining compatibility with existing SQL databases and tools.

However, this is not enough in itself. The line between transactional or operational systems And analytical systems cannot be a border. A database must handle both large numbers of users and long-running queries, at least most of the time. To this end, transactional/operational databases add analytical capabilities in the form of columnar indexes or MPP (massively parallel processing) capabilities. It is now possible to run analytical queries against certain distributed operational databases, such as MariaDB Xpand (distributed SQL) or Couchbase (distributed NoSQL).

Never extract

That’s not to say the technology is in a place where no specialized database is needed. No operational database is currently capable of performing petabyte-scale analyses. There are extreme cases where nothing but a time series or other specialized database will work. The trick to simplifying things or performing real-time analytics is to avoid snippets.

In many cases, the answer is how the data is entered in the first place. Rather than sending data to one database and then pulling data from another, the transaction can be applied to both. Modern tools like Apache Kafka Or Amazon Kinesis allow this kind of data flow. While this approach ensures that data gets to both places without delay, it requires more complex development to ensure data integrity. By avoiding data push-pull, transactional and analytical databases can be updated at the same time, enabling real-time analytics when a specialized database is required.

Some analytic databases simply cannot accept this. In this case, more regular batch loads can be used as a workaround. However, to do this effectively, the source operational database must support longer queries, potentially during peak hours. This requires an embedded columnar index or MPP.

Old and new databases

Client-server databases were amazing in their day. They have evolved to make good use of many processors and controllers to deliver performance for a wide variety of applications. However, client-server databases were designed for employees, workgroups, and internal systems, not the Internet. They have become absolutely untenable in the modern age of web-scale systems and data ubiquity.

Many applications use many different databases. The advantage is a small blast radius if going down. The downside is that there is something broken all the time. Combining fewer databases into a distributed data structure enables IT departments to create a more reliable data infrastructure that handles varying amounts of data and traffic with less downtime. It also means less pushing data when it’s time to analyze it.

Support for new business models and real-time operational analytics are just two benefits of a distributed database architecture. Another is that with fewer copies of data, understanding data lineage and ensuring data integrity becomes easier. Storing more copies of data in different systems creates a greater opportunity for something to mismatch. Sometimes the mismatch is simply due to different time indices and other times it is a real mistake. By combining data into fewer, more capable systems, you have fewer copies and less to verify.

A new real-time architecture

By relying primarily on general-purpose distributed databases that can handle both transactions and analytics, and using streaming for those larger analysis cases, you can support the operational type of analytics in real time that modern businesses need. These databases and tools are readily available in the cloud and on-premises and already widely deployed in production.

Change is difficult and takes time. This is not only a technical problem but a personnel and logistics problem. Many applications have been deployed with siled architectures and live outside the development cycle of the rest of the data infrastructure. However, economic pressure, growing competition, and new business models are pushing for this change in even the most conservative and loyal companies.

Meanwhile, many organizations are using cloud migration to refresh their IT architecture. No matter how or why, business is now in real time. The data architecture must correspond to this.

Andrew C. Oliver is Senior Director of Product Marketing at MariaDB.

The New Tech Forum provides a venue to explore and discuss emerging enterprise technologies with unprecedented depth and breadth. The selection is subjective, based on our selection of the technologies that we think are important and most interesting for InfoWorld readers. InfoWorld does not accept marketing materials for publication and reserves the right to edit all contributed content. Send all inquiries to

Copyright © 2023 IDG Communications, Inc.

Leave a Comment