Efficient data processing with Delta Lake in Microsoft Fabric

This blog is machine translated to English.

Last year, Microsoft launched the new analytics platform Fabric. How does Fabric differ from other analytics platforms such as Azure Synapse? In this blog, we delve deeper into one of the most important building blocks of an analytics platform, data storage.

Nowadays, more and more data is being stored from all kinds of applications. As a result, analysis tools must be able to process ever larger amounts of data. Not only is the amount of data growing, but the frequency is also increasing. For more and more processes, it is even important that real-time information is available. For example, sensor data about the status of a machine. It is therefore important that data processing does not come at the expense of speed or reliability. In order to efficiently process these ever-growing data volumes, Microsoft has opted for the open source software Delta Lake in Fabric.

In Delta Lake, tables are stored in a Parquet file structure. This is a type of storage that is specifically designed for analyzing big data. It has a number of advantages over traditional storage such as CSV. Tables are grouped by column and stored in sub-files, including metadata with statistical data. This means that values from the same column are physically linked. This way of saving speeds up the execution of queries, because irrelevant columns and files do not need to be loaded. By comparison, with CSV, data is stored in a file per row. For a specific query, all data must then be read first to arrive at the answer. A time-consuming task with large data sets. In addition, Parquet files are compressed so that less storage space is used. A disadvantage of Parquet files is that they cannot be modified; therefore, the entire table must always be loaded first if data needs to be modified.

Delta Lake uses Parquet storage as a basis, but has a number of additional benefits. First, like relational databases, Delta Lake offers the possibility to modify data. As a result, the entire table no longer has to be loaded before data can be modified. Delta Lake uses a separate transaction log with metadata across the entire table instead of metadata per sub-file. This makes queries even faster, because not every Parquet file has to be read to retrieve the metadata.

In addition, so-called ACID (atomicity, consistency, isolation & durability) transactions ensure increased data reliability. In Delta Lake, it is also possible to change the structure of tables over time without breaking the ETL process. Useful when the structure in the source system changes or you want to add a new column. Delta Lake, in contrast to Parquet, stores different versions. This allows you to easily go back to an earlier version if there is an error in the data.

What does using Delta Lake mean for your organization? More efficient data storage reduces the costs of an analytics platform. First, because less space is needed to store the same data, but also because resources run shorter when processing the data. Delta Lake's ability to deal with changing table structures also reduces maintenance costs. After all, adjustments to the analytics platform need to be made less often. In short, these technical improvements mean lower costs, less maintenance and more reliable analyses and dashboards. A powerful combination to help your organization move forward.

know more

Get in touch with

Auke

Data & Analytics Lead | Business Technology Consultant