delta lake in starburst galaxy (intro & integration)

What is Delta Lake? I guess the project page says it best.

Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines that include Trino

https://delta.io/

Yes, many other engines (including Spark and Hive) and APIs (notably Python and Java) are available to utilize this popular modern table format.

Wondering what a “table format” is? Think of Hive, but with more features such as versioning and storing metadata in the data lake alongside the data files. Still wondering? If so, check out the following video from my colleague Evan Smith.

OK, where were we? Oh yeah, learning a bit about the Delta Lake table format. Delta Lake was initially developed by Databricks and became an open source project in 2019. It has evolved since then to have some strong key features.

Delta Lake allows classical Data Manipulation Language (DML) statements (INSERT, UPDATE, DELETE, and MERGE) and manages the commit of those operations as an ACID-compliant transaction. It does this by maintaining a transaction log of changes (also know as the DeltaLog) which itself is the enabler for performing time-travel querying since every change creates a new version of the table.

For a deep-dive on the metadata and DeltaLog that lives on the data lake along with the table’s datasets, as well as how this information is leveraged to maintain table versioning, check out the following video.

That is all cool stuff, but let’s get our hands dirty. Delta Lake integrates nicely with Trino via the Delta Lake Connector. Of course, what’s the fastest way to get going with Trino? Since you asked… it is Starburst Galaxy! Not only is it already running in the cloud, but you can get started for free!!

You can then leverage its Great Lakes connectivity to use one of the popular public cloud object stores. For a little help with that, check out my querying aviation data in the cloud (leveraging starburst galaxy) post that sets up an Amazon S3 catalog.

Here I am ready to go now! How about you?

As you might be able to see above (if you have stronger glasses than me!), I ran a couple of statements to create a schema to play in and then to root myself to that schema.

-- my catalog is named 'mycloud', so update accordingly
CREATE SCHEMA mycloud.dlblog1;
USE mycloud.dlblog1;

Let’s create a new table. Notice the type property in the WITH clause — Great Lakes connectivity allow multiple table formats to be created and used. I’m just telling it to leverage Delta Lake.

CREATE TABLE dune_characters (
  id integer,
  name varchar(55),
  notes varchar(255)
)
WITH (type = 'delta');

Now, add a few records.

INSERT INTO dune_characters 
  (id, name, notes)
VALUES
  (101, 'Leto', 'Ruler of House Atreides'),
  (102, 'Jessica', 'Concubine of the Duke');

Verify these records are present.

Ooof! I forgot the star of the book/movie!! Let’s get him inserted, too.

INSERT INTO dune_characters 
  (id, name, notes)
VALUES
  (103, 'Paul', 'Son of Leto');

Instead of just querying the table again to verify it has 3 rows now, here is a peek into the S3 bucket where the table’s data is stored. You can see 2 datasets there (1 from the first INSERT and the 2nd from the next one).

And that _delta_log folder? Yep, it is the mythical DeltaLog we touched upon earlier. Let’s take a peek into it!

Basically, we have 3 different JSON files. One for each version. Yes, the CREATE TABLE created the 000 version. Here are a few relevant snippets from each of these files.

Table VersionRelevant Snippets
...000.json"operation":"CREATE TABLE"
...001.json"operation":"WRITE"
\"numRecords\":2
...002.json"operation":"WRITE"
\"numRecords\":1

Sparked (pun definitely intended) your interest? Go back up and watch that hour-long video for more. I promise it can be fun if your the science fair kind of person that enjoys seeing the gory details. For the rest of us, this is plenty for now!

In summary, Starburst Galaxy is the fastest way to get going with Trino. It also has great support for the Delta Lake table format. Stay tuned as I have a few more posts in this series I will be publishing soon.

Published by lestermartin

Software development & data engineering trainer/evangelist/consultant currently focused on data lake frameworks including Hive, Spark, Kafka, Flink, NiFi, and Trino/Starburst.

One thought on “delta lake in starburst galaxy (intro & integration)

Leave a comment