Skip to main content

· 6 min read
Adrian Brudaru

automated pipeline automaton

Why is there a data engineer shortage?

  1. High Demand and Rapid Growth: The increasing reliance on data-driven decision-making and the rise of big data technologies have created a surge in demand for skilled data engineers.
  2. Skill Gap and Specialization: Data engineering requires a unique blend of technical skills, and finding individuals with the right combination of programming, database management, and cloud computing expertise can be challenging.
  3. Competition from Other Data Roles: The allure of data science and other data-related roles has attracted professionals, leading to a talent shortage in the data engineering field.

How big is the data engineer shortage?

💡 "In Europe there are 32K data engineers and 48K open positions to hire one. In the US the ratio is 41K to 79K" Source: Linkedin data analysis blog post

Well that doesn’t look too bad - if only we could all be about 2x as efficient :)

Bridging the gap: How to make your data engineers 2x more efficient?

There are 2 ways to make the data engineers more efficient:

Option 1: Give them more to do, tell them how to do their jobs better!

For some reason, this doesn’t work out great. All the great minds of our generation told us we should be more like them

  • do more architecture;
  • learn more tech;
  • use this new toy!
  • learn this paradigm.
  • take a step back and consider your career choices.
  • write more tests;
  • test the tests!
  • analyse the tests :[
  • write a paper about the tests...
  • do all that while alerts go off 24/7 and you are the bottleneck for everyone downstream, analysts and business people screaming. (┛ಠ_ಠ)┛彡┻━┻

“I can't do what ten people tell me to do. So I guess I'll remain the same”

  • Otis Redding, Sittin' On The Dock Of The Bay

Option 2: Take away unproductive work

A data engineer has a pretty limited task repertoire - so could we give some of their work to roles we can hire?

Let’s see what a data engineer does, according to GPT:

  • Data curation: Ensuring data quality, integrity, and consistency by performing data profiling, cleaning, transformation, and validation tasks.
  • Collaboration with analysts: Working closely with data analysts to understand their requirements, provide them with clean and structured data, and assist in data exploration and analysis.
  • Collaboration with DWH architects: Collaborating with data warehouse architects to design and optimize data models, schemas, and data pipelines for efficient data storage and retrieval.
  • Collaboration with governance managers: Partnering with governance managers to ensure compliance with data governance policies, standards, and regulations, including data privacy, security, and data lifecycle management.
  • Structuring and loading: Designing and developing data pipelines, ETL processes, and workflows to extract, transform, and load data from various sources into the target data structures.
  • Performance optimization: Identifying and implementing optimizations to enhance data processing and query performance, such as indexing, partitioning, and data caching.
  • Data documentation: Documenting data structures, data lineage, and metadata to facilitate understanding, collaboration, and data governance efforts.
  • Data troubleshooting: Investigating and resolving data-related issues, troubleshooting data anomalies, and providing support to resolve data-related incidents or problems.
  • Data collaboration and sharing: Facilitating data collaboration and sharing across teams, ensuring data accessibility, and promoting data-driven decision-making within the organization.
  • Continuous improvement: Staying updated with emerging technologies, industry trends, and best practices in data engineering, and actively seeking opportunities to improve data processes, quality, and efficiency.

Let’s get a back of the napkin estimation of how much time they spend on those areas

Here’s an approximation as offered by GPT. Of course, actual numbers depend on the maturity of your team and their unique challenges.

  • Collaboration with others (including data curation): Approximately 40-60% of their working hours. This includes tasks such as collaborating with team members, understanding requirements, data curation activities, participating in meetings, and coordinating data-related activities.
  • Data analysis: Around 10-30% of their working hours. This involves supporting data exploration, providing insights, and assisting analysts in understanding and extracting value from the data.
  • Technical problem-solving (structuring, maintenance, optimization): Roughly 30-50% of their working hours. This includes solving data structuring problems, maintaining existing data structures, optimizing data pipelines, troubleshooting technical issues, and continuously improving processes.

By looking at it this way, solutions become clear:

  • Let someone else do curation. Analysts could talk directly to producers. By removing the middle man, you improve speed and quality of the process too.
  • Automate data structuring: While this is not as time consuming as the collaboration, it’s the second most time consuming process.
  • Let analyst do exploration of structured data at curation, not before load. This is a minor optimisation, but 10-30% is still very significant towards our goal of reducing workload by 50%.

How much of their time could be saved?

Chat GPT thinks:

it is reasonable to expect significant time savings with the following estimates:

  1. Automation of Structuring and Maintenance: By automating the structuring and maintenance of data, data engineers can save 30-50% or more of their time previously spent on these tasks. This includes activities like schema evolution, data transformation, and pipeline optimization, which can be streamlined through automation.
  2. Analysts and Producers Handling Curation: Shifting the responsibility of data curation to analysts and producers can save an additional 10-30% of the data engineer's time. This includes tasks such as data cleaning, data validation, and data quality assurance, which can be effectively performed by individuals closer to the data and its context.

It's important to note that these estimates are approximate and can vary based on the specific circumstances and skill sets within the team.

40-80% of a data engineer’s time could be spared

💡 40-80% of a data engineer’s time could be spared

To achieve that,

  • Automate data structuring.
  • Govern the data without the data engineer.
  • Let analysts explore data as part of curation, instead of asking data engineers to do it.

This looks good enough for solving the talent shortage. Not only that, but doing things this way lets your team focus on what they do best.

A recipe to do it

  1. Use something with schema inference and evolution to load your data.
  2. Notify stakeholders and producers of data changes, so they can curate it.
  3. Don’t explore json with data engineers - let analyst explore structured data.

Ready to stop the pain? Read this explainer on how to do schema evolution with dlt. Want to discuss? Join our slack.

· 5 min read
Tong Chen
info

💡Check out the accompanying colab demo: Google Colaboratory demo


Hi there! 👋 In this article, I will show you a demo on how to train ChatGPT with the open-source dlt repository. Here is the article structure, and you can jump directly to the part that interests you. Let's get started!

I. Introduction

II. Walkthrough

III. Result

IV. Summary

I. Introduction

Navigating an open-source repository can be overwhelming because comprehending the intricate labyrinths of code is always a significant problem. As a person who just entered the IT industry, I found an easy way to address this problem with an ELT tool called dlt (data load tool) - the Python library for loading data.

In this article, I would love to share a use case - training GPT with an Open-Source dlt Repository by using the dlt library. In this way, I can write prompts about dlt and get my personalized answers.

II. Walkthrough

The code provided below demonstrates training a chat-oriented GPT model using the dlt- hub repositories (dlt and pipelines). To train the GPT model, we utilized the assistance of two services: Langchain and Deeplake. In order to use these services for our project, you will need to create an account on both platforms and obtain the access token. The good news is that both services offer cost-effective options. GPT provides a $5 credit to test their API, while Deeplake offers a free tier.

The credit for the code goes to Langchain, which has been duly acknowledged at the end.

1. Run the following commands to install the necessary modules on your system.

python -m pip install --upgrade langchain deeplake openai tiktoken
# Create accounts on platform.openai.com and deeplake.ai. After registering, retrieve the access tokens for both platforms and securely store them for use in the next step. Enter the access tokens grabbed in the last step and enter them when prompted

import os
import getpass

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')
embeddings = OpenAIEmbeddings(disallowed_special=())

2. Create a directory to store the code for training the model. Clone the desired repositories into that.

  # making a new directory named dlt-repo
!mkdir dlt-repo
# changing the directory to dlt-repo
%cd dlt-repo
# cloning git repos into the dlt-repo directory
# dlt code base
!git clone https://github.com/dlt-hub/dlt.git
# example pipelines to help you get started
!git clone https://github.com/dlt-hub/pipelines.git
# going back to previous directory
%cd ..

3. Load the files from the directory

import os
from langchain.document_loaders import TextLoader

root_dir = './dlt-repo' # load data from
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
try:
loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
docs.extend(loader.load_and_split())
except Exception as e:
pass

4. Load the files from the directory

import os
from langchain.document_loaders import TextLoader

root_dir = './dlt-repo' # load data from
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
try:
loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
docs.extend(loader.load_and_split())
except Exception as e:
pass

5. Splitting files to chunks

# This code uses CharacterTextSplitter to split documents into smaller chunksbased on character count and store the resulting chunks in the texts variable.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

6. Create Deeplake dataset

# Set up your deeplake dataset by replacing the username with your Deeplake account and setting the dataset name. For example if the deeplakes username is “your_name” and the dataset is “dlt-hub-dataset” 

username = "your_deeplake_username" # replace with your username from app.activeloop.ai
db = DeepLake(dataset_path=f"hub://{username}/dlt_gpt", embedding_function=embeddings, public=True) #dataset would be publicly available
db.add_documents(texts)

# Assign the dataset and embeddings to the variable db , using deeplake dataset.
# Replace your_username with actual username
db = DeepLake(dataset_path="hub://"your_username"/dlt_gpt", read_only=True, embedding_function=embeddings)

# Create a retriever
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

7. Initialize the GPT model

from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name='gpt-3.5-turbo')
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)

III. Result

After the walkthrough, we can start to experiment different questions and it will output answers based on our training from dlt hub repository.

Here, I asked " why should data teams use dlt? "

chatgptq1

It outputted:

  1. It works seamlessly with Airflow and other workflow managers, making it easy to modify and maintain your code.
  2. You have complete control over your data. You can rename, filter, and modify it however you want before it reaches its destination.

Next, I asked " Who is dlt for? "

chatgptq2

It outputted:

  1. dlt is meant to be accessible to every person on the data team, including data engineers, analysts, data scientists, and other stakeholders involved in data loading. It is designed to reduce knowledge requirements and enable collaborative working between engineers and analysts.

IV. Summary

It worked! we can see how GPT can learn about an open source library by using dlt and utilizing the assistance of Langchain and Deeplake. Moreover, by simply follow the steps above, you can customize the GPT model training to your own needs.

Curious? Give the Colab demo💡 a try or share your questions with us, and we'll have ChatGPT address them in our upcoming article.


[ What's more? ]

  • Learn more about [dlt] 👉 here
  • Need help or want to discuss? Join our Slack community ! See you there 😊

· 6 min read
Adrian Brudaru

Schema evolution combines a technical process with a curation process, so let's understand the process, and where the technical automation needs to be combined with human curation.

Whether you are aware or not, you are always getting structured data for usage

Data used is always structured, but usually produced unstructured.

Structuring it implicitly during reading is called "schema on read", while structuring it upfront is called "schema on write".

To fit unstructured data into a structured database, developers have to perform this transition before loading. For data lake users who read unstructured data, their pipelines apply a schema during read - if this schema is violated, the downstream software will produce bad outcomes.

We tried running away from our problems, but it didn't work.

Because structuring data is difficult to deal with, people have tried to not do it. But this created its own issues.

  • Loading json into db without typing or structuring - This anti-pattern was created to shift the structuring of data to the analyst. While this is a good move for curation, the db support for structuring data is minimal and unsafe. In practice, this translates to the analyst spending their time writing lots of untested parsing code and pushing silent bugs to production.
  • Loading unstructured data to lakes - This pattern pushes the curation of data to the analyst. The problem here is similar to the one above. Unstructured data is hard to analyse and curate, and the farther it is from the producer, the harder it is to understand.

So no, one way or another we are using schemas.

If curation is hard, how can we make it easier?

  • Make data easier to discover, analyze, explore. Structuring upfront would do that.
  • Simplify the human process by decentralizing data ownership and curation - the analyst can work directly with the producer to define the dataset produced.

Structuring & curating data are two separate problems. Together they are more than the sum of the parts.

The problem is that curating data is hard.

  • Typing and normalising data are technical processes.
  • Curating data is a business process.

Here's what a pipeline building process looks like:

  1. Speak with the producer to understand what the data is. Chances are the producer does not document it and there will be many cases that need to be validated analytically.
  2. Speak with the analyst or stakeholder to get their requirements. Guess which fields fulfill their requirements.
  3. Combine the 2 pieces of info to filter and structure the data so it can be loaded.
  4. Type the data (for example, convert strings to datetime).
  5. Load the data to warehouse. Analyst can now validate if this was the desired data with the correct assumptions.
  6. Analyst validates with stakeholder that this is the data they wanted. Stakeholder usually wants more.
  7. Possibly adjust the data filtering, normalization.
  8. Repeat entire process for each adjustment.

And when something changes,

  1. The data engineer sees something break.
  2. They ask the producer about it.
  3. They notify the analyst about it.
  4. The analyst notifies the business that data will stop flowing until adjustments.
  5. The analyst discusses with the stakeholder to get any updated requirements.
  6. The analyst offers the requirements to the data engineer.
  7. The data engineer checks with the producer/data how the new data should be loaded.
  8. Data engineer loads the new data.
  9. The analyst can now adjust their scripts, re-run them, and offer data to stakeholder.

Divide et impera! The two problems are technical and communicational, so let's let computers solve tech and let humans solve communication.

Before we start solving, let's understand the problem:

  1. For usage, data needs to be structured.
  2. Because structuring is hard, we try to reduce the amount we do by curating first or defering to the analyst by loading unstructured data.
  3. Now we are trying to solve two problems at once: structuring and curation, with each role functioning as a bottleneck for the other.

So let's de-couple these two problems and solve them appropriately:

  • The technical issue is that unstructured data needs to be structured.
  • The curation issue relates to communication - so taking the engineer out of the loop would make this easier.

Automate the tech: Structuring, typing, normalizing

The only reason to keep data unstructured was the difficulty of applying structure.

By automating schema inference, evolution, normalization, and typing, we can just load our jsons into structured data stores, and curate it in a separate step.

Alert the communicators: When there is new data, alert the producer and the curator.

To govern how data is produced and used, we need to have a definition of the data that the producer and consumer can both refer to. This has typically been tackled with data contracts - a type of technical test that would notify the producer and consumer of violations.

So how would a data contract work?

  1. Human process:
    1. Humans define a data schema.
    2. Humans write a test to check if data conforms to the schema.
    3. Humans implement notifications for test fails.
  2. Technical process:
    1. Data is extracted.
    2. Data is staged to somewhere where it can be tested.
    3. Data is tested:
      1. If the test fails, we notify the producer and the curator.
      2. If the test succeeds, it gets transformed to the curated form.

So how would we do schema evolution with dlt?

  1. Data is extracted, dlt infers schema and can compare it to the previous schema.
  2. Data is loaded to a structured data lake (staging area).
  3. Destination schema is compared to the new incoming schema.
    1. If there are changes, we notify the producer and curator.
    2. If there are no changes, we carry on with transforming it to the curated form.

So, schema evolution is essentially a simpler way to do a contract on schemas. If you had additional business-logic tests, you would still need to implement them in a custom way.

The implementation recipe

  1. Use dlt. It will automatically infer and version schemas, so you can simply check if there are changes. You can just use the normaliser + loader or build extraction with dlt. If you want to define additional constraints, you can do so in the schema.
  2. Define your slack hook or create your own notification function. Make sure the slack channel contains the data producer and any stakeholders.
  3. Capture the load job info and send it to the hook.

· 4 min read
Rahul Joshi

Why we need a simple Google Sheets -> data warehouse pipeline

Spreadsheets are great. They are really simple to use and offer a lot of functionality to query, explore, manipulate, import/export data. Their wide availability and ease of sharing also make them great tools for collaboration. But they have limitations and cannot be used for storage and processing of large-scale complex data. Most organizational data is actually stored in data warehouses and not spreadsheets.

However, because of the easy set up and intuitive workflow, Google Sheets are still used by many people to track and analyze smaller datasets. But even this data often needs to be combined with the rest of the organizational data in the data warehouse for reasons like analytics, reporting etc. This is not a problem when the dataset is small and static and just needs to be exported once to the data warehouse. In most cases, however, the Google Sheets data is not static and is updated regularly, thus creating a need for an ETL pipeline, and thereby complicating an otherwise simple and intuitive workflow.

Since dlt has a Google Sheets pipeline that is very easy to set up and deploy, we decided to write a blog to demonstrate how some very common use-cases of Google Sheets can be enchanced by inserting this dlt pipeline into the process.

Use-case #1: Google sheets pipeline for measuring marketing campaign ROI

As an example of such a use-case, consider this very common scenario: You're the marketing team of a company that regularly launches social media campaigns. You track some of the information such as campaign costs in Google Sheets, whereas all of the other related data such as views, sign-ups, clicks, conversions, revenue etc. is stored in the marketing data warehouse. To optimize your marketing strategy, you decide to build a dashboard to measure the ROI for the campaigns across different channels. Hence, you would like to have all your data in one place to easily be able to connect your reporting tool to it.

To demonstrate this process, we created some sample data where we stored costs related to some campaigns in a Google Sheet and and the rest of the related data in BigQuery.

campaign-roi-google-sheets campaign-roi-data-warehouse

We then used the dlt google sheets pipeline by following these simple steps to load the Google Sheets data into BigQuery.

With the data loaded, we finally connected Metabase to the data warehouse and created a dashboard to understand the ROIs across each platform: campaign-roi-dashboard-1
campaign-roi-dashboard-2

Use-case #2: Evaluating the performance of your ML product using google sheets pipeline

Another use-case for Google Sheets that we've come across frequently is to store annotated training data for building machine learning (ML) products. This process usually involves a human first manually doing the annotation and creating the training set in Google Sheets. Once there is sufficient data, the next step is to train and deploy the ML model. After the ML model is ready and deployed, the final step would be to create a workflow to measure its performance. Which, depending on the data and product, might involve combining the manually annotated Google Sheets data with the product usage data that is typically stored in some data warehouse

A very common example for such a workflow is with customer support platforms that use text classfication models to categorize incoming customer support tickets into different issue categories for an efficient routing and resolution of the tickets. To illustrate this example, we created a Google Sheet with issues manually annotated with a category. We also included other manually annotated features that might help measure the effectiveness of the platform, such as priority level for the tickets and customer feedback.

customer-support-platform-google-sheets

We then populated a BigQuery dataset with potential product usage data, such as: the status of the ticket (open or closed), response and resolution times, whether the ticket was escalated etc. customer-support-platform-data-warehouse

Then, as before, we loaded the google sheets data to the data warehouse using the dlt google sheets pipeline and following these steps.

Finally we connected Metabase to it and built a dashboard measuring the performance of the model over the period of a month:

customer-support-platform-dashboard

· 7 min read
Adrian Brudaru
info

Google Colaboratory demo

This colab demo was built and shown by our working student Rahul Joshi, for the Berlin Data meetup, where he talked about the state of schema evolution in the open source.

What is schema evolution?

In the fast-paced world of data, the only constant is change, and it usually comes unannounced.

Schema on read

Schema on read means your data does not have a schema, but your consumer expects one. So when they read, they define the schema, and if the unstructured data does not have the same schema, issues happen.

Schema on write

So, to avoid things breaking on running, you would want to define a schema upfront - hence you would structure the data. The problem with structuring data is that it’s a labor intensive process that makes people take pragmatic shortcuts of structuring only some data, which later leads to lots of maintenance.

Schema evolution means that a schema is automatically generated on write for the data, and automatically adjusted for any changes in the data, enabling a robust and clean environment downstream. It’s an automatic data structuring process that is aimed at saving time during creation, maintenance, and recovery.

Why do schema evolution?

One way or another, produced raw unstructured data becomes structured during usage. So, which paradigm should we use around structuring?

Let’s look at the 3 existing paradigms, their complexities, and what a better solution could look like.

The old ways

The data warehouse paradigm: Curating unstructured data upfront

Traditionally, many organizations have adopted a 'curate first' approach to data management, particularly when dealing with unstructured data.

The desired outcome is that by curating the data upfront, we can directly extract value from it later. However, this approach has several pitfalls.

Why curating unstructured data first is a bad idea

  1. It's labor-intensive: Unstructured data is inherently messy and complex. Curating it requires significant manual effort, which is time-consuming and error-prone.
  2. It's difficult to scale: As the volume of unstructured data grows, the task of curating it becomes increasingly overwhelming. It's simply not feasible to keep up with the onslaught of new data. For example, Data Mesh paradigm tries to address this.
  3. It delays value extraction: By focusing on upfront curation, organizations often delay the point at which they can start extracting value from their data. Valuable insights are often time-sensitive, and any delay could mean missed opportunities.
  4. It assumes we know what the stakeholders will need: Curating data requires us to make assumptions about what data will be useful and how it should be structured. These assumptions might be wrong, leading to wasted effort or even loss of valuable information.

The data lake paradigm: Schema-on-read with unstructured data

In an attempt to bypass upfront data structuring and curation, some organizations adopt a schema-on-read approach, especially when dealing with data lakes. While this offers flexibility, it comes with its share of issues:

  1. Inconsistency and quality issues: As there is no enforced structure or standard when data is ingested into the data lake, the data can be inconsistent and of varying quality. This could lead to inaccurate analysis and unreliable insights.
  2. Complexity and performance costs: Schema-on-read pushes the cost of data processing to the read stage. Every time someone queries the data, they must parse through the unstructured data and apply the schema. This adds complexity and may impact performance, especially with large datasets.
  3. Data literacy and skill gap: With schema-on-read, each user is responsible for understanding the data structure and using it correctly, which is unreasonable to expect with undocumented unstructured data.
  4. Lack of governance: Without a defined structure, data governance can be a challenge. It's difficult to apply data quality, data privacy, or data lifecycle policies consistently.

The hybrid approach: The lakehouse

  • The data lakehouse uses the data lake as a staging area for creating a warehouse-like structured data store.
  • This does not solve any of the previous issues with the two paradigms, but rather allows users to choose which one they apply on a case-by-case basis.

The new way

The current solution : Structured data lakes

Instead of trying to curate unstructured data upfront, a more effective approach is to structure the data first with some kind of automation. By applying a structured schema to the data, we can more easily manage, query, and analyze the data.

Here's why structuring data before curation is a good idea:

  1. It reduces maintenance: By automating the schema creation and maintenance, you remove 80% of maintenance events of pipelines.
  2. It simplifies the data: By imposing a structure on the data, we can reduce its complexity, making it easier to understand, manage, and use.
  3. It enables automation: Structured data is more amenable to automated testing and processing, including cleaning, transformation, and analysis. This can significantly reduce the manual effort required to manage the data.
  4. It facilitates value extraction: With structured data, we can more quickly and easily extract valuable insights. We don't need to wait for the entire dataset to be curated before we start using it.
  5. It's more scalable: Reading structured data enables us to only read the parts we care about, making it faster, cheaper, and more scalable.

Therefore, adopting a 'structure first' approach to data management can help organizations more effectively leverage their unstructured data, minimizing the effort, time, and complexity involved in data curation and maximizing the value they can extract from their data.

An example of such a structured lake would be parquet file data lakes, which are both, structured and inclusive of all data. However, the challenge here is creating the structured parquet files and maintaining the schemas, for which the delta lake framework provides some decent solutions, but is still far from complete.

The better way

So, what if writing and merging parquet files is not for you? After all, file-based data lakes capture a minority of the data market.

dlt is the first python library in the open source to offer schema evolution

dlt enables organizations to impose structure on data as it's loaded into the data lake. This approach, often termed as schema-on-load or schema-on-write, provides the best of both worlds:

  1. Easier maintenance: By notifying the data producer and consumer of loaded data schema changes, they can quickly decide together how to adjust downstream usage, enabling immediate recovery.
  2. Consistency and quality: By applying structure and data typing rules during ingestion, dlt ensures data consistency and quality. This leads to more reliable analysis and insights.
  3. Improved performance: With schema-on-write, the computational cost is handled during ingestion, not when querying the data. This simplifies queries and improves performance.
  4. Ease of use: Structured data is easier to understand and use, lowering the skill barrier for users. They no longer need to understand the intricate details of the data structure.
  5. Data governance: Having a defined schema allows for more effective data governance. Policies for data quality, data privacy, and data lifecycle can be applied consistently and automatically.

By adopting a 'structure first' approach with dlt, organizations can effectively manage unstructured data in common destinations, optimizing for both, flexibility and control. It helps them overcome the challenges of schema-on-read, while reaping the benefits of a structured, scalable, and governance-friendly data environment.

To try out schema evolution with dlt, check out our colab demo.

colab demo

Want more?

  • Join our Slack
  • Read our schema evolution blog post
  • Stay tuned for the next article in the series: How to do schema evolution with dlt in the most effective way

· 5 min read
Rahul Joshi
info

TL;DR: Trying to become more user-centric and make data driven decisions? Get started with the SQL source pipeline + BigQuery + Metabase

When you have a web and / or mobile app but no data yet

If you're a startup without a dedicated data team but a sizeable number of users on your website or mobile app, then chances are that you are collecting and storing all your product data in OLTP databases like MySQL, Postgres, etc. As you have grown, you have likely been aiming to become more user-centric, yet you find that no one at your company has information on what your users do or what their experience is like. Stakeholders should be making data-driven decisions, but they are not yet because they are unable to use the existing data to understand user behavior and experience. This is usually the point when folks realize they need a data warehouse.

Why a data warehouse is necessary

OLTP databases are great because they are optimized to handle high-volume real-time transactions and maintain data integrity and consistency. However, they are not very well-suited for advanced analytics and data modelling. If you want to create reports, dashboards, and more that help you understand you users, you are going to want to extract, load, and transform (ELT) into a OLAP database like Google BigQuery, Snowflake, etc. To do this, you will need to create a data pipeline, which can be quite challenging if your company does not have a dedicated data engineering team.

Why a data pipeline is necessary

Production dashboards rely on the availability of consistent, structured data, which necessitates deploying a data pipeline that is idompotent, can manage the schema and handle schema changes, can be deployed to load data incrementally, etc. For most startups, it's not obvious how to create such pipelines. This is why we decided to demonstrate how one can set up such a data pipeline and build analytics dashboards on top of it.

Why a reporting tool is necessary

We chose to build our dashboard in Metabase because it also offers an open source edition. The advantage of reporting tools like Metabase is that they are easy and intuitive to use even for people who can't write SQL, but at the same time they are powerful enough for those who would like to use SQL.

How we set this up

1. Creating a PostgreSQL -> BigQuery pipeline

Our aim was to create a Metabase dashboard to explore data in a transactional database. The data set that we chose was a sample of The Dell DVD Store 2 database, which we put into a Postgres database deployed on a Google Cloud SQL instance. To make this data available to Metabase, we needed to first load all of the data into a BigQuery instance, and for this we needed a data pipeline. We created this pipeline by doing very simple customizations on the existing dlt sql_database pipeline. See the accompanying repo for the steps we followed.

2. Building a Metabase reporting dashboard

With the database uploaded to BigQuery, we were now ready to build a dashboard. We created a Metabase cloud account and connected it to our BigQuery instance. This made the whole database accessible to Metabase and we were able to analyze the data.

The DVD store database contains data on the products (film DVDs), product categories, existing inventory, customers, orders, order histories etc. For the purpose of the dashboard, we decided to explore the question: How many orders are being placed each month and which films and film categories are the highest selling?

orders_chart.png top_selling_tables.png In addition to this, we were also able to set up email alerts to get notified whenever the stock of a DVD was either empty or close to emptying.

low_stock_email_alert.png

3. Deploying the pipeline

With our dashboard ready, all we had to do was deploy our pipeline so that the dashboard could get updated with new data daily. Since the dashboard only uses some of the tables, we needed to modify the pipeline, that was configured to load the entire database, to instead only update the necessary tables. We also wanted to make it possible for the pipeline to load tables incrementally whenever possible.

We first started by selecting the tables that we wanted to update, namely: orders, orderlines, products, categories, inventory. We then decided whether we wanted to update the tables incrementally or with full replace:

  • Tables orders and orderlines contain data on the orders placed. This means that they also contain a date column and hence are loaded incrementally every day.
  • Tables products, categories, and inventory contain information on the existing products. These tables don't contain a date column and are updated whenever there is any change in inventory. Since the values of the existing data in the tables can change, these tables are not updated incrementally, but are instead fully loaded each time the pipeline is run.

In order to specify these conditions and deploy our pipeline in production, we followed these steps.

· 4 min read
Rahul Joshi
info

TL;DR: We created a Hacker News -> BigQuery dlt pipeline to load all comments related to popular ELT keywords and then used GPT-4 to summarize the comments. We now have a live dashboard that tracks these keywords and an accompanying GitHub repo detailing our process.

Motivation

To figure out how to improve dlt, we are constantly learning about how people approach extracting, loading, and transforming data (i.e. ELT). This means we are often reading posts on Hacker News (HN), a forum where many developers like ourselves hang out and share their perspectives. But finding and reading the latest comments about ELT from their website has proved to be time consuming and difficult, even when using Algolia Hacker News Search to search.

So we decided to set up a dlt pipeline to extract and load comments using keywords (e.g. Airbyte, Fivetran, Matillion, Meltano, Singer, Stitch) from the HN API. This empowered us to then set up a custom dashboard and create one sentence summaries of the comments using GPT-4, which made it much easier and faster to learn about the strengths and weaknesses of these tools. In the rest of this post, we share how we did this for ELT. A GitHub repo accompanies this blog post, so you can clone and deploy it yourself to learn about the perspective of HN users on anything by replacing the keywords.

Creating a dlt pipeline for Hacker News

For the dashboard to have access to the comments, we needed a data pipeline. So we built a dlt pipeline that could load the comments from the Algolia Hacker News Search API into BigQuery. We did this by first writing the logic in Python to request the data from the API and then following this walkthrough to turn it into a dlt pipeline.

With our dlt pipeline ready, we loaded all of the HN comments corresponding to the keywords from January 1st, 2022 onward.

Using GPT-4 to summarize the comments

Now that the comments were loaded, we were ready to use GPT-4 to create a one sentence summary for them. We first filtered out any irrelevant comments that may have been loaded using simple heuritics in Python. Once we were left with only relevant comments, we called the gpt-4 API and prompted it to summarize in one line what the comment was saying about the chosen keywords. If you don't have access to GPT-4 yet, you could also use the gpt-3.5-turbo API.

Since these comments were posted in response to stories or other comments, we fed in the story title and any parent comments as context in the prompt. To avoid hitting rate-limit error and losing all progress, we ran this for 100 comments at a time, saving the results in the CSV file each time. We then built a streamlit app to load and display them in a dashboard. Here is what the dashboard looks like:

dashboard.png

Deploying the pipeline, Google Bigquery, and Streamlit app

With all the comments loaded and the summaries generated in bulk, we were ready to deploy this process and have the dashboard update daily with new comments.

We decided to deploy our streamlit app on a GCP VM. To have our app update daily with new data we did the following:

  1. We first deployed our dlt pipeline using GitHub Actions to allow new comments to be loaded to BigQuery daily
  2. We then wrote a Python script that could pull new comments from BigQuery into the VM and we scheduled to run it daily using crontab
  3. This Python script also calls the gpt-4 API to generate summaries only for the new comments
  4. Finally, this Python script updates the CSV file that is being read by the streamlit app to create the dashboard. Check it out here!

Follow the accompanying GitHub repo to create your own Hacker News/GPT-4 dashboard.

· 3 min read
Rahul Joshi
info

TL;DR: As of last week, there is a dlt pipeline that loads data from Google Analytics 4 (GA4). We’ve been excited about GA4 for a while now, so we decided to build some internal dashboards and show you how we did it.

Why GA4?

We set out to build an internal dashboard demo based on data from Google Analytics (GA4). Google announced that they will stop processing hits for Universal Analytics (UA) on July 1st, 2023, so many people are now having to figure out how to set up analytics on top of GA4 instead of UA and struggling to do so. For example, in UA, a session represents the period of time that a user is actively engaged on your site, while in GA4, a session_start event generates a session ID that is associated with all future events during the session. Our hope is that this demo helps you begin this transition!

Initial explorations

We decided to make a dashboard that helps us better understand data attribution for our blog posts (e.g. As DuckDB crosses 1M downloads / month, what do its users do?). Once we got our credentials working, we then used the GA4 dlt pipeline to load data into a DuckDB instance on our laptop. This allowed us to figure out what requests we needed to make to get the necessary data to show the impact of each blog post (e.g. across different channels, what was the subsequent engagement with our docs, etc). We founded it helpful to use GA4 Query Explorer for this.

Internal dashboard

Dashboard 1 Dashboard 2

With the data loaded locally, we were able to build the dashboard on our system using Streamlit. You can also do this on your system by simply cloning this repo and following the steps listed here.

After having the pipeline and the dashboard set up just how we liked it, we were now ready to deploy it.

Deploying the data warehouse

We decided to deploy our Streamlit app on a Google Cloud VM instance. This means that instead of storing the data locally, it would need to be in a location that could be accessed by the Streamlit app. Hence we decided to load the data onto a PostgreSQL database in the VM. See here for more details on our process.

Deploying the dlt pipeline with GitHub Actions

Once we had our data warehouse set up, we were ready to deploy the pipeline. We then followed the deploy a pipeline walkthrough to configure and deploy a pipeline that will load the data daily onto our data warehouse.

Deploying the dashboard

We finally deployed our Streamlit app on our Google Cloud VM instance by following these steps.

Enjoy this blog post? Give dlt a ⭐ on GitHub 🤜🤛

· 3 min read
Matthaus Krzykowski

Using DuckDB, dlt, & GitHub to explore DuckDB

tip

TL;DR: We created a Colab notebook for you to learn more about DuckDB (or any open source repository of interest) using DuckDB, dlt, and the GitHub API 🙂

So is DuckDB full of data about ducks?

Nope, you can put whatever data you want into DuckDB ✨

Many data analysts, data scientists, and developers prefer to work with data on their laptops. DuckDB allows them to start quickly and easily. When working only locally becomes infeasible, they can then turn this local “data pond” into a data lake, storing their data on object storage like Amazon S3, and continue to use DuckDB as a query engine on top of the files stored there.

If you want to better understand why folks are excited about DuckDB, check out this blog post.

Perhaps ducks use DuckDB?

Once again, the answer is also 'nein'. As far as we can tell, usually people use DuckDB 🦆

To determine this, we loaded emoji reaction data for DuckDB repo using data load tool (dlt) from the GitHub API to a DuckDB instance and explored who has been reacting to issues / PRs in the open source community. This is what we learned…

The three issues / PRs with the most reactions all-time are

  1. SQLAlchemy dialect #305
  2. Add basic support for GeoSpatial type #2836
  3. Support AWS default credential provider chain #4021

The three issues / PRs with the most reactions in 2023 are

  1. Add support for Pivot/Unpivot statements #6387
  2. Add support for a pluggable storage and catalog back-end, and add support for a SQLite back-end storage #6066
  3. Add support for UPSERT (INSERT .. ON CONFLICT DO ..) syntax #5866

Some of the most engaged users (other than the folks who work at DuckDB Labs) include

All of these users seem to be people. Admittedly, we didn’t look at everyone though, so there could be ducks within the flock. You can check yourself by playing with the Colab notebook.

Maybe it’s called DuckDB because you can use it to create a "data pond" that can grow into a data lake + ducks like water?

Although this is a cool idea, it is still not the reason that it is called DuckDB 🌊

Using functionality offered by DuckDB to export the data loaded to it as Parquet files, you can create a small “data pond” on your local computer. To make it a data lake, you can then add these files to Google Cloud Storage, Amazon S3, etc. And if you want this data lake to always fill with the latest data from the GitHub API, you can deploy the dlt pipeline.

Check this out in the Colab notebook and let us know if you want some help setting this up.

Just tell me why it is called DuckDB!!!

Okay. It’s called DuckDB because ducks are amazing and @hannes once had a pet duck 🤣

Why "Duck" DB? Source: DuckDB: an Embeddable Analytical RDBMS

Enjoy this blog post? Give data load tool (dlt) a ⭐ on GitHub here 🤜🤛

· 9 min read
Matthaus Krzykowski

Summary

The excitement around DuckDB has steadily increased over the last year. The project has consistently crossed the 1M downloads per month mark over the last three months (Dec ‘22, Jan ‘23, Feb ‘23), a large increase from the ~250k downloads per month in February ‘22.

Like so many others, we are excited about the project, too. Recently, we attended the DuckDB conference and spoke with many members of the community to learn why people are excited about it. We examined issues on GitHub, interviewed some of the top contributors to the project, and even experimented with DuckDB ourselves to determine how we could contribute.

We aimed to identify the most popular reasons why people try out DuckDB with our research. We found five perspectives that people commonly have when trying out DuckDB.

Marcin watching a MotherDuck presentation

dltHub co-founder Marcin watching a MotherDuck presentation at DuckCon in Brussels in February

1) "Normie" users love to drop-In DuckDB in all sorts of places

Last December, folks from the data + ML community organized a great conference, Normconf. Presenters and attendees were invited to share stories about everyday data problems and celebrate solutions. Many of the attendees referred to themselves as "normies" during the conference. We have found many of these folks overlap with those trying out DuckDB.

Normies have a specific way of solving problems that breaks some behavior patterns and expectations of previous generations of software engineers. As Ben Labaschin explains in his presentation Building an HTTPS Model API for Cheap: AWS, Docker, and the Normconf API, “normie software” has the following criteria:

  • It does not require domain knowledge to use. You should need almost no documentation because time is most important and all learning is investment. Before you learn, you need to know if it will pay off in the future.
  • The best tools can be reused in many contexts. You learn them once, and you can apply them everywhere.
  • Tools should work together. You should pick the tool that works with your other tools.

Many at Normconf agreed that DuckDB is also a “normie tool”. It has a Python wrapper and can be dropped into any Python script, notebook, or Streamlit app. It has helpers and integrates with other libraries that are part of typical workflows (e.g. Pandas, Parquet, Arrow, etc). It is a powerful analytical database and brings local SQL execution without credentials and other hassles. You can scan data from and export data to Parquet, CSV, or JSON and query an S3 bucket directly.


2) Local Data Workflows Are Going Mainstream, and DuckDB Is at the Center

Many people struggle to access cloud data warehouses within their organizations. Some of the problems that these users encounter include:

  1. Painful processes to obtain credentials and permissions
  2. A setup that is perceived as "difficult"
  3. Deployment of local composer files
  4. Working on remote machines is often much less pleasant

Instead they often use DuckDB to load data locally. DuckDB enables people to start using data by:

  1. Allowing them to learn SQL and try examples without any setup
  2. Querying GCP or S3 buckets from a local machine
  3. Creating notebooks or data apps with embedded DuckDB that showcase their work

Prototyping and experimenting with production data locally on DuckDB is a popular practice. From what we learned, deployment of DuckDB to production is still quite rare. Companies seldom use local workflows as this depends on someone having their laptop turned on to function. However, many non-engineering personnel use DuckDB to access production data.

3) The community is exploring various ways to use DuckDB's columnar query engine (e.g. analytics)

As many people in the community are exploring how DuckDB could be used, the fact that the DuckDB engine provides a way to quickly query the columnar format seems to be central to multiple use cases.

In data processing and usage, there are two types of transformations:

  • Non-time-critical, usually nightly, "transformation" jobs. These are run programmatically, the output is saved somewhere, and a business person consumes this output on demand a few hours or days later
  • Time-critical "dashboard user access" jobs. A report is created as output from the former job. Now, a user wants to gain insights from it "on demand", so they aggregate this table and wait. This computation is now time-critical, because the user is actively waiting for it

Row-based engines like Postgres are great at the first type of job, which usually involves many joins and row-based operations. However, they are not fast at aggregating data sets, as this requires them to iterate over rows by index (access the row, find its value, sum it).

Column-based engines, such as DuckDB, the Postgres AlloyDB engine, MySQL Percona, Redshift, etc., excel at aggregation jobs, which, for example, count or sum elements in a column.

Here, we have found evidence of two use cases where DuckDB is particularly suited:

  1. Aggregation of event data (e.g. product analytics). A dump of events could be easily scanned and aggregated into "users", "sessions", counted, etc. By using a database architecture like an "activity schema" and a way to tag events, this would make a great "product analytics" embedded warehouse. The MotherDuck team explains some reasons why to use DuckDB for analytics here.
  2. Aggregation of "one big table" architectures that use "one big table" instead of a dimensional model. This is a common design pattern for organizations that do not tackle data architecture, such as small analyst teams.

The amount of data processed for analytics workloads is often smaller than people think, with dashboards typically built from aggregated data. The co-founder of Motherduck suggests that a past analysis of his of BigQuery queries found that 90% of queries in organisations processed less than 100 MB of data (while the co-founder of Ponder kind of disagrees). Many people load CSV, Parquet, and JSON files sizes ranging from 50 to 200M rows into DuckDB. This includes not only one-off data loading but also ongoing demo projects. Common advantages of DuckDB we frequently heard about are speed, costs, and the usability advantages mentioned above.

4) With DuckDB users bring the database engine to their data and instead of the other way around

For most of us this behavioural pattern should ring true:

  1. “I have my data where I see fit. For example, in a AWS S3 bucket or on my laptop. My workflows deal with this well. And the workflow is fairly simple and cheap.”
  2. “Now I want to process my data.”
  3. “So I move my data to where a database engine is. I load it somewhere. I load it to Snowflake, BigQuery, Redshift. This is more complicated and costs significantly.”

We encountered a different related pattern with DuckDB users. DuckDB users often bring the engine to their data:

  1. People drop it into AWS Lambda or Google Cloud Function to process data close to the engine. The few DuckDB production deployments that we have seen were all AWS Lambda centric. You can read a general example here
  2. The product that makes this pattern super easy for AWS is Boiling Data. The product also scales fairly well

The community frequently experiments with making such a setup work. In a recent post from frequent DuckDB issue contributor Mimoune Djouallah, he describes how he used Azure Storage, DuckDB, and an Azure ML Notebook to build a cost-effective data engineering pipeline for smaller data workloads. The pipeline involves reading data from Azure Storage, running complex queries, and saving the results in a bucket. The resulting bucket can be consumed in Synapse Serverless/PowerBI/Notebook, etc. We wonder, just like Mimoune and others do, if vendors will be building more support for smaller data workloads going forward.

It is probably worth mentioning at this stage that DuckDB and even PostgreSQL are not databases in the traditional sense. Rather, they are relational "database" management systems (RDBMS) that manage structured collections of data. While both can be used for querying internally stored data and reading external data like files or federated databases, DuckDB focuses primarily on the latter.

This means that it functions more as an RDBMS without an attached database. Therefore, it is inaccurate to refer to DuckDB as a database. Instead, we should consider the parquet file or data lake as the actual database, with DuckDB serving as a tool for managing and querying the data.

5) Some people are dropping DuckDB into their BI tooling

A few people people we met have chosen to have always-up file storage and on-demand DuckDB embedded in a BI tool. A BI tool that supports this use case is Metabase. It’s an interesting concept that might one day lead to a stable open source BI tool SQL cache.

This is different than the past. Classic business intelligence tools use the OLAP concept—the business user creates a pivot table, and the tool composes a SQL query automatically and issues it to the database. This slows things down and adds additional cost. Think about Data Studio doing queries on BigQuery just because you change the granularity.

If the part of the data is cached (in-memory or as temporary parquet file) and the BI tool has an embedded query engine, it can do following tricks, for free and in no-time:

  • change the granularity of a time dimension between different units (hour, day, week, calendar week, US calendar week, etc.)
  • drill downs, drill throughs, and filters
  • leveraging joins through predefined paths or UI query builders

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.