Guide to Data Engineering the Open Source Way.

Brian Ndung'u Njenga
5 min readOct 31, 2022

--

Data engineers are the unsung heroes in the current business environment and so is the process of data engineering. What do they do?

As per datacamp, a data engineer is someone who develops, constructs, tests and maintains architectures such as databases and large scale processing systems. In short, the data engineer plays a major role in the Extract, Transform and Load(ETL) process and there are various tools to achieve this. The two tools that are mostly used to achieve this are meltano and airbyte. In this writeup we are going to give meltano a try and see how useful it is as a dataops tool. To set it up, you can follow the instructions here.

To follow along, you need to setup docker, python, postgres, wsl if you are running on a windows environment and have access to a redshift cluster(not so open source, sorry).

Extract(E)

The E in ETL stands for extraction of data from a source system to a staging data repository.

To setup our own etl, we’ll use this dataset and download all taxi trips for the year 2022. Once this is done, we’ll move the data to a postgres database.

We’ll also download the taxi_zones data to be able to map all trips to a location.

Taxi trip data

We’ll use this script to move the data to postgres from parquet(It’s so fast it’s funny)

import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import io
df = pd.read_parquet('./data/')# Replace host, db_port, db_name, db_user, db_password, db_schema# conn = psycopg2.connect("host='host' port='db_port' dbname='db_name' user='db_user' password='db_password' options='-c search_path=db_schema'")# connection = pg.connect("host="+db_host+" dbname="+db_name+" user="+db_username+" password="+db_password)engine = create_engine('postgresql+psycopg2://db_user:db_pass@host:db_port/db_name' , connect_args={'options': '-c search_path={}'.format('db_schema')})cur = conn.cursor()# Write to databasedf.head(0).to_sql('yellow_taxi_2022', engine, if_exists='replace',index=False) #drops old table and creates new empty tableconn = engine.raw_connection()
cur = conn.cursor()
output = io.StringIO()
df.to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
# contents = output.getvalue()
cur.copy_from(output, 'yellow_taxi_2022', null="") # null values become ''
conn.commit()
Successfully imported all 19 million rows to Postgres

If you’ve already setup meltano, you can now go ahead and setup our extractor(tap-postgres). We are moving the data from our local postgres installation to an aws redshift cluster.

To setup the extractor run the following command(windows) and configure the extractor as stated in the meltano docs.

docker run -v ${pwd}:/projects -w /projects meltano/meltano:latest-python3.9 add extractor tap-postgres

To Show all settings that are needed, run:

docker run -v ${pwd}:/projects -w /projects meltano/meltano:latest-python3.9 config tap-postgres list

To test the extractor run:

docker run -v ${pwd}:/projects -w /projects meltano/meltano:latest-python3.9 select tap-postgres --list --all
Get all entities in our tap.

Load(L)

The L in ETL means moving our data to a warehouse.

Next we need to setup our loader(target-redshift)

docker run -v ${pwd}:/projects -w /projects meltano/meltano:latest-python3.9 add loader target-redshift

To show all settings that are required to run our loader

docker run -v ${pwd}:/projects -w /projects meltano/meltano:latest-python3.9 config target-redshift list

Transform(T)

Finally, the T in ETL means structuring our data in a way it’s ready for analytics.

We are going to use dbt to transform our data. To install it in your meltano project, run

docker run -v ${pwd}:/projects -w /projects meltano/meltano:latest-python3.9 add transformer dbt-redshift

To view the required settings:

docker run -v ${pwd}:/projects -w /projects meltano/meltano:latest-python3.9 config dbt-redshift list

After setting it up, we need to join two tables(yellow taxi and map data) to get the destination and pickup zones.

Setting up dbt might be challenging but the documentation and google will assist to make things easier.

dbt transformation script.

We can now run our etl job through meltano’s ui utility (we need to run this command)

docker run -p 80:5000 -v ${pwd}:/projects -w /projects meltano/meltano:latest-python3.9 ui start
Meltano Pipeline Setup through the UI.

Since we are moving 19M records to redshift from our local computer, this will take a while.

Successfully moved our data to redshift(from our local Postgres DB).
Glimpse of our transformed data.

An ETL process can’t be complete without an orchestrator(job scheduler) but we don’t need one here.

Once we have successfully moved the data to redshift, it’s ready for analysis. So can we analyze our data and build a dashboard using an open source Business Intelligence tool? Of course.

BI the Open Source Way.

After the ETL process is over, we need to query data and get insights from it(data analysis) to present it to other stake holders.

The tool in use for this process is apache superset, you can go through the documentation to set it up based on your environment here.

After setup, we are going to feed in the required settings to connect to our data warehouse.

Parameters required to connect to redshift
Creating a dataset.
Superset Dashboard

And there you go, an easy end to end data ops setup powered by powerful open source tools. There are so many tools in this space and most if not all get the job done.

For any queries you can reach me at ndungunjenga96@gmail.com.

--

--

Brian Ndung'u Njenga
Brian Ndung'u Njenga

No responses yet