Ingesting the OMDB Dataset

8 min readJul 15, 2024

A full DBT project and a comprehensive Dockerized Pipeline

Introduction

In a previous cluster of articles we investigated the IMDB dataset. This dataset was an excellent substrate for tutorials that allowed us to view the Medallion architecture in its full power, the versatility of the DBT tool and the super powers of DuckDB. It is a real and huge dataset dataset. For this reason it required at least 32GB of RAM. This posed a lot of problems in creating a dockerized pipeline for processing it. I also eventually hit a performance wall that prevented me from loading the dataset in AGE or Neo4J. Because of these issues, I had to search thoroughly for a smaller but real dataset. It was not very long until I bumped this wonderful article. It mentions the OMDB dataset. They did something similar but they did not provide a pipeline or investigate the properties of the dataset. Our purpose in this article is to tackle both issues. This is where our adventure starts.

The manual approach

What the user needs to do

Since the dataset is small I went through dockerizing everything. And for this reason I used the PostgreSQL database, specifically the official image. The plan here for the manual part is to ingest the data through a Jupyter notebook to a staging folder (raw_data) and the load it to the PostgreSQL database. The code is straight forward. Data is residing in this url. The first part of the notebook uses the requests library to get data in small chunks and then assemble them to the final files.

Obviously you can use the staged data for other purposes too. We end up with 30 files.

These 30 files are loaded to a pandas data frame, after uncompressing them for some massaging, and then saved directly to the database in suitable tables. If the table exists, it is dropped and created.

Nothing fancy yet.

Description of the dataset

This dataset has some unique pathologies, fortunately, way less than the IMDB dataset. For this reason we provide a DBT project that the user must run manually on the database to create the Bronze, Intermediate and Silver layers (and test the last one) so as to arrive at a curated (cleansed) set of tables.

By inspection of the PostgreSQL database we need 2 seeds, that will act as enumeration, region and language. This is not surprising since the OMDB dataset is similar to the IMDB dataset. Another similarity has to do with NULLs. These are denoted by ‘\N’ again. There is an extra details concerning the field of language. The value ‘xx’ value is also mapped to NULL because in ISO parlance it means ‘Not assigned’. These NULL mappings happen in the Intermediate Layer. Another notable detail is the lack of array fields. This way we can test discrete values (regions and languages are still tested through referential integrity) like in the case of silver_cleansed_movie_references:

From the IMDB dataset analysis we quickly are able to come up with a stack for analyzing its structure. We need a Bronze layer which is actually a table view of the files in the landing layer (the repository/folder of downloaded files) so as to bring the data into our database and a casting layer where we give types to our data (bronze and silver_casted in the source repository). In type theory parlance we need to refine the types since they do not tell the whole truth. For this reason we need the silver_cleansed layer (which in our case does imputation too). This carries the data tests, cleansing and possible imputations. Here is the full schematic (Gold Layer deserves a separate article):

Pathologies and Data testing

There are a number of pathologies in this dataset. It was necessary to dig a lot through DBT logs to find out failing tests and their reason. First of all series, movie series, episodes, seasons and movies (collectively called content) all have a parent. The parent is nullable and acts as a container. Most of the pathologies come from this and do not exist in IMDB dataset.

Pathology type 1

The first category of pathologies has to do with content labelled erroneously. Its mixed up. This means some series are movies, some movie series are episodes and so on. After a lot of data exploration we make the following observations about how content and its parent are related that can help us clean the content.

(thanks ozh). In this respect the solution is to detect the misplaced content and move it where it belongs. Here is an example

Look what is happenning here. We keep movies where their parent is null, in movie series but not in seasons (it will move to the episodes). We also bring to movies the episodes which have a parent of movie series.

Pathology type 2

The second category has to do with seasons having self reference. This easy to fix. We detect seasons with self references and we replace the self reference with the series they belong. The next sql code mirror exactly our approach.

Pathology type 3

The third category has to do with episodes having a parent that is either a series or a season but also a provided series id. Our task is to create a synthetic field that mirrors the inferred series of the episode through parent and ensure that the provided series matches the inferred series. There is a complication here. The parent can be either a series or a season. Here is the inference rules

If the parent is null, the inferred series is null
If the parent is a series, the inferred series is that series
If the parent is a season, then the inferred series is the parent of that season

Keep in mind that this is an imputation operation and for this reason happens over cleansed data. The next sql code of the model reflects the above reasoning.

The first part of the with as described above just rearranges content to much rules related to parent. The imputation happens in the second part.

The test about the equality of provided and imputed series is happening in this snippet (part of the data tests in silver_cleansed_all_episodes_model.yml)

Can you see why?

Pathology type 4

This pathology type has to do with the closed world assumption:

Our dataset is a closed world when no primary id in our dataset that refers to another primary id is dangling, not in the dataset.

This is usually violated and we have to enforce it in the silver layer through data cleansing. For this we need some helper tables for content (persons and jobs do not need any helper tables) because our dataset is not relational form in the strict sense. We saw this with episodes where parent can be a season or a series. These helper tables are

with_cleansing_all_ids_of_original_movies_and_series(which collects all content ids)
with_cleansing_all_ids_of_series_and_seasons (which collect all series and season ids)
with_cleansing_all_ids_of_singleton_series (which collect all series, that do not have a season)

All these tables are used in data tests. Here is a closed world enforcement

Moreover, the ids are collected from the cleansed tables. The first is used for movie aliases for example. The second is used for episodes. The last one is the most interesting. This tests the fact that for the episodes, the episodes that have a provided series, do not correspond to a season. Here is the full data testing of the episode model

Apart from that, no other cleansing was necessary. You can find this work here. A lot of the clumsiness contained in the individual silver tables and tests will be resolved in the Gold Layer (To Be Published Soon).

The automatic approach

What the user needs to do

For this one we resorted to importing the manual approach through git submodules to the automated (scheduled) project, see README. What this project does? This project follows the manual tests verbatim by using Airflow as an orchestrator and Astronomer as a deployment engine through Docker (please fire it up). In short,

Downloads through Python code the OMDB files in a folder stamped with the current date (Airflow start date)
Inserts the dataset into the database and versions the dataset
Runs the DBT project through cosmos.

The orchestration happens through data-aware scheduling. Step 1 DAG triggers Step 2 DAG and Step 2 DAG triggers Step 3 DAG through suitably defined datasets

“s3://dataset-bucket/example1.csv” (DAG1 -> DAG2)
“s3://dataset-bucket/example2.csv” (DAG2 -> DAG3)

The user need to start from a clean installation. Then using astronomer tools, we can run in the project

In Podman under Windows you may need two extra commands
c:\tools\astro_1.30.0_windows_amd64.exe config set -g container.binary podman
set BUILDAH_FORMAT=docker

c:\tools\astro_1.30.0_windows_amd64.exe dev start

so as to start the setup.

The user is guided to a webpage where with the credentials admin/admin can access the DAGs.

It is enough to run the first DAG that gets files into the filesystem. You can identify it because it has a date for the next run . The rest will be triggered automagically. Files go in time-stamped folders

.\dbt\omdbfiles\{run date}

Feel free to run omdb_dataset_fetcher and inspect the other two DAGs.

Description of the code base

The dataset retriever

First we need to get the files from the URL. The DAG is pretty self-explanatory. We create a time stamp from the run date and propagate to all the layer via XCOM. The folder is created otherwise, if it exists, deleted and created. Files are downloaded like in the notebook and the dataset is marked at the very end as done (finish_this task).

The files are loaded one after another so as to not cause traffic issues to the ODMB site.

The database loader

Still we mimic exactly what happens in the notebook. Two extra highlights here. We have a task (version_this) that puts a version of the dataset in database, we need to know our dataset version. Second our database scema creator operator inherits from PythonVirtualenvOperator because Airflow use SQLAlchemy 1.x and we need SQLAlchemy 2.x. This is why we need the following line in our Dockerfile:

RUN python -m pip install apache-airflow[virtualenv]

The ingestion happens in parallel. Here is the DAG

The DAG depends on the first dataset, and the final task triggers the next dataset.

The DBT runner

For running the DBT project we selected the cosmos plugin. For security reasons (as its should happen in a production system) we configure the project PostgreSQL data source through Airflow connections. One could also use the dbt project profile.yml (no recommended) the magic happens here

The DAG depends on the second and final dataset and follows the way this plugin should be configured from the examples . Here is the DAG

Because of random crashes we decided to run all the tests in the very end.

test_behavior=TestBehavior.AFTER_ALL

The graph is the real DBT graph with all its complexities (including Gold Layer, To Be Published Soon).

Conclusion

The above pipeline and DBT project have to do with a very real dataset. The dataset is not huge and for this reason someone can experience a full ingestion pipeline in his/her laptop with 16GB RAM. A very big number of details are included and because of the space of this article I would like to not expand it threefold. Instead, I would like to let you download, execute and study the code base. The code as always is provided here and here. I hope you enjoy it. Cheers!

Ingesting the OMDB Dataset

Introduction

The manual approach

What the user needs to do

Description of the dataset

Pathologies and Data testing

The automatic approach

What the user needs to do

Description of the code base

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Vasileios Anagnostopoulos

No responses yet