Lateral Joins in DuckDB
A useful tool for concise code
Background
We will use the work done in my previous article. Just for a quick recap, there is a number of files released by IMDB, knows a IMDB non-commercial dataset. It is a dump of a database that describes a subset of information related to titles released and people involved. In that article I used dbt to create models of this information and write some data tests that check the dataset for integrity and errors. There I discovered that in some cases, like the crew related information, for example, the dataset had problems, nothing unexpected for a data engineer. Problems like people mentioned for which no details were available in the corresponding table. In relational parlance, this was a referential integrity issue. So the closed world model assumption of the dataset is violated. The tests uncovered the issue and dbt shined there. The reasonable approach was to keep only the part of the dataset for which the referential integrity was enforceable. However my approach for isolating this subset(the SQL query) was flawed. This is where our story begins.
The problem and its flawed solution
Normally we can use DuckDB to investigate our dataset and especially its CLI. However here we will use Dbeaver. We use the 0.10.0 version of JDBC driver that comes pre-installed with the community version. After we run our dbt pipeline (see previous article) we head over to Dbeaver, create a DuckDB connection and open the database file in the folder database_files. We are looking at the dev.main schema and especially the name_basics_model view. Let’s see how it is defined:
Nothing fancy, an nconst identifier (primary key) imported as foreign key to the other tables, useful for referring to a specific person. The rest are names, some years. What catches the eye is the knownForTitles array. It is supposed to be a list of identifiers to the other “big” table, namely, title_basics_model.sql . The primary key there is tconst. The expectation is that the knownForTitles field is a possibly empty array containing tconsts. We created a test for the casted model,
which failed immediately. Why is that? It happens because some tconsts are non existent as the next screenshot shows

which is the result of the following snippet
If the above script detects problems, we can throw way the “not” and get tuples that are “legal”. So the next naïve solution should make the test happy.
Right? Unfortunately this is wrong. In the version of the dataset I have (yours may differ) the original dataset of names has 13245721 names. The above amounts to 11696555 names. So reconciliation is not possible. What happened?
A complicated correct solution.
What was the problem with the naive solution. It is very easy to spot by executing the next snippets.
By un-nesting an empty list gives nothing. So names with empty or all-illegal knownForTitles field are vanishing, they are invisible to aggregation. What we need for these cases, is to have additionally all nconsts present, with an empty knownForTitles field, and merge all lists in the end. Let’s make a first attempt.
Yes, it is puzzling and highly inefficient. But to simplify things a bit and understand what it is all about, just run this simple snippet.

Uncomplicating the correct solution with lateral joins.
As we said, the solution is highly inefficient for something that is a simple map operator in functional parlance. Postgres has a construct for this and is called a lateral join. Fortunately, DuckDB offers something similar that is very flexible and fits perfectly our use case. Let’s see the solution first.
It does not deviate at all from the principle of our initial complicated solution (save for the extra birthYear, deathYear cleansing). We still merge an empty list with possibly a non-existing list of acceptable title identifiers. However we use the “map” nature of the lateral join. We just need to name the join (t2 here) and give a schema (it has a knownForTitles column). This name corresponds to a definition. The solution is really concise and performant. To test that it works indeed, just use this snippet to play with the main idea.

The good news is that we can go beyond the one join. The title_crew_model has writers and directors. Both are lists that map to the name basics table.
It is pretty obvious that we need double cleansing here (two mapping functions). Lateral joins can do this
Obviously we need materializations to parquet/csv files in order to have acceptable performance. Views are extremely slow. The last line is part of the original cleansing approach, we need it to ensure cleansed data reference cleansed data. But what are these list_sort doing there?
Unit testing our solutions
DBT core 1.8+ added the ability to write unit tests for the complicated logic someone has to write for cleansing. This is where the list_sort above is used, it is necessary for writing testable code. The unit test does not need to cover the cleansing of tconst since it is a data test. However for pedagogical purposes we added it. Here is the test and then we are going to explain it.
In this test we mock the inputs to the model definition. The mocks have to do with
title_crew_model_casted, name_basics_model_cleansed and title_basics_model_cleansed which we are used for cleansing. Because of the structure we do not need to mock every field of the original table. So, we create entries with writers and directors which have entries which do and do not belong to the names. What we test is that the “do not” belong names disappear and entries with tconst not in the cleansed table of titles do not appear. This is the “expect:” part. Due to the processing, the cleansed list may not have the original ordering. It has to do clearly with the way records are processed, no guarantees. But for our tests to be deterministic we need to sort the entries. The above structure of the test may not come as a surprise. It resembles a lot how Spock tests are written. Also it is way better than the pgTap approach of writing tests. No SQL is involved at all. This is not a small thing. How do we run this test? Unsurprisingly running the model tests includes this one too. Read the excellent documentation here in order to learn more. For the cleansed model that triggered our discussion, here is the test (birthYear, deathYear are not included since they are data tested)
Conclusion
We went through one of these interesting code shortcuts offered by DuckDB in order to write clean code that cleans data. LATERAL is optional according to the documenttaion, we keep it there for documentation purposes. As always the code is in the repository. As Dr. Werner Krauth says, feel free to download, run and play with the code. In case you find a mistake or an omission, do not hesitate to mention it at the comments section. Cheers!
UPDATE!!!! Code snippets brough to their latest versions and added unit tests.