Adding some MinIO to your standalone Apache Spark cluster

Disaggregated compute and storage for the apprentice Data Engineer

9 min readAug 26, 2022

Background

One of the typical problems, a beginner in Apache Spark encounters is to have an Apache Spark cluster and an Apache Hadoop cluster to practice the locality of compute and storage through YARN or follow Apache Spark examples by interacting with an HDFS system. This typically entails connecting your Apache Spark to an already running HDFS system, or just run your Apache Spark as a YARN job in an existing Hadoop cluster. The approach in most of the cases amounts to some kind of the following options:

Download distribution for Apache Hadoop and Apache Spark and run them locally.
Use a docker distribution that is all-inclusive and preset like Hortonworks Sandbox
Use a more modularized docker distribution like Big Data Europe.
Pay for a preset distribution in various cloud providers.

Of course, (1) does not qualify as clustered if you need more than 1 instances and takes some steps to set up. On the other extreme, (4) may cost you $$$. Solution (3) downloads a lot and since so many things are preset, it does not qualify as educational, especially when maintenance is left on volunteers and can be behind the current versions of the packages. Solution (2) is too fat for my taste. Still, a lot is preset. However, if you have a fairly capable laptop, you can create your own setup, and learn on the way.

One of my main requirements was to have something that can be setup easily. This rules out Hadoop/YARN since it has too many moving parts and knobs to make it work as a cluster. Making my own docker image was a no-go at that time (but could take an existing one and upgrade it, of course). Using some other’s distribution was also a no-go because I saw a lot of them that quickly become unmaintained or get updates sporadically. Coming to Spark, by ruling out YARN or paying $$$ for educational purposes, I see the standalone cluster mode as the best solution to my constraints since I was not interested at this point in learning Kubernetes. Instead I should focus on Apache Spark. But still, the standalone mode needs some storage to pull data. While a relational database is a very viable (and overkill if you do not already use it for something else) option, I wanted something like Hadoop-like.

MinIO to the rescue

One possible solution was to use something like S3, but without having to use Amazon Web Services. LeoFS is an option, but there is a lot of hype about MinIO and good documentation … and docker images. MinIO promises disaggregated compute and storage. The value proposition is that storage and compute should scale independently. This is in accord with cloud computing. Contrary to Hadoop it is not a file storage. In HDFS a file is split and replicated across various nodes. But MinIO is an object storage. You have files+metadata which both make an object. After a couple of minutes of thought, it is obvious that HDFS can be emulated easily. After all, it is a game of path names. But you lose the co-location of compute and storage, which is exactly the same if your Apache Spark cluster and Apache Hadoop cluster live in different servers. In this particular case, MinIO has broken the speed barrier against HDFS.

This is actually what I was looking since I jumped outside the Hadoop ecosystem. The only remaining question is how to use it with Apache Spark? It turns out that I’m not the only one having this question. Fortunately, there are answers like this and this (from Medium):

Big Data without Hadoop/HDFS? MinIO tested on Jupter + PySpark

The takeover of Hortonworks by Cloudera ended the free distribution of Hadoop. Therefore, a lot of people are looking…

python.plainenglish.io

Unfortunately, these are either outdated or they use Apache Spark from the SDK and not through a standalone cluster. They can also be complicated, and in some cases the answers are scattered across many sites. Sometimes they lead to dead ends.

After a lot of trial and error and posts and blogs, this is the documentation of my setup. The purpose is to be simple, maintainable and oriented towards beginners. Let’s start.

Setup MinIO and first sanity check

We will use the minIO docker image from minIO. The Bitnami one is not maintained anymore. I provide a docker-compose.yml for the code of this article. You can start minIO as

docker compose up minio

Minio takes some time to start and here you are if you access it at

http://127.0.0.1:9001

We will use the data from here. It is a csv file with addresses for demo purposes, which you need to download. Log in, create a bucket as “mybucket” and upload your csv file. This is the end result

You can also join the docker network and list the contents without problems by using the outdated bitnami client

docker run -it — rm — name minio-client \
— env MINIO_SERVER_HOST=”my-minio-server” \
— env MINIO_SERVER_ACCESS_KEY=”theroot” \
— env MINIO_SERVER_SECRET_KEY=”theroot123" \
— network app-tier — volume $HOME/mcconf:/.mc \
bitnami/minio-client ls minio/mybucket

The output is shown in the next screenshot

Accessing MinIO programmatically through python

For sanity check, we will access through python our minIO server from our PC. First, we need to install python minIO support.

pip install minio

Open your favorite editor, IDE or CLI environment and execute

Access minio through Python

You can also join the docker network through jupyter .

Create in your current folder a sub-folder, e.g. jupyter-workspace. Now you can run

docker run -it --rm --network app-tier  -p 10000:8888 -v "${PWD}"/jupyter-workspace:/home/jovyan/ jupyter/scipy-notebook:latest

connect to

http://localhost:10000

take the token from the console output

and login with it. Create a python3 notebook and execute a modified version of the above code. First in a cell run

!pip install minio

In case you have any problems, look here.

In the next cell run the modified code

Success!!!

Accessing MinIO through local Hadoop client

Now it is time for our first encounter with the Apache ecosystem. It came to me as a pleasant surprise that Hadoop can “see” an S3 file system as another distributed file store. It is not only for HDFS. There is extensive documentation on this. MinIO is S3 compatible, so we are in business. HDFS is so advertised that it is easy to miss the point that Hadoop actually is both an implementation and a driver for HDFS. Let’s get to work!

First, we need to download the Hadoop distribution, since the plan is to access our “distributed” minIO from Hadoop and apply its commands. Download the latest release from here (3.3.4). I keep everything in my Downloads/Apache folder. So the next reasonable step now is to have

export HADOOP_HOME=$HOME/Downloads/apache/hadoop-3.3.4
export PATH=$HADOOP_HOME/bin:$PATH

The idea is now to naively access it as a S3 file system with the s3a scheme which call the appropriate driver.

hadoop fs -ls s3a://mybucket/addresses.csv

Of course, it fails. We have not configured where s3a is !!!. Head over to

$HADOOP_HOME/etc/hadoop

backup core-site.xml (e.g. make it _core-site.xml) and “steal” some of the contents from here. A more up-to-date list can be found here. We put the bare minimum necessary (in my repo it is this file):

core-site.xml for minIO access

Our attempt now fails again

However now we are in a better position. It cannot find a library. The real hint is in overview section. In summary:

export HADOOP_OPTIONAL_TOOLS=”hadoop-aws”

This time we succeed, YAY!!!!

So we can use Apache Hadoop to interact with another distributed filesystem. A huge deal. Now our hopes for Spark access get better and better since it uses Hadoop as its JDBC for file systems.

Accessing MinIO through local Spark

We will first make sure that we can access the cluster through Spark locally. Head over to the downloads section of Apache Spark and download the latest release, without any built-in Hadoop. You can always use the all-inclusive. But here for tutorial reasons we do the bare minimum. After this article you will have the knowledge to handle that situation. Download it, unzip it and set it up

export HADOOP_HOME=$HOME/Downloads/apache/hadoop-3.3.4
export SPARK_HOME=$HOME/Downloads/apache/spark-3.3.0-bin-without-hadoop
export PATH=$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH
export HADOOP_OPTIONAL_TOOLS=”hadoop-aws”

We have a script for accessing minIO through Spark. It reads

While the python distribution has the python packages we need, I will also use it through pip (python 3.10.6), for autocompletion reasons and easy setup.

pip3 install pyspark

Let’s give it a try in Thonny.

Oooops!!!! What went wrong? Hadoop can talk to minIO and Spark talks to Hadoop. It cannot talk though, since we have a Spark without Hadoop. The fix is usually the documentation.

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

We succeed this time.

The file is there (you can also verify through UI) or with hadoop!!!

Now we are ready to provide our Apache Spark standalone cluster with some storage.

Running our Spark standalone cluster

Our cluster is based on this repo. There are some differences though.

Latest dependencies
Spark is without bundled Hadoop
I add a latest Hadoop distribution
Enable S3 in Hadoop so as to be able to run commands from workers or master

I also have added a docker-compose.yml that sets everything up. The other thing I would like to mention is that since Hadoop comes with an empty core-site.xml, it should be provided either directly, by linking an external one to the Hadoop folder, or link a spark-defaults.conf which puts the same variables in a non-xml format with a hadoop prefix. I included both. Feel free to connect to the container and execute hadoop file system commands. Let's spin our full cluster with minIO and Spark (master url is in 127.0.0.1:8080).

docker compose up

We delete the parquet file

hadoop fs -rmr s3a://mybucket/output_addresses.parquet

and now we will try to re-create it from our spark cluster!!! We definitely need our env variables before submitting in a new terminal

export HADOOP_HOME=$HOME/Downloads/apache/hadoop-3.3.4
export SPARK_HOME=$HOME/Downloads/apache/spark-3.3.0-bin-without-hadoop
export PATH=$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH
export HADOOP_OPTIONAL_TOOLS=”hadoop-aws”
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Now we can submit

spark-submit — master spark://127.0.0.1:7077 spark-access-minio.py

We were too optimistic, sigh!!!!

Crash and burn! Not really obvious what is happening here. Some months ago I opened this bug. Months passed, day to day work took over. At some point in mid-July I decided to track it down. I had a very hard time. The Hadoop in the container resolves just fine.

Eventually I added a magic line in my /etc/hosts to resolve it to my localhost in case it comes from the local Spark client.

Guess what! It worked this time and the file is there, though it is a mystery to me why. I suspect driver is validating data as it is described by others! I am very interested in an explanation.

Conclusion

I decided to deviate from the typical Hadoop/Spark combo because of complexity. I opted instead for a different solution that is easier and equally cloud native. That path needed a lot of experimentation and studying. The gains in knowledge were non-trivial. I re-discovered that it can pay off to do things differently. I shared my journey to make the life of others easier and as a means of self-documentation. I used an x64 Mac with Mac OS Monterey, Python 3.10.6 and latest OpenJDK11. Code is in this repository. As usually I am very happy for corrections and suggestions for improvement.