Setting Up a Test Run of Cumulus ETL

Welcome to Cumulus!

This guide will explain how to install and run Cumulus ETL on some example patient data (no PHI). It assumes you are familiar with the command line.

This should help explain some Cumulus ETL concepts and give you some confidence that you can install everything correctly, before we move on to running Cumulus ETL in production.

For this test run, feel free to just use your own personal laptop or desktop, rather than your institution’s compute infrastructure. But that would work too.

Let’s open a terminal, navigate to a fresh scratch directory, and begin.

Preparations

Install Cumulus ETL

Cumulus ETL is shipped as a Docker image driven by a Docker Compose file.

  1. Install Docker
  2. Download the Docker Compose file
    • wget https://raw.githubusercontent.com/smart-on-fhir/cumulus-etl/refs/heads/main/compose.yaml

This compose.yaml file is all you’ll need. Any further Docker images needed by Cumulus ETL commands will be downloaded on the fly.

Whenever you run Cumulus ETL, this compose.yaml file will either need to be in your current directory, or you’ll have to pass Docker Compose the argument -f /path/to/compose.yaml.

Download Sample Data

Let’s download some prepared sample patient data pre-generated by Synthea.

  1. Download zip file
    • wget https://github.com/smart-on-fhir/sample-bulk-fhir-datasets/archive/refs/heads/10-patients.zip
  2. Unzip it
    • unzip 10-patients.zip

You’ll now see a folder called ./sample-bulk-fhir-datasets-10-patients holding some fake patient data.

First Run

Initialize the ETL

Before doing anything else with Cumulus ETL, you’ll need to initialize the output folder.

When this is done, the output folder will have several subfolders holding empty Delta Lakes, ready to receive patient data.

docker compose run --rm \
  --volume `pwd`:/host \
  cumulus-etl init \
  /host/output

Adding Patient Data

Now let’s add in the fake patient data with a typical ETL run.

Every Cumulus ETL run will fold in new FHIR data by either adding into or updating existing data in the Delta Lakes.

You’ll need to provide three folder arguments:

  1. A source folder that holds the FHIR data to be processed.
  2. An output folder that holds the Delta Lakes (the same folder from init above).
  3. A PHI/build folder that will hold build and cache artifacts that may contain PHI like patient IDs. It is important that every ETL run for a given output folder uses the same consistent PHI folder.

So let’s provide those three arguments and add our FHIR data:

docker compose run --rm \
  --volume `pwd`:/host \
  cumulus-etl etl \
  /host/sample-bulk-fhir-datasets-10-patients \
  /host/output \
  /host/phi

After running this command, you should be able to see more Delta Lake files in ./output/*/ and some build artifacts in ./phi.

Congratulations! You’ve run your first Cumulus ETL process. The first of many!

Next Steps

This was just a demonstration of the ETL portion of the process.

Let’s learn how to run the whole Cumulus pipeline by setting up the Cumulus infrastructure in an actual AWS environment.