Setting Up a Test Run of Cumulus ETL
Welcome to Cumulus!
This guide will explain how to install and run Cumulus ETL on some example patient data (no PHI). It assumes you are familiar with the command line.
This should help explain some Cumulus ETL concepts and give you some confidence that you can install everything correctly, before we move on to running Cumulus ETL in production.
For this test run, feel free to just use your own personal laptop or desktop, rather than your institution’s compute infrastructure. But that would work too.
Let’s open a terminal, navigate to a fresh scratch directory, and begin.
Preparations
Install Cumulus ETL
Cumulus ETL is shipped as a Docker image driven by a Docker Compose file.
- Install Docker
- Follow the official instructions
- Download the Docker Compose file
wget https://raw.githubusercontent.com/smart-on-fhir/cumulus-etl/refs/heads/main/compose.yaml
This compose.yaml
file is all you’ll need. Any further Docker images needed by Cumulus ETL commands will be downloaded on the fly.
Whenever you run Cumulus ETL, this compose.yaml
file will either need to be in your current directory, or you’ll have to pass Docker Compose the argument -f /path/to/compose.yaml
.
Download Sample Data
Let’s download some prepared sample patient data pre-generated by Synthea.
- Download zip file
wget https://github.com/smart-on-fhir/sample-bulk-fhir-datasets/archive/refs/heads/10-patients.zip
- Unzip it
unzip 10-patients.zip
You’ll now see a folder called ./sample-bulk-fhir-datasets-10-patients
holding some fake patient data.
First Run
Initialize the ETL
Before doing anything else with Cumulus ETL, you’ll need to initialize the output folder.
When this is done, the output
folder will have several subfolders holding empty Delta Lakes, ready to receive patient data.
docker compose run --rm \
--volume `pwd`:/host \
cumulus-etl init \
/host/output
Adding Patient Data
Now let’s add in the fake patient data with a typical ETL run.
Every Cumulus ETL run will fold in new FHIR data by either adding into or updating existing data in the Delta Lakes.
You’ll need to provide three folder arguments:
- A source folder that holds the FHIR data to be processed.
- An output folder that holds the Delta Lakes (the same folder from
init
above). - A PHI/build folder that will hold build and cache artifacts that may contain PHI like patient IDs. It is important that every ETL run for a given output folder uses the same consistent PHI folder.
So let’s provide those three arguments and add our FHIR data:
docker compose run --rm \
--volume `pwd`:/host \
cumulus-etl etl \
/host/sample-bulk-fhir-datasets-10-patients \
/host/output \
/host/phi
After running this command, you should be able to see more Delta Lake files in ./output/*/
and some build artifacts in ./phi
.
Congratulations! You’ve run your first Cumulus ETL process. The first of many!
Next Steps
This was just a demonstration of the ETL portion of the process.
Let’s learn how to run the whole Cumulus pipeline by setting up the Cumulus infrastructure in an actual AWS environment.