How To Manually Review the ETL Output
Your organization may require a manual review of all ETL output before uploading to the AWS cloud.
That can be supported easily enough with a two-step ETL process.
First ETL Step: Generate Human-Readable Files
Follow the normal ETL flow, but:
- Make sure to pass
--output-format=ndjson
tocumulus-etl
- Use a local output folder (we don’t want this data in the cloud until we’ve reviewed it)
- Remember that Docker will require that local folder to be mapped outside of its container with something like
--volume /outside/path:/inside/path
- Remember that Docker will require that local folder to be mapped outside of its container with something like
This will drop all ETL results as ndjson files in the target folder.
Manual Review
These ndjson files are human-readable (though not entirely pleasant) and/or can be processed with standard json tools.
Whatever your organization’s process is, once you are happy with the files, we can upload these files into a Delta Lake in the cloud.
Second ETL Step: Upload Binary Files
We want to convert the local ndjson folders into binary Delta Lake files in your AWS cloud.
Thankfully, there’s a special convert
command for that:
docker compose -f $CUMULUS_REPO_PATH/compose.yaml \
run --volume /path/to/output:/output --rm \
cumulus-etl \
convert \
--s3-region=us-east-2 \
/output \
s3://my-cumulus-prefix-99999999999-us-east-2/subdir1/
This will copy over the job logs in the JobConfig
file too.
For help on any other flags or options, pass --help
.