Cumulus upload integration testing
This document will walk you through completing two kinds of integration testing for transmitting data to the Cumulus Aggregator.
Prerequisites
You’ll need:
- A local copy of the Cumulus Library project
- For the quick test: a copy of an example export with synthea data - the Library has a set of test data you can use for this.
- For the long test: a local copy of the sample bulk fhir datasets, and an instance of the Cumulus ETL
- You’ll need to either set up your own Aggregator instance or reach out to BCH to get credentials configured to generate a site ID using the credential management script.
Configuring uploads
The Cumulus Library has a script for Uploading data in bulk. You can pass values to it via the command line, but we recommend setting up environment variables instead. Specifically:
CUMULUS_AGGREGATOR_USER
\ CUMULUS_AGGREGATOR_ID
- these should match the credentials configured in the Aggregator via the credential management script.
CUMULUS_AGGREGATOR_URL
- this should, for this testing, be set to a non production environment. The BCH Aggregator is using https://staging.aggregator.smartcumulus.org/upload/
for this, but you can use an endpoint of your choice if you are self-hosting an Aggregator.
Quick test: uploading test data
With these environment variables set, the bulk uploader is all set to load data. Perform the following steps, inside the cumulus-library-core
project:
- Copy the test data file
./tests/test_data/count_synthea_patient.parquet
into./data_export/test_data
- If desired, perform an upload dry run with
./data_export/bulk_upload.py --preview
- this will show you what the bulk uploader will do without actually sending data - Run the bulk uploader with
./data_export/bulk_upload.py
- A user with access to the Aggregator’s S3 bucket can verify if the upload was successful
Integration test: Processing synthetic data through ETL
If the quick test was successful, you can test your processing pipeline entirely with synthetic data. by running through the following steps:
- If you haven’t already, you’ll want to set up the ETL with synthetic data. The setup guide in the Cumulus ETL documentation includes instructions to deploy with a synthetic dataset.
- When it’s complete, you should be able to view data in athena to verify.
- In the cumulus library repo, build the Athena tables and export results, with
./library/make.py --build --export
(make sure you set the setup guide in the Cumulus Library documentation and set the appropriate environment variables/AWS credentials) - When the export completes, you should have folders in
./library/data_export
corresponding to the currently configured exportable studies (at the time of this writing,core
andcovid
). - Run the bulk uploader with
./data_export/bulk_upload.py
If this works, then you’ve proved out the whole data export flow and should be able to run a production export flow, just changing the CUMUMULUS_AGGREGATOR_*
environment variables to point to the production instance. If you’re using the BCH aggregator, you do not need to specify CUMULUS_AGGREGATOR_URL
, as that URL is the default value in the bulk upload tool.