Bulk FHIR Exports

Cumulus ETL wants data, and lots of it.

It’s happy to ingest data that you’ve gathered elsewhere (as a separate export), but it’s also happy to download the data itself as needed during the ETL (as an on-the-fly export).

Export Options

If you have an existing process to export health data, you can do that bulk export externally, and then just feed the resulting files to Cumulus ETL. (Though note that you will need to provide some export information manually, with the --export-group and --export-timestamp options. See --help for more info.)
Otherwise, we recommend the SMART Fetch tool. Not only will it grab the bulk export, but it will do post-processing to grab resources that don’t normally come in a bulk export (like Medications and PractitionerRoles) and inline linked clinical notes so that you don’t have to download them later. It will also generate a log file that the ETL can parse, so you don’t need to specify the above --export-* options manually.

In any case, it’s simple to feed that data to the ETL by passing Cumulus ETL the folder that holds the downloaded data as the input path.

Cumulus Assumptions

Cumulus ETL makes some specific assumptions about the data you feed it and the order you feed it in.

This is because Cumulus tracks which resources were exported from which FHIR Groups and when. It only allows Encounters that have had all their data fully imported to be queried by SQL, to prevent an in-progress ETL workflow from affecting queries against the database. (i.e. to prevent an Encounter that hasn’t yet had Conditions loaded in from looking like an Encounter that doesn’t have any Conditions)

Of course, even in the normal course of events, resources may show up weeks after an Encounter (like lab results). So an Encounter can never knowingly be truly complete, but Cumulus ETL makes an effort to keep a consistent view of the world at least for a given point in time.

Encounters First

Please export Encounters along with or before you export other Encounter-linked resources. (Patients can be exported beforehand, since they don’t depend on Encounters.)

To prevent incomplete Encounters, Cumulus only looks at Encounters that have an export timestamp at the same time or before linked resources like Condition. (As a result, there may be extra Conditions that point to not-yet-loaded Encounters. But that’s fine, they will also be ignored until their Encounters do get loaded.)

If you do export Encounters last, you may not see any of those Encounters in the core study tables once you run Cumulus Library on the data. (Your Encounter data is safe and sound, just temporarily ignored by the Library until later exports come through.)

No Partial Group Exports

Please don’t slice and dice your Group resources when exporting. Cumulus ETL assumes that when you feed it an input folder of export files, that everything in the Group is available (at least, for the exported resources). You can export one resource from the Group at a time, just don’t slice that resource further.

This is because when you run ETL on say, Conditions exported from Group Group1234, it will mark Conditions in Group1234 as completely loaded (up to the export timestamp).

Using _since or a date-oriented _typeFilter is still fine, to grab new data for an export. The concern is more about an incomplete view of the data at a given point in time.

For example, if you sliced Conditions according to category when exporting (e.g. _typeFilter=Condition?category=problem-list-item), Cumulus will have an incorrect view of the world (thinking it got all Conditions when it only got problem list items).

You can still do this if you are careful! For example, maybe exporting Observations is too slow unless you slice by category. Just make sure that after you export all the Observations separately, you then combine them again into one big Observation folder before running Cumulus ETL.

Archiving Exports

Exports can take a long time, and it’s often convenient to archive the results. For later re-processing, sanity checking, quality assurance, or whatever.

It’s recommended that you archive everything in the export folder. This is what you may expect to archive:

The resource export files themselves (these will look like 1.Patient.ndjson or Patient.000.ndjson or similar)
The log.ndjson log file
The deleted/ subfolder, if present (this will hold a list of resources that the FHIR server says should be deleted)
The error/ subfolder, if present (this will hold a list of errors from the FHIR server as well as warnings and informational messages, despite the name)

Downloading Clinical Notes Ahead of Time

If you are interested in running NLP tasks, that will require the clinical note attachments found inside DiagnosticReport and DocumentReference resources to be available.

In order to make them available to Cumulus ETL, you’ll want to download the attachments ahead of time and only once by inlining them.

What’s Inlining?

Inlining is the process of taking an original NDJSON attachment definition like this:

{
  "url": "https://example.com/Binary/document123",
  "contentType": "text/html"
}

Then downloading the referenced URL, and stuffing the results back into the NDJSON with some extra metadata like so:

{
  "url": "https://example.com/Binary/document123",
  "contentType": "text/html; charset=utf8",
  "data": "aGVsbG8gd29ybGQ=",
  "size": 11,
  "hash": "Kq5sNclPz7QV2+lfQIuc6R7oRu0="
}

Now the data is stored locally in your downloaded NDJSON and can be processed independently of the EHR.

How to Inline

SMART Fetch has a special inlining mode. Simply run the following command, pointing at both a source NDJSON folder and your EHR’s FHIR URL.

smart-fetch hydrate --tasks inline ./ndjson-folder --fhir-url FHIR_URL --smart-client-id XXX --smart-key /YYY

This will modify the data in the input folder!

By default, this will inline text, HTML, and XHTML attachments for any DiagnosticReports and DocumentReferences found. But there are options to adjust those defaults. See --help for more information.

Resuming an Interrupted Export

Bulk exports can be brittle. The server can give the odd occasional error or time you out. Maybe you lose your internet connection. Who knows.

If that happens, just re-run the SMART Fetch export command with the same folder, and it will resume.

Registering an Export Client

On your server, you need to register a new “backend service” client. You’ll be asked to provide some sort of private/public key. See below for generating that. You’ll also be asked for a client ID or the server may generate a client ID for you.

Generating a JWK Set

A JWK Set (JWKS) is just a file with some cryptographic keys, usually holding a public and private version of the same key. FHIR servers use it to grant clients access.

You can generate a JWKS using the RS384 algorithm and a random ID by running the command below.

(Make sure you have jose installed first.)

jose jwk gen -s -i "{\"alg\":\"RS384\",\"kid\":\"`uuidgen`\"}" -o private.jwks
jose jwk pub -s -i private.jwks -o public.jwks

After giving public.jwks to your FHIR server, you can pass private.jwks to SMART Fetch with --smart-key (example below).

Generating a PEM key

A PEM key is just a file with a single private cryptographic key. Some FHIR servers may use it to grant clients access.

If your FHIR server uses a PEM key, it will provide instructions on the kind of key it expects and how to generate it. See for example, Epic’s documentation.

After giving the public key to your FHIR server, you can pass your private.pem file to SMART Fetch with --smart-key (example below).

SMART Arguments

You’ll need to pass two arguments to SMART Fetch:

--smart-client-id=YOUR_CLIENT_ID
--smart-key=/path/to/private.jwks

You can also give --smart-client-id a path to a file with your client ID, if it is too large and unwieldy for the commandline.