How Does Cumulus De-identify Patient Data?
First, let’s review the timeline of when PHI gets redacted as patient data flows through the Cumulus pipeline. And then we’ll discuss which specific fields gets de-identified and how.
The PHI Lifecycle
There are three main stages of patient data anonymity, as the Cumulus project shepherds that data from the EHR to the Cumulus dashboard:
- Full PHI records
- De-identified records
- Patient counts
Full PHI
This is the raw data from the EHR, usually in the form of a bulk FHIR export. Cumulus ETL saves this data locally to a temporary folder before beginning its work (this folder gets deleted after use or even if the process is interrupted).
De-identified Records
Cumulus ETL runs its de-identification routine (see below) & natural language processing (NLP) and then uploads the resulting de-identified records to an output S3 bucket. Note this is all still inside your own IT infrastructure.
Patient Counts
Each study will have its own SQL queries that run against the de-identified records. (And only the de-identified records, as these queries have no access to the full PHI at this point in the pipeline.)
These SQL queries will all result in a simple count of the target information. That might be how many patients had a particular symptom, how many patients didn’t, how many patients are prescribed a particular medication, etc.
For example, a Covid study might query how many patients have fever symptoms, with a result like “10,000 patients showed fever symptoms on 10/15/2021.” And that count would be sent on to the Cumulus dashboard.
But no specific patient data can be sent to the dashboard. Just the total count of results of a given SQL query.
This is the first time data leaves your institution, and by this time, there is no PHI at all. Just counts.
De-identification
The full PHI and patient count stages are easy to understand. All the PHI or none of it.
But the piece in-between where de-identified data sits at rest in Amazon S3 is more nuanced. Let’s explore that.
There are two main transformations of PHI inside Cumulus ETL:
- Dropping data: Cumulus ETL holds an allow-list of all FHIR fields that are acceptable. Fields with PHI are not allowed and are dropped on the floor (for example:
Patient.name). If not performing NLP, any attachment or clinical note data is also dropped. - Anonymizing data: It replaces all resource IDs with anonymized IDs, chops birth dates down to just the year, generalizes zip codes, and (optionally) runs
philteron a few higher-risk text fields.
Those de-identification steps are performed as Cumulus ETL reads the FHIR data from disk. So as it writes out FHIR to Athena, the data has already been de-identified.
Read more on each of the specific de-identification strategies below.
Dates
Most dates are left alone, as precise timing is useful for studies and carries minimal PHI risk. But anything age related is carefully handled in the usual HIPAA manner:
- Birthdates are redacted down to just the year (no month or day)
- If the birthdate (or other age field) indicates an age over 89, those patients will be grouped together as one cohort (this is done by study code, not the ETL)
Zip Codes
Zip codes are redacted down to just the first three digits (e.g. 12139 becomes 12100).
Additionally, for certain small-population zip codes where even three digits is too identifying, the zip code is entirely redacted to 00000.
Extensions
Extensions are stripped out unless they are on a list of recognized extensions, to ensure that PHI doesn’t accidentally slip in. The allowed extensions include the standard USCDI patient extensions (birth sex, gender identity, race, and ethnicity) as well as various harmless vendor extensions.
Any unrecognized “Modifier” extension will cause Cumulus ETL to entirely skip the containing resource, since the resource can’t be properly understood.
IDs
Cumulus ETL de-identifies FHIR resource IDs itself.
By IDs, we are only talking about FHIR resource IDs. Other identifiers (like patient identifiers) are always stripped out entirely.
These resource IDs are one-way securely hashed for anonymity. (Using a HMAC-SHA256 hash with a 256 bit salt.)
Patients and Encounters
Patient and Encounter resources are anonymized like any other resource. But with one difference.
A mapping from the old to the new IDs is kept for debugging purposes. If there is ever a concern about data integrity or oddities are observed in the de-identified results, it is crucial that some mechanism exists to reverse the anonymization so that your institution can investigate.
This mapping is obviously very precious and is treated as sensitive PHI. It’s stored in a special PHI directory (the third argument to Cumulus ETL). And you control where that PHI directory lives (an S3 bucket, a local disk, etc.), so that it can be locked down as tightly as you like. It never leaves your institution’s control.
Any other resource is usually already tied to a patient or encounter. So Cumulus does not bother keeping a mapping for those.
Freeform Text Fields
There are some freeform text fields that Cumulus ETS leaves in. These fields are useful for presenting or computing a phenotype:
CodeableConcept.textCoding.display
Although Cumulus wants to largely preserve these fields, they may contain PHI since they are freeform text fields after all.
If that is likely for your institution, you can have Cumulus ETL run philter over these freeform fields, by passing --philter. This replaces any detected PHI like names, phone numbers, MRNs, social security numbers, etc. with asterisks.
But be warned that it will significantly slow down the ETL process.
NLP on Clinical Notes
If performing NLP, the clinical notes are not stripped during normal de-identification, as they need to be passed to the NLP model.
The resulting detected symptoms and other medical codes are then kept in the de-identified results, along with “span” pointers back to the text, indicating which text indicated the specific symptom. This does not hold actual text, but just numbers pointing to offsets in the text.
Conclusion
And that’s it! In summation, the only data that leaves your institution are just raw counts that could not be considered PHI.
But inside your institution, there is some de-identified resources that sticks around as well as more sensitive ID mappings for patients and encounters.
Hopefully you feel a little more at ease about how de-identification is performed, but always feel free to reach out to the Cumulus team for suggestions or questions.