Creating Library Studies

The following guide talks about how to use the Cumulus Library to create new aggregations in support of ongoing projects.

Setup

If you are going to be creating new studies, we recommend, but do not require, adding an environment variable, CUMULUS_LIBRARY_PATH, pointing to the folder in which you’ll be working on study development. cumulus-library will look in each subdirectory of that folder for manifest files, so you can run several studies at once.

If you’re not doing this, you can always add the --study-dir path/to/dir argument to any build/export call to tell it where to look for your work.

Creating a new study

If you’re authoring a study, you just need to do two things to get started:

  • Make a new directory inside the directory you’re keeping studies in. The name of this directory will be the name you use to run it using the cumulus-library cli command. In this document, we’re calling this directory my_study as an example.
  • Make a new file, manifest.toml. A toml file is a config file format - you don’t need to worry too much about the details of this format, as we’ll show you in this document how the library uses these files to run your study. You can copy the following template as an example, which has comments describing what each section does:
# 'study_prefix' should be a string at the start of each table. We'll use this
# to clean up queries, so it should be unique. Name your tables in the following
# format: [study_prefix]__[table_name]. It should probably, but not necessarily,
# be the same name as the folder the study definition is in.
study_prefix = "my_study"

# For most use cases, this should not be required, but if you need to programmatically
# build tables, you can provide a list of files implementing BaseTableBuilder.
# See the core study for examples of this pattern. These run before
# any SQL execution
# [table_builder_config]
# file_names = [
#     "my_table_builder.py",
# ]

# The following section describes all tables that should be generated directly
# from SQL files.
[sql_config]
# 'file_names' defines a list of sql files to execute, in order, in this folder.
# Recommended order: Any ancillary config (like a list of condition codes),
# tables/view selecting subsets of data from FHIR data, tables/views creating 
# summary statistics.
file_names = [
    "setup.sql",
    "lab_observations.sql",
    "counts.sql",
    "date_range.sql"
]


# The following section defines parameters related to exporting study data from
# your athena database
[export_config]
# The following tables will be output to disk when an export is run. In most cases,
# only count tables should be output in this way.
export_list = [
    "my_study__count_influenza_test_month",
]

# For generating counts table in a more standardized manner, we have a class in the 
# main library you can extend that will handle most of the logic of assembling 
# queries for you. We use this pattern for generating the core tables, as well
# other studies authored inside BCH. These will always be run after any other
# SQL queries have been generated
# [counts_builder_config]
# file_names = [
#    "count.py"
# ]

# For more specialized statistics, we provide a toml-based config entrypoint. The
# details of these configs will vary, depending on which statistical method you're
# invoking. For more details, see the statistics section of the docs for a list of
# supported approaches.
# These will run last, so all the data in your study will exist by the time these
# are invoked.
# [statistics_config]
# file_names = 
# [
#    "psm_config.toml"
# ]

There are other hooks you can use in the manifest for more advanced control over how you can generate sql - these are commented out in the above template, and you can delete them if you don’t need them. See Creating SQL with python for more information.

If you’re familiar with git workflows, we recommend creating a git repo for your study, to help version your study in case of changes.

Writing SQL queries

Most users have a workflow that looks like this:

  • Write queries in the AWS Athena console while you are exploring the data
    • We recommend trying to keep your studies pointed at the core tables. The raw FHIR resource tables contain a lot of nested data that can be tricky to write cross-EHR queries against. For example, an EHR may store Medication information in the medication or the medicationrequest raw resource tables, but the core__medication hides that complexity and is always available, regardless of the specific EHR approach. You can look at the Core study documentation for details about that study’s contents. If you do need some data that is not available in the core tables, make sure you look at the Creating SQL with python guide for help to safely extract datasets from the raw resource tables.
  • Move queries to a file as you finalize them
  • Build your study with the CLI to make sure your queries load correctly.

Important detail on FHIR arrays: When we flatten a FHIR element that is specified as being potentially an array (like many instances of CodeableConcept, for example), we create a seperate table from that field. It can be joined back to the table it was extracted from by the id field present in both tables.

However - in your study design, you will need to handle cases where multiple items may exist in these tables. It is common for multiple code systems to be used for a single record.

As an example, the Condition resource has a base level CodeableConcept that should contain a SNOMED code, but often has only an ICD9/10 code, or a EHR vendor specific code. We handle this case in two ways:

  • The core__condition_codable_concepts_display table contains one record per resource, where we specify a priority order and take the first valid code we find, which is ok for cases where you aren’t very concerned about a specific coding and are just looking to get an idea of what data you have
  • The core__condition_codable_concepts_all table contains every code for every system found. This is useful when you are specifically looking for data associated with a given clinical coding system, but if you are not careful, you can cause a condition to be counted twice by not specifying a coding system when joining this table with the base condition table.

Your approach to handling this is going to be dictated by the specific clinical context you’re working with. In cases where we don’t specify two table types for an array resource, you should assume that we are following the second pattern and account for that in your queries.

sqlfluff

We use sqlfluff to help maintain a consistent style across many different SQL query authors. We recommend using sqlfluff as you are developing your queries to ensure your SQL is matching the style of other authors, but it is not required. You can copy our sqlfluff config into your study if you’d like to use the same style we are.

There are two commands you can run inside your study’s directory to check formatting:

  • sqlfluff lint will show you any variations from the expected styling rules
  • sqlfluff fix will try to make your autocorrect your queries to match the expected style

In order to both make your queries parsable to other humans, and to have sqlfluff be maximally helpful, we have a few requirements and some suggestions for query styling.

Hard Requirements

For all of these, Cumulus Library will notify you if you write a query that breaks one of these rules when you build your study.

  • All your tables must start with a string like my_study__.
  • Relatedly, __ (two underscores) is a reserved character string. Your table names should have exactly one of these.
  • We have three reserved table prefixes: etl_, nlp_, and lib_. These are fine to use elsewhere in the table name, just not at the beginning. For example, my_study__nlp_counts would cause an error, but my_study__counts_nlp would be fine.

Requirements for accepting PRs

  • Count tables must use the CUBE operator to create powersets of data. See the Trino docs for more information about its syntax. The core study, and other studies produced by the core Cumulus team, provide examples of its usage.
  • For PHI reverse identification protection, exclude rows from count tables if they have a small number of members, e.g. less than 10.
  • You may want to select a SQL style guide as a reference. Mozilla provides a SQL style guide, which our sqlfluff config enforces. If you have a different style you’d like to use, you can update the .sqlfluff config to allow this. For example, Gitlab’s data team has a style guide that is more centered around DBT, but is more prescriptive around formatting.
  • Don’t implicitly reference columns tables. Either use the full table name, or give the table an alias, and use that any time you are referencing a column.
  • Don’t use the * wildcard in your final tables. Explicitly list the columns with table name/alias - sqlfluff has a hard time inferring what’s going on otherwise.
  • We are currently asking for all caps for SQL keywords like SELECT and four-space indentation for queries. sqlfluff fix will apply this for you, but it may be easier to find other problems if you adhere to this from the start.
  • Agggregate count tables should have the first word after the study prefix be count_, and otherwise the word count should not be used.

Metadata tables

  • Creating a table called my_study__meta_date with two DATE columns, min_date and max_date, and populating it with the start and end date of your study, will allow other Cumulus tools to detect study date ranges, and otherwise bakes the study date range into your SQL for future reference.
  • Creating a my_study__meta_version with one column, data_package_version, and giving it an integer value as shown in this snippet:
      CREATE TABLE my_study__meta_version AS
      SELECT 1 AS data_package_version;
    

    allows you to signal versions for use in segregating data downstream, like in the Cumulus Aggregator. Increment it when your counts output changes format, and thus third parties need to rerun your study from scratch. If this is not set, the version will implicitly be set to zero.

Testing studies

If you have a Cumulus database in Athena already, you can easily point at that during study development.

But it may also be faster or easier to work with local files, where you can add edge cases. Cumulus Library has an optional database backend driven by local ndjson just for that!

Set up your ndjson

You can grab fake Synthea data or use the result of actual bulk-export results from your EHR.

Once you have that, run Cumulus ETL on your source ndjson with the --output-format=ndjson flag and pointing at some local directory. For example:

docker compose run \
  --volume local_dir:/in \
  cumulus-etl \
  /in/ndjson \
  /in/output \
  /in/phi \
  --output-format=ndjson

This will generate a tree of processed & anonymized ndjson (just like the ETL normally makes).

Run your study on the local ndjson

Now you can run Cumulus Library but point it at the output files with the --db-type=duckdb and --load-ndjson-dir=DIR flags. For example:

cumulus-library build \
  --db-type duckdb \
  --load-ndjson-dir local_dir/output \
  --database local_dir/duck.db \
  --target my_study

Adding edge cases

Not only is this faster than talking to Athena, but you can edit the local ndjson to add interest edge cases that you want your SQL to be able to handle.

We use this feature in the library and our studies for automated unit testing.

Sharing studies

If you want to share your study as an official Cumulus study, please let us know via the discussion forum - we can talk more about what makes sense for your use case.

If you write a paper using the Cumulus library, please cite the project

Snapshotting/archiving studies

If you need to freeze a study at a specific point in time (like if you’re working on a publication), you can create an archive of that study using the archive command in the Cumulus library CLI. Just be aware that this archive may contain sensitive data, and so make sure your store the archive someplace that complies with your organization’s security policies.