Creating SQL with python

Before jumping into this doc, take a look at Creating Studies. If you’re just working with the Core study tables related to the US Core FHIR profiles, you may not be interested in this. We cover some of the most common cases with workflows, so check what’s available there first before writing your own custom classes

Why would I even need to think about this?

There are two main reasons to use a builder:

  • You want to leverage the library of templates we provide to quickly create common kinds of SQL statements
  • If you are working against the raw FHIR resource tables, and are trying to access nested data in Athena.
    • This is gnarly because while the ETL provides a full SQL schema for your own data, it does not guarantee a schema for data that you don’t have at your site. And if you want your study to run at multiple sites with different EHRs, you need to be careful when accessing deep FHIR fields. For example, your EHR might populate Condition.evidence.code and you can safely write SQL that uses it. But a different site’s EHR may not provide that field at all, and thus that column may not be defined in the SQL table schema at that other site.

    The ETL provides a schema for data elements two levels deep, and tries to handle all cases where a Period, Coding or a CodeableConcept is present, but there may still be some gaps that are missed - Extensions especially can be tricky if they are recursive.

Utilities

There are two main bits of infrastructure we use for programmatic tables: The TableBuilder class, and the collection of template SQL.

If you include a table builder in your study, and you want to see what the query being executed looks like, you can use the generate-sql command in the Cumulus library CLI to write out example queries. They will go into a folder inside your study called reference_sql.

To document your study strucuture, you can use the generate-md command to create markdown tables you can copy into your study docs. Note that, as of this writing, you’ll need to supply a description for each field by hand. This output will be generated inside your study, in a file named {study name}_generated.md.

Working with TableBuilders

We have a base TableBuilder class that all the above use cases leverage. At a high level, here’s what it provides:

  • A prepare_queries function, which is where you put your custom logic. It should create an array of queries in self.queries. The CLI will pass in a StudyConfig object , containing information about arguments passed to the CLI, and a StudyManifest , which contains the information you provided in the study’s manifest.toml.
  • An execute_queries function, which will run prepare_queries and then apply those queries to the database. You shouldn’t need to touch this function - just be aware this is how your queries actually get run.
  • A write_queries function, which will write your queries from prepare_function to disk. If you are creating multiple queries in one go, calling comment_queries before write_queries will insert some spacing elements for readability.
  • A display_text string, which is what will be shown with a progress bar when your queries are being executed.
  • A parallel_allowed boolean, which defaults to True. If parallel execution is allowed, your builder will run all the queries added to prepare_queries at once, up to the connection pool limit, if (and only if) the study manifest is also set up to run in parallel. If you want to ensure that dependent queries are always run in series, your builder should override __init__ as follows:
      def __init__(self, *args, **kwargs):
          super().__init__(*args, **kwargs)
          self.parallel_allowed=False
    

You can either extend this class directly (like builder_*.py files in cumulus_library/studies/core) or create a specific class to add reusable functions for a repeated use case (like in cumulus_library/statistics/counts_builder.py).

TableBuilder SQL generally should go through a template SQL generator, so that your SQL has been validated. If you’re just working on counts, you don’t need to worry about this detail, but otherwise, the following section talks about our templating mechanism.

If you need to incrementally check steps in the database, or need other state logging tools, you can use functions in cumulus_library/log_utils.py to write to an autogenerated {study}__lib_transactions table to track state.

If you need a user to supply information from the CLI (say, for example, you’d like them to control which valueset they want to use, as outlined in the Opioid valueset study), you can ask them to provide this as a custom arugment, using the format --option key:value. These options will be placed in the StudyConfig that is passed into the builder.

Use cases

Working with template SQL

For validating SQL, we are using Jinja templates to create validated SQL in a repeatable manner. We don’t expect you to write these templates - instead, using the template function library you can provide arguments to these templates that will allow you to generate standard types of SQL tables, as well as using templates targeted for bespoke operations. But you can write study specific templates if you have a complex use case. The Core study has study specific templates to generate flat tables from nested FHIR tables, as an example.

When you’re thinking about a query that you’d need to create, first check the template function library to see if something already exists. Basic creation and inspection queries should be covered, as well as unnestings for some common FHIR objects.

Querying nested data

Are you trying to access data from deep within raw FHIR tables? I’m so sorry. Here’s an example of how this can get fussy with code systems:

A FHIR coding element may be an array, or it may be a singleton, or it may be a singleton wrapped an array. It may be fully populated, or partially populated, or completely absent. There may be one code per record, or multiple codes per record, and you may only be interested in a subset of these codes.

This means you may have differing schemas in Athena from one site’s data to another (especially if they come from different EHR systems, where implementation details may differ). In order to handle this, you need to create a standard output representation that accounts for all the different permutations you have, and conform data to match that. The encounter and condition builders both jump through hoops to try and get this data into flat tables for downstream use.

This is a pretty open ended design problem, but based on our experience, your prepare_queries implementation should attempt the following steps:

  • Check if your table has any data at all
  • If it does, inspect the table schema to see if the field you’re interested in is populated with the schema elements you’re expecting
    • If yes, it’s safe to grab them
    • If no, you will need to manually initialize them to an appropriate null value
  • If you are dealing with deeply nested objects, you may need to repeat the above more than once
  • Write a jinja template that handles the conditionally present data, and a template function to invoke that template
  • Test this against data samples from as many different EHR vendors as you can
  • Be prepared to need to update this when you hit a condition you didn’t expect
  • Create a distinct table that has an ID for joining back to the original
  • Perform this join as appropriate to create a table with unnested data

You may find it useful to use the --builder [filename] sub argument of the CLI build command to run just your builder for iteration. The Sample bulk FHIR datasets can provide an additional testbed database above and beyond whatever you produce in house.

Add this builder to the table_builder_config section of your manifest.toml - this will make this data available for downstream queries.

Good luck! If you think you’re dealing with a pretty common case, you can reach out to us on the discussion forum and we may be able to provide an implementation for you, or provide assistance if you’re dealing with a particular edge case.