Package API reference
The following describes at a high level some of the classes available to help you build studies programmatically. They are all importable directly from the base cumulus_library
package.
This is aimed at clinical study usage, rather than data prep usage, so some of the fussier bits will be glossed over
BaseTableBuilder
self, manifest: study_manifest.StudyManifest | None = None
The BaseTableBuilder
class is what every python based builder inherits from. It is designed to help wrangle the logistics of going from a query to executing that query against a database.
A BaseTableBuilder has two class variables:
queries
: A list to append queries to. display
: Text displayed while the builder is running in the CLI
A StudyManifest
is available if, for some reason, you’d like to access study information at setup. For most clinical use cases, you can safely ignore this.
prepare_queries
self, config: base_utils.StudyConfig, manifest: study_manifest.StudyManifest,*args,**kwargs,
A BaseTableBuilder
has one method you are meant to implement, prepare_queries
, whose job is to create SQL however it makes sense for your use case. These queries can be stored in the class’s queries
variable, which a different method, execute_queries
, will invoke before running those queries against the database. You should not (and indeed, you may get a warning about it if you try) override execute_queries
- just assume that it’s there and doing whatever you need it to do.
So, using BaseTableBuilder
might look as simple as this:
from cumulus_library import BaseTableBuilder, StudyConfig, StudyManifest
class MyBuilder(BaseTableBuilder):
def my_custom_function(table_name:str):
return f"CREATE TABLE study__table AS (SELECT * FROM core__{table_name})"
def prepare_queries(self,
config: StudyConfig,
manifest: StudyManifest,
*args,
**kwargs,
):
self.queries.append("CREATE TABLE study__table AS (SELECT * FROM core__patient)")
for resource in ['encounter', 'condition']:
self.queries.append(self.my_custom_function(resource))
write_queries
self, path: pathlib.Path | None = None
You don’t need to write your queries to disk - this will build them dynamically on the fly, so that they are always up to date with your builder. If you want to see the query for debugging, you can use write_queries
to write files to disk (or, you can use the generate-sql
subcommand of the cumulus-library
CLI):
import pathlib
from cumulus_library import BaseTableBuilder, StudyConfig, StudyManifest
class MyBuilder(BaseTableBuilder):
def prepare_queries(self,
config: StudyConfig,
manifest: StudyManifest,
*args,
**kwargs,
):
self.queries.append("CREATE TABLE study__table AS (SELECT * FROM core__patient);")
self.write_queries(pathlib.Path.cwd() / "output.sql")
comment_queries
self, doc_str:str | None=None
If you want to make query output a bit nicer for human eyes, comment_queries
will insert a divider between each query. if doc_str
is provided, it will be appended to the front of a list of queries (generate-sql
uses this for its disclaimer).
So, a usage like this:
import pathlib
from cumulus_library import BaseTableBuilder, StudyConfig, StudyManifest
class MyBuilder(BaseTableBuilder):
def prepare_queries(self,
config: StudyConfig,
manifest: StudyManifest,
*args,
**kwargs,
):
self.queries = ["SELECT foo FROM TABLE;", "SELECT bar FROM TABLE;"]
self.comment_queries()
self.write_queries(pathlib.Path.cwd() / "output.sql")
Would produce output like this:
SELECT foo FROM TABLE;
-- ###########################################################
SELECT bar FROM TABLE;
StudyConfig
db: databases.DatabaseBackend,schema: str, drop_table: bool = False, force_upload: bool = False, verbose: bool = False, stats_build: bool = False, stats_clean: bool = False, umls_key: str | None = None, options: dict | None = None
The StudyConfig object is passed to every class based off BaseStudyBuilder. It’s a repository of things that the CLI may think a study needs to perform a particular action.
The following fields are relevant to clinical studies. The rest can be ignored, as they are primarily intended for query execution: db: an interface to the database in question. DatabaseBackend
is beyond the scope of this document, but if you need to inspect a database to dynamically change a query, db.cursor()
will return a python PEP-249 compatible database cursor you can pass queries to. schema The name of the database schema being connected to. options This is a dictionary of custom options that can be passed at the CLI. If you need to allow user-determined behavior for your study, this is where they would be accessed.
StudyManifest
study_path: pathlib.Path | None = None, data_path: pathlib.Path | None = None, *, options: dict[str, str] | None = None
A StudyManifest
represents the contents of a manifest.toml
file. It also contains a number of convenience methods for accessing data from the study manifest.
You don’t need to worry about creating one of these objects - you can safely assume that your builder will be passed one of these via the CLI. We’re only including methods that are relevant to clinical studies, rather than data prep/execution.
See Creating studies for more info about what goes into a manifest and why.
get_study_prefix
self -> str | None Returns the prefix defined in the manifest, if present
get_dedicated_schema
self -> str | None Returns the dedicated schema defined in the manifest, if present
CountsBuilder
self, study_prefix: str | None = None, manifest: study_manifest.StudyManifest | None = None
A CountsBuilder
extends BaseStudyBuilder
. Its primary feature is providing a repeatable way to create cube tables for export to a coordinating site. For clinical studies, every table you export should be created with this class.
The CountsBuilder
makes the following assumptions:
- The table you are counting contains the resource refs from the core tables relevant to the particular subject
- Conditions are counted by related
encounter_ref
, since conditions often occur more than once for an encounter to support billing, and we will automatically join the encounter table for purposes of getting this other countable ref. - Document references, encounters, and observations will count their own associated IDs
- All other resources will count patients
- Conditions are counted by related
count_[resource]
self, table_name: str, source_table: str, table_cols: list, where_clauses: list | None = None, min_subject: int | None = None, annotation: counts_templates.CountAnnotation | None = None,) -> str
Every cumulus-supported resource has a similarly structured count generator function. Just select the type that’s appropriate for the resource you’re interested in.
In detail, the expected arguments are as follows:
- table_name: The name of the counts table to create. It must start with study prefix.
- source_table: The table to create counts data from.
- table_cols: The columns from source_table to add to the count table
- where_clauses: An array of where clauses to use for filtering the data
- min_subject: An integer setting the minimum bin size, for masking small sample size sets to help preserve patient anonymity. Note: if you define where_clauses, this is not used, and you should provide an equivalent method of binning patients
- annotation: An external source to use for adding metadata to a counts table. See
CountAnnotation
below for more info.
A count generator returns the function created from the counts template.
Here’s an example of a CountsBuilder
in use:
from cumulus_library import CountsBuilder, StudyConfig, StudyManifest
class MyBuilder(CountsBuilder):
def prepare_queries(self,
config: StudyConfig,
manifest: StudyManifest,
*args,
**kwargs,
):
self.queries.append(
self.count_condition(
'my_study__condition_count',
'core__condition',
['code','recorddate_year']
)
)
This would count instances of condition codes by year.
CountAnnotation
field: str, join_table: str, join_field: str, columns: list[tuple[str, str | None]]
A CountAnnotation object can be supplied to a count[resource]
function to indicate a table that should be joined to it post-counting. The intended purpose of this is to allow metadata about code systems to be joined without including them in the cubing function, which adds to both size and runtime.
In detail, the expected arguments are as follows:
- field The field from the count table to target for the join.
- join_table The table to use as a source of annotation data.
- join_field The field from join_table to join against field
- columns A list of tuples, describing column to join from join_table. The first value in the tuple is the name of the column or a string literal, and the second value, if supplied, is the alias to use for that column.
Using our example from CountsBuilder
above, here’s an example of how we’d use this to annotate data:
from cumulus_library import CountAnnotation CountsBuilder, StudyConfig, StudyManifest
class MyBuilder(CountsBuilder):
def prepare_queries(self,
config: StudyConfig,
manifest: StudyManifest,
*args,
**kwargs,
):
self.queries.append(
self.count_condition(
'my_study__condition_count',
'core__condition',
['code','recorddate_year']
annotation=CountAnnotation(
field='code',
join_table='"umls"."icd10_tree"',
join_column='code',
columns=[
("'icd_10'","system"),
("str", None)
]
)
)
)
This would add the string ‘icd_10’ in a column named system
, and the UMLS display value in a column named str
, as it is in the source.
get_template
filename_stem: str, path: pathlib.Path | None = None, **kwargs: dict
We use Jinja in cases where we want to dynamically build out SQL statements. Jinja itself is beyond the scope of this document, but as a simple example, a template like this:
SELECT * FROM my_study__;
Would allow you to pass a table_name into this template, and thus reuse it in multiple places. get_template
abstracts away all the nuance of jinja environment loading, and just allows you to pass a number of variables into a template for execution.
In detail, the expected arguments are as follows:
- filename_stem The filename of your template, before the first period (since there are a number of different conventions for jinja file extensions)
- path The location of the template; if not provided, we assume it’s in the same location as the file invoking it.
- **kwargs We don’t specify these to allow for customizibility, but any additional keyword arguments will get passed into the template for usage.
So, given the query above in a file named example_select.sql.jinja
, you could use in in a builder in the same directory as follows:
from cumulus_library import BaseTableBuilder, get_templates, StudyConfig, StudyManifest
class MyBuilder(BaseTableBuilder):
def prepare_queries(self,
config: StudyConfig,
manifest: StudyManifest,
*args,
**kwargs,
):
self.queries.append(get_templates('example_query', table_name='condition'))