This is a NodeJS library for working with bulk data in different formats and mostly for converting the data between those formats. Some utility functions for reading directories, parsing and others are also included.
The library is written in TypeScript and then compiled to JavaScript. It is not currently published to NPM so it should be used via GitHub:
git clone https://github.com/smart-on-fhir/bulk-data-tools.git
Then require
what you need from the build/src
folder or import it directly from /src
if you are using TypeScript.
In order to simplify conversions between data formats we handle the data through collection instances. A collection is an abstract representation of the underlying data, regardless of how that data was obtained. The collections have entries
and lines
iterator methods that will iterate over the entries without having to maintain everything in memory. The entries()
method will yield JSON objects and the lines()
method yields format specific strings.
These collections have one entry for each input line. If created from a directory that contains multiple NDJSON files, then all those files will be combined into single collection.
Typically these collections have one entry. If created from a directory that contains multiple JSON files or from array containing multiple objects, then all those files/objects will be combined as entries of single collection.
Represents a Delimited (CSV, TSV, etc.) object. These collections have one entry for each input line.
Working with bulk data implies that we have to deal with lots of files (or with big ones). The code of this library is written in a way that provides a balance between performance and simplicity.
In some cases we assume that the input or output might be big and use iterators to handle the data one entry at a time. Such cases are:
Collection.fromDirectory(...)
In other cases we know that the data is not that big:
Collection.fromString(...)
, Collection.fromStringArray(...)
, Collection.fromArray(...)
implies that the string or array argument is already available in memoryCollection.toString(...)
, Collection.toStringArray(...)
, Collection.toArray(...)
... implies that the caller requires the result as a whole (in memory).json
, ndjson
, csv
and tsv
. There are
classes to represent collections in each data format:To convert the data follow these simple steps (example from CSV to anything):
// 1. Create a collection for the input data - one of:
const input = DelimitedCollection.fromString(); // parse string as CSV
const input = DelimitedCollection.fromStringArray(); // parse strings as CSV rows
const input = DelimitedCollection.fromArray(); // load from row objects
const input = DelimitedCollection.fromFile(); // load from file
const input = DelimitedCollection.fromDirectory(); // load from directory
// 2. Then "export" it to whatever you need:
const output = input.toString(); // CSV string
const output = input.toStringArray(); // CSV string rows
const output = input.toArray(); // CSV row objects
const output = input.toJSON(); // JSON string
const output = input.toNDJSON(); // NDJSON string
const output = input.toFile(); // CSV file
In addition to the collection classes, this library comes with a collection of global function that can be useful for some related tasks. Some interesting examples are:
Working with CSV and TSV
Working with files and directories
filter
parameter..json
and .ndjson
files are parsed. Yields one JSON object for each line of an NDJSON file and one object for each JSON file. Other files are ignored.Working with JSON objects
const files = lib.filterFiles("/path/to/dir", /\.json$/i);
for (const file of files) {
const json = JSON.parse(fs.readFileSync(file, "utf8"));
json.lastModified = Date.now();
fs.writeFileSync(file, JSON.stringify(json));
}
for (const line of lib.readLine("/path/to/big/file")) {
console.log(line);
}
const ndjson = DelimitedCollection.fromDirectory("/path/to/dir");
// Note that we DO NOT use toJSON() because the result might be big. Instead, we
// iterate over entries() which will handle rows one by one and will not consume
// a lot of memory!
let lineCount = 0;
for (const entry of ndjson.entries()) {
fs.appendFileSync(
path,
(++lineCount === 1 ? "" : "\r\n") + JSON.stringify(entry)
);
}
const entries = new NDJSONFile("/path/to/ndjson").entries;
let lineCount = 0;
for (const entry of entries) {
fs.writeFileSync(
`/base/path/file-${++lineCount}.json`,
JSON.stringify(entry)
);
}
The bulk_data
executable can be used in the terminal to convert data between
different formats.
Examples:
# Convert CSV file to NDJSON
node bulk_data --input path/to/file.csv --output-type ndjson
# Convert NDJSON file to CSV
node bulk_data --input path/to/file.ndjson --output-type csv
Note that the examples will output their result to the terminal. You can
append > filename
to the command to write the result to file.
For the full list of possible conversions see tests/bin.test.ts.
CLI parameters:
--input
- Path to input directory or file.--input-type
- The type of input (json
, ndjson
, csv
, tsv
, auto
).
Defaults to auto
which means the input type can be omitted and will be detected
based in the file extension of the file passed as --input
.If the --input
is
a directory, then --input-type
is required and cannot be auto
.--output-type
- The type of output (json
, ndjson
, csv
, tsv
).--eol
- The line separator (CRLF, LF). Defaults to CRLF
. If the output-type
is delimited (csv or tsv), use this to specify what should be used as line
separator. Defaults to CRLF
(\r\n
).Generated using TypeDoc