Set Up a Pipeline

If your project is configured to use analysis pipelines, automated procedures will ensure proper processing of newly ingested data. Typically a pipeline takes care of a number of steps, as shown in the following diagram:

Pipelines typically start with an ingest process and are triggered automatically. The exact number of additional steps and the exact processing details in each step depends on the assay type. However, all pipelines have the following in common:

A verification step. This may include verifying that the data has the expected format and doesn't have illegal values. There might also be a scientific quality control (QC) step involved that is specific to the assay.
Sample association. Assay data will be associated with the samples they belong to. This allows a researcher to search for this data at a later stage by looking for metadata on the sample or the subject or cohort associated with that sample
Data preservation. Data sets/files are preserved in their original state as they move through the pipeline. If new data is generated, a new, separate data set/file is saved. In all cases, data is moved to the appropriate location. For instance, if new data is generated by the pipeline, the raw data will be archived.
Compartmentalization. Each pipeline is broken up in a number of substeps. If the science changes on a particular analysis approach used in a pipeline, possibly only a few substeps can be re-run instead of rerunning the entire pipeline anew.
Results files produced by the pipelines can be read into an IDE or further interpreted using visualization tools.
The specific pipelines available are described above as part of the section Advanced Search - Pipeline Result Files.

Configuring and Using Project Stores

A project may be configured so that certain data sets uploaded in the associated watch folder, will be made available to the project store instead of getting picked up by automated pipelines. This configuration setup is, for example, for interesting pilot or other experimental data not suited for pipelines. Project stores are project-specific stores enabling analysts to load data directly into the IDE, associate files to samples, add user tags to describe the data, mark files for deletion that are no longer necessary, and upload derived insights for storage in HISE.
If you use a watch folder to ingest files of which some need to go to analysis pipelines and others need to go to a project folder, the support group can set this up provided the files can be distinguished at ingest by file type.As an alternative, support can create a dedicated watch folder with all content always going to the project folder. Please contact immunology-support@alleninstitute.org if you need a watch folder setup or configured to send data to the project store.
To use a project store it is recommended that you ingest files using the following method:

Create a "manifiest.csv" file defining all the files you are ingesting, their file types, and the sample reference
Create a single tar file containing the files plus the manifest file
Ingest the tar file into the watch folder

The following is an example of a "manifest.csv" file:
-----------------------------------------------------
accountGuid: <uuid>
projectGuid: <uuid>
file, samples, fileType
myFile.pdf, KT1002; KT3004, report file
myDir/myFile.rds, reference, rds file
----------------------------------------------------
Note that for the manifest.csv file:
The first two lines contain the account and project this data should belong to
The third line is a static line "file, samples, fileType"
The remaining lines define the name of the file, the associated sample, and the file type of the file
Note that if a file is not associated with a sample, the special keyword "reference" can be used.
If the file type that you want to declare does not exist in the project, please contact immunology-support@alleninstitute.org to have it declared.

CBC Results Setup

Lab results can be submitted to HISE and become part of the metadata of the sample. Results can be found and displayed in the Data Availability Dashboard, and can be used in Advanced Search to select data. Lab results can also be read in the IDE, either embedded along with the data of a result file or separately as file descriptors only.

Lab results are ingested as CSV files. The ingest pipeline checks for missing and illegal values, normalizes the results to adhere to common names and value range standards, and associates the results with the correct samples.

Before you can ingest CBC data in your project, project configuration is needed. There are two available options.

In most cases, the nature of the expected CBC data can be defined straightforwardly. This includes:

What patient and sample should the data be associated with?
What is the name of the column in the CSV file the data is ingested as, and is there a friendly name you'd rather use in HISE?
What is the unit of measurement and, if applicable, the (normal) range of values

If this is the case, you can work with immunology-support to define the expected data upfront, after which you can start ingesting the actual data.

If the nature of the expected CBC data is complex - for instance because data delivering research partners use different diagnostics labs producing results with different naming conventions and ranges, and data harmonization is needed - extended project setup and configuration is needed.

In both cases, data may be submitted partially in separate ingests, and CBC data is incrementally added. It is also possible to cancel out or overwrite data (see below).

Please contact immunology-support@alleninstitute.org if you need a HISE project "CBC results enabled".

Troubleshooting CBC Results

I need to retry a lab results ingest after a previous error. How do I do that?
After you have fixed the problems with the old lab results, you can ingest the updated lab results CSV file.

I want to overwrite previously ingested CBC data. How do I do that?
If you want to submit a payload that includes overwriting pre-existing data, make sure that the file name includes “retry” or “Retry” or “RETRY” or “ReTrY”.

Configuring Surveys/Questionnaires

Survey Data can be submitted to HISE and becomes part of the metadata of the sample. Results can be found and displayed in the Data Availability Dashboard, and can be used in Advanced Search to select data. Lab results can also be read in the IDE, either embedded along with the data of a result file or separately as file descriptors only.

Questionnaire data is ingested as CSV files. The ingest pipeline checks for missing and illegal values, aligns the data with the original questionnaire, and associates the results with the correct samples.

Before you can ingest survey data in your project, project configuration is needed. Specifically, the design of your questionnaire or questionnaires needs to be declared in the project so that survey data can be aligned with the (correct) questionnaire.

We support RedCap's data dictionary to define the design of the questionnaire(s) and CSV exports of the questionnaire data to ingest results in watch folders.

Please contact immunology-support@alleninstitute.org if you need a HISE project "Survey data enabled".

Configuring Patient History (EMR) data

Patient history (EMR) data can be submitted to HISE and becomes part of the metadata of the subject. A subject's patient history may span multiple hospital visits. Each separate visit can be denoted with a visit date which can either contain only a year or a year and month. Additionally, each separate visit must contain a "day since first research visit" number, measured in number of days. For instance if the first hospital visit was 14 days prior to the first research visit, this number will be -14. If multiple hospital visits are recorded in the same year and month, the "days since first research visit" will serve to distinguish these visits.

The EMR data can be found and displayed in the Data Availability Dashboard, and can be used in Advanced Search to select data. The metadata can also be read in the IDE.

EMR data is ingested as CSV files. The ingest pipeline checks for missing and illegal values and associates the results with the correct subject and hospital visit.

Before you can ingest data in your project, project configuration is needed. Specifically, the scheme of expected patient history data needs to be declared in the project so that the data can be verified and treated properly.

Please contact immunology-support@alleninstitute.org if you need a HISE project "patient history enabled".

Demographics Data

Subject demographics and info about the associated samples are typically provided as part of the manifest delivered to the HISE wet lab. This data is automatically transferred to HISE and available there. However, it is possible to submit some demographics data via WatchFolder instead, e.g, as part of survey data.

Please contact immunology-support@alleninstitute.org if you have questions about demographics data submission.

Pipeline Data Curation

Select users will have access to pipeline data curation options.

Pipeline Approvals

Assay pipelines that require approval of the results before they are released to the Research Space can be reviewed here. There are specialized views for flow cytometry and sequencing pipelines.

Duplicate Batch Identification

In cases where the same pipelines are rerun on the same raw data - because a prior run was rejected and/or because an error occurred - any results of older identical runs are automatically hidden.

However, in other cases, two different runs are only conceptually duplicates but actually start with different raw data sets. For instance, the wet lab may decide to create a second raw data set of the same underlying material, e.g. because of an irregularity prior to ingestion in HISE or to resequence data at a new preferred depth. In these cases, HISE cannot automatically recognize these sets as potentially duplicates. Instead, select users can use the same approval views to mark certain runs at duplicates, thereby removing the duplicate from the Research Space. This includes removing the output results as well as QC reports and other deliverables that were produced by the pipeline for the Research Space.