Submit and Monitor Pipeline Batches

Where to Submit Batches: Watch Folders

"Watch Folders" refer to the Google Buckets we use to upload files into HISE.

Finding a Watch Folder

To find the correct Watch Folder to drop your batch files in:

1. Under the Username drop down, select "Environment"

2. On the Environment page, select the Account and Project you wish to use

3. Select the Username drop down again and select "Watch Folders"

4. On the Watch Folders page, scroll down to view information on each Watch Folder in the Project you have selected. Find the Watch Folder with the Partner that you are working with, and click on the link underneath the "Watch Folder" column

5. The link will direct you to a Google Bucket where you can upload all of the files required to run a pipeline batch

Submitting Batches

The First Step: Submission Sheets

To run a batch, the user first uploads a Submission Sheet into the Watch Folder.

There are a few important things to consider before uploading the Submission Sheet:

1. Confirm that the file name syntax matches the current naming convention for your batch’s sequencing type. If the syntax is not quite right this can cause errors during the pipeline run, and the Submission Sheet and tar file will need to be reuploaded with the necessary edits.

For a detailed overview of file syntax, see the Pipeline File Syntax documentation: here

2. The content inside the Submission Sheet also has standardized syntax, especially the content of the Type column within the Header Sheet.

Watch out for whitespaces! These can cause errors in the pipeline
For Submission Sheet examples and their content’s syntax, see the Pipeline File Syntax documentation: here
An example of a Submission Sheet is here

Once you have validated that your Submission Sheet syntax is correct, navigate to the Watch Folder that was selected and click "Upload Files"; alternatively, simply drag your file to upload it to HISE.

You can view your uploaded file data on the Submissions page under the "Data Processing" tab:

Resubmitting Submission Sheet

If you need to resubmit your Submission Sheet for any reason, first reach out to the Dev team (madeline.ambrose@alleninstitute.org) who will delete the sheet’s data from the database. Once it has been deleted, it can be resubmitted to the Watch Folder.

The Final Step: Tar Files

The next and final step of starting a pipeline batch run is submitting a Tar file.

Like the Submission Sheet, Tar files have a standardized syntax that is necessary for the pipeline to run correctly.

For details, see the Pipeline File Syntax documentation: here

If the Tar file is saved in a Google Bucket, you can use gsutil to transfer the Tar to the Watch Folder much more quickly than by uploading from your local machine.

Instructions on how to install gcloud in order to use the gsutil command: Gcloud Installation Instructions
After gcloud is installed, type this line in your Terminal:

gsutil cp gs://yourBucketName/YourFileName gs://WatchfolderBucket

If your Tar file is saved on your local machine, you can upload the file to the Watch Folder by clicking "Upload Files" or dragging the file over the page.

Monitoring Batches with the Pipeline Dashboard

The Pipeline Dashboard is where you can view each process in the pipeline run’s status, input/output files, and error logs if they occur.

If a pipeline run is complete or a process has failed, you will receive an email notification.

To navigate to the Pipeline Dashboard page, click on "Data Processing" in the navigation bar and select "Pipeline Dashboard".

Here you can filter by BatchID, PanelID, Sequencing Type, Status, Data Streams, and Submission File:

Click "View Pipeline" to see the details of each process in your Batch’s pipeline run:

If a process has failed during your pipeline run, you can retry each failed process by clicking "Retry Pipeline" under the Status column:

When your Batch has reached the quality control step, some processes will now be marked "under_review".

To review the quality control report that is now available, go to the Approval page by selecting "Data Processing", then "Sequencing Pipeline Approval".

Filter by Sequencing type and Batch ID to find your batch:

Click on the button under the Submission ID column, then scroll down to select "Batch Summary Report".

After reviewing the Batch Summary Report, select the wells you would like to approve, and click "Approve <Submission ID>".

Viewing Pipeline Result Files

After a run is complete, the result files can be accessed through Advanced Search.

To find a Batch’s result files, navigate to the Advanced Search page by selecting "Research" in the navigation bar and clicking on "Advanced Search".

On the Advanced Search page click on the "New Query" button in the top right hand corner:

A modal will open where you can select the appropriate Project(s), type of Query output, and the File types:

Information about your Query will show in a new page. More filtering can be done here by selecting values from the row of drop down menus near the top of the page. The most common filter is usually batchId under the "Sample Metadata" menu.

When you are finished filtering the data, click "View Results" in the bottom right corner.

Re-running Batches

If you need to rerun your batch’s pipeline for any reason:

Navigate to the Sequencing Approval Page by selecting "Data Processing" in the navigation bar and "Sequencing pipeline approval" in the drop down menu. Find your batch, and click the yellow "Flag as duplicate" button.

The flagging as duplicate process can take some time, so we recommend waiting at least 10 minutes to avoid errors that occur when a batch is rerun without flagging the first run as a duplicate.

After the wait, rename your tar file with a new timestamp, and navigate back to your Project’s Watch Folder.

Note: It’s very important to change the timestamp in your tar file’s name, otherwise it will be rejected as a duplicate upload.

Troubleshooting Process Failures

If a process fails, there are a few approaches to debug the errors:

1. First, look at the process’s logs. The logs can show which step in the process failed, and possible errors coming from R/Python code.

These can be found in the Pipeline Dashboard by clicking on the failed process’s panel, which expands the panel and shows the "View Logs" button:

2. If, after viewing the logs, you still cannot troubleshoot the error, or if the logs are blank, please reach out to our Support Dev (madeline.ambrose@alleninstitute.org) for further assistance.