Getting Your AGC Data
The University of Michigan Advanced Genomics Core generates and distributes over 250 TB of data a year, and that number is growing exponentially. We’ve made a lot of changes in the past few years as we’ve grown from data sets in the KB-range (Sanger) to data in the TB-range (Illumina NovaSeq).
There are two important points for our clients to consider as genomic data grows:
- You are going to get a LOT of data.
- You need a place to store it.
It is your responsibility to get your data and keep it safe. The AGC deletes older data at regular intervals, so you need a plan to receive and store all of your data. To help you, we’ve partnered with U of M’s ITS, ARC, and Michigan Medicine’s HITS to get you the storage and support you need when dealing with many terabytes of data. Check out the section on the Where To Keep Your Data below to get started.
With the sheer volume of data that we generate and move, our old data delivery methods (email, ftp, and external drives) are no longer feasible. We now deliver all data to all of our clients using the University-supported Globus file sharing system.
AGC Data Retention Policy
It is your responsibility to keep your data safe, secure, and backed up.
We allow 2 weeks from the time we notify you of its availability in MiCores data to download your data. If the AGC pushed your data to your Globus endpoint, we will keep a copy of your data on our servers for 2 weeks while you ensure it is all there. The best way to ensure you have a complete copy of all of your data is to use the supplied md5 file; see our FAQs below for details on how to use a checksum file.
You must download and keep the entire directory that is shared with you. Although you may not need the raw files for your internal processing, they are commonly required for publishing. Best practice is to keep all of the files you get from the AGC on Data Den, which is why beginning in Fall 2021, the AGC prefers to deliver data directly to your Data Den allocation from the UMRCP. If you have collaborators, bioinformaticians, and others who analyze your data for you, that does not mean that those people will keep a copy of your data forever. You must ensure you have a safe and secure copy of your data. Again, Data Den from UMRCP covers this for you.
After 2 weeks, we archive some of your raw run data, and we keep that archive for 6 months. At your request, we can pull archived data back, although there will be a charge for that service.
We cannot commit to having any of your data after 6 months. Depending on our current rotation policies, we might be able to recover your data after 6 months, but we make no guarantee. If we can recover the raw data, we may need to reprocess it. The minimum charge for this recovery and reprocessing is 5 hours of AGC Tech Time per run ($175 as of Fall 2021). If your run is especially complex to recover, or you require additional processing (e.g., cell ranger, space ranger, etc), there will be additional tech time charges assessed.
Where to Keep Your Data
In fall 2021, ARC rolled out the UMRCP program for nearly all researchers at Michigan, across all campuses. The U-M Research Computing Package includes free data storage in Turbo, archival storage in Data Den, and Great Lakes HPC cluster usage time. Since this program is free, and available to nearly all U-M users of the AGC, we strongly recommend its usage, and are building UMRCP into our standard data delivery processes.
Our current best practice is to deliver genomic data directly to our clients’ Data Den allocation. This requires a few one-time configuration steps that your lab must do using Globus after you lab has signed up for the UMRCP. Specifically, you must create an area in your Data Den allocation where the AGC can write your data.
For detailed instructions on setting up your Data Den so you can receive data pushes from the AGC, click here.
If you are an external, non-U-M client, we will make your data available for you to download (pull) from our Globus endpoint. This requires you to have a Globus account and a Globus endpoint that can receive the data as you pull it. We will grant read access to our Globus endpoint to the submitter’s email address, as well as the PI’s email address. As stated in our data retention policy above, you will have 2 weeks to pull your data before we delete our copy. If you need more time go get things set up, contact us at firstname.lastname@example.org and we will work something out.
Michigan Medicine “push to Turbo” Clients
Several of our clients in Michigan Medicine were already using this “push” data delivery model, but had it set up to go directly to their Turbo allocation. We’ve revised our recommendation, and now suggest that all of our clients get their data delivered directly to Data Den. By having the AGC send data directly to Data Den, the lab no longer needs to remember to copy their files to Data Den. And with Data Den allocations at 100 TB (versus 10 TB or less for Turbo), having the AGC push directly to Data Den means it’s less likely that you’ll run out of space. And when you do run out of space, buying additional Data Den storage is 87% cheaper than adding Turbo space.
If your lab is already having AGC push to your Turbo, we recommend following the instructions above to create a new endpoint in your Data Den allocation, and then deleting your Turbo Globus endpoint. (Don’t delete the files; just remove the endpoint from Globus). If you need help, we can help you, and so can the nice folks at ARC.
My Data is on Data Den. Now What?
The good news is that your data is safe. The bad news is that you can’t directly work with files on Data Den, which is after all a tape-based long-term archive system. The OK news is that it’s OK – you can’t really do anything with most sequencing files on your PC anyway.
All of our clients will need to do additional processing of the files from the AGC, and all of that additional processing needs to be done on a Linux-based cluster. Like the Great Lakes HPC. So, the first step is to copy the files off of the backup tape (Data Den), and put them on high-performance drives (Turbo) for analysis. ARC offers a few options for moving files from Data Den to Turbo, and we prefer using Globus. You’ll need to work with ARC to make sure that your Turbo space is configured for Globus access, and then just use the Globus UI to move the files from your Data Den area to your Turbo area. Then you can fire up Seurat or other analysis packages on the Great Lakes HPC. Teaching you how to move files and run bioinformatic analysis packages is beyond the scope of this web page, but ARC has a set of videos to help you get started..
There are some files that the AGC delivers that are useful on a laptop or workstation. Generally small HTML files and a few other special meta data files. To view those on your laptop, you’ll need to use Globus Connect Personal to download just those files from Data Den to your laptop. We don’t recommend downloading all of the files to your laptop, it will fill your hard drive in a hurry. More info to come in this space – When Data Den updates to Globus V5, the “Download” button may get turned on…)
Pulling The Data Yourself with Globus
If your lab wants to pull the data themselves, we can do that. We don’t recommend it because it doesn’t get your data to a safe long-term storage location ASAP. Globus requires some investment of time to use, including creating accounts and ensuring you have an “endpoint” to receive the data files from the AGC’s servers. An “endpoint” is simply the Globus term for “file share”. That is, an “endpoint” is a location to store files that Globus can read and write. You may also know of these as “Network Drives”, “Shared Drives”, or “Mount Points”. We have created a short document to help you get started. Globus’s help section is also a good resource, as is U-Mich’s ARC-TS.
- Submitter Pull (this is what external clients need to do)
- This is what you get if you don’t specify a Globus Group or an endpoint that the AGC can write to.
- When your data is ready, we will put your data in our temporary storage area and grant only the submitter and the PI read-only access. You will be notified by a comment in your MiCores submission.
- The PI or submitter must log in to Globus and transfer the data to your Globus endpoint. You have 2 weeks to copy the data off of our endpoint before we remove your data from our file server.
- Group Pull (gives your lab direct control over who can download your data)
- Your lab creates a Globus Group with the identities of all Globus users who will be able to copy your data from the AGC’s temporary file server. Important: make sure to allow the group viewable for “all Globus users”. This doesn’t make your data public, it just makes the name of your group public.
- When filling out your MiCores submission, enter your group name into the field “Additional Globus ID or Unique Name”
- When your data is ready, we will put your data in our temporary storage area, and grant that group read-only access, and notify you via a comment in your MiCores submission.
- Someone from that group must log in to Globus and transfer the data to your Globus endpoint. Your team will have 2 weeks to copy the data off of our endpoint before we remove your data from our file server
Storing Genomic Data on your PC
We (AGC, BRCF, ARC, ITS and HITS) do not recommend storing your full genomic datasets on your laptop, pc, lab workstation, or even lab server. Data Den, Turbo, and Locker are the right places for genomic data files at U-M. If you insist on keeping your data elsewhere, you’re on your own.
Pulling small files from Data Den to your PC
You may want to download a few of the smaller run summary files to look at on your PC or laptop. If you want to get those files to your PC, you will need to use Globus Connect Personal, which allows your Windows or Mac computer to create its own Globus endpoint. Globus.org has quite a bit of help available, including a video about using Globus Connect Personal.
Be careful about downloading all of the files for a run, especially .fastq, .bam, and other very large files. The large files aren’t very useful on a PC/laptop since there are no tools available to manipulate them on Windows or Mac OS. Those files belong on a HPC, where they can be properly analyzed and manipulated with bioinformatic toolkits. In the case of data delivered by the AGC, you will only want web (.html) files, loupe browser files (.cloupe), and other sub-gigabyte files. You should select those files individually using the Globus UI to transfer them to your PC. If you blindly download your entire run, you will probably fill your drive with data that you can’t use anyway.
Frequently Asked Questions
What data will I receive from Illumina sequencing?
You will receive demultiplexed fastq files (gzip compressed), containing only the reads from clusters passing the Illumina quality filter. By default, we do not trim the sequencing data. Unless purchasing a full flow cell, we are no longer able to deliver fastq files associated with the unknown bin or raw bcl files.
How is my data structured?
- md5 – Contains checksum information which can be used to confirm integrity and completeness of data transfer.
- DemuxStats – demultiplexing metrics and summary how your samples performed (this can be opened in excel). This file is identical to the file we attach to your MiCores submission at data release.
- fastqs – Stored within a directory labeled
fastq_[request-id], .fastq files are sorted into individual folders for each sample.
- analysis – If additional analysis is performed, a directory labeled with the type of analysis and the request id will be included. This directory will contain, for example, the results of the 10x Genomics cellranger pipeline, RNAseq pipeline, etc. for each sample
- README – Text file containing information about how samples were processed in the lab, information about the sequencing run, and details about the data processing that may be useful for publishing.
Example of the delivered file structure:
│ └── Sample_123-SR-1_AAAAAAAA-TTTTTTTT
│ └── Sample_123-SR-1
│ ├── 123-SR-1_AAAAAAAA-TTTTTTTT_S1_L000_R1_001.fastq.gz
│ └── 123-SR-1_AAAAAAAA-TTTTTTTT_S1_L000_R2_001.fastq.gz
How do I know I have a complete copy of the data?
The first step is to use Globus. A Globus transfer will not succeed unless every file has been successfully moved. Internally at the AGC, we always check the “verify file integrity after transfer” in Globus’s “Transfer & Sync Options” whenever we move files with Globus.
Secondly, data delivered from the AGC includes an “md5” file. This file contains checksums that can be used to confirm that you have a complete copy of every file after the transfer is complete. The checksum file is usually named “<service-request-name>.md5“; an example would be 3092-AG.md5
Every operating system has a different command to confirm the checksums. On most Linux-based systems, like the Great Lakes HPC, you would use a process like this:
- cd <service-request-name>
- md5sum –quiet –check <service-request-name>.md5
- Pay attention to any output; the –quiet flag means that the md5sum program will only produce output if there is a problem.
Remember that the AGC only makes your data available for 2 weeks, and that after 6 months our copy of your data is destroyed. We strongly encourage all clients to download their data as soon as possible, and use the provided md5 file to verify each download as soon as possible. If there are issues, a second Globus transfer of the entire directory usually clears up the issues. If you still see checksum issues after re-trying the transfer, contact the AGC.
What data will I receive for Single Cell projects?
- Fastq’s – We use bcl2fastq to demultiplex all sequencing data. You will notice that each sample will have 4 fastq directories associated with it (one for each of the 4 barcodes in the 10x barcode set for each sample), named Sample_[number]_[letter]. You will find the normal output (demultiplexed fastqs) in the following path: [run_id]/[service_request_id]/Sample_[number]_[letter]
- Cellranger count output – We run cellranger count on all single cell gene expression samples. Inside the top directory of your download is a directory for each sample by name that contains the results from the count step of the cellranger pipeline. The web_summary.html is likely what you want to look at first. The cloupe.cloupe file can be opened in the loupe browser supplied by 10x. You will be able to find this file in the following path: [run_id]/Sample_[name]/outs
- If your samples are not standard single cell RNAseq, we will run the appropriate supported 10x software for your project type
- At this time we can not support processing of TotalSeq-A projects. You will only receive demultiplexed fastqs.
Please contact if you would like assistance interpreting the results produced by cellranger. We will do our best to answer any questions or we can guide you towards assistance and resources provided by 10x.
What will I receive from my QuantSeq project?
For Lexogen’s QuantSeq library prep, we do initial RNAseq processing including trimming, alignment, and feature counting. With your data delivery, you will receive an additional folder containing the outputs of the trimming and alignment steps as well as count matrices and QC metrics for your data. Upon request, we can also provide codes for Lexogen’s BlueBee pipeline for further analysis and making comparisons.
What part of my data should I download?
We strongly recommend downloading the entire directory that is shared with you. Although you may not need the raw files for your internal processing, it is a very common requirement for publishing.
What if I need additional sequencing for my samples?
If you would like additional sequencing of samples previously sequenced, please contact us with the sample numbers and submission ID. Include the number of additional reads you would like for your samples. We will contact you with an estimate of additional costs.
What if I need Bioinformatics Help?
An initial NGS project consultation with the UM Bioinformatics Core is free.