This page is outdated, please also see the man page.
Important changes in version 1.2
The quan mode in version 1.2 is not compatible with the quantifications generated by the previous versions. This due to bug fixes and slight adjustments to the way we quantify. DO NOT MIX QUANTIFICATIONS GENERATED BY EARLIER VERSIONS OF QTLTOOLS WITH QUANTIFICATIONS FROM VERSION 1.2 AS THIS WILL CREATE A BIAS IN YOUR DATASET.
Other changes include:
- An option to calculate TPM values.
- The combination of options used in quantifying a BAM file are hashed and added to the output file names. When merging multiple quantifications make sure to check that this hash is the same across all files, to ensure that there are no biases in the data.
Filtering of reads in quantifications
Each read will be checked for the following filters in following order. If a read fails one of the filters it will not be checked for the latter filters. This means if a read fails multiple filters it will be counted towards the first filter it fails and you can find these counts in the stats files:
- Filter secondary alignments
- Filter unmapped reads
- Filter reads failing QC if --filter-failed-qc
- Filter duplicate alignments if --filter-remove-duplicates
- Filter reads failing --filter-mapping-quality threshold
- Filter reads not properly oriented and not properly paired if --check-proper-pairing
- Filter reads bases on allowed mismatches based on --filter-mismatch and --filter-mismatch-total
- Filter mate pairs where the first mate is not on the forward strand or the second mate is not on the reverse strand or if the mate pairs align to different chromosomes or if the first read starts before the second (these will be counted as not paired)
How to run quan?
The QTLtools mode quant allows quantifying exons and genes and requires 2 files as input:
- A BAM file containing the sequence data for a RNA-seq experiment
- A GTF file containing the exon and gene annotations that can be downloaded from here
To illustrate how this works, we provide the 4 following example files:
- A RNA-seq BAM file on chr22 for sample HG00381: BAM / index
- A GTF file containing the gencode annotation v19 for chr22: GTF
Then, to perform quantification from paired-end RNA-seq with 49 bp reads aligned with GEM, we recommend to use this command:
QTLtools quan --bam HG00381.chr22.bam --gtf gencode.v19.annotation.chr22.gtf.gz --sample HG00381 --out-prefix HG00381 --filter-mapping-quality 150 --filter-mismatch 5 --filter-mismatch-total 5 --rpkm
This command produces 5 output files HG00381.5zDu4oPgLES.exon.count.bed, HG00381.5zDu4oPgLES.exon.rpkm.bed, HG00381.5zDu4oPgLES.gene.count.bed, HG00381.5zDu4oPgLES.gene.rpkm.bed, and HG00381.5zDu4oPgLES.stats. If you open the gene rpkm file, you'll get something like this:
#chr start end gene length strand HG00381
22 16062156 16062157 ENSG00000233866.1 L=586;T=lincRNA;R=chr22:16062157-16063236;N=LA16c-4G1.3 + 0
22 16076171 16076172 ENSG00000229286.1 L=121;T=pseudogene;R=chr22:16076052-16076172;N=LA16c-4G1.4 - 0
22 16084825 16084826 ENSG00000235265.1 L=578;T=pseudogene;R=chr22:16084249-16084826;N=LA16c-4G1.5 - 0
22 16124972 16124973 ENSG00000223875.1 L=685;T=pseudogene;R=chr22:16100517-16124973;N=NBEAP3 - 0
22 16122719 16122720 ENSG00000215270.3 L=1049;T=pseudogene;R=chr22:16122720-16123768;N=LA16c-60H5.7 + 56.6973588
22 16193003 16193004 ENSG00000206195.6 L=5545;T=processed_transcript;R=chr22:16147979-16193004;N=AP000525.9 - 1.05850612
22 16151396 16151397 ENSG00000271672.1 L=622;T=pseudogene;R=chr22:16150776-16151397;N=DUXAP8 - 9.436360824
22 16154765 16154766 ENSG00000272872.1 L=694;T=sense_intronic;R=chr22:16154073-16154766;N=LL22NC03-N14H11.1 - 0
22 16157078 16157079 ENSG00000232775.2 L=1572;T=pseudogene;R=chr22:16157079-16172700;N=AP000525.10 + 0
This file is a BED file with the following 7 columns:
- 1. The chromosome ID
- 2. The TSS position [0-based]
- 3. The TSS position [1-based]
- 4. The gene ID
- 5. Gene info where L is total exonic length when overlapping exons are merged, T is gene type, R is gene boundaries, N is the gene name
- 6. The strand orientation
- 7. The RPKM value for the sample (here HG00381) at the corresponding gene
Now, if you open HG00381.5zDu4oPgLES.exon.rpkm.bed, you'll get:
#chr start end gene length strand HG00381
...
22 17517459 17517460 ENSG00000237438.2_17517460_17518234 ENSG00000237438.2 + 7.874866966
22 17517459 17517460 ENSG00000237438.2_17525763_17525890 ENSG00000237438.2 + 106.9270201
22 17517459 17517460 ENSG00000237438.2_17528214_17528320 ENSG00000237438.2 + 77.96262006
...
22 17565843 17565844 ENSG00000177663.9_17565844_17566119 ENSG00000177663.9 + 3.009379634
22 17565843 17565844 ENSG00000177663.9_17578687_17578833 ENSG00000177663.9 + 166.3142867
22 17565843 17565844 ENSG00000177663.9_17579665_17579777 ENSG00000177663.9 + 21.7314494
...
For exon quantifications, the output is slightly different. The 4th and the 5th columns give this time, the exonID and the gene it belongs to, respectively. The TSS position is the same for all exons within the same gene. These pieces of information are important for QTL mapping in cis.
The HG00381.5zDu4oPgLES.stats file will contain:
sample HG00381
total_reads 610860
total_secondary_alingments 0
total_unmapped 27706
total_duplicate 0
total_failqc 0
total_mapQ_less_than_150 80608
total_notpaired 1404
total_mismatches_greater_than_5_5 139022
total_merged_reads 160772
total_good 353278
total_exonic 335448
total_exonic_multi_counting 369218
total_exonic_multi_counting_after_merge_(used_for_rpkm) 340749.3782233436
good_over_total 0.5783289133352978
exonic_over_total 0.5491405559375307
exonic_over_good 0.9495298320302991
This file is a stats file with the following lines:
- 1. Sample name
- 2. Number of reads in the BAM file
- 3. Number of secondary alingments
- 4. Number of unmapped reads
- 5. Number of duplicate reads
- 6. Number of reads with the failed QC tag
- 7. Number of reads below the mapping quality threshold (here 150)
- 8. Number of reads failing the mismatches per read and mismatches total filters (here 5 and 5)
- 9. Number of reads where the mate pairs were overlapping and thus were merged
- 10. Number of reads that passed all filters
- 11. Number of reads that aligned to exons
- 12. Number of reads that aligned to exons when we count reads that align to multiple exons multiple times
- 13. Number of reads that aligned to exons when we merge overlapping mate pairs
- 14. Number of good reads over the number of total reads
- 15. Number of exonic reads over the number of total reads
- 16. Number of exonic reads over the number of good reads
To quantify all samples for which you have RNA-seq, proceed as follows:
- A. Run QTLtools quan for each BAM file separately (i.e. in indepeneden jobs)
- B. Extract the annotation columns using cut -f1-6 on one of the file
- C. Extract the per sample quantifications using cut -f7 for each file separately
- D. Build the entire quantification matrix using the paste to merge B with all quantifications of C
- E. Filter all poorly quantified exons/genes using awk for example
The expected results will look like this:
#chr start end gene length strand HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
chr1 29369 29370 ENSG00000227232.4 1800 - 3.574873147 2.159108572 2.656253435 2.147152811 3.060300881 0.4603702761
chr1 135894 135895 ENSG00000268903.1 755 - 1.094681369 0.5444517286 0 0 1.015910092 0
chr1 137964 137965 ENSG00000269981.1 284 - 2.494419821 1.184234794 0.7113553138 0.9226247455 3.437314236 2.122065293
chr1 139378 139379 ENSG00000237683.5 2661 - 1.575143469 1.306025695 0.2530689078 0.3692579309 3.092051181 1.358887358
chr1 267252 267253 ENSG00000228463.4 3954 - 0.07465174813 0.09450983011 0 0.1822382211 0.03526978984 0.209576757
chr1 317719 317720 ENSG00000237094.7 6869 + 0 0.005440265952 0 0.009536520153 0 0.03290143452
chr1 564441 564442 ENSG00000225972.1 372 + 1269.561342 1101.687828 1226.812553 1438.322375 1157.266087 2486.809527
chr1 565019 565020 ENSG00000225630.1 1044 + 1324.772668 1573.121065 1388.501978 1507.901121 1498.158208 2563.208926
chr1 566453 566454 ENSG00000237973.1 1543 + 5898.256933 6113.821775 5275.691308 6484.280259 5137.033507 8913.352248
Hereafter the list of options that can be used to tune the quantification step:
- --out-prefix: Output file prefix
- --rpkm: Output RPKM values
- --tpm: Output TPM values
- --gene-types: Gene types to quantify. (Requires gene_type attribute in GTF. It will also use transcript_type if present)
- --xxhash: Rather than using GTF file name to generate unique hash for the options used, use hash of the GTF file
- --no-hash: Don't include a hash signifying the options used in the quantification in the file names
- --filter-mapping-quality [10]: Minimal phred mapping quality for a read to be considered
- --filter-mismatch [OFF]: Maximum mismatches allowed in a read. If between 0 and 1 taken as the fraction of read length (Requires NM attribute)
- --filter-mismatch-total [OFF]: Maximum total mismatches allowed in paired reads. If between 0 and 1 taken as the fraction of combined read length (Requires NM attribute)
- --filter-remove-duplicates: Remove duplicate sequencing reads in the process
- --filter-min-exon: Minimal exon length to consider. Exons smaller than this will not be printed out in the exon quantifications, but will still count towards gene quantifications.
- --check-proper-pairing: If provided only properly paired reads according to the aligner that are in correct orientation will be considered. Otherwise all pairs in correct orientation will be considered
- --check-consistency: If provided checks the consistency of split reads with annotation, rather than pure overlap of one of the blocks of the split read
- --no-merge: If provided overlapping mate pairs will not be merged