This page is outdated, please also see the man page.

Important changes in version 1.2

The quan mode in version 1.2 is not compatible with the quantifications generated by the previous versions. This due to bug fixes and slight adjustments to the way we quantify. DO NOT MIX QUANTIFICATIONS GENERATED BY EARLIER VERSIONS OF QTLTOOLS WITH QUANTIFICATIONS FROM VERSION 1.2 AS THIS WILL CREATE A BIAS IN YOUR DATASET.

Other changes include:

An option to calculate TPM values.
The combination of options used in quantifying a BAM file are hashed and added to the output file names. When merging multiple quantifications make sure to check that this hash is the same across all files, to ensure that there are no biases in the data.

Filtering of reads in quantifications

Each read will be checked for the following filters in following order. If a read fails one of the filters it will not be checked for the latter filters. This means if a read fails multiple filters it will be counted towards the first filter it fails and you can find these counts in the stats files:

Filter secondary alignments
Filter unmapped reads
Filter reads failing QC if --filter-failed-qc
Filter duplicate alignments if --filter-remove-duplicates
Filter reads failing --filter-mapping-quality threshold
Filter reads not properly oriented and not properly paired if --check-proper-pairing
Filter reads bases on allowed mismatches based on --filter-mismatch and --filter-mismatch-total
Filter mate pairs where the first mate is not on the forward strand or the second mate is not on the reverse strand or if the mate pairs align to different chromosomes or if the first read starts before the second (these will be counted as not paired)

How to run quan?

The QTLtools mode quant allows quantifying exons and genes and requires 2 files as input:

A BAM file containing the sequence data for a RNA-seq experiment
A GTF file containing the exon and gene annotations that can be downloaded from here

To illustrate how this works, we provide the 4 following example files:

A RNA-seq BAM file on chr22 for sample HG00381: BAM / index
A GTF file containing the gencode annotation v19 for chr22: GTF

Then, to perform quantification from paired-end RNA-seq with 49 bp reads aligned with GEM, we recommend to use this command:


		QTLtools quan --bam HG00381.chr22.bam --gtf gencode.v19.annotation.chr22.gtf.gz --sample HG00381 --out-prefix HG00381 --filter-mapping-quality 150 --filter-mismatch 5 --filter-mismatch-total 5 --rpkm

This command produces 5 output files HG00381.5zDu4oPgLES.exon.count.bed, HG00381.5zDu4oPgLES.exon.rpkm.bed, HG00381.5zDu4oPgLES.gene.count.bed, HG00381.5zDu4oPgLES.gene.rpkm.bed, and HG00381.5zDu4oPgLES.stats. If you open the gene rpkm file, you'll get something like this:


		#chr	start	end	gene	length	strand	HG00381

		22	16062156	16062157	ENSG00000233866.1	L=586;T=lincRNA;R=chr22:16062157-16063236;N=LA16c-4G1.3	+	0

		22	16076171	16076172	ENSG00000229286.1	L=121;T=pseudogene;R=chr22:16076052-16076172;N=LA16c-4G1.4	-	0

		22	16084825	16084826	ENSG00000235265.1	L=578;T=pseudogene;R=chr22:16084249-16084826;N=LA16c-4G1.5	-	0

		22	16124972	16124973	ENSG00000223875.1	L=685;T=pseudogene;R=chr22:16100517-16124973;N=NBEAP3	-	0

		22	16122719	16122720	ENSG00000215270.3	L=1049;T=pseudogene;R=chr22:16122720-16123768;N=LA16c-60H5.7	+	56.6973588

		22	16193003	16193004	ENSG00000206195.6	L=5545;T=processed_transcript;R=chr22:16147979-16193004;N=AP000525.9	-	1.05850612

		22	16151396	16151397	ENSG00000271672.1	L=622;T=pseudogene;R=chr22:16150776-16151397;N=DUXAP8	-	9.436360824

		22	16154765	16154766	ENSG00000272872.1	L=694;T=sense_intronic;R=chr22:16154073-16154766;N=LL22NC03-N14H11.1	-	0

		22	16157078	16157079	ENSG00000232775.2	L=1572;T=pseudogene;R=chr22:16157079-16172700;N=AP000525.10	+	0

This file is a BED file with the following 7 columns:

1. The chromosome ID
2. The TSS position [0-based]
3. The TSS position [1-based]
4. The gene ID
5. Gene info where L is total exonic length when overlapping exons are merged, T is gene type, R is gene boundaries, N is the gene name
6. The strand orientation
7. The RPKM value for the sample (here HG00381) at the corresponding gene

Now, if you open HG00381.5zDu4oPgLES.exon.rpkm.bed, you'll get:


		#chr	start	end	gene	length	strand	HG00381

		...
		
		22      17517459        17517460        ENSG00000237438.2_17517460_17518234     ENSG00000237438.2       +       7.874866966

		22      17517459        17517460        ENSG00000237438.2_17525763_17525890     ENSG00000237438.2       +       106.9270201

		22      17517459        17517460        ENSG00000237438.2_17528214_17528320     ENSG00000237438.2       +       77.96262006

		...

		22      17565843        17565844        ENSG00000177663.9_17565844_17566119     ENSG00000177663.9       +       3.009379634

		22      17565843        17565844        ENSG00000177663.9_17578687_17578833     ENSG00000177663.9       +       166.3142867

		22      17565843        17565844        ENSG00000177663.9_17579665_17579777     ENSG00000177663.9       +       21.7314494

		...

For exon quantifications, the output is slightly different. The 4th and the 5th columns give this time, the exonID and the gene it belongs to, respectively. The TSS position is the same for all exons within the same gene. These pieces of information are important for QTL mapping in cis.

The HG00381.5zDu4oPgLES.stats file will contain:


		sample	HG00381

		total_reads	610860

		total_secondary_alingments	0

		total_unmapped	27706

		total_duplicate	0

		total_failqc	0

		total_mapQ_less_than_150	80608

		total_notpaired	1404

		total_mismatches_greater_than_5_5	139022

		total_merged_reads	160772

		total_good	353278

		total_exonic	335448

		total_exonic_multi_counting	369218

		total_exonic_multi_counting_after_merge_(used_for_rpkm)	340749.3782233436

		good_over_total	0.5783289133352978

		exonic_over_total	0.5491405559375307

		exonic_over_good	0.9495298320302991

This file is a stats file with the following lines:

1. Sample name
2. Number of reads in the BAM file
3. Number of secondary alingments
4. Number of unmapped reads
5. Number of duplicate reads
6. Number of reads with the failed QC tag
7. Number of reads below the mapping quality threshold (here 150)
8. Number of reads failing the mismatches per read and mismatches total filters (here 5 and 5)
9. Number of reads where the mate pairs were overlapping and thus were merged
10. Number of reads that passed all filters
11. Number of reads that aligned to exons
12. Number of reads that aligned to exons when we count reads that align to multiple exons multiple times
13. Number of reads that aligned to exons when we merge overlapping mate pairs
14. Number of good reads over the number of total reads
15. Number of exonic reads over the number of total reads
16. Number of exonic reads over the number of good reads

To quantify all samples for which you have RNA-seq, proceed as follows:

A. Run QTLtools quan for each BAM file separately (i.e. in indepeneden jobs)
B. Extract the annotation columns using cut -f1-6 on one of the file
C. Extract the per sample quantifications using cut -f7 for each file separately
D. Build the entire quantification matrix using the paste to merge B with all quantifications of C
E. Filter all poorly quantified exons/genes using awk for example

The expected results will look like this:


		#chr    start   end     gene    length  strand  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102

		chr1       29369   29370   ENSG00000227232.4       1800    -       3.574873147     2.159108572     2.656253435     2.147152811     3.060300881     0.4603702761

		chr1       135894  135895  ENSG00000268903.1       755     -       1.094681369     0.5444517286    0       0       1.015910092     0

		chr1       137964  137965  ENSG00000269981.1       284     -       2.494419821     1.184234794     0.7113553138    0.9226247455    3.437314236     2.122065293

		chr1       139378  139379  ENSG00000237683.5       2661    -       1.575143469     1.306025695     0.2530689078    0.3692579309    3.092051181     1.358887358

		chr1       267252  267253  ENSG00000228463.4       3954    -       0.07465174813   0.09450983011   0       0.1822382211    0.03526978984   0.209576757

		chr1       317719  317720  ENSG00000237094.7       6869    +       0       0.005440265952  0       0.009536520153  0       0.03290143452

		chr1       564441  564442  ENSG00000225972.1       372     +       1269.561342     1101.687828     1226.812553     1438.322375     1157.266087     2486.809527

		chr1       565019  565020  ENSG00000225630.1       1044    +       1324.772668     1573.121065     1388.501978     1507.901121     1498.158208     2563.208926

		chr1       566453  566454  ENSG00000237973.1       1543    +       5898.256933     6113.821775     5275.691308     6484.280259     5137.033507     8913.352248

Hereafter the list of options that can be used to tune the quantification step:

--out-prefix: Output file prefix
--rpkm: Output RPKM values
--tpm: Output TPM values
--gene-types: Gene types to quantify. (Requires gene_type attribute in GTF. It will also use transcript_type if present)
--xxhash: Rather than using GTF file name to generate unique hash for the options used, use hash of the GTF file
--no-hash: Don't include a hash signifying the options used in the quantification in the file names
--filter-mapping-quality [10]: Minimal phred mapping quality for a read to be considered
--filter-mismatch [OFF]: Maximum mismatches allowed in a read. If between 0 and 1 taken as the fraction of read length (Requires NM attribute)
--filter-mismatch-total [OFF]: Maximum total mismatches allowed in paired reads. If between 0 and 1 taken as the fraction of combined read length (Requires NM attribute)
--filter-remove-duplicates: Remove duplicate sequencing reads in the process
--filter-min-exon: Minimal exon length to consider. Exons smaller than this will not be printed out in the exon quantifications, but will still count towards gene quantifications.
--check-proper-pairing: If provided only properly paired reads according to the aligner that are in correct orientation will be considered. Otherwise all pairs in correct orientation will be considered
--check-consistency: If provided checks the consistency of split reads with annotation, rather than pure overlap of one of the blocks of the split read
--no-merge: If provided overlapping mate pairs will not be merged

Monday 2nd September 2019