Sequence data [BAM/SAM]

Any sequence data needs to be specified in either BAM or SAM format. To validate the file, use samtools as follows:

samtools view mySequences.bam | head

If you can visualize some of your sequences using this command without any errors, this means that QTLtools will also be able to read the file.

Genotype data [VCF/BCF]

Any genotype data needs to be specified using VCF or BCF file format. Hereafter a minimal VCF example:

##fileformat=VCFv4.1
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT UNR1 UNR2 UNR3 UNR4
chr7 123 SNP1 A G 100 PASS . GT:DS 0/0:0.001 0/0:0.000 0/1:0.999 1/1:1.999
chr7 456 SNP2 T C 100 PASS . GT:DS 0/0:0.001 0/0:0.000 0/1:1.100 0/0:0.100
chr7 789 SNP3 A T 100 PASS . GT:DS 1/1:2.000 0/1:1.001 0/0:0.010 0/1:0.890

A precise description of this file format can be found here.

QTLtools needs any VCF/BCF to be indexed. This can be done using tabix for VCFs and bcftools for BCFs.

All QTLtools functionalities can use either the GT or DS (i.e. genotype dosage) fields. The DS field can be very usefule to encode imputation uncertainty. Note that the match mode does not work with the DS field and therefore requires the GT to be specified.

Missing GT entries are encoded with ./. and missing DS entries are encoded with .

To validate your VCF/BCF file, use bcftools as follows:

bcftools view myGenotypes.vcf.gz | less -S

If you can parse your genotypes using this command without any errors, this means that QTLtools will also be able to read the file.

Phenotype data [BED]

Phenotype data are specified using an extended UCSC BED format. It is a standard BED file with some additional columns. The missing values should be encoded as NA. Hereafter a general example of 4 molecular phenotypes for 4 samples.

#Chr start end pid gid strand UNR1 UNR2 UNR3 UNR4 chr1 99999 100000 pheno1 pheno1 + -0.50 0.82 -0.71 0.83
chr1 199999 201000 pheno2 pheno2 + 1.18 -2.84 1.34 -1.56
chr1 299999 300000 exon1 gene1 + -1.13 1.18 -0.03 0.11
chr1 299999 300000 exon2 gene1 + -1.18 1.32 -0.36 1.26

This file is TAB delimited. Each line corresponds to a single molecular phenotype. The first 6 columns are:

  1. Chromosome ID [string]
  2. Start genomic position of the phenotype (here the TSS of gene1) [integer, 0-based]
  3. End genomic position of the phenotype (here the TSS of gene1) [integer, 1-based]
  4. Phenotype ID (here the exon IDs) [string].
  5. Phenotype group ID (here the gene IDs, multiple exons belong to the same gene) [string]
  6. Strand orientation [+/-]

Then each additional column gives the quantification for a sample. Quantifications are encoded with floating numbers. This file should have P lines and N+6 columns where P and N are the numbers of phenotypes and samples, respectively.

THIS FILE FORMAT EXTENDS THE FASTQTL FILE FORMAT BY ADDING 2 COLUMNS! THIS MEANS THAT QTLTOOLS CANNOT WORK WITH FASTQTL BED FILES!

To make a quick and dirty conversion, you can use this command:

zcat myFastQTLphenotypes.bed.gz | awk '{ $4=$4" . +"; print $0 }' | tr " " "\t" | bgzip -c > myQTLtoolsPhenotypes.bed.gz

The small example above gives 3 different ways of encoding phenotype data:

  1. pheno1/pheno1 [line1]: the most standard way for encoding a molecular phenotype. It has a unique ID (pheno1) and spans 1bp.
  2. pheno2/pheno2 [line2]: alternative way of specifying a molecular phenotype. It has a unique ID (pheno2) but spans a region of 1kb.
  3. gene1 [line3-4]: these two lines specify a group (gene1) of 2 molecular phenotypes (exon1 and exon2). Importantly, both phenotypes need to share the same coordinate otherwise QTLtools will not be able to determine that they belong to the same group.

Sample IDs are specified in the header line. This line needs to start with a hash key (i.e. #).

This BED file needs to be indexed with tabix as follows:

bgzip myPhenotypes.bed && tabix -p bed phenotypes.bed.gz

If this doesn't work, this probably means that your BED file is not sorted, so sort it using sort.

Important notes (to be read!)

MAKE SURE THAT THE CHROMOSOME IDS ARE THE SAME ACROSS ALL FILES. A very common mistake is to have chromsomes as 1-22 in the genotype file while they are chr1-chr22 in the phenotype or sequence data. In such cases, QTLtools will not be able to find correspondance.

Covariate data [TXT]

The COV file contains the covariate data in simple TXT format. The missing values should be encoded as NA. Hereafter an example of 4 covariates for 4 samples.

id UNR1 UNR2 UNR3 UNR4
PC1 -0.02 0.14 0.16 -0.02
PC2 0.01 0.11 0.10 0.01
PC3 0.03 0.05 0.08 0.07
BIN 1 0 0 1

Herafter, some properties of this file:

  1. The file is TAB delimited
  2. First row gives the sample ID and each additional one corresponds to a single covariate
  3. First column gives the covariate ID and each additional one corresponds to a sample
  4. The file should have S+1 rows and C+1 columns where S and C are the numbers of samples and covariates, respectively.

Both quantitative and qualitative covariates are supported. Quantitative covariates are assumed when only numeric values are provided. Qualitative covariates are assumed when only non-numeric values are provided. In practice, qualitative covariates with F factors are converted in F-1 binary covariates.

Include/Exclude file formats

The various --{include,exclude}-{sites,samples,phenotypes,covatiates} options require a simple text file which lists the IDs of the desired type, one ID per line. The include options will result in running the analyses only in this subset of IDs, whereas exclude options will remove these IDs from the analyses. The IDs for --{include,exclude}-sites refer to the 3rd column in VCF/BCF files, --{include,exclude}-covariates refer to the 1st column in COV files, --{include,exclude}-phenotyps refer to the 4th column in BED files and when --grp-best option is used to the 5th column.

The --include-positions and --exclude-positions options require a text file which lists the chromosomes and positions (separated by a space) of genotypes to be excluded or included. One position per line.

Monday 2nd September 2019