How to run a nominal pass in cis?

To perform a simple nominal pass on your data set in order to get nominal P-values between your phenotypes and genotypes, you need the following input files:

For simplicitly, we provide the following small example files:

Then, you can run a nominal pass of the data in a particular genomic region using:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --out nominals.txt

The given options are:

The output file nominals.txt looks like this:

ENSG00000237438.2 chr22 17517460 17517460 + 5803 -202619 rs74762450 chr22 17314841 17314841 0.00957345 -0.347492 0
ENSG00000237438.2 chr22 17517460 17517460 + 5803 -165953 rs5748687 chr22 17351507 17351507 0.00227542 -0.225077 0
ENSG00000237438.2 chr22 17517460 17517460 + 5803 -165363 rs4006343 chr22 17352097 17352097 0.0022876 -0.226253 0
ENSG00000237438.2 chr22 17517460 17517460 + 5803 -164231 rs3875996 chr22 17353229 17353229 0.00181073 -0.233649 0
...
ENSG00000099954.14 chr22 17840837 17840837 + 6200 469530 rs5992121 chr22 18310367 18310367 0.00653855 -0.00519037 0
ENSG00000099954.14 chr22 17840837 17840837 + 6200 473554 rs367922 chr22 18314391 18314391 0.00125305 -0.00612444 0
ENSG00000099954.14 chr22 17840837 17840837 + 6200 476586 rs11705197 chr22 18317423 18317423 0.00961104 -0.00498535 0
ENSG00000099954.14 chr22 17840837 17840837 + 6200 754091 rs362027 chr22 18594928 18594928 0.00673601 0.0100518 0
ENSG00000099954.14 chr22 17840837 17840837 + 6200 754094 rs361893 chr22 18594931 18594931 0.00673601 0.0100518 0

The columns are:

Associations between genotype dosages and phenotype quantifications are measured with linear regressions, similar to the R/lm function. This model assumes that phenotypes are normally distributed. To enforce them to match normal distributions N(0, 1), we recommend to use:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --out nominals.txt --normal

To change the size of the cis-window (i.e. the maximal distance spanned by phenotype-variant pairs to be considered for testing) from default value 1e6 bp to 2e6 bp, use:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --out nominals.txt --window 2000000

To exclude samples, variants, phenotypes or covariates from the analysis, you can use one of these options:

  1. To exclude samples: --exclude-samples file.exc
  2. To exclude variants: --exclude-sites file.exc
  3. To exclude phenotypes: --exclude-phenotypes file.exc
  4. To exclude caovariates: --exclude-covariates file.exc

For instance, if you want to ignore these 3 samples when analyzing the example data set (HG00277, HG00278, HG00280), create a text file containing the IDs of the samples to be excluded, called here file.exc:

HG00277
HG00278
HG00280

Then, run:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --out nominals.txt --exclude-sample file.exc

Similarly to these 4 options for data exclusion, you can also specify the set of samples, variants, phenotypes and covariates you wish to include in the analysis using the options: --include-samples, --include-sites, --include-phenotypes and --include-covariates, respectively.


Important notes on parallelization

To facilitate parallelization on compute cluster, we developed an option to run the analysis into chunks of molecular phenotypes. This is quite similar to what is commonly used for genotype imputation, where only small chunks of data are imputed one at a time in seperate jobs. However in practice, it is usually quite complicated to properly split the genome into a given number of chunks with correct coordinates. To facilitate this, we embedded all coordinates into a chunk-based system so that you only need to specify the chunk index you want to run. Then, splitting the genome into chunks, extraction of data, and analysis are automatically performed. For instance, to run analysis on chunk number 12 when splitting the example data set into 20 chunks, just run:

QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --chunk 12 20 --out nominals_12_20.txt

If you want to submit the whole analysis into 20 jobs on a compute cluster, just run:

for j in $(seq 1 20); do
     echo "QTLtools cis --vcf ... --bed ... --cov ... --nominal 0.01 --chunk $j 20 --out nominals_$j\_20.txt" | qsub
done

Here qsub needs to be changed according to the job submission system used (bsub, psub, etc...).

In this simple example, we only split the data into 20 chunks. However, a realistic whole genome analysis would require to split the data in 200, 500 or even 1,000 chunks to fully use the capabilities of a modern compute cluster.
Monday 11th July 2016