This page is outdated, please also see the man page.

How to run a nominal pass in cis?

To perform a simple nominal pass on your data set in order to get nominal P-values between your phenotypes and genotypes, you need the following input files:

An indexed phenotype data matrix in BED format (see the section: Preparing input files)
An indexed genotype data matrix in VCF/BCF format (see the section: Preparing input files)
A covariate data matrix in TXT format (see the section: Preparing input files)

For simplicitly, we provide the following small example files:

The phenotype data matrix for chr22 on 358 samples: BED / index
The genotype data matrix for chr22 on 358 samples: VCF / index
The covariate data matrix on 358 samples: TXT

Then, you can run a nominal pass of the data in a particular genomic region using:


		QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --out nominals.txt

The given options are:

--nominal 0.01 to only output phenotype-variant pairs with a nominal p-value below 0.01. If you want the full output, use instead --nominal 1 which is going to create big files
--region chr22:17000000-18000000 to only process phenotypes which locations are within the specified genomic region

The output file nominals.txt looks like this:


		ENSG00000237438.2 chr22 17517460 17517460 + 5803 -202619 rs74762450 chr22 17314841 17314841 0.00957345 -0.347492 0

		ENSG00000237438.2 chr22 17517460 17517460 + 5803 -165953 rs5748687 chr22 17351507 17351507 0.00227542 -0.225077 0

		ENSG00000237438.2 chr22 17517460 17517460 + 5803 -165363 rs4006343 chr22 17352097 17352097 0.0022876 -0.226253 0

		ENSG00000237438.2 chr22 17517460 17517460 + 5803 -164231 rs3875996 chr22 17353229 17353229 0.00181073 -0.233649 0

		...

		ENSG00000099954.14 chr22 17840837 17840837 + 6200 469530 rs5992121 chr22 18310367 18310367 0.00653855 -0.00519037 0

		ENSG00000099954.14 chr22 17840837 17840837 + 6200 473554 rs367922 chr22 18314391 18314391 0.00125305 -0.00612444 0

		ENSG00000099954.14 chr22 17840837 17840837 + 6200 476586 rs11705197 chr22 18317423 18317423 0.00961104 -0.00498535 0

		ENSG00000099954.14 chr22 17840837 17840837 + 6200 754091 rs362027 chr22 18594928 18594928 0.00673601 0.0100518 0

		ENSG00000099954.14 chr22 17840837 17840837 + 6200 754094 rs361893 chr22 18594931 18594931 0.00673601 0.0100518 0

The columns are:

1. The phenotype ID
2. The chromosome ID of the phenotype
3. The start position of the phenotype
4. The end position of the phenotype
5. The strand orientation of the phenotype
6. The total number of variants tested in cis
7. The distance between the phenotype and the tested variant (accounting for strand orientation)
8. The ID of the tested variant
9. The chromosome ID of the variant
10. The start position of the variant
11. The end position of the variant
12. The nominal P-value of association between the variant and the phenotype
13. The corresponding regression slope
14. A binary flag equal to 1 is the variant is the top variant in cis

Associations between genotype dosages and phenotype quantifications are measured with linear regressions, similar to the R/lm function. This model assumes that phenotypes are normally distributed. To enforce them to match normal distributions N(0, 1), we recommend to use:


		QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --out nominals.txt --normal

To change the size of the cis-window (i.e. the maximal distance spanned by phenotype-variant pairs to be considered for testing) from default value 1e6 bp to 2e6 bp, use:


		QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --out nominals.txt --window 2000000

To exclude samples, variants, phenotypes or covariates from the analysis, you can use one of these options:

To exclude samples: --exclude-samples file.exc
To exclude variants: --exclude-sites file.exc
To exclude phenotypes: --exclude-phenotypes file.exc
To exclude caovariates: --exclude-covariates file.exc

For instance, if you want to ignore these 3 samples when analyzing the example data set (HG00277, HG00278, HG00280), create a text file containing the IDs of the samples to be excluded, called here file.exc:

Then, run:


		QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --region chr22:17000000-18000000 --out nominals.txt --exclude-sample file.exc

Similarly to these 4 options for data exclusion, you can also specify the set of samples, variants, phenotypes and covariates you wish to include in the analysis using the options: --include-samples, --include-sites, --include-phenotypes and --include-covariates, respectively.

Important notes on parallelization

To facilitate parallelization on compute cluster, we developed an option to run the analysis into chunks of molecular phenotypes. This is quite similar to what is commonly used for genotype imputation, where only small chunks of data are imputed one at a time in seperate jobs. However in practice, it is usually quite complicated to properly split the genome into a given number of chunks with correct coordinates. To facilitate this, we embedded all coordinates into a chunk-based system so that you only need to specify the chunk index you want to run. Then, splitting the genome into chunks, extraction of data, and analysis are automatically performed. For instance, to run analysis on chunk number 12 when splitting the example data set into 20 chunks, just run:


		QTLtools cis --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal 0.01 --chunk 12 20 --out nominals_12_20.txt

If you want to submit the whole analysis into 20 jobs on a compute cluster, just run:


		for j in $(seq 1 20); do

		     echo "QTLtools cis --vcf ... --bed ... --cov ... --nominal 0.01 --chunk $j 20 --out nominals_$j\_20.txt" | qsub

		done

Here qsub needs to be changed according to the job submission system used (bsub, psub, etc...).

In this simple example, we only split the data into 20 chunks. However, a realistic whole genome analysis would require to split the data in 200, 500 or even 1,000 chunks to fully use the capabilities of a modern compute cluster.

Monday 11th July 2016