This page is outdated, please also see the man page.

How to run pca?

The QTLtools mode pca allows performing a Principal Component Analysis (PCA) either on molecular phenotype quantifications or genotype data. It is typically used (i) to detect outliers in the data, (ii) to detect stratification in the data or (iii) to build a covariate matrix before QTL mapping.


Phenotype data

In this case, it requires just a single file as input:

To illustrate how this works, we provide the following example file:

Then, to perform PCA on this file, use:

QTLtools pca --bed genes.50percent.chr22.bed.gz --scale --center --out genes.50percent.chr22

This is going to produce two files:

If you open genes.50percent.chr22.pca, you'll get something that looks like this:

SampleID HG00096 HG00097 HG00099 HG00100
genes.50percent.chr22_1_1_svd_PC1 9.4607 3.61907 -7.27784 -4.74021
genes.50percent.chr22_1_1_svd_PC2 -2.89575 -6.1823 -0.488397 -3.1485
genes.50percent.chr22_1_1_svd_PC3 -3.24429 -4.06885 -6.76097 -0.231029
genes.50percent.chr22_1_1_svd_PC4 -0.00959694 2.1984 1.58854 -3.01095
genes.50percent.chr22_1_1_svd_PC5 1.50254 -1.01627 2.46998 4.49168
genes.50percent.chr22_1_1_svd_PC6 -1.06957 -0.953119 3.60855 2.53457
genes.50percent.chr22_1_1_svd_PC7 -6.21865 2.10234 -5.26164 3.38665
genes.50percent.chr22_1_1_svd_PC8 2.62994 -0.516754 0.234968 1.15582
genes.50percent.chr22_1_1_svd_PC9 -3.62158 0.246679 -8.15425 -1.29294
...

In this file, the header line gives the sample IDs, the first column the ID of the PCs, starting from the first one and each successive line the coordinates of the samples on the PCs. The second output file genes.50percent.chr22.pca_stats looks like this:

sd 7.81917 7.40651 6.66507 5.40993 4.96369 4.56251 4.41553 3.62532 3.34306
prop_var 0.100544 0.0902117 0.0730544 0.0481304 0.0405178 0.0342329 0.0320629 0.0216137 0.0183791
cumm_prop 0.100544 0.190756 0.26381 0.311941 0.352459 0.386692 0.418755 0.440368 0.458747

The 3 first lines give you for each successive PC:

The options --center and --scale can be used to enforce centering and scaling of the phenotype values prior to the PCA.


Genotype data

To illustrate how this works, we provide the following example file:

Then, to perform PCA on this file, use:

QTLtools pca --vcf genotypes.chr22.vcf.gz --scale --center --maf 0.05 --distance 50000 --out genotypes.chr22

The output files are the same than for phenotype data. The only difference here is that we added the two following options to ensure that we have variant sites in linkage equilibrium:

Monday 11th July 2016