General questions
A: We used the expression pre-processed by the GTEx analysis working group. Covariates are the same as used in the eQTL analysis. GTEx processing pipeline is here , and Docker image is here . A copy of Portal_Analysis_Methods_v6p_08182016.pdf can be downloaded from either GTEx or our predictdb s3 bucket.
PredictDB questions
A: To reduce the likelihood spurious findings, we are now filtering models based on statistical significance (FDR < .05).
A: Not at the moment. Sex chromosome genes need to be handled differently than those on the autosomes, and do not fit into our existing analysis pipeline.
A: For the GTEx models, we calculated the covariance of the SNP dosages within our training samples. For the DGN model, we used public genotype data from 1000 Genomes to calculate the covariance of the SNP dosages.
A: This is an abbreviation for “Tissue-Wide.” We have historically created a Cross-Tissue model, which is a measure of the expression common across all tissues, and Tissue-Specific models (TS), which is modeled as an orthogonal component to the shared cross tissue component. Details can be found here . While these models are not available at the moment, they are in our pipeline.
A: Yes, they can be found in the extra table in each sqlite db.
A: This can be found in the sample_info table in each sqlite db.

A: 1KG and HapMap are the labels for the set of SNPs which were used to train the prediction models. As you probably have guessed, 1KG refers to the 1000 Genomes SNP set and the HapMap refers to the HapMap SNP set. Careful consideration of the SNPs in your genotype data should help you decide which of the models to use. The 1000 Genomes models make use of many more SNPs, so we believe these models to have greater prediction accuracy and some preliminary investigation into out-of-sample data suggests this to be the case. If you are working with older data though, or you would have to impute a significant amount of your genotype data to achieve more complete coverage of the models, we recommend using the HapMap models. For most use cases, the HapMap models are a good starting point.

Click on the following links to download the complete SNP annotation files we used to train the different models. These are tab-delimited text files with info on chromosome, position and allele info in addition to the rsid numbers.
HapMap (36MB)
1000 Genomes (150MB)

Alternatively, you can query the sqlite database to get a list of all the SNPs it uses for prediction. See here for a python function to create a file containing all SNPs in a database.

A: All scripts for model training are located here: https://github.com/hakyimlab/PredictDBPipeline Note Keep in mind that large chunk of the scripts have to do with splitting up the data and submitting to the HPC cluster at the University of Chicago. The scripts you would probably be most interested in for model training and building a database are GTEx_Tissue_Wide_CV_elasticNet.R and make_sqlite_db.py

PrediXcan questions
A: In order for the predicted expression file to be correct, the dosage files and samples file must correspond to the same individual. The columns for the dosage files are columns are snpid rsid position allele1 allele2 MAF id1 ..... idn and the samples file lists out the information for id1 ... idn row by row. As the id numbers are not included in the dosage file, it is critically important that the samples file has the same number of individuals in the same order as the dosages files. Otherwise later association tests will be invalid.
MetaXcan questions