General questions
A: We used the expression pre-processed by the GTEx analysis working group. Covariates are the same as used in the eQTL analysis. GTEx processing pipeline is here , and Docker image is here.
A: These models allow prediction of gene expression or alternative splicing on a GWAS study. The predicted levels can be associated to a complex trait such as a disease's susceptibility. PrediXcan implementation uses these models and individual-level data. MetaXcan repository contains an implementation based on GWAS summary statistics (i.e. doesn't need individual-level data). Also, a multi-tissue association is implemented there both for individual-level and summary statistics-level data.

A1: GTEx v8 MASHR-based models are parsimonious and exhibit the greatest power. These are the best option. However, they require GWAS preprocessing on older GWAS as detailed here.

A2: GTEx v8 UTMOST models are a robust but less effective option. They are based on HapMap snps and as such have a good overlap with most GWAS' snp sets.

A3: If your GWAS population is non-European, you might be interested in population-specific models like the MESA models.

PredictDB questions
A: The prediction model databases only contain models that pass a stringent criteria (each family of models has its own criteria, e.g. MASHR models require at least one snp with high posterior probability of being an eQTL). Our model training algorithms are complex and conservative. Sometimes, a good enough signal can't be captured from a gene's expression profile, even if it has an eQTL. Conversely, sometimes the algorithms converge for genes with complex profiles where no eQTL could be found. On other occasions, the algorithm doesn't converge.
A: Not at the moment. Sex chromosome genes need to be handled differently than those on the autosomes, and do not fit into our existing analysis pipeline.
A: For the GTEx models, we calculated the covariance of the SNP dosages within our training samples. For the DGN model, we used public genotype data from 1000 Genomes to calculate the covariance of the SNP dosages.
A: This is an abbreviation for “Tissue-Wide.” We have historically created a Cross-Tissue model, which is a measure of the expression common across all tissues, and Tissue-Specific models (TS), which is modeled as an orthogonal component to the shared cross tissue component. Details can be found here . While these models are not available at the moment, they are in our pipeline.
A: Yes, they can be found in the extra table in each sqlite db.
A: This can be found in the sample_info table in each sqlite db.
A: The .db files are simple sqlite files. You can programmatically query them via python, R, perl, etc (using appropriate libraries). Find example queries in link

A: 1KG and HapMap are the labels for the set of SNPs which were used to train the prediction models. As you probably have guessed, 1KG refers to the 1000 Genomes SNP set and the HapMap refers to the HapMap SNP set. Careful consideration of the SNPs in your genotype data should help you decide which of the models to use. The 1000 Genomes models make use of many more SNPs, so we believe these models to have greater prediction accuracy and some preliminary investigation into out-of-sample data suggests this to be the case. If you are working with older data though, or you would have to impute a significant amount of your genotype data to achieve more complete coverage of the models, we recommend using the HapMap models. For most use cases, the HapMap models are a good starting point. Bear in mind that v6 models are deprecated

Click on the following links to download the complete SNP annotation files we used to train the different models. These are tab-delimited text files with info on chromosome, position and allele info in addition to the rsid numbers.
HapMap (36MB)
1000 Genomes (150MB)

Alternatively, you can query the sqlite database to get a list of all the SNPs it uses for prediction. See here for a python function to create a file containing all SNPs in a database.

A: To build Elastic Net models, the code is available here: PredictDB_Pipeline_GTEx_v7. An introductory documentation is available here Google Group.

The newer GTEx v8 models were generaed with code available here. The documentation is still a work in progress, don't hesitate to use the PrediXcan/MetaXcan Google Group to ask for support.

The older Elastic Net pipeline for v6 models is available here. It is deprecated and exists solely for reference purposes.

PrediXcan questions
A: In order for the predicted expression file to be correct, the dosage files and samples file must correspond to the same individual. The columns for the dosage files are columns are snpid rsid position allele1 allele2 MAF id1 ..... idn and the samples file lists out the information for id1 ... idn row by row. As the id numbers are not included in the dosage file, it is critically important that the samples file has the same number of individuals in the same order as the dosages files. Otherwise later association tests will be invalid.
MetaXcan questions