We have recieved a number of inquiries from users about how to build their
own prediction models. All scripts to train the models are located in the
GitHub repository here . Note
that many of these scripts have to do with splitting up data and submitting
jobs to the University of Chicago's HPC cluster. The script for model training
is located at here
We hope this proves useful and gives some insight into how the models were built. If you do build a prediction model for your own research, we encourage you to share it with the community following the publication of your main results, or sooner, if you feel comfortable. We can then host your model on predictdb.org while giving you proper acknowledgement.
Today we are updating the tissue-wide prediction models trained from GTEx data which use 1000 Genomes SNPs. There are two main differences:
1. The snp covariance files we had provided previously were built using a reference dataset with 1000 genomes data. Now the covariance data is based off of the GTEx training data. This should resolve many issues with MetaXcan not being able to utilize all snps in models.
2. In the previous release of our GTEx 1000 Genomes models, we had mistakenly excluded some well-performing protein-coding gene models before filtering for statistical significance at the level of FDR < .05. We have now included these models in the databases. After including these genes though, some gene-tissue models that had previously just reached the threshold of statistical significance now no longer meet the necessary criteria. These models have been dropped appropriately. As these models were not predicted to have great performance anyway, we hope that few people, if any, will miss these models.
If you are using the GTEx 1000 Genomes models in your research, we ask you to rerun your analyses using these updated models.
After numerous requests, we have now made our prediction models using HapMap SNPs available. These are models trained on GTEx's V6p data, but filtered to include HapMap SNPs only. These may prove beneficial to your analysis if you are working with older data or if you are having to impute much of your genotype data in order to achieve more complete coverage with 1000 Genomes. Detailed descriptions of the databases can be found in the README files.
We are happy to announce the release of new and improved Transcriptome Prediction Models, based off GTEx expression data trained on 1000 Genomes. The models (and associated covariance matrices) can be downloaded from PredictDB Data Repository. A more comprehensive explanation of the model changes is available at PrediXcan Wiki.
We will send out an email to the group once these models are made available at PredictDB Data Repository, so keep an eye out in the coming days. Cheers!
Since we released our updated prediction models a little over a week ago, we have received numerous requests for models that have been based on HapMap snps only instead of 1000 Genomes. We are happy to provide these models to you, and they will likely be released in the coming days. If you have found that you have needed to impute much of your genotype data to achieve more complete coverage of the prediction model, or if you are working with older datasets in general, it may prove beneficial to use these HapMap models instead.
We have also released a new MetaXcan version that takes advantage of these new models, and has additional output statistics such as the model's prediction performance p-value. Since the new models have a different database schema, updating to the latest MetaXcan version is required. Please note that there is no backward compatibility with the old models.
Thank you all for your involvement with PrediXcan and MetaXcan , and your help and feedback. We hope that these new models will prove beneficial to your research. Cheers!