A: 1KG and HapMap are the labels for the set of SNPs which were used to train the prediction models. As you probably have guessed, 1KG refers to the 1000 Genomes SNP set and the HapMap refers to the HapMap SNP set. Careful consideration of the SNPs in your genotype data should help you decide which of the models to use. The 1000 Genomes models make use of many more SNPs, so we believe these models to have greater prediction accuracy and some preliminary investigation into out-of-sample data suggests this to be the case. If you are working with older data though, or you would have to impute a significant amount of your genotype data to achieve more complete coverage of the models, we recommend using the HapMap models. For most use cases, the HapMap models are a good starting point.
Click on the following links to download the complete SNP annotation files we used to train the different models. These are tab-delimited text files with info on chromosome, position and allele info in addition to the rsid numbers.
1000 Genomes (150MB)
Alternatively, you can query the sqlite database to get a list of all the SNPs it uses for prediction. See here for a python function to create a file containing all SNPs in a database.
A: All scripts for model training are located here: https://github.com/hakyimlab/PredictDBPipeline Note Keep in mind that large chunk of the scripts have to do with splitting up the data and submitting to the HPC cluster at the University of Chicago. The scripts you would probably be most interested in for model training and building a database are GTEx_Tissue_Wide_CV_elasticNet.R and make_sqlite_db.py