MetaXcan output file formats

Festus

2022-03-08

Categories: FAQ

The MetaXcan Software hosts a suite of tools i.e PrediXcan, SPrediXcan, MultiXcan and SMultiXcan. This post describes the file format output from each tool.

PrediXcan

Individual-level data method to compute gene-trait associations. Detailed info

The output is a tab delimited file which contains Individual predicted expression on the rows and gene predicted in the columns.

The first two columns contain the FID and IID for every observation.

Association

Gives the association between predicted expression and an outcome.(PrediXcanAssociation.py)

Each output has the following columns;

gene: ENSEMBLE ID or intron id
effect: estimated effect size
se: estimated effect size standard error
zscore: predicted association z-score
pvalue: association p-value
n_samples: number of samples used
status: If there was any error in the computation, it is stated here

SPrediXcan

Runs association between the gene models and summary statistics.

Each output file is a CSV, with each row containing a gene association at a given trait-tissue combination:

gene: ENSEMBLE ID or intron id
gene_name: HUGO name or intron id
zscore: predicted association z-score
effect_size: estimated effect size
pvalue: association p-value
var_g: estimated variance of predicted expression or splicing, calculated as W' * G * W (where W is the vector of SNP weights in a gene’s model, W' is its transpose, and G is the covariance matrix)
pred_perf_r2: prediction model cross-validated performance
pred_perf_pval: prediction model cross-validated performance
pred_perf_qval: deprecated, empty field left for compatibility
n_snps_used: number of snps in the intersection of GWAS and model
n_snps_in_cov: number of snps in the LD compilation
n_snps_in_model: number of snps in the model
best_gwas_p: smallest p-value acros GWAS snps used in this model
largest_weight: largest prediction model weight

MultiXcan

Multi-Tissue PrediXcan, takes multiple gene expression files as input.

This script computes a gene-level association from predicted gene expression to a human trait, using multiple studies for each gene jointly. It supports adjusting for covariates. It inputs predicted expression files as generated by Predict.py

The results look like:

gene: a gene’s id: as listed in the Tissue Transcriptome model. Ensemble Id for most gene model releases. Can also be a intron’s id for splicing model releases.
pvalue: significance p-value of MultiXcan association
n_models: number of models (tissues) available for this gene
n_samples: number of individuals available to this gene-phenotype combination (k.e. inner join of phenotype and predictions)
p_i_best: best p-value of single-tissue PrediXcan association.
m_i_best: name of best single-tissue PrediXcan association.
p_i_worst: worst p-value of single-tissue PrediXcan association.
m_i_worst: name of worst single-tissue PrediXcan association.
status: If there was any error in the computation, it is stated here
n_used: number of independent components of variation kept among the tissues' predictions. (Synthetic independent tissues)
max_eigen: In the PCA decomposition of predicted expression, the maximum eigenvalue.
min_eigen: In the PCA decomposition of predicted expression, the minimum eigenvalue.
min_eigen_kept: In the PCA decomposition of predicted expression, the minimum eigenvalue kept (i.e. surviving SVD)

If you specify --loadings_output, you’ll get a file specify the loadings of the PC decomposition of predicted expressions for each gene:

gene: Ensemble Id (or intron id) being analized
pc: identifier of principal component
tissue: tissue being analyzed
weight: coefficient of loading from tissues to PC

If you specify --coefficient_output, you get a file with effect sizes for the tissues involved in each gene:

param: effect size of the PCA-regularized regression. (i.e. effect sizes of the PC components, converted to tissue-space)
variable: tissue being analyzed
gene: ensemble ID (or intron id)

SMultiXcan

Summary-stats based Multi-Tissue PrediXcan.

The results contain the following columns;

gene: a gene’s id: as listed in the Tissue Transcriptome model.
gene_name: gene name as listed by the Transcriptome Model, typically HUGO for a gene. It can also be an intron’s id.
pvalue: significance p-value of S-MultiXcan association
n: number of “tissues” available for this gene
n_indep: number of independent components of variation kept among the tissues' predictions. (Synthetic independent tissues)
p_i_best: best p-value of single-tissue S-PrediXcan association.
t_i_best: name of best single-tissue S-PrediXcan association.
p_i_worst: worst p-value of single-tissue S-PrediXcan association.
t_i_worst: name of worst single-tissue S-PrediXcan association.
eigen_max: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the top independent component
eigen_min: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the last independent component
eigen_min_kept: In the SVD decomposition of predicted expression correlation matrix: eigenvalue (variance explained) of the smalles independent component that was kept.
z_min: minimum z-score among single-tissue S-PrediXcan associations.
z_max: maximum z-score among single-tissue S-PrediXcan associations.
z_mean: mean z-score among single-tissue S-PrediXcan associations.
z_sd: standard deviation of the mean z-score among single-tissue S-PrediXcan associations.
tmi: trace of T * T', where This correlation of predicted expression levels for different tissues multiplied by its SVD pseudo-inverse. It is an estimate for number of indepent components of variation in predicted expresison across tissues (typically close to n_indep)
status: If there was any error in the computation, it is stated here