Package: MetaGSCA 1.0.0

Authors:
Yan Guo
Fei Ye
Hui Yu

1 Introduction

MetaGSCA is an analytical tool to systematically assess the co-expression of a specified gene set by aggregating evidence across studies. A nonparametric approach (GSNCA) that accounts for the gene-gene correlation structure is used to test whether the gene set is differentially co-expressed between two comparative conditions, from which a permutation test p-statistic is computed. A meta-analysis is then performed with one of two options: random-intercept logistic regression model or the inverse variance method, thus yielding a con-clusive result across individual studies. Based on the results from the meta-analysis, a pathway crosstalk network can be delineated to visually reflect represent pathways’ similar profiles of gene set co-expression across the aligned datasets.

In the manuscript, MetaGSCA was demonstrated on over 300 cellular pathways across 11 TCGA datasets. In the present tutorial, we demonstrate the usage of main functions on two pathways across 11 TCGA datasets.

Packages data.table, meta (metafor), grid, sjstats, igraph are required.

The analysis will start by loading package MetaGSCA.

library(MetaGSCA)


2 How to use the MetaGSCA package

This Section illustrates the typical procedure for applying the main functions provided in package MetaGSCA.

2.1 Set Working Directory

Create a temporary directory “tutorial” under the user’s current working directory. All output generated in this tutorial will be saved in “tutorial”.

## set your workdir
if (!dir.exists('tutorial')) dir.create('tutorial')
setwd('tutorial')


2.2 Example Data

Load the example data:
1. lists of genesets (3 pathways)
2. lists of datasets (11 TCGA datasets)
Two aspects of datasets are necessary for running the examples and case studies in this tutorial.

data(meta)
datasetnames <- names(datasets)
genesetnames <- names(genesets)

Note:
1. lists of genesets (3 pathways):

## [1] "Aminosugars metabolism"     "p38 MAPK pathway"           "p38 MAPK signaling pathway"

2. lists of datasets (11 TCGA datasets):

##          BRCA COAD HNSC KIRC KIRP LIHC LUAD LUSC PRAD STAD THCA
## #Genes   3799 3757 3819 3811 3818 3686 3803 3844 3801 3931 3760
## #Samples  226   80   86  144   64  100  116  102  104   64  118



2.3 Usage of Major Functions

2.3.1 Meta-analysis of gene set differential correlation (MetaGSCA function)

MetaGSCA() function performs meta-analysis of gene set differential co-expression analysis for one or multiple genesets. When dealing with only one gene set, users supply one item for each of the two arguments: list.geneset and names.geneset; when dealing with multiple gene sets, users supply multiple items for each of these two arguments. For simplicity, here we demonstrate the usage on two gene sets.

Arguments
* list.geneset: a list of gene sets (one or multiple).
* list.dataset: a list of datasets, first column is gene name.
* list.group: a list of samples/patients subgroup or condition (e.g. (1,1,1,2,2,2)).
* names.geneset: gene set names, corresponding to list.geneset.
* names.dataset: dataset names, corresponding to list.dataset.
* nperm: number of permutations used to estimate the null distribution of the test statistic. If not given, a default value 500 is used.
* nboot: number of bootstraps used to estimate the point and interval estimate. If not given, a default value 200 is used.
* method: meta-analysis method. Must be either GLMM or Inverse.
* effect: statistical model applied in meta-analysis. Must be either random or fixed.

2.3.1.1 Code

testMultiple <- MetaGSCA(list.geneset = genesets[1:2],
                       list.dataset = datasets,
                       list.group = groups,
                       names.geneset = genesetnames[1:2],
                       names.dataset = datasetnames,
                       nperm = 100,
                       nboot = 100,
                       method = 'GLMM',
                       effect = 'random')
## 
## 2021-08-20 11:41:31 Processing dataset: BRCA...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:41:55 Processing dataset: COAD...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:42:15 Processing dataset: HNSC...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:42:34 Processing dataset: KIRC...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:42:55 Processing dataset: KIRP...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:43:14 Processing dataset: LIHC...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:43:34 Processing dataset: LUAD...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:43:54 Processing dataset: LUSC...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:44:14 Processing dataset: PRAD...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:44:33 Processing dataset: STAD...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:44:52 Processing dataset: THCA...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured


2.3.1.2 Parameters

genesetnames[1:2]
## [1] "Aminosugars metabolism" "p38 MAPK pathway"
genesets[1:2]
## $`Aminosugars metabolism`
##  [1] "CMAS"    "GCK"     "GFPT1"   "GFPT2"   "GNE"     "GNPDA1"  "GNPNAT1" "HK1"     "HK2"     "HK3"     "MPI"     "NAGK"    "NANS"    "PGM3"    "RENBP"   "UAP1"   
## 
## $`p38 MAPK pathway`
##  [1] "ATF1"     "DUSP1"    "DUSP10"   "EEF2K"    "EIF4E"    "ELK1"     "GADD45A"  "HSPB1"    "IL1R1"    "MAP2K4"   "MAP2K6"   "MAP3K10"  "MAP3K4"   "MAP3K5"   "MAP3K7"   "MAPK11"   "MAPK12"   "MAPK13"  
## [19] "MAPK14"   "MAPKAPK2" "MAPKAPK3" "MEF2B"    "MEF2C"    "MEF2D"    "MKNK1"    "MKNK2"    "RPS6KA4"  "RPS6KA5"  "SRF"      "TAB1"     "TAB2"     "TRAF6"


2.3.1.3 Result

In the working directory, two output files will be generated for each gene set: a forest plot (.png) and a table file (.csv). After the command execution, four output files will be generated at the gene-set level, including two forest plots (.png) and two table files (.csv). In addition to these gene-set level files, an overall TSV file will be generated, which assembles output statistics information from multiple gene sets.

## [1] "Aminosugars metabolism_GLMM_random_forestPlot.png"                             "Aminosugars metabolism_Individual Dataset GSAR Results with Bootstrapping.csv"
## [3] "p38 MAPK pathway_GLMM_random_forestPlot.png"                                   "p38 MAPK pathway_Individual Dataset GSAR Results with Bootstrapping.csv"


In each gene-set-specific PNG file, we see the forest plot across multiple datasets, with random effects model bootstrap result attached at the bottom.


multiFig1
Forest plot for one gene set across multiple datasets

multiFig2
Forest plot for another gene set across multiple datasets
The two CSV files each include the output data information for one specific gene set across multiple datasets. Here we only show information for the first two datasets.
Aminosugars metabolism
##                     BRCA                            COAD                           
## Original.TS         "3.094016"                      "3.630329"                     
## Original.Pvalue     "0.118812"                      "0.079208"                     
## Boot.TS.Mean        "0.169505"                      "0.081584"                     
## Boot.TS.Bias        "-2.924511"                     "-3.548745"                    
## Boot.TS.Std         "0.195140"                      "0.118236"                     
## Boot.TS.Norm.CI     "95% CI: [-0.212962, 0.551972]" "95% CI: [-0.150155, 0.313323]"
## Boot.TS.PT.CI       "95% CI: [0.009901, 0.723267]"  "95% CI: [0.009901, 0.427228]" 
## Boot.Pvalue.Mean    "0.169505"                      "0.081584"                     
## Boot.Pvalue.Bias    " 0.050693"                     " 0.002376"                    
## Boot.Pvalue.Std     "0.195140"                      "0.118236"                     
## Boot.Pvalue.Norm.CI "95% CI: [-0.212962, 0.551972]" "95% CI: [-0.150155, 0.313323]"
## Boot.Pvalue.PT.CI   "95% CI: [0.009901, 0.723267]"  "95% CI: [0.009901, 0.427228]" 
## Genes.not.found     "-"                             "-"                            
## Genes.removed       "-"                             "-"

p38 MAPK pathway
##                     BRCA                           COAD                          
## Original.TS         "6.719747"                     "9.629239"                    
## Original.Pvalue     "0.019802"                     "0.009901"                    
## Boot.TS.Mean        "0.026436"                     "0.054554"                    
## Boot.TS.Bias        "-6.693311"                    "-9.574684"                   
## Boot.TS.Std         "0.031595"                     "0.085687"                    
## Boot.TS.Norm.CI     "95% CI: [-0.035489, 0.08836]" "95% CI: [-0.11339, 0.222499]"
## Boot.TS.PT.CI       "95% CI: [0.009901, 0.128713]" "95% CI: [0.009901, 0.293317]"
## Boot.Pvalue.Mean    "0.026436"                     "0.054554"                    
## Boot.Pvalue.Bias    " 0.006634"                    " 0.044654"                   
## Boot.Pvalue.Std     "0.031595"                     "0.085687"                    
## Boot.Pvalue.Norm.CI "95% CI: [-0.035489, 0.08836]" "95% CI: [-0.11339, 0.222499]"
## Boot.Pvalue.PT.CI   "95% CI: [0.009901, 0.128713]" "95% CI: [0.009901, 0.293317]"
## Genes.not.found     "-"                            "-"                           
## Genes.removed       "-"                            "-"


At last, the TSV file _Meta Analysis bootstrap result.tsv assembles the statistics results of all gene sets altogether, including information for all individual datasets.

_Meta Analysis bootstrap result.tsv
* Geneset.Name: Gene set (pathway) name.
* Number.of.Genes: Number of genes included in the pathway.
* meta.p: Meta-analysis summarized permutation p-value across multiple datasets.
* bootstrap.p: Median of meta-analysis summarized bootstrap p-values across 11 datasets.
* BRCA: Permutation p-value generated within one individual dataset (BRCA here, but can be other dataset name).

##                    Aminosugars metabolism p38 MAPK pathway
## Number.of.Genes                   16.0000          32.0000
## meta.p.random                      0.0587           0.0656
## bootstrap.p.random                 0.0476           0.0638
## BRCA                               0.1188           0.0198
## COAD                               0.0792           0.0099
## HNSC                               0.2673           0.3663
## KIRC                               0.0099           0.0792
## KIRP                               0.0099           0.2673
## LIHC                               0.1584           0.0099
## LUAD                               0.0495           0.0396
## LUSC                               0.0099           0.0495
## PRAD                               0.0099           0.0297
## STAD                               0.0594           0.0594
## THCA                               0.4455           0.4257

2.3.1.4 Code for single gene set application

Users can try executing MetaGSCA() function on one gene set only. Arguments list.geneset and names.geneset should be prepared in the same way as if in case of multiple gene sets, only that one item is involved. When executing the following command, one overall text file (**_Meta Analysis bootstrap result.tsv**) and two output files (.png and .csv) for one gene set will be generated.

testSingle <- MetaGSCA(list.geneset = genesets[2],
         list.dataset = datasets, 
         list.group = groups,
         names.geneset = genesetnames[2],
         names.dataset = datasetnames,
         nperm = 100,
         nboot = 100,
         method = 'GLMM',
         effect = 'random')



2.3.2 Pathway Crosstalk Analysis (PWcTalk function)

For pathway crosstalk analysis, one call PWcTalk() function in one line, or alternatively, execute several codelines combining PWcTalkpre() and PWcTalkNW() to enable greater modulation of network layout. One must execute MetaGSCA_Multi_Geneset() function on sufficiently many gene sets (say >100) before attempting pathway crosstalk analysis. To enable swift testing of pathway cross talk analysis, we provide one example input as if it was generated from the prior workflow.

Note: If you are connecting from a Windows desktop to a remote Linux server and run these commands in Linux terminal, you will need an X11 display assistance (such as Xming) configured and started before you run pathway crosstalk analysis.

data(input2PWcTalk)
## One code line to execute pathway crosstalk analysis. Simple, but no flexibility for layout tuning. 
PWcTalk(input2PWcTalk,test='binary',
  pTh.dataset=0.01,pTh.pwPair=0.01,pTh.pw=0.01,figname='PWcTalk',
  pdfW=10,pdfH=10,asp=0.7,vbase=15,ebase=2,vlbase=1,power=1/2)
## One code block to execute pathway crosstalk analysis. More code lines, but enabling interactive layout tuning.
preNW <- PWcTalkNWpre(input2PWcTalk,test='binary',
  pTh.dataset=0.01,pTh.pwPair=0.01,pTh.pw=0.01)
g_tkid <- PWcTalkNW(preNW$PW.pair,preNW$PW.p)
##### PAUSE here: adjust the network layout on the pop-out window to reach a satisfaction #####
coords <- tk_coords(g_tkid$tkid)
g_tkid <- PWcTalkNW(preNW$PW.pair,preNW$PW.p,layout=coords,pdfW=14,pdfH=10,figname='PWcTalk',asp=0.5)       
A demonstration pathway crosstalk network

A demonstration pathway crosstalk network


2.3.3 Auxiliary functions

2.3.3.2 Scrutiny of gene-wise standard deviation (check_sd function)

Differential co-expression analysis is heavily dependent on the variance of data. If many genes have static expression across samples in either or both experimental conditions, GSNCA or alike algorithms may fail. The main function MetaGSCA() calls on a function check_sd() to pre-check the variance of genes in the input data matrix. Here we make this function check_sd() accessible for users so that they may use it to check on the variance situation of the input data themselves. Input to the function include two data matrices of same number of columns (genes) for two comparative conditions respectively, one vector for gene names (colnames of the two matrices), and a threshold of minimum standard deviation (defaults to 0.001). The output is a list of two components: one vector of gene names for the variance-qualifying genes, and a concatenated string including all disqualifying genes.

data(meta)
STAD <- datasets[['STAD']]
N <- ncol(STAD)
n1 <- N%/%2
objt1 <- t(STAD[,1:n1])
objt2 <- t(STAD[,(n1+1):N])
genes <- rownames(STAD)
check.res <- check_sd(objt1,objt2,genes,0.1)
head(check.res$genes.kept)
## [1] "A1BG"  "A2M"   "AACS"  "AADAT" "AANAT" "AARS"
# check.res$genes.removed # three genes were removed.
check.res$genes.removed
## [1] "CRH,IL9,LECT2"


3 Citing MetaGSCA

Guo Y, Yu H, …, Ye Fei. MetaGSCA: A tool for meta-analysis of gene set differential coexpression. PLoS computational biology. 2021;17(5):e1008976.