2 How to use the MetaGSCA package

This Section illustrates the typical procedure for applying the main functions provided in package MetaGSCA.

2.1 Set Working Directory

Create a temporary directory “tutorial” under the user’s current working directory. All output generated in this tutorial will be saved in “tutorial”.

## set your workdir
if (!dir.exists('tutorial')) dir.create('tutorial')
setwd('tutorial')

2.2 Example Data

Load the example data:
1. lists of genesets (3 pathways)
2. lists of datasets (11 TCGA datasets)
Two aspects of datasets are necessary for running the examples and case studies in this tutorial.

data(meta)
datasetnames <- names(datasets)
genesetnames <- names(genesets)

Note:
1. lists of genesets (3 pathways):

## [1] "Aminosugars metabolism"     "p38 MAPK pathway"           "p38 MAPK signaling pathway"

2. lists of datasets (11 TCGA datasets):

##          BRCA COAD HNSC KIRC KIRP LIHC LUAD LUSC PRAD STAD THCA
## #Genes   3799 3757 3819 3811 3818 3686 3803 3844 3801 3931 3760
## #Samples  226   80   86  144   64  100  116  102  104   64  118

2.3 Usage of Major Functions

2.3.1 Meta-analysis of gene set differential correlation (MetaGSCA function)

MetaGSCA() function performs meta-analysis of gene set differential co-expression analysis for one or multiple genesets. When dealing with only one gene set, users supply one item for each of the two arguments: list.geneset and names.geneset; when dealing with multiple gene sets, users supply multiple items for each of these two arguments. For simplicity, here we demonstrate the usage on two gene sets.

Arguments
* list.geneset: a list of gene sets (one or multiple).
* list.dataset: a list of datasets, first column is gene name.
* list.group: a list of samples/patients subgroup or condition (e.g. (1,1,1,2,2,2)).
* names.geneset: gene set names, corresponding to list.geneset.
* names.dataset: dataset names, corresponding to list.dataset.
* nperm: number of permutations used to estimate the null distribution of the test statistic. If not given, a default value 500 is used.
* nboot: number of bootstraps used to estimate the point and interval estimate. If not given, a default value 200 is used.
* method: meta-analysis method. Must be either GLMM or Inverse.
* effect: statistical model applied in meta-analysis. Must be either random or fixed.

2.3.1.1 Code

testMultiple <- MetaGSCA(list.geneset = genesets[1:2],
                       list.dataset = datasets,
                       list.group = groups,
                       names.geneset = genesetnames[1:2],
                       names.dataset = datasetnames,
                       nperm = 100,
                       nboot = 100,
                       method = 'GLMM',
                       effect = 'random')

## 
## 2021-08-20 11:41:31 Processing dataset: BRCA...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:41:55 Processing dataset: COAD...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:42:15 Processing dataset: HNSC...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:42:34 Processing dataset: KIRC...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:42:55 Processing dataset: KIRP...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:43:14 Processing dataset: LIHC...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:43:34 Processing dataset: LUAD...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:43:54 Processing dataset: LUSC...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:44:14 Processing dataset: PRAD...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:44:33 Processing dataset: STAD...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured
## 
## 2021-08-20 11:44:52 Processing dataset: THCA...
## 
##   Aminosugars metabolism : Results Captured
##   p38 MAPK pathway : Results Captured

2.3.1.2 Parameters

genesetnames[1:2]

## [1] "Aminosugars metabolism" "p38 MAPK pathway"

genesets[1:2]

## $`Aminosugars metabolism`
##  [1] "CMAS"    "GCK"     "GFPT1"   "GFPT2"   "GNE"     "GNPDA1"  "GNPNAT1" "HK1"     "HK2"     "HK3"     "MPI"     "NAGK"    "NANS"    "PGM3"    "RENBP"   "UAP1"   
## 
## $`p38 MAPK pathway`
##  [1] "ATF1"     "DUSP1"    "DUSP10"   "EEF2K"    "EIF4E"    "ELK1"     "GADD45A"  "HSPB1"    "IL1R1"    "MAP2K4"   "MAP2K6"   "MAP3K10"  "MAP3K4"   "MAP3K5"   "MAP3K7"   "MAPK11"   "MAPK12"   "MAPK13"  
## [19] "MAPK14"   "MAPKAPK2" "MAPKAPK3" "MEF2B"    "MEF2C"    "MEF2D"    "MKNK1"    "MKNK2"    "RPS6KA4"  "RPS6KA5"  "SRF"      "TAB1"     "TAB2"     "TRAF6"

2.3.1.3 Result

In the working directory, two output files will be generated for each gene set: a forest plot (.png) and a table file (.csv). After the command execution, four output files will be generated at the gene-set level, including two forest plots (.png) and two table files (.csv). In addition to these gene-set level files, an overall TSV file will be generated, which assembles output statistics information from multiple gene sets.

## [1] "Aminosugars metabolism_GLMM_random_forestPlot.png"                             "Aminosugars metabolism_Individual Dataset GSAR Results with Bootstrapping.csv"
## [3] "p38 MAPK pathway_GLMM_random_forestPlot.png"                                   "p38 MAPK pathway_Individual Dataset GSAR Results with Bootstrapping.csv"

In each gene-set-specific PNG file, we see the forest plot across multiple datasets, with random effects model bootstrap result attached at the bottom.

Forest plot for one gene set across multiple datasets
multiFig2

Forest plot for another gene set across multiple datasets The two CSV files each include the output data information for one specific gene set across multiple datasets. Here we only show information for the first two datasets.
Aminosugars metabolism

##                     BRCA                            COAD                           
## Original.TS         "3.094016"                      "3.630329"                     
## Original.Pvalue     "0.118812"                      "0.079208"                     
## Boot.TS.Mean        "0.169505"                      "0.081584"                     
## Boot.TS.Bias        "-2.924511"                     "-3.548745"                    
## Boot.TS.Std         "0.195140"                      "0.118236"                     
## Boot.TS.Norm.CI     "95% CI: [-0.212962, 0.551972]" "95% CI: [-0.150155, 0.313323]"
## Boot.TS.PT.CI       "95% CI: [0.009901, 0.723267]"  "95% CI: [0.009901, 0.427228]" 
## Boot.Pvalue.Mean    "0.169505"                      "0.081584"                     
## Boot.Pvalue.Bias    " 0.050693"                     " 0.002376"                    
## Boot.Pvalue.Std     "0.195140"                      "0.118236"                     
## Boot.Pvalue.Norm.CI "95% CI: [-0.212962, 0.551972]" "95% CI: [-0.150155, 0.313323]"
## Boot.Pvalue.PT.CI   "95% CI: [0.009901, 0.723267]"  "95% CI: [0.009901, 0.427228]" 
## Genes.not.found     "-"                             "-"                            
## Genes.removed       "-"                             "-"

p38 MAPK pathway

##                     BRCA                           COAD                          
## Original.TS         "6.719747"                     "9.629239"                    
## Original.Pvalue     "0.019802"                     "0.009901"                    
## Boot.TS.Mean        "0.026436"                     "0.054554"                    
## Boot.TS.Bias        "-6.693311"                    "-9.574684"                   
## Boot.TS.Std         "0.031595"                     "0.085687"                    
## Boot.TS.Norm.CI     "95% CI: [-0.035489, 0.08836]" "95% CI: [-0.11339, 0.222499]"
## Boot.TS.PT.CI       "95% CI: [0.009901, 0.128713]" "95% CI: [0.009901, 0.293317]"
## Boot.Pvalue.Mean    "0.026436"                     "0.054554"                    
## Boot.Pvalue.Bias    " 0.006634"                    " 0.044654"                   
## Boot.Pvalue.Std     "0.031595"                     "0.085687"                    
## Boot.Pvalue.Norm.CI "95% CI: [-0.035489, 0.08836]" "95% CI: [-0.11339, 0.222499]"
## Boot.Pvalue.PT.CI   "95% CI: [0.009901, 0.128713]" "95% CI: [0.009901, 0.293317]"
## Genes.not.found     "-"                            "-"                           
## Genes.removed       "-"                            "-"

At last, the TSV file _Meta Analysis bootstrap result.tsv assembles the statistics results of all gene sets altogether, including information for all individual datasets.

_Meta Analysis bootstrap result.tsv
* Geneset.Name: Gene set (pathway) name.
* Number.of.Genes: Number of genes included in the pathway.
* meta.p: Meta-analysis summarized permutation p-value across multiple datasets.
* bootstrap.p: Median of meta-analysis summarized bootstrap p-values across 11 datasets.
* BRCA: Permutation p-value generated within one individual dataset (BRCA here, but can be other dataset name).

##                    Aminosugars metabolism p38 MAPK pathway
## Number.of.Genes                   16.0000          32.0000
## meta.p.random                      0.0587           0.0656
## bootstrap.p.random                 0.0476           0.0638
## BRCA                               0.1188           0.0198
## COAD                               0.0792           0.0099
## HNSC                               0.2673           0.3663
## KIRC                               0.0099           0.0792
## KIRP                               0.0099           0.2673
## LIHC                               0.1584           0.0099
## LUAD                               0.0495           0.0396
## LUSC                               0.0099           0.0495
## PRAD                               0.0099           0.0297
## STAD                               0.0594           0.0594
## THCA                               0.4455           0.4257

2.3.1.4 Code for single gene set application

Users can try executing MetaGSCA() function on one gene set only. Arguments list.geneset and names.geneset should be prepared in the same way as if in case of multiple gene sets, only that one item is involved. When executing the following command, one overall text file (**_Meta Analysis bootstrap result.tsv**) and two output files (.png and .csv) for one gene set will be generated.

testSingle <- MetaGSCA(list.geneset = genesets[2],
         list.dataset = datasets, 
         list.group = groups,
         names.geneset = genesetnames[2],
         names.dataset = datasetnames,
         nperm = 100,
         nboot = 100,
         method = 'GLMM',
         effect = 'random')

2.3.2 Pathway Crosstalk Analysis (PWcTalk function)

For pathway crosstalk analysis, one call PWcTalk() function in one line, or alternatively, execute several codelines combining PWcTalkpre() and PWcTalkNW() to enable greater modulation of network layout. One must execute MetaGSCA_Multi_Geneset() function on sufficiently many gene sets (say >100) before attempting pathway crosstalk analysis. To enable swift testing of pathway cross talk analysis, we provide one example input as if it was generated from the prior workflow.

Note: If you are connecting from a Windows desktop to a remote Linux server and run these commands in Linux terminal, you will need an X11 display assistance (such as Xming) configured and started before you run pathway crosstalk analysis.

data(input2PWcTalk)
## One code line to execute pathway crosstalk analysis. Simple, but no flexibility for layout tuning. 
PWcTalk(input2PWcTalk,test='binary',
  pTh.dataset=0.01,pTh.pwPair=0.01,pTh.pw=0.01,figname='PWcTalk',
  pdfW=10,pdfH=10,asp=0.7,vbase=15,ebase=2,vlbase=1,power=1/2)
## One code block to execute pathway crosstalk analysis. More code lines, but enabling interactive layout tuning.
preNW <- PWcTalkNWpre(input2PWcTalk,test='binary',
  pTh.dataset=0.01,pTh.pwPair=0.01,pTh.pw=0.01)
g_tkid <- PWcTalkNW(preNW$PW.pair,preNW$PW.p)
##### PAUSE here: adjust the network layout on the pop-out window to reach a satisfaction #####
coords <- tk_coords(g_tkid$tkid)
g_tkid <- PWcTalkNW(preNW$PW.pair,preNW$PW.p,layout=coords,pdfW=14,pdfH=10,figname='PWcTalk',asp=0.5)

A demonstration pathway crosstalk network

2.3.3 Auxiliary functions

2.3.3.1 Gene Sets Net Correlations Analysis (GSNCA related functions)

For users convenience, we provide several functions pertinent to the core differential co-expression analysis algorithm, GSNCA. Function gsnca_stat() calculates the overall correlation distance metric for a gene set between two experimental conditions, and thus it returns a scalar value. Function gsnca_p() implements the sample label permutation and calculates the p-statistic based on a number of permutated trials; this function returns a list of two scalar components, one for the correlation distance statistic and the other for the p-statistic. Function gsnca_gsets() conducts GSNCA (including permutation) for multiple gene sets, and its list output has the external layer for the multiple gene sets and the internal layer of the same format as gsnca_p() output.

data(meta)
BRCA <- datasets[['BRCA']]
N <- ncol(BRCA)
n1 <- N%/%2
objt1 <- t(BRCA[1:100,1:n1])
objt2 <- t(BRCA[1:100,(n1+1):N])
gStat.res <- gsnca_stat(objt1,objt2)
# gsnca_stat() result
gStat.res

## [1] 12.42113

data(meta)
BRCA <- datasets[['BRCA']]
smpCode <- substr(colnames(BRCA),14,15)
grp1 <- which(smpCode=='01')
grp2 <- which(smpCode=='11')
object <- BRCA[1:100,c(grp1,grp2)]
group <- c(rep(1,length(grp1)),rep(2,length(grp1)))
perm.list <- vector('list',500)
for (i in seq_len(500)) {perm.list[[i]] <- sample(ncol(object))}
gsnca_p.res <- gsnca_p(object,group,perm.list)
gsets <- split(rownames(object),rep(1:4,each=25)) 
gsnca_gsets.res <- gsnca_gsets(gsets,object,group,perm.list)
# gsnca_p() result
gsnca_p.res

## $p
## [1] 0.001996008
## 
## $stat
## [1] 28.38517

# gsnca_gsets() result
gsnca_gsets.res[1:2]

## $`1`
## $`1`$p
## [1] 0.001996008
## 
## $`1`$stat
## [1] 6.47211
## 
## 
## $`2`
## $`2`$p
## [1] 0.001996008
## 
## $`2`$stat
## [1] 7.532444

2.3.3.2 Scrutiny of gene-wise standard deviation (check_sd function)

Differential co-expression analysis is heavily dependent on the variance of data. If many genes have static expression across samples in either or both experimental conditions, GSNCA or alike algorithms may fail. The main function MetaGSCA() calls on a function check_sd() to pre-check the variance of genes in the input data matrix. Here we make this function check_sd() accessible for users so that they may use it to check on the variance situation of the input data themselves. Input to the function include two data matrices of same number of columns (genes) for two comparative conditions respectively, one vector for gene names (colnames of the two matrices), and a threshold of minimum standard deviation (defaults to 0.001). The output is a list of two components: one vector of gene names for the variance-qualifying genes, and a concatenated string including all disqualifying genes.

data(meta)
STAD <- datasets[['STAD']]
N <- ncol(STAD)
n1 <- N%/%2
objt1 <- t(STAD[,1:n1])
objt2 <- t(STAD[,(n1+1):N])
genes <- rownames(STAD)
check.res <- check_sd(objt1,objt2,genes,0.1)
head(check.res$genes.kept)

## [1] "A1BG"  "A2M"   "AACS"  "AADAT" "AANAT" "AARS"

# check.res$genes.removed # three genes were removed.
check.res$genes.removed

## [1] "CRH,IL9,LECT2"

Tutorial of the MetaGSCA package