ConfoundR

Introduction

Aims

ConfoundR is an interactive web application developed in R with Shiny. The goals of this app are to enable users to:

Compare the expression of an individual gene in tumour epithelium with the expression of the same gene in stromal/non-epithelial cells
Compare the expression of multiple genes in tumour epithelium with the expression of the same genes in stromal/non-epithelial cells
Compare the expression/enrichment of a gene set/gene signature in tumour epithelium with the expression/enrichment of the same gene set/gene signature in stromal/non-epithelial cells

By enabling the above comparisons the app enables users to identify individual genes, or gene sets/gene signatures which show differential expression/enrichment across tumour cell populations and therefore may be confounded when examined in bulk transcriptmomic tumour samples as a consequence of different proportions of tumour cell populations.

If you are ready to carry out this analysis, you can skip the below introductions, and examine the expression of individual genes using the page, or examine the expression of multiple genes using either a heatmap on the page or the Gene Set Enrichment Analysis (GSEA) method on the page. For more information on the datasets used in the ConfoundR app or for more details on the analysis modules available in the app see below.

Datasets

Eight datasets, available from Gene Expression Omnibus (GEO), are used in the ConfoundR app:

three from colorectal cancer (CRC) (GSE39396, GSE35602 & GSE31279)
two from breast cancer (GSE8183 & GSE14548) one of which is from triple negative breast cancer (TNBC) (GSE81838)
one from pancreatic ductal adenocarcinoma (PDAC) (GSE164665)
one from ovarian cancer (GSE9899)
one from prostate cancer (GSE97284)

Each of these datasets consists of matched tumour epithelium and stromal/non-epithelial samples that have been obtained from bulk tumour specimens using either laser capture microdissection (LCM) or fluorescence-activated cell sorting (FACS). To view more detailed descriptions of each of the datasets and the pre-processing steps applied click here .

GSE39396

Cohort of tumours from 6 CRC patients. Each tumour was separated by FACS into epithelial cells (EPCAM+), leukocytes (CD45+), endothelial cells (CD31+) and fibroblasts (FAP+). This cohort was profiled on the Affymetrix HT HG-U133+ PM Array Plate and background corrected, quantile normalised, summarised and log2 transformed using the robust multi-array average (RMA) method. The probe IDs were aligned to gene symbols using the microarray annotation file. For genes represented by multiple probes a single measurement for the gene was selected by choosing the probe with highest mean value across the samples.

GSE35602

Cohort of 13 matched tumour epithelium and stroma LCM from human CRC tissue (4 normal samples were removed). This cohort was profiled using Agilent-014850 Whole Human Genome Microarray 4x44K G4112F and lowess normalised. The series matrix was downloaded from GEO and the 4 normal samples removed. The probe IDs were aligned to gene symbols using the microarray annotation file. For genes represented by multiple probes a single measurement for the gene was selected by choosing the probe with highest mean value across the samples.

GSE31279

Cohort of 8 matched tumour epithelium and stroma LCM from human CRC tissue and 2 cases with LCM tumour stroma only: GSM775275 and GSM775279. Profiled by Illumina humanRef-8 v2.0 expression beadchip and quantile normalised. The series matrix, which included matched whole tumour and normal for 35 patients, with 10 of those patients having LCM tumour stroma +/- epithelium, was downloaded from GEO. The 18 LCM samples of interest (8 matched tumour epithelium and stroma samples plus the 2 stroma samples with no matched tumour epithelium) were extracted from the matrix. Genes with no expression across all samples were removed. The probe IDs were aligned to gene symbols using the microarray annotation file. For genes represented by multiple probes a single measurement for the gene was selected by choosing the probe with highest mean value across the samples.

GSE81838

Cohort of 10 matched tumour epithelium and stroma LCM from human triple negative breast cancer profiled using the Affymetrix Human Gene 1.0 ST Array. RMA normalisation and log2 transformation performed. Probe IDs were aligned to gene symbol using the microarray annotation file. For genes represented by multiple probes a single measurement for the gene was selected by choosing the probe with highest mean value across the samples.

GSE14548

Cohort of 14 fresh frozen primary breast cancer biopsies separated into the epithelial and stroma compartments using LCM and profiled using the Affymetrix Human X3P Array. In the epithelial compartment, normal and malignant (ductal carcinoma in situ (DCIS) or invasive ductal carcinoma (IDC)) epithelium were captured. In the stroma compartment normal stroma away from the malignant lesion, the DCIS-associated stroma and/or IDC-associated stroma whenever possible. The series matrix was downloaded from GEO and only the 9 matched IDC epithelium and IDC-associated stroma samples were selected. Probe IDs were aligned to gene symbol using the microarray annotation file. For genes represented by multiple probes a single measurement for the gene was selected by choosing the probe with highest mean value across the samples.

GSE164665

Cohort of 19 matched tumour epithelium and stroma LCM samples from human pancreatic ductal adenocarcinoma (PDAC) sequenced using Illumina NextSeq 500. The matrices with the raw gene read counts for the tumour epithelium and stroma samples were downloaded from GEO and merged. Genes with low expression (less than 10 counts across all samples or counts in less than three samples were removed). Normalised counts were calculated using size factors estimated by the DESeq2 function estimateSizeFactors. Variance stabilised transformed, normalised counts were generated using the vst function from DESeq2 with the blind argument set to FALSE.

GSE9899

Cohort of 295 human ovarian cancer samples including five matched tumour epithelium and stroma LCM profiled by Affymetrix Human Genome U133 Plus 2.0 Array. The series matrix was downloaded from GEO and the five matched tumour epithelium and stroma LCM samples were selected. Probe IDs were aligned to gene symbol using the microarray annotation file. For genes represented by multiple probes a single measurement for the gene was selected by choosing the probe with highest mean value across the samples.

GSE97284

Cohort of LCM tissue specimens from 12 low grade (Gleason 3+3) and 13 high grade (Gleason 8 and higher) radical prostatectomy and 5 cystoprostatectomy cases profiled using the Affymetrix Human Gene 1.0 ST Array. For each radical prostatectomy case tumour epithelium, prostatic intraepithelial neoplasia (PIN) epithelium and benign epithelium were captured along with adjacent stroma (tumour-associated stroma, PIN-associated stroma and benign-associated stroma). For cystoprostatectomy cases benign epithelium and adjacent stroma were captured. The series matrix was downloaded from GEO and the 25 matched tumour epithelium and tumour-associated stroma samples were selected. Probe IDs were aligned to gene symbol using the microarray annotation file. For genes represented by multiple probes a single measurement for the gene was selected by choosing the probe with highest mean value across the samples.

Expression Boxplots

The module allows users to compare the expression of a single gene between epithelial and stromal/non-epithelial cells in each of the datasets. This module enables the user to enter a gene symbol and the analysis module will draw boxplots for the expression of the chosen gene in epithelial and non-epithelial cells allowing a visual comparison of gene expresssion (for each dataset in which the chosen gene is found). In addition, a Mann-Whitney U test is performed comparing the expression of the chosen gene in epithelial and stromal/non-epithelial cells and the p-value is displayed on the plots. Boxplots are made using the ggplot2 package and the Mann-Whitney U tests are performed via the ggpubr package.

Expression Heatmap

The module allows users to visually compare the expression of multiple genes between epithelial and stromal/non-epithelial cells in each of the datasets. This module enables the user to enter a list of gene symbols (each gene symbol should be on a new line) and the analysis module will draw a heatmap of the expression of the chosen genes (which are present in each of the datasets). For heatmaps the expression values for each gene are converted to Z-scores for plotting, enabling visual comparison of the expression of each gene across samples. For the RNA-seq dataset (GSE164665) the transformed, normalised counts produced by the DESeq2 function, vst, are used as the expression values and are converted to Z-scores for plotting of the heatmap. In the heatmaps, the samples are split into epithelial and stromal/non-epithelial samples to aid comparison between these populations. Heatmaps are made using the ComplexHeatmap R package.

GSEA

The module enables users to perform gene set enrichment analysis (GSEA). Users can select existing gene sets from the Hallmark, KEGG (Kyoto Encyclopedia of Genes and Genomes), BioCarta, Reactome and PID (Pathway Interactions Database) gene set collections, using the dropdown menus provided, or use their own custom gene set by entering a list of gene symbols (each gene symbol should be on a line).

DE analysis methods

To perform GSEA it is necessary to first perform differential expression (DE) analysis. This app implements two different methods to conduct DE analysis, depending on the profiling technology used for the transcriptomic profiling of each dataset. For datasets that were profiled using microarray technology DE analysis was carried out using the limma R package while for datasets that were profiled using RNA-Seq technology DE analysis was carried out using the DESeq2 package. For the LCM experiments DE analysis is conducted comparing stroma to tumour epithelium. For the FACS sorted dataset (GSE39396), by default the DE analysis is performed comparing fibroblasts to tumour epithelial cells. However, the user can change which cell populations they wish to compare, for this FACS sorted dataset, using the “Compare” and “to” selection menus provided within the GSE39396 panel on the GSEA page. Multiple cell populations can be selected as one group and compared to another individual cell population or another group of multiple cell populations (e.g. Fibroblasts & Leukocytes compared to Epithelial cells).

GSEA analysis methods

To perform pre-ranked GSEA first DE analysis is performed in each dataset as described above comparing gene expression in stromal/non-epithelial cells with expression in epithelial cells. Following DE analysis for each dataset genes are ranked based on the t-statistic (limma) or Wald statistic (DESeq2) calculated by the DE analysis package. This ranked gene list is used, in combination with the gene set selected/uploaded by the user, to perform pre-ranked GESA using the GSEA function from the clusterProfiler R package using the fgsea method (specifically the fgseaSimple method) with 10000 permutations and a random seed of 123.

GSEA statistics

GSEA provides three statistics: Enrichment Score (ES), Normalised Enrichment Score (NES) and p-value (p). The meaning of each of these is explained below.

Enrichment Score (ES): For the LCM datasets, a positive ES indicates that the gene set is more enriched in the stroma compared to the tumour epithelium. For the FACS dataset, a positive ES indicates that the gene set is more enriched in the cell populations selected in the “Compare” box relative to the cell populations selected in the “to” box (by default this is fibroblasts relative to epithelial cells but this can be changed by the user). On the other hand, a negative ES in the LCM datasets indicates that the gene set is less enriched in the stroma compared to the tumour epithelium (or in other words the gene set is more enriched in the tumour epithelium compared to the stroma). For the FACS dataset, a negative ES indicates that the gene set is less enriched in the cell populations selected in the “Compare” box relative to the cell populations selected in the “to” box (or in other words more enriched in the cell populations selected in the “to” box relative to the cell populations selected in the “Compare” box).
Normalised Enrichment Score (NES): To calculate the NES, enrichment scores are calculated for 10000 random gene sets, with the same number of genes as the user selected gene set. The ES obtained by the user selected gene set is then divided by the mean of the enrichment scores for the random gene sets to obtain the NES.
p-value (p): Using the enrichment scores of the 10000 random gene sets, the probability of obtaining an ES as extreme or more extreme than the ES obtained for the user selected gene set is calculated and this is the p-value (p) for the gene set. The p-value can then be used in conjunction with the NES to determine if the gene set is significantly positively/negatively enriched.

Introduction

Aims

Datasets

GSE39396

GSE35602

GSE31279

GSE81838

GSE14548

GSE164665

GSE9899

GSE97284

Expression Boxplots

Expression Heatmap

GSEA

DE analysis methods

GSEA analysis methods

GSEA statistics

Expression Boxplots

GSE39396 - CRC

GSE35602 - CRC

GSE31279 - CRC

GSE81838 - TNBC

GSE14548 - Breast Cancer

GSE164665 - PDAC

GSE9899 - Ovarian Cancer

GSE97284 - Prostate Cancer

Expression Heatmap

GSE39396 - CRC

GSE35602 - CRC

GSE31279 - CRC

GSE81838 - TNBC

GSE14548 - Breast Cancer

GSE164665 - PDAC

GSE9899 - Ovarian Cancer

GSE97284 - Prostate Cancer

GSEA

GSE39396 - CRC

GSE35602 - CRC

GSE31279 - CRC

GSE81838 - TNBC

GSE14548 - Breast Cancer

GSE164665 - PDAC

GSE9899 - Ovarian Cancer

GSE97284 - Prostate Cancer