Department of
Epidemiology &
Biostatistics

Genetic Epidemiology

Global Health

Behavior and Prevention

Health Care

Splitting Random Forest (SRF): A novel algorithm for determining compact sets of genes that distinguish between cancer subtypes

Abstract

Motivation: In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, we developed a non-parametric, iterative algorithm, Splitting Random Forest (SRF), to robustly identify genes that distinguish between molecular subtypes.

 

Results: The optimal SRF 50 gene classifiers for glioblastoma (GB), breast and ovarian cancer subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF 50 sets outperformed other methods by identifying compact genes sets needed for distinguishing between tested cancer subtypes (10-200 fold fewer genes than ANOVA or published gene sets). Significant overlap between SRF 50 and published gene sets was present, showing that SRF identifies the relevant sub-sets of important gene lists. We discovered, through Ingenuity Pathway Analysis (IPA), that the overlap in “hub” genes between the SRF 50 and published genes sets were RB1, PIK3R1, PDGFBB for GB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for breast; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for ovarian cancer data. The SRF approach, by reducing the number of genes needed for robust classification, is an effective driver of biomarker discovery research.


R code for SRF


Associated Data Files

Glioblastoma (GB):
GB full dataset
GB validation dataset

Breast Cancer:
breast cancer training dataset
breast cancer testing dataset


Ovarian Cancer:
ovarian cancer training dataset
ovarian cancer testing dataset



Last Updated (Friday, 14 October 2011 01:57)

 
Hosting: Ilimitada Hosting Chile VPS