Splitting Random Forest (SRF): A novel algorithm for determining compact sets of genes that distinguish between cancer subtypes
Abstract
Motivation: In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, we developed a non-parametric, iterative algorithm, Splitting Random Forest (SRF), to robustly identify genes that distinguish between molecular subtypes.
Results: The optimal SRF 50 gene classifiers for glioblastoma (GB), breast and ovarian cancer subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF 50 sets outperformed other methods by identifying compact genes sets needed for distinguishing between tested cancer subtypes (10-200 fold fewer genes than ANOVA or published gene sets). Significant overlap between SRF 50 and published gene sets was present, showing that SRF identifies the relevant sub-sets of important gene lists. We discovered, through Ingenuity Pathway Analysis (IPA), that the overlap in “hub” genes between the SRF 50 and published genes sets were RB1, PIK3R1, PDGFBB for GB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for breast; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for ovarian cancer data. The SRF approach, by reducing the number of genes needed for robust classification, is an effective driver of biomarker discovery research.
Associated Data Files
Glioblastoma (GB):
GB full dataset
GB validation dataset
Breast Cancer:
breast cancer training dataset
breast cancer testing dataset
Ovarian Cancer:
ovarian cancer training dataset
ovarian cancer testing dataset
Last Updated (Friday, 14 October 2011 01:57)