Sparse PCA and XGboost for predicting tumor type


The Pan-Cancer intiative at The Cancer Genome Atlas (TCGA) Research Network has analyzed numerous types of tumor using different technologies to profile the genomic, transcriptomic and epigenomic landscapes. This body of work represents an approach to study cancers not only based on the organ of origin, but also the broader biomarker profiles [1]. For this post, high-throughput gene expression data are used to build a classifier for 5 tumor types: breast cancer, lung cancer, colorectal cancer, kidney cancer and prostate cancer. The public data was provided and maintained by TCGA, the raw RNA sequence data were normalized to Fragment per kilobase transcript per million mapped reads (FPKM) to adjust for differences in transcript lengths. The normalized data are compiled and hosted by UCI Machine Learning Repository.

Dec 4th, 2017 - 15 minute read -
R, Sparse PCA, XGBoost, Cancer Research

Data preprocessing

As shown in Fig.1, this data set is imbalanced with Colorectal cancer being the most underpresented. Taking this into account for multiclass classification, F1 score will be used to reflect the precision and recall performance within each class. There's also the issue of high dimensionality, where number of features (20,531 columns) far exceeds the number of samples (801 rows). A modified form of Principal Component Analysis (PCA) will be used to reduce the complexity of the dataset.

tumor distribution in data
Fig.1. Count distribution of tumor types in 801 samples. BRCA-breast cancer,COAD-Colorectal cancer, KIRC-Kidney cancer, LUAD- Lung cancer, PRAD- Prostate cancer.

For a detailed explanation on Prinicipal Component Analysis (PCA), I highly recommend this link (it's the best I've seen and the visual is awesome!). Briefly, PCA is a technique commonly used to reduce high dimensional numerical data to lower dimensional space. Eigen decomposition of the feature covariance/correlation matrix results in the eigenvector matrix and its correpsonding eigenvalue matrix. Traditional PCA works by combining all original features with linear weights (aka: loadings) to capture the maximum variance of the data. Loading values can be obtained by eigenvector*square root (eigenvalue).The analysis results in several principal components, which are ranked by the amount of variance it captures (first PC captures the most). Each Principal Component(PC) will have its own eigenvector and eigenvalues that characterizes the variance, and each PC will have its own loadings for transforming the original data. For example, to calculate the first PC score for one sample, let Ci,j be the loadings for the i-th PC and the jth FPKM-normalized gene expression feature:

PC1 Score = gene_measurement{1} x C1,1 + gene_measurement{2} x C1,2 +... gene_measurement{j} x C1,j

When we are dealing with genomic data such as gene expression measurements, typically there's hundreds of thousands to millions of gene or transcript being quantified per sample. Based on existing literature, we know that only a fraction of those genomic features are driving the target of interest. To apply this constraint and to select sub group of features, we can use sparse PCA which results in sparse loadings. Sparsity of the loadings is effectively a form of feature selection, and it offers practical advantages such as easier interpretation and the need for less variable measurements, which can be costly in RNA sequencing. One of the well known sparse PCA methods was published by Zou, Hastie, and Tibshirani in 2005[2]. In this paper, they approached PCA as a regression problem optimized by lasso shrinkage of the coefficients resulting in some loadings of zero values. In this example, I used the nsprcomp package in R, which allows you to specifiy the maximum number of non-zero loadings for each PC (and optional constraint for non-negative loadings). I picked 1000 (5% of the total number of genes), but it's a parameter that can be further optimized. The codes for sparse PCA are shown below (complete R script can be found at the github repo). I tried both traditional PCA ( prcomp() ) and sparse PCA to generate the plots in Fig.2.

### exclude columns where values are all 0s
df<[,-(which(colSums( == 0))]

### the gene expressions values are normalized, no need to set scale to True in nspca()
pca.df<-nsprcomp(df, k=c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000))
df_out <-$x)
df_out$group <- label$Class


#### Use plotly to generatate 3D interactive chart. 
p <- plot_ly(df_out, x = ~PC1, y = ~PC2, z = ~PC3, color = ~group) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'PC1'),
                      yaxis = list(title = 'PC2'),
                      zaxis = list(title = 'PC3')))
Fig.2. Top-3D plot of PCA scores using traditional PCA on entire dataset. Bottom-3D plot of PCA scores using sparse PCA on entire dataset.

As shown above, the sparse PCA resulted in clear clustering of the groups along the PC axes, showing that selection of subgroup of gene expressions did not compromise the performance of the PCs.

XGB model of PC scores

After getting the loadings for 1,000 gene expression features, 10 PC scores are computed for each sample. This would serve as the input matrix to the classification model Xtreme Gradient Boosting (XGB). In short, XGB is a regularized boosting algorithm in which the weak classifier learns its parameters based on the performance of the previous iteration, giving more weight to previously misclassified samples (or large error in case of regression) (see fig 3). The regularization in XGB helps to control for overfitting common in Gradient Boosting Machine (GBM) as boosting tries to reduce the bias of its learner.

schematic of xgb
Fig.3. Schematic of gradient boosting trees. The growth of the tree is guided by the gradient/error of the previous iteration. (Source:

cv_folds <- createFolds(target, k=5)
val_split<-lapply(cv_folds, function(ind, dat) dat[ind,], dat=df)
train_split<-lapply(cv_folds, function(ind, dat) dat[-ind,], dat=df)

# placeholder for 5 CV and 5 classes F1 scores
f1_val<-matrix(NA, nrow=5, ncol=5)

for (i in 1:length(cv_folds)){
  ## perform sparse PCA on training set, limit to 1K non-zero loadings for the feature variables
  spca.train<-nsprcomp(train.df, k=c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000))
  ## transformed data in training set
  ## get loadings for validation data frame
  val.spc<-predict(spca.train, val.df)
  #get target info
  #set up for xgb
  dtrain.sparse<-xgb.DMatrix(data = as.matrix(train.spc), label =as.numeric(
  dval.sparse<-xgb.DMatrix(data = as.matrix(val.spc), label = as.numeric(
  watchlist <- list(train=dtrain.sparse, test=dval.sparse)
  xgb.model <- xgb.train(data=dtrain.sparse, max_depth=2, eta=0.0001, nthread = 4,colsample_bytree=0.3, nrounds=150, verbose=F,
                         watchlist=watchlist,num_class=5, early_stopping_rounds=3,objective = "multi:softmax", eval_metric="merror")
colnames(f1_val) <- c("BRCA","COAD", "KIRC",'LUAD','PRAD')
rownames(f1_val)<-c("CV1","CV2", "CV3",'CV4','CV5')

Within each cross validation, sparse PCA is performed on training data only, and subsequent loadings are calculated for the validation data. At the end of each cross validations, the F1 scores of the predictions are obtained (Table 1). F1 score is the weighted average of precision and recall, and it's useful when you are working with an imbalanced dataset. The F1 value is calculated as 2*(precision*recall)/(precision+recall), and it reflects how precise and robust the classifier is.

CV 1 0.98330.96971.00000.96300.9824
CV 2 1.00000.96770.98361.00001.0000
CV 3 0.96000.92860.98240.92860.9615
CV 4 0.98311.00001.00000.96551.0000
CV 5 0.98360.96770.98300.98180.9811

Table 1. F1 scores of the 5 cross validation.

The classifications of Colorectal and Lung cancers have the lowest average (about 0.967) and wider variations (0.025). This is consistent with the plot of the PC scores shown in Fig.2, where the clusters of Colorectal cancer and Lung cancer are in close proximity, making it a bit harder to classify between these two cancers based on the gene expression data. Beyond this set of data related to tumor specimen gene expression, I think the approach described in this post can be extended to liquid biopsy of blood samples for prognosis of disease.


[1].The Cancer Genome Atlas Research Network, Weinstein, J., Collisson, E.A., Mills,G.B., et al. The Cancer Genome Atlas Pan-Cancer analysis project, Nature Genetics 45, 1113–1120 (2013)

[2].Zou, H., Hastie, T., Tibshirani, R., Sparse Principal Component Analysis,Journal of Computational and Graphical Statistics, Volume 15, Number 2, Pages 265–286