The Art of Machine Learning

Data preprocessing

As shown in Fig.1, this data set is imbalanced with Colorectal cancer being the most underpresented. Taking this into account for multiclass classification, F1 score will be used to reflect the precision and recall performance within each class. There's also the issue of high dimensionality, where number of features (20,531 columns) far exceeds the number of samples (801 rows). A modified form of Principal Component Analysis (PCA) will be used to reduce the complexity of the dataset.

tumor distribution in data — Fig.1. Count distribution of tumor types in 801 samples. BRCA-breast cancer,COAD-Colorectal cancer, KIRC-Kidney cancer, LUAD- Lung cancer, PRAD- Prostate cancer.

For a detailed explanation on Prinicipal Component Analysis (PCA), I highly recommend this link (it's the best I've seen and the visual is awesome!). Briefly, PCA is a technique commonly used to reduce high dimensional numerical data to lower dimensional space. Eigen decomposition of the feature covariance/correlation matrix results in the eigenvector matrix and its correpsonding eigenvalue matrix. Traditional PCA works by combining all original features with linear weights (aka: loadings) to capture the maximum variance of the data. Loading values can be obtained by eigenvector*square root (eigenvalue).The analysis results in several principal components, which are ranked by the amount of variance it captures (first PC captures the most). Each Principal Component(PC) will have its own eigenvector and eigenvalues that characterizes the variance, and each PC will have its own loadings for transforming the original data. For example, to calculate the first PC score for one sample, let C_i,j be the loadings for the i-th PC and the jth FPKM-normalized gene expression feature:

PC1 Score = gene_measurement{1} x C_1,1 + gene_measurement{2} x C_1,2 +... gene_measurement{j} x C_1,j

When we are dealing with genomic data such as gene expression measurements, typically there's hundreds of thousands to millions of gene or transcript being quantified per sample. Based on existing literature, we know that only a fraction of those genomic features are driving the target of interest. To apply this constraint and to select sub group of features, we can use sparse PCA which results in sparse loadings. Sparsity of the loadings is effectively a form of feature selection, and it offers practical advantages such as easier interpretation and the need for less variable measurements, which can be costly in RNA sequencing. One of the well known sparse PCA methods was published by Zou, Hastie, and Tibshirani in 2005[2]. In this paper, they approached PCA as a regression problem optimized by lasso shrinkage of the coefficients resulting in some loadings of zero values. In this example, I used the nsprcomp package in R, which allows you to specifiy the maximum number of non-zero loadings for each PC (and optional constraint for non-negative loadings). I picked 1000 (5% of the total number of genes), but it's a parameter that can be further optimized. The codes for sparse PCA are shown below (complete R script can be found at the github repo). I tried both traditional PCA ( prcomp() ) and sparse PCA to generate the plots in Fig.2.


    
      
        
  
    
    

        


  


  
        
          
          library(data.table)
        
        
          
          library(plyr)
        
        
          
          library(caret)
        
        
          
          library(nsprcomp)
        
        
          
          library(xgboost)
        
        
          
          

        
        
          
          ########################### Load data ##############
        
        
          
          raw.data<-fread('data.csv')
        
        
          
          label<-fread('labels.csv')
        
        
          
          target<-label$Class
        
        
          
          

        
        
          
          target.num<-mapvalues(target, from = c("BRCA","COAD", "KIRC",'LUAD','PRAD'), to = c(0,1,2,3,4))
        
        
          
          raw.data$V1<-NULL
        
        
          
          raw.data<-data.frame(raw.data)
        
        
          
          df<-raw.data[,-(which(colSums(raw.data) == 0))]
        
        
          
          

        
        
          
          ############ visualize the PCA plot on whole dataset ##############
        
        
          
          pca<-nsprcomp(df, k=c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000))
        
        
          
          df_out <- as.data.frame(pca$x)
        
        
          
          df_out$group <- label$Class
        
        
          
          

        
        
          
          library(ggplot2)
        
        
          
          library(plotly)
        
        
          
          p <- plot_ly(df_out, x = ~PC1, y = ~PC2, z = ~PC3, color = ~group) %>%
        
        
          
            add_markers() %>%
        
        
          
            layout(scene = list(xaxis = list(title = 'PC1'),
        
        
          
                                yaxis = list(title = 'PC2'),
        
        
          
                                zaxis = list(title = 'PC3')))
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
  



    

  


      
      
        view raw
        
          rna_seq_expression(cv).R
        
        hosted with ❤ by GitHub

pca_plot — Fig.2. Top-3D plot of PCA scores using traditional PCA on entire dataset. Bottom-3D plot of PCA scores using **sparse** PCA on entire dataset.

pca_plot_sparse1000 — Fig.2. Top-3D plot of PCA scores using traditional PCA on entire dataset. Bottom-3D plot of PCA scores using **sparse** PCA on entire dataset.

As shown above, the sparse PCA resulted in clear clustering of the groups along the PC axes, showing that selection of subgroup of gene expressions did not compromise the performance of the PCs.

XGB model of PC scores

After getting the loadings for 1,000 gene expression features, 10 PC scores are computed for each sample. This would serve as the input matrix to the classification model Xtreme Gradient Boosting (XGB). In short, XGB is a regularized boosting algorithm in which the weak classifier learns its parameters based on the performance of the previous iteration, giving more weight to previously misclassified samples (or large error in case of regression) (see fig 3). The regularization in XGB helps to control for overfitting common in Gradient Boosting Machine (GBM) as boosting tries to reduce the bias of its learner.

schematic of xgb — Fig.3. Schematic of gradient boosting trees. The growth of the tree is guided by the gradient/error of the previous iteration. (Source:https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)


    
      
        
  
    
    

        


  


  
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
          
          cv_folds <- createFolds(target, k=5)
        
        
          
          val_split<-lapply(cv_folds, function(ind, dat) dat[ind,], dat=df)
        
        
          
          train_split<-lapply(cv_folds, function(ind, dat) dat[-ind,], dat=df)
        
        
          
          #for 5 CV and 5 classes F1 scores
        
        
          
          f1_val<-matrix(NA, nrow=5, ncol=5)
        
        
          
          

        
        
          
          for (i in 1:length(cv_folds)){
        
        
          
            group<-names(cv_folds)[i]
        
        
          
            #get dataframe using [[]]
        
        
          
            val.df<-val_split[[group]]
        
        
          
            train.df<-train_split[[group]]
        
        
          
            #perform sparse PCA on training set, limit to 1K non-zero loadings for the feature variables (5% of total features)
        
        
          
            library(doSNOW)
        
        
          
            library(ff)
        
        
          
            library(doSNOW)
        
        
          
            cluster<-makeCluster(6,"SOCK")
        
        
          
            registerDoSNOW(cluster)
        
        
          
            spca.train<-nsprcomp(train.df, k=c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000))
        
        
          
            train.spc<-spca.train$x
        
        
          
            val.spc<-predict(spca.train, val.df)
        
        
          
            
        
        
          
            #get target info
        
        
          
            valIdx<-cv_folds[group][[group]]
        
        
          
            train.target<-target.num[-valIdx]
        
        
          
            val.target<-target.num[valIdx]
        
        
          
            
        
        
          
            #set up for xgb
        
        
          
            dtrain.sparse<-xgb.DMatrix(data = as.matrix(train.spc), label =as.numeric(train.target))
        
        
          
            dval.sparse<-xgb.DMatrix(data = as.matrix(val.spc), label = as.numeric(val.target))
        
        
          
            watchlist <- list(train=dtrain.sparse, test=dval.sparse)
        
        
          
            xgb.model <- xgb.train(data=dtrain.sparse, max_depth=2, eta=0.0001, nthread = 2,colsample_bytree=0.3, nrounds=150, verbose=F,
        
        
          
                                   watchlist=watchlist,num_class=5, early_stopping_rounds=3,objective = "multi:softmax", eval_metric="merror")
        
        
          
            
        
        
          
            prob<-predict(xgb.model,dval.sparse)
        
        
          
            # sebastian raschka:What we are trying to achieve with the F1-score metric is to find an equal balance between precision and recall, which is extremely useful in most scenarios when we are working with imbalanced datasets (i.e., a dataset with a non-uniform distribution of class labels).
        
        
          
            g<-confusionMatrix(prob, val.target)
        
        
          
            f1_val[i,]<-as.vector(g$byClass[,7])
        
        
          
          }
        
        
          
            
        
        
          
          colnames(f1_val) <- c("BRCA","COAD", "KIRC",'LUAD','PRAD')
        
        
          
          rownames(f1_val)<-c("CV1","CV2", "CV3",'CV4','CV5')
        
        
          
          as.table(f1_val)
        
  



    

  


      
      
        view raw
        
          rna_seq_expression(cv).R
        
        hosted with ❤ by GitHub

Within each cross validation, sparse PCA is performed on training data only, and subsequent loadings are calculated for the validation data. At the end of each cross validations, the F1 scores of the predictions are obtained (Table 1). F1 score is the weighted average of precision and recall, and it's useful when you are working with an imbalanced dataset. The F1 value is calculated as 2*(precision*recall)/(precision+recall), and it reflects how precise and robust the classifier is.

	BRCA	COAD	KIRC	LUAD	PRAD
CV 1	0.9833	0.9697	1.0000	0.9630	0.9824
CV 2	1.0000	0.9677	0.9836	1.0000	1.0000
CV 3	0.9600	0.9286	0.9824	0.9286	0.9615
CV 4	0.9831	1.0000	1.0000	0.9655	1.0000
CV 5	0.9836	0.9677	0.9830	0.9818	0.9811

Table 1. F1 scores of the 5 cross validation.

The classifications of Colorectal and Lung cancers have the lowest average (about 0.967) and wider variations (0.025). This is consistent with the plot of the PC scores shown in Fig.2, where the clusters of Colorectal cancer and Lung cancer are in close proximity, making it a bit harder to classify between these two cancers based on the gene expression data. Beyond this set of data related to tumor specimen gene expression, I think the approach described in this post can be extended to liquid biopsy of blood samples for prognosis of disease.

REFERENCES

[1].The Cancer Genome Atlas Research Network, Weinstein, J., Collisson, E.A., Mills,G.B., et al. The Cancer Genome Atlas Pan-Cancer analysis project, Nature Genetics 45, 1113–1120 (2013)

[2].Zou, H., Hastie, T., Tibshirani, R., Sparse Principal Component Analysis,Journal of Computational and Graphical Statistics, Volume 15, Number 2, Pages 265–286

	library(data.table)
	library(plyr)
	library(caret)
	library(nsprcomp)
	library(xgboost)

	########################### Load data ##############
	raw.data<-fread('data.csv')
	label<-fread('labels.csv')
	target<-label$Class

	target.num<-mapvalues(target, from = c("BRCA","COAD", "KIRC",'LUAD','PRAD'), to = c(0,1,2,3,4))
	raw.data$V1<-NULL
	raw.data<-data.frame(raw.data)
	df<-raw.data[,-(which(colSums(raw.data) == 0))]

	############ visualize the PCA plot on whole dataset ##############
	pca<-nsprcomp(df, k=c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000))
	df_out <- as.data.frame(pca$x)
	df_out$group <- label$Class

	library(ggplot2)
	library(plotly)
	p <- plot_ly(df_out, x = ~PC1, y = ~PC2, z = ~PC3, color = ~group) %>%
	add_markers() %>%
	layout(scene = list(xaxis = list(title = 'PC1'),
	yaxis = list(title = 'PC2'),
	zaxis = list(title = 'PC3')))

	cv_folds <- createFolds(target, k=5)
	val_split<-lapply(cv_folds, function(ind, dat) dat[ind,], dat=df)
	train_split<-lapply(cv_folds, function(ind, dat) dat[-ind,], dat=df)
	#for 5 CV and 5 classes F1 scores
	f1_val<-matrix(NA, nrow=5, ncol=5)

	for (i in 1:length(cv_folds)){
	group<-names(cv_folds)[i]
	#get dataframe using [[]]
	val.df<-val_split[[group]]
	train.df<-train_split[[group]]
	#perform sparse PCA on training set, limit to 1K non-zero loadings for the feature variables (5% of total features)
	library(doSNOW)
	library(ff)
	library(doSNOW)
	cluster<-makeCluster(6,"SOCK")
	registerDoSNOW(cluster)
	spca.train<-nsprcomp(train.df, k=c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000))
	train.spc<-spca.train$x
	val.spc<-predict(spca.train, val.df)

	#get target info
	valIdx<-cv_folds[group][[group]]
	train.target<-target.num[-valIdx]
	val.target<-target.num[valIdx]

	#set up for xgb
	dtrain.sparse<-xgb.DMatrix(data = as.matrix(train.spc), label =as.numeric(train.target))
	dval.sparse<-xgb.DMatrix(data = as.matrix(val.spc), label = as.numeric(val.target))
	watchlist <- list(train=dtrain.sparse, test=dval.sparse)
	xgb.model <- xgb.train(data=dtrain.sparse, max_depth=2, eta=0.0001, nthread = 2,colsample_bytree=0.3, nrounds=150, verbose=F,
	watchlist=watchlist,num_class=5, early_stopping_rounds=3,objective = "multi:softmax", eval_metric="merror")

	prob<-predict(xgb.model,dval.sparse)
	# sebastian raschka:What we are trying to achieve with the F1-score metric is to find an equal balance between precision and recall, which is extremely useful in most scenarios when we are working with imbalanced datasets (i.e., a dataset with a non-uniform distribution of class labels).
	g<-confusionMatrix(prob, val.target)
	f1_val[i,]<-as.vector(g$byClass[,7])
	}

	colnames(f1_val) <- c("BRCA","COAD", "KIRC",'LUAD','PRAD')
	rownames(f1_val)<-c("CV1","CV2", "CV3",'CV4','CV5')
	as.table(f1_val)