Drug discovery and treatment for patients are important areas of study in the battle over cancer in humans. Understanding the mechanism into which cancer forms, and attack the body will be the basis for finding new drugs to improve outcomes and lessen side effects. To create novel drugs in a new approach, we need to understand the cells on a molecular level. We must study the nucleotide variations that create cancer, and how we can effectively classify them for future studies. Fortunately for computational biologists studying cancer, there has been

a large effort to study and document mutations in the last decade and store them in public databases for use. Even more recent advances in computing power combined with machine learning techniques have allowed us to see patterns in mutations to classify and cluster cancer mutations.
To begin, we used two curated, functionally- validated single nucleotide variation (SNVs) datasets to train our models. The Benchmarking Dataset (n = 3591 SNVs) and CanDrA Datasets (n = 1550 SNVs) were training using both supervised (random forests) and unsupervised (K- means clustering) to create model of classifying future SNVs. To create scores, functional scores were assigned by A DataBase and web server of human Whole-Genome single nucleotide variants and their Functional Predictions (dbWGFP) which gave us more than 50 functional scores, these were narrowed down to 48 scores based on functional predictions and conservations calculations. To reduce the input size to the model, recursive feature elimination was implemented. The mutations were then fed into a random forest classifier using Bayesian Optimization for hyper parameter tuning.The model will use these scores to attempt to predict whether a mutation is a driver or passenger in the onset of cancer.
As of right now, we have initial results on our random forest, but we are still working on adjusting our results for the cluster algorithm. For the first pass with the random forest, we had about a 74% accuracy rate in determining the driver, passenger or unknown category of the mutation.



Author: Steven Agajanian

Coauthor(s): Oluyemi Odeyemi, MS Simrath Ratra Nathaniel Bischoff Gennady Verkhivker, PhD

Status: Work In Progress