By John Cassidy and Harry Clifford
As technologies become ever more sophisticated, the amount of data they produce gets richer: the same is true for genomic data.
Whilst the first genome wide association studies (GWAS) relied on a finite number of single nucleotide polymorphisms (SNPs) across a large patient population, more modern genomics processes generate incredibly deep datasets (e.g. genomics, transcriptomics, methylome, proteome, etc.) from a limited number of patients.
Machine Learning can also be used to draw clinically useful information from combined genomics datasets, the size of which traditional statistics would struggle with.
This means that traditional theoretical and applied statistical techniques, whilst very successfully applied to GWAS studies, can’t be used for modern genomics datasets. Moreover, most of the important signals in genomics datasets are often incredibly small and masked by technical noise, and thus require far more sophisticated analysis techniques. It is for these reasons that machine learning has been so successfully applied to the datasets generated from genomic sequencing.
Machine Learning can also be used to draw clinically useful information from combined genomics datasets, the size of which traditional statistics would struggle with. For example, Rampasek and Goldenburg (2017) developed a variational autoencoder which, through combining datasets (Genomics of Drug Sensitivity in Cancer; Cancer Cell Line Encyclopedia) was capable of predicting drug response in cancer.
CNNs* have recently been shown to take a lead among current algorithms for solving genomic sequence-based problems. Several groups (inc. DeepBind) have been able to model protein binding from sequencing data, whilst others (e.g. DeepSea) are developing CNNs to predict the effects of non-coding variants, purely from genomic sequences. Additionally, startups such as Verge Genomics are beginning to use such approaches in the rational design of new drug candidates for commonly mutated proteins in cancers.
What are CNNs?*
The current “best” algorithm for visual pattern recognition is a convolutional neural network (CNN).
CNNs were developed from perceptrons, vector-mapping algorithms in turn inspired by associative learning of the brain, and the idea of “integrate and fire” neurons.
Much like the first perceptron, the convolutional neural network also bears the hallmarks of a biologically-inspired system: think of a perceptron as a coarse approximation of a single neural pathway and a CNN as a complex, multi-layered network of neural connections inspired by a biological neural network (BNN), as found in the human brain.
Read more about the crossovers of artificial intelligence (AI) and biology
John is Co-founder and CEO at Cambridge Cancer Genomics, a precision oncology startup building software solutions for iterative medicine. CCG uses integrated bioinformatic and machine learning pipelines for liquid biopsy analysis, in order to guide doctors on tumour treatment in real time. John is actively involved in the biotech startup community as a Venture Partner at the Pioneer Fund, Director of SiliconBio and a lecturer at Anglia Ruskin University. He holds a Masters in Pharmacology (1st Class) from the University of Glasgow and pursued a PhD in functional genomics at the University of Cambridge. His research career in academia (CRUK) and industry (MedImmune) focused on understanding how tumours evolve and become resistant to treatment. He has published numerous scientific papers, book chapters and literature reviews, and has 300+ citations.
Harry is Co-founder and Chief Technology Officer at Cambridge Cancer Genomics. He leads CCG’s tech team of AI researchers, data scientists and bioinformaticians to conduct innovative R&D and develop the underlying tumor analysis pipelines and cloud architecture of CCG’s smart genomics systems. Harry has a PhD in bioinformatics from Oxford University and experience in postdoctoral roles, including at Cambridge University with Cancer Research UK; and in the biopharma industry, where he worked on developing biomarker-based medical diagnostics.