Part two of your essential glossary of AI’s key terms and phrases:


Machine Learning

The study of algorithms that can infer patterns and rules within data without needing explicit instructions. Note that this definition technically includes deep learning, but that the two terms tend to be used with mutual exclusivity. Machine learning generally refers to older algorithms including support vector machines, random forest regression, K nearest neighbour, etc.


Deep learning is very much in fashion at the moment and there is a tendency to apply it (perhaps somewhat injudiciously) to all analytical problems. However, while deep learning algorithms are remarkably potent computational tools, they are heavily reliant on large volumes of data. Traditional machine learning still has a significant place in data science, particularly with regard to data-poor areas (including many less common medical conditions).



A rectangular array of scalars (in lay terms: a grid of numbers) that behave as a unit.



Used interchangeably with the term algorithm in the context of machine learning algorithms, or with the term network in the context of neural networks.



Natural Language Processing (NLP)

The study of developing computer systems that can perform useful functions based on natural spoken or written language (think Amazon Alexa, Google Assistant, etc.). Medical NLP applications are mostly early stage, but the last 18 months have seen some major advances in the field and it’s definitely a space to watch.


Neural Network

A machine learning algorithm inspired by the neuronal architecture of the biological brain. Neural networks form the cornerstone of deep learning.




When a machine learning model learns features that are too specific to its training set and will not generalise well to real-world examples. For example, a weather prediction algorithm trained on three years of data may predict rain on the 27th March next year with 100% certainty, based on the fact it rained on 27th March each of the last three years. Overfitting is a major issue in machine learning and is often the consequence of small datasets, or datasets that do not contain an adequate spread of data.




Currently the most popular programming language for machine learning applications




Rectified linear activation unit (ReLU)


A non-linear activation function whose output equals its input for positive numbers, but which outputs 0 for all negative input numbers. So ReLU(2) = 2, whereas ReLU(-1) = 0.


Intuitively, this may seem too simple to allow for powerful data manipulation. However, it has proven very effective and has replaced the sigmoid function as the activation function of choice for fully-connected hidden layers of most neural networks.



A statistical programming language, probably a close second to Python in popularity for machine learning applications. Note that NHS Digital has chosen R over Python as their official programming language.


Recurrent neural network (RNN)

A type of neural network that ‘remembers’ information from the previous item within a sequence of data in order to inform the interpretation of the current item. RNNs are most commonly used in natural language processing, where previous words in a phrase inform the interpretation of the current word. For example, knowledge of the previous item is essential in interpreting the word complaint in the phrases presenting complaint vs written complaint.


Reinforcement learning (RL)

A relatively nascent field of AI focussed on the concept of training machines to develop behaviour strategies based on distant rewards. RL was pivotal to the celebrated success of AlphaGo and has some very interesting (though currently theoretical) applications in clinical medicine, where sequential decision making based on learned behaviour models is central to the activity of many clinicians.



A single, real number (e.g. 725 or 3.142).


Supervised learning

The process of training a machine learning model by tasking it with mapping from input data to pre-assigned labels. For example: feeding a CNN a large collection of chest X-rays and training it to detect pneumonia, where each X-ray in the training set has been labelled with a 0 or 1 (denoting the absence or presence of pneumonia) by an expert radiologist. Most deep learning applications in production today are based on the supervised learning framework.



For the purposes of machine learning, a tensor refers to a scalar, vector or matrix whose values will be transformed as part of an algorithm – hence, Google’s deep learning library is called TensorFlow.


Test set

The dataset used for the final evaluation of a completed machine learning model.



The process of converting words into numerical “tokens” so that they may be used by mathematical algorithms (e.g. deep learning models). In the simplest example, a word’s token is simply its position within the dictionary (list) of words used for a given NLP task.


Train set

The dataset used during the primary training of a machine learning model (often 80% of the data, where 10% is reserved for the validation set and 10% for the test set).

Transfer learning

The process by which a deep learning model can re-use knowledge acquired in one domain to improve performance in another domain. For example, a CNN trained to identify real-world objects from photographs (of which there is an abundance online) could be fine-tuned to identify cerebral haemorrhages from CT brains (examples of which may be harder to acquire). The feature abstraction functions learned in the early convolutional layers of the network (e.g. the ability to detect edges and rudimentary geometric shapes) will be common to both tasks, so by re-training only the later layers of the network, one can both expedite training and achieve high performance with less training data.


Unsupervised learning

The process of training a machine learning model where the labels for the data are not provided. Often, unsupervised learning tasks centre around the concept of “clustering”, such as grouping patients from a certain disease population into a prespecified number of phenotypical groups.



Validation set

Usually, the dataset that is used to tune a machine learning model‘s hyperparameters (e.g. the dichotomisation threshold when results are produced by the model on a continuous scale but required by the clinician in binary form, such as 1 for “malignant” or 0 for “benign”) after the initial training phase based on the training set.


Vanilla neural network

A neural network that consists of an input layer, a small number of fully connected hidden layers and an output layer. The term “vanilla” refers to the fact that this is the simplest form of neural network architecture, which does not contain convolutional features, recurrent features, LSTM units, etc.


A list of scalar numbers that behave as a unit. The length of the list is referred to as the “dimension” of the vector, such that a 3-dimensional vector is a list of 3 numbers. For example, if is a 3-dimensional vector , then .


The essential differentiable component of a neural network. Weights are scalar values that are adjusted during a network’s training phase to alter its internal mathematical structure, which in turn adjusts the function performed by the network. A “weight matrix” contains all the weights of a given layer within a network.

The black magic of deep learning lies in the fact that a neural network’s weight matrices starts out as a collection of randomly initialised scalar values (assuming it is not making use of transfer learning) but, by incrementally tweaking these values, the network can transform itself into a cutting edge data processing tool.


Word vectorization

An interesting idea whereby a computer learns to represent words as multi-dimensional vectors, which can be used to perform meaningful functions.