Rachelle Aviv and Zack Dvey-Aharon on retrospective versus prospective validation.

AI research, and more specifically research that utilizes deep learning, has been a hot topic for roughly the past decade. However, for those not already familiar with this field, it may look like a complete morass, or a black box – confusing, opaque, and the kind of thing you throw data at until something works. As a result, it can be hard to judge deep learning research on its merits. This piece will outline some of the existing issues with current retrospective research, and discuss how some of these concerns may be addressed by prospective validation studies.

Retrospective research refers to the training and, more importantly, validation of deep learning models using data that has already been collected. The vast majority of such training is performed on existing datasets, typically large ones. There is no limit to the number of times the same data can be used to train a model, or the number of models that may result from a given dataset. Once trained, the models’ performance and accuracy need to be validated, usually using a smaller subset of the data that is set aside for this purpose. The validation data is typically called a ‘holdout dataset’ and used exclusively for validating the model. Critically, though the holdout dataset is not used for training the models, it does come from the same base dataset.

In prospective research, as with retrospective research, the training is performed on an existing dataset, but the validation is performed on newly acquired data. In other words, validation is performed using data collected from the real world explicitly for the purpose of validating and measuring the accuracy of the trained model. For instance, if we’re training a neural network to recognize pictures of dogs, retrospective research would have us take a collection of animal pictures and train our model to identify all the dogs in that particular collection, later validating our models using our holdout set of images. However, if we want to validate the models using prospective data, after training the model we might literally take pictures of animals ourselves and feed them to the program, checking how often the software correctly identifies the dogs.

In theory, the first sounds better – we can retrain the model until it can pick every single dog image out of our holdout set. In practice, however, feeding real-world data into our models would offer some challenges: we, and our models, have never “seen” it before, and it has been collected in a real-world setting without being curated. As a result, there’s no guarantee that the model will be able to achieve the same impressive results in the real world, or for other collections.

Retrospective studies may be laden with further issues as well. One such issue is that the model may learn to use information that shouldn’t matter in a real-world setting. For instance, let’s say you were trying to teach a neural network to diagnose a disease based on clinical photos from multiple kinds of cameras. The better-equipped clinics, with the higher quality cameras, may both produce higher quality photos and see patients in more serious condition. In this case the model may conclude that the higher the resolution of the image, the more severe the patient’s disease. When applied to a real-world setting, that biased model would probably misdiagnose serious cases in clinics that used older cameras.

Another issue is what is called “overfitting”. As mentioned earlier in the dog example, you can train a model that seems to work perfectly for a certain dataset. However, the final result is a model that “remembers” specific characteristics of the images, rather than truly generalizing what a dog should look like. Though it appears to be performing optimally on the specific training set in question, when applied in practice the model will probably miss crucial information, or judge based on irrelevant details. For instance, there may be certain biases in the dataset – well-lit photos may correlate more highly with results we’re looking for, teaching the model that light is correlated with positive findings. This brings us back to the importance of the performed validation; when the validation set is also a subset of the original dataset, the same biases are presumably present, and the model will perform as expected even though it suffers from critical biases. However, if the validation set has a different source, these biases will be exposed and the model will almost certainly perform worse.

While the parameters are learned by the model, hyperparameters are determined by the researchers, telling the model how to learn. Though the difference between the two won’t be discussed at length in this article, a more detailed explanation can be found here. Another form of overfitting is called “hyperparameter overfitting” or hyperfitting. This is the result of the model being reset and rerun many times, with slight differences between the hyperparameters that the researcher has access to. For various reasons, there is always a dimension of randomness in model performance. When the model is rerun very many times, an improvement in performance may very well be the result of this fluctuation, rather than the adjusted hyperparameters. A model that is cherry-picked, from among similar models, for its higher performance on a single validation set may not replicate this advantage on a different set. Random chance will get you very good outcomes, but those won’t necessarily translate to real-world results. When this is the case, validation becomes the main gatekeeper of trust in quality.

When reading previous research it can be hard to identify which studies have these issues; given that the field is relatively new, professional standards are fewer and not as widespread. Poor research practices may include overfitting, small sample sizes, or loosely substantiated claims – all of which are hard to pinpoint when reading studies that don’t cite crucial information, such as statistical significance, or which haven’t been tested on a different population than the one the model was trained on.

Prospective studies may not solve all those problems, but can definitely address most of them. Firstly, on the most intuitive level, there is a lot of transparency in stating the research objective before it is carried out. Both overfitting and hyperfitting are rendered moot, because the model is tested on real-world data. If the model fits the dataset it was trained on perfectly but nothing else, i.e. is overfitted, this will be immediately apparent in the results. Moreover, while dataset-dependent biases may still exist in the model (like in the example of the different clinics), they presumably won’t threaten the study’s results, simply because the study is using a different data source. Finally, the results of the prospective study don’t depend nearly as much on what the researcher decides to report, or how much they adhere to good practice – a prospective study is much harder to perform multiple times on the same population until the desired results are acquired. Lastly, prospective studies are far more beholden to classic study controls, such as blinding, and can therefore provide a much higher degree of trust in the results.

It is worth noting however that, like anything else, prospective studies have their limitations. One pertinent issue is that the generalized efficacy of various models reported upon in studies can be hard to compare, given that their samples are almost always different. A deep learning model that finds the issues in a given study sample may not do as well with a different sample. For instance, a model trained to detect disease in a localized population may not perform as well on patients from other countries. Additionally, while less prone to researcher bias than retrospective studies, the quality of the research is still heavily dependent on the quality, experience, and integrity of the researcher. The research and the researcher, as always, cannot be completely separated.

That said, many of the benchmarks of good research are applicable to prospective deep learning validation studies. As with any other research, the larger the sample size and the higher the statistical significance, the better. Smaller confidence intervals are another good way of gauging quality. A further good mark of transparency is whether or not the study mentions all of their attempts, and not only the successful ones. If they do so, the fewer attempts the better, as they’re less likely to stumble onto good results by a factor of luck.

Before performing the prospective component of the research there are also various steps researchers should take to ensure best practice. One of these, in both retrospective and prospective studies, is what’s called cross-validation, in which the researcher runs their model on multiple sub-samples of their dataset. The results of those are then averaged, and that average is then presented as the final result. Retrospective studies can also make sure to validate using multiple datasets, which minimizes the chances of over- or hyperfitting, though not completely.

In conclusion, though both retrospective and prospective deep learning research contain training and validation stages, the characteristics of the validation stages are quite different. While retrospective research uses a subset of the original dataset for validation, prospective research collects data specifically for the purpose of validating the efficacy of the model. There are also multiple practices, including cross-validation, that both prospective and retrospective researchers can engage in to ensure best practice. Finally, as readers of the studies, there are multiple things we can check to verify the quality of the research in question, statistical significance and transparency being a good starting point. Most importantly, as we’ve discussed, the best test is always real-world application – a prospective study is much more likely to accurately represent and validate the process, data, and usability of the model in question.