Carl Miller, Research Director at the Demos Centre for the Analysis of Social Media did an experiment recently. It has become a norm for companies to openly collect users’ data. Some of these entities are also trading the accumulated information with third parties. As such, Miller made a request to 80 companies, asking for a return of his data. He wanted to know how he exists in data form and whether he could conjure a version of himself with data. 

Even though General Data Protection Regulation (GDPR) renders rights for one to be informed on the collection and use of their personal data, the process remains laborious. As Miller recounted his experience to the British Broadcasting Corporation (BBC). In the end, only 20 companies replied and Miller managed to receive 7000 pages long of his personal data. He noted part of the collected data were provided voluntarily: name, address, demographics and contact details. 

The other part was generated as he used the company’s services or products. The last part was meta-data, or new data created from models and probabilities. Miller felt the alarming part lies in the fact the meta-data were mostly generated by companies he had never heard off, suggesting his data could have been brought to other entities without his knowledge. 

Most of the time, data are in the wild 

At least in this experiment alone, Miller was not convinced that he could be recognized at an individual level through his data. He was more appalled by the non-transparent and tedious steps he had to make to retrieve his personal data. Some of these received data were beyond comprehension when they arrived. This, probably rings a bell in the medical/healthcare community too. 

As expressed by Dr. Anna Goldenberg, Senior Scientist in Genetics and Genome Biology Program at SickKids Research Institute and Associate Professor in the Department of Computer Science at the University of Toronto, in AIMed Breakfast Briefing – Experience the Future of AI in Radiology (Toronto), there are a lot of data in the ecosystem but researchers will never be able to see them. 

Although there may be exceptions, such as the large amount of data obtained from an Institution Review Board (IRB) approved research project or individual patients who had given consent, there still a risk of not having ground-breaking or relevant data to develop what the researchers have in mind. Dr. Errol Colak, Clinical Lead of the Diagnostic Imaging and Learning Algorithms (DILA) group at The Li Ka Shing Center for Healthcare Analytics Research & Training shares a similar opinion in the same session. 

A centralized body to manage access 

He said from the radiology point of view, there are large databases but most of them are not accessible. Even if they are, there are many different ways that data is being recorded. Hence, it’s not like when someone is inclined to retrieve an image from a particular source taken on a certain day, they will be able to do so immediately. Over the years, many had tried to develop their own solutions to overcome this but most were effort waste and time-consuming. Data retrieval continues to be a concern, even for professionals.

Dr. April Khademi, Assistant Professor of Biomedical Engineering at Ryerson University and Principal Investigator of the Image Analysis in Medicine Lab (IAMLAB) added perhaps it is more logical to have a centralized body, to manage and disseminate the access of data. However, this has to be exercised cautiously because some institutions tried to maintain their exclusivity to particular sets of data in order to keep them at the competitive edge. 

Furthermore, clinical data can be rather messy. Thus, there is a need for validation strategies and some form of responsibilities, to ensure researchers are using accurate data especially if they are using them to develop new tools that will be used on patients eventually. 


Author Bio


Hazel Tang

A science writer with data background and an interest in the current affair, culture, and arts; a no-med from an (almost) all-med family. Follow on Twitter.