Health Policy May 11, 2020
Social data infrastructure for AI medical research

By Ryan Khurana - Accountable Care Journal

Artificial intelligence (AI) has proven its worth in medical applications of late, as around the world it is driving Covid-19 responses from viral tracking to diagnostics to vaccine research.

These useful AI approaches, powered by deep learning, function by “learning” from large datasets, pulling out complex correlations in order to better achieve their objective. Integral to AI’s usefulness is access to data of sufficient size and diversity.

Globally, we are facing a pandemic in which empowering AI research is essential for effective response. It is, therefore, vital to develop the appropriate understanding of the rights associated with AI’s data needs. There exists in the medical ethics literature a tension between the individual right to autonomy and the social benefits that result from wider access to medical data which requires an elucidation of social rights in order to be resolved.

The individual right to autonomy that is currently guaranteed by most liberal medical regimes enables patients to have control over their medical data, granting them the ability to opt-out of sharing their data. While autonomy is important in the clinician-patient context, restricting access to patient data for medical research would limit the progress of AI.

This results from the ample evidence to suggest that certain demographic groups would be more reluctant to share their medical data, biasing the results of AI and harming the generalisability of the findings. For example, there is a much larger sample of heart disease data about Caucasian patients, which results in algorithms that detect heart disease performing poorer on minorities. The inequalities that would be exacerbated by allowing for individual autonomy to restrict data access are damaging and must be avoided.

The need for AI to access this medical data, however, does not invalidate a patient’s rights or make their anxieties unjustified. This is why public investments need to be made in order to communicate the types of rights that ought to apply to their medical data and in building a data infrastructure that distinguishes between individual patient data and the training data for medical research.

The medical data used for building AI is different from the type of data doctors keep on their patients. It is standardised, losing the particular details a doctor may find essential and the variations that exist between clinics, and it is aggregated, requiring less detail about each individual patient, but the same information from a very large number of patients.

To accelerate AI research, this kind of data must be collected, cleaned, and administered by public investment. Amidst the current Covid-19 pandemic, initiatives such as CORD-19, a public-private partnership in the United States to make thousands of research papers on coronaviruses available to AI researchers to generate new insights, highlights the value of public-sector leadership in accelerating medical development.

Since these investments would be in developing a new data infrastructure, they must be guided by an understanding of the social rights that exist over aggregated data. The medical data being used to train AI is inherently social in nature since insights are drawn from the relationship between patterns between the data individuals.

Genome sequencing serves as an illustrative example of how individual data is actually about the population. Since human beings naturally share significant parts of their genome with others, details about an individual can be imputed without their consent, should consent of those that share characteristics with them be given. Your medical data is not just about you, but others like you.

The insights you carry about others creates a social obligation for sharing but does not invalidate your right to privacy. Social privacy, however, requires intersectional understanding about relationships, distinguishing between what is wholly yours and what belongs to a group.

To protect this, public data infrastructure should adopt a differential privacy standard. Differential privacy enables datasets to be created that carry information about individuals but are functionally indistinguishable from having a single individual remove their data. The standard prevents any individual’s data from being reverse-engineered from a population dataset and can be adapted to protect group privacy as well, such as the differences between genders or races.

This standard works by applying noise to datasets, using a transparent algorithm to randomise information over a subset of observations. In effect, this approach sacrifices a subset of the data to randomness, meaning it requires even more data to function. The benefit, however, is that while the process by which data is randomised is known, which individuals were actually treated by this process is unknown, meaning that no individual’s data can be trusted to be honest. This enables population-level insights to be drawn without drawing any information about individuals.

Developing such a public data infrastructure would allow for greater public buy-in of the investments in medical innovation being made. It also enables democratic accountability, allowing the public to voice their concerns over the use of their data, ensuring that medical research is done for their benefit. This would allow researchers more access to respond to pressing diagnostic and treatment needs as we are currently seeing with the rapid spread of Covid-19.

Having such an infrastructure in place would mean that, should future pandemics hit, the medical community would have the data necessary to respond more rapidly. Balancing the need to protect public health with the protection of human rights is impossible without a suitable framework for social rights.

While differential privacy is only one aspect of this social rights approach, it highlights the conversations needed to enable a more responsive medical sector.


#ACJDigital #ACJInsight