How can data science help to better understand human health?

Innovations in data science are rapidly improving our ability to spot diseases, identify risks, and tailor therapies. But expert knowledge is needed to turn the raw data into insights.

The blizzard of health data, enabled by performative leaps in data collection technology, data organization and computational power, will only help us if there is a corresponding advance in analytics, ethics and governance. The University of Birmingham’s interdisciplinary approach embodied by the cross-campus Centre for Computational Biology (CCB) linking Data, Knowledge and Expertise in the Statistical, Mathematical, Computational and Life Sciences, is now turning raw numbers into actionable insights.

Big data can rapidly turn into big noise without the proper analytics to make sense of it says Professor Jean-Baptiste Cazier, Chair of Bioinformatics and Director of the CCB, at the University of Birmingham. In healthcare, innovations like artificial intelligence and machine learning are radically improving our ability to spot diseases, identify risks, and tailor therapies. But beyond the modelling of a condition, expert knowledge is also needed to annotate, code, curate and in turn interpret data and convert it into meaningful insight.

Common diseases are rare

Precision medicine is healthcare’s great hope of more targeted disease prediction, prevention, diagnosis and management. While genetic data was at its core, the more we understand the molecular underpinnings of disease, the more we are learning that there is no such thing as a common disease, says Professor Georgios Gkoutos, Chair of Clinical Bioinformatics at the CCB and associate Director of the Health Data Research (HDR) UK, at the University of Birmingham.

Most non-communicable diseases, from cancer to asthma, can be described as ‘rare’ because on an individual level they could vary substantially. Precision medicine tailors health to specific individuals or groups who would benefit from proactive care or whose response to treatment can be predicted through computational biology modelling and other techniques. Starting with Genomics, the sequencing of a person’s DNA, is essential to establish the genetic and epigenetic bases of health and disease. 

inhaler

Credit: Photofusion Picture Library / Alamy Stock Photo
 

Yet while most non-communicable diseases have genetic predispositions, these are challenging to elucidate because these conditions are often related to multiple genes. Cancer, for example, is best thought of as an accumulation of genetic mutations that confers malignancy in a particular cell type. For any single gene, there are a large number of rare variants that increase susceptibility to disease, individually or in combination with others. Thanks to the increasing affordability of DNA sequencing, large-scale studies of genetic variations within populations have identified numerous susceptibility mutations and these conditions are getting better understood. Note that the characterization of the genomics profile of tumours remains a challenge due to cancer dynamics and heterogeneity.

The need to understand disease etiology also incorporates other types of -omics.  Where DNA studies enable the examination of genetic variants of disease, multi-omics looks at the quantitative outputs of genome regulation -- RNA (transcriptomics); proteomics (peptide abundance and interactions); epigenomics (reversible DNA modifications across the entire genome); and metabolomics (the outputs of cellular metabolism).  This aggregated data can tell us more about how diseases develop in vivo than can genomic profiling alone. For example, tiny genomic alterations can be better identified through the cell protein level and can predict the response of people with cancer to radiotherapy and chemotherapy. 

Machine learning and explainable models

The challenge with multi-omics is that much data is irrelevant to what the researcher is really investigating. By definition, an omics modality measures all facets of a complex system. However, in the study of any given process, for example response to a particular drug, only a small subset of system components may be mechanistically involved in the biological response. Machine learning can help researchers separate the wood from the trees.  "When you are making omics measurements, you need a 'sparse model' - the few things that really matter to the process you are looking at, like the mutation drivers in cancer, for instance, or the microbial species active in the metabolism of a specific chemical compound," says Professor Ben Brown, Chair of Environmental Bioinformatics at the CCB, the University of Birmingham.  "But in 'omics', you are gathering data about a universe of things of which 99% have nothing to do with what you are trying to measure". When you don’t know which components of the system are involved before you do the experiment, omics are often the best answer. 

At the interface of transcriptomics, metagenomics and metabolomics, he says, machine learning enables the researcher to zoom in on what is salient – to identify the subsets of factors driving the phenomena of interest. "This is changing what science is,” he adds. “No longer do we have to come in with a hypothesis and look at a few specific molecules to see if they do what we expect them to do. We’re now looking at the entire universe and asking what are the most interesting patterns in the system.”

While AI is popularly covered in the form of world champion-defeating chess or Go-playing bots, or high frequency trading algorithms in finance, machine learning in Brown's lab takes a very different form. "These famed AI systems can do fast, clever predictions. But in science it's not the prediction that is important – it’s understanding why prediction is possible. If machine learning identifies a subset of genes associated with small subset of metabolites that matter, we want access not to a black-box but a glass-box. We want to open it, look inside and understand what the machine has learned.” 

From the outside in: Environmental bioinformatics

Precision medicine is not just about multi-omics and the corresponding micro-level biological data, says Professor Cazier. To understand disease at an individual, rather than population, level also requires understanding the interplay between genetic predisposition, environment and lifestyle. In a diverse society, health data must capture socioeconomic circumstances, environment, geography and ethnicity so medical discoveries can benefit all populations, across the globe, Cazier argues.

Computational research at the University of Birmingham is illuminating how environmental factors interact with health, to the level of the genome and microbiome. Professor Brown identifies two zones of importance - the 'chemosphere' or 'exposome’, the chemicals to which we are exposed --  including pesticides in fruit and vegetables, to BPA/BPS in plastic - and second, the microbiome, an under-appreciated variable determining how toxicants affect us. Of cells in our bodies, over 90% are bacterial and microbial organisms. "Anything you eat or drink, the metabolic products are mostly not what human cells make, but what the microbiome made" says Professor Brown. "If we want to know how the body deals with toxicants like pesticides, we have to understand the breakdown products generated by gut microbes.”

Professor Brown's lab has found that, by identifying microbes that metabolize toxic substances into non-toxic chemicals, we can alter the microbiome and confer protection for the host. "In a joint University of Birmingham project with Lawrence Berkeley National Laboratory, we’ve identified the gut microbiome for a fruit fly that makes it resilient to the herbicide Atrazine. We now want to use this more broadly to protect honey bees from herbicides and insecticides.” This approach could inform the development of probiotics to make humans more resilient to dietary toxicants. "You could look at which gut microbes are capable of processing that toxicant and then create a favorable condition in the human gut, to have those species be a higher population fraction and reduce the toxicity of the exposure.”

Leading the data science agenda

The University of Birmingham’s work in data science is informed by, and in turn helping, the UK’s National Health Service (NHS). While the NHS contains some of the richest health data in the world, helping researchers learn more about the relationships between the biological, environmental and socioeconomic aspects of disease, the ability to extract useful information from patient records has been limited because of the data’s scale and heterogeneity. A typical patient record may contain quantitative clinical data such as pathology results, alongside doctors letters, ward notes and clinical imaging results. 

The Birmingham-based Queen Elizabeth Hospital, one of the largest in Europe, has maintained entirely digital patient healthcare records since its conception. Housing one of the most advanced electronic data systems in the world, it gives important clues for advancing medicine - but they need proper analytics to make sense of.

QE-hospitalTony Hisgett from Birmingham, UK [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0 )], via Wikimedia Commons
 

Professor Cazier describes these datasets as “data rich, but information poor”. Extracting insight requires first collection and collocation, followed by annotation, coding and curation to enable an effective analysis that will require expert interpretation. Some of this can happen at source when clinicians apply codes to their entries on a patient’s record but this is not always possible in the busy clinical arena. Professor Gkoutos is currently developing novel learning health systems that will facilitate an evolving NHS learning paradigm that caters for the definition, structuring, harmonization, interoperability of health care data. He believes that such an environment will enable new scientific discoveries arising from the efficient transformation of large multi-dimensional biomedical datasets into actionable information and knowledge and facilitate the development novel cutting-edge technology applications to enhance decision making and improve healthcare.

 Common disease: A misnomer?

“Although we understand much about disease pathophysiology, humans are individuals and disease manifestations and trajectories differ substantially”, says Professor Gkoutos. “Some therapies are only applicable to particular individuals and have different degrees of effectiveness, and toxicity. Until now, we have not had the means to understand these divergences.”

Professor Gkoutos thinks we are beginning to find ways of harnessing richer data, such as comparing biological data with clinical outcomes, to stratify information and better treat people as individuals. “The challenge”, he says, “is how to make meaningful associations between these datasets and how to use them to identify subgroups of the population for individual treatments. Working closely with researchers, this is starting to happen within the NHS and is an important stepping stone towards routine individualized medicine.”

Five challenges for data science

Health data science is not without challenges – Professor Gkoutos identifies five. First, there must be strict ethics and governance in data use and storage, which is difficult when data is multi-source. Second, data must be well-annotated and curated. As AI evolves, there is an essential role for curation that ensures the information is both useful and ethically managed. Two further challenges relate to computational power and data storage capacity, although Professor Gkoutos sees these gradually improving. 

The fifth is how to make this data accessible and useful to the wider research community. “We need a national strategy to address how we are going to make data available, how we are going to de-identify [people], and how we can ensure patient safety,” says Professor Gkoutos.

Health Data Sciences underpins the activity of the University of Birmingham at the intersection of cross-sector collaborations such as the Health Data Research (HDR) UK and the Alan Turing Institute initiatives, which aim to make better use of clinical data and develop major research programs focused on AI and Data Sciences respectively. Professor Cazier is very enthusiastic about opening up the potential for interdisciplinarity. While his work traditionally involved quantitative data, he is keen to explore ways to triangulate bioinformatics with social science research data, which is predominantly qualitative. This will require creative methods of linkage, but he is hopeful that it will shed light on the multiple and complex contributory factors in how diseases develop and respond to treatment. The University of Birmingham’s strong emphasis on collaboration and interdisciplinarity provides the perfect institutional ecosystem to advance the Data Science agenda, he says.

Banner image - marcos alvarado / Alamy Stock Photo

Explore

Discover more stories about our work and insights from our leading researchers.

Culture and collections

Schools, institutes and departments

Services and facilities