Research themes

Finger pointing at data on a screen.The spread of academic research themes within the Centre for Health Data Science reflects the nature of the varied multi-modal types of the data, the different platforms, the decentralized collection, and analysis and most importantly the aim of providing more personalised and targeted treatment for individual patients. Hence, our themes range from incorporating background knowledge and semantics in the analysis of multi-modal clinical and omics data, the development of frameworks to examine large patient cohorts automatically and by providing AI supported solutions across nearly the whole range of clinical and medical as well as social applications.   

Semantic Data Mining and Analysis

Data requires context to be understood, and we develop approaches that realise a reciprocal relationship between background knowledge and foreground data. We use semantic techniques to enhance the performance of information extraction, learning from that information, and then feeding it back into background knowledge, ultimately forming a knowledge framework that approaches computational understanding of healthcare entities. This knowledge framework forms the basis for explainable and explicable approaches to healthcare research, evolving from and leaving behind the black-box paradigm. We realise this approach in applications that include cohort identification, patient stratification, and pathway analysis. 

Ontologies and Standards

The centre leads in the design, implementation, and analysis of semantic standards for healthcare data. These activities provide the basis for a unified approach to data integration and analysis across modalities, geopolitical boundaries, and disciplines, and ultimately facilitate the transformation of data into knowledge, and finally into actionable insight. 

The centre works closely together with other partners developing novel artificial intelligence-based approaches to enable the power of modern machine learning techniques to current health challenges and to discover hidden information in and knowledge from electronic health records (EHRs). This includes the use of state-of-the art AI approaches to discover patient subgroups, or phenotypes, in large clinical trial data (Lancet) and the use of different modalities (structured, text-based, time-series information) in disease risk predictions in primary and secondary care. Furthermore, we apply these AI techniques to better understand complex multimorbidity of patients as well as developing dynamic risk models using longitudinal data for patient outcome prediction (OPTIMAL). 

Multiomics Multimodal integrative analytics

High throughput technologies have totally transformed clinical research and as a result, many of the multi-omics (For example transcriptomics, proteomics, metabolomics, single-cell transcriptomics, etc.) and multi-modal data sets (Example: image, literature, etc.) are available. It is necessary to use an integrative strategy that incorporates multi-omics and multi-modal data to emphasize the interrelationships of the relevant biomolecules and their functions. We have developed several potential diagnostics tools, workflows, and machine-learning approaches for data integration and interpretation in response to the development of high-throughput techniques.

We integrate them with the aim to identify novel therapeutic targets or mechanisms for a particular disease for example colon cancer. We explore many different types of data sets that include stakeholders' experimental data, hospital data generated from trials, and public (ex: TCGA, GEO, etc) data sets. Some examples of integrative analytics can be found here microbiome and inflammatory markers in the infant cohort (Wood and Acharjee et al., Allergy, 2021); microbiome, metabolome, and single-cell sequence data in the colon cancer cohort (Bosch and Acharjee et al., 2022; Quraishi and Acharjee et al., J Crohns Colitis, 2020). Our group is consisting of multiple experts of people for example biochemists, bioinformaticians, computer scientists, etc and we promote interdisciplinary research. 


Over the last two decades, we have been developing the start of the art phenotype-based methods for the causative variant prioritisation of diseases, from rare, to complex diseases and cancer, their application of which has been repeatedly demonstrated to be highly effective in identifying causative variants in whole genome or whole exome sequence. These methods have been recently extended to the identification of multiple interacting candidate variants. 

We are now working on the development of new methods that can be applied to individuals to provide molecular diagnosis in a personalised manner that encompass the contribution of medium-rare to common alleles as well as low frequency and common variants contributing to polygenic complex diseases. Moreover, we develop knowledge-based frameworks for the integrating multiple types of ’omics data combined with clinical data, focusing on both qualitative and quantitative data as well as improve and extend machine learning algorithms that can utilize these frameworks as background knowledge and combine with individual-level quantitative data. 

Translational Phenomics

We aim at facilitating the systematic and consistent representation at scale of clinically meaningful, high-resolution health-related phenotypes from complex data that arise during routine healthcare catering enabling disease/patient stratification and the identification of biomarkers 

We then cater their multimodal high dimensional integrative analysis that will encompass background biomedical knowledge and account for the integrative view of patient genome sequence and other high-dimensional omics data types and enrich our approach by the development and assessment of robust methods for analysing (e.g., patient clustering, outcome prediction etc), visualising (clinical dashboards etc) and using complex multi-dimensional, multimodal, and multi-omics, datasets for the derivation of novel descriptions of phenotype and personalisation of healthcare delivery (e.g., patient/disease stratification).  

Ultimately, we aim at the transformation of integrated data environments into a semantic-based secure infrastructure of interoperable, in depth, real time biomedical data that can act as the harmonization backbone of healthcare ecosystems and the generation of an integrative, multimodal longitudinal characterisation of the clinical phenome (system-wide patient journey) for early diagnosis and anomaly detection, disease stage monitoring, adherence and treatment response. 

Data driven clinical trials

Working together with Birmingham Clinical Trials Unit (BCTU), we are developing processes that will enable us to conduct lower-cost data driven randomised controlled trials using routine data from electronic health records. These processes include assessing the need to conduct a trial; defining the eligible population; obtaining estimates of disease incidence for sample size calculation; and trial emulation. The centre is undertaking two NIHR funded data driven trials that are capitalising on the investments made in electronic health records and make use of our data extraction software – DExtER to address unmet need in at-risk groups of patients. One of these – the DARE-2-THINK trial (Chief investigator Professor Dipak Kotecha) – is examining the effect of direct oral anticoagulants in those with atrial fibrillation and low to medium risk of stroke on risk of vascular and neurological outcomes. The RADIANT trial, conducted in collaboration with a healthcare technology company Cegedim, is assessing whether text message reminders to women with previous gestational diabetes helps increase testing for type 2 diabetes. These trials are helping to revolutionise the way that clinical trials are run in the NHS because they use minimal time of front-line staff and require much fewer or no additional visits to hospitals or GPs because much of the follow-up data is based on routinely collected data.    

Automated Clinical Epidemiology

Automated Clinical Epidemiology Studies (ACES) is one of the most ambitious Health Informatics projects in the UK. The primary aim of this project is to expedite electronic health records research by automating data extraction and data analysis for various epidemiological study designs using computer software. Designed and developed in-house it is the state-of-the-art framework for routine epidemiology and pharmaco-epidemiology research.  

What is DExtER? 

DExtER generates real world evidence utilising electronic health records. It can generate evidence for routine epidemiology study designs such as cohort, case-control and cross-sectional, incidence and prevalence and pharmaco-epidemiology studies such as new-user design, prevalent new-user design, and help you generate data for complex time series analysis using machine learning. 

It provides an easy-to-use web interface where medical researchers can design various epidemiological studies by providing data extraction specifications, such as study period, exposure for the cohort, baseline characteristics, matching criteria for the unexposed and outcome of interest. DExtER interrogates the electronic health record databases and produces analysable datasets and results. 

Recently, we have demonstrated DExtER as tool for that can be adapted for Informatics Consult, Data Driven Clinical Trials, Audits, and Global Health Informatics. 

Regulatory Science

In order to ensure healthcare innovations are delivered safely and effectively, it is important that they are validated, evaluated and legislated rapidly following their development. Regulatory science achieves this through partnerships and collaboration – bringing together people from a range of disciplines to evaluate new treatments and interventions as they are developed and to ensure their quality, safety and efficacy as they are implemented in clinical practice. 

Researchers in the Centre for Health Data Science work alongside partners in the Birmingham Health Partners Centre for Regulatory Science and Innovation (CRSI), bringing together experts in medicine, applied health research, health policy and management, clinical trial design, medical law and patient-reported outcomes research. Together we work to identify emerging challenges in UK regulatory science; generate evidence to advance regulation and develop regulatory standards; and contribute to the design of UK-wide infrastructure and guidance, facilitating safe, cost-effective implementation of innovative interventions for the benefit of the NHS, patients and clinicians. 

Real world evidence

The real-world evidence (RWE) team is a multidisciplinary group of researchers and clinicians who aim to use real-world, routinely collected data to produce high quality research on epidemiology, pharmacoepidemiology, clinical practice and health care utilisation relevant to patients and clinicians. 

We make use of a range of data sources across both primary and secondary care, including Clinical Practice Research Datalink (CPRD), The Health Improvement Network (THIN), Hospital Episode Statistics, and routine healthcare data from a range of collaborating UK hospitals. Our research spans multiple clinical areas – in particular, endocrinology, cardiology, ophthalmology, rheumatology, neurology, gastroenterology, and nephrology – and includes: 

  •  investigating disease prognosis; 

  • describing trends in disease prevalence, incidence and management; 

  • development and testing of algorithms and prediction models for implementation in decision support in primary and secondary care; 

  • and pharmacoepidemiological studies, particularly exploring the potential for drug repurposing. 


Genomics, population genetics and human diversity

We use large scale genomic sequencing and health data to investigate human ancestry, traits, and diseases. Our group adopts a multidisciplinary approach by combining genetics and bioinformatics along with health data science and other disciplines to study human diseases. Our collaborators include physicians and clinicians, geneticists, epidemiologists, statisticians, anthropologists, archaeologists, linguists, and specialists from numerous other disciplines, highlighting the multidisciplinary breadth of our research. We combine all these expertise by exploring the human genome from an evolutionary point of view, showing how past events have shaped human genetic diversity including our ancestry and diseases.

A major effort in our group is to investigate disease-variants in populations that remain today underrepresented in genetic studies such as Africans, Middle Easterners and South Asians. Despite that all human populations today are related by descent from a common ancestral population and genetic variation among humans is relatively low, small genetic differences exist among populations (for example arising from genetic drift or adaptation) and could sometimes be medically relevant. Our group studies genetic variation across worldwide populations to understand the origin of their genetic differences and if it can inform on disease aetiology. We aim to decrease the disparity of ethnic representation in genetic studies and improve human health globally. 

Clinical Decision Support Systems

The clinical decision support systems team focus on translating clinical guideline advice into models that can be applied in electronic health record systems. The ultimate goal of this team, in addition to providing computable models that are true to their guideline backing and clinically useful, is to be able to validate these models and facilitate guideline authoring through evidence derived from sources such as DExtER

The team supports, a web repository of existing models to be tested and shared by clinical practitioners and researchers, along with a toolset that can be used to generate new models. 

Artificial Intelligence in Healthcare: from molecule to bedside

Routinely collected clinical records allow investigation of many different diseases simultaneously, including inherently “longitudinal” phenotypes related to multi-morbidity, complications, progression, drug response, and survival. A systems genomics approach coupled with electronic health records (EHRs) offers the tantalising possibility of uncovering shared aetiology of many apparently different disorders, identifying potential synergies in the prevention and treatment of multiple disorders, reducing polypharmacy and associated treatment-related complications. Conversely, identifying causal pathways that have directionally opposing effects on different conditions would be valuable in identifying potential safety concerns relevant to de-risking trials of agents that target such pathways. 

We develop AI-based precision medicine frameworks that facilitates the transformation of biomedical scientific understanding and its translation into healthcare settings while taking into account fairness, uncertainty and interpretability. This is approach is aimed to contribute towards disease prevention and treatment through a complementary alignment of disease classification based on pattern of end-organ manifestation and its molecular-based counterpart. Consistent with “molecule to man” precision medicine strategies, the overarching hypothesis is that causal insights into biology and disease can be revealed by overlaying high- resolution EHRs and other biomolecular traits onto individuals in large-scale cohorts with genomic information. 

By leveraging deep biomedical knowledge, we aim to enable integrative cause–effect analyses over routinely collected clinical data in combination with high dimensional traits deriving from the omics space, ultimately, developing novel computational infrastructures and associated data science methods so as to enable to in-depth and across scales and levels of granularity integration of basic biology, biomedicine and population health science. 


Clinical themes

Primary care

The centre undertakes a number of research projects using primary care data and focusing on primary care improvement. We access big data from primary care records from our partners at the MHRA Clinical Practice Research Datalink (CPRD) and Cegedim to conduct population scale analyses on a range of clinical conditions and themes, culminating in over 100 research publications to date.  

These include research on the application of machine learning to the understanding of complex multimorbidity (OPTIMAL Study), describing multimorbidity in pregnancy (Mumpredict), and assessing the health impacts of long Covid in primary care (TLC Study). There is also an increasing portfolio of research on digital primary care trials, such as the Radiant Trial on improving the management of women with a history of gestational diabetes using big data and text messaging.  

The centre has a group of dynamic academic clinicians working in primary care including GPs and GP academic clinical fellows and works closely with the West Midlands Clinical Research Network. The centre’s work also aligns with the University of Birmingham’s Centre for Primary Care Improvement, which includes research on continuity of care, clinical decision making, and digital access. 

Acute care

Acute care services in the UK are currently working at capacity at a time when demand for healthcare is at its greatest. Being able to harness the power of the vast amount of routine health data that is collected from people accessing acute care services both in the hospital and community settings can in turn lead to better delivery of healthcare services in this setting. Two such acute care digital health data hubs PIONEER (led by Professor Liz Sapey) and INSIGHT (led by Professor Alastair Denniston) are at the forefront of delivering this ambition. The Patient Safety Research Collaboration (led by Professors Alice Turner, Shakila Thangaratinam, and Liz Sapey) are ensuring that health data and digital health tools can lead to better clinical decision making in the acute care setting and leads to improved patient safety and reduces the risk of harm. 

Maternal health

Maternity care faces many challenges in the UK. Fundamentally, maternity is a normal physiological process that a large proportion of the population experience across their lives, but one which can increase the risk of long-term illness or even death. Multiple safety scandals such as the Ockenden report have been prominent in the news, and the recent Women’s Health Strategy emphasises key priorities around improving complex care for mothers and ensuring that they are involved in research, avoiding historic “male-by-default" approaches to research.  

The centre is involved in major programmes of work to meet these challenges. The MuM-PreDiCT programme led by the centre brings together a collaboration of researchers from across the four nations, using primary and secondary care records from millions of patients to investigate the health of pregnant women with multiple long-term conditions. Data from electronic health records are used to describe the health in pregnancy in those with multiple conditions, identify drug effects in pregnancy, and develop prediction models and tools to understand individual risk in pregnancy. 

The centre is also deploying its flagship DExtER platform directly into maternity settings to better support services in improving care and contributing to research. We are developing a pilot project in Shrewsbury and Telford Hospitals NHS Trust to enable the trust to use data from their electronic health records, without removing data from the trust. This will allow clinical services to benefit from data science technologies and methods without risking the privacy and protection of personal data held by hospitals. Our ambition is to support services in functioning as Learning Health Systems, where a health system systematically learns from every patient that walks in through the door. 

The centre’s work also aligns with the University of Brimingham‘s Patient Safety Research Collaboration’s maternity theme through development of patient safety tools such as prediction models and Clinical Decision Support Systems. 

Public Health

Birmingham is uniquely placed in the UK as a diverse multi-cultural hub of community activity. Despite, decades of investment, the region is plagued by substantial inequalities in healthcare outcomes. There are many areas of health where Birmingham has worse outcomes than other areas in the West Midlands or England, for example: 

  • The infant mortality rate in Birmingham is 7.0 per 1000 live births compared to 3.9 for England or 5.6 for the West Midlands 

  • The mortality rate in women in those under 75 due to CVD is 57.3 deaths/100,000 population compared to 43.4 for England (47.0 West Midlands)  

  • Smoking attributable death rates is 274.8 deaths/100,000 population compared to 250.2 for England and 249.3 for the West Midlands.  

Of note, maternity and respiratory outcomes have been identified as core areas for addressing in the national CORE20PLUS5 strategy to address health inequalities. These national trends in morbidity and mortality are amplified when examining the data on an MSOA or LSOA level in the city.  

The centre’s work is aligned to address many of these public health inequalities across the city, across the UK and also internationally. We host expertise in the wider determinants of health such as taking a data driven public health approach to gambling and violence and our work aligns with the University’s Centre for Crime, Justice and Policing


Cancer research is central to our centre's activities and the diverse expertise of our centre members,  including genetics, biology, social and population science, health care delivery,  diagnostics, bioinformatics and artificial intelligence are applied across multiple cancer related projects. Our members are also participating in a number of cancer related initiatives, for example The Birmingham Experimental Cancer Medicine Centre (ECMC).  

Indicative areas of cancer related areas of research include: 

I. Cancer Health Data Science

Example areas: 

1.  Artificial Intelligence/Machine Learning in Cancer Research

2. Multimodal cancer related data harmonisation, interoperability, integration and analysis

3. Translational cancer phenomics

II. Omics/clinical bioinformatics 

Development of cutting-edge novel cancer bioinformatics approaches and analysis pipelines For example: 

  1. Microbiome-based cancer diagnostics using AI based approaches
  2. Utilizing deconvolution algorithms to comprehend the immune microenvironment
  3. Identifying diagnostic markers for early risk stratification

III. Cancer Population Genetics/Genomics 

We aim to address the issue of ethnic minority underrepresentation in pharmacogenomic studies, which contributes to global health disparities. Specifically, we focus on assessing the efficacy of cancer treatments in different ethnic groups in the UK, with a particular interest in the genetic ancestry of the population and its association with treatment outcomes. By combining our expertise in genomics and artificial intelligence, we aim to develop more personalized and precise treatment plans that are effective in treating cancer across all populations.