The media are always fascinated by medical ‘breakthrough’ stories: tales of hope that scientists have found cures for our most threatening diseases and tales of woe that our lifestyles are doing us harm. All too often these stories portray the underlying science as conclusive when at best, it is speculative. Uncertainty does not grab headlines. Successful careers are being forged by our more numerate journalists in dissecting the overblown claims – for example the Guardian’s Bad Science column and Radio 4’s More or Less provide expositions of the lack of substance in these stories, often through detailed examination of the statistical evidence. The Royal Statistical Society champions good statistical reporting, presenting annual awards for Statistical Excellence in Journalism.
Last week, it was the turn of a medical test to hit the front pages, with reports that ‘scientists have developed a blood test that can reliably predict whether a person will develop dementia within the next three years’ based on a letter published in Nature Medicine. If true, such a test would allow the detection of Alzheimer’s at a preclinical stage, at a point where treatments could be used to halt progression. Putting to one side the issue that no effective treatments currently exist, how should we judge whether this biomarker discovery is a valid or overblown claim?
Modern biomarker development is a complex story of molecules and computer science. It proceeds from a group of individuals (some with the disease and some free of the disease) providing samples of blood, tissue or some other bodily fluid or substance, to laboratory techniques being used to measure hundreds, if not thousands of different molecules present in the samples. Finally intensive computational algorithms create a ‘classifier’ – an equation that predicts disease based on measurements of concentrations of the most discriminating subset of molecules. Classifiers can be created without need for understanding disease mechanisms – the reasons why certain molecules are linked to the presence of disease „ they are simply a product of statistical correlations. The scientific explanations can follow later.
The research underpinning the Alzheimer’s test was of this nature. A cohort of 525 healthy over 70s were recruited and followed over five years, regularly undergoing cognitive testing and providing blood samples in which thousands of molecules were measured. Participants who suffered cognitive decline were identified and a classifier created based on measurement of the ten molecules identified to be most strongly linked with disease. The classifier was reported as correctly identifying nine out of ten of those who suffered cognitive decline and nine out of ten of those with normal cognitive function – accurate enough to have clinical value.
But there are reasons to consider these findings as speculative and unlikely to be replicated in clinical care. First, the sample size used in the study was much smaller than first appears. Development of the classifier was created by comparing concentrations in only 18 patients who developed cognitive impairment compared with 53 with normal cognitive function. Small samples are likely to throw up spurious findings, but sadly are common in biomarker development studies due to the high cost of undertaking the molecular analyses required. Second, the definition of Alzheimer’s was unlike anything used in clinical care. The researchers created a statistical measurement from results of ten different cognitive function tests. And third, the analysis was restricted to extreme groups – participants who demonstrated clear decline and participants with high stable cognitive function. More than 50% of the cohort were excluded because they showed a lesser degree of cognitive decline or impairment. While such an approach of comparing extremes gives the study maximum chance of detecting associations, it will grossly over exaggerate the clinical value of the test.
This scientific approach to biomarker discovery is also prone to being misled by spurious associations. Tests need to separate out signals from noise – the real relationships from the randomness that exists in our world. A proportion of the reported relationships between the molecules and disease will be caused by this randomness. This proportion is higher in smaller samples and increases with the number of relationships tested, particularly where there is no underlying purported scientific mechanism. There is real concern that most findings in studies of this nature are false. While computational methods are developing to reduce these risks, all classifiers must undergo independent validation using the test in a new large study before claims of their value should be believed. Sadly this is rarely done. A validation study was reported in the paper, but with a sample size of only 30 participants it falls a long way short of providing convincing evidence.
The Alzheimer’s study may have discovered something of immense value, but sadly the evidence provided so far is inadequate to tell.
Professor Jon Deeks
Professor of Biostatistics and Director of the Birmingham Clinical Trials Unit. Professor Deeks runs a test and biomarker evaluation research group at the University.