A lot of people are trying to figure out how to separate the real news from the fake.

Often this involves vetting sources and checking facts, as we know news is more trustworthy if it comes from reliable outlets and expresses reliable information. Fake news detection is reduced to evaluating reliability.

But reliability isn't really what fake news is about. Real news sources sometimes get the story wrong and fake news sources sometimes publish the truth.

The distinction between real news and fake news isn't about reliable and unreliable sources or even about true and false information, it’s about honesty and deception. What makes news fake is the intention of its author – if they are reporting something they believe is true or something they believe is false.

But how can you measure the intention of millions of anonymous authors online?

Based on the analysis of large samples of natural language, corpus linguists have shown that the structure of language varies systematically depending on its communicative purpose. When people tell stories, they use more past tense verbs and third person pronouns. When people provide explanations, they use more nouns and prepositions. When people interact, they use more questions and interjections. The grammar of a text reflects its purpose.

This is why the language of fake news – its structure as opposed to its content – could be the key to its detection. An author who is trying to inform the public has a very different motivation than an author who is trying to deceive. This difference in communicative purpose will have linguistic consequences. The trick is to figure out what these consequences are – to describe the style of fake news.

We’re starting to work on this problem at the University of Birmingham, but the first step is to get the data.

Although fake news is everywhere, it’s unclear how to sample it, especially when people disagree so vehemently about what news is real and what news is fake. Numerous organisations are building databases of fake news, primarily through fact checking, but their main concern isn’t producing a balanced corpus for linguistic analysis, and these collections always reflect the political biases of their curators. It’s also unclear how to collect a comparable sample of real news.

Questions such as these are key to understanding what makes fake news distinctive. Imagine a study where the fake news consists of sensationalist articles that were spread through social media platforms, while the real news consists of more serious articles published by traditional news outlets. A linguistic comparison of these two datasets would undoubtedly find variation in formality, but it would be wrong to assume this is a fundamental difference between honest and deceptive reporting.

Our initial research suggests the solution to these problems is to build corpora that represent not just fake news but the entire online news media landscape. Only by sampling news from across formats, outlets, markets, authors, topics, and viewpoints and by then subjecting these data to careful linguistic analysis can we begin to truly understand how the language of fake news and real news differs.