Everybody seems to be talking about Big Data, and the potential for fruitful applications within Social Science is well accepted. This was recognised in a recent successful bid to the ESRC - our Civil Society Data Partnership (CSDP) with NCVO is now in full swing, funded under the Big Data Phase 3 initiative. For full details of the CSDP and the methods used within, please see http://tsrc-ncvo-csdp.com. However, as the newest member of the TSRC, I would like to take this opportunity to introduce the dedicated readers of this blog to the project. The Freedom of Information Act 2000 requires all public authorities to adopt and maintain an approved publication scheme.
Due to increasing calls for transparency in the public domain (most recently embodied and enacted in the Local Transparency Code 2015), it is now possible to obtain information on all expenditure by a Local Authority over a threshold range (at the time of writing - £250), information on all grants to VCSEs and payments made on procurement cards. This is in addition to payments made by clinical commissioning groups (whereby HM Treasury requires all NHS organisations to publish details of expenditure over £25,000), and huge databases of grants made available by non-profit data providers such as 360Giving and grant makers such as The Big Lottery.
An example of one of these datasets can be seen in Figure 1
Our task in this new project is to develop computational methods which can not only analyse all of these tens of millions of payments, but also match them with the appropriate, intended recipients (with a particular interest in Third Sector organisations). This is no easy task for several reasons. The data is not only spread across thousands of different sources and data providers, but it is also collated into various publication states. While recommendations are made for the format of the data, this is rarely adhered to, nor is the data often adequately formatted in a consistent, `machine-readable’ way. We must also overcome the abbreviations, spelling mistakes and informal naming conventions used by accounting and administrative staff.
While the scripts are still very much in development, we can already account for about 85-95% of the payment entries: a promising start! Future posts on the dedicated site will aim to make a magnitude of matched data available for all interested parties, while the next step in our work is to analyse the output.
One example of which considers the distribution of Grants according to the NCVO’s classification of charities by ICNPO number, as seen in Figure 3
This is done using algorithms which we intend to make publically available in due course (for the technically minded these are written in Python and Matlab). Using this ‘noisy’ real world data, where the beneficiaries might be more difficult to identify, it is important to incorporate the potential for organisations which might not feature in the Charity Commission or Companies House register alone. For this reason, we also scan a large number of registers to look for things such as Community Amateur Sports Clubs and NHS Trusts.
An overview of how one of the matching algorithms works can be seen in Figure 2