The Bean Review of government statistics will assess what the public sector data machine needs to make it fit for purpose for the next decade or two. The regulatory framework and outputs are part of that but at the core is a question about sources: how can big data, open data and administrative data help deliver more and new accurate statistics, in a more timely fashion and for less money. This note tries to unravel what these terms might mean for the Government Statistical Service (GSS). It concludes that there is an imperative to investigate the possibilities and that the Bean review can ensure that the required development work is supported at the highest level in government.
Professor Sir Charles Bean has made it clear that his preliminary report (due next month) will focus on the first two elements of the terms of reference, namely to assess: “the UK’s future statistics needs in particular relating to the challenges of measuring the modern economy”, and “the effectiveness of the ONS in delivering those statistics, including the extent to which the ONS makes use of relevant data and emerging data-science techniques.”
Talk of big data, open data and administrative sources often attracts a certain amount of scoffing from “statisticians”. They say that the traditional way of doing things is just fine and the “new science” is a passing fad with no prospects. Some say that the charlatan is not even new. Beyond the GSS and the statistics profession though, much in the world of data is changing. Information Age has published its top 50 data influencers in the UK. The selection aimed to “shine a light on those transforming organisations, enhancing decision-making and driving business value through the use of data”. It’s notable that the public sector is not represented (even allowing for their need to sell tickets at £300 a piece for a gala dinner awards ceremony). In essence, there is more data around than ever before and it is being used in many more and varied ways in many walks of analytical life. The new ways need to be taken seriously in the field of government statistics and properly assessed.
In any case, the old-fashioned surveys which have been at the core of statistics for generations have falling response rates and are increasingly expensive. The fact that they are rapidly ceasing to be fit for purpose is an open secret that will sooner or later undermine the old ways. Surveys as conducted are often no longer capable of answering the questions we really want answered as attention focuses increasingly on the extremes, not just a meaningless average, in an ever more diverse society. The ONS does already use other data sources as part of its processes (it required this FoI request to find the data) so there is some basis from which it can progress.
But what is big, open and admin data in the context of government statistics, and what are the implications?
Admin data is a relatively simple concept. Across the public sector, digital records are being created as a consequence of day-to-day activity. In the digital age you contribute to this data mountain at every turn: your passport at the border, your GP’s prescription, the house purchase, a business registration, the number plate recognition camera, a pension or benefit claim, the parking ticket, and tax return. And so on.
In theory, it has a lot going for it too. The opportunity to create time series data must be huge. Sometimes the data will have enormous richness and be available for small geographic areas. On-going production costs can be very low and the burden on respondents can be minimised. Data can be more easily linked and statistics can be updated more frequently, almost in real time where that makes sense.
Some areas of government say that they are onto this but one suspects they are doing so at a painfully slow rate. So far as I know there is no new National Statistics series that has come about from a digital data set. I used to help compile the public sector spending and borrowing figures that come monthly from the ONS and Treasury, and they have barely changed in timeliness or detail in three decades. In other areas of public sector activity – for example civil service manpower issues such as hours, staff, locations and sickness – there should be figures that show us what’s actually happening especially in an era of austerity-obsession.
The barriers to use of this data take many forms. They might be:
- Tech-related. “This system does not speak with this system” we are told.
- Data quality. Mismatches, inconsistencies and breaks in series mean, in many eyes, that there’s no point. And the thing that is measured will only be a proxy for what we really want to know.
- Desire. Even if an individual or team in a government department can see the opportunity and wants to do something, there is no incentive to change how things are done. The data might not, after all, be accurate or reliable, they think.
- Bureaucratic inertia. The senior bosses will not be prioritising something they are not asked for and in many cases don’t understand – and don’t have a budget line for. There is also more extreme bloody-mindedness. I recall being told by one ONS board member that on principle they would not give me the data I wanted as I’d only go off and make mischief or money.
- Commercial. A government department might have been collecting data from the public (ie from us and about us) at our (tax payers’) expense and then making (often small amounts of) income from selling it. There are a few large and increasingly unaccountable trading funds that make millions from data selling – and they need to be tackled seriously by the government. But money is also an excuse used by much smaller departments. One suspects there’s probably quite a few people in government whose livelihood depends on the business of charging other government departments for the use of data!
- Legal. The system at times seems to be bogged down by legal hurdles that can be used by departmental lawyers with great enthusiasm to stop their data being used for statistical purposes, despite a generally positive legal framework and a hope of more data sharing following the 2007 Statistics Act. The worries are often related to personal disclosure or confidentiality – important issues but not a catch all excuse for inaction.
Some of these barriers are easier to overcome than others. Professor Bean could make a great start by asking for a list of the national data assets. We have various records of the public sector’s physical assets, such as this one of land and buildings and this map. A government that prides itself as being one of the most open in the world cannot refuse that. Once we know what exists we can learn about the barriers to release in each case, and start making more of it available. Freeing the data assets will ultimately deliver much more on-going wealth generation to the country than the one-off sale of government physical assets.
The review could also recommend that the government sweeps through as many of the barriers noted above as possible. It should be driven by the thought that for the public sector to fail to use data that it owns to improve not only its own policy making but everyone’s understanding of our society is only a step sort of a crime especially in an era of public spending cuts.
Big data is usually defined as “a collection of data whose size or complexity is beyond the ability of typical database software tools to process and analyse”. Within the public sector, most admin data is big data and it’s hard to think of many examples of big data that would not also be described as admin data, when the definition of admin data is broad. Some records will be relatively small “big data” but most, think perhaps of the data produced by a hospital, will be complex, and well beyond what an individual could handle with excel. This is where special skills are required, and Professor Bean must make that clear.
But the public sector could also use private sector big data. The classic example would be EPOS (point of sale) data that we contribute to on a daily basis in a supermarket, and many other retail environments. EPOS means we know exactly what product is bought for what price for, it is said, over 95% of many common purchases. Given that, it is extraordinary that we still base key personal consumption statistics on a very small sample of data collected by people with clipboards on one day a month (for example, the inflation figures and, therefore, so much of the deflation of nominal series) and diaries of household consumption. Beyond prices, there is phone usage, utility energy bills, bus and train tickets and many more. These data sets are measuring real activity not what a sample of people said they were doing.
Big data is already being looked at seriously by central banks, who in so many ways are increasingly ahead of their counterparts in national statistical institutes. Two years ago there was a special issue of the Journal of Banking Regulation dedicated to data which had this article, by the Bank of England, as its introduction. Central banking interest in big data was spurred by new regulation and data following the 2007-08 crisis but also extends to economic data. An article in the Bank’s Quarterly Bulletin of Q1 2015 updated some of the thinking.
Of course the digital records will rarely if ever be exactly the measure that the statisticians or economists would choose but such disadvantages of the “new” data sources need to be assessed. The data is also often privately owned so it is not as if the government has any right to see it as things stand. But it is there, it is being used by businesses to make decisions, and by regulators to regulate. The public sector needs to play the game too both with its data and, where possible, with others’ data too. Often businesses will share information as they know that mixing their own with other data improves the value of it. There are plenty of examples of sharing and we need a public sector that is able to initiate those discussions and deliver the resulting products.
So much for admin data and big data. Open data is more complex to define but might embrace three, partly overlapping, themes in the context of government statistics:
- Open organisationally
- Open statistics
- Open data
The GSS has written guidance on open data in a document dating from 2012. It was a good start but is dated, not very holistic in its approach and, importantly, it looks like not much has been delivered as a result of it. There have been other reports on open data, for example, the one by the Public Administration Select Committee in 2014. So, lets hope that Bean can build on the emerging sense of opportunity and call for a fresh start from the government statisticians.
To explain more …………
Being open organisationally means that the government statisticians would be open about what they are doing. How are statistics actually compiled, how many people work in various areas, how much do the stats cost, how are priorities decided? And it means being clear about strategy. ONS, being independent to a degree under the UK Statistics Authority, can do this more easily than statisticians in a ministerial department. Sadly the ONS has slipped over time and seems to offer less information about what it’s up to and the UKSA board has not seen fit to demand more from the ONS or other departments.
Specific actions that the Bean review might propose include encouraging UKSA, and the ONS and GSS to:
– publish more guides, metadata, think pieces about data quality/innovations etc. This could be done via a revision to the Code of Practice requiring such data for all National Statistics.
– grasp the OD agenda now the Cabinet Office has (apparently) lost interest (and do it better)
– accept that the data portal, data.gov.uk, is not fit for purpose, say as much and propose change
– be wary of linking up too closely with Administrative Data Research Network to make “open data” for communities which are not open. Academic researcher is important but the case for government to focus on satisfying the needs of one small group to the exclusion of non-academics needs to be debated.
– rethink FoI responses. Don’t just respond to FoI requests with data in a table, but consider starting to make that data available on a regular basis. If one person wanted it, there is every chance that many others do too.
– give due consideration to the implications and possibilities of this year’s public sector information (PSI) regulations.
This means publishing more series from the data held/collected, and making them easier to find, visualise, re-use etc. Publishing more of what is already collected is possibly the most dramatic way to enhance user perceptions. It surely also delivers the greatest bang for a buck. Often millions of pounds will have been spent on a survey or accessing an admin data set and yet only a few dozen pages of (pdf format) time series will be published. The marginal cost of producing hundreds of tables (in excel?) or a data set that can be interrogated by users will be modest and yet so much more use could be made of the data. It is best to do this and get the credit for it rather than wait till, amid much criticism, it is pointed out that only a small fraction of what could be published is actually published. Specific actions could include:
– publish many more tables/statistics from the large surveys (eg LFS)
– publish statistics and associated material in a better way (ie on a new website)
– create interactive tools that allow series to be interrogated and visualised
Behind the statistics lies data, both individual pieces of (granular) data and what might be called data infrastructure. How much of that could be published? UKSA and the ONS/GSS should consider – and Bean should advocate – a commitment to:
– collate and publish, say, the ODUG version of the National Information Infrastructure
– ensure that future ONS work on address registers should be for public use, not just for the ONS
– make huge chunks of anonymised data available for users to use how they like (via SPSS etc)
– be a champion of legislation where required to get access to public sector (and regulators, BoE and others?) admin data for statistical purposes
– keep a tally of new data/datasets so that it can be seen that progress is being made and that good is coming from it.
None of this is easy but at least Bean will force the discussion into the open. The simple point is that much can be investigated. And in time, I am confident, such investigation will lead to there being more data (certainly) that could be produced more cheaply, in a more timely fashion and be of better quality (probably). Although the sceptics and nay-sayers will be right in some respects too it is vital to let others have time to see what could be done. The developments required do not have to be frightening. Most importantly, Bean must make clear that these considerable possibilities do not need to involve sharing personal or private information. That is desirable and will come but is for another review!