Big data and official statistics

Will government statisticians act to take advantage of this potential new and important source? Hope so, but fear not.

Yesterday I went to an event on Big Data and Official Statistics. The event was hosted by the RSS and sponsored/supported by ODI and Data Pop Alliance.

The afternoon was spent dealing with three questions. My thoughts on each are briefly summarised below.

How would you define big data?

Big data – data sets with sizes beyond the ability of commonly used software tools and commonly held skills to capture, curate, manage, and process. Big data, and thus the statistics derived from big data, would mostly be created as a by-product of a process that arose as a result of non-statistical activity.

What are the opportunities for official statistics from big data?

It seems that all sorts – large corporates, scientists and small start-ups – do big data and they must do it for a reason. Part of their rationale is speculative as no one can know where the “big data” fashion will end but they do it as it is, or is expected to be, enabling, informative and good for decision-making.

The same should be true for official statistics too – we don’t know where it will lead but there is scope for new, more and more accurate data, produced in a more timely fashion, and probably cheaper (eventually). For NSIs to ignore the opportunity could be to sign their death warrant. Others will encroach in their space leaving the government statisticians with less and less to do. They will maintain their monopoly over official GDP and some other stats but not to engage and lead the big data owners to a place where the data can be a coherent part of the whole (as opposed to unused or used in a detached way) would be damaging to the core of the nation’s evidence base for society.

Response rates are falling and surveys failing, and many of the groups of particular interest do not respond or do not do so in great enough numbers or with adequate accuracy, so action is needed to maintain and enhance the data quality. The economy and society is no longer the one of simple post war definitions when a small army of clipboard data collectors and a big ledger delivered the nation’s numbers.

Government big data is not really big big data as per science so it’s not an impossible task. The 3Vs – volume, velocity and variety – are relatively easy to deal with in social and economic statistics.

Statisticians can help to set the agenda and contextualise if they get their skates on. Or they will be left to wither. Change will happen without them but they can improve the end product if they engage.

An open mind and some experimentation is needed. Some resulting data will be good and useful, some won’t. And the ultimate use is unknowable. Who would have known, if asked 20 years ago, where the internet would have taken us?

What are the challenges / pitfalls that ‘big data’ approaches to official statistics face?

Official statisticians and many academics will have to let old methods and old definitions go. It’s time to adopt a nuclear approach rather than stay with bows and arrows.

There’s no gain in inaction while hiding behind EU requirements for dated statistics or over inflated concerns about privacy. Security is being handled by large privates so it and ethics can be handled by the state too.

Most of the issues with big data are really just the same ones that exist with small data – what do the figures really mean, when is a trend a trend, etc?

I suspect that one major shift that is required in the minds/expectations of the user is that big data is better for “now casting” and probably worse for long term histories as data definitions will change – and statisticians will have little control on that. How do statisticians handle that?


