More on big data ….. data linkage

Here’s a bit more on the potential of big data and administrative data, in particular data linkage, in the work of national statistical agencies. I am prompted by a journal landing on the door step in which leading statisticians set out the case for the use of big data. It has given me a renewed sense of optimism that there could be innovative and better statistics to come even if budgets are under threat and traditional methods are suffering. 

The latest edition of Journal A of the Royal Statistical Society has an editorial entitled “Big data in social research” (by Harvey Goldstein and Natalie Shlomo) and publishes Peter Diggle‘s presidential address on the topic of data science, which was delivered in June this year. The arrival of the journal follows my blogs on what the government statistical service might get from big data and admin data and the differences between data science and statistics.

Perhaps the key passage of the editorial is about data linkage. It says: “When we have identifiable big data, record linkage can be carried out to enhance existing survey and other sources of data”. It adds that “linkage to administrative data is already an established practice in statistical agencies and is used for enriching statistical data …. to carry out small area estimation … or to improve the quality of the data collection process.”

It refers to an excellent paper by The American Association for Public Opinion Research which sets out the advantages of record linkage with big data. There is no need to keep real-time footfall or sales data separate, it says, from in-store surveys of customer actions, adding that the former should be the frame for the latter. “These (big) data become the primary monitoring tool, and surveys are utilised to conduct deeper probing based on trends, changes in trends or anomalies that are detected in the primary monitoring data.” It makes sense as the merger of the two, to create “blended data”, diminishes the weaknesses inherent in both data types.

What would a shift in this direction mean in practice for government data collection in the UK? It would require, for example, that when a respondent fills in a survey for government statistics, they could be asked to put one or more relevant administrative codes on the form. This might, in the case of an overseas travel survey, be a passport number. This would allow the survey responses to be tagged to admin records. The travel survey might ask about the purpose of a trip or how much was spent and that could be linked to actual dates of travel and destinations retrieved from passport scans at the border. The resulting data, the sum of two parts, would be much richer and more meaningful than either source on its own.

An NHS number might be asked for in a health survey or a National Insurance number in a survey about work. Of course not every respondent would know or give such data – many refuse to answer surveys in any case – but some would. The “small print” on the survey would be barely changed as the use would be purely for statistical purposes. It’s hard to see much harm coming from, for example, a researcher conducting a survey for TfL asking to swipe the respondent’s oyster card. In cases where the take up of linkage was high, the survey could over time be reworked (and possibly shortened) to give the useful information that is not held in the admin records. If these techniques are being used to assess the impact of an advertising campaign, they can be used in government data collection too.

Beyond the issue of linkage, the editorial says: “There are many examples of successful applications” of statistical systems using administrative data. Some I identified include:

  • Various projects that were under way in the US Census Bureau (see a presentation by William Bostick from 2013). For example, the use of electronic transactions and administrative data to supplement or improve construction and retail and service statistics; the use of vendor data on new residential properties to aid analysis of new construction and sales;  incorporation of online public records maintained by local jurisdictions and state agencies; and, the evaluation of electronic payment processing to fill data gaps such as geographical detail and revenue measures by firm size.
  • How big data is helping with understanding the mourning process and the activities of foreign born residents. (See a presentation by Guillermina Jasso, from NYU.)
  • How mobile phone records can boost the accuracy, timeliness and detail of existing tourism statistics. (See this EU paper.)
  • Using big data to get improved small area population characteristics (in this Italian work).
  • The UNECE wiki page giving links to other projects.
  • The UN’s global working group on big data, which seems to not include the UK, held a conference last year.

The same AAPOR paper usefully defines big data sources as social media, personal data (tracking devices), sensor data, transactional data and administrative data. It describes these as “organic” as they are by-products of processes which are not designed to collect data for statistical purposes. Some of these will be much more fruitful hunting grounds for statisticians than others but lets hope that the Bean review asks the government to make every effort to get behind the movement to deliver better and cheaper data.

The historical context of this shift was set out very nicely in “Official statistics and Big Data” by three Dutch experts: “Until around the 1980s, data were essentially a scarce commodity with a high price. Before the era of Big Data, information was not readily available but had to be collected for a particular purpose. Official statistical information based on survey data had a unique value: there simply was no alternative. For example, population census data, collected door to door, was immensely valuable to policy-makers, researchers and other users. In the last few decades, data collected by public administrations have become increasingly accessible for statistical purposes, stimulated in part by IT developments. Statistical data collection by means of questionnaires was supplemented and increasingly replaced by administrative data sources. Nowadays, some countries do not conduct extensive population surveys anymore but compile census statistics by combining and analysing data from several administrative sources. NSIs became more integrated in the information architecture of the government. In this way, the burden on persons and businesses to respond to questionnaires was considerably reduced.” This is clearly not describing UK practice now, but could be in ten years. It won’t be easy but the path must be taken.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s