GKB: Geodemographics Knowledge Base
 

Thank you to Census and GeoDems Group Commitee member David Harris for providing this GKB Technical Blog

The ONS API - a tool for analysts and researchers?

David Harris 

Introduction

Modern online data APIs offer the promise of automating access to online data sources and eliminating tedious manual downloads, but they are often thought to need data engineering or developer skills beyond those of a typical analyst, researcher or statistician.

This article looks at the ONS (beta) API and shows how an analyst who knows some Python programming can reap benefits from it. An added benefit from this approach is that data held in the API is held in tidy data format, a significant improvement on the quite untidy formats in which ONS data can sometimes be presented.

The main limitation at present, however, is that the ONS API is as yet a beta release that exposes very few datasets. If one’s interest lies outside the few datasets available it does not yet provide a solution, even if it indicates a promising route for future development.

Scenario

The scenario considered here is one in which an analyst might want to regularly download the most recent version of a dataset, to be used in some subsequent analysis or processing that is regularly repeated. In this context the API is of most value when used to download datasets that are frequently updated, since it then avoids frequent manual data downloads. If we can assume the data is not huge, there is little benefit in filtering on download, since that can easily be done in the subsequent processing. The aim here is to download the most recent version of a complete dataset as smoothly as possible.

Getting a list of available datasets

The ONS blog at https://digitalblog.ons.gov.uk/2021/02/15/how-to-access-data-from-the-ons-beta-api/ discusses two ways of customising data. There is an interactive online resource at https://www.ons.gov.uk/help/localstatistics  and the API endpoint at https://api.beta.ons.gov.uk/v1/datasets.

The datasets available at the two links are not the same as each other, however, so a first task is to find a list of datasets available on the API. In a browser, if one clicks on the API endpoint at https://api.beta.ons.gov.uk/v1/datasets then the browser will display what the API returns. The results depend on the browser used - Firefox presents the results in a structured form, while Chrome and Edge display a rather unstructured mess. Either way, however, the result is complex and readable only with difficulty, so a useful first task is to convert this data to an easy-to-read CSV list.

The approach below will use Python for this and other programming tasks

Python uses its requests package to retrieve data in JSON format from an API, and its json package to convert the JSON to a Python dictionary. This can be done as follows:

import requests
import json
dsets_dictionary = json.loads(
requests.get("https://api.beta.ons.gov.uk/v1/datasets").text)


The resulting dictionary still contains unwanted complexity, but it is something that can be interactively explored to gain familiarity with the data the API returns, for example in a Python notebook environment such as Jupyter, in a debugger, or by using a package such as pprint.PrettyPrinter(). The following shows some shortcuts to some of the results of this kind of exploration.

The unique identifers (which will be needed later to download data), titles and descriptions of  available datasets can be converted to a Pandas dataframe and written to a csv file with these additional lines:

import pandas as pd
dsets_dataframe = pd.DataFrame({col:[d[col] for d in dsets_dictionary['items']]
for col in ['id', 'title','description']}).\
sort_values('id')
dsets_dataframe.to_csv('datasetlist.csv',index=False)

At time of writing there are 20 datasets available in the API.

Download Example: Regional GDP

The dataset giving regional annual GDP figures, which has the ID regional-gdp-by-year, is used here as an example.

Two calls to the API are needed to retrieve the URL for the download, using the ID of the required dataset as part of the first of these calls:

dataset_id = "regional-gdp-by-year"
gdpinfo = json.loads(requests.get(
f"https://api.beta.ons.gov.uk/v1/datasets/{dataset_id}").text)
gdpversionurl = gdpinfo['links']['latest_version']['href']

gdpversioninfo = json.loads(requests.get(gdpversionurl).text)
gdpdownloadurl = gdpversioninfo['downloads']['csv']['href']

At time of writing the download URL for this dataset (gdpdownloadurl) – is https://download.beta.ons.gov.uk/downloads/datasets/regional-gdp-by-year/editions/time-series/versions/4.csv, and the data can be manually downloaded by clicking on this link in a browser. Note that this URL points to a specific version of the dataset - when a new version is released this URL is likely to change, so the URL must be retrieved each time a data download is required to be sure of getting the most recent version.

The data can be programmatically loaded to a Pandas dataframe...

gdp_dataframe = pd.read_csv(gdpdownloadurl,
storage_options = {'User-Agent': 'Mozilla/5.0'})

... or alternatively, the following code will copy the data to a local file:

r = requests.get(gdpdownloadurl)
open('gdp_annual_data.csv', 'wb').write(r.content)
Tidy Data Format

A further benefit of using the ONS API is that CSV downloads are delivered in a normalised tidy data format (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html).

The ONS blog linked above describes this as follows:

  • Each observation (or value) is a row in the dataset.
  • Each variable (also known as a dimension) is a column in the dataset.
  • All datasets in the API have a time and a geography dimension plus one or more additional dimensions

The regional GDP data has around 4,500 rows, of which a small sample looks like this:

/blog/DH table snip.JPG

For most purposes this is a huge improvement on the formats in which data can sometimes be supplied. In this format each row contains a single numerical value (the column V4_1). Values of dimensions such as region, SIC and year are explicit, separate and clear, and the format can easily be reshaped (using, for example an Excel Pivot Table or the Python pivot_table() function) into other formats. This contrasts with data supplied in other spreadsheet formats that can have a large number of columns, with each column relating to a combination of dimensions, and from which the separate dimension values can be difficult to identify programmatically. The move towards tidy data format is one which should be welcomed with a hope that ONS will adopt it more widely.

Conclusion

From this brief overview, it seems fair to say that the ONS API appears very promising as a tool for analysts as well as for data engineers, but this promise is still a long way from being fulfilled because of the small range of datasets as yet available. The use of tidy data format is particularly welcome, giving a substantial improvement in ease of use for anyone who has struggled with data such as Retail Sales or Consumer Trends in their current Excel spreadsheet formats.

We can hope that further development of the API is on ONS's roadmap, and that we will see a much wider range of datasets, particularly those that are frequently updated, available using the API in the near future.

0 comments
Gkb_promo

Geodemographics - blogs and resources

Visit the Geodemographics Knowledge Base (GKB) for expert blogs and links to useful sources of geodemographic data and knowledge.

Visit the website A white arrowA black arrow

Get the latest MRS news

Our newsletters cover the latest MRS events, policy updates and research news.