XiScience Logo



A Representative Dataset?

NOTE: The analyses and conclusions on this page were produced in 2020. Since this time, much has happened in the epidemic, and trends have changed, as new data has emerged. Some conclusions on this page though valid in 2020, may not be valid now. This is a consequence of looking for patterns in live data.

When trying to figure out which dataset to use, to get a non-biased, un-narrativised view of the status of the UK COVID19 epidemic, it can be difficult to pick a robust source of data, representative of the real situation on the ground, to inform our decisions. Most commonly the media uses the number of people testing positive in swab tests. This was fine at the start of the epidemic, when most people tested were those having to go to hospital, and the variance in testing capacity was relatively small.

However, once testing moved away from hospitalised cases, and more into the community, and testing capacity began to rise, week-on-week, the use of swab test data to gauge the status of the epidemic, became fundamentally flawed.

There are a number of reasons for this; the most prevalent problem being: that as testing capacity increased so did the number of people testing positive. In fact from the start of July, we began to see a clear correlation between the number of tests performed and the number of people testing positive (see discussion on this for more detail).

So we need to look at other datasets to see if they can help us see what is going on in our local community and across the nation as a whole. Hospitalisation data has proven useful as a surrogate measure of the evolution of epidemic, as well as a measure of the virus’ severity on the population as a whole. The main problems with using hospital data are two fold; firstly the natural variance in the datasets make it difficult to spot small changes in viral prevalence. Secondly, there is a lag between changes in viral prevalence, and the observed effects on hospital data. For new admissions, this lag is around 17 days. For hospital occupancy, the lag is around 24 days, and with mortality figures, the lag is around 33 days.

Thus with hospital and mortality datasets, it is easy to miss subtle changes following national policy changes, or through natural causes. When such changes are observable in the data, the situation on the ground has already moved on potentially in a dramatically different direction due to the aforementioned lag.

Using COVID hospital and mortality data to steer critical decision making is like try to steer a super tanker with a 17-33 day delay on the steering wheel. We need a dataset that better represents the situation on the ground as it happens. A dataset that does not have such long delays.

COVID Symptom Study

We are very lucky in the UK, as a team of scientists from King’s College London, in partnership with their tech spin-out company ZOE, put in place a COVID symptom tracking app, which over time, has become a very powerful tool for predicting how likely someone is to be infected with nCoV19, based on their symptoms. We as a nation are now able to reap the benefits from the dataset and analysis produced by the scientific team at ZOE, to help inform and steer critical decision making.

The graph below shows data from the COVID symptom study app as it is published on their website. This dataset gives an estimation of the prevalence of nCoV19 in the UK four days ago, based on data acquired over a two week period prior to estimation:

Data from the COVID Symptom Study app, seems to be a much more representative dataset than the change in number of positive swab tests. The symptom study app data tracks well with hospital admissions and occupancy, which are representative of subsequent mortality. Furthermore, this dataset is updated daily, and from it we are able to access COVID infection level estimates for our local community and nationwide (see heatmap of UK).

The symptom tracking app creates infection level estimates, from over 4 million users reporting daily their state of wellness, and symptoms if unwell. The models used to process this information, and make predictions for regional and national infection levels, are calibrated using COVID swab test data, making this data the most representative contemporary dataset available in the UK at present.

Infection Survey

The ONS publishes weekly, predicted virus prevalence data, based on results from an nCoV19 infection survey run nationally in partnership with University of Oxford, University of Manchester, Public Health England and Wellcome Trust. This survey tests a randomly selected cross section of the population for nCoV19 infection whether symptomatic or not. The results from this survey are then used to predict the current prevalence nCoV19 infection across the UK.

The predicted prevalence data published by the ONS gives a good measure of the current nCoV19 infection situation, and tracks well with symptom study data, having the advantage that it detects asymptomatic nCoV19 infection as well symptomatic infection, in contrast to the symptom study app, which only detects symptomatic COVID. Infection survey publication usually has a seven day delay, due to the time required to swab and test volunteers, and analyse the results.

This data shows clearly consistent trends over time. The only minor problem with using this dataset is that each weekly published dataset shows slightly different trends to that from the week before. This is not a problem if we just want to know the general direction and magnitude of UK nCoV19 infection, however it becomes a problem when trying to draw conclusions on timings of change for comparing to other datasets, such as symptom study data.

The individual data points in these plots are the mean value of a spread of data with a 95% credible interval. To understand the significance of this, if we take the peak value for the 20th Nov 2020 dataset: 671,100 on 7th Nov 2020; 95% of the data generated for this date was in the range 640,300 - 702,800. This needs to be taken into consideration when using this dataset.

Infection & Symptom Surveys

Within the COVID symptom study app, on the app website, and in the reports that the symptom study team send daily to the UK government, it is made clear that the published data is based on result and swab test from the two weeks prior to and up to four days prior to the data publication date. This means that there is only a four day delay between the data reported by the symptom study team, and the current status of the UK COVID epidemic. This makes the symptom study data the most contemporary dataset examined here thus far.

If we superimpose data from the COVID symptom study app with data from the infection survey for England, timeshifting it by four days to allow for lag, we see the same trends and similar prevalence estimation, over time between the two independently derived datasets. See the graph below, which shows the latest symptom study data (including new and old strain variant data), and infection survey day from the latest ONS publication and from the 13th Nov 2020 dataset.

Making Past Comparisons

The team running the symptom tracking app have data from 29th March 2020 to present, however, they changed the predictive model used in early July 2020. As a consequence, the data they currently publish live on their website is from the 12th June 2020 onwards. This makes it difficult to compare the situation now with that of the UK epidemic peak in April 2020. The data from their previous model (29th March – 21st May 2020) is available on their website, but it may not necessarily be directly comparably with current data due to the model change.

This puts us in a tricky position, because though we can see from the symptom tracking app data what is happening now, and what has happened recently, we can’t directly get a handle on how it compares to what we experienced nationally in April 2020. This makes it difficult to have perspective, and know if our actions are: proportionate, an under reaction, or an overreaction.

Getting proportionality with respect to personal restrictions and behaviour modification, may not be important for some as it may have little effect on their lives, just a little inconvenience. However, for others it is crucial; it can mean the difference between businesses continuing or going bankrupt, healthcare accessibility being available or reduced, the mental health, wellbeing and happiness of individuals being affected deleteriously, and sadly for some, the difference between life and death.

Past Peak?

To help us tell if symptom study app data from before 12th June 2020 is directly comparable to the current dataset, we can use the England Infection Survey dataset, which contains data from 26th April 2020. Using this data, the past symptom study app dataset can be adjusted, with the infection survey dataset a useful comparison and calibration tool.

The graph below shows past and present symptom study app data plotted alongside infection survey data. The infection survey data is limited, as it doesn’t go back as far in time as the symptom study dataset, however the data available correlates well with the past symptom study dataset:

One potential problem with the data alignment shown above, becomes apparent if we compare symptom study estimates with infection survey estimates (from 13th Nov 20 data) at the 31st October 2020 peak. Here we can see that the infection survey prediction is around 8% higher than the symptom study prevalence prediction.

A potential reason for this is that the symptom survey only detects symptomatic nCoV19 infection, whereas the infection survey detects both symptomatic and asymptomatic nCoV19 infection. Accounting for this 8% difference when comparing infection survey data with past symptom study dataset, requires a reduction in the past estimated peak infection estimate of 2.13 million people.

On the graph below, plotted in red is an estimation of combined symptomatic and asymptomatic nCoV19 infection, based on increasing symptom study data by 8%. If we then fit the infection survey dataset to this, we arrive at a past estimated symptomatic COVID peak prevalence of 1.86 million people.

This creates a more complete dataset that also tracks well with UK hospital admissions (on the date of writing this analysis; see below), allowing a more complete perspective when comparing to the past, whilst giving an accurate contemporary account of the level of nCoV19 infection nationwide.

A consequence of doing this recalibration, is it brings the estimated peak infection level down from 2.13 million to nearer 1.86 million. It is difficult to tell if this is an appropriate way to treat the past data. Furthermore, as discussed previous, the infection survey data plotted above represents the mean of a range of values; the value 294,900 from 26th April 2020, with 95% credible interval lower and upper limits of 194,500 and 430,100 respectively. Using these values to calibrate the past peak, provides a past peak of 1.86 million with 95% credible interval lower and upper limits of 1.23 and 2.71 million people respectively.

So the caveat with using the pre 12th June 2020 symptom tracking app data in this way, is that it may make things look worse or even better now than they actually are. The calibrated dataset predicts we are currently at of the April 2020 peak with 95% credible interval lower and upper limits of and respectively; using the uncalibrated dataset predicts we are at of the April peak.

From this data we can see the first wave of the UK epidemic peak at the start of April 2020, with nearly two million people infected. We can also see clearly the progression of current and future waves as they happen. With this dataset, it is easy to have perspective of how good or bad the epidemic is now, in comparison to the first wave of COVID.

The final complete four day time-shifted, calibrated, symptom tracking app dataset, is shown in the graph below:

An Alternative Fit

Here we have an alternative fit for comparing ONS infection survey data with symptom study data, assuming 20% of nCoV19 infections are asymptomatic. This adds hospital admission data to help get a handle on where the infection survey data peak should be placed.

It is too early to make any definitive conclusions about this dataset alignment yet. As time passes, and more infection survey data is published, it will become more apparent if this fit is more or less valid than that discussed above. Further discussion about this will be published here as the data evolves.

Daily Change

When looking at the increase or decrease in the growth of the UK epidemic graphically, it is easy to see that the epidemic is increasing or decreasing in size. However looking graphically at data presented like this, it becomes difficult to see how quickly the epidemic is increasing or decreasing, and when or if a change will occur.

Knowing this information is very important, as it helps us decide what the situation is going to be like in the future. If the epidemic is growing or shrinking exponentially, then we know that we are going to be in a very different situation very quickly. If it is changing in size linearly, then we know that a different situation is going to arrive much more gradually. If the rate of change is slowing down exponentially, then we know that either the increase or decrease is slowing down and will likely change to its opposite soon (increase switching to decrease, and vice versa).

To get a better feel for this information, a useful measure is the gradient of the graph, and how it changes with time. The graph below shows the daily difference in the number of symptomatic cases in the UK, based on the symptom tracking app data. If the value is positive (red, pointing up) then the epidemic is growing. If the value is negative (green, pointing down), then the epidemic is shrinking.

If the daily difference is increasing in magnitude, then the epidemic is growing/shrinking exponentially. If the daily difference is the same as previous days, then it is growing/shrinking linearly. If the daily difference is decreasing in magnitude, then the growth or shrinkage is slowing down.

Growth Rate

To further aid our interpretation of what is going on with the growth and shrinkage of the UK COVID epidemic, a useful visualisation, which shows the ratio of the change in one days gradient, compared to the day before, is shown in the graph below.

On this graph, a value over +1 means the epidemic is growing exponentially. If the value is +1, then the epidemic is growing linearly, if it is less than +1, but greater than 0 then it means the increase is slowing down exponentially. If it is -1, it means the epidemic is shrinking linearly, if it is less than -1 (e.g. -1.5), then the epidemic is shrinking exponentially. If the value is between 0 and -1 it means the shrinking is slowing down.

Averaged Growth Rates

To help us ‘predict’ what the future will be like, it is good to see how the epidemic has grown or shrunk over a longer period of time, which averages out short-term anomalies. The graph below shows the 7 day (top graph) and 14 day (bottom graph) rolling averages for the growth ratio:


All data presented in the charts and analysis on this website come exclusively from external sources. The data is downloaded and updated live using APIs as you peruse this site. This datahub is intended as a convenient portal to access the data, and see a variety of different analytical representations. It is not intended as a primary source, and as such we make no warranties or claims to the authenticity or accuracy of the data presented here; you should access directly the source data for critical decision making.