COVID Swab Test Data

NOTE: The analyses and conclusions on this page were produced in 2020. Since this time, much has happened in the epidemic, and trends have changed, as new data has emerged. Some conclusions on this page though valid in 2020, may not be valid now. This is a consequence of looking for patterns in live data.

When looking for changes in the status of the UK COVID epidemic using PCR swab test positive data, we have two datasets to choose from. The government publishes on their website the number of people testing positive for nCoV19 by the date the cases are reported, and by the date the sample is taken.

The latter of these datasets gives a more representative picture of COVID infections with respect to time, however it has peaks and troughs based around weekends, and so needs a rolling average to smooth it. The former of these datasets, is slightly less undulating, however, it is susceptible to variations in reporting.

Using these datasets as they are presented in time graph form below, to infer the status in the community at large of the COVID epidemic, is a fundamentally flawed approach. There are a number of reasons for this, see the discussion below.

Data by Date Published

Data by Swab Date

Swab Test Data Analysis

The main problem with using changes in PCR swab tests positive date over time, without modification or calibration, is that it ignores external factors which can have a dramatic effect on the changes in the data over time, giving a false picture of the situation on the ground. One such factor is the change in number of PCR swab tests carried out.

The chart below shows the total number of COVID swab test performed in the UK on any one day, for pillars 1, 2, and 4 combined. From this, we can see a general increase in tests carried out over time:

Total Tests (Pillars 1, 2, 4)

The latest number of tests carried out is INF times that of the 31st March 2020. This means when trying to compare the number of test positives now, to those at the end of March, we should really divide today's figure by INF to give a representative comparison. This is not ideal, as it assumes the only external factor affecting the data is variability in testing capacity. Furthermore, as we can see from the 'Total Tests' graph above, the daily variability in the number of tests carried out is huge. The maximum daily number of swab tests carried out to date is 2059489, and the latest number of tests carried out is 1925. As test capacity increases it is quite possible that the range of this variability will also increase. So we need a way to account for this variability when looking at the published data. The best way would be for the government to publish each day's data, calibrated to the amount of test that day, giving a relative number of people testing positive.

Percentage Test Positives

Without access to data calibrated in this way, we can create it ourselves in a crude way, by dividing the published number of people testing positive by the number of tests published for each time point. This is not ideal, as we have no way of knowing exactly how each test positive results matches with the number of tests carried out on the day the test was counted. But when we do this with both test positive datasets, we see an interesting change in the trend over time:

Data by Date Published

Data by Swab Date

The trend we see over time is more consistent with the time trends for hospital and symptom tracking data. Though this helps remove the variance in the swab test dataset caused by change in the number of tests carried out, it does not remove other factors from the data.

Variance Caused by Pillars

One such factor comes about by different pillars returning potentially different percentage positive numbers. However, the different pillars datasets are combined in to one dataset, creating a variance in the swab test dataset that is not easily calibrated for:

In the early days of the UK coronavirus epidemic, the main way to get a swab test for the majority of people was in a hospital. This meant that a lot of the tests carried out were on people with symptoms severe enough that they needed to visit or be admitted to hospital.

The consequence of this, was that the people getting tested were highly likely to receive a positive nCoV19 test result. Hospital and hospitalisation was acting indirectly as a highly efficient pre-filtration process causing a high percentage of people tested to return a positive swab test result (around 30 – 40% of daily tests performed were returning positive).

As test capacity increased, and people were able to get a test without having to go to hospital, we saw a loss of effect of this accidental pre-filtration process, causing the percentage of tests returning positive to fall dramatically.

Tests performed in hospitals are published as pillar 1 data, and tests performed in the community at drive-in and walk-in centres is published as pillar 2 data. Tests performed as part of the variety of research studies that are ongoing nationally, is published as pillar 4 data.

Though it is a leap, looking at the data as discussed, we can feel comfortable in the assumption that pillar 1 data will generally have a higher percentage of tests returning positive, as it is from people very likely to have COVID. And that pillar 2 & 4 data will generally have a lower percentage of tests returning positive, as they don’t have the same pre-filtration provided to pillar 1.

This means when looking at swab test positive data as a percentage of the number of daily tests performed, we need to account for the variance caused by the change in number of tests carried out in each pillar.

Effects of Pillar Ratio Change

We can see from the graph above how the ratio of pillar 1 to pillar 2&4 changed over time, and the effect this had on the percentage of tests returning positive.

In early April 2020, 90% of swab tests were pillar 1, with only 10% pillar 2&4. As April progressed, the amount of pillar 2&4 tests dramatically increased. This occurred as tests capacity was increased and testing made more widely available by the UK government.

Around the end of April 2020, the number of tests carried out more than doubled in a matter of days, causing the tests results to become dominated by community tests results (pillar 2). This caused a rapid decrease in the percentage of tests returning positive, allowing the variance caused by changing test capacity to become a more pronounced cause of variance in the daily tests positive count.

This established a new baseline percentage positive, with pillar 2 data dominating the swab test positive dataset.

As a consequence of this, when looking at this dataset as a percentage of daily tests performed, we can’t legitimately compare data from May 2020 onwards, with data from April 2020 and before. Furthermore, pillar 1 data has continued to decrease over time, as testing capacity is increased, week on week.

The resultant effect of this is that comparing percentage test positive data from one day, with another is also flawed, as with comparing uncalibrated data changes over time. However, the variance in ratio of pillar 1:pillar 2&4 ratio is not as large (at the time of writing this analysis) as the variance in the number of daily tests, so the impact on the percentage positive is smaller. Although this only applies when comparing data after 5th May 2020, when the pillar 2&4 data began to dominate the dataset, and the increase in the percentage of pillar 2&4 in the test dataset began to increase in a linear manner.

Adjusting for Pillar Variance

It is possible to adjust the percentage test positive dataset, to allow for the variance in the pillar 1:pillar 2&4 ratio. If we apply a scaling factor to the data from 5th May 2020 onwards, which is based on the percentage of tests in the total dataset which are pillar 2&4, and scale so that it fits with hospital admissions, then we get the graph above.

In this graph, the light blue is the pillar adjusted data, and the dark blue dashed line is the unadjusted dataset. The two datasets are identical up until 4th May 2020 as the pillar correction has not been applied as it disproportionately lowers this part of the dataset, because it is mainly pillar 1 data.

Pillar adjusting the data from 5th May 2020, as one may expect, increases more the data points which are more dominated by pillar 2&4. The upshot of this, is that data from later dates (the right hand end of the graph) are increased in magnitude relative to data from earlier dates, which are slightly reduced in magnitude.

Adjusting the dataset for pillar, as adjusting for the number of tests carried out, though not ideal, gives a dataset with less external variance, and therefore more representative of the status of the UK epidemic as a whole.

Comparing with Symptom Data

If we overlay the calibrated swab test datasets with data produced by the symptom study, we see a good correlation over time. This gives confidence that this approach is a more valid way to use swab test positive data, than in its uncalibrated form.

Closer examination of the overlaid datasets, which here have been scaled such that the November 2020 peaks are of the same magnitude, shows a good agreement; though ‘sample date’ data seems to track slightly better. This is probably due to the ‘sample date’ dataset not containing the variance caused by reporting date that will be inherent in the ‘date reported’ dataset.

The key difference between the two datasets is apparent in the level of undulation. Swab test data is more “noisy” (more undulating) than symptom tracking data. This makes it difficult to differentiate between natural variance in the swab test dataset, and real change in response to a change in the status of the UK epidemic. This means that swab test data is useful for confirming gross, large scale changes in the epidemic over time, but can’t really be used to look for more subtle changes which are useful for predicating future events.

Conclusions

Having spent a lot of time discussing the best way to represent, and use nCoV19 test data, for perspective, it is a good idea to compare the calibrated and uncalibrated datasets. In the graph below, shown in dark blue, is the ‘by date published’ dataset. In light blue is the same dataset calibrated to remove variance due to the number of tests carried out. The red dotted line shows the same dataset pillar adjusted, as a rolling seven day average.

Uncalibrated and Calibrated Data

If the government were to change the way it publishes test positive data, so that the different pillars are each reported separately, and each reported either calibrated to the number of daily tests performed, or with this data attached, then swab test positive data could be used to help determine the prevalence of nCoV19 in the community at large, as it would account for variance caused by these external factors.

Until the data is published in such a way that removes variance caused by external factors, then using daily swab test positive data to decide the status of the COVID epidemic both nationally and locally, and thus make critical decisions, should be avoided at all costs. It is flawed at best, with the potential to be dangerously misleading.

Further to this, as a larger percentage of nCoV19 test data become generated in a variety of different ways, such as the new high throughput fast test, and private companies using their own tests for employees, which is then combined and published as one dataset, non-COVID related variance in the swab test dataset will become more pronounced. Daily test positive counts will continue to rise independently of the actual prevalence of nCoV19 in the community, and the dataset will become less and less representative of the current situation with the UK COVID epidemic.

Until the data is published in a more representative way then it is best to use other, more representative datasets to make critical decisions, such as that from the infection survey, symptom tracking data, and new hospital admissions.

CORONAVIRUS DATAHUB