Contact Tracing for COVID-19 Using The DBSCAN Clustering Algorithm
Just last week, I was in a Zoom call and the discussion went into Nth-order thinking.
Essentially trying to make assumptions and predict what might happen in the future based on current events by asking “so what?”
Here was the thought exercise.
Pretend that it’s March of 2020, you hear about the first COVID-19 case in your country. What would you do? Where would you invest?
Here was the thought process.
- SARS-CoV-2 comes into action. So, what?
- People are scared and the news is going to talk about it. So, what?
- Out of this fear, people start stocking up on food, toilet paper and all supplies possible in case there’s production and supply chain issues. So, what?
- That means the market of hand sanitizers and masks are going to rise. I’d invest in that.
- People are inside because of a lack of research and the companies say not to come into the office. So, what?
- There’s a rise in all things virtual. Computers, web calling apps — Zoom, Webex, Google Meet, Microsoft Teams, etc.
The list goes on and on. In fact the group on the Zoom call ended up predicting inflation.
However, going back to the investing part, anything that could be changed from physical to virtual was considered as a good investment. Web calling apps were one, however, even healthcare became virtual and we saw the rise of telemedicine.
However, even before that, we saw companies like Apple and Google get involved to prevent the spread virtually, by using contact tracing.
Getting notified if you were exposed to someone with COVID-19. Scary but needed.
The Basics of Contact Tracing
Contact tracing is a tool used by local authorities to identify people who could’ve been infected by COVID-19 through contact transmission.
In short, when a person is detected as positive, we need to identify other people who might’ve been infected. To do so, authorities will have to follow the activity of patients diagnosed in the last 14 days, known as contact tracing.
This search for contacts is carried out manually or is automated through numerical methods. These automated methods rely on GPS data and machine learning (ML).
On the ML side, we use the DBSCAN, a Density-Based Data Clustering algorithm.
COVID-19 disrupted the lives of many and is much like many other pandemics. However, the mortality rate of this virus is lower than other typical viruses because there are minimal symptoms for the COVID-19 carrier. In fact, the carrier already spreads before the patient is tested positive. Hence, the need for contact tracing and to find the people who interacted with the COVID-19 carrier.
There are multiple ways to prevent the spread of the virus, one of which is to stop people from infecting other people, and, we can do this through contact tracing.
There are two ways to do this, manually or digitally.
- In manual contact tracing, health authorities interview the infected individual and try to nail down all of the people which could’ve been in contact with the carrier for 5–10 minutes or more in the last 2–3 weeks. In this case, the environment (indoors or outdoors), the proximity of the two individuals and the duration of contact all compute a risk score for each person who was in contact with the carrier.
But, this comes with its issue. The manual process is too lengthy, infeasible to do for millions of infected people and the accuracy is fairly low. However, this led to the creation of digital contact tracing which is based on the networks of contact of the people linking together.
- We link these people together to form a cluster and can then identify the infected individuals. There are many different types of ML clustering techniques. In this article, we’re specifically going to focus on Density-Based Clustering.
Density based clustering is a type of unsupervised learning technique (the data has little to no labels). The main basis of this technique is that the cluster is a high point density region which is determined by the number of points with similar latitude and longitudes (same place) with nearly identical timestamps. The high and low regions are separated and to separate those regions we have noise and outliers.
Regardless of the type of clustering, each uses a Blue trace protocol. This is where the user is on some type of central server which will allow us to generate a temporary ‘id’ for them by name through bluetooth. Ultimately, these ids can be recorded and used for clustering.
The density based algorithm which can identify clusters in large databases with the presence of outliers and noise is DBSCAN.
DBSCAN only requires one input parameter which is the id and is 100x more efficient than the CLARANS algorithm (another density based algorithm) when creating arbitrary shaped clusters.
In fact, when talking about the correlation to COVID-19, with the best case scenario of 100% coverage and testing delay of 0 days, the effective reproduction number (number of people which 1 person — the carrier — can affect) comes from 1.2 down to 0.8 by added contact tracing.
The Data
Now, it’s difficult to have access to this GPS data however, through the JSON mock data generator, we’re able to create a dataset containing the geographical location data of different individuals, specifically the latitude and longitude.
The generated dataset contains:
- 100 total location tags of 10 different people
- 12 UNIX timestamps with a gap of an hour to mimic the real life movement of people.
- Each grouping of the dataset has its unique identifier known as the id which is represented by the name of the individual to differentiate the records of different people.
This is all formed as a JSON object for each id.
Pre-Processing
Population Density
The first part of pre-processing is understanding the logic behind using location tags through population density. The important thing to remember is that COVID-19 is airborne, which means the closer the distance between people, the higher chances of spreading the virus.
So, the population density is vital to understand the chance of transmission. For example, if the location tags of the individuals are far apart or greater than a certain minimum transmission point, then there’s a low chance of spreading through close contact (propagation of contagion doesn’t take place).
Time constraint
The second pre-processing step is the time constraint. The density and distance of the two people should be between 6 feet, but the contact should also take place around the same time otherwise the transmission wouldn’t be in question. So, we need to ensure that the dataset has a balanced number of ids with similar timestamps.
Instances Per Record
The number of records each person holds must be more than once so we can understand and display the movement of a carrier between multiple locations. This helps mimic real life movement.
In this dataset, each id has 10 instances which each have different timestamps and location tags (varying latitude and longitude values).
Building the Model
The basis of any contact tracing model is that different data elements can form groups with similar data elements through clusters.
All clusters share some type of similar characteristic among themselves. This data analysis technique helps identify different patterns and characteristics in data and group them to form new insights.
However, using a lower number of clusters means that there’s less precision, resulting in data loss and loss of finer details. Overall, it’s mass simplification and generalization, therefore, making the clusters more unique and specific allow for richer insights and decisions.
Clustering is an unsupervised learning technique in ML which means the data doesn’t need to be labeled (contrary to supervised learning). Overall, the model will identify patterns on its own and cluster them accordingly.
However, clustering must follow specific criteria.
- Homogeneity: all instances and ids in the same cluster are similar to each other.
- Intracluster non-homogeneity: each cluster varies from other clusters based on their characteristics.
Understanding the Model
We know that this is a density based model approach, however in a mathematical context, the overall density of the data points is the sum of all density functions of every object.
These clusters are formed by the local maxima of the summed density function, known as density attractors. But the influence of the density attractor depends on the cluster.
Again, clusters which form based on the density of the region are known as DBSCAN (Density-Based Spatial Clustering of Application with Noise).
DBSCAN doesn’t require the cluster numbers as a parameter but instead predicts the cluster numbers based on the data and the clusters generated can take different shapes: spherical and non linear separable clusters.
Aside from the DBSCAN model with density based clustering, we also have partitioning based clustering which relies on K-means and CLARANS (Clustering Large Applications based on Randomized Search). As mentioned previously, DBSCAN has almost 100x greater efficiency than CLARANS, however, the base of their mathematical analyses are similar.
In partitioning based clustering, the clusters are K-means.
- The K-means parameters are chosen by the user depending on data characteristics and model specification. The K-means selects a k number of centroids (center of a cluster) and will constantly readjust the position of the centroid.
- The readjusting is done whenever there is a minute change in the position of the centroids.
- Finally, the distance metric is chosen based on the data such as the Euclidean distance, Manhattan distance (distance measured between axes at right angle), Hamming distance (distance between two strings whose corresponding characters are at different lengths), etc.
Mathematical Analysis for Partitioning-Based Clustering
Partitioning based clustering and K-means is based on one primary equation.
The cost function in specific can be based on any proposed distance (Euclidean distance, Manhattan distance, Hamming distance) where Ji is the cost function within group i.
As mentioned, the K-means algorithm requires k as a parameter which represents the number of clusters. To determine the optimal value of k, we have the Elbow Method and the Silhouette Method.
The elbow method is testing for every possible value of k and solving for the cost function for each k value to then plot its graph.
Ultimately the graph begins with a high cost value and will slowly flatten. The cluster number (k) is determined where the change in slope of growth is at the highest flattening, known as the elbow of the graph.
The silhouette method on the other hand assumes that the data points have already been grouped into k number of clusters and for each datapoint, we identify the following parameters.
- G(i) → the cluster is assigned to a data point
- | G(i) | → identifying the number of data points assigned to a certain cluster
- a(i) → returns the accuracy of assignment of the data point to its cluster
- b(i) → returns the average variance to nearby clusters which is not its cluster
More About DBSCAN
Overall, generating clusters in K-means and DBSCAN result in a different number of clusters being generated. For this use case in DBSCAN, the number of clusters is 29 (cluster-0 to cluster-28).
The way DBSCAN operates the way it does is because of key components unique to it.
eps
→ provides the maximum distance between two data points to be considered “of danger” to each other. In this case, its the minimum safe distance determined by health authorities to prevent the transmission of COVID-19min_samples
→ minimum distance of points to be considered as significant.metric
→the distance metric to evaluate the similarity among data points. This is necessary to evaluate the distance between locations on earth.
After using these metrics to create the clusters, tracing the infected person is quite simple.
We run under the assumption that if any person, ‘person_x’, is infected, we will find the cluster with ‘person_x’ through their ‘id’ and append ‘person_x’ with other people in that cluster into an infected list.
Each unique name from the list is cross checked with ‘person_x’ until no more infected people are found.
Now in the end, many of these clusters return a similar output, however, the reason DBSCAN was chosen over K-means or other methods is because of its ability to learn arbitrary shapes across large datasets.
In this graph, we can see that DBSCAN (blue line hidden behind the light green plot — Scipy K-Means) is one of the fastest working algorithms for the largest number of data points.
Now that its determined that DBSCAN is one of the best approaches to more forward with, the algorithm was tested.
Using DBSCAN for COVID-19 Contact Tracing
You can find the dataset here and the full Github repository here.
The first step was to import all of the libraries.
Pandas are vital to be used for data manipulation. Seaborn and Matplotlib are used for plotting and data visualization and we also imported the DBSCAN clustering model.
#importing all necessary modules
import numpy as np
import pandas as pd #used for data manipulation
import matplotlib.pyplot as plt #used for data visualization
import seaborn as sns #used for data visualization
%matplotlib inline
import datetime as dt
from sklearn.cluster import DBSCAN #used for machine learning
Then we imported the data and added it to the DataFrame. To read the data we use Pandas. The training dataset is an entire time series with each id having a timestamp, latitude and longitude. It’s also vital to ensure the latitudes and longitudes are not in string format otherwise they won’t be identified, and instead must be floats.
#reading the json file and adding it as the DataFrame
df = pd.read_json("train_dataset.json")
We then print all of the major columns and information from the DataFrame. In this case, we can see that the latitude and longitude are in float64 values and each id has its corresponding timestamp and geographical location. Overall, each object has its datetime64 and two float64.
#printing necessary information about the DataFrame
df.info()
To get the statistics of each id which showcase the central tendency, dispersion and shape of the dataset/cluster, we use df.describe()
.
#returning the necessary indices/description of the DataFrame
df.describe()
To ensure each object has the correct corresponding information, we return the first 5 rows of the DataFrame. In this case the output showcases 5 different instances for each id with the right timestamp (each 1 hour apart) with the latitude and longitude.
#returning the first 5 rows from the DataFrame
df.head()
df.head()
helped identify that each id has more than 5 timestamps, however, we call the unique function to return the array of all possible ids and names in the dataset. Again, we have 10 of each.
#printing all of the possible/unique names from the id column
df['id'].unique()
Exploratory Data Analysis
This next section of the code enters the Exploratory Data Analysis which helps summarize the data to extract key insights.
We begin by displaying the dataset as a seaborn scatterplot to show the different instances for each id.
#displaying the seaborn scatterplot by showing the ids(names) with their respective latitudes and longitude value across the x and y axis
plt.figure(figsize=(8,6))
sns.scatterplot(x='latitude', y='longitude', data=df, hue='id')
plt.legend(bbox_to_anchor= [1, 0.8])
For a more in-depth EDA, we display the data in different plots. This first one is displayed as a jointplot. There are many types of joint plots but since the kind=‘kde’
, we are returned with this type of plot.
#same dataset but instead of a scatterplot, it is analyzed with a jointplot
sns.jointplot(x='latitude', y='longitude', data=df, color='red', kind='kde')
In this case, rather than the jointplot, we visualize the distribution of ids with the latitudes. To read the boxplot, the horizontal line in the center of each represents the median of that id. The top is the top quartile and the bottom line is the bottom quartile.
The lower quartile (lower latitude value) showcases where the first 25% of the data falls up to. The second quartile is where 75% of the data falls below.
The lines extending beyond the box are the whiskers. They represent the minimum and maximum values the data can reach. Finally, any external dots represent outliers. In this case, Alice is the only id with outliers.
#visualizing the data in boxplots with id (x-axis) and latitude (y-axis)
sns.boxplot(x= 'id', y= 'latitude', data = df, palette = 'coolwarm')
plt.tight_layout()
This second version represents the boxplot ids in relation to the longitude.
#visualizing the data in boxplots with id (x-axis) and longitude (y-axis)
sns.boxplot(x='id', y='longitude', data=df, palette='coolwarm')
plt.tight_layout()
Now, we reach building the model. In these next blocks of code we bring the model into the program, make clusters through DBSCAN and identify the infected ids based on the clusters.
Defining the Model
We begin with defining the model using the DBSCAN algorithm.
#creating the clustering model, and identifying the infections by filtering through the data in the clusters,
#defining the model using the DBSCAN algorithm.
epsilon = 0.0018288 #distance of 6ft in km
model = DBSCAN(eps = epsilon, min_samples = 2, metric = "haversine").fit(df[['latitude', 'longitude']])
df['cluster'] = model.labels_.tolist()
We then create a scatter plot, similar to how we did in the EDA phase and can see the clusters. As mentioned earlier, the data created 29 clusters (cluster-0 to cluster-28).
#generating the seaborn scatterplot which created 29 clusters. Cluster-0 to cluster-28 represents data points
#with neighboring nodes and potential infected people.
labels = model.labels_
fig = plt.figure(figsize=(12,10))
sns.scatterplot(data = df, x='latitude', y='longitude', hue = ['cluster-{}'.format(x) for x in labels])
plt.legend(bbox_to_anchor = [1, 1])
However, with too many points, we realize that cluster- -1 is the noise cluster and plot the graph without the noise, allowing for a cleaner scatterplot.
#Plotting the same plot without noise which is cluster--1.
ids = df[(df['cluster'] == -1)].index
df.drop(ids, inplace = True)
labels = model.labels_
fig = plt.figure(figsize=(12,10))
sns.scatterplot(data=df, x='latitude', y='longitude', hue = ['cluster-{}'.format(x) for x in df['cluster']])
plt.legend(bbox_to_anchor = [1, 1])
Testing the Model
To check for the people who may be infected, we call the function get_infected_names after calling on a specific id and the model will create a list, see which ids are in corresponding clusters and will return the infected names.
#This is testing the actual model. To check for the people who might be potentially infected from the patient,
#call the function get_infected_names and input a name from the dataset as a parameter to check for neighbouring nodes.
def get_infected_names(input_name):
df = pd.read_json("train_dataset.json")
epsilon - 0.0018288 #6ft in km
model = DBSCAN(eps=epsilon, min_samples=2, metric='haversine').fit(df[['latitude', 'longitude']])
df['cluster'] = model.labels_.tolist()
input_name_clusters = []
for i in range(len(df)):
if df['id'][i] == input_name:
if df['cluster'][i] in input_name_clusters:
pass
else:
input_name_clusters.append(df['cluster'][i])
infected_names = []
for cluster in input_name_clusters:
if cluster != -1:
ids_in_cluster = df.loc[df['cluster'] == cluster, 'id']
for i in range(len(ids_in_cluster)):
member_id = ids_in_cluster.iloc[i]
if (member_id not in infected_names) and (member_id != input_name):
infected_names.append(member_id)
else:
pass
return infected_names
We call on the function and receive the infected names.
get_infected_names('Alice')
get_infected_names('David')
get_infected_names('Carol')
The Cons of Contact Tracing
By using the algorithm, health authorities can identify who is at risk of contracting COVID-19 and better allocate treatment resources and prevent the further spreading, however, it comes with its problems.
Using this method could put many at worry. In reality, studies suggest that people can have up to 12 close contacts a day; however, the carrier/infected person only transmits COVID-19 to 2–3 other people throughout the course of the virus.
However, one of the main problems which stands is the data problem. Collecting geographical locations for each person 24 times in a day and then cross checking to form clusters is not simple. Plus, people don’t always have their phones on them. Consider a UPS delivery person who leaves their phone in their truck, there is now no data on the interaction with the house they delivered it to because the eps value is less than 6 ft.
However, it’s important to remember that this is a public health initiative, meaning in the end it will still prevent the spread of the disease to the widespread population.
It all comes back to remote medicine and providing healthcare services and awareness remotely.