GPS trajectory clustering with Python

How to solve the common problems encountered in this field

Steve Cao
Towards Data Science

--

The rapid growth of mobile devices has led to vast amounts of GPS trajectories collected by location-based services, geo-social networks, transport, or ride-sharing apps.

Clustering of GPS trajectories (Trips) (Image by author)

GPS trajectory clustering is being increasingly used in many applications. For example, it can help to identify the frequent routes or trips. The trajectory similarity can be used to find whether the trajectory follows a certain route. It can then be used for tracking transport services, e.g. public buses in a city.

In this article, I will provide a gentle introduction to fast GPS trajectory clustering. Python open source libraries are used in this solution. Typically, there may be several issues encountered during clustering trajectories.

  • GPS trajectories often contain many data points. Clustering can take a long time.
  • Suitable similarity measures are needed to compare different trajectories.
  • Clustering approach and configuration
  • Visualizing trajectories and their clusters

There are a number of interesting articles on this topic [1, 2]. However, not all the above issues are addressed. In this article, I will illustrate a Python solution covering all these areas. The data used here is originally from the Vehicle Energy Dataset [3]. A process of clustering all GPS points was first applied to the original data. I then selected the trajectories for several start-end groups. For simplicity, the data provided in this article are for these groups of trajectories, as shown below:

GPS trajectories before clustering

The codes for this article and a tutorial can be found in my GitHub here.

Trajectory data points reduction

The Ramer-Douglas-Peucker (RDP) algorithm is one of the most popular methods to reduce the number of points in a polyline. The objective of this algorithm is to find a subset of points to represent the whole polyline.

The RDP function returns subset points and points indices

The parameter epsilon is set as 10 (meter) here. After RDP, the number of data points in the trajectories is reduced significantly, compared with the original data points.

traj #    data points original       after RDP
0 266 21
1 276 36
2 239 34
...

This contributes to a drastic reduction in the computation time for the distance matrix.

distance matrix without RDP: 	 61.0960 seconds
distance matrix with RDP: 0.4899 seconds

Trajectory similarity measures and distance matrix

There are various methods to measure the similarity of two trajectories, e.g. Fréchet distance, Area method, as visualized in the following figure. They can be calculated using similaritymeasures package.

Trajectory similarity measures: Fréchet distance and Area

For the examples in this article, the Fréchet distance is used to compute the distance matrix as below. The Area similarity measure is also supported.

DBSCAN clustering

Scikit-learn provides several clustering methods. In this article, I present the DBSCAN with the pre-computed matrix. Parameters are set as below:

  • eps: 1000 (m) for Fréchet distance, 300000 (m²) for Area measure.
  • min_samples:1, it is to make sure all trajectories will be clustered into a cluster.

The codes are as below.

Results and Trajectory visualization

Matplotlib is a comprehensive library for visualizations in Python. It provides flexible solutions for both static and animated visualization.

The clustering results for the six groups are visualized using subplots in Matplotlib. As shown below, most trajectory groups have several clusters. Only the last two groups have one cluster, for the reason that the similarity of these trajectories is high (the Fréchet distances are less than eps, i.e.1000).

Clustered GPS trajectories

Conclusions

As many GPS-enabled devices are used today, GPS trajectory clustering is being increasingly popular. This article walked you through the common problems in this field. The Python tools are then explored to solve these problems. For further readings, you can check these articles on polyline simplification, distance matrix calculation, and web-based spatial visualizations.

References

  1. João Paulo Figueira, Clustering Moving Object Trajectories,
  2. William Yee, A Gentle Introduction to IoT/GPS Trajectory Clustering and Geospatial Clustering
  3. Vehicle Energy Dataset (VED, A Large-scale Dataset for Vehicle Energy Consumption Research

--

--