We created this to help some of our customers in the Fleet Tracking business who have in the order of million vehicles. Goal is showcase possibility in in terms of performance for production usage.
- Minimal dependencies: The existing geo spatial libraries for spark seem out of date - so we made conscious effort to use minimum dependencies as possible (
shapelyis the only real one). - We used
Azure Databrickshere - however you may useopen source Sparkas well. - Streaming source & sink: We used
Azure Eventhubs- you may usekafkaor any other sources/sinks - We are using
Geojsonformat for both points and polygons as a best practice
Data
We need the following data for our solution:
- Points data (locations/latlong) of various entities(vehicles) from a streaming source (e.g. Kafka/Azure event hub)
- Polygon data (maps) representing the Geofences (this generally does not change often)
- Relation/mapping between the points & polygons (e.g. vehicle A may be associated with Geofence X)
Logic
- Read the incoming streaming data (Points data in geojson format from a Streaming source)
- We do a join of Points data with the Geofence data (Polygons). E.g vehicle A may be associated with Geofence X
- We apply geofence logic on the joined dataframe to check if the Point is within the Polygon - and add this new Geofence column (boolean) to the joint dataframe
- We send the result to the Sink (Eventhub/Kafka)
If the above is unclear check the Basics of Geofencing section - code is often more explanatory.
- Roughly we were able to run geofencing for 3.7 million Geojson messages in 2 mins 10 sec (throughput of 28k/sec).
- The cluster size was 32 cores (4 cores * 8 nodes).
- Data was for 1000 vehicles (and one geofence for each).
Test data generator can be customized to generate data according to your scale.
