Introduction to Machine Learning - Hackathon 2022 - Waze Challenge

About the project

Unless you live on the moon you probably used Waze for navigation when driving. The app’s ability to plan the best route and react to new events made it popular. But merely reacting to events isn’t enough - the ultimate goal is to predict events and solve congestion problems before they occur.

As part of The Hebrew University of Jerusalem course "Introduction to Machine Learning (67577)", the four of us participated in a hackathon, determined to answer the following questions:

What is the most likely next event given a sequence of Waze events?
What is the distribution of Waze events in a given time point?

Table of context

Dataset
Model Tasks
- Next Event Prediction
- Event Distribution Prediction
Getting Started
- Installation
- Run Locally
Dive In
Contributors

Dataset

The dataset holds about 18K real traffic events from the Waze application, collected between 2021-02-07 to 2022-05-24. Each row describes a single event and hold 19 features:

Feature	Description	Type	Example
id	Unique identifier for the event Cell	int	16519
linqmap_type	describing the event family	string	JAM
linqmap_subtype	describing the event in details	string	JAM_STAND_STILL_TRAFFIC
pubDate	the report date	string	05/15/2022 09:31:17
linqmap_reportDescription	description of the event (Hebrew)	string	-
linqmap_street	the street name (Hebrew)	string	תל אביב - יפו
linqmap_nearby	interest points near the event (Hebrew)	string	שתולים
linqmap_roadType	road type code	string	6
linqmap_reportMood	user mood (as assessed by Waze)	string	0
linqmap_reportRating	report rating	int	5
linqmap_expectedBeginDate	event expected beginning	string	-
linqmap_expectedEndDate	event expected ending	string	-
linqmap_magvar	orientation w.r.t to the north pole	int	244
nComments	comments	string	0
linqmap_reliability	event reliability	int	9
update_date	when the event was last updated	string	1652608382312
x	x coordinate of the event	int	180774.21999999974
y	y coordinate of the event	int	661479.4800000004

*The only features that are guaranteed to be present are ID, linqmap_type, x, y.

**The dates are given in POSIX time.

(back to top)

Model Tasks

As mentioned, in this Hackathon we were asked to answer 2 independent question. Therefore, this program runs 2 independent task:

Predict Next Event.
Predict Event Distribution.

Next Event Prediction

Given a sequence of 4 consecutive events in Tel-Aviv (ordered by time) predict the next event. That is, given a sequence of 4 events $x_1,...,x_4$ predict the following features of the 5th event: (linqmap_type, linqmap_subtype, x coordinate, y coordinate).

Input & Output

The input for this problem is a dataframe with groups of 4 events in Tel Aviv with same structure as the training data and a number indicating which group they belong to (the last column).

The output is a dataframe with a single row per group and 4 columns corresponding to the values above.

Evaluation

In this section the evaluation method is a weighted combination of F1-macro loss for linqmap_type, linqmap_subtype and l2 loss for the location - $(\hat{x} − x)^2 + (\hat{y} − y)^2$.

Event Distribution Prediction

Given a time range (start-end) predict the distribution of events across the nation. That is, for the following 3 time slots 8:00-10:00, 12:00-14:00, 18:00-20:00 predict the number of events of each type.

Input & Output

The input is one of the dates: 05.06.2022, 07.06.2022 and 09.06.2022.

The output is a 3 by 4 table where each row corresponds to a time slot, the columns match the linqmap_type (ACCIDENT, JAM, ROAD_CLOSED, WEATHERHAZARD).

Evaluation

In this section the grading is computed by the following weighted MSE:

where $\hat{y}_{event, t}$ is the number of predicted of events of some type at time t, $y_{event, t}$ is the actual number of events of that type at time t.

Getting Started

Our model requires Python 3.7+ to run.

Installation

Clone the repo and enter the project directory:

git clone https://github.com/OrrMatzkin/IML.Hackathon.Waze.git
cd IML.Hackathon.Waze

Install and run a virtualenv, isolated Python environment (this step is not mandatory but recommended):
```
pip3 install virtualenv
virtualenv IML.Hackathon.Waze
source IML.Hackathon.Waze/bin/activate
```
The requirements.txt file lists all Python libraries that our program depends on, they will be installed using:
```
pip3 install -r requirements.txt
```

Run Locally

The program is set to run both tasks. The program needs data to train its models, the Next Event Prediction task also needs take_features sequences of 4 consecutive events. Therefore, the program requires 2 arguments in total to run (we already supplied real time data):

python3 main.py data/waze_data.cvs data/waze_take_features.csv

While the program runs it will update you of its stage. After the program would train the models it will run both tasks and save the prediction as csv, as defied in the tasks section.

(back to top)

Dive In

In the next few sections we will walk you through how are program and models work.

Preprocess

Before even approaching the task, we saved 20% of the data for a last minute test, as the data we have been given is precious and of-course limited.

Then we looked and examined the Data, we wanted to figure out what features we hold and how the data is represented. We found out that we have two types of dates, many features about location and little data about the reporters themselves.

Before we tried to understand what features we wish to keep and what new one to create by computing the correlation between them, we cleaned the data with a couple of ways including:

Getting rid of duplicates (by id).
Filling missing data by analyze same samples with close destination and time to the area
Converting dates to date format.
Finding correlation between subtypes and location inside and outside of town.
Creating dummies values for non-numeric features
And many more...

One example of what we succeeded to learn from the raw data is where most of the events (by type) occurs geographically. We saw by to printing (x,y) location of events that most of the jams are in Tel-Aviv (no surprise here).

Next Event

How The Data is represented?

For fitting and predicting each quartet (4 samples) we took every sequence of 4 consecutive samples and convert them into one sample. The label was the type, subtype, x coordinate and y coordinate. This way we can build models with samples and labels as we want.

How The Model Works?

Each sample (quartet, from now on) goes through two models,: one for the type and a second model for the subtype that assumes the sample has a specific type. With the samples represented as 4 samples inside each sample and the labels from the corresponding sample we get the “simple” model. To choose a model we split the train data to validation data and new train data. The new train data has been used for fitting each model and checking how good the prediction on the validation data.

How We Chose the Model?

To understand the best version of each family of models we used K-fold. We got the best hyperparameter that best predict the new train data.

The model of type got the best result from ExtraTreeClassifier. The 4 models of subtype got the best result from ExtraTreeClassifier. The 2 models of X and Y coordinates got the best result from RandomForsetRegressor

How We fit the Model?

Each sample is 4 samples combined as one sample. Each label is the type of the consecutive sample of the 4 samples from the dataset. For Each Type we made a new model with multi-class as label. So each label could be one of the subtypes of each type.

Results

We got 0.62 on the train with f1 macro and 0.28 on the test (while the test has been checked only after we finished the work).

Event Distribution

We chose to analyze the data by day of week and time slots in each day. Then we calculated the average of events in each time slot, the average calculated by the number of days the data has been gathered from and number of events from each type.

(back to top)

Contributors

Thank you for reading and showing interest in our hackathon project... Keep in mind, we don't need any contribution to this project whatsoever.

(back to top)

Copyright

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.BS7HXHXp0QKBBe6sNYvjJZ0/edit?usp=sharing).

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
data		data
figures		figures
predictions		predictions
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fit_predict.py		fit_predict.py
main.py		main.py
models_evaluation_selection.py		models_evaluation_selection.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
task_1.py		task_1.py
task_2.py		task_2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction to Machine Learning - Hackathon 2022 - Waze Challenge

About the project

Table of context

Dataset