Skip to content

Unless you live on the moon you probably used Waze for navigation when driving. The app’s ability to plan the best route and react to new events made it popular. But merely reacting to events isn’t enough - the ultimate goal is to predict events and solve congestion problems before they occur.

License

Notifications You must be signed in to change notification settings

OrrMatzkin/IML-Hackathon-Waze

Repository files navigation

Introduction to Machine Learning - Hackathon 2022 - Waze Challenge

GitHub last commit GitHub issues GitHub pull requests GitHub repo size GitHub

About the project

Unless you live on the moon you probably used Waze for navigation when driving. The app’s ability to plan the best route and react to new events made it popular. But merely reacting to events isn’t enough - the ultimate goal is to predict events and solve congestion problems before they occur.

As part of The Hebrew University of Jerusalem course "Introduction to Machine Learning (67577)", the four of us participated in a hackathon, determined to answer the following questions:

  1. What is the most likely next event given a sequence of Waze events?
  2. What is the distribution of Waze events in a given time point?

Table of context

Dataset

The dataset holds about 18K real traffic events from the Waze application, collected between 2021-02-07 to 2022-05-24. Each row describes a single event and hold 19 features:

Feature Description Type Example
id Unique identifier for the event Cell int 16519
linqmap_type describing the event family string JAM
linqmap_subtype describing the event in details string JAM_STAND_STILL_TRAFFIC
pubDate the report date string 05/15/2022 09:31:17
linqmap_reportDescription description of the event (Hebrew) string -
linqmap_street the street name (Hebrew) string תל אביב - יפו
linqmap_nearby interest points near the event (Hebrew) string שתולים
linqmap_roadType road type code string 6
linqmap_reportMood user mood (as assessed by Waze) string 0
linqmap_reportRating report rating int 5
linqmap_expectedBeginDate event expected beginning string -
linqmap_expectedEndDate event expected ending string -
linqmap_magvar orientation w.r.t to the north pole int 244
nComments comments string 0
linqmap_reliability event reliability int 9
update_date when the event was last updated string 1652608382312
x x coordinate of the event int 180774.21999999974
y y coordinate of the event int 661479.4800000004

*The only features that are guaranteed to be present are ID, linqmap_type, x, y.

**The dates are given in POSIX time.

(back to top)

Model Tasks

As mentioned, in this Hackathon we were asked to answer 2 independent question. Therefore, this program runs 2 independent task:

  1. Predict Next Event.
  2. Predict Event Distribution.

Next Event Prediction

Given a sequence of 4 consecutive events in Tel-Aviv (ordered by time) predict the next event. That is, given a sequence of 4 events $x_1,...,x_4$ predict the following features of the 5th event: (linqmap_type, linqmap_subtype, x coordinate, y coordinate).

Input & Output

The input for this problem is a dataframe with groups of 4 events in Tel Aviv with same structure as the training data and a number indicating which group they belong to (the last column).

The output is a dataframe with a single row per group and 4 columns corresponding to the values above.

Evaluation

In this section the evaluation method is a weighted combination of F1-macro loss for linqmap_type, linqmap_subtype and l2 loss for the location - $(\hat{x} − x)^2 + (\hat{y} − y)^2$.

Event Distribution Prediction

Given a time range (start-end) predict the distribution of events across the nation. That is, for the following 3 time slots 8:00-10:00, 12:00-14:00, 18:00-20:00 predict the number of events of each type.

Input & Output

The input is one of the dates: 05.06.2022, 07.06.2022 and 09.06.2022.

The output is a 3 by 4 table where each row corresponds to a time slot, the columns match the linqmap_type (ACCIDENT, JAM, ROAD_CLOSED, WEATHERHAZARD).

Evaluation

In this section the grading is computed by the following weighted MSE:

MSE
where $\hat{y}_{event, t}$ is the number of predicted of events of some type at time t, $y_{event, t}$ is the actual number of events of that type at time t.

Getting Started

Our model requires Python 3.7+ to run.

Installation

  1. Clone the repo and enter the project directory:

    git clone https://github.com/OrrMatzkin/IML.Hackathon.Waze.git
    cd IML.Hackathon.Waze
  2. Install and run a virtualenv, isolated Python environment (this step is not mandatory but recommended):

    pip3 install virtualenv
    virtualenv IML.Hackathon.Waze
    source IML.Hackathon.Waze/bin/activate
  3. The requirements.txt file lists all Python libraries that our program depends on, they will be installed using:

    pip3 install -r requirements.txt

Run Locally

The program is set to run both tasks. The program needs data to train its models, the Next Event Prediction task also needs take_features sequences of 4 consecutive events. Therefore, the program requires 2 arguments in total to run (we already supplied real time data):

python3 main.py data/waze_data.cvs data/waze_take_features.csv

While the program runs it will update you of its stage. After the program would train the models it will run both tasks and save the prediction as csv, as defied in the tasks section.

(back to top)

Dive In

In the next few sections we will walk you through how are program and models work.

Preprocess

Before even approaching the task, we saved 20% of the data for a last minute test, as the data we have been given is precious and of-course limited.

Then we looked and examined the Data, we wanted to figure out what features we hold and how the data is represented. We found out that we have two types of dates, many features about location and little data about the reporters themselves.

Before we tried to understand what features we wish to keep and what new one to create by computing the correlation between them, we cleaned the data with a couple of ways including:

  • Getting rid of duplicates (by id).
  • Filling missing data by analyze same samples with close destination and time to the area
  • Converting dates to date format.
  • Finding correlation between subtypes and location inside and outside of town.
  • Creating dummies values for non-numeric features
  • And many more...

One example of what we succeeded to learn from the raw data is where most of the events (by type) occurs geographically. We saw by to printing (x,y) location of events that most of the jams are in Tel-Aviv (no surprise here).

x_y_events_map

Next Event

How The Data is represented?

For fitting and predicting each quartet (4 samples) we took every sequence of 4 consecutive samples and convert them into one sample. The label was the type, subtype, x coordinate and y coordinate. This way we can build models with samples and labels as we want.

drawing

How The Model Works?

Each sample (quartet, from now on) goes through two models,: one for the type and a second model for the subtype that assumes the sample has a specific type. With the samples represented as 4 samples inside each sample and the labels from the corresponding sample we get the “simple” model. To choose a model we split the train data to validation data and new train data. The new train data has been used for fitting each model and checking how good the prediction on the validation data.

How We Chose the Model?

To understand the best version of each family of models we used K-fold. We got the best hyperparameter that best predict the new train data.

random_forest_error extra_tree_error

The model of type got the best result from ExtraTreeClassifier. The 4 models of subtype got the best result from ExtraTreeClassifier. The 2 models of X and Y coordinates got the best result from RandomForsetRegressor

How We fit the Model?

Each sample is 4 samples combined as one sample. Each label is the type of the consecutive sample of the 4 samples from the dataset. For Each Type we made a new model with multi-class as label. So each label could be one of the subtypes of each type.

Results

We got 0.62 on the train with f1 macro and 0.28 on the test (while the test has been checked only after we finished the work).

Event Distribution

We chose to analyze the data by day of week and time slots in each day. Then we calculated the average of events in each time slot, the average calculated by the number of days the data has been gathered from and number of events from each type.

(back to top)

Contributors

Thank you for reading and showing interest in our hackathon project... Keep in mind, we don't need any contribution to this project whatsoever.

(back to top)

Copyright

MIT License

Copyright (c) 2022 OrrMatzkin

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.BS7HXHXp0QKBBe6sNYvjJZ0/edit?usp=sharing).

About

Unless you live on the moon you probably used Waze for navigation when driving. The app’s ability to plan the best route and react to new events made it popular. But merely reacting to events isn’t enough - the ultimate goal is to predict events and solve congestion problems before they occur.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages