Create folder and then open a terminal in that folder and run these commands:
git clone https://github.com/ZekiMilovan001/NordeusDE.git
The .env file generally contains sensitive configuration settings for your project, such as database credentials. Here’s how to set up and manage .env file.
-
USER:
- Set this to your PostgreSQL username (the user that has access to the database).
- Example:
USER=postgres.
-
PASSWORD:
- Set this to your PostgreSQL user's password.
- Example:
PASSWORD=password123.
-
HOST:
- Set this to the host where your PostgreSQL database is running.
- If it’s running on your local machine, you can use
localhost. - Example:
HOST=localhost.
-
PORT:
- Set this to the port number that PostgreSQL is listening on (default is 5432).
- Example:
PORT=5432.
-
DATABASE:
- Set this to the name of your PostgreSQL database.
- Example:
DATABASE=nordeus.
Once you have filled out the .env file, it should look like this:
USER=postgres
PASSWORD=password123
HOST=localhost
PORT=5432
DATABASE=nordeus
-
Install PostgreSQL and pgAdmin.
-
Create a Database in pgAdmin:
- Open pgAdmin.
- In the left panel, navigate to:
Servers > PostgreSQL. - Right-click on
PostgreSQL, then select:
Create > Database. - Enter a name for your database.
Note: This is a simple approach to quickly get started with the API. 😊
-
Create python virtual enviroment and install dependencies:
- Run
python -m venv venvin the terminal you have recently opened in the folder. - Run
pip install -r requirements.txt.
- Run
-
Extract Data into Database:
- Run the
process.pyscript to extract data from.jsonlfiles into the database. - Example:
python process.py events_test.jsonl
- Run the
-
Start Uvicorn server:
- Run
uvicorn main:appin the terminal.
- Run
-
Start using API (e.g. Postman)
Endpoint: /player_stats
Method: GET
Description: Retrieve statistics for a specific player.
user_id(required, string ): Represents a unique identifier for a user.date(optional): In the formatYYYY-MM-DD.- If no date is specified, the API calculates all-time stats for the specified user.
- If a date is specified, the API calculates stats for the given date only.
- Country of the user (string): The country of the user.
- Registration datetime (datetime): The date and time the user registered, shown in their local timezone.
- Days since last login (int):
- If no date is specified, calculates the days since the user's last login.
- If a date is provided, calculates the days up to the last login date.
- Number of sessions (int):
- If a date is specified, returns the session count for that day.
- If no date is specified, returns the all-time number of sessions.
- Time spent in game (in seconds) (int):
- If a date is provided, gives the time spent on that day.
- If no date is specified, gives the all-time total time spent in the game.
- Total points on matches (int):
- Points are based on matches that started on the specified date, categorized by the user's role as home or away.
- If no date is specified, provides all-time points, calculated as:
- Win = 3 points
- Draw = 1 point
- Loss = 0 points
- Match time as a percentage of total game time (float):
- Represents the percentage of total game session time spent actively participating in matches.
- If no date is specified, provides the all-time percentage.
{
"user_country": "US",
"local_time": "2024-10-09T13:27:35-04:00",
"time_since_last_login": 34,
"time_spent_playing": 780,
"number_of_sessions": 2,
"home_points": 3,
"away_points": 0,
"match_to_game_ratio": 14.487179487179487
}
Endpoint: /game_stats
Method: GET
Description: Retrieve statistics for a game state.
date(optional): In the formatYYYY-MM-DD.- If no date is specified, the API calculates all-time game statistics.
- If a date is specified, the API calculates statistics for the specified date only.
- Number of daily active users (int):
- If a date is specified, returns the number of users who had at least one session on that day.
- If no date is specified, returns the total number of users with at least one session ever.
- Number of sessions (int): The total number of sessions recorded.
- Average number of sessions (float):
- Calculates the average number of sessions per user for users with at least one session.
- User(s) with the most points overall (list of strings): The user or users who have accumulated the most points in total.
{
"active_users": 3,
"num_sessions": 5,
"average_session_num": 1.6666666666666667,
"users_with_most_points": [
"52d65a1b-8012-934e-001b-19e6ba3cdc0e"
]
}
| Parameter Name | Parameter Type | Parameter Description |
|---|---|---|
event_id |
INT |
Unique identifier representing an event |
event_timestamp |
INT |
Time of event represented as Unix time |
event_type |
STRING |
One of the following: registration, session_ping, match |
event_data |
JSON OBJECT |
JSON object containing all event-specific data (check event data below) |
For our analysis, we are interested in three types of events: registration, session_ping, and match.
The registration event is generated when a user registers. At that time, the user receives a unique identifier to track them across all other events. This event also includes the country and OS of the device used.
| Parameter Name | Parameter Type | Parameter Description |
|---|---|---|
country |
STRING |
Country from which the user registered |
user_id |
STRING |
Unique identifier representing a user |
device_os |
STRING |
OS of the device user registered from. Valid values: iOS, Android, Web |
A session is the continuous period during which a player interacts with a game. Sessions are tracked using session_ping events. This event is sent when a user opens the game and every 60 seconds afterward, as long as the game remains open. If more than 60 seconds pass between two ping events, they are considered different sessions. The first event in a session has the type session_start, and the last event has the type session_end.
| Parameter Name | Parameter Type | Parameter Description |
|---|---|---|
user_id |
STRING |
Unique identifier representing a user |
type |
STRING |
Possible values: session_start, session_end, and "" (empty string) |
This event is triggered when a match starts and finishes. If the match has started, the values for home_goals_scored and away_goals_scored will be NULL. When the event signals that the match has finished, these fields will contain non-NULL values.
| Parameter | Type | Description |
|---|---|---|
match_id |
STRING | Identifier representing one match (same value for both match start and end) |
home_user_id |
STRING | Unique identifier representing the home user |
away_user_id |
STRING | Unique identifier representing the away user |
home_goals_scored |
INTEGER | Number of goals scored by the home team (or NULL if the match just started) |
away_goals_scored |
INTEGER | Number of goals scored by the away team (or NULL if the match just started) |
For simplicity, each country is assigned exactly one timezone.
| Parameter | Type | Description |
|---|---|---|
country |
STRING | Country code |
timezone |
STRING | Timezone for the country. Possible values: "America/New_York", "Asia/Tokyo", "Europe/Berlin", "Europe/Rome" |
- Checked the data integrity as per column definitions given in the task.
- Checked if
goals_scoredfields have all int values. (They do) - Checked if
user_idfield in registration unique. (i.e. Checked if one user can register once with sameuser_id) - Checked if
device_osandcountryfields contain unallowed values. (They did, Mars forcountryand ENIAC fordevice_os) -
Note 1: There was repetition of registration for same
user_idvalue, but also one of them haddevice_osvalue ENIAC, which after discarding it, solves the problem. -
Note 2: Regarding the Mars field, in this solution, I discarded that register row, but also
countryvalue could be changed to random or one of the allowed values forcountry, depending on requirements and is open for discussion. - Droped the
typecolumn in session data, since i wanted that bonus 😎.
- Checked if
- Explored additional information about dataset.
- Checked if
session_pingandmatchtypes of pings can happen at the same time. (i.e. Checked if general playtime can be queried using only session info. It can.)
- Checked if
The data was split into 3 main sub-categories that refer to their respective event_type values, thus 3 initial tables:
- Register data
- Session data
- Match data
All of these tables contain before mentioned fields.
The event_data field was discarded from these tables, but it formed three new tables:
- Register events
- Session events
- Match events
These tables contain event information for their respective events, explained before, with addition of event_id to each of them, which acts as a foreign key to data tables.
Since some of them are pretty simple, such as extracting country of the registered user, I will be skipping some.
- How many days have passed since this user last logged in?
- Idea was to filter data based on
dateparameter (event_timedate-value <=date) and get the most recent one, then calculate delta between that time anddatevalue.
- Idea was to filter data based on
- Number of sessions for that user on that day. If no date is provided give the all time number of sessions.
- Time filter: Idea was to find bounds of
dateparameter. Lower bound is first second of the day, for givendateparameter (or 0 if no values were passed) and upper bound is last second of the day (or last second of current day, if no values were passed). - Session count: Idea was to sort by
event_timestampascendingly for eachuser_idvalue and calculating value of the new flag field as:- 0: if difference between current and last ping is less than 60 seconds, as per task definition. (i.e. if it is not start of new session)
- 1: in other cases, meaning that that row is the start of new session.
- By summing these flags, all session starts are counted and number of sessions is extracted.
- Session duration: : Idea was to find sum of all differences between last and first ping in each session.
- Time filter: Idea was to find bounds of
- Total points won on matches grouped by whether the user was home or away, for specific day or all time, depending if
dateparameter has value. - Time filter: same idea as before. - Home points: filter data whereuser_idis equal tohome_user_idfield and sum points accordingly. - Away points: filter data whereuser_idis equal toaway_user_idfield and sum points accordingly. - Match time as a percentage of total game time. Represents how much of a player's overall game session was spent actively participating in a match, compared to the entire time spent in the game
- Starts and Ends temporary tables: Fitered by match start or end.
- Games table: It has same data as original, with the addition of start and end time for each game.
- Games away and Games home table: Has contents of Games table but filtered whether user was playing at home or away. Wiith addition of
live_gamefield, that is showing how much player was participating in a match since start of it (e.g. 60 -players_latency, whereplayer_latencyis telling us how late player was late for the match start). Similar stands for match ends. - Result by comining these tables and summing
live_gamevalues for that player, we can extract how much time player has spent playing matches actively. Ratio is caluclated using already found time spent value.
- Number of daily active users.
- Counted unique occurencies of
user_idfiltered by before mentioned criteria.
- Counted unique occurencies of
- Number of sessions and average number of sessions for users that have at least one session.
- Used same flag "trick" as in Player stats but for every player, filtered by before mentioned criteria.
- The user/users with the most points overall.
- Found maximum number of points overall, then filtered all players who have that number of points, filtered by before mentioned criteria.
- The general session data could've been stored into one table, containing all fields, since all event types have same fields (excluding
event_datafield), but I've chosen to separate them for the optimization reasons (less data to be queried) and for the sake of simplicity, also, they were not needed at the same time in the API. If ALL data got combined into one big table, a lot of fields would be NULL, sinceevent_datahas different attributes depending onevent_typevalue, thus wasting memory and slowing down the queries. - "Table for each query": If the data was manipulated in such a way, that table gets created for every query in task requirements, the queries would be faster, shorter/simpler (reducing code in API router), but introducing redudancy in data and making loading and cleaning process more complex.
Generally, in the API implementation, I've chosen to distribute more workolad on the database and minimize work with the fetched data in the router, since database has internal optimization tools, which are smarter than I am probably, hence, increasing the speed. I also was thinking about using 2nd approach, but I felt like it was not the goal of this challenge to reduce the SQL ussage to select * from table and do the rest using pandas, although I do like the idea of faster API calls more, but at the and of the day, you gotta pay the toll somewhere, is it at loading or at reading.
