This document contains a data-science oriented description of the project. It orients the data science project towards data and technology. It serves the purposes of
- outlining a solution,
- defining the scope of the project,
- defining measurable success metrics,
- defining interfaces to the production environment,
- gathering information regarding relevant datasets and features to the problem,
- upstream communication and agreement on requisites.
Mark which tasks have been performed.
- Summary: all major specs are defined including success criteria, scope, etc.
- Output specification: the output format, metadata and indexation have been defined and agreed upon.
- Solution architecture: a viable and minimal architecture for the final solution has been provided and agreed upon.
- Limitations and risks: a list with the main limitations and risks associated to the project has been provided.
- Related resources: lists with related datasets, features and past projects have been given.
- Peer-review: you discussed and brainstormed with colleagues the outlined specifications.
- Acceptance: these specifications have been accepted by the Data and Product directors. State names and date.
The table below summarizes the key requirements for the project.
| problem type | target population | entity | N_target | min_coverage | N_labeled | sucess_metrics | updt_freq |
|---|---|---|---|---|---|---|---|
| content based filter for recommendations | business | CNPJ | - | 80% | NA | average precision | as new companies come in dataset |
This project aims at developing a model to recommend leads for a company based on its previous clients.
| Entity | Region | Type | Size | Status | Sector | N_target |
|---|---|---|---|---|---|---|
| CNPJ | Brasil | companies | any | active | - | - |
Variables present in config.SPECIAL_LABELS are used to filter market to make recomendations. Model removes from recomendations possibilities all rows that do not share its special labels with at least one row of user's portfolio for each label.
Initially used special labels:
| Label | Based on |
|---|---|
setor |
Tests & business knowledge (little) |
de_faixa_faturamento_estimado_grupo |
Tests |
predict function offer option to do not use special labels on recomendation.
The model will outputs a list of string that will be a subset of id column of input file.
As the problem is make recomendations for a user with a portfolio, based only on it's portfolio. So, the model should be a content-based filter, validated by metrics that analyze results of recomendations.
The model should consume data from total market's dataset, restricting input ids possiblities by market dataset's ids.
For current version, user must put estaticos_market.csv train file and and estaticos_portfolio<1, 2, 3>.csv test files on data folder.
To generate predictions, put input file on predict folder and run make predict INPUT='<path_to_file>'.
For our recommendation system, we have divided the portfolios datasets into train and test, considered precision and recall as follows:
Then, we have used the Average Precision at k (AP@k) on each portfolio, to valuate and compare our models, where k is the number of recommendations requested, from rank 1 through k.
We have used the Average Precision because it gives us an idea of not only the number of leads from the test subset inside the recommendations, but also the position of those leads.
| Limitation | Likelihood | Loss | Contingency |
|---|---|---|---|
| User's portfolio contains new ids | 100% | not possible make prediction | Update market list |
| User's portfolio with high variance | 50% | predictions could be near to random | - |