This report should contain all relevant information regarding your model. Someone reading this document should be able to easily understand and reproducible your findings.
Mark which tasks have been performed
- Summary: you have included a description, usage, output, accuracy and metadata of your model.
- Pre-processing: you have applied pre-processing to your data and this function is reproducible to new datasets.
- Feature selection: you have performed feature selection while modeling.
- Modeling dataset creation: you have well-defined and reproducible code to generate a modeling dataset that reproduces the behavior of the target dataset. This pipeline is also applicable to generate the deploy dataset.
- Model selection: you have chosen a suitable model according to the project specification.
- Model validation: you have validated your model according to the project specification.
- Model optimization: you have defined functions to optimize hyper-parameters and they are reproducible.
- Peer-review: your code and results have been verified by your colleagues and pre-approved by them.
- Acceptance: this model report has been accepted by the Data Science Manager. State name and date.
The model is a lead recommendation system, it is based on cosine similarity to generate a pairwise similarity score metric between multiple companies.
- clone the repository
- download test dataset from cloud storage
- run
mkdir workspace/data - run
mkdir workspace/models - configure train data on
config.py - install Docker
- run
make run - run
make predict INPUT='<input_filepath>'
is all ids present on estaticos_market.csv
will be subset of ids of estaticos_market.csv
| metric | 'portfolio1' |
'portfolio2' |
'portfolio3' |
|---|---|---|---|
| Average precision | .0 | .12 | .12 |
| n of companies | .98 | .8 | .9 |
- remove columns listed on
config.TO_REMOVE: most of then represents redundant information on dataset, as described on dict.json - remove columns with
NaNrate >config.NAN_THRESH, configured initialy as 0.6 (60%): columns coudl no be well treated by any filling - Fix columns from
config.TO_FIX_OBJ2BOOL: they wasTrue/Falsevalues but read as string - Impute remaining
NaNs as set onconfig.NAN_FIXES: manual analysis/testing - Encode columns on
config.ORDINAL_ENCODEmanually: encode ordering by semantic (like the worst value 0 and best N) - Encode columns from
config.SPECIAL_LABELSand store to join after feature selection: some columns witch by tests we have the hipotesis that can be use to filter results - Scale data: to use clustering strategies
- Use PCA to select components to describe 60% of variance
We removed some columns (config.TO_REMOVE) by information redundance and used PCA to reduce dataset dimension by a n_components that represents 60% of variance
We tried to use KMeans to improve recomendations, but after some tests we decided to let it just as a option on software predicting.
Our recomendations comes mainly from cosine_similarity calculated between user's portfolio and the rest of the market.
KMeans clustering to select possible recomendations was our first guess, but after some tests, a filter based on some analysis over features and tests with our validation metric, we choose to use as default just filtering methods and cosine_similarity to order the leads.
- Train dataset
estaticos_market.csvwas pipelined - Test files
estaticos_portfolio<1, 2, 3>.csvwas divided into train and test (70/30) - We use average precision (implemented on
squad_3_ad_data_science.validation) to measure our recomendations quality and make manual filter to maximize this metrix
By tests we tried to find features that could represent the preferences from the portfolios, using then to filter results or clusterize dataset