Skip to content

NoveliaYuki/ML-root_finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Training and Fine Tuning of ML models

1. Preface

A free tree is an undirected graph that represents the connections between the words in a sentence. This project is extremely useful because unsupervised models often fail to determine the direction of these links, resulting in this free tree. Once we know the root of the tree thanks to this model we can easily deduce the direction of these connections.

2. Description

The purpose of this project is to train multiple Machine Learning models to get the best one at finding the root of a sentence given a free tree based in a set of selected features. The features that have been tested and the ones that have been selected are in the following sections with their corresponding analysis in section 4. Analysis of Features.

The data that has been used to train these models has been extracted from here. It is given in multiple languages (to check them go to the section 3.1 List of Languages), one file per language. The features extracted from this data have been analyzed to determine how relevant they are to the model's performance. It is an indispensable step, because the simpler the model the better if the performance is barely affected by the deletion of this feature.

Data has been divided into two parts: the training and the validation portions. The main reason is to avoid overfitting due to bias and variance, and this method is perfect for that. These portions have been modified using the K-Fold Cross-Validation method for k=10, which consists in saving a few portions that are not used for training but to evaluate how good the model is afterwards.

Different models have been tested to compare them and use the best one for this use case. You can find the list of models below in the section 5. Models. For the training of each model, hyperparameters like alfa and others for each own model have been optimized. Before the training of each of the models feature data has been normalized.

The scoring methods used to evaluate the performance of the models are the MSE and the R-squared values, used to quantify the error rate and the accuracy rate of the models respectively.

3. Data used for Training and Feature extraction

The data used for training the different models can be found here. The data received are multiple rooted trees, one line per sentence. A part of this project concerns the transformation from a rooted tree to a free tree.

3.1 List of Languages

Arabic French Indonesian Portuguese Turkish
Chinese Galician Italian Russian
Czech German Japanese Spanish
English Hindi Korean Swedish
Finnish Icelandic Polish Thai

3.2 Features Extracted

Feature Name Brief Description
vertex_degree Number of vertices connected to a vertex.
avg_distance Mean distance from the vertex to all the other vertices of the sentence.
is_center If the vertex belongs to the center of the "graph" of the sentence.
is_centroidal If the vertex belongs to the centroid of the "graph" of the sentence.
phrase_length Length of the sentence to which the vertex belongs.
betweenness_centrality Number of times a vertex happens to be in the shortest path between two other vertices.
closeness_centrality Reciprocal of avg_distance.
eigenvector_centrality Measure to quantify the influence of a vertex, also considers its neighbor's influence.
average_neighbor_degree Mean of the degrees (vertex_degree) of the adjacent vertices of a vertex.
eccentricity Maximum distance from one vertex to another in the graph.
diameter Maximum value for eccentricity.
in_diameter Indicates whether the vertex can be found within the diameter.
radius Minimum value for eccentricity.
weiner_index Sum of distances between all the pairs of vertices inside the graph.

4. Analysis of Features

These plots have been created to determine the influence of each feature. Keep in mind that this mostly applies to Linear Correlation, so there may be other relations that we are not aware of and that we can not see using these plots.

Only the most relevant features are shown in this section, but for more information go to section 9.1 Tables and Plots for Descriptive and Distribution Analysis of the data.

4.1 Descriptive and Distribution Analysis

The table to do the descriptive analysis of these features is in section 9.1. As we can see, the sentences are quite long as the average of the feature phrase length is 22.26 and the maximum is 70. As for the diameter the average is 7.69, meaning that words close to each other in the sentence are normally related. The feature betweenness centrality shows that vertices are not usually part of a shortest path between two other vertices, but when it does happen these vertices have a heavy weight in the decision of what the root of the tree is.

All the features show a persistent left tendency, which can be seen in the following plots and the ones found in section 9.1. It is the main reason why the normalization method that has been used is the robust normalizer.

vertex_degree-boxplot vertex_degree-distribution phrase_length-boxplot phrase_length-distribution betweenness_centrality-boxplot betweenness_centrality-distribution eigenvector_centrality-boxplot eigenvector_centrality-distribution

Finally, the most important plot is the one for the target we are looking for as shown below. As we can see, it also has a left tendency and its distribution is not very homogeneous.

target-boxplot target-distribution

4.2 Linear Correlation of Features

In the following figure we can see that some features have very little linear correlation with others, so these are the first ones that have been deleted to improve the performance of the models.

The most important features at first glance seem to be betweenness centrality, eigenvector centrality and closeness centrality in the respective order. It is pretty accurate, as in the end we have kept two of these three most important features.

correlation_matrix

Thanks to the following plot we can safely say that betweenness centrality has a big linear correlation with the target almost looking like a quadratic function.

dimensionality_reduction

5. Models

These are the following models used for this project:

Random Elastic Net Regression
Normal degree Decision trees
Linear Regression Random Forest
Ridge Regression Gradient Boosting Machine
Lasso Regression GMM

The "models" Random and Normal degree are used as a baseline to determine the quality of the others. Random referes to a model that outputs a random number between 0 and 1. Normal degree refers to a model that normalizes the degree (k) of each vertex calculating (k-1)/(n-1), being n the length of the sentence.

6. K-fold Cross-Validation and Data Normalization

We normalize the data from the features extracted that need normalization to avoid giving more importance to one feature than another solely for its value size. The chosen method is the Robust Scaler due to the fact that all the plots previously presented of the features' properties show a very persistent left tendency as mentioned previously.

7. Evaluation

As it has been mentioned in the description, the evaluation methods are the MSE and R-squared values.

Scoring method Brief description
MSE Mean of all the squared differences between the model predictions and the correct expected results. The lower number for this value the better.
R-squared Mean of all the R-squared values which measures the quality of the regression model. The higher number for this value the better.

8. Results

8.1 Best model

The Gradient Boosting Machine is the one that has proven to be most effective for this case with these four hyperparameters being considered relevant: vertex degree, phrase length, betweenness centrality and eigenvector centrality. Other models were not too far from first place as shown here:

Model MSE R-squared
Gradient Boosting Machine 0.01873 0.72408
Linear Regression 0.02017 0.70292
Ridge Regression 0.02017 0.70292

For all models the best values for their own hyperparameters have been chosen, including the best value for alfa. Despite this, the models Lasso Regression, Elastic Net Regression and GMM fail to perform well on any situation with any combination of features, meaning that they are not fit for this use case.

After deleting all features except one it can be noticed that the most important one by far is the betweenness centrality. Go to table 2 in section 9.2 Tables of Results to see the results for MSE and R-squared with using only the feature betweenness centrality.

8.2 Comparison of Models for different Languages

As mentioned before the data input refers to the rooted trees of sentences in different languages. From table 4 and 5 in section 9.2 Tables of Results we get the following plots respectively:

MSE_results

The first plot shows the MSE values for each language for each of the models. Remember that the lower the number the better, so in this case we can see that the best language these models are at is Japanese, which is interesting, since it is known to be a language quite difficult to learn. We get these results most probably because its syntax is quite simple leading to a much easier time to learn what the root of the sentence is.

R2_results

The second plot shows the R-squared values for each language for each of the models. Remember that the higher the number the better, so in this case we can see that only the Linear Regression, the Ridge Regression, the Decision Tree, the Random Forest and the Gradient Boosting Machine models have a positive value and thus are worth investigating. For this value it seems like once again Japanese is the best language that these models are at, just like with the MSE values. As mentioned before, it is probably due to its simple syntax. Chinese also has a simple syntax, reinforcing the previous hypothesis as Chinese is also one of the languages that these models are best at.

9. Annex

9.1 Tables and plots for Descriptive and Distribution Analysis of the data

Table 1.

Table that shows a data description for each feature.

Measure/Feature vertex_degree avg_distance is_center is_centroidal phrase_length betweenness_centrality closeness_centrality eigenvector_centrality average_neighbor_degree eccentricity
Count 371321 371321 371321 371321 371321 371321 371321 371321 371321 371321
Mean 1.892815 3.691950 0.074765 0.074765 22.260737 0.146178 0.294866 0.185223 3.169671 6.173166
Standard Deviation 1.296955 1.051328 0.263013 0.263013 8.934124 0.221502 0.092301 0.138870 1.324016 1.872693
Min 1.0 1.0 0.0 0.0 3.0 0.0 0.094708 0.000042 1.0 1.0
25% 1.0 2.944444 0.0 0.0 16.0 0.0 0.230769 0.082408 2.0 5.0
50% 1.0 3.60 0.0 0.0 21.0 0.0 0.277778 0.152841 3.0 6.0
75% 3.0 4.333333 0.0 0.0 28.0 0.235507 0.339623 0.241381 4.0 7.0
Max 14.0 10.558824 1.0 1.0 70.0 1.0 1.0 0.707107 14.0 18.0

Plots 1 & 2.

Boxplot and distribution plots for the feature avg_distance.

avg_distance-boxplot avg_distance-distribution

Plots 3 & 4.

Boxplot and distribution plots for the feature is_center.

is_center-boxplot is_center-distribution

Plots 5 & 6.

Boxplot and distribution plots for the feature is_centroidal.

is_centroidal-boxplot is_centroidal-distribution

Plots 7 & 8.

Boxplot and distribution plots for the feature closeness_centrality.

closeness_centrality-boxplot closeness_centrality-distribution

Plots 9 & 10.

Boxplot and distribution plots for the feature average_neighbor_degree.

average_neighbor_degree-boxplot average_neighbor_degree-distribution

Plots 11 & 12.

Boxplot and distribution plots for the feature eccentricity.

eccentricity-boxplot eccentricity-distribution

Plots 13 & 14.

Boxplot and distribution plots for the feature diameter.

diameter-boxplot diameter-distribution

Plots 15 & 16.

Boxplot and distribution plots for the feature in_diameter.

in_diameter-boxplot in_diameter-distribution

Plots 17 & 18.

Boxplot and distribution plots for the feature radius.

radius-boxplot radius-distribution

Plots 19 & 20.

Boxplot and distribution plots for the feature wiener_index.

wiener_index-boxplot wiener_index-distribution

9.2 Tables of Results

Table 1.

Results for using the four features vertex_degree, phrase_length, eigenvector and betweenness centrality with all languages as input data.

Model MSE R-squared
Random 0.285105422370349 -3.20060265254352
Normal degree 5.18634950886863 -75.4134422683945
Linear Regression 0.0201658979678067 0.702917218900158
Ridge Regression 0.0201658979672097 0.702917219011988
Lasso Regression 0.0678806703897772 -0.0000147436483798779
Elastic Net Regression 0.0678806703897772 -0.0000147436483798779
Decision Tree 0.0342242901324194 0.495803626639215
Random Forest 0.0224229087660543 0.669670546659429
Gradient Boosting Machine 0.0190093487013587 0.719960973566609
GMM 0.0859682726791761 -0.266475292626924

Table 2.

Results for using only the feature betweenness centrality with all languages as input data. The model normal degree does not have a value because the feature vertex degree has not been selected.

Model MSE R-squared
Random 0.2843237930512497 -3.1891087881428732
Normal degree - -
Linear Regression 0.0202809792052787 0.7012214301329455
Ridge Regression 0.020280979205731776 0.7012214302394947
Lasso Regression 0.06788067038977724 -1.474364837987796e-05
Elastic Net Regression 0.06788067038977724 -1.474364837987796e-05
Decision Tree 0.0201769848553593 0.7027557091184073
Random Forest 0.020006119966908216 0.7052721931142946
Gradient Boosting Machine 0.01963092111601815 0.7107976564435315
GMM 0.08596827267917613 -0.2664752926269244

Table 3.

Results for using all the previously described features with all languages as input data. The random forest model does not have a value because it did not even finish the first fold within 3 hours.

Model MSE R-squared
Random 0.284882094822205 -3.19733863530342
Normal degree 0.481941878181523 -8.06081392231267
Linear Regression 0.0200146946282594 0.705144898427363
Ridge Regression 0.0200146945281936 0.705144899951571
Lasso Regression 0.0708827675005482 -0.0000147436483798779
Elastic Net Regression 0.0678806703897772 -0.000771276833173439
Decision Tree 0.0346982454858982 0.48878538067082
Random Forest - -
Gradient Boosting Machine 0.0187291471075987 0.724083492005656
GMM 0.0859682726791761 -0.266475292626924

Table 4.

MSE results for using the four features vertex_degree, phrase_length, eigenvector and betweenness centrality for each of the languages.

Language/Model Linear Regression Ridge Regression Lasso Regression Elastic Net Regression Decision Tree Random Forest Gradient Boosting Machine GMM
Arabic 0.023530201625890 0.023530200411524 0.070882767500548 0.070882767500548 0.041892520807356 0.025823790428371 0.024267148578674 0.091989009039923
Chinese 0.017475723657304 0.017475721632455 0.064627450893589 0.064627450893589 0.032692279744135 0.019328221176870 0.018703343028974 0.081042325597481
Czech 0.025574448197278 0.025574445833119 0.078177154985588 0.078177154985588 0.048093078733165 0.029779830043549 0.027950156894012 0.102286726935066
English 0.019061243015400 0.019061243325417 0.067520685886209 0.067520685886209 0.036610787284008 0.021414126819886 0.021076577754712 0.084528933896009
Finnish 0.024517010597128 0.024517011920720 0.083739351279073 0.083739351279073 0.044088388554561 0.028335792757142 0.027814677548983 0.110357044549104
French 0.016501784295316 0.016501782976113 0.059571556814500 0.059571556814500 0.029866842065063 0.018137027910744 0.017609673332394 0.073301612864846
Galician 0.019777566798655 0.019777565410495 0.063624767801684 0.063624767801684 0.035765205450456 0.021304230519486 0.020838523700073 0.079409123138615
German 0.018239866311027 0.018239866232157 0.066861106723244 0.066861106723244 0.033758375540433 0.020233827558412 0.020209476403394 0.083123471884610
Hindi 0.016713159607737 0.016713157034977 0.060505032821676 0.060505032821676 0.033064646527558 0.016713157034977 0.019113629368141 0.074282706005423
Icelandic 0.021027155921022 0.021027152733755 0.073250571025165 0.073250571025165 0.038514529837016 0.023853714467744 0.023705403094476 0.093705268135159
Indonesian 0.021330666147686 0.021330669463480 0.072454646672234 0.072454646672234 0.037335159052457 0.021330669463480 0.021824949736656 0.093466674979132
Italian 0.018545206593930 0.018545205266345 0.062243298973675 0.062243298973675 0.034582007054110 0.020249406546336 0.019569865023368 0.077042968838995
Japanese 0.013471720644726 0.013471720528667 0.052215240250456 0.052215240250456 0.023226176732437 0.013861670142528 0.013762476401944 0.062966194328831
Korean 0.027020784531987 0.027020779520219 0.079481462845313 0.079481462845313 0.050074508490094 0.031548695948378 0.028994179149125 0.110736402898037
Polish 0.025205430469470 0.025205428022444 0.077425257940242 0.077425257940242 0.047293626226512 0.029048296050948 0.027488949420920 0.101913891865650
Portuguese 0.018536753873033 0.018536753677286 0.062928834342612 0.062928834342612 0.034291502520805 0.020814913227607 0.020136315358895 0.078312086849974
Russian 0.023056755870021 0.023056755548817 0.074674921973112 0.074674921973112 0.043210964100487 0.026428132295294 0.024918838796479 0.097923184051334
Spanish 0.019327903967912 0.019327902426048 0.063181759695345 0.063181759695345 0.035260489937938 0.021427567761218 0.020477824698236 0.078690048837189
Swedish 0.020319383955141 0.020319381632665 0.071750172750841 0.071750172750841 0.037923665751459 0.023933493288610 0.023873935849547 0.090423077401137
Thai 0.017851796897786 0.017851796603027 0.061178759436948 0.061178759436948 0.033053661037837 0.019769775458511 0.018596636636208 0.076548764958193
Turkish 0.024909582654976 0.024909579083363 0.079029734724098 0.079029734724098 0.044595721160219 0.029325854753074 0.027701259404730 0.105722681385448
Total 0.020165897967807 0.020165897967210 0.067880670389777 0.067880670389777 0.034224290132419 0.022422908766054 0.019009348701359 0.085968272679176

Table 5.

R-squared results for using the four features vertex_degree, phrase_length, eigenvector and betweenness centrality for each of the languages.

Language/Model Linear Regression Ridge Regression Lasso Regression Elastic Net Regression Decision Tree Random Forest Gradient Boosting Machine GMM
Arabic 0.668231405361397 0.668231553988513 -0.000771276833173 -0.000771276833173 0.408498740198355 0.635472709088694 0.657495249031473 -0.298280240990090
Chinese 0.729450626761948 0.729450792033005 -0.000481425281323 -0.000481425281323 0.492598198354606 0.700138548751708 0.709899243881501 -0.254385102310142
Czech 0.672926694028515 0.672926725959813 -0.000422420328038 -0.000422420328038 0.384733511274243 0.619044261243735 0.642386933184047 -0.308888908665246
English 0.707321592620632 0.707321638492237 -0.000747441490780 -0.000747441490780 0.471973954269712 0.661415598080704 0.667767550034253 -0.318390522188571
Finnish 0.722742078711091 0.722742114127700 -0.000323551055634 -0.000323551055634 0.498224865670929 0.695424901766864 0.704141031411308 -0.230717074076135
French 0.722742078711091 0.722742114127700 -0.000323551055634 -0.000323551055634 0.498224865670929 0.695424901766864 0.704141031411308 -0.230717074076135
Galician 0.689017088376618 0.689017115553612 -0.000955116302744 -0.000955116302744 0.436460683582461 0.664672804756430 0.672054409119377 -0.248586631624224
German 0.726762949392607 0.726763127401463 -0.000426895077754 -0.000426895077754 0.494358106753184 0.696789484663698 0.696820302434507 -0.243779123542593
Hindi 0.723863522701020 0.723863620630227 -0.000354968846125 -0.000354968846125 0.452858865457517 0.723863620630227 0.684210535230182 -0.227921967057732
Icelandic 0.712799159906727 0.712799306429452 -0.000454823234870 -0.000454823234870 0.474302510003958 0.674334796238256 0.676074232566568 -0.279527905051973
Indonesian 0.705532771218490 0.705532819217927 -0.000503091068536 -0.000503091068536 0.484159680284077 0.705532819217927 0.698797999828881 -0.290351155978656
Italian 0.702163239714396 0.702163320635012 -0.000715681561900 -0.000715681561900 0.443455091463391 0.674504473208440 0.685265206124121 -0.238154963649812
Japanese 0.741624222059655 0.741624244176905 -0.000323146729331 -0.000323146729331 0.554939132079185 0.734211453781428 0.736082944031855 -0.206097588914639
Korean 0.659720028186403 0.659720246051809 -0.000366410957912 -0.000366410957912 0.370054953654056 0.602683504555444 0.634749855740814 -0.393773699591188
Polish 0.674140819554199 0.674140911822583 -0.000222166202581 -0.000222166202581 0.388374590240733 0.624146782547441 0.644403494936484 -0.316518764727151
Portuguese 0.705097111166362 0.705097104110269 -0.000435739476198 -0.000435739476198 0.454290258083305 0.668778289588656 0.679813122652100 -0.244782536482362
Russian 0.691565051927168 0.691565088526099 -0.000404248385223 -0.000404248385223 0.420662726909824 0.646082674198140 0.666504251285823 -0.311660111791061
Spanish 0.694111902483234 0.694112072526492 -0.001219560969910 -0.001219560969910 0.440972603028769 0.660161123899398 0.675526734865881 -0.311660111791061
Swedish 0.716661741916394 0.716661885841392 -0.000743277962007 -0.000743277962007 0.470179954007153 0.665842539619768 0.666851601647136 -0.260698882611414
Thai 0.708254057149609 0.708254101559506 -0.000745693805230 -0.000745693805230 0.458682822166127 0.676617632575237 0.695856527406451 -0.251702488646375
Turkish 0.684416825406463 0.684417069460645 -0.000579268668109 -0.000579268668109 0.433369178494998 0.628173484750927 0.648963384818978 -0.338294458657109
Total 0.702917218900158 0.702917219011988 -0.000014743648380 -0.000014743648380 0.495803626639215 0.669670546659429 0.719960973566609 -0.266475292626924

Releases

No releases published

Packages

 
 
 

Contributors