Statistical methods and machine learning can have a wide application in many industries. There are many uses for employing these methods, and possibly even more algorithms one could choose. Several factors must be taken into account when making a decision. The purpose of the method, whether it is to explain or predict is to be considered. If prediction is the goal then one must decide if it is a classification or regression problem.
Given a dataset this case study will evaluate three statistical methods used for classification to compare the efficacy and weigh the differences between the models.
Little metadata is included with the dataset, there are roughly 100,000 observations and 130 features that are a mix of numeric and categorical variables. The data is said to have come from banking data, and the feature which will be predicted is just labeled as target
. This feature is categorical and will assume the value of 0 or 1, denoting True or False. The goal of this dataset is to tune three statistical models to compare prediction this target variable. The models will be tuned using cross validation and the results of the models will be compared.
The three models that will be evaluated are gradient boosting with XGBoost, a random forest classifier, and a support vector machine. The dataset is large enough to challenge these classifiers
The accompanying code for the methods used can be found in the Code/Functions Section of the Appendix.
The first step in processing the data is splitting the target column out and dropping the ID
column which adds no value to the analysis. (Appendix 6.2.1)
A brief EDA was conducted on the dataset; summary statistics were computed for the numerical data and the categorical data. Some of the categorical features had an excessive amount of factors and required processing as it could be a problem after one hot encoding. If left unmodified it would lead to a very large, sparse matrix that may be difficult to work with during modeling.
Features which had greater than 20 categories were identified and plotted (Appendix 6.2.2). The goal of this level reduction was to have roughly 500 features after one-hot encoding. Ultimately, features v22
and v56
were identified as the most likely candidates to have the features binned. A function was created to facilitate the proper manipulation of these categories bin_df_col()
(Appendix 6.2.3). Factor levels which had less than 140 entries were put into an “other” category to reduce the number of features. Once the appropriate features were binned, the data was one-hot encoded, partitioned into training and test sets. (Appendix 6.2.4). The final shape number of features in the dataset was 453.
A custom function run_clf_grid()
was written for hyper-parameter tuning (Appendix 6.2.5) which had a helper function, run_clf()
(Appendix 6.2.6).
The run_clf_grid()
was used to tune the hyper parameters for the models used to run the classification. The data is passed along with a dictionary of parameters to tune. The keys of the dictionary are the parameter name and the values are a list of parameter values to try. The itertools.product function is used to generate all possible combinations of parameters and the provided classifier is run using the run_clf()
function with every parameter combination (if no classifier is provided, XGBoost is used). The scores for each parameter combination are returned.
The run_clf()
function is called by run_clf_grid()
. This function does k-fold cross validation, unpacks the hyperparameters provided, and passes them to the provided classifier. The classifier is fit on each train fold and the log loss is calculated between the predictions on the test fold and the target labels for each fold. These are averaged to provide the k-fold log loss. The log-loss was calculated with the following formula where y is the label of the target.
\[LogLoss = - \frac{1}{n} \sum\limits_{i=1}^n{[y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]}\]
An additional function was required for tuning the XGBoost classifier (Appendix 6.2.7). Cross-validation of the tuning parameters are conducted in a very similar fashion and the results are returned in a data dict.The best tuning iteration prediction log-loss and accuracy for the XGBoost and Random Forest were found by passing a tuple of parameters to these custom functions.
XGBoost is run on the data with various parameters with 3 folds using run_clf_grid()
(Appendix 6.2.8). This algorithm is an optimized gradient boosting machine. Boosting is a technique where the residual of the model are iteratively run as the new targets of the model so that the model learns from the mistakes of previous iterations.
Here XGBoost is run with a decisions tree and the parameters being tuned are:
Parameter | Effect | Values Used |
---|---|---|
eta |
Learning rate for how aggressively the boosting adjusts the model between iterations | [0.001, 0.01, 0.1] |
subsample |
The percent of the training data to subsample each iteration to avoid overfitting | [.25, .5] |
colsample_bytree |
The percent of columns that are considered at each level of the tree | [0.25, 0.5] |
max_depth |
The maximum depth of the decision tree | [2,4] |
boost_rounds |
How many iterations of boost the model will run | [30,60] |
The next algorithm under considersation is the Random Forest Classifier (Appendix 6.2.9). Random Forests are also tree base, however they are built in parallel. A number of different decision trees are built using subset of the data and new data is fed to the resulting set of trees and they vote on the new classification.
The hyper parameters being tuned are:
Parameter | Effect | Values Used |
---|---|---|
n_estimators |
The number of trees | [10, 100] |
max_depth |
The max depth of the trees | [2, 4] |
max_features |
The number of features considered when looking for a split. | [None, ‘sqrt’] |
The support vector machine model was ultimately tuned with the assistance of the GridSearchCV()
function from the scikit-learn
package using LinearSVC()
(Appendix 6.2.10).
The parameters for the linear svm:
Parameter | Effect | Values Used |
---|---|---|
C |
Controls the strength of regularization, smaller values impose a stronger regulation. | [0.1, 1.0, 10, 100] |
loss |
Specifies loss function to use | [‘squared_hinge’] |
class_weight |
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data (source) | [‘balanced’, None] |
Additionally, dual=False
and max_iter=10000
were passed to the function to have the algorithm run faster.
Tuning attempts were also conducted on a small sampling of the data on a SVC()
with the RandomizedSearchCV()
(Appendix 6.2.11).
The final object was to observe the effect of scaling to the support vector machine. The data set was sampled into sets of 1000, 2000, 5000 and 10,000 samples with the same code, just changing the value of n (Appendix 6.2.12). These sampled datasets were run with the default parameters with an SVC()
The results were saved to a csv for easy retrieval.
ID | target | v1 | v2 | v4 | v5 | v6 | v7 | v8 | v9 | v10 | v11 | v12 | v13 | v14 | v15 | v16 | v17 | v18 | v19 | v20 | v21 | v23 | v25 | v26 | v27 | v28 | v29 | v32 | v33 | v34 | v35 | v36 | v37 | v38 | v39 | v40 | v41 | v42 | v43 | v44 | v45 | v46 | v48 | v49 | v50 | v51 | v53 | v54 | v55 | v57 | v58 | v59 | v60 | v61 | v62 | v63 | v64 | v65 | v67 | v68 | v69 | v70 | v72 | v73 | v76 | v77 | v78 | v80 | v81 | v82 | v83 | v84 | v85 | v86 | v87 | v88 | v89 | v90 | v92 | v93 | v94 | v95 | v96 | v97 | v98 | v99 | v100 | v101 | v102 | v103 | v104 | v105 | v106 | v108 | v109 | v111 | v114 | v115 | v116 | v117 | v118 | v119 | v120 | v121 | v122 | v123 | v124 | v126 | v127 | v128 | v129 | v130 | v131 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 |
mean | 114228.93 | 0.7611987 | 1.630685674923 | 7.4644107807752 | 4.1450975035109 | 8.7423591810425 | 2.4364015815439 | 2.4839208102273 | 1.496568585079 | 9.0318585930527 | 1.8830457847125 | 15.4474126838914 | 6.8813044377403 | 3.7983963023606 | 12.0942788858509 | 2.0809106817235 | 4.9232224287135 | 3.8322704127588 | 0.841045485361 | 0.2223004779864 | 17.7735922 | 7.0297402 | 1.09308841985134 | 1.69812879 | 1.8760306041673 | 2.7434540369624 | 5.093328020265 | 8.2064159614183 | 1.622150683167 | 2.1616329797329 | 6.40623558906 | 8.1223868077918 | 13.375597717567 | 0.7414708467043 | 0.09092818 | 1.2371837560457 | 10.465928405775 | 7.1825513447041 | 12.9249650976855 | 2.2165969904458 | 10.7951692627232 | 9.142231088139 | 1.63052543 | 12.5380219089864 | 8.0165466385719 | 1.5042648853011 | 7.1981589983863 | 15.7112990521152 | 1.2538563 | 1.559556203 | 4.077827529765 | 7.7016530932587 | 10.5879446792157 | 1.7142943914356 | 14.5830349435487 | 1.0306943 | 1.68732734 | 6.3437132502133 | 15.8475567 | 9.2872753836876 | 17.564117 | 9.4493351122383 | 12.2699599 | 1.4317667 | 2.4333034251272 | 2.4050555276588 | 7.3073659482608 | 13.334481973067 | 2.2096999319881 | 7.287173817372 | 6.2083564881429 | 2.1738077203141 | 1.6079557040978 | 2.8222527267858 | 1.220184140291 | 10.1802156 | 1.9241838301535 | 1.5184252 | 0.9669125681134 | 0.5823667750715 | 5.4751845316194 | 3.8528830609624 | 0.6657576395036 | 6.4579519592415 | 7.622553947745 | 7.6676237703283 | 1.250720530646 | 12.0916229011791 | 6.8664137058717 | 2.8902887272399 | 5.2967158900748 | 2.6428276253487 | 1.08104522222 | 11.7913595537984 | 2.1526200477383 | 4.1812843773808 | 3.365313669889 | 13.5744509815506866 | 10.5480509331675 | 2.2912175835805 | 8.3038570644314 | 8.3646508021664 | 3.1689697933543 | 1.2912178861444 | 2.7375959775818 | 6.8224391421538 | 3.54993833 | 0.9198119852931 | 1.6726576869395 | 3.2395418343606 | 2.0303732349014 | 0.3101442 | 1.9257634679567 | 1.7393891285777 |
std | 65934.49 | 0.4263529 | 0.8132648607174 | 2.2250359927279 | 0.8626621261702 | 1.5434406499236 | 0.4506138012766 | 0.4427149883004 | 2.1097863790959 | 1.4495416816586 | 1.3934663832624 | 0.5933833698793 | 0.9241467383403 | 0.8831731273539 | 1.4439213329891 | 0.5504519731541 | 1.3446427992241 | 1.4360671215414 | 0.462864337893 | 0.1286811853652 | 0.8674296 | 1.0694018 | 2.98732218535776 | 2.24158271 | 0.4139845828121 | 0.6266564934556 | 2.011310849914 | 0.9654450102285 | 0.423243709016 | 0.7396951573049 | 2.02419539908 | 1.0062805681003 | 1.785729106552 | 0.4065718695202 | 0.58347762 | 1.7710758856348 | 3.1676437185216 | 0.7544255445798 | 0.7487952252983 | 0.4866693177728 | 1.5858589673518 | 1.5505828176218 | 2.19532169 | 1.6499256735956 | 0.6779729929047 | 1.1678898433356 | 1.8730575227974 | 0.6003598481219 | 1.754598 | 0.626683038593 | 0.5092542056744 | 5.1380645873115 | 1.5563992428107 | 0.4037792043052 | 1.5934386536928 | 0.6962441 | 2.24951155 | 1.8974181415614 | 1.4104959 | 0.8437106405939 | 1.719832 | 1.4267009120726 | 1.7543601 | 0.9222675 | 0.5998123098125 | 1.0395551395494 | 0.9433861663182 | 1.3842291866628 | 0.8072647286202 | 1.6856685581687 | 2.788207549548 | 0.79784990138 | 0.706908758694 | 1.0618557422559 | 0.349851975096 | 2.2735704 | 0.7875292870987 | 2.13245315 | 0.1343842960905 | 0.1803984566689 | 1.2320110796948 | 0.6421580106347 | 0.1983492717587 | 0.8415485280468 | 1.444984138095 | 1.7627602611966 | 0.3465493424595 | 5.1734083103548 | 1.7690092157239 | 1.3541209028561 | 0.9229137705948 | 0.6652710797571 | 1.70317177617 | 2.2193463583259 | 0.692217074898 | 2.8139505527565 | 1.117152356997 | 2.6128782936866974 | 1.4274426584522 | 0.5034027060615 | 2.7426922561405 | 1.5035793392475 | 3.1636039920245 | 0.5545506102771 | 1.0186034472285 | 1.3486997532297 | 1.94343091 | 1.5915548723072 | 0.3779128300971 | 1.2212253350593 | 0.8143412614486 | 0.6932616 | 0.9496402470355 | 0.8518204492634 |
min | 3 | 0 | -0.0000009996497 | -0.0000009817614 | -0.0000006475929 | -0.0000005287068 | -0.0000009055091 | -0.0000009468765 | -0.0000007783778 | -0.0000009828757 | -0.0000009875317 | -0.0000001459062 | 0.0000005143224 | -0.0000008464889 | -0.0000009738831 | -0.0000008830427 | -0.0000009978294 | -0.0000009066455 | 0.000000447547 | -0.0000005178987 | 1.5167764 | 0.1061806 | -0.0000009999932 | 0.04104324 | -0.0000009346696 | -0.0000009915986 | -0.000000696088 | -0.0000003040753 | -0.000000955996 | -0.0000009713108 | -0.000000670767 | -0.0000009958327 | -0.0000004906628 | -0.0000009999498 | 0 | -0.0000009999742 | 0.0000001238996 | -0.0000007272275 | -0.0000006206144 | -0.0000009724295 | -0.0000009482212 | -0.0000009202112 | 0.06934879 | -0.0000009924422 | -0.0000006697975 | -0.0000009091393 | -0.0000003616122 | -0.0000009838107 | 0.0130615 | -0.000000984682 | -0.0000007570607 | -0.0000009976841 | -0.0000008803695 | -0.0000009970164 | -0.0000006113428 | 0 | 0.05305528 | -0.0000008770837 | 0.6592957 | -0.0000002413161 | 1.501359 | -0.0000009920262 | 0.4270946 | 0 | -0.0000009838592 | -0.0000006171967 | -0.0000007729896 | -0.0000009902572 | -0.0000009992919 | -0.0000001443765 | -0.0000007251767 | -0.0000009865679 | -0.0000009994245 | -0.0000009995741 | -0.000000981431 | 0.8723955 | -0.0000009990082 | 0.02236518 | -0.0000002343004 | 0.0000000339321 | 0.0000004251919 | -0.0000009687809 | -0.0000005784674 | -0.0000008832504 | -0.000000905952 | -0.0000005605928 | -0.0000009787309 | -0.0000009981311 | -0.0000003942767 | -0.0000007111493 | -0.0000009757743 | -0.0000008683169 | 0.00009364981 | -0.0000005467029 | 0.0000004251643 | -0.0000009996884 | -0.000000999027 | -0.0000000009335327 | -0.0000009853189 | -0.0000009450359 | -0.0000009991992 | -0.0000001695463 | -0.0000009998183 | -0.0000009932534 | -0.0000009820642 | -0.0000009978497 | 0.01913856 | -0.0000009994953 | -0.0000009564174 | -0.0000009223798 | 0.0000008197812 | 0 | -0.0000009901257 | -0.0000009999134 |
25% | 57280 | 1 | 1.34615289971 | 6.57577033681 | 4.0686969218 | 8.39409001481 | 2.3409675593 | 2.37658602471 | 0.265314672374 | 8.81356030231 | 1.05032820659 | 15.3982297448 | 6.32262474904 | 3.46408745982 | 11.2560173031 | 1.90568564088 | 4.70588270329 | 3.37982887932 | 0.719486145362 | 0.192411804374 | 17.7735922 | 6.4187546 | 0.00000006196688 | 0.27447507 | 1.755531054 | 2.56647467162 | 4.74235525242 | 8.11437410392 | 1.49129133927 | 1.83074195405 | 5.0558001341 | 7.89614964569 | 13.0993453186 | 0.59071649453 | 0 | 0.305219112655 | 8.41038973367 | 7.06761850192 | 12.8131706654 | 2.06896530006 | 10.5425569114 | 8.88634100226 | 0.25630139 | 12.1569310021 | 7.91400039216 | 0.658792378256 | 6.83726745823 | 15.6677184587 | 0.2081914 | 1.27659511766 | 3.97664923776 | 4.06135866729 | 10.2167970814 | 1.59960415915 | 14.5830349435487 | 1 | 0.27093532 | 5.85294439423 | 15.8475567 | 9.16798396465 | 17.564117 | 9.27007251269 | 12.0874811 | 1 | 2.23611757389 | 2.06003010584 | 7.19022597493 | 13.1517514492 | 1.95122020344 | 7.20499240376 | 3.59298457736 | 1.81534571891 | 1.31803793054 | 2.44394683635 | 1.10841063474 | 9.5518364 | 1.62351387723 | 0.22454102 | 0.949615010065 | 0.517926838695 | 5.13586626368 | 3.65008498653 | 0.59461319143 | 6.36225348979 | 7.18954304011 | 7.29994016973 | 1.17270867441 | 12.0916229011791 | 6.34055270755 | 2.3290450503 | 4.98815003049 | 2.43201770861 | 0.17354816907 | 11.702030193 | 1.85200656575 | 2.72121863922 | 2.92385753715 | 11.9966658175000003 | 10.2666662017 | 2.1390365024 | 7.60399574885 | 7.86516760377 | 1.1694252228 | 1.05263168344 | 2.28260878443 | 6.51960744691 | 2.57105312 | 0.0847131977909 | 1.57097366948 | 2.76249711473 | 1.68126114914 | 0 | 1.44947705168 | 1.46341418483 |
50% | 114189 | 1 | 1.630685674923 | 7.4644107807752 | 4.1450975035109 | 8.7423591810425 | 2.4364015815439 | 2.4839208102273 | 1.496568585079 | 9.0318585930527 | 1.31291009873 | 15.4474126838914 | 6.6132409216 | 3.7983963023606 | 11.9678254667 | 2.0809106817235 | 4.9232224287135 | 3.8322704127588 | 0.841045485361 | 0.2223004779864 | 17.7735922 | 7.0393655 | 0.33059372542 | 1.69812879 | 1.8760306041673 | 2.7434540369624 | 5.093328020265 | 8.2064159614183 | 1.622150683167 | 2.1616329797329 | 6.53443367796 | 8.1223868077918 | 13.375597717567 | 0.7414708467043 | 0 | 1.2371837560457 | 10.3393377646 | 7.1825513447041 | 12.9249650976855 | 2.2165969904458 | 10.7951692627232 | 9.142231088139 | 1.63052543 | 12.5380219089864 | 8.0165466385719 | 1.21194423179 | 7.1981589983863 | 15.7112990521152 | 1.2538563 | 1.559556203 | 4.077827529765 | 7.7016530932587 | 10.5879446792157 | 1.7142943914356 | 14.5830349435487 | 1 | 1.68732734 | 6.3437132502133 | 15.8475567 | 9.2872753836876 | 17.564117 | 9.4493351122383 | 12.2699599 | 1 | 2.4333034251272 | 2.4050555276588 | 7.3073659482608 | 13.334481973067 | 2.2096999319881 | 7.287173817372 | 6.2083564881429 | 2.1738077203141 | 1.6079557040978 | 2.8222527267858 | 1.220184140291 | 10.1802156 | 1.9241838301535 | 1.5184252 | 0.9669125681134 | 0.5823667750715 | 5.4751845316194 | 3.8528830609624 | 0.6657576395036 | 6.4579519592415 | 7.622553947745 | 7.6676237703283 | 1.250720530646 | 12.0916229011791 | 6.8664137058717 | 2.8902887272399 | 5.2967158900748 | 2.6428276253487 | 1.08104522222 | 11.7913595537984 | 2.1526200477383 | 4.1812843773808 | 3.365313669889 | 14.0388799090000003 | 10.5480509331675 | 2.2912175835805 | 8.3038570644314 | 8.3646508021664 | 3.1689697933543 | 1.2912178861444 | 2.7375959775818 | 6.8224391421538 | 3.54993833 | 0.9198119852931 | 1.6726576869395 | 3.2395418343606 | 2.0303732349014 | 0 | 1.9257634679567 | 1.7393891285777 |
75% | 171206 | 1 | 1.630685674923 | 7.55150068977 | 4.34022894242 | 8.92479757997 | 2.48469939431 | 2.52844500954 | 1.496568585079 | 9.3023251243 | 2.10065719345 | 15.5938957785 | 7.01940368178 | 3.7983963023606 | 12.7157742438 | 2.0809106817235 | 5.14285647543 | 3.8322704127588 | 0.841045485361 | 0.2223004779864 | 18.1546039 | 7.6665218 | 1.09308841985134 | 1.69812879 | 1.89891343478 | 2.77910334858 | 5.33033911741 | 8.47939326721 | 1.622150683167 | 2.1616329797329 | 7.70145144761 | 8.25075880327 | 14.3249233609 | 0.7414708467043 | 0 | 1.2371837560457 | 12.7624628502 | 7.34477205535 | 13.0496460315 | 2.2374883411 | 11.0221009644 | 9.41516321173 | 1.63052543 | 12.6746325614 | 8.13559232975 | 2.00572213229 | 7.41788211852 | 15.8715598393 | 1.2538563 | 1.559556203 | 4.15366204094 | 7.7016530932587 | 10.8395399737 | 1.73501642403 | 15.3129111878 | 1 | 1.68732734 | 6.38440109403 | 16.4708469 | 9.46899440238 | 18.4375 | 9.73384058789 | 12.9166001 | 2 | 2.43664774246 | 2.4050555276588 | 7.55221374391 | 13.5593215242 | 2.24358989361 | 7.8230073696 | 6.2083564881429 | 2.1738077203141 | 1.6079557040978 | 2.8222527267858 | 1.220184140291 | 10.4335862 | 1.9241838301535 | 1.5184252 | 0.990101887728 | 0.5823667750715 | 5.4751845316194 | 3.8528830609624 | 0.6657576395036 | 6.6690010731 | 7.71084428042 | 8.00612317966 | 1.30166574802 | 15.6972116963 | 6.93118555837 | 2.8902887272399 | 5.2967158900748 | 2.6428276253487 | 1.08104522222 | 12.4436323096 | 2.1526200477383 | 4.1812843773808 | 3.365313669889 | 15.372185696099999 | 10.7189543566 | 2.31017036209 | 8.6453665737 | 8.41772155054 | 3.1689697933543 | 1.2912178861444 | 2.7375959775818 | 6.99999915782 | 3.54993833 | 0.9198119852931 | 1.6726576869395 | 3.2395418343606 | 2.0303732349014 | 0 | 1.9257634679567 | 1.7393891285777 |
max | 228713 | 1 | 20.0000006294 | 19.9999999087 | 19.9999997446 | 20.0000003539 | 20.0000005964 | 19.9999998141 | 20.000000997 | 20.0000007502 | 18.5339164478 | 20.0000009233 | 18.7105503906 | 20.0000009059 | 19.9999996125 | 19.9999990205 | 20.0000009395 | 19.9999992246 | 20.0000009159 | 20.0000007723 | 20.000001 | 19.296052 | 20.0000009982 | 20.000001 | 19.9999996744 | 20.00000029 | 19.8481942126 | 19.9999999189 | 17.5609751445 | 20.0000002023 | 20.000000815 | 20.0000000877 | 20.0000002071 | 20.000000355 | 12 | 19.9155262925 | 19.9999991256 | 19.999999255 | 19.999999748 | 20.0000004033 | 19.8316812461 | 20.0000005511 | 20.000001 | 19.9999996424 | 19.9999995929 | 19.9999991454 | 20.0000008581 | 20.0000009228 | 20.000001 | 20.0000001945 | 19.9999997458 | 20.0000009786 | 19.9999994001 | 20.0000004966 | 18.8469601202 | 7 | 20.000001 | 19.9999990118 | 20.000001 | 20.0000009583 | 20.000001 | 20.0000009988 | 19.8163109 | 12 | 19.9999996418 | 20.0000005707 | 15.9735089981 | 20.0000009888 | 19.9999996377 | 20.0000009706 | 20.0000002311 | 20.0000002847 | 20.0000009699 | 20.0000008889 | 17.5609751085 | 19.8427544 | 20.0000000942 | 20.000001 | 6.30577492863 | 8.92384346581 | 19.9999994839 | 19.0163118283 | 9.0705377032 | 19.9999997693 | 20.0000008332 | 19.0587996349 | 19.9999997618 | 20.0000009983 | 20.0000004219 | 20.0000001121 | 18.7752514628 | 20.0000006045 | 20.0000009841 | 20.0000003358 | 20.0000005938 | 20.0000009996 | 20.0000005066 | 19.999999675599998 | 20.0000009715 | 20.0000007973 | 20.0000007954 | 20.0000009555 | 20.0000004599 | 10.3942654912 | 20.0000008035 | 20.0000009324 | 19.68606924 | 20.0000009992 | 15.6316128253 | 19.9999990947 | 20.000000402 | 11 | 19.9999995909 | 20.0000009426 |
The numeric variables were inspected to identify any anomalies. It appears as thought the data may have already undergone some type of transformation as a majority of the features have a max value of 20, and a minimum of around 0. No further processing of the numeric variables was conducted.
v3 | v22 | v24 | v30 | v31 | v47 | v52 | v56 | v66 | v71 | v74 | v75 | v79 | v91 | v107 | v110 | v112 | v113 | v125 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 | 114321 |
unique | 3 | 18210 | 5 | 7 | 3 | 10 | 12 | 122 | 3 | 9 | 3 | 4 | 18 | 7 | 7 | 3 | 22 | 36 | 90 |
top | C | AGDF | E | C | A | C | J | BW | A | F | B | D | C | A | E | A | F | G | BM |
freq | 114041 | 2886 | 55177 | 92288 | 91804 | 55425 | 11106 | 18233 | 70353 | 75094 | 113560 | 75087 | 34561 | 27082 | 27082 | 55688 | 22053 | 71556 | 5836 |
The categorical features revealed the shear number of factors for some of the features contained; a few which were identified to have over 20 categories were v22
, v56
and v125
. These features were plotted below
Figure 4.1.1 Level Balance for v22 (Note: x-axis text is turned off as too many features created an undesirable look)
Figure 4.1.2 Level Balance for v56
Figure 4.1.3 Level Balance for v125
Columns v22
(Figure 4.1.1) and v56
(Figure 4.1.2) have a large categories of users with a relatively low number of users and were binned.
Column v125
(Figure 4.1.2) has over 100 categories but many appear to have at least 1000 samples but this column was kept unaltered.
Inspection of the the value counts for column v22
shows a dropoff in value counts after the value HUU
with a count of 146.
A similar inspection of column v56
shows a sharp dropoff after value CF
woth 141 value counts. These two features were binned so all categories with value counts below 140 for columns v22
and v56
were denoted “other”.
Figure 4.1.4 Level Balance for v22 after binning
Figure 4.1.5 Level Balance for v56 after binning
Figure 4.1.5 and Figure 4.1.5 show the features v22
and v56
after binning. A majority of the observations in the v22
feature now fall in the “other” category, but the data diversity is still in place.
Best Tune | |
---|---|
boost_round | 60 |
booster | gbtree |
colsample_bytree | 0.5 |
eta | 0.1 |
eval_metric | logloss |
max_depth | 4 |
objective | binary:logistic |
subsample | 0.5 |
The XGBoost model performed arguably the best out of all models, full results in the appendix. The log-loss score for the best out of fold prediction was 0.4728 with an accuracy of 78.08 %. The log-loss score was by far the best, and the accuracy was very good, but not far and away better than the other models. This model also took the shortest amount of time to tune which is surprising because the most parameters were investigated during tuning using this algorithm compared to the Random Forest and Support Vector Machine.
Best Tune | |
---|---|
max_depth | 4 |
max_features | None |
n_estimators | 10 |
The Random Forest did not perform very well for classification task, the best parameters from cross-validation are shown above, full results in the appendix. The log-loss, 7.6979, was far worse than XGBoost but the accuracy was middle of the pack at 77.71 % This model also took the longest to tune (SVC()
excluded).
Best Tune | |
---|---|
C | 1 |
class_weight | None |
loss | squared_hinge |
Compared to the other models the LinearSVC()
yielded the least desirable accuracy score at 77.19 %. The GridSearchCV()
function was used to find the best model and the runtime wasn’t the worst.
Early attempts at tuning with RandomizedSearchCV()
and/or SVC()
were unsuccessful. The RandomizedSearchCV()
was run with SVC()
initially with the 1000 sampled dataset. Even with the n_iter
at 500, the search would run very fast, under a minute. Once the dataset was over 5000 samples, the search struggle very mightily. The kernel was fixed to linear, the classifier was changed to LinearSVC()
, the n_iter
was used at the default of 10, n_jobs
was set equal to 1, it still took many, many, many hours to run and wouldn’t complete. The next section displays the runtimes for the SVM, as this algorithm is notorious for taking a long run time.
Figure 4.5.1 Sampled Support Vector Machine runtime results
The result of the SVM runtimes are shown in Figure 4.5.1 above. As seen in this visualization, also experienced during model building, this classifier will not scale with data very well, as it will exhibit exponential scaling.
XGBoost | Random Forest | Support Vector Machine | |
---|---|---|---|
Log Loss | 0.4728 | 7.6979 | ——— |
Accuracy | 78.08 % | 77.71 % | 77.19 % |
Table 5.0.1 above shows the comprehenisive results for this classification task. The XGBoost performed the best, as it had the shortest time to tune, and produced the best metrics on a holdout validation set of data.
This case study illustrated that there are balances and trade-offs one must navigate when choosing a machine learning algorithm or statistical method. Each of these may perform equally well on real data, as the accuracy scores seem to be similar enough, though this assertion would need validation. An assessment like this would be a good starting point, and further model tuning could be conducted.
data = plot_data.copy()
## test the code with a subset of the data
#data = data[0:1000]
target = data['target']
data.drop(['target', 'ID'],inplace=True, axis=1)
The above code will prepare the data for analysis.
### Find cols with over 20 categories
cat_data = data.loc[:, data.dtypes == object]
col_bin_candidates = dict()
for col in cat_data:
category_count = len(data[col].value_counts())
if category_count > 20:
col_bin_candidates[col] = category_count
col_bin_candidates
Find cols with over 20 categories, and prints result.
### Visualized values for cols with over 10 categories
for col in col_bin_candidates:
counts = data[col].value_counts().to_frame()
fig = plt.figure()
if len(counts) < 50:
plt.title("{c} Bar Char".format(c=col))
plt.bar(counts.index, counts[col])
else:
plt.title("{c} Histogram".format(c=col))
plt.hist(counts[col])
### Look for bin point col v22
data['v22'].value_counts()[0:50]
### Look for bin point col v56
data['v56'].value_counts()[0:50]
def bin_df_col(df, col, cutoff):
vc = df[col].value_counts().to_frame()
below_cutoff = vc[vc[col] < cutoff].index
df.loc[(df[col].isin(below_cutoff)), col] = 'Other'
return df
The above function will create an “other” category based on a value passed as cutoff
to reduce the number of levels in the feature.
### Bin cols based on observations above
data = py_scr.bin_df_col(data, 'v22', 140)
data = py_scr.bin_df_col(data, 'v56', 140)
data_ohe = pd.get_dummies(data)
X_train, X_test, y_train, y_test = train_test_split(data_ohe, target, test_size=0.33, random_state=42)
This is the final data step which will reduce the number of factors in some columns, one hot encode and partition the data into train and test sets.
def run_clf_grid(data, clf_hyper_grid, return_best=False, boost_rounds=None, clf=None):
clf_scores = []
param, param_values = zip(*clf_hyper_grid.items())
param_list = [dict(zip(param, param_value)) for param_value in itertools.product(*param_values)]
for params in param_list:
if clf:
score = run_clf(clf, data, params)
clf_scores.append(score)
else:
for boost_round in boost_rounds:
score = run_xgb(data, params, boost_round)
clf_scores.append(score)
clf_scores.sort(key=lambda x: x['log_loss'])
if return_best:
clf_scores = [clf_scores[0]]
return clf_scores
This code will accept a dict of parameter values and iterate through them, performing cross-validation on the model to tune for the best parameters.
def run_clf(a_clf, data, clf_hyper):
M, L, n_folds = data # unpack data container
kf = KFold(n_splits=n_folds) # Establish the cross validation
scores = []
for ids, (train_index, test_index) in enumerate(kf.split(M, L)):
clf = a_clf(**clf_hyper) # unpack parameters into clf is they exist
clf.fit(M.iloc[train_index], L.iloc[train_index])
pred = clf.predict(M.iloc[test_index])
score_log_loss = log_loss(L.iloc[test_index], pred)
pred[pred<0.5] = 0
pred[pred>=0.5] = 1
score_acc = accuracy_score(L.iloc[test_index], pred)
scores.append((score_log_loss, score_acc))
ret = {
'clf': str(clf),
'log_loss': sum([score[0] for score in scores]) / float(len(scores)),
'accuracy': sum([score[1] for score in scores]) / float(len(scores))
}
return ret
This function will perform a round of cross-validation on a classifier. One must pass the desired model, the data, and the parameters for the one iteration of cross-validation. This function was used for the random forest tuning.
def run_xgb(data, clf_hyper, boost_round):
M, L, n_folds = data # unpack data container
kf = KFold(n_splits=n_folds) # Establish the cross validation
scores = []
for ids, (train_index, test_index) in enumerate(kf.split(M, L)):
xgtrain = xgb.DMatrix(M.iloc[train_index].values, L.iloc[train_index].values)
xgtest = xgb.DMatrix(M.iloc[test_index].values, L.iloc[test_index].values)
clf = xgb.train(
clf_hyper,
xgtrain,
num_boost_round=boost_round,
verbose_eval=True,
maximize=False
)
pred = clf.predict(xgtest, ntree_limit=clf.best_iteration)
score_log_loss = log_loss(L.iloc[test_index], pred)
pred[pred<0.5] = 0
pred[pred>=0.5] = 1
score_acc = accuracy_score(L.iloc[test_index], pred)
scores.append((score_log_loss, score_acc))
ret = {
'params': clf_hyper,
'boost_round': boost_round,
'log_loss': sum([score[0] for score in scores]) / float(len(scores)),
'accuracy': sum([score[1] for score in scores]) / float(len(scores))
}
return ret
This function will perform one round of cross-validation for the XGBoost classifier. The data, hyper-parameters and the number of boosting rounds are required.
xgboost_hyper = {
"objective": ["binary:logistic"],
"booster": ["gbtree"],
"eval_metric": ["logloss"],
"eta": [0.001, 0.01, 0.1],
"subsample": [.25, .5],
"colsample_bytree": [0.25, 0.5],
"max_depth": [2,4]
}
clf_data = (data_ohe, target, 3)
xgb_scores = py_scr.run_clf_grid(clf_data, xgboost_hyper, boost_rounds=[30,60])
The above shows the function call to run the XGBoost tuning.
r_clf = RandomForestClassifier
r_clf_hyper_grid = {
'n_estimators': [10, 100],
'max_depth': [2, 4],
'max_features': [None, 'sqrt']
}
rf_scores = py_scr.run_clf_grid(clf_data, r_clf_hyper_grid, clf=r_clf)
This code is what was used to tune the random forest classifier.
svm_param_dist ={
'C': [0.1, 1.0, 10, 100],
'loss': ['squared_hinge'],
'class_weight':['balanced', None]
}
lin_svm = LinearSVC(dual=False, max_iter=10000)
svm_clf = GridSearchCV(lin_svm, svm_param_dist, n_jobs=-1)
search = svm_clf.fit(X_train, y_train)
lsvm_y_preds = svm_clf.predict(X_test)
acc_lsvm = accuracy_score(lsvm_y_preds, y_test)
This code was used to tune the LinearSVC()
from scipy.stats import uniform, expon
sample_data = data.sample(n=2000, random_state=2)
sample_target = sample_data['target']
sample_data.drop(['target', 'ID'],inplace=True, axis=1)
sample_data_ohe = pd.get_dummies(sample_data)
X_train_smp, X_test_smp, y_train_smp, y_test_smp = train_test_split(sample_data_ohe, sample_target, test_size=0.33, random_state=42)
svm_param_dist ={
'C': expon(scale=100),
'gamma': expon(scale=.1),
'kernel': ['linear']
}
svm_tune = SVC()
clf_tune = RandomizedSearchCV(svm_tune, svm_param_dist, cv=5, n_iter=10, random_state=0, n_jobs=-1)
This code was what was attempted to use with RandomizedSearchCV()
but not successful.
n=1000
sample_data = data.sample(n=n, random_state=2)
sample_target = sample_data['target']
sample_data.drop(['target', 'ID'],inplace=True, axis=1)
sample_data = py_scr.bin_df_col(sample_data, 'v22', 140)
sample_data = py_scr.bin_df_col(sample_data, 'v56', 140)
sample_data_ohe = pd.get_dummies(sample_data)
X_train_smp, X_test_smp, y_train_smp, y_test_smp = train_test_split(sample_data_ohe, sample_target, test_size=0.33, random_state=42)
svm = SVC()
_=svm.fit(X_train_smp, y_train_smp)
svm_y_preds = svm.predict(X_test_smp)
acc_svm = accuracy_score(svm_y_preds, y_test_smp)
This code is what was timed during the sampling exercise, the number was changed according to the desired conditions and it was run.
## [
## {
## "accuracy": 0.7807576910628843,
## "boost_round": 60,
## "log_loss": 0.4728152353881004,
## "params": {
## "boost_round": 60,
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7799179503328347,
## "boost_round": 60,
## "log_loss": 0.4746925072993398,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7799704341284629,
## "boost_round": 60,
## "log_loss": 0.4760574659643522,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7792706501867549,
## "boost_round": 60,
## "log_loss": 0.477636326600088,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7795943002597948,
## "boost_round": 30,
## "log_loss": 0.48019136665380885,
## "params": {
## "boost_round": 60,
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7794018596758251,
## "boost_round": 60,
## "log_loss": 0.4806111771501589,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7792706501867549,
## "boost_round": 60,
## "log_loss": 0.4810116930080995,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7793056393838401,
## "boost_round": 30,
## "log_loss": 0.48111711586009626,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7774337173397713,
## "boost_round": 60,
## "log_loss": 0.48380703866094527,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7774424646390427,
## "boost_round": 60,
## "log_loss": 0.4841062634560416,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7771887929601736,
## "boost_round": 30,
## "log_loss": 0.48600775300587057,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.777477453836128,
## "boost_round": 30,
## "log_loss": 0.48671780543123405,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7764890090184656,
## "boost_round": 30,
## "log_loss": 0.49079358783366933,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7761566116461541,
## "boost_round": 30,
## "log_loss": 0.4909624186022256,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7724127675580165,
## "boost_round": 30,
## "log_loss": 0.4948742440031848,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7710131996746005,
## "boost_round": 30,
## "log_loss": 0.49521669686653175,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.1,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7750806938357783,
## "boost_round": 60,
## "log_loss": 0.5666707977131821,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7757279939818581,
## "boost_round": 60,
## "log_loss": 0.5667872764083817,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7648551009875701,
## "boost_round": 60,
## "log_loss": 0.573243505943783,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7649250793817409,
## "boost_round": 60,
## "log_loss": 0.5734722548385941,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7614436542717437,
## "boost_round": 60,
## "log_loss": 0.5746586614921271,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7620647125200094,
## "boost_round": 60,
## "log_loss": 0.5747820783200127,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 60,
## "log_loss": 0.5789357752843304,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 60,
## "log_loss": 0.5789745847921156,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.773077562302639,
## "boost_round": 30,
## "log_loss": 0.6156225417446154,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7736811259523622,
## "boost_round": 30,
## "log_loss": 0.6156942938979568,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7633068290165412,
## "boost_round": 30,
## "log_loss": 0.6192451318553448,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7635255114983249,
## "boost_round": 30,
## "log_loss": 0.6193647432998911,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 30,
## "log_loss": 0.6206327503523822,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 30,
## "log_loss": 0.6206662014971803,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 30,
## "log_loss": 0.6226969775009574,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 30,
## "log_loss": 0.6227177228266377,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.01,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7726052081419862,
## "boost_round": 60,
## "log_loss": 0.6739744067404801,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7729900893099256,
## "boost_round": 60,
## "log_loss": 0.6740059350516693,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 60,
## "log_loss": 0.6750018174868662,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7613824231768441,
## "boost_round": 60,
## "log_loss": 0.6750333793775128,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 60,
## "log_loss": 0.6752258273426545,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 60,
## "log_loss": 0.6752363492087136,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 60,
## "log_loss": 0.6758209735131904,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 60,
## "log_loss": 0.6758244855568595,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7726139554412574,
## "boost_round": 30,
## "log_loss": 0.6834792467182454,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7724127675580165,
## "boost_round": 30,
## "log_loss": 0.6835009979140817,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7612162244906885,
## "boost_round": 30,
## "log_loss": 0.683945695543021,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7618110408411404,
## "boost_round": 30,
## "log_loss": 0.6839629448266957,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 4,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 30,
## "log_loss": 0.684114072931942,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 30,
## "log_loss": 0.6841175453950736,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.5,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 30,
## "log_loss": 0.6843646475196623,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.5
## }
## },
## {
## "accuracy": 0.7611987298921458,
## "boost_round": 30,
## "log_loss": 0.6843670570345989,
## "params": {
## "booster": "gbtree",
## "colsample_bytree": 0.25,
## "eta": 0.001,
## "eval_metric": "logloss",
## "max_depth": 2,
## "objective": "binary:logistic",
## "subsample": 0.25
## }
## }
## ]
## [
## {
## "accuracy": 0.7771275618652741,
## "clf": "RandomForestClassifier(max_depth=4, max_features=None, n_estimators=10)",
## "log_loss": 7.69789401226916
## },
## {
## "accuracy": 0.7770750780696459,
## "clf": "RandomForestClassifier(max_depth=4, max_features=None)",
## "log_loss": 7.699707088066617
## },
## {
## "accuracy": 0.7651437618635247,
## "clf": "RandomForestClassifier(max_depth=2, max_features=None, n_estimators=10)",
## "log_loss": 8.11183392616853
## },
## {
## "accuracy": 0.7611987298921458,
## "clf": "RandomForestClassifier(max_depth=2, max_features='sqrt', n_estimators=10)",
## "log_loss": 8.248094615957752
## },
## {
## "accuracy": 0.7611987298921458,
## "clf": "RandomForestClassifier(max_depth=4, max_features='sqrt', n_estimators=10)",
## "log_loss": 8.248094615957752
## },
## {
## "accuracy": 0.7611987298921458,
## "clf": "RandomForestClassifier(max_depth=2, max_features=None)",
## "log_loss": 8.248094615957752
## },
## {
## "accuracy": 0.7611987298921458,
## "clf": "RandomForestClassifier(max_depth=2, max_features='sqrt')",
## "log_loss": 8.248094615957752
## },
## {
## "accuracy": 0.7611987298921458,
## "clf": "RandomForestClassifier(max_depth=4, max_features='sqrt')",
## "log_loss": 8.248094615957752
## }
## ]
XGBoost | Random Forest | Linear SVC | |
---|---|---|---|
Log Loss: | 0.4728 | 7.6979 | ——— |
Accuracy: | 78.08 % | 77.71 % | 77.19 % |