Case Study 8

1 Introduction

Statistical methods and machine learning can have a wide application in many industries. There are many uses for employing these methods, and possibly even more algorithms one could choose. Several factors must be taken into account when making a decision. The purpose of the method, whether it is to explain or predict is to be considered. If prediction is the goal then one must decide if it is a classification or regression problem.

Given a dataset this case study will evaluate three statistical methods used for classification to compare the efficacy and weigh the differences between the models.

2 Background

Little metadata is included with the dataset, there are roughly 100,000 observations and 130 features that are a mix of numeric and categorical variables. The data is said to have come from banking data, and the feature which will be predicted is just labeled as target. This feature is categorical and will assume the value of 0 or 1, denoting True or False. The goal of this dataset is to tune three statistical models to compare prediction this target variable. The models will be tuned using cross validation and the results of the models will be compared.

The three models that will be evaluated are gradient boosting with XGBoost, a random forest classifier, and a support vector machine. The dataset is large enough to challenge these classifiers

3 Methods

The accompanying code for the methods used can be found in the Code/Functions Section of the Appendix.

3.1 Exploratory Data Analysis and Data Processing

The first step in processing the data is splitting the target column out and dropping the ID column which adds no value to the analysis. (Appendix 6.2.1)

A brief EDA was conducted on the dataset; summary statistics were computed for the numerical data and the categorical data. Some of the categorical features had an excessive amount of factors and required processing as it could be a problem after one hot encoding. If left unmodified it would lead to a very large, sparse matrix that may be difficult to work with during modeling.

Features which had greater than 20 categories were identified and plotted (Appendix 6.2.2). The goal of this level reduction was to have roughly 500 features after one-hot encoding. Ultimately, features v22 and v56 were identified as the most likely candidates to have the features binned. A function was created to facilitate the proper manipulation of these categories bin_df_col() (Appendix 6.2.3). Factor levels which had less than 140 entries were put into an “other” category to reduce the number of features. Once the appropriate features were binned, the data was one-hot encoded, partitioned into training and test sets. (Appendix 6.2.4). The final shape number of features in the dataset was 453.

3.2 Model Tuning

A custom function run_clf_grid() was written for hyper-parameter tuning (Appendix 6.2.5) which had a helper function, run_clf() (Appendix 6.2.6).

The run_clf_grid() was used to tune the hyper parameters for the models used to run the classification. The data is passed along with a dictionary of parameters to tune. The keys of the dictionary are the parameter name and the values are a list of parameter values to try. The itertools.product function is used to generate all possible combinations of parameters and the provided classifier is run using the run_clf() function with every parameter combination (if no classifier is provided, XGBoost is used). The scores for each parameter combination are returned.

The run_clf() function is called by run_clf_grid(). This function does k-fold cross validation, unpacks the hyperparameters provided, and passes them to the provided classifier. The classifier is fit on each train fold and the log loss is calculated between the predictions on the test fold and the target labels for each fold. These are averaged to provide the k-fold log loss. The log-loss was calculated with the following formula where y is the label of the target.

\[LogLoss = - \frac{1}{n} \sum\limits_{i=1}^n{[y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]}\]

An additional function was required for tuning the XGBoost classifier (Appendix 6.2.7). Cross-validation of the tuning parameters are conducted in a very similar fashion and the results are returned in a data dict.The best tuning iteration prediction log-loss and accuracy for the XGBoost and Random Forest were found by passing a tuple of parameters to these custom functions.

3.2.1 XGBoost

XGBoost is run on the data with various parameters with 3 folds using run_clf_grid() (Appendix 6.2.8). This algorithm is an optimized gradient boosting machine. Boosting is a technique where the residual of the model are iteratively run as the new targets of the model so that the model learns from the mistakes of previous iterations.

Here XGBoost is run with a decisions tree and the parameters being tuned are:

Parameter Effect Values Used
eta Learning rate for how aggressively the boosting adjusts the model between iterations [0.001, 0.01, 0.1]
subsample The percent of the training data to subsample each iteration to avoid overfitting [.25, .5]
colsample_bytree The percent of columns that are considered at each level of the tree [0.25, 0.5]
max_depth The maximum depth of the decision tree [2,4]
boost_rounds How many iterations of boost the model will run [30,60]

3.2.2 Random Forest

The next algorithm under considersation is the Random Forest Classifier (Appendix 6.2.9). Random Forests are also tree base, however they are built in parallel. A number of different decision trees are built using subset of the data and new data is fed to the resulting set of trees and they vote on the new classification.

The hyper parameters being tuned are:

Parameter Effect Values Used
n_estimators The number of trees [10, 100]
max_depth The max depth of the trees [2, 4]
max_features The number of features considered when looking for a split. [None, ‘sqrt’]

3.2.3 Support Vector Machine

The support vector machine model was ultimately tuned with the assistance of the GridSearchCV() function from the scikit-learn package using LinearSVC()(Appendix 6.2.10).

The parameters for the linear svm:

Parameter Effect Values Used
C Controls the strength of regularization, smaller values impose a stronger regulation. [0.1, 1.0, 10, 100]
loss Specifies loss function to use [‘squared_hinge’]
class_weight The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data (source) [‘balanced’, None]

Additionally, dual=False and max_iter=10000 were passed to the function to have the algorithm run faster.

Tuning attempts were also conducted on a small sampling of the data on a SVC() with the RandomizedSearchCV()(Appendix 6.2.11).

The final object was to observe the effect of scaling to the support vector machine. The data set was sampled into sets of 1000, 2000, 5000 and 10,000 samples with the same code, just changing the value of n (Appendix 6.2.12). These sampled datasets were run with the default parameters with an SVC() The results were saved to a csv for easy retrieval.

4 Results

4.1 Summary Statistics

Numerical Data

Table 4.1.1 Summary Statistics for Numeric Data
ID target v1 v2 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 v21 v23 v25 v26 v27 v28 v29 v32 v33 v34 v35 v36 v37 v38 v39 v40 v41 v42 v43 v44 v45 v46 v48 v49 v50 v51 v53 v54 v55 v57 v58 v59 v60 v61 v62 v63 v64 v65 v67 v68 v69 v70 v72 v73 v76 v77 v78 v80 v81 v82 v83 v84 v85 v86 v87 v88 v89 v90 v92 v93 v94 v95 v96 v97 v98 v99 v100 v101 v102 v103 v104 v105 v106 v108 v109 v111 v114 v115 v116 v117 v118 v119 v120 v121 v122 v123 v124 v126 v127 v128 v129 v130 v131
count 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321
mean 114228.93 0.7611987 1.630685674923 7.4644107807752 4.1450975035109 8.7423591810425 2.4364015815439 2.4839208102273 1.496568585079 9.0318585930527 1.8830457847125 15.4474126838914 6.8813044377403 3.7983963023606 12.0942788858509 2.0809106817235 4.9232224287135 3.8322704127588 0.841045485361 0.2223004779864 17.7735922 7.0297402 1.09308841985134 1.69812879 1.8760306041673 2.7434540369624 5.093328020265 8.2064159614183 1.622150683167 2.1616329797329 6.40623558906 8.1223868077918 13.375597717567 0.7414708467043 0.09092818 1.2371837560457 10.465928405775 7.1825513447041 12.9249650976855 2.2165969904458 10.7951692627232 9.142231088139 1.63052543 12.5380219089864 8.0165466385719 1.5042648853011 7.1981589983863 15.7112990521152 1.2538563 1.559556203 4.077827529765 7.7016530932587 10.5879446792157 1.7142943914356 14.5830349435487 1.0306943 1.68732734 6.3437132502133 15.8475567 9.2872753836876 17.564117 9.4493351122383 12.2699599 1.4317667 2.4333034251272 2.4050555276588 7.3073659482608 13.334481973067 2.2096999319881 7.287173817372 6.2083564881429 2.1738077203141 1.6079557040978 2.8222527267858 1.220184140291 10.1802156 1.9241838301535 1.5184252 0.9669125681134 0.5823667750715 5.4751845316194 3.8528830609624 0.6657576395036 6.4579519592415 7.622553947745 7.6676237703283 1.250720530646 12.0916229011791 6.8664137058717 2.8902887272399 5.2967158900748 2.6428276253487 1.08104522222 11.7913595537984 2.1526200477383 4.1812843773808 3.365313669889 13.5744509815506866 10.5480509331675 2.2912175835805 8.3038570644314 8.3646508021664 3.1689697933543 1.2912178861444 2.7375959775818 6.8224391421538 3.54993833 0.9198119852931 1.6726576869395 3.2395418343606 2.0303732349014 0.3101442 1.9257634679567 1.7393891285777
std 65934.49 0.4263529 0.8132648607174 2.2250359927279 0.8626621261702 1.5434406499236 0.4506138012766 0.4427149883004 2.1097863790959 1.4495416816586 1.3934663832624 0.5933833698793 0.9241467383403 0.8831731273539 1.4439213329891 0.5504519731541 1.3446427992241 1.4360671215414 0.462864337893 0.1286811853652 0.8674296 1.0694018 2.98732218535776 2.24158271 0.4139845828121 0.6266564934556 2.011310849914 0.9654450102285 0.423243709016 0.7396951573049 2.02419539908 1.0062805681003 1.785729106552 0.4065718695202 0.58347762 1.7710758856348 3.1676437185216 0.7544255445798 0.7487952252983 0.4866693177728 1.5858589673518 1.5505828176218 2.19532169 1.6499256735956 0.6779729929047 1.1678898433356 1.8730575227974 0.6003598481219 1.754598 0.626683038593 0.5092542056744 5.1380645873115 1.5563992428107 0.4037792043052 1.5934386536928 0.6962441 2.24951155 1.8974181415614 1.4104959 0.8437106405939 1.719832 1.4267009120726 1.7543601 0.9222675 0.5998123098125 1.0395551395494 0.9433861663182 1.3842291866628 0.8072647286202 1.6856685581687 2.788207549548 0.79784990138 0.706908758694 1.0618557422559 0.349851975096 2.2735704 0.7875292870987 2.13245315 0.1343842960905 0.1803984566689 1.2320110796948 0.6421580106347 0.1983492717587 0.8415485280468 1.444984138095 1.7627602611966 0.3465493424595 5.1734083103548 1.7690092157239 1.3541209028561 0.9229137705948 0.6652710797571 1.70317177617 2.2193463583259 0.692217074898 2.8139505527565 1.117152356997 2.6128782936866974 1.4274426584522 0.5034027060615 2.7426922561405 1.5035793392475 3.1636039920245 0.5545506102771 1.0186034472285 1.3486997532297 1.94343091 1.5915548723072 0.3779128300971 1.2212253350593 0.8143412614486 0.6932616 0.9496402470355 0.8518204492634
min 3 0 -0.0000009996497 -0.0000009817614 -0.0000006475929 -0.0000005287068 -0.0000009055091 -0.0000009468765 -0.0000007783778 -0.0000009828757 -0.0000009875317 -0.0000001459062 0.0000005143224 -0.0000008464889 -0.0000009738831 -0.0000008830427 -0.0000009978294 -0.0000009066455 0.000000447547 -0.0000005178987 1.5167764 0.1061806 -0.0000009999932 0.04104324 -0.0000009346696 -0.0000009915986 -0.000000696088 -0.0000003040753 -0.000000955996 -0.0000009713108 -0.000000670767 -0.0000009958327 -0.0000004906628 -0.0000009999498 0 -0.0000009999742 0.0000001238996 -0.0000007272275 -0.0000006206144 -0.0000009724295 -0.0000009482212 -0.0000009202112 0.06934879 -0.0000009924422 -0.0000006697975 -0.0000009091393 -0.0000003616122 -0.0000009838107 0.0130615 -0.000000984682 -0.0000007570607 -0.0000009976841 -0.0000008803695 -0.0000009970164 -0.0000006113428 0 0.05305528 -0.0000008770837 0.6592957 -0.0000002413161 1.501359 -0.0000009920262 0.4270946 0 -0.0000009838592 -0.0000006171967 -0.0000007729896 -0.0000009902572 -0.0000009992919 -0.0000001443765 -0.0000007251767 -0.0000009865679 -0.0000009994245 -0.0000009995741 -0.000000981431 0.8723955 -0.0000009990082 0.02236518 -0.0000002343004 0.0000000339321 0.0000004251919 -0.0000009687809 -0.0000005784674 -0.0000008832504 -0.000000905952 -0.0000005605928 -0.0000009787309 -0.0000009981311 -0.0000003942767 -0.0000007111493 -0.0000009757743 -0.0000008683169 0.00009364981 -0.0000005467029 0.0000004251643 -0.0000009996884 -0.000000999027 -0.0000000009335327 -0.0000009853189 -0.0000009450359 -0.0000009991992 -0.0000001695463 -0.0000009998183 -0.0000009932534 -0.0000009820642 -0.0000009978497 0.01913856 -0.0000009994953 -0.0000009564174 -0.0000009223798 0.0000008197812 0 -0.0000009901257 -0.0000009999134
25% 57280 1 1.34615289971 6.57577033681 4.0686969218 8.39409001481 2.3409675593 2.37658602471 0.265314672374 8.81356030231 1.05032820659 15.3982297448 6.32262474904 3.46408745982 11.2560173031 1.90568564088 4.70588270329 3.37982887932 0.719486145362 0.192411804374 17.7735922 6.4187546 0.00000006196688 0.27447507 1.755531054 2.56647467162 4.74235525242 8.11437410392 1.49129133927 1.83074195405 5.0558001341 7.89614964569 13.0993453186 0.59071649453 0 0.305219112655 8.41038973367 7.06761850192 12.8131706654 2.06896530006 10.5425569114 8.88634100226 0.25630139 12.1569310021 7.91400039216 0.658792378256 6.83726745823 15.6677184587 0.2081914 1.27659511766 3.97664923776 4.06135866729 10.2167970814 1.59960415915 14.5830349435487 1 0.27093532 5.85294439423 15.8475567 9.16798396465 17.564117 9.27007251269 12.0874811 1 2.23611757389 2.06003010584 7.19022597493 13.1517514492 1.95122020344 7.20499240376 3.59298457736 1.81534571891 1.31803793054 2.44394683635 1.10841063474 9.5518364 1.62351387723 0.22454102 0.949615010065 0.517926838695 5.13586626368 3.65008498653 0.59461319143 6.36225348979 7.18954304011 7.29994016973 1.17270867441 12.0916229011791 6.34055270755 2.3290450503 4.98815003049 2.43201770861 0.17354816907 11.702030193 1.85200656575 2.72121863922 2.92385753715 11.9966658175000003 10.2666662017 2.1390365024 7.60399574885 7.86516760377 1.1694252228 1.05263168344 2.28260878443 6.51960744691 2.57105312 0.0847131977909 1.57097366948 2.76249711473 1.68126114914 0 1.44947705168 1.46341418483
50% 114189 1 1.630685674923 7.4644107807752 4.1450975035109 8.7423591810425 2.4364015815439 2.4839208102273 1.496568585079 9.0318585930527 1.31291009873 15.4474126838914 6.6132409216 3.7983963023606 11.9678254667 2.0809106817235 4.9232224287135 3.8322704127588 0.841045485361 0.2223004779864 17.7735922 7.0393655 0.33059372542 1.69812879 1.8760306041673 2.7434540369624 5.093328020265 8.2064159614183 1.622150683167 2.1616329797329 6.53443367796 8.1223868077918 13.375597717567 0.7414708467043 0 1.2371837560457 10.3393377646 7.1825513447041 12.9249650976855 2.2165969904458 10.7951692627232 9.142231088139 1.63052543 12.5380219089864 8.0165466385719 1.21194423179 7.1981589983863 15.7112990521152 1.2538563 1.559556203 4.077827529765 7.7016530932587 10.5879446792157 1.7142943914356 14.5830349435487 1 1.68732734 6.3437132502133 15.8475567 9.2872753836876 17.564117 9.4493351122383 12.2699599 1 2.4333034251272 2.4050555276588 7.3073659482608 13.334481973067 2.2096999319881 7.287173817372 6.2083564881429 2.1738077203141 1.6079557040978 2.8222527267858 1.220184140291 10.1802156 1.9241838301535 1.5184252 0.9669125681134 0.5823667750715 5.4751845316194 3.8528830609624 0.6657576395036 6.4579519592415 7.622553947745 7.6676237703283 1.250720530646 12.0916229011791 6.8664137058717 2.8902887272399 5.2967158900748 2.6428276253487 1.08104522222 11.7913595537984 2.1526200477383 4.1812843773808 3.365313669889 14.0388799090000003 10.5480509331675 2.2912175835805 8.3038570644314 8.3646508021664 3.1689697933543 1.2912178861444 2.7375959775818 6.8224391421538 3.54993833 0.9198119852931 1.6726576869395 3.2395418343606 2.0303732349014 0 1.9257634679567 1.7393891285777
75% 171206 1 1.630685674923 7.55150068977 4.34022894242 8.92479757997 2.48469939431 2.52844500954 1.496568585079 9.3023251243 2.10065719345 15.5938957785 7.01940368178 3.7983963023606 12.7157742438 2.0809106817235 5.14285647543 3.8322704127588 0.841045485361 0.2223004779864 18.1546039 7.6665218 1.09308841985134 1.69812879 1.89891343478 2.77910334858 5.33033911741 8.47939326721 1.622150683167 2.1616329797329 7.70145144761 8.25075880327 14.3249233609 0.7414708467043 0 1.2371837560457 12.7624628502 7.34477205535 13.0496460315 2.2374883411 11.0221009644 9.41516321173 1.63052543 12.6746325614 8.13559232975 2.00572213229 7.41788211852 15.8715598393 1.2538563 1.559556203 4.15366204094 7.7016530932587 10.8395399737 1.73501642403 15.3129111878 1 1.68732734 6.38440109403 16.4708469 9.46899440238 18.4375 9.73384058789 12.9166001 2 2.43664774246 2.4050555276588 7.55221374391 13.5593215242 2.24358989361 7.8230073696 6.2083564881429 2.1738077203141 1.6079557040978 2.8222527267858 1.220184140291 10.4335862 1.9241838301535 1.5184252 0.990101887728 0.5823667750715 5.4751845316194 3.8528830609624 0.6657576395036 6.6690010731 7.71084428042 8.00612317966 1.30166574802 15.6972116963 6.93118555837 2.8902887272399 5.2967158900748 2.6428276253487 1.08104522222 12.4436323096 2.1526200477383 4.1812843773808 3.365313669889 15.372185696099999 10.7189543566 2.31017036209 8.6453665737 8.41772155054 3.1689697933543 1.2912178861444 2.7375959775818 6.99999915782 3.54993833 0.9198119852931 1.6726576869395 3.2395418343606 2.0303732349014 0 1.9257634679567 1.7393891285777
max 228713 1 20.0000006294 19.9999999087 19.9999997446 20.0000003539 20.0000005964 19.9999998141 20.000000997 20.0000007502 18.5339164478 20.0000009233 18.7105503906 20.0000009059 19.9999996125 19.9999990205 20.0000009395 19.9999992246 20.0000009159 20.0000007723 20.000001 19.296052 20.0000009982 20.000001 19.9999996744 20.00000029 19.8481942126 19.9999999189 17.5609751445 20.0000002023 20.000000815 20.0000000877 20.0000002071 20.000000355 12 19.9155262925 19.9999991256 19.999999255 19.999999748 20.0000004033 19.8316812461 20.0000005511 20.000001 19.9999996424 19.9999995929 19.9999991454 20.0000008581 20.0000009228 20.000001 20.0000001945 19.9999997458 20.0000009786 19.9999994001 20.0000004966 18.8469601202 7 20.000001 19.9999990118 20.000001 20.0000009583 20.000001 20.0000009988 19.8163109 12 19.9999996418 20.0000005707 15.9735089981 20.0000009888 19.9999996377 20.0000009706 20.0000002311 20.0000002847 20.0000009699 20.0000008889 17.5609751085 19.8427544 20.0000000942 20.000001 6.30577492863 8.92384346581 19.9999994839 19.0163118283 9.0705377032 19.9999997693 20.0000008332 19.0587996349 19.9999997618 20.0000009983 20.0000004219 20.0000001121 18.7752514628 20.0000006045 20.0000009841 20.0000003358 20.0000005938 20.0000009996 20.0000005066 19.999999675599998 20.0000009715 20.0000007973 20.0000007954 20.0000009555 20.0000004599 10.3942654912 20.0000008035 20.0000009324 19.68606924 20.0000009992 15.6316128253 19.9999990947 20.000000402 11 19.9999995909 20.0000009426

The numeric variables were inspected to identify any anomalies. It appears as thought the data may have already undergone some type of transformation as a majority of the features have a max value of 20, and a minimum of around 0. No further processing of the numeric variables was conducted.

Categorical Data

Table 4.1.2 Summary of Categorical Data
v3 v22 v24 v30 v31 v47 v52 v56 v66 v71 v74 v75 v79 v91 v107 v110 v112 v113 v125
count 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321 114321
unique 3 18210 5 7 3 10 12 122 3 9 3 4 18 7 7 3 22 36 90
top C AGDF E C A C J BW A F B D C A E A F G BM
freq 114041 2886 55177 92288 91804 55425 11106 18233 70353 75094 113560 75087 34561 27082 27082 55688 22053 71556 5836

The categorical features revealed the shear number of factors for some of the features contained; a few which were identified to have over 20 categories were v22, v56 and v125. These features were plotted below


Figure 4.1.1 Level Balance for v22 (Note: x-axis text is turned off as too many features created an undesirable look)


Figure 4.1.2 Level Balance for v56


Figure 4.1.3 Level Balance for v125


Columns v22 (Figure 4.1.1) and v56 (Figure 4.1.2) have a large categories of users with a relatively low number of users and were binned.

Column v125 (Figure 4.1.2) has over 100 categories but many appear to have at least 1000 samples but this column was kept unaltered.

Inspection of the the value counts for column v22 shows a dropoff in value counts after the value HUU with a count of 146.

A similar inspection of column v56 shows a sharp dropoff after value CF woth 141 value counts. These two features were binned so all categories with value counts below 140 for columns v22 and v56 were denoted “other”.

Figure 4.1.4 Level Balance for v22 after binning


Figure 4.1.5 Level Balance for v56 after binning


Figure 4.1.5 and Figure 4.1.5 show the features v22 and v56 after binning. A majority of the observations in the v22 feature now fall in the “other” category, but the data diversity is still in place.

4.2 XGBoost Tuning

Table 4.2.1 XGBoost Best Parameters
Best Tune
boost_round 60
booster gbtree
colsample_bytree 0.5
eta 0.1
eval_metric logloss
max_depth 4
objective binary:logistic
subsample 0.5

The XGBoost model performed arguably the best out of all models, full results in the appendix. The log-loss score for the best out of fold prediction was 0.4728 with an accuracy of 78.08 %. The log-loss score was by far the best, and the accuracy was very good, but not far and away better than the other models. This model also took the shortest amount of time to tune which is surprising because the most parameters were investigated during tuning using this algorithm compared to the Random Forest and Support Vector Machine.

4.3 Random Forest Tuning

Table 4.3.1 Random Forest Best Parameters
Best Tune
max_depth 4
max_features None
n_estimators 10

The Random Forest did not perform very well for classification task, the best parameters from cross-validation are shown above, full results in the appendix. The log-loss, 7.6979, was far worse than XGBoost but the accuracy was middle of the pack at 77.71 % This model also took the longest to tune (SVC() excluded).

4.4 Linear SVM Tuning

Table 4.4.1 LinearSVC Best Parameters
Best Tune
C 1
class_weight None
loss squared_hinge

Compared to the other models the LinearSVC() yielded the least desirable accuracy score at 77.19 %. The GridSearchCV() function was used to find the best model and the runtime wasn’t the worst.

Early attempts at tuning with RandomizedSearchCV() and/or SVC() were unsuccessful. The RandomizedSearchCV() was run with SVC() initially with the 1000 sampled dataset. Even with the n_iter at 500, the search would run very fast, under a minute. Once the dataset was over 5000 samples, the search struggle very mightily. The kernel was fixed to linear, the classifier was changed to LinearSVC(), the n_iter was used at the default of 10, n_jobs was set equal to 1, it still took many, many, many hours to run and wouldn’t complete. The next section displays the runtimes for the SVM, as this algorithm is notorious for taking a long run time.

4.5 Support Vector Machine Runtime

Figure 4.5.1 Sampled Support Vector Machine runtime results

The result of the SVM runtimes are shown in Figure 4.5.1 above. As seen in this visualization, also experienced during model building, this classifier will not scale with data very well, as it will exhibit exponential scaling.

5 Conclusion

Table 5.0.1 Comprehensive Model Summary
XGBoost Random Forest Support Vector Machine
Log Loss 0.4728 7.6979 ———
Accuracy 78.08 % 77.71 % 77.19 %

Table 5.0.1 above shows the comprehenisive results for this classification task. The XGBoost performed the best, as it had the shortest time to tune, and produced the best metrics on a holdout validation set of data.

This case study illustrated that there are balances and trade-offs one must navigate when choosing a machine learning algorithm or statistical method. Each of these may perform equally well on real data, as the accuracy scores seem to be similar enough, though this assertion would need validation. An assessment like this would be a good starting point, and further model tuning could be conducted.

6 Appendix

6.1 Sources

LinearSVC

XGBoost

Log-Loss

6.2 Code/Functions

6.2.1 Data Prep

data = plot_data.copy()
## test the code with a subset of the data
#data = data[0:1000]
target = data['target']
data.drop(['target', 'ID'],inplace=True, axis=1)

The above code will prepare the data for analysis.

6.2.2 Categorical EDA Raw

### Find cols with over 20 categories 
cat_data = data.loc[:, data.dtypes == object]

col_bin_candidates = dict()
for col in cat_data:
    category_count = len(data[col].value_counts())
    if category_count > 20:
        col_bin_candidates[col] = category_count
        
col_bin_candidates

Find cols with over 20 categories, and prints result.

### Visualized values for cols with over 10 categories
for col in col_bin_candidates:
    counts = data[col].value_counts().to_frame()
    
    fig = plt.figure()
    if len(counts) < 50:
        plt.title("{c} Bar Char".format(c=col)) 
        plt.bar(counts.index, counts[col])
    else:
        plt.title("{c} Histogram".format(c=col)) 
        plt.hist(counts[col])
### Look for bin point col v22
data['v22'].value_counts()[0:50]
### Look for bin point col v56
data['v56'].value_counts()[0:50]

6.2.3 Bin Categorical Features

def bin_df_col(df, col, cutoff):
    vc = df[col].value_counts().to_frame()
    below_cutoff = vc[vc[col] < cutoff].index
    df.loc[(df[col].isin(below_cutoff)), col] = 'Other'
    
    return df

The above function will create an “other” category based on a value passed as cutoff to reduce the number of levels in the feature.

6.2.4 Binning/One Hot Encoding

### Bin cols based on observations above
data = py_scr.bin_df_col(data, 'v22', 140)
data = py_scr.bin_df_col(data, 'v56', 140)

data_ohe = pd.get_dummies(data)

X_train, X_test, y_train, y_test = train_test_split(data_ohe, target, test_size=0.33, random_state=42)

This is the final data step which will reduce the number of factors in some columns, one hot encode and partition the data into train and test sets.

6.2.6 Classifier Cross-Validation

def run_clf(a_clf, data, clf_hyper):
    M, L, n_folds = data # unpack data container
    kf = KFold(n_splits=n_folds) # Establish the cross validation
    scores = []

    for ids, (train_index, test_index) in enumerate(kf.split(M, L)):
        clf = a_clf(**clf_hyper) # unpack parameters into clf is they exist
        clf.fit(M.iloc[train_index], L.iloc[train_index])

        pred = clf.predict(M.iloc[test_index])
        score_log_loss = log_loss(L.iloc[test_index], pred)
        pred[pred<0.5] = 0
        pred[pred>=0.5] = 1
        score_acc = accuracy_score(L.iloc[test_index], pred)
        scores.append((score_log_loss, score_acc))

    ret = {
        'clf': str(clf),
        'log_loss': sum([score[0] for score in scores]) / float(len(scores)),
        'accuracy': sum([score[1] for score in scores]) / float(len(scores))
    }

    return ret

This function will perform a round of cross-validation on a classifier. One must pass the desired model, the data, and the parameters for the one iteration of cross-validation. This function was used for the random forest tuning.

6.2.7 XGBoost Cross-Validation

def run_xgb(data, clf_hyper, boost_round):
    M, L, n_folds = data # unpack data container
    kf = KFold(n_splits=n_folds) # Establish the cross validation
    scores = []

    for ids, (train_index, test_index) in enumerate(kf.split(M, L)):
        xgtrain = xgb.DMatrix(M.iloc[train_index].values, L.iloc[train_index].values)
        xgtest = xgb.DMatrix(M.iloc[test_index].values, L.iloc[test_index].values)
        
        clf = xgb.train(
            clf_hyper,
            xgtrain,
            num_boost_round=boost_round,
            verbose_eval=True,
            maximize=False
        )
        
        pred = clf.predict(xgtest, ntree_limit=clf.best_iteration)
        score_log_loss = log_loss(L.iloc[test_index], pred)
        pred[pred<0.5] = 0
        pred[pred>=0.5] = 1
        score_acc = accuracy_score(L.iloc[test_index], pred)
        scores.append((score_log_loss, score_acc))

    ret = {
        'params': clf_hyper,
        'boost_round': boost_round,
        'log_loss': sum([score[0] for score in scores]) / float(len(scores)),
        'accuracy': sum([score[1] for score in scores]) / float(len(scores))
    }

    return ret  

This function will perform one round of cross-validation for the XGBoost classifier. The data, hyper-parameters and the number of boosting rounds are required.

6.2.8 XGBoost Tuning

xgboost_hyper = { 
   "objective": ["binary:logistic"],
   "booster": ["gbtree"],
   "eval_metric": ["logloss"],
   "eta": [0.001, 0.01, 0.1], 
   "subsample": [.25, .5],
   "colsample_bytree": [0.25, 0.5],
   "max_depth": [2,4]
}
clf_data = (data_ohe, target, 3)
xgb_scores = py_scr.run_clf_grid(clf_data, xgboost_hyper, boost_rounds=[30,60])

The above shows the function call to run the XGBoost tuning.

6.2.9 Random Forest Tuning

r_clf = RandomForestClassifier
r_clf_hyper_grid = {
'n_estimators': [10, 100],
'max_depth': [2, 4],
'max_features': [None, 'sqrt']
}
rf_scores = py_scr.run_clf_grid(clf_data, r_clf_hyper_grid, clf=r_clf)

This code is what was used to tune the random forest classifier.

6.2.12 SVM Runtiming

n=1000
sample_data = data.sample(n=n, random_state=2)

sample_target = sample_data['target']

sample_data.drop(['target', 'ID'],inplace=True, axis=1)

sample_data = py_scr.bin_df_col(sample_data, 'v22', 140)
sample_data = py_scr.bin_df_col(sample_data, 'v56', 140)

sample_data_ohe = pd.get_dummies(sample_data)

X_train_smp, X_test_smp, y_train_smp, y_test_smp = train_test_split(sample_data_ohe, sample_target, test_size=0.33, random_state=42)

svm = SVC()

_=svm.fit(X_train_smp, y_train_smp)

svm_y_preds = svm.predict(X_test_smp)

acc_svm = accuracy_score(svm_y_preds, y_test_smp)

This code is what was timed during the sampling exercise, the number was changed according to the desired conditions and it was run.

6.2.13 XGBoost CV Results

## [
##    {
##       "accuracy": 0.7807576910628843,
##       "boost_round": 60,
##       "log_loss": 0.4728152353881004,
##       "params": {
##          "boost_round": 60,
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7799179503328347,
##       "boost_round": 60,
##       "log_loss": 0.4746925072993398,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7799704341284629,
##       "boost_round": 60,
##       "log_loss": 0.4760574659643522,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7792706501867549,
##       "boost_round": 60,
##       "log_loss": 0.477636326600088,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7795943002597948,
##       "boost_round": 30,
##       "log_loss": 0.48019136665380885,
##       "params": {
##          "boost_round": 60,
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7794018596758251,
##       "boost_round": 60,
##       "log_loss": 0.4806111771501589,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7792706501867549,
##       "boost_round": 60,
##       "log_loss": 0.4810116930080995,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7793056393838401,
##       "boost_round": 30,
##       "log_loss": 0.48111711586009626,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7774337173397713,
##       "boost_round": 60,
##       "log_loss": 0.48380703866094527,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7774424646390427,
##       "boost_round": 60,
##       "log_loss": 0.4841062634560416,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7771887929601736,
##       "boost_round": 30,
##       "log_loss": 0.48600775300587057,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.777477453836128,
##       "boost_round": 30,
##       "log_loss": 0.48671780543123405,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7764890090184656,
##       "boost_round": 30,
##       "log_loss": 0.49079358783366933,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7761566116461541,
##       "boost_round": 30,
##       "log_loss": 0.4909624186022256,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7724127675580165,
##       "boost_round": 30,
##       "log_loss": 0.4948742440031848,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7710131996746005,
##       "boost_round": 30,
##       "log_loss": 0.49521669686653175,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7750806938357783,
##       "boost_round": 60,
##       "log_loss": 0.5666707977131821,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7757279939818581,
##       "boost_round": 60,
##       "log_loss": 0.5667872764083817,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7648551009875701,
##       "boost_round": 60,
##       "log_loss": 0.573243505943783,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7649250793817409,
##       "boost_round": 60,
##       "log_loss": 0.5734722548385941,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7614436542717437,
##       "boost_round": 60,
##       "log_loss": 0.5746586614921271,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7620647125200094,
##       "boost_round": 60,
##       "log_loss": 0.5747820783200127,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.5789357752843304,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.5789745847921156,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.773077562302639,
##       "boost_round": 30,
##       "log_loss": 0.6156225417446154,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7736811259523622,
##       "boost_round": 30,
##       "log_loss": 0.6156942938979568,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7633068290165412,
##       "boost_round": 30,
##       "log_loss": 0.6192451318553448,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7635255114983249,
##       "boost_round": 30,
##       "log_loss": 0.6193647432998911,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6206327503523822,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6206662014971803,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6226969775009574,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6227177228266377,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7726052081419862,
##       "boost_round": 60,
##       "log_loss": 0.6739744067404801,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7729900893099256,
##       "boost_round": 60,
##       "log_loss": 0.6740059350516693,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6750018174868662,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7613824231768441,
##       "boost_round": 60,
##       "log_loss": 0.6750333793775128,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6752258273426545,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6752363492087136,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6758209735131904,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6758244855568595,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7726139554412574,
##       "boost_round": 30,
##       "log_loss": 0.6834792467182454,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7724127675580165,
##       "boost_round": 30,
##       "log_loss": 0.6835009979140817,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7612162244906885,
##       "boost_round": 30,
##       "log_loss": 0.683945695543021,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7618110408411404,
##       "boost_round": 30,
##       "log_loss": 0.6839629448266957,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.684114072931942,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6841175453950736,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6843646475196623,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6843670570345989,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    }
## ]

6.2.14 Random Forest CV Results

## [
##    {
##       "accuracy": 0.7771275618652741,
##       "clf": "RandomForestClassifier(max_depth=4, max_features=None, n_estimators=10)",
##       "log_loss": 7.69789401226916
##    },
##    {
##       "accuracy": 0.7770750780696459,
##       "clf": "RandomForestClassifier(max_depth=4, max_features=None)",
##       "log_loss": 7.699707088066617
##    },
##    {
##       "accuracy": 0.7651437618635247,
##       "clf": "RandomForestClassifier(max_depth=2, max_features=None, n_estimators=10)",
##       "log_loss": 8.11183392616853
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=2, max_features='sqrt', n_estimators=10)",
##       "log_loss": 8.248094615957752
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=4, max_features='sqrt', n_estimators=10)",
##       "log_loss": 8.248094615957752
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=2, max_features=None)",
##       "log_loss": 8.248094615957752
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=2, max_features='sqrt')",
##       "log_loss": 8.248094615957752
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=4, max_features='sqrt')",
##       "log_loss": 8.248094615957752
##    }
## ]
XGBoost Random Forest Linear SVC
Log Loss: 0.4728 7.6979 ———
Accuracy: 78.08 % 77.71 % 77.19 %