1 Introduction

Statistical methods and machine learning can have a wide application in many industries. There are many uses for employing these methods, and possibly even more algorithms one could choose. Several factors must be taken into account when making a decision. The purpose of the method, whether it is to explain or predict is to be considered. If prediction is the goal then one must decide if it is a classification or regression problem.

Given a dataset this case study will evaluate three statistical methods used for classification to compare the efficacy and weigh the differences between the models.

2 Background

Little metadata is included with the dataset, there are roughly 100,000 observations and 130 features that are a mix of numeric and categorical variables. The data is said to have come from banking data, and the feature which will be predicted is just labeled as target. This feature is categorical and will assume the value of 0 or 1, denoting True or False. The goal of this dataset is to tune three statistical models to compare prediction this target variable. The models will be tuned using cross validation and the results of the models will be compared.

The three models that will be evaluated are gradient boosting with XGBoost, a random forest classifier, and a support vector machine. The dataset is large enough to challenge these classifiers

3 Methods

The accompanying code for the methods used can be found in the Code/Functions Section of the Appendix.

3.1 Exploratory Data Analysis and Data Processing

The first step in processing the data is splitting the target column out and dropping the ID column which adds no value to the analysis. (Appendix 6.2.1)

A brief EDA was conducted on the dataset; summary statistics were computed for the numerical data and the categorical data. Some of the categorical features had an excessive amount of factors and required processing as it could be a problem after one hot encoding. If left unmodified it would lead to a very large, sparse matrix that may be difficult to work with during modeling.

Features which had greater than 20 categories were identified and plotted (Appendix 6.2.2). The goal of this level reduction was to have roughly 500 features after one-hot encoding. Ultimately, features v22 and v56 were identified as the most likely candidates to have the features binned. A function was created to facilitate the proper manipulation of these categories bin_df_col() (Appendix 6.2.3). Factor levels which had less than 140 entries were put into an “other” category to reduce the number of features. Once the appropriate features were binned, the data was one-hot encoded, partitioned into training and test sets. (Appendix 6.2.4). The final shape number of features in the dataset was 453.

3.2 Model Tuning

A custom function run_clf_grid() was written for hyper-parameter tuning (Appendix 6.2.5) which had a helper function, run_clf() (Appendix 6.2.6).

The run_clf_grid() was used to tune the hyper parameters for the models used to run the classification. The data is passed along with a dictionary of parameters to tune. The keys of the dictionary are the parameter name and the values are a list of parameter values to try. The itertools.product function is used to generate all possible combinations of parameters and the provided classifier is run using the run_clf() function with every parameter combination (if no classifier is provided, XGBoost is used). The scores for each parameter combination are returned.

The run_clf() function is called by run_clf_grid(). This function does k-fold cross validation, unpacks the hyperparameters provided, and passes them to the provided classifier. The classifier is fit on each train fold and the log loss is calculated between the predictions on the test fold and the target labels for each fold. These are averaged to provide the k-fold log loss. The log-loss was calculated with the following formula where y is the label of the target.

\[LogLoss = - \frac{1}{n} \sum\limits_{i=1}^n{[y_i \cdot log_e(\hat{y_i}) + (1-y_i) \cdot log_e(1-\hat{y_i}) ]}\]

An additional function was required for tuning the XGBoost classifier (Appendix 6.2.7). Cross-validation of the tuning parameters are conducted in a very similar fashion and the results are returned in a data dict.The best tuning iteration prediction log-loss and accuracy for the XGBoost and Random Forest were found by passing a tuple of parameters to these custom functions.

3.2.1 XGBoost

XGBoost is run on the data with various parameters with 3 folds using run_clf_grid() (Appendix 6.2.8). This algorithm is an optimized gradient boosting machine. Boosting is a technique where the residual of the model are iteratively run as the new targets of the model so that the model learns from the mistakes of previous iterations.

Here XGBoost is run with a decisions tree and the parameters being tuned are:

Parameter	Effect	Values Used
`eta`	Learning rate for how aggressively the boosting adjusts the model between iterations	[0.001, 0.01, 0.1]
`subsample`	The percent of the training data to subsample each iteration to avoid overfitting	[.25, .5]
`colsample_bytree`	The percent of columns that are considered at each level of the tree	[0.25, 0.5]
`max_depth`	The maximum depth of the decision tree	[2,4]
`boost_rounds`	How many iterations of boost the model will run	[30,60]

3.2.2 Random Forest

The next algorithm under considersation is the Random Forest Classifier (Appendix 6.2.9). Random Forests are also tree base, however they are built in parallel. A number of different decision trees are built using subset of the data and new data is fed to the resulting set of trees and they vote on the new classification.

The hyper parameters being tuned are:

Parameter	Effect	Values Used
`n_estimators`	The number of trees	[10, 100]
`max_depth`	The max depth of the trees	[2, 4]
`max_features`	The number of features considered when looking for a split.	[None, ‘sqrt’]

3.2.3 Support Vector Machine

The support vector machine model was ultimately tuned with the assistance of the GridSearchCV() function from the scikit-learn package using LinearSVC()(Appendix 6.2.10).

The parameters for the linear svm:

Parameter	Effect	Values Used
`C`	Controls the strength of regularization, smaller values impose a stronger regulation.	[0.1, 1.0, 10, 100]
`loss`	Specifies loss function to use	[‘squared_hinge’]
`class_weight`	The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data (source)	[‘balanced’, None]

Additionally, dual=False and max_iter=10000 were passed to the function to have the algorithm run faster.

Tuning attempts were also conducted on a small sampling of the data on a SVC() with the RandomizedSearchCV()(Appendix 6.2.11).

The final object was to observe the effect of scaling to the support vector machine. The data set was sampled into sets of 1000, 2000, 5000 and 10,000 samples with the same code, just changing the value of n (Appendix 6.2.12). These sampled datasets were run with the default parameters with an SVC() The results were saved to a csv for easy retrieval.

4 Results

4.1 Summary Statistics

Numerical Data

Table 4.1.1 Summary Statistics for Numeric Data
	ID	target	v1	v2	v4	v5	v6	v7	v8	v9	v10	v11	v12	v13	v14	v15	v16	v17	v18	v19	v20	v21	v23	v25	v26	v27	v28	v29	v32	v33	v34	v35	v36	v37	v38	v39	v40	v41	v42	v43	v44	v45	v46	v48	v49	v50	v51	v53	v54	v55	v57	v58	v59	v60	v61	v62	v63	v64	v65	v67	v68	v69	v70	v72	v73	v76	v77	v78	v80	v81	v82	v83	v84	v85	v86	v87	v88	v89	v90	v92	v93	v94	v95	v96	v97	v98	v99	v100	v101	v102	v103	v104	v105	v106	v108	v109	v111	v114	v115	v116	v117	v118	v119	v120	v121	v122	v123	v124	v126	v127	v128	v129	v130	v131
count	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321
mean	114228.93	0.7611987	1.630685674923	7.4644107807752	4.1450975035109	8.7423591810425	2.4364015815439	2.4839208102273	1.496568585079	9.0318585930527	1.8830457847125	15.4474126838914	6.8813044377403	3.7983963023606	12.0942788858509	2.0809106817235	4.9232224287135	3.8322704127588	0.841045485361	0.2223004779864	17.7735922	7.0297402	1.09308841985134	1.69812879	1.8760306041673	2.7434540369624	5.093328020265	8.2064159614183	1.622150683167	2.1616329797329	6.40623558906	8.1223868077918	13.375597717567	0.7414708467043	0.09092818	1.2371837560457	10.465928405775	7.1825513447041	12.9249650976855	2.2165969904458	10.7951692627232	9.142231088139	1.63052543	12.5380219089864	8.0165466385719	1.5042648853011	7.1981589983863	15.7112990521152	1.2538563	1.559556203	4.077827529765	7.7016530932587	10.5879446792157	1.7142943914356	14.5830349435487	1.0306943	1.68732734	6.3437132502133	15.8475567	9.2872753836876	17.564117	9.4493351122383	12.2699599	1.4317667	2.4333034251272	2.4050555276588	7.3073659482608	13.334481973067	2.2096999319881	7.287173817372	6.2083564881429	2.1738077203141	1.6079557040978	2.8222527267858	1.220184140291	10.1802156	1.9241838301535	1.5184252	0.9669125681134	0.5823667750715	5.4751845316194	3.8528830609624	0.6657576395036	6.4579519592415	7.622553947745	7.6676237703283	1.250720530646	12.0916229011791	6.8664137058717	2.8902887272399	5.2967158900748	2.6428276253487	1.08104522222	11.7913595537984	2.1526200477383	4.1812843773808	3.365313669889	13.5744509815506866	10.5480509331675	2.2912175835805	8.3038570644314	8.3646508021664	3.1689697933543	1.2912178861444	2.7375959775818	6.8224391421538	3.54993833	0.9198119852931	1.6726576869395	3.2395418343606	2.0303732349014	0.3101442	1.9257634679567	1.7393891285777
std	65934.49	0.4263529	0.8132648607174	2.2250359927279	0.8626621261702	1.5434406499236	0.4506138012766	0.4427149883004	2.1097863790959	1.4495416816586	1.3934663832624	0.5933833698793	0.9241467383403	0.8831731273539	1.4439213329891	0.5504519731541	1.3446427992241	1.4360671215414	0.462864337893	0.1286811853652	0.8674296	1.0694018	2.98732218535776	2.24158271	0.4139845828121	0.6266564934556	2.011310849914	0.9654450102285	0.423243709016	0.7396951573049	2.02419539908	1.0062805681003	1.785729106552	0.4065718695202	0.58347762	1.7710758856348	3.1676437185216	0.7544255445798	0.7487952252983	0.4866693177728	1.5858589673518	1.5505828176218	2.19532169	1.6499256735956	0.6779729929047	1.1678898433356	1.8730575227974	0.6003598481219	1.754598	0.626683038593	0.5092542056744	5.1380645873115	1.5563992428107	0.4037792043052	1.5934386536928	0.6962441	2.24951155	1.8974181415614	1.4104959	0.8437106405939	1.719832	1.4267009120726	1.7543601	0.9222675	0.5998123098125	1.0395551395494	0.9433861663182	1.3842291866628	0.8072647286202	1.6856685581687	2.788207549548	0.79784990138	0.706908758694	1.0618557422559	0.349851975096	2.2735704	0.7875292870987	2.13245315	0.1343842960905	0.1803984566689	1.2320110796948	0.6421580106347	0.1983492717587	0.8415485280468	1.444984138095	1.7627602611966	0.3465493424595	5.1734083103548	1.7690092157239	1.3541209028561	0.9229137705948	0.6652710797571	1.70317177617	2.2193463583259	0.692217074898	2.8139505527565	1.117152356997	2.6128782936866974	1.4274426584522	0.5034027060615	2.7426922561405	1.5035793392475	3.1636039920245	0.5545506102771	1.0186034472285	1.3486997532297	1.94343091	1.5915548723072	0.3779128300971	1.2212253350593	0.8143412614486	0.6932616	0.9496402470355	0.8518204492634
min	3	0	-0.0000009996497	-0.0000009817614	-0.0000006475929	-0.0000005287068	-0.0000009055091	-0.0000009468765	-0.0000007783778	-0.0000009828757	-0.0000009875317	-0.0000001459062	0.0000005143224	-0.0000008464889	-0.0000009738831	-0.0000008830427	-0.0000009978294	-0.0000009066455	0.000000447547	-0.0000005178987	1.5167764	0.1061806	-0.0000009999932	0.04104324	-0.0000009346696	-0.0000009915986	-0.000000696088	-0.0000003040753	-0.000000955996	-0.0000009713108	-0.000000670767	-0.0000009958327	-0.0000004906628	-0.0000009999498	0	-0.0000009999742	0.0000001238996	-0.0000007272275	-0.0000006206144	-0.0000009724295	-0.0000009482212	-0.0000009202112	0.06934879	-0.0000009924422	-0.0000006697975	-0.0000009091393	-0.0000003616122	-0.0000009838107	0.0130615	-0.000000984682	-0.0000007570607	-0.0000009976841	-0.0000008803695	-0.0000009970164	-0.0000006113428	0	0.05305528	-0.0000008770837	0.6592957	-0.0000002413161	1.501359	-0.0000009920262	0.4270946	0	-0.0000009838592	-0.0000006171967	-0.0000007729896	-0.0000009902572	-0.0000009992919	-0.0000001443765	-0.0000007251767	-0.0000009865679	-0.0000009994245	-0.0000009995741	-0.000000981431	0.8723955	-0.0000009990082	0.02236518	-0.0000002343004	0.0000000339321	0.0000004251919	-0.0000009687809	-0.0000005784674	-0.0000008832504	-0.000000905952	-0.0000005605928	-0.0000009787309	-0.0000009981311	-0.0000003942767	-0.0000007111493	-0.0000009757743	-0.0000008683169	0.00009364981	-0.0000005467029	0.0000004251643	-0.0000009996884	-0.000000999027	-0.0000000009335327	-0.0000009853189	-0.0000009450359	-0.0000009991992	-0.0000001695463	-0.0000009998183	-0.0000009932534	-0.0000009820642	-0.0000009978497	0.01913856	-0.0000009994953	-0.0000009564174	-0.0000009223798	0.0000008197812	0	-0.0000009901257	-0.0000009999134
25%	57280	1	1.34615289971	6.57577033681	4.0686969218	8.39409001481	2.3409675593	2.37658602471	0.265314672374	8.81356030231	1.05032820659	15.3982297448	6.32262474904	3.46408745982	11.2560173031	1.90568564088	4.70588270329	3.37982887932	0.719486145362	0.192411804374	17.7735922	6.4187546	0.00000006196688	0.27447507	1.755531054	2.56647467162	4.74235525242	8.11437410392	1.49129133927	1.83074195405	5.0558001341	7.89614964569	13.0993453186	0.59071649453	0	0.305219112655	8.41038973367	7.06761850192	12.8131706654	2.06896530006	10.5425569114	8.88634100226	0.25630139	12.1569310021	7.91400039216	0.658792378256	6.83726745823	15.6677184587	0.2081914	1.27659511766	3.97664923776	4.06135866729	10.2167970814	1.59960415915	14.5830349435487	1	0.27093532	5.85294439423	15.8475567	9.16798396465	17.564117	9.27007251269	12.0874811	1	2.23611757389	2.06003010584	7.19022597493	13.1517514492	1.95122020344	7.20499240376	3.59298457736	1.81534571891	1.31803793054	2.44394683635	1.10841063474	9.5518364	1.62351387723	0.22454102	0.949615010065	0.517926838695	5.13586626368	3.65008498653	0.59461319143	6.36225348979	7.18954304011	7.29994016973	1.17270867441	12.0916229011791	6.34055270755	2.3290450503	4.98815003049	2.43201770861	0.17354816907	11.702030193	1.85200656575	2.72121863922	2.92385753715	11.9966658175000003	10.2666662017	2.1390365024	7.60399574885	7.86516760377	1.1694252228	1.05263168344	2.28260878443	6.51960744691	2.57105312	0.0847131977909	1.57097366948	2.76249711473	1.68126114914	0	1.44947705168	1.46341418483
50%	114189	1	1.630685674923	7.4644107807752	4.1450975035109	8.7423591810425	2.4364015815439	2.4839208102273	1.496568585079	9.0318585930527	1.31291009873	15.4474126838914	6.6132409216	3.7983963023606	11.9678254667	2.0809106817235	4.9232224287135	3.8322704127588	0.841045485361	0.2223004779864	17.7735922	7.0393655	0.33059372542	1.69812879	1.8760306041673	2.7434540369624	5.093328020265	8.2064159614183	1.622150683167	2.1616329797329	6.53443367796	8.1223868077918	13.375597717567	0.7414708467043	0	1.2371837560457	10.3393377646	7.1825513447041	12.9249650976855	2.2165969904458	10.7951692627232	9.142231088139	1.63052543	12.5380219089864	8.0165466385719	1.21194423179	7.1981589983863	15.7112990521152	1.2538563	1.559556203	4.077827529765	7.7016530932587	10.5879446792157	1.7142943914356	14.5830349435487	1	1.68732734	6.3437132502133	15.8475567	9.2872753836876	17.564117	9.4493351122383	12.2699599	1	2.4333034251272	2.4050555276588	7.3073659482608	13.334481973067	2.2096999319881	7.287173817372	6.2083564881429	2.1738077203141	1.6079557040978	2.8222527267858	1.220184140291	10.1802156	1.9241838301535	1.5184252	0.9669125681134	0.5823667750715	5.4751845316194	3.8528830609624	0.6657576395036	6.4579519592415	7.622553947745	7.6676237703283	1.250720530646	12.0916229011791	6.8664137058717	2.8902887272399	5.2967158900748	2.6428276253487	1.08104522222	11.7913595537984	2.1526200477383	4.1812843773808	3.365313669889	14.0388799090000003	10.5480509331675	2.2912175835805	8.3038570644314	8.3646508021664	3.1689697933543	1.2912178861444	2.7375959775818	6.8224391421538	3.54993833	0.9198119852931	1.6726576869395	3.2395418343606	2.0303732349014	0	1.9257634679567	1.7393891285777
75%	171206	1	1.630685674923	7.55150068977	4.34022894242	8.92479757997	2.48469939431	2.52844500954	1.496568585079	9.3023251243	2.10065719345	15.5938957785	7.01940368178	3.7983963023606	12.7157742438	2.0809106817235	5.14285647543	3.8322704127588	0.841045485361	0.2223004779864	18.1546039	7.6665218	1.09308841985134	1.69812879	1.89891343478	2.77910334858	5.33033911741	8.47939326721	1.622150683167	2.1616329797329	7.70145144761	8.25075880327	14.3249233609	0.7414708467043	0	1.2371837560457	12.7624628502	7.34477205535	13.0496460315	2.2374883411	11.0221009644	9.41516321173	1.63052543	12.6746325614	8.13559232975	2.00572213229	7.41788211852	15.8715598393	1.2538563	1.559556203	4.15366204094	7.7016530932587	10.8395399737	1.73501642403	15.3129111878	1	1.68732734	6.38440109403	16.4708469	9.46899440238	18.4375	9.73384058789	12.9166001	2	2.43664774246	2.4050555276588	7.55221374391	13.5593215242	2.24358989361	7.8230073696	6.2083564881429	2.1738077203141	1.6079557040978	2.8222527267858	1.220184140291	10.4335862	1.9241838301535	1.5184252	0.990101887728	0.5823667750715	5.4751845316194	3.8528830609624	0.6657576395036	6.6690010731	7.71084428042	8.00612317966	1.30166574802	15.6972116963	6.93118555837	2.8902887272399	5.2967158900748	2.6428276253487	1.08104522222	12.4436323096	2.1526200477383	4.1812843773808	3.365313669889	15.372185696099999	10.7189543566	2.31017036209	8.6453665737	8.41772155054	3.1689697933543	1.2912178861444	2.7375959775818	6.99999915782	3.54993833	0.9198119852931	1.6726576869395	3.2395418343606	2.0303732349014	0	1.9257634679567	1.7393891285777
max	228713	1	20.0000006294	19.9999999087	19.9999997446	20.0000003539	20.0000005964	19.9999998141	20.000000997	20.0000007502	18.5339164478	20.0000009233	18.7105503906	20.0000009059	19.9999996125	19.9999990205	20.0000009395	19.9999992246	20.0000009159	20.0000007723	20.000001	19.296052	20.0000009982	20.000001	19.9999996744	20.00000029	19.8481942126	19.9999999189	17.5609751445	20.0000002023	20.000000815	20.0000000877	20.0000002071	20.000000355	12	19.9155262925	19.9999991256	19.999999255	19.999999748	20.0000004033	19.8316812461	20.0000005511	20.000001	19.9999996424	19.9999995929	19.9999991454	20.0000008581	20.0000009228	20.000001	20.0000001945	19.9999997458	20.0000009786	19.9999994001	20.0000004966	18.8469601202	7	20.000001	19.9999990118	20.000001	20.0000009583	20.000001	20.0000009988	19.8163109	12	19.9999996418	20.0000005707	15.9735089981	20.0000009888	19.9999996377	20.0000009706	20.0000002311	20.0000002847	20.0000009699	20.0000008889	17.5609751085	19.8427544	20.0000000942	20.000001	6.30577492863	8.92384346581	19.9999994839	19.0163118283	9.0705377032	19.9999997693	20.0000008332	19.0587996349	19.9999997618	20.0000009983	20.0000004219	20.0000001121	18.7752514628	20.0000006045	20.0000009841	20.0000003358	20.0000005938	20.0000009996	20.0000005066	19.999999675599998	20.0000009715	20.0000007973	20.0000007954	20.0000009555	20.0000004599	10.3942654912	20.0000008035	20.0000009324	19.68606924	20.0000009992	15.6316128253	19.9999990947	20.000000402	11	19.9999995909	20.0000009426

The numeric variables were inspected to identify any anomalies. It appears as thought the data may have already undergone some type of transformation as a majority of the features have a max value of 20, and a minimum of around 0. No further processing of the numeric variables was conducted.

Categorical Data

Table 4.1.2 Summary of Categorical Data
	v3	v22	v24	v30	v31	v47	v52	v56	v66	v71	v74	v75	v79	v91	v107	v110	v112	v113	v125
count	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321	114321
unique	3	18210	5	7	3	10	12	122	3	9	3	4	18	7	7	3	22	36	90
top	C	AGDF	E	C	A	C	J	BW	A	F	B	D	C	A	E	A	F	G	BM
freq	114041	2886	55177	92288	91804	55425	11106	18233	70353	75094	113560	75087	34561	27082	27082	55688	22053	71556	5836

The categorical features revealed the shear number of factors for some of the features contained; a few which were identified to have over 20 categories were v22, v56 and v125. These features were plotted below

Figure 4.1.1 Level Balance for v22 (Note: x-axis text is turned off as too many features created an undesirable look)

Figure 4.1.2 Level Balance for v56

Figure 4.1.3 Level Balance for v125

Columns v22 (Figure 4.1.1) and v56 (Figure 4.1.2) have a large categories of users with a relatively low number of users and were binned.

Column v125 (Figure 4.1.2) has over 100 categories but many appear to have at least 1000 samples but this column was kept unaltered.

Inspection of the the value counts for column v22 shows a dropoff in value counts after the value HUU with a count of 146.

A similar inspection of column v56 shows a sharp dropoff after value CF woth 141 value counts. These two features were binned so all categories with value counts below 140 for columns v22 and v56 were denoted “other”.

Figure 4.1.4 Level Balance for v22 after binning

Figure 4.1.5 Level Balance for v56 after binning

Figure 4.1.5 and Figure 4.1.5 show the features v22 and v56 after binning. A majority of the observations in the v22 feature now fall in the “other” category, but the data diversity is still in place.

4.2 XGBoost Tuning

Table 4.2.1 XGBoost Best Parameters
	Best Tune
boost_round	60
booster	gbtree
colsample_bytree	0.5
eta	0.1
eval_metric	logloss
max_depth	4
objective	binary:logistic
subsample	0.5

The XGBoost model performed arguably the best out of all models, full results in the appendix. The log-loss score for the best out of fold prediction was 0.4728 with an accuracy of 78.08 %. The log-loss score was by far the best, and the accuracy was very good, but not far and away better than the other models. This model also took the shortest amount of time to tune which is surprising because the most parameters were investigated during tuning using this algorithm compared to the Random Forest and Support Vector Machine.

4.3 Random Forest Tuning

Table 4.3.1 Random Forest Best Parameters
	Best Tune
max_depth	4
max_features	None
n_estimators	10

The Random Forest did not perform very well for classification task, the best parameters from cross-validation are shown above, full results in the appendix. The log-loss, 7.6979, was far worse than XGBoost but the accuracy was middle of the pack at 77.71 % This model also took the longest to tune (SVC() excluded).

4.4 Linear SVM Tuning

Table 4.4.1 LinearSVC Best Parameters
	Best Tune
C	1
class_weight	None
loss	squared_hinge

Compared to the other models the LinearSVC() yielded the least desirable accuracy score at 77.19 %. The GridSearchCV() function was used to find the best model and the runtime wasn’t the worst.

Early attempts at tuning with RandomizedSearchCV() and/or SVC() were unsuccessful. The RandomizedSearchCV() was run with SVC() initially with the 1000 sampled dataset. Even with the n_iter at 500, the search would run very fast, under a minute. Once the dataset was over 5000 samples, the search struggle very mightily. The kernel was fixed to linear, the classifier was changed to LinearSVC(), the n_iter was used at the default of 10, n_jobs was set equal to 1, it still took many, many, many hours to run and wouldn’t complete. The next section displays the runtimes for the SVM, as this algorithm is notorious for taking a long run time.

4.5 Support Vector Machine Runtime

Figure 4.5.1 Sampled Support Vector Machine runtime results

The result of the SVM runtimes are shown in Figure 4.5.1 above. As seen in this visualization, also experienced during model building, this classifier will not scale with data very well, as it will exhibit exponential scaling.

5 Conclusion

Table 5.0.1 Comprehensive Model Summary
	XGBoost	Random Forest	Support Vector Machine
Log Loss	0.4728	7.6979	———
Accuracy	78.08 %	77.71 %	77.19 %

Table 5.0.1 above shows the comprehenisive results for this classification task. The XGBoost performed the best, as it had the shortest time to tune, and produced the best metrics on a holdout validation set of data.

This case study illustrated that there are balances and trade-offs one must navigate when choosing a machine learning algorithm or statistical method. Each of these may perform equally well on real data, as the accuracy scores seem to be similar enough, though this assertion would need validation. An assessment like this would be a good starting point, and further model tuning could be conducted.

6 Appendix

6.1 Sources

LinearSVC

XGBoost

Log-Loss

6.2 Code/Functions

6.2.1 Data Prep

data = plot_data.copy()
## test the code with a subset of the data
#data = data[0:1000]
target = data['target']
data.drop(['target', 'ID'],inplace=True, axis=1)

The above code will prepare the data for analysis.

6.2.2 Categorical EDA Raw

### Find cols with over 20 categories 
cat_data = data.loc[:, data.dtypes == object]

col_bin_candidates = dict()
for col in cat_data:
    category_count = len(data[col].value_counts())
    if category_count > 20:
        col_bin_candidates[col] = category_count
        
col_bin_candidates

Find cols with over 20 categories, and prints result.

### Visualized values for cols with over 10 categories
for col in col_bin_candidates:
    counts = data[col].value_counts().to_frame()
    
    fig = plt.figure()
    if len(counts) < 50:
        plt.title("{c} Bar Char".format(c=col)) 
        plt.bar(counts.index, counts[col])
    else:
        plt.title("{c} Histogram".format(c=col)) 
        plt.hist(counts[col])

### Look for bin point col v22
data['v22'].value_counts()[0:50]

### Look for bin point col v56
data['v56'].value_counts()[0:50]

6.2.3 Bin Categorical Features

def bin_df_col(df, col, cutoff):
    vc = df[col].value_counts().to_frame()
    below_cutoff = vc[vc[col] < cutoff].index
    df.loc[(df[col].isin(below_cutoff)), col] = 'Other'
    
    return df

The above function will create an “other” category based on a value passed as cutoff to reduce the number of levels in the feature.

6.2.4 Binning/One Hot Encoding

### Bin cols based on observations above
data = py_scr.bin_df_col(data, 'v22', 140)
data = py_scr.bin_df_col(data, 'v56', 140)

data_ohe = pd.get_dummies(data)

X_train, X_test, y_train, y_test = train_test_split(data_ohe, target, test_size=0.33, random_state=42)

This is the final data step which will reduce the number of factors in some columns, one hot encode and partition the data into train and test sets.

6.2.5 Hyper Tune Grid Search

def run_clf_grid(data, clf_hyper_grid, return_best=False, boost_rounds=None, clf=None):
    clf_scores = []
    param, param_values = zip(*clf_hyper_grid.items())
    param_list = [dict(zip(param, param_value)) for param_value in itertools.product(*param_values)]
    for params in param_list:
        if clf:
            score = run_clf(clf, data, params)
            clf_scores.append(score)
        else:
            for boost_round in boost_rounds:
                score = run_xgb(data, params, boost_round)
                clf_scores.append(score)
    clf_scores.sort(key=lambda x: x['log_loss'])
    if return_best:
        clf_scores = [clf_scores[0]]
    return clf_scores

This code will accept a dict of parameter values and iterate through them, performing cross-validation on the model to tune for the best parameters.

6.2.6 Classifier Cross-Validation

def run_clf(a_clf, data, clf_hyper):
    M, L, n_folds = data # unpack data container
    kf = KFold(n_splits=n_folds) # Establish the cross validation
    scores = []

    for ids, (train_index, test_index) in enumerate(kf.split(M, L)):
        clf = a_clf(**clf_hyper) # unpack parameters into clf is they exist
        clf.fit(M.iloc[train_index], L.iloc[train_index])

        pred = clf.predict(M.iloc[test_index])
        score_log_loss = log_loss(L.iloc[test_index], pred)
        pred[pred<0.5] = 0
        pred[pred>=0.5] = 1
        score_acc = accuracy_score(L.iloc[test_index], pred)
        scores.append((score_log_loss, score_acc))

    ret = {
        'clf': str(clf),
        'log_loss': sum([score[0] for score in scores]) / float(len(scores)),
        'accuracy': sum([score[1] for score in scores]) / float(len(scores))
    }

    return ret

This function will perform a round of cross-validation on a classifier. One must pass the desired model, the data, and the parameters for the one iteration of cross-validation. This function was used for the random forest tuning.

6.2.7 XGBoost Cross-Validation

def run_xgb(data, clf_hyper, boost_round):
    M, L, n_folds = data # unpack data container
    kf = KFold(n_splits=n_folds) # Establish the cross validation
    scores = []

    for ids, (train_index, test_index) in enumerate(kf.split(M, L)):
        xgtrain = xgb.DMatrix(M.iloc[train_index].values, L.iloc[train_index].values)
        xgtest = xgb.DMatrix(M.iloc[test_index].values, L.iloc[test_index].values)
        
        clf = xgb.train(
            clf_hyper,
            xgtrain,
            num_boost_round=boost_round,
            verbose_eval=True,
            maximize=False
        )
        
        pred = clf.predict(xgtest, ntree_limit=clf.best_iteration)
        score_log_loss = log_loss(L.iloc[test_index], pred)
        pred[pred<0.5] = 0
        pred[pred>=0.5] = 1
        score_acc = accuracy_score(L.iloc[test_index], pred)
        scores.append((score_log_loss, score_acc))

    ret = {
        'params': clf_hyper,
        'boost_round': boost_round,
        'log_loss': sum([score[0] for score in scores]) / float(len(scores)),
        'accuracy': sum([score[1] for score in scores]) / float(len(scores))
    }

    return ret

This function will perform one round of cross-validation for the XGBoost classifier. The data, hyper-parameters and the number of boosting rounds are required.

6.2.8 XGBoost Tuning

xgboost_hyper = { 
   "objective": ["binary:logistic"],
   "booster": ["gbtree"],
   "eval_metric": ["logloss"],
   "eta": [0.001, 0.01, 0.1], 
   "subsample": [.25, .5],
   "colsample_bytree": [0.25, 0.5],
   "max_depth": [2,4]
}
clf_data = (data_ohe, target, 3)
xgb_scores = py_scr.run_clf_grid(clf_data, xgboost_hyper, boost_rounds=[30,60])

The above shows the function call to run the XGBoost tuning.

6.2.9 Random Forest Tuning

r_clf = RandomForestClassifier
r_clf_hyper_grid = {
'n_estimators': [10, 100],
'max_depth': [2, 4],
'max_features': [None, 'sqrt']
}
rf_scores = py_scr.run_clf_grid(clf_data, r_clf_hyper_grid, clf=r_clf)

This code is what was used to tune the random forest classifier.

6.2.10 SVM Tuning: Grid Search

svm_param_dist ={
    'C': [0.1, 1.0, 10, 100], 
    'loss': ['squared_hinge'], 
    'class_weight':['balanced', None]
}

lin_svm = LinearSVC(dual=False, max_iter=10000)

svm_clf = GridSearchCV(lin_svm, svm_param_dist, n_jobs=-1)
search = svm_clf.fit(X_train, y_train)

lsvm_y_preds = svm_clf.predict(X_test)

acc_lsvm = accuracy_score(lsvm_y_preds, y_test)

This code was used to tune the LinearSVC()

6.2.11 SVM Tuning: Random Search

from scipy.stats import uniform, expon

sample_data = data.sample(n=2000, random_state=2)

sample_target = sample_data['target']

sample_data.drop(['target', 'ID'],inplace=True, axis=1)

sample_data_ohe = pd.get_dummies(sample_data)

X_train_smp, X_test_smp, y_train_smp, y_test_smp = train_test_split(sample_data_ohe, sample_target, test_size=0.33, random_state=42)

svm_param_dist ={
    'C': expon(scale=100), 
    'gamma': expon(scale=.1),
    'kernel': ['linear']
}

svm_tune = SVC()

clf_tune = RandomizedSearchCV(svm_tune, svm_param_dist, cv=5, n_iter=10, random_state=0, n_jobs=-1)

This code was what was attempted to use with RandomizedSearchCV() but not successful.

6.2.12 SVM Runtiming

n=1000
sample_data = data.sample(n=n, random_state=2)

sample_target = sample_data['target']

sample_data.drop(['target', 'ID'],inplace=True, axis=1)

sample_data = py_scr.bin_df_col(sample_data, 'v22', 140)
sample_data = py_scr.bin_df_col(sample_data, 'v56', 140)

sample_data_ohe = pd.get_dummies(sample_data)

X_train_smp, X_test_smp, y_train_smp, y_test_smp = train_test_split(sample_data_ohe, sample_target, test_size=0.33, random_state=42)

svm = SVC()

_=svm.fit(X_train_smp, y_train_smp)

svm_y_preds = svm.predict(X_test_smp)

acc_svm = accuracy_score(svm_y_preds, y_test_smp)

This code is what was timed during the sampling exercise, the number was changed according to the desired conditions and it was run.

6.2.13 XGBoost CV Results

## [
##    {
##       "accuracy": 0.7807576910628843,
##       "boost_round": 60,
##       "log_loss": 0.4728152353881004,
##       "params": {
##          "boost_round": 60,
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7799179503328347,
##       "boost_round": 60,
##       "log_loss": 0.4746925072993398,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7799704341284629,
##       "boost_round": 60,
##       "log_loss": 0.4760574659643522,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7792706501867549,
##       "boost_round": 60,
##       "log_loss": 0.477636326600088,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7795943002597948,
##       "boost_round": 30,
##       "log_loss": 0.48019136665380885,
##       "params": {
##          "boost_round": 60,
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7794018596758251,
##       "boost_round": 60,
##       "log_loss": 0.4806111771501589,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7792706501867549,
##       "boost_round": 60,
##       "log_loss": 0.4810116930080995,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7793056393838401,
##       "boost_round": 30,
##       "log_loss": 0.48111711586009626,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7774337173397713,
##       "boost_round": 60,
##       "log_loss": 0.48380703866094527,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7774424646390427,
##       "boost_round": 60,
##       "log_loss": 0.4841062634560416,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7771887929601736,
##       "boost_round": 30,
##       "log_loss": 0.48600775300587057,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.777477453836128,
##       "boost_round": 30,
##       "log_loss": 0.48671780543123405,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7764890090184656,
##       "boost_round": 30,
##       "log_loss": 0.49079358783366933,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7761566116461541,
##       "boost_round": 30,
##       "log_loss": 0.4909624186022256,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7724127675580165,
##       "boost_round": 30,
##       "log_loss": 0.4948742440031848,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7710131996746005,
##       "boost_round": 30,
##       "log_loss": 0.49521669686653175,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.1,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7750806938357783,
##       "boost_round": 60,
##       "log_loss": 0.5666707977131821,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7757279939818581,
##       "boost_round": 60,
##       "log_loss": 0.5667872764083817,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7648551009875701,
##       "boost_round": 60,
##       "log_loss": 0.573243505943783,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7649250793817409,
##       "boost_round": 60,
##       "log_loss": 0.5734722548385941,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7614436542717437,
##       "boost_round": 60,
##       "log_loss": 0.5746586614921271,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7620647125200094,
##       "boost_round": 60,
##       "log_loss": 0.5747820783200127,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.5789357752843304,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.5789745847921156,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.773077562302639,
##       "boost_round": 30,
##       "log_loss": 0.6156225417446154,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7736811259523622,
##       "boost_round": 30,
##       "log_loss": 0.6156942938979568,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7633068290165412,
##       "boost_round": 30,
##       "log_loss": 0.6192451318553448,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7635255114983249,
##       "boost_round": 30,
##       "log_loss": 0.6193647432998911,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6206327503523822,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6206662014971803,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6226969775009574,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6227177228266377,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.01,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7726052081419862,
##       "boost_round": 60,
##       "log_loss": 0.6739744067404801,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7729900893099256,
##       "boost_round": 60,
##       "log_loss": 0.6740059350516693,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6750018174868662,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7613824231768441,
##       "boost_round": 60,
##       "log_loss": 0.6750333793775128,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6752258273426545,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6752363492087136,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6758209735131904,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 60,
##       "log_loss": 0.6758244855568595,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7726139554412574,
##       "boost_round": 30,
##       "log_loss": 0.6834792467182454,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7724127675580165,
##       "boost_round": 30,
##       "log_loss": 0.6835009979140817,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7612162244906885,
##       "boost_round": 30,
##       "log_loss": 0.683945695543021,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7618110408411404,
##       "boost_round": 30,
##       "log_loss": 0.6839629448266957,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 4,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.684114072931942,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6841175453950736,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.5,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6843646475196623,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.5
##       }
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "boost_round": 30,
##       "log_loss": 0.6843670570345989,
##       "params": {
##          "booster": "gbtree",
##          "colsample_bytree": 0.25,
##          "eta": 0.001,
##          "eval_metric": "logloss",
##          "max_depth": 2,
##          "objective": "binary:logistic",
##          "subsample": 0.25
##       }
##    }
## ]

6.2.14 Random Forest CV Results

## [
##    {
##       "accuracy": 0.7771275618652741,
##       "clf": "RandomForestClassifier(max_depth=4, max_features=None, n_estimators=10)",
##       "log_loss": 7.69789401226916
##    },
##    {
##       "accuracy": 0.7770750780696459,
##       "clf": "RandomForestClassifier(max_depth=4, max_features=None)",
##       "log_loss": 7.699707088066617
##    },
##    {
##       "accuracy": 0.7651437618635247,
##       "clf": "RandomForestClassifier(max_depth=2, max_features=None, n_estimators=10)",
##       "log_loss": 8.11183392616853
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=2, max_features='sqrt', n_estimators=10)",
##       "log_loss": 8.248094615957752
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=4, max_features='sqrt', n_estimators=10)",
##       "log_loss": 8.248094615957752
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=2, max_features=None)",
##       "log_loss": 8.248094615957752
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=2, max_features='sqrt')",
##       "log_loss": 8.248094615957752
##    },
##    {
##       "accuracy": 0.7611987298921458,
##       "clf": "RandomForestClassifier(max_depth=4, max_features='sqrt')",
##       "log_loss": 8.248094615957752
##    }
## ]

	XGBoost	Random Forest	Linear SVC
Log Loss:	0.4728	7.6979	———
Accuracy:	78.08 %	77.71 %	77.19 %

Machine Learning: Model Tuning

Case Study 8

Got RAM??
DS7333 - Thursdays

Jason Rupp

4 March 2021

1 Introduction

2 Background

3 Methods

3.1 Exploratory Data Analysis and Data Processing

3.2 Model Tuning

3.2.1 XGBoost

3.2.2 Random Forest

3.2.3 Support Vector Machine

4 Results

4.1 Summary Statistics

Numerical Data

Categorical Data

4.2 XGBoost Tuning

4.3 Random Forest Tuning

4.4 Linear SVM Tuning

4.5 Support Vector Machine Runtime

5 Conclusion

6 Appendix

6.1 Sources

6.2 Code/Functions

6.2.1 Data Prep

6.2.2 Categorical EDA Raw

6.2.3 Bin Categorical Features

6.2.4 Binning/One Hot Encoding

6.2.5 Hyper Tune Grid Search

6.2.6 Classifier Cross-Validation

6.2.7 XGBoost Cross-Validation

6.2.8 XGBoost Tuning

6.2.9 Random Forest Tuning

6.2.10 SVM Tuning: Grid Search

6.2.11 SVM Tuning: Random Search

6.2.12 SVM Runtiming

6.2.13 XGBoost CV Results

6.2.14 Random Forest CV Results

Machine Learning: Model Tuning

Case Study 8

Got RAM?? DS7333 - Thursdays

Jason Rupp

4 March 2021

1 Introduction

2 Background

3 Methods

3.1 Exploratory Data Analysis and Data Processing

3.2 Model Tuning

3.2.1 XGBoost

3.2.2 Random Forest

3.2.3 Support Vector Machine

4 Results

4.1 Summary Statistics

Numerical Data

Categorical Data

4.2 XGBoost Tuning

4.3 Random Forest Tuning

4.4 Linear SVM Tuning

4.5 Support Vector Machine Runtime

5 Conclusion

6 Appendix

6.1 Sources

6.2 Code/Functions

6.2.1 Data Prep

6.2.2 Categorical EDA Raw

6.2.3 Bin Categorical Features

6.2.4 Binning/One Hot Encoding

6.2.5 Hyper Tune Grid Search

6.2.6 Classifier Cross-Validation

6.2.7 XGBoost Cross-Validation

6.2.8 XGBoost Tuning

6.2.9 Random Forest Tuning

6.2.10 SVM Tuning: Grid Search

6.2.11 SVM Tuning: Random Search

6.2.12 SVM Runtiming

6.2.13 XGBoost CV Results

6.2.14 Random Forest CV Results

Got RAM??
DS7333 - Thursdays