Here is an example of a project which involved replicating a neural network found in a journal article. This was a very challenging exercise, which I would love to expand upon, and get more experience in this domain.
Artificial Neural Networks can described as an interconnected series of nodes which apply mathematical algorithms to data through a series of weight calculations and functions. There are many different types and architectures available for a myriad of different applications. Readily available computing power from technological advancements have allowed neural networks to gain in popularity and these algorithms have been applied to many corners of research. One such application in research is in the field of physics which is the subject of this case study.
The purpose of this case study is to emulate a neural network found in this paper titled “Searching for Exotic Particles in High-energy Physics with Deep Learning.” and the accompanying dataset can be found here. As the title of the paper implies, researchers aim to use a neural network to classify high-energy particles.
The exercise will be to build a neural network that has the same architecture as the network in this paper with the goal of achieving equal performance metrics to what was claimed in this research. The network described is a 5-layer network with 300 hidden units in each dense layer, a learning rate of 0.05, and weight decay of \(1\ x\ 10^{-5}\).
The given dataset with this problem is comprised of 29 total features created through Monte Carlo simulations. As described on the UCI site, the first 21 features (minus the target) are considered to be low-level kinematic properties measured from a particle detector inside an accelerator, and the last 7 are high-level features created from the first 21. These features in the data help physicists discriminate between particle types. The goal of this classifier is to use the features to predict on the target class, 0 or 1. The performance metric of choice in the paper is AUC, and the models from this case study will use the same for evaluation purposes.
The accompanying code for the methods used can be found in the Code/Functions Section of the Appendix. The neural network in this case study was built using the tensorflow
python package.
There are 11,000,000 observations in this data; the first 10,000,000 rows were used as a training set leaving the last 1,000,000 rows were used as a validation set. As in the article the data was scaled prior to training. The scale_df()
function (Appendix 6.2.1) was used to process this action. The features which had all values above zero were scaled with a min-max scaler. The article stated that a similar scaling technique was used for the exclusively positive features. The other attributes in the dataset were scaled to a zero mean with a standard deviation of one by a standard scaler. The preprocess_data()
function (Appendix 6.2.2) was used as a helper function to split the data into training/test sets and facilitate the scaling of the variables.
The model was called with the build_model()
function (Appendix 6.2.3); the network was built as true to form as in the journal article. It seemed there was an expectation of a final output so several tweaks were made to various models, which will be expanded upon in the results section.
The function compile_and_fit()
, get_optimizer()
, get_callbacks()
functions (Appendix 6.2.4) were used to compile the model, the code for which was found at this website. Two different stocastic gradient descent optimizers were tested; the first was the SGD()
optimizer in the tensorflow package, and the second was an optimizer from the tensorflow-addons
package called SGDW()
. The initial learning rate was for the final model was 0.05 with and weight decay of \(1\ x\ 10^{-5}\) as in the paper. The momentum term was held constant at 0.9. Several methods were employed to apply the weight decay factor which will be elaborated upon in the results section. Batch size in was kept at 100 during model training emulating what was used in the article. The final model trained for 30 epochs, each epoch taking a little more than 800s. Several models were trained with 800s being about the average length for each epoch. The neural network minimized the binary cross-entropy of the training set, and the callbacks function monitored the validation loss with a patience of 10. The final model was saved in HDF5 format to be recalled. Layer weights were initialized with a random normal distribution, the initial layer having a mean of 0 and standard deviation of 0.1, the hidden layers mean of 0 and standard deviation of 0.05 and the output layer used the default. This output layer initialization is the slightly different from the article, and was done specifically to improve performance, as it seemed to improve when changing from what the paper said layers mean of 0 and standard deviation of 0.001.
## Model: "sequential_1"
## _________________________________________________________________
## Layer (type) Output Shape Param #
## =================================================================
## dense_5 (Dense) (None, 300) 8700
## _________________________________________________________________
## dense_6 (Dense) (None, 300) 90300
## _________________________________________________________________
## dense_7 (Dense) (None, 300) 90300
## _________________________________________________________________
## dense_8 (Dense) (None, 300) 90300
## _________________________________________________________________
## dense_9 (Dense) (None, 1) 301
## =================================================================
## Total params: 279,901
## Trainable params: 279,901
## Non-trainable params: 0
## _________________________________________________________________
The final model architecture is shown above. As stated, there are 5 total layers, with an input shape of (28,). No modifications were made to the given architecture from the paper to improve performance, only the parameters given were changed to attempt to improve the model.
The final model was selected on the basis that it performed the best on the AUC and log-loss metrics. Even though this was arguably the best performing the final model still seemed to under perform considering the expectation, the validation AUC on the withheld test set 0.7589 and a log-loss of 0.5907. Even with these values during training it seemed like the model could have been overfitting. The training round from the epoch with best score during compilation of this model the loss was about 0.48 and the AUC was around 0.85. Because of this several attempts were made to improve the model with the parameters.
The article stated that there was a weight decay applied to the learning rate. One hypothesis to possible overfitting was this decay factor wasn’t being properly applied. Though the sources here and here describe the call to decay the learning rate with SGD(decay=1e-5)
, it is unclear whether this worked. In the documentation for SGD()
there is no parameter for decay, and it isn’t listed as an acceptable **kwargs
, though in SGDW()
, it does list it as acceptable for backward compatibility. When the final model was compiled, there were no errors, even though the weight decay was passed to the function with decay=1e-5
. Additionally, the second source from above said that the article has been updated to tensorflow 2.0.
Several other attempts were made to apply learning rate regularization. As previously stated a different optimizer was picked to apply the weight decay, SGDW()
by changing the get_optimizer()
(Appendix 6.2.6). The weight factor was applied directly, not in steps as proposed by the documentation for this optimizer. It was unclear how many epochs this would train for, so an appropriate amount of steps wouldn’t have been know prior to training. This effect seemed to apply the weight decay very strong, and the model seemed to stop learning very quickly, and didn’t perform very well, with the best validation AUC being around 0.70.
Another method used to try to optimize this architecture was by applying L2 regularization of 0.00001 (Appendix 6.2.7) to the layers as a form of weight decay suggested in this paper and this site. This was more balanced between the training and validation set metrics, but the metrics weren’t as impressive, the best AUC for training and validation being about 0.72.
The last attempt to regularize the learning rate was with the tf.keras.optimizers.schedules.ExponentialDecay()
(Appendix 6.2.8) included in tensorflow. This did not work very well and the model didn’t learn, as it seemed again like the effect was too strong and the model quickly stopped learning, making no progress passed the first epoch. It was a bit unclear whether the decay rate was supposed to be a proportion or raw value, it seemed like it was to be the former.
Several models were trained with a varying degree of minor changes and saved following the instructions on the tensorflow site. During initial testing of this process, there were no problems recalling the files for the write up. However some where throughout the case study the process stopped working and suddenly there was an error TypeError: __init__() got an unexpected keyword argument 'reduction'
close to the end. This was a pretty catastrophic blow to the exercise as the models were not able to be recalled, even though the instructions were followed, tested and validated. There was a similar problem encountered by a user at this popular site, saying the compile=False
should do the trick. It didn’t work to recall the files in the default folder format as described on the tensorflow site which is the format that 6 models were trained and saved. Because of this the model had to be recompiled and presented in this write-up at a very inconvenient time. A new model was compiled, and saved in the HDF5 format, though there was only time to recompile the final model. With the model in the HDF5 format, predictions on the dataset, luckily, were able to be made and presented in the write up. All of the models created during this case study are included in the zip folder.
The final model created did not seem to perform very well with the given architecture. There are some difference in the models which certainly have contributed, as the scores reported in the article are 0.87 for the AUC. These researchers increased the momentum term through training. This would have accelerated the loss toward the minimum at a greater rate. The effect of this coupled with the weight decay factor on the learning rate, would allow the model to learn for longer. Decaying the learning rate will decrease the size of the steps toward the global minimum, but increasing the momentum term linearly will make these steps more aggressive. These controls were not applied in the final model in this case study. As a result, the researchers claimed to have trained their models between ranges of 200-1000 epochs, this amount of control granted much better resolution of the minimum. It was unclear whether the weight decay was applied correctly to the final model, or if it even had any effect. To this effect, the batch size for this model was kept at 100 throughout as it was in the article, but the final model trained for only 30 epochs, compared to the research. This was not experimented or adjusted, though there are several way which this could have been adjusted. It would have been nice split the 10,000,000 rows of training data into different sets by changing the amount of steps considered to be an epoch. The researchers claimed to have used 2.5 millions observations of data for training, so it was considered that it was possible to train for more epochs by reducing the amount of steps. It is unknown what effect this would have had on the final result.
Other measures could have been taken in an attempt to improve performance of the model. One very useful stradegy would have been to use dropout layers. Some were described in the article, though this case study only aimed to emulate their final model, which did not have any such layers. These drop-outs will randomly set some of the node weights to 0, this should have the effect of reducing overfitting and improving performance.
Another technique which could have been employed is the use of a bottleneck layer by putting a dense layer with a few number of nodes between the hidden layers. This will have the effect of compressing the data, and will create new features in the process, similar to principal component analysis. The features that are present will have to combine, which will create new features. This new feature creation for the network should yield better performance metrics.
The model in this case study fell well short of what was described in this paper when comparing similar metrics. There were big differences in how the models were designed so it is not very surprising. If a similar AUC for a model was found with this case study, it seems like the log-loss/binary crossentropy would be a much more telling metric of comparison. This will glean insight as to how close the model predictions are to the correct class. By comparing this metric between models might be a better judge because it will compare how “correct” the model is predicting, the lower the score the better.
https://arxiv.org/pdf/1402.4735.pdf
https://archive.ics.uci.edu/ml/datasets/HIGGS
https://www.tensorflow.org/tutorials/keras/overfit_and_underfit
https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers/SGDW#attributes
https://www.pyimagesearch.com/2019/07/22/keras-learning-rate-schedules-and-decay/
https://arxiv.org/pdf/1711.05101.pdf
https://www.tensorflow.org/tutorials/keras/save_and_load
https://stackoverflow.com/questions/60530304/loading-custom-model-with-tensorflow-2-1
https://towardsdatascience.com/machine-learning-part-20-dropout-keras-layers-explained-8c9f6dc4c9ab#:~:text=Dropout%20is%20a%20technique%20used,update%20of%20the%20training%20phase.
def scale_df(df):
# finds cols with only positive features
pos_columns = df.loc[:, df.ge(0).all()].columns
# the other cols
other_columns = df.loc[:, ~df.ge(0).all()].columns
x_pos_scaled = min_max.fit_transform(df[pos_columns])
x_other_scaled = std_scale.fit_transform(df[other_columns])
return_array = np.concatenate((x_pos_scaled, x_other_scaled),axis=1)
return(return_array)
def preprocess_data(df, n = 1000000):
y_train, y_test = df['target'].iloc[:-n].to_numpy(), df['target'].iloc[-n:].to_numpy()
x_train, x_test = scale_df(df.drop(columns='target').iloc[:-n]), scale_df(df.drop(columns='target').iloc[-n:])
return(x_train, x_test, y_train, y_test)
def build_model():
primary_initalizer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.1)
hidden_initalizer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.05)
output_initalizer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.001)
model = tf.keras.Sequential([
layers.Dense(300, activation='tanh', kernel_initializer=primary_initalizer, input_shape=(28,)),
layers.Dense(300, activation='tanh', kernel_initializer=hidden_initalizer),
layers.Dense(300, activation='tanh', kernel_initializer=hidden_initalizer),
layers.Dense(300, activation='tanh', kernel_initializer=hidden_initalizer),
layers.Dense(1, activation='sigmoid')
])
return(model)
initial_learning_rate = 0.05
batch_size = 100
def get_optimizer():
return(tf.keras.optimizers.SGD(learning_rate=initial_learning_rate, momentum=0.9, decay=1e-5))
def get_callbacks():
return(tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10))
def compile_and_fit(model, train_ds, validate_ds, max_epochs):
model.compile(optimizer=get_optimizer(),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[
tf.keras.losses.BinaryCrossentropy(
from_logits=True, name='binary_crossentropy'),
'accuracy',
tf.keras.metrics.AUC()
])
model.summary()
history = model.fit(
train_ds[0],
train_ds[1],
epochs=max_epochs,
validation_data=validate_ds,
callbacks=get_callbacks(),
verbose=1)
return(history)
def get_optimizer():
return(tfa.optimizers.SGDW(learning_rate=initial_learning_rate, weight_decay=1e-5, momentum=0.9))
def build_model():
primary_initalizer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.1)
hidden_initalizer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.05)
output_initalizer = tf.keras.initializers.RandomNormal(mean=0., stddev=0.001)
model = tf.keras.Sequential([
layers.Dense(300, activation='tanh', kernel_initializer=primary_initalizer, input_shape=(28,)), kernel_regularizer = tf.keras.regularizers.L2(0.00001)),
layers.Dense(300, activation='tanh', kernel_initializer=hidden_initalizer), kernel_regularizer = tf.keras.regularizers.L2(0.00001)),
layers.Dense(300, activation='tanh', kernel_initializer=hidden_initalizer), kernel_regularizer = tf.keras.regularizers.L2(0.00001)),
layers.Dense(300, activation='tanh', kernel_initializer=hidden_initalizer), kernel_regularizer = tf.keras.regularizers.L2(0.00001)),
layers.Dense(1, activation='sigmoid')
])
return(model)
initial_learning_rate = 0.05
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate,
decay_steps=1000000,
decay_rate=0.00001,
)
optimizer=tf.keras.optimizers.SGD(learning_rate=lr_schedule, momentum=0.9)