Download PDFOpen PDF in browserA New Machine Learning Approach for Anomaly Detection Using Metadata for Model TrainingEasyChair Preprint 8296 pages•Date: March 15, 2019AbstractWe report a new approach to train machine learning (ML) models for binary classification in order to detect anomalies in application log records. Contrary to the common use of actual values of different log fields, we used metadata of the log records (“log schema”) to train and test our ML models. Our objective was to use ML models to automatically detect anomalous log records that may carry sensitive or restricted information and thus prevent their inadvertent transfer (“leakage”) from the source to destination environments. In addition to all the controls and measures currently in place to prevent such data leakage, our ML model approach provides an additional layer of data security to further reduce the possibility of potential data leaks. Several ML models (decision tree (DT), random forest (RF) and Gradient Boosted Tree (GBT)) were trained using a combination of real (class: “normal”) and synthetic (class: “suspicious”) metadata for approximately five million log records. The metadata for “normal” records were extracted from the schema of real historical log records that do not contain “sensitive” or “restricted” information. The metadata for likely “suspicious” records were simulated via artificially injecting structural violations that are not observed in the known “normal” log records. The final prediction (“normal” or “suspicious”) for each new record was based upon the use of a voting classifier. The three ML models (DT, RF and GBT) in our solution all individually yield high average accuracy in predictions (1.0, 0.99 and 1.0, respectively) over multiple experimental runs. Accordingly, the voting classifier consistently yields highly accurate predictions (1). Combined, our results suggest that the use of a combination of real and synthetic metadata derived from log schema and a voting classifier can be successfully applied to build a robust ML model solution for anomaly detection in log records. Keyphrases: Decision Tree, Gradient Boosted Tree, Random Forest, anomaly detection, data leakage, machine learning
|