HITB Lab: Practical Machine Learning in InfoSecurity

Abstract

This lab session is designed to give attendees a quick introduction to ML concepts and gets up and running with the popular machine learning library, sci-kit learn.
We first start by building a basic understanding of how to integrate ML into an email spam identification system. We look at the inner workings and discuss the components involved in the system. Using the training data, we train our system to identify genuine messages and the system automatically learns from these examples. Different classifiers are tuned to get the maximum efficiency we can crunch out from this setup.
Once we have an efficient system, we do a deep dive and look at how one can trick the system to fail, again by using ML techniques.
Machine Learning (ML) is the future. Systems we use today use ML extensively, whether it is powering an e-commerce website or fraud detection in banking. However, it takes the average developer and security professional some level of skill and experience to apply machine learning and get useful results. It is a skill that anyone can learn, but we feel that material in this space is greatly lacking.
We give students a gentle introduction to the topic with the classic boolean classification problem and introduce classifiers, which are at the core of many of the most common ML systems. We deal with some easy to implement classifiers in sci-kit learn (linear classifiers, decision trees etc.), and show visualizations on how it works.
We then dive into training our classifiers with a labelled dataset. Trying different classifiers to approach the problem and verify the accuracy by cross verifying with the test data helps us choose an ideal algorithm for the problem in hand. This lab servers as a quick and practical introduction to the world of machine learning.
In addition, we guide the student through a simple example of deploying security machine learning systems in production pipelines in a distributed and scalable fashion using Apache Spark. Lastly, we will touch on ways that such systems can be poisoned, misguided, and utterly broken if the architects and implementers are not careful.
Overview of the Topics covered for Workshop :

Introduction to machine learning
Hands-on guided exploration of Python machine learning libraries:
Data-wrangling using Numpy and Pandas
Scikit-learn’s functions and capabilities
Data visualization using Matplotlib/Seaborn
Walkthrough of the most commonly used machine learning algorithms (with quick hands-on examples/visualizations for select algorithms)
Supervised learning algorithms
Linear/logistic regression
Support Vector Machines
Unsupervised learning algorithms
Hierarchical/k-Means clustering
Decision trees/Random forests
Semi-supervised learning
Lecture on application of machine learning in the security/abuse space
Spam, fraud, malware, phishing, and intrusion detection short examples
Principles behind selecting the best machine learning models for different use-cases
Considerations when using machine learning in an adversarial/malicious environment
Streaming pipelines for machine learning using Apache Spark MLlib (PySpark)
Apache Spark

General architecture
Distributed, scalable machine learning deployments with Spark
Guided example of a streaming architecture for network anomaly detection using reinforcement learning on Spark
Evaluating the security of machine learning systems
Techniques and guided example of fuzzing a classifier and regressor to find blind spots in the model
Evaluation of intelligent learning system architecture that is resilient to model poisoning by an adversar

Below Crucial Components would be explored in detail for developing the filter :

CountVectorizer – Transform text data , tuning the parameters
SVC() /NB – different algorithms that once could use
Why are pipelines in sci-kit learn useful?
DataFrame in Pandas / Numpy Arrays
K-Fold – for easy dataset splitting
confusion_matrix – for cross validation / accuracy testing

Prerequisites Knowledge:

Basic familiarity with Linux
Python scripting knowledge is a plus, but not essential

Technical Requirements

Latest version of VirtualBox Installed
Administrative access on your laptop with external USB allowed
At least 20 GB free hard disk space
At least 4 GB RAM (the more the better)