Building a Lightweight Network Intrusion App

 

Introduction

At data-driven companies, the findings from the data science team, or the outputs, influence the APIs that power software used by the other teams. Perhaps an inventory manager monitors the predicted sell-through rate of all new items added to an e-commerce platform, or an engineer builds a client application that leverages machine learning to recommend music. These are examples of how companies can integrate, leverage and disseminate machine learning and data science.

A challenge a lot of companies face exist around the handoffs — how can other teams interact with and build applications on models developed by data scientists? Taking a page out of the software engineering playbook, complex systems can speak to each other through APIs. I wanted to leverage the DataScience.com Platform to construct a machine learning API to power other applications. I also wanted to leverage Skater, our model interpretation library, to help disconnected teams understand how shared machine learning models work.

I looked at the problem of network intrusion, and detecting malicious activity on a network. Once I built a good model, I wanted to:

  • Set up an API that processed sequences of connections, determined if there was an attack, and what type of attack existed.

  • Build a notification system that, in the event of a detected attack, sent notifications to stakeholders with a link to a report that indicated why the system believed an attack was underway.

Like this:

SVG Image


I gave myself 48 hours to complete the project, using the DataScience.com Platform to help me stitch together all the components.

Hours 0-6: The Data

I used the 1999 KDD Cup dataset which has some known issues. After spending several hours finding, collecting, cleaning, and exploring the data, I began building a model. I do my analysis in Jupyter whenever I can; I used a 7.6 Gb / 2 core AWS instance, and leveraged our deep learning dependency collection just in case it came to that (it didn't).

Hours 6-8: Building the Model

There are a couple challenges with network intrusion detection, namely:

  • Attack patterns are dynamic, and the constant change makes it difficult to detect new attacks. We need models that extrapolate well.

  • Errors are expensive. False positives, or notifications of an attack when there is none, waste time and resources. Meanwhile, false negatives, or the false assurance that there is not a breach when there actually is one, amount to the cost of a breach, information leak, or lapse in service. All models will make errors, but we can change the probability of each type of error by weighting samples differently. By adding weight to samples that correspond to an attack, we can reduce our false negatives, at the expense of more false positives.

Given that the model needed to extrapolate beyond the training set, I used a Kernel-based method as a means of identifying abnormalities to help detect new types of attacks. While this approach seemed to work, I got even better results ensembling it with a gradient boosting classifier.

After training, the model is small enough to be serialized, tied to version control, and dumped to S3.

Hours 22-25: The Prediction API

I needed to build the "intruder detector" API, the component that looks at connections and selectively sends notifications to stakeholders. This script would load the model from S3, and provide a predict function that takes in some connections, and returns a prediction as JSON. We used DataScience.com deploy to expose this callable as an API on a container. If we needed to update the model, we could have simply changed the script's model id and bumped the version: 

def predict(connection):
    """Uses the local model object to score the connection, 
    send an alert given a detected intrusion
    and returns a dictionary containing the results.
    """
    row = [connection[feature] for feature in features]
    prediction = model.predict(row)
    if prediction == "Normal Activity":
        return {'message': "Normal Activity"} 
    else:
        report_url = get_url_from_row(row)
        msg = "{0} detected. To see why, go to {1}".format(prediction, report_url)
        for number in target_numbers:
            alert(msg, number)    
        return {'message': "{} detected, sending alert.".format(class_pretty_names[prediction])}

The prediction API generates a URL that will populate the reporting app with the current connection, and send this link via the Twilio API.

def get_url_from_row(connection):
    """Returns a url to the reporting application, 
    populated with the current connection
    """
    return 'https://intruder-detector.datascience.com/?' + urllib.urlencode(zip(features, connection))

 

Hours 25-30 and 46-48: The Reporting Application

Next, I needed the reporting application, the report that explains intrusions to stakeholders. The reporting app is a flask application, deployed via Gunicorn with 4 workers. It contains a form, holding the attributes of an individual connection. Once an intruder is detected, the prediction API sends a message containing a URL that populates the form as needed. This would allow our hypothetical IT team to get real-time notifications of intruders, and detailed explanations of the situation.

Once the reporting app is populated with a connection, it asks the reporting app for an explanation. The explanation is HTML, so we can just render it into an IFrame:

def get(funcname, **kwargs):
    """Asks the reporting api to execute function funcname, with 
    arguments **kwargs
    """
    json_body = {"funcname": funcname}
    json_body.update(locals()['kwargs'])
    cookies = {'datascience-platform': ds_platform_cookie}
    response = requests.post(ds_platform_url, json=json_body, cookies=cookies)
    return response.text

@app.route("/get_explanation", methods=['POST'])
def return_explanation_from_post():
    body = request.json
    attack_type = body['attack_type']
    user = body['user']
    response = get('return_explanation', attack_type=attack_type, user=user)
    return response

With the report HTML served, we can explore which connection attributes are driving the prediction. In the example below, the model believes that the connection corresponds to a probe intrusion, given connection flags and 0 logins.

Hours 30-34: The Report API

The report API needed to do a few things, like generate predictions, give explanations, and provide metadata like feature names. I defined a router function that takes all requests, selects the appropriate function, and pass it the specified arguments:

def router(funcname, *args, **kwargs):
    return getattr(DeployFuncs, funcname)(*args, **kwargs)

class DeployFuncs(object):
    @staticmethod
    def return_explanation(attack_type, user):
        row = np.array([user[i] for i in features])
        instance = np.array(list(map(clean, row)))
        label_id = classes.index(attack_type)
        html = explainer.explain_instance(instance, model_obj, labels=(label_id,)).as_html()
        return {'html':html}
    #etc...

This is sort of a pseudo REST interface. But how do we generate explanations? Using Skater and LIME, you look at the current connection, randomly modify its characteristics, and observe how our model changes its prediction as a function thereof. It's somewhat analogous to a Monte Carlo simulation for estimating unknown quantities. But rather than probabilities of accepting or rejecting samples, we provide a distance to the original point, and samples correspond to changes from our original prediction. Here's an example where a random forest regressor has learned that there are actually 2 separate linear relationships, depending on the values of X:

from skater.model import InMemoryModel
from skater.core.local_interpretation.lime.lime_tabular import LimeTabularExplainer
import numpy as np
from sklearn.ensemble import RandomForestRegressor

def data_generating_process(x):
    """If X is positive, then coefficients are positive, otherwise
    they are negative."""
    coefs = np.array([3.2, 0.5, 1.3])
    if x.sum() > 0:
        return np.random.normal(0, 10) + np.dot(coefs, x)
    else:
        return np.random.normal(0, 10) - np.dot(coefs, x)

#generate some data
X = np.random.normal(0, 10, size = (1000, 3))
y = np.apply_along_axis(data_generating_process, 1, X)

#fit a "black box"
black_box = RandomForestRegressor()
black_box.fit(X, y)
model = InMemoryModel(black_box.predict)

#pick a couple examples
example1 = X[np.where(X.sum(axis=1) > 0)][0]
example2 = X[np.where(X.sum(axis=1) < 0)][0]

#explain the two examples through the model
explainer = LimeTabularExplainer(X, mode='regression', discretize_continuous=False)
explainer.explain_instance(example1, model).show_in_notebook()
explainer.explain_instance(example2, model).show_in_notebook()
2017-07-25 17:54:47,096 - skater.model.base - WARNING - No examples provided, cannot infer model type

Thus, we get locally faithful explanations of why our model identifies certain connections as intrusions. Because our model is deployed as an API, anyone at the company can use Skater's deployed model wrapper to understand its inner mechanics.

Wrapping Up

We hear from customers all the time that they would love to become more data-driven, and to leverage the work of their data science teams throughout their companies. To that end, visibility, model deployments, and a spirit of institutional "modularity" are paramount. I found that our DataScience.com Platform enabled me to implement these concepts so that I could quickly and effectively build an effective network intrusion detection app.