INNOVARE. Revista de Ciencia y Tecnología. Vol. 12, No. 1, 2023
INNOVARE
Revista de Ciencia y Tecnología
Disponible en CAMJOL - Sitio web: www.unitec.edu/innovare/
1
Autor corresponsal: jeancasoto@unitec.edu, Universidad Tecnológica Centroamericana, Campus Tegucigalpa, Honduras
Disponible en: http://dx.doi.org/10.5377/innovare.v12i1.15956
© 2023 Autores. Este es un artículo de acceso abierto publicado por UNITEC bajo la licencia https://creativecommons.org/licenses/by-nc/4.0/
Original Article
Binary classification of malware by analyzing its behavior in the
network using machine learning
Clasificación binaria de malware mediante el análisis de su comportamiento en la red mediante aprendizaje de
maquina
Jean Carlo Soto
1
Facultad de Ingeniería, Universidad Tecnológica Centroamericana, UNITEC, Tegucigalpa, Honduras
Article history:
Received: 29 October 2022
Revised: 22 March 2023
Accepted: 29 March 2023
Published: 15 April 2023
Keywords
Cybersecurity
Deep learning
Machine learning
Malware
Network
Palabras clave
Aprendizaje de maquina
Aprendizaje profundo
Malware
Red
Seguridad cibernética
ABSTRACT. Introduction. Every day we are exposed to all kinds of cyber-threats when we browse the internet,
compromising the confidentiality, integrity, and availability of our devices. Cyber-attacks have become more
sophisticated and cyber attackers require less technical knowledge to execute such attacks. An automated and well-
defined process to counter these attacks becomes urgent. The study aim was to solve this problem. Methods. A model
was developed to analyze the information in Packet Capture (PCAP) files and classify network connections as either
benign or malicious (malware generated). This software used two methods: traditional machine learning algorithms
and neural networks. Our experiments were carried out using the Intrusion Detection Evaluation Dataset
(CICIDS2017), which contains labeled samples of PCAP files. We experimented using both raw and standardized data.
The classification results were evaluated using recall, precision, F1-score, and accuracy metrics. Results. These were
satisfactory for both methods, obtaining more than 95% in the F1-score and recall metric, indicating a low number of
false negatives. Conclusion. It was found that data standardization had a favorable impact on all metrics and should
be used carefully. Overall, our experiments showed that malicious network traffic can be successfully detected using
automated methods achieving above 95% of F1-score in the K-Nearest Neighbors algorithm (K-NN) classifier.
RESUMEN. Introducción. Cada día estamos expuestos a todo tipo de ciberamenazas cuando navegamos por internet,
comprometiendo la confidencialidad, integridad y disponibilidad de nuestros dispositivos. Los ciberataques se han
convertido más sofisticados y los ciberatacantes requieren menos conocimientos técnicos para ejecutar dichos ataques.
Un proceso automatizado y bien definido para contrarrestar estos ataques se vuelve urgente. El objetivo del estudio fue
resolver este problema. Métodos. Se desarrolló un modelo para analizar la información en los archivos de Captura de
paquetes (PCAP) y clasificar las conexiones de red como benignas o maliciosas (generadas por malware). Este software
utilizó dos métodos: algoritmos tradicionales de aprendizaje de maquina y redes neuronales. Nuestros experimentos se
llevaron a cabo utilizando el conjunto de datos de evaluación de detección de intrusiones (CICIDS2017), que contiene
muestras etiquetadas de archivos PCAP. Se utilizó datos tanto crudos como estandarizados. Los resultados de la
clasificación se evaluaron utilizando métricas de exhaustividad, precisión, puntuación F1 y precisión. Resultados.
Estos fueron satisfactorios para ambos métodos, obteniendo más del 95% en las métricas de puntuación F1 y
exhaustividad, lo que indica un bajo número de falsos negativos. Conclusión. Se encontró que la estandarización de
datos tuvo un impacto favorable en todas las métricas y debe usarse con cuidado. En general, nuestros experimentos
mostraron que el tráfico de red malicioso se puede detectar con éxito utilizando métodos automatizados que alcanzan
más del 95% de la puntuación F1 en el Clasificador del Algoritmo de Vecinos Más Cercanos (K-NN).
1. Introduction
The International Telecommunications Union (ITU)
develops the Global Cybersecurity Index (GCI). This was
first launched in 2015 with 192 ITU member states and
the state of Palestine to help these states to identify areas
of improvement and encourage countries to act raising
awareness on the state of cybersecurity worldwide. This
GCI consists of 82 questions to evaluate 5 pillars: legal
measures, technical measures, organizational measures,
capacity development measures and cooperation
measures.
J. C. Soto
INNOVARE. Revista de Ciencia y Tecnología. Vol. 12, No. 1, 2023
2
In the last GCI published in 2020, Honduras performed
in the last place of the GCI in the America Region and was
placed in the 178
th
position out of 182 countries with a
score of 2.2 out of 100 (International Communication
Union, 2020). This means almost every person using the
internet does not actually know the risks of the ever-
increasing malicious software out there.
Malware is defined as malicious software that is
intentionally placed or inserted into a system to harm
(Stallings, 2006) and it has been known for some time as
one of the strongest threats on the internet. State-of-the-
art software for virus detection and prevention (antivirus)
has been quite successful.
However, antivirus providers face problems due to the
large number of variations of malware that are produced
daily. Ciampa (2021) highlights in the 2018 McAfee Labs
threats report that the number of new malware released
every month exceeds 20 million, and the total malware in
existence is approaching 900 million instances. In 2019,
four out of every five organizations experienced at least
one successful cyberattack, and over one-third suffered
six or more successful attacks (Cyberedge Group, 2020).
The organizations that oversee dealing with this type
of threat increasingly require better techniques for the
automated classification of malware samples in general.
Malware classification is a process that was traditionally
done manually (Tian et al., 2009; Gheorghescu, 2005).
This became inefficient over time because of the large
number of malware samples emerging daily product of the
polymorphism, metamorphism, and obfuscation
techniques involved in modern malware. As a
consequence of the inefficiency of this process, the need
of automating and standardizing this process arises.
Malicious network traffic samples should be identified
with the least possible margin of error by this automated
process.
One of the most popular approaches for malware
classification is based on content. This checks the content
of the files and compares them with signatures from a
database, looking for matches with previously identified
malware samples. Some research works (Tian et al., 2009)
concentrate on Malicious Executable Classification
Systems (MECS) that distinguish between benign or
malignant executables. However, this approach cannot
recognize new variants of already known families without
having an existing sample of these.
Another approach for malware classification is based
on behavior, which is subdivided into two types: based on
Central Processing Unit (CPU) and based on the data
network traffic. The first one analyzes and monitors the
behavior of programs on the computer. The second one
analyzes and monitors incoming and outgoing data
packets, connections to hosts, and others. Even though
monitoring and processing system calls can be a resource
intensive task (Nari & Ghorbani, 2013), most of the works
using CPU-based classification are based on system calls,
used to abstract, and represent malware behavior. Nari &
Ghorbani (2013) proposed the behavior-based approach
via data traffic network under the assumption that when a
new variant of malware emerges, it will show similar
behavior to its predecessor regardless of the obfuscation,
polymorphism, or metamorphism used to create it. Today
we can find numerous investigations (Hock & Kortis,
2015; Chockwanich & Visoottiviseth, 2019; Jabez &
Muthukumar Dr., 2015; Yin et al., 2017) that show the
behavior of malware in the data traffic network as an
essential component.
Identifying malware by the network traffic is quite the
same as the intrusion detection systems based on the
network. An intrusion detection/prevention system
(IDS/IPS) is a security tool that can detect malicious
activity and taking preventive measures to protect both the
host and the network against potential threats, which
would normally pass through a traditional firewall
(Ambati & Vidyarthi, 2013; Kolokotronis & Shiales,
2021). IDS/IPS are divided into two categories: Host
Intrusion Detection/Prevention System (HIDS/HIPS) and
Network Intrusion Detection/Prevention System
(NIDS/NIPS). HIDS/HIPS are user (host) based IDS/IPS.
These are used to analyze and monitor activities in a
particular machine. NIDS/NIPS detect and prevent
intrusion threats by continuously monitoring data network
traffic, looking for malicious and unauthorized entries that
attempt to harm the basic security of the data network.
These systems take automatic action to stop the intrusion
by sending alerts to the administrator, dropping, or
blocking malicious traffic from the source address, or
terminating the connection (Kolokotronis & Shiales,
2021).
Shipulin (2018) explains the technology behind the
NIDS/NIPS systems. These works at layer 4 of the OSI
(Open Systems Interconnection) model (Purser, 2004).
That is, with transport layer protocols such as TCP
(Transmission Control Protocol), UDP (User Datagram
Protocol), and others. The goal is to identify malicious
packets in data network traffic representing attack
attempts. The incoming traffic is divided into its
corresponding protocol, and it is decoded, decompressed,
normalized, and later compared it with a set of signatures.
This research work is based on the premise that any
new variant of malware behaves similarly to that of its
predecessor (Tian et al., 2009), together with the fact that
most malware communicates with external hosts (Nari &
Ghorbani, 2013). The proposed model bases its operation
on the behavior-based approach at the data network level
(Nari & Ghorbani, 2013). This model parses files
containing frames and packets captured from the network,
known as (Packet Capture) PCAP files.
This model employs two methods for classifying
malware-generated traffic samples: (a) using traditional
machine learning algorithms such as K-Nearest
Neighbors (K-NN) and Support Vector Machines (SVM)