Detecting Multielement Algorithmically Generated Domain Names Based on Adaptive Embedding Model

Page created by Warren Guzman
 
CONTINUE READING
Hindawi
Security and Communication Networks
Volume 2021, Article ID 5567635, 20 pages
https://doi.org/10.1155/2021/5567635

Research Article
Detecting Multielement Algorithmically Generated Domain
Names Based on Adaptive Embedding Model

 Luhui Yang ,1 Guangjie Liu ,2 Weiwei Liu,1 Huiwen Bai ,1 Jiangtao Zhai ,3
 and Yuewei Dai2
 1
 School of Automation, Nanjing University of Science and Technology, Nanjing, China
 2
 School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing, China
 3
 School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China

 Correspondence should be addressed to Guangjie Liu; gjieliu@gmail.com

 Received 19 January 2021; Revised 25 April 2021; Accepted 24 May 2021; Published 1 June 2021

 Academic Editor: Jesús Dı́az-Verdejo

 Copyright © 2021 Luhui Yang et al. This is an open access article distributed under the Creative Commons Attribution License,
 which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
 With the development of detection algorithms on malicious dynamic domain names, domain generation algorithms have
 developed to be more stealthy. The use of multiple elements for generating domains will lead to higher detection difficulty. To
 effectively improve the detection accuracy of algorithmically generated domain names based on multiple elements, a domain
 name syntax model is proposed, which analyzes the multiple elements in domain names and their syntactic relationship, and an
 adaptive embedding method is proposed to achieve effective element parsing of domain names. A parallel convolutional model
 based on the feature selection module combined with an improved dynamic loss function based on curriculum learning is
 proposed, which can achieve effective detection on multielement malicious domain names. A series of experiments are designed
 and the proposed model is compared with five previous algorithms. The experimental results denote that the detection accuracy of
 the proposed model for multiple-element malicious domain names is significantly higher than that of the comparison algorithms
 and also has good adaptability to other types of malicious domain names.

1. Introduction malware that infiltrates into the hosts. When communi-
 cating with the C & C server, the algorithm is used to
Advanced Persistent Threat (APT) attacks and botnets have generate domain names to complete the communication.
become important threats to network security [1, 2]. To Because the domain names are usually dynamically gener-
achieve remote control of the controlled hosts, attackers ated, the malware’s communication domain names are
usually use a remote Command and Control (C & C) server. updated over time, thus evading the detection rules of the
After establishing a communication link with the C & C security software. With the development of mobile Internet,
server, the attacker can collect information about the con- 5G technology, and the increase of the Internet of Things
trolled host and steal sensitive data or use the controlled host (IoT) devices, detecting malicious algorithmically generated
to launch further attacks on other hosts or networks. Early domain names is of great importance for network security.
botnets would hard-code the domain names of C & C servers DGAs can be classified into various categories according
in malware, which can be easily detected and blocked by to their generation principles and generating elements [3].
firewalls or Intrusion Detection Systems (IDS). To enable Early DGAs used English letters, numbers, and some special
dynamic updating of domain names, Domain Generation characters as basic elements to generate domain names by
Algorithm (DGA) is widely used in malware communication some simple random algorithms. Then some botnets started
with C & C servers. Cyber attackers use DGAs to generate to use words as the basic elements for domain generation,
and register a large number of pseudorandom domain such as Matsnu [4] and Suppobox [5]. Compared with
names and then embed the domain generation code in the character-based DGAs, word-based DGAs are more difficult
2 Security and Communication Networks

to detect. Later, some researchers have designed DGAs that conducted on the datasets. Finally, in Section 6, we conclude
can generate domain names with lower character ran- our research and future work.
domness by using combinations of English letters [6]. In
recent years, with the development of detection techniques 2. Related Work
for malicious domain names, some researchers have
designed DGAs that are more resistant to detection, such as When malware communicates with C & C servers, the
a Stealthy Domain Generation Algorithm (SDGA) based on requested domain name is usually a dynamic domain name
the Hidden Markov Model (HMM) [7] and the use of generated by DGA. Detecting malicious dynamic domain
Generative Adversarial Networks GAN (GAN) to generate names can discover the communication behaviors of mal-
dynamic domain names [8]. According to the analysis of ware and C & C nodes. Methods for detecting malicious
legitimate domains, there are elements such as English DNS communication can be divided into three categories
words, Chinese pinyin, abbreviations, special meaning based on their principles: feature-based, behavior-based, and
numbers, English letters, and special characters in legitimate deep-learning-based. Among them, feature-based detection
domain names, so when the domain names are generated algorithms analyze the character features of malicious do-
based on these multiple elements, the degree of anomalies in main names and legitimate domain names. The behavior-
the grammatical composition is lower and the existing al- based algorithms analyze the behavioral features of mali-
gorithms are not effective for the detection of malicious cious and legitimate DNS communication. Deep-learning-
domain names based on multiple elements. based algorithms use deep-learning-based models to extract
 To effectively improve the detection accuracy of multi- features, which do not rely on manual analysis of features
elements hybrid malicious domain names, a domain name and have better self-learning and self-adaptive capabilities
multielement adaptive embedding method based on a do- and can detect unknown malicious domain names. Deep-
main name syntax model is proposed, which can realize the learning-based algorithms are the most popular methods in
effective segmentation of various elements in domain names. recent years, so they are grouped into a separate category.
Based on the adaptive embedding module, a parallel con- The following is a review of the three types of detection
volutional model based on the feature selection module is algorithms.
proposed, which, combined with a dynamic Focalloss Domain name feature-based detection algorithms are
function, can achieve effective feature extraction and clas- the earliest and classical malicious domain name detection
sification of multielement hybrid malicious domain names. algorithms, which are implemented by analyzing the
Comparative experimental results on several typical datasets character features of domain names. This type of detection
denote that the proposed model can effectively detect algorithms only relies on the analysis of domain name
multielement hybrid malicious domain names and also has features without long-time observation and additional in-
good adaptability to other types of malicious domain names. formation. Yadav et al. [9, 10] detected malicious domain
 This paper’s contributions can be summarized as follows: names by extracting the Kullback-Leibler Divergence, Jac-
 card Index, and Edit Distance features; they also use a linear
 (1) We propose a domain name syntax model, which regression classifier to achieve effective detection. However,
 analyzes the constituent elements and syntax rules of the algorithm was tested only on a few botnet families, not
 domain names. It can be used for the element parsing on a larger DGA dataset. Later, Schiavoni et al. [11] proposed
 of domain names. Phoenix, a domain name detection algorithm, which uses a
 (2) We propose an adaptive parsing and embedding meaningful character ratio and N-gram normality as fea-
 module, which can get more appropriate domain tures. The shortcomings of this algorithm lie in the small
 name parsing results based on the maximum entropy number of feature dimensions and the fact that it has been
 probability. It can map the elements into vector tested only on a limited number of malicious dynamic
 representations. domain names. Raghuram et al. [12] focused on the read-
 (3) We propose a parallel convolutional model based on ability of domain names; they analyzed the character
 a feature selection module to improve the classifi- composition, word composition, word length, and other
 cation accuracy by performing feature attention on dimensions of legitimate domain names and established a
 multiple convolutional branches. probability model of legitimate domain names from a large
 number of samples to achieve a better effect of detecting
 (4) We proposed an improved dynamic loss function
 malicious domain names. Truong et al. [13] analyzed domain
 based on curriculum learning, which can effectively
 name construction rules based on a large number of legit-
 improve the detection accuracy for hard-to-detect
 imate and malicious domain names to detect malicious
 samples.
 domain names by the length expectation value. Schales et al.
 The remainder of this paper is organized as follows. In [14] performed detection by extracting 17 domain name
the next section, we summarize the related work on the features combined with four of weighted confidence and
existing malicious domain name detection methods. In verified the effectiveness of the algorithm in a large-scale
Section 3, the proposed domain name syntax model is in- network environment. Yang et al. [15] proposed a word
troduced. In Section 4, the proposed detection model is semantic analysis method for word-based domain names to
introduced. In Section 5, we firstly introduce the selected detect malicious domain names by the correlation between
dataset in our research and then a series of experiments are words in them.
Security and Communication Networks 3

 The drawback of feature-based detection algorithms is combining CNN and LSTM with local attention mechanism,
that they rely on manual feature analysis capabilities, and the which can effectively improve the detection accuracy of
dimensionality of character-level features is limited. After some hard-to-detect malicious domain names. Xu et al. [29]
verifying the effectiveness of domain name features, Zago improved the embedding method of domain name by
et al. [16] found that only a few features had a significant N-gram characters and used a parallel CNN network for
impact on the detection results. The malware communicates classification.
with C & C servers with similar lifecycle and query patterns DGA algorithms have also evolved in recent years to
in their DNS requests. Therefore, some researchers have generate stealthier malicious domain names. For example,
combined the communicational behavior features of DNS the malware Matsnu and Suppobox use English words to
with character-level features of the domain name to de- generate domain names; it can effectively reduce the ran-
signing detection algorithms. Bilge et al. [17] summarized 15 domness of characters in domain names. On the other hand,
features of domain name requests based on the features Fu et al. [7] proposed a stealthy domain name generation
analyzed and designed a malicious domain name detection algorithm, which is based on Hidden Markov Models
system called EXPOSURE using a J-48 decision tree as the (HMM) and Probabilistic Context-Free Grammars
classifier. The system successfully detected a large number of (PCFGs). The generative model is designed and trained with
malicious domain names in real network traffic. Shi et al. a large number of legitimate samples that generate domain
[18] extracted domain name features, IP features, TTL names that better fit the character distribution of legitimate
features, and WHOIS features and used Extreme Learning domain names. Therefore, more effective detection models
Machine (ELM) to achieve effective classification. Mowbray need to be designed for stealthier malicious domain names.
et al. [19] proposed a malicious domain name detection In summary, compared with the domain name feature-
algorithm based on domain name length distribution, which based detection algorithm, the deep-learning-based detec-
can achieve effective detection of some specific kinds of tion algorithm does not need to manually analyze the fea-
malicious domain names. Kwon et al. [20] used the Power tures and extracts higher-dimensional domain name
Spectral Density (PSD) analysis technique to analyze DNS features through the deep-learning-based architectures.
query time information in large-scale networks to detect Compared with behavioral feature-based detection algo-
malicious DNS requests. Since the generation and regis- rithms, deep-learning-based detection algorithms do not
tration of DGA domain names are not fully synchronized, a need to collect the complete DNS interaction. However, the
large number of unregistered domain name requests appear embedding method of current deep-learning-based algo-
in the DNS requests of malware; based on this fact, Yadav rithms is relatively simple, and most of them only use
et al. [21] and Antonakakis et al. [22] have proposed various character embedding, which cannot effectively represent
malicious domain name request detection algorithms based domain names with multiple elements, so it is necessary to
on NXDomain responses caused by unregistered domain analyze different elements in domain names and design
names. more effective embedding methods.
 In recent years, deep-learning-based DGA domain name
detection has become the most popular direction for 3. Domain Name Syntax Model
malicious domain name detection. Woodbridge et al. [23]
first used a deep learning model to detect malicious domain This section defines a domain name syntax model. Firstly,
names, using Long Short Term Memory (LSTM) network to the constituent elements of domain names are analyzed, and
design a detection model. Yu et al. [24] built a model for the basic constituent units of the domain name syntax tree
detecting malicious domain names using CNNs and are defined from multiple dimensions. Then, the multidi-
achieved detection results close to the LSTM model. Sub- mensional tree structures of the domain name are analyzed.
sequently, Yu et al. [25] compared the detection results of Finally, the features of legitimate domain names and
seven classical deep-learning-based models and analyzed the malicious dynamic domain names are compared based on
detection effects of different CNN and LSTM architectures; the syntax model, and the results denote that the domain
the results denote that both LSTM and CNN structures had name syntax tree can effectively characterize the domain
effective detection results. Tran et al. [26] proposed an name and can support the design of the detection model of
LSTM-MI model, which can effectively alleviate the sample malicious domain names.
imbalance in the binary classification and multiclassification
problems by using a cost-sensitive loss function. Also to
improve the accuracy of the multiclassification problem, 3.1. Elements Definition of Domain Name Syntax Model.
Qiao et al. [27] added a global attention mechanism to the The basic elements of a domain name are numbers, English
LSTM model, which can significantly improve the multi- letters, and special characters. The numbers are the 10 digits
classification accuracy of DGA domain names. The global from 0 to 9, and the English letters are usually lowercase
attention mechanism relies on the calculation of the global from English alphabets. The special characters in legitimate
attention matrix, which is strongly influenced by different domain names usually only contain “− ” and “_” according to
training samples, and the local attention mechanism can be the statistical result of the Alexa domain list. There are
considered to improve the attention effect by calculating various combinations of English letters in domain names
only the weights between different characters in the domain that can form elements with certain semantic meanings,
name. Therefore, Yang et al. [28] proposed a detection model such as English words, Chinese pinyin, abbreviations, and so
4 Security and Communication Networks

on. Different numbers can also form numeric combinations domain name can be divided into various types of elements,
with special meanings, such as years. To analyze the elements and the characters in a domain name can be divided
in domain names effectively, the constituent elements in according to single, double, or triple characters, as well as the
domain names are defined in multiple dimensions. pinyin, word, and number combinations. Take the domain
 Firstly, the basic elements of the domain name are de- name “baidutop123” as an example. The domain name can
fined in equation (1); the basic elements are 10 numbers, 26 be parsed according to single, double, or triple characters;
English letters, and 2 special characters. however, they cannot obtain the semantic meaning of the
 characters in the domain name; the most appropriate
 ⎪
 ⎧
 ⎪ A_ � {a − z}, parsing method that satisfies human understanding should
 ⎪
 ⎪
 ⎨ _
 N � {0 − 9}, (1) parse the domain to “bai,” “du,” “top,” “123”.
 ⎪
 ⎪ To obtain more appropriate domain name parsing re-
 ⎪
 ⎪
 ⎩ T_ � ’ − , ’ ,
 ′ ′ sults, a domain name syntax model based on PCFG is
 proposed. The PCFG is a syntax model based on word types
where A_ denotes the set of English letters, N_ denotes the set and cooccurrence probabilities that parses sentences into
of numbers, and T_ denotes the set of special characters. tree-like element parse trees. The PCFG model only con-
 There are a large number of English words in the le- siders words, while the elements of domain names are more
gitimate domain names, which can be regarded as one type complex. According to the analysis in the previous sub-
of element in domain names. It can be defined as section, the set of elements in a domain name include
 _ � w1 , w2 , . . . wn ,
 W (2) English letters, numbers, special characters, words, two-
 character combinations, three-character combinations,
where W _ denotes the set of English words. Chinese pinyin, Chinese abbreviations, English abbrevia-
 The total number of two-character combinations in the tions, and special meaning numerical combinations. The
domain name is 676, and the top 50 two-character com- combination of different elements can be defined as
binations can be taken as the meaningful two-character set.
The total number of three-character combinations in the WN ⟶ W ∘ N, (5)
domain name is 17,576, and the top 100 three-character where W denotes a word, and N denotes a number. It can be
combinations can be taken as the meaningful three-char- further expressed as
acter set. So the meaningful two-character and three-
character sets in the domain name are defined as W ⟶ top,
 (6)
 ⎨ B_ � b1 , b2 , . . . b50 ,
 ⎧ N ⟶ 1.
 ⎩ _ (3)
 F � f1 , f2 , . . . f100 , Therefore, the parse tree of “baidutop123” can be ob-
 tained based on the domain name PCFG as shown in
where B_ denotes the set of meaningful two-character
 Figure 1.
combinations, and F_ denotes the set of meaningful three-
 In Figure 1, the symbol D denotes the domain name, the
character combinations.
 individual branches denote the grammatical composition in
 Other elements in the domain name include Chinese
 the domain name, the symbols in equation (5) denote the
pinyin, English abbreviations, Chinese abbreviations, and
 nonterminal symbol of the syntax tree, and the symbol in
special meaning numeric combinations, which can be de-
 equation (6) denotes the terminal symbol.
fined as
 Since the elements in the domain names are not ran-
 ⎪
 ⎧
 ⎪ G_ � g1 , g2 , . . . gn , domly combined, there is a certain combination probability,
 ⎪
 ⎪ so the PCFG model enhances the rules by adding a prob-
 ⎪
 ⎨ H_ � h1 , h2 , . . . hn ,
 (4) ability to each rule:
 ⎪
 ⎪
 ⎪ I_ � i1 , i2 , . . . in ,
 ⎪
 ⎪
 ⎩ A ⟶ β[p], (7)
 J_ � j1 , j2 , . . . , jn ,
 where A denotes the nonterminal symbols, β denotes the
where G_ denotes the set of Chinese pinyin, H_ denotes the set concatenation of all terminal symbols and nonterminal
of English abbreviations, I_ denotes the set of Chinese ab- symbols, and p denotes the probability of occurrence of the
breviations, and J_ denotes the set of meaning numeric rule. The domain name syntax model can be defined as
combinations. shown in equation (8).
 G � (M, Σ, R, D, P), (8)
3.2. Domain Syntax Tree Construction Methods. Tree
structures can be constructed in several dimensions for where the conditional probability of a nonterminal symbol A
describing the relationship between elements in a domain connecting a symbol can be expressed as P(A ⟶ β|A).
name or between different domain names. In this paper, we The words in a domain name can also be parsed into
focus on the syntactic composition of the relationships characters or two-character combinations, etc. Therefore,
between different elements in a domain name. From the there are different priorities for parsing the characters into
analysis in the previous subsection, it is clear that the same different elements. Segmenting the elements in a domain
Security and Communication Networks 5

 D number of occurrences of the symbol A in the statistical
 samples.
 While there exist multiple dimensional ways to analyze
 GG GW WN the relationship of elements in a domain name, this section
 focuses on constructing the syntax tree of a domain name
 from the element type and the conditional probability of
 G G W NN occurrence. By analyzing the relationship between the ele-
 ments in a domain name, the syntax tree of the elements can
 be constructed, and the domain name legitimacy can be
 bai du top
 analyzed based on the domain name syntax tree.
 N NN

 3.3. Analysis of Malicious Domain Names. Currently known
 1 DGAs can be divided into three categories. The first uses
 N N characters as generating elements to randomly generate
 domain names; most of the current malware uses these al-
 gorithms. The second uses English words to generate domain
 2 3 names, reducing the randomness of the characters in domain
Figure 1: Illustration of domain name PCFG syntax parse tree. names; a small number of current malware uses these al-
 gorithms. The third is the SDGA algorithm based on HMM
name according to higher priorities can yield a more ac- and other generative models; the HMM models generate
curate domain name parse tree. After adding the domain domain names with character distribution closer to normal
name resolution priority, the domain name syntax model domain names. This section analyzes four typical samples of
can be defined as malware-generated domain names based on random char-
 acter DGAs, in which Corebot and Gameover generate
 G � (M, Σ, R, D, P, L), (9) malicious dynamic domain names using 26 English letters
 and 10 numbers; Cryptolocker and Kraken generate mali-
where L indicates the priority of the elements in each domain cious dynamic domain names using only 26 English letters. In
parsing rule. the word-based DGA, Matsnu randomly selects some verbs
 For a grammar parse tree T obtained from a domain and nouns from the word list and uses “− ” to connect them,
name D, the average priority of elements in each rule and the and Suppobox selects two words from the word list to form a
probability of this parse tree can be calculated as dynamic domain name. The HMM-based SDGA algorithm
 ⎪
 ⎧ 1 uses English words or legitimate domain names as training
 ⎪
 ⎪ L(T, D) � l(r(n)),
 ⎪
 ⎪ n samples to train the HMM model and uses the pretrained
 ⎨
 ⎪ (10) model to generate malicious dynamic domain names. Sample
 ⎪
 ⎪ examples of the three DGA algorithms are shown in Table 1.
 ⎪
 ⎩ P(T, D) � p(r(n)),
 ⎪
 To analyze the character characteristics of different types
 n∈T
 of malicious domain names, the character frequency dis-
where L(T, D) denotes the average priority of each domain tribution for legitimate domain names is counted, and three
name parsing result, P(T, D) denotes the conditional types of malicious domain names are selected for statistics.
probability of that parsing result, l(r(n)) denotes the priority The results of character frequency distribution are shown in
of each element in the parsing rule, and p(r(n)) denotes the Figure 2. As can be seen from the figure, the character
conditional probability of each parsing rule. frequency of character-based malicious domain names
 For multiple parse trees of the same domain name, the differs significantly from that of legitimate domain names,
one with the highest average priority and the highest while the character frequency distribution of word-based
probability can be selected as and SDGA domain names is closer to that of legitimate
 domain names, indicating that word-based constructed
 T(D) � argmax L(T, D) & argmax P(T, D), (11)
 T∈τ(D) T∈τ ′ (D)
 malicious domain names and SDGA domain names are less
 abnormal in character distribution and more concealed.
where τ(D) denotes the set of all parse trees for domain D. To further analyze the differences of several malicious
 The conditional probability of each rule can be derived domain names, the entropy value of domain name parsing
from the statistical results of a large number of samples and results and distributions of entropy values are counted. For
can be calculated as the domain name parsing result D � c1 , c2 , . . . , cn , the
 C(A ⟶ β) entropy can be calculated as
 P(A ⟶ β | A) � , (12) n
 C(A)
 H c1 , c2 , . . . , cn � − P ck |c1 . . . ck− 1 log2 P ck |c1 . . . ck− 1 ,
where C(A ⟶ β) denotes the number of occurrences of k�1

A ⟶ β in the statistical samples and C(A) denotes the (13)
6 Security and Communication Networks

 Table 1: DGA categories and examples. The mean entropy of Matsnu domain names is significantly
DGA category Malware Samples
 lower than that of legitimate domains, mainly because the
 words in Matsnu are connected by the character “− ,” so the
 hkf1tw674ifcp32mjy.ddns.net
 average probability is significantly lower. The average en-
 Corebot 7pg2ir32y87vibg.ddns.net
 g4afwf7j58utqla.ddns.net tropy of Suppobox domain names, which is composed
 qydakyiugnjlo.net entirely of words, is closer to that of the legitimate domain
 Cryptolocker fngfkutwhhvtvpj.ru name; however, the peak of the histogram distribution is
Character- uopeklvmntwqg.ru significantly different from the legitimate domain names.
based DGA 1weasokabztv6c6z0wh1o1terk.net For SDGA domain names, 10,000 legitimate domain
 Gameover 1ck1rgy1agsja3nrpz35rpgity.net names, 10,000 SDGA domain names generated by training
 8to1i2wiszbdn84rez10t04g7.org with legitimate domain names, and 10,000 SDGA domain
 wrdopsivf.net names generated by training with English words are selected
 Kraken iuhqhbmq.net and their average entropy distribution histograms are as
 zoipmnwr.net
 shown in Figure 5. It can be seen that since the SDGA
 text-react-teacher.com domain names are trained with legitimate domain names or
 Matsnu passenger-express.com
 English words, the generated domain names are closer in
Word-based wall-turn-panic.com
DGA partyprobable.net
 distribution to legitimate domain names.
 Suppobox fightprobable.net The results of parsing and statistical analysis based on
 freshwelcome.net domain name syntax tree show that many types of malicious
 stgaarls.net domain names have a certain degree of difference in entropy
HMM-based distribution from legitimate domain names; however, some
 — dap6anng.net
SDGA word-based and SDGA domain names are closer to legiti-
 pore9nn10.net
 mate domain names in entropy distribution, indicating that
 such malicious domain names are more concealed, and the
where H(c1 , c2 , . . . , cn ) denotes the entropy of domain detection method needs to deeply analyze the semantic
name parsing result and P(ck |c1 . . . ck− 1 ) denotes the con- expression of elements.
ditional probability of occurrence of each element. Con-
sidering that domain names have different lengths, the
average entropy can be used for comparing domain names of 4. Detection Model Based on
different lengths. The average entropy value is calculated as Adaptive Embedding
1 1 n Based on the domain name syntax tree model, this section
 H c1 , c2 , . . . , cn � − P ck |c1 . . . ck− 1 log2 P ck |c1 . . . ck− 1 ,
n n k�1 firstly proposes an adaptive domain name element em-
 bedding method to map different elements to vectors, then a
 (14)
 parallel convolutional model based on a feature selection
where n denotes the number of elements in a domain name. module is proposed to select features from different con-
 Based on the Markov Assumption, the probability of volutional kernel branches to improve the accuracy of
occurrence of each element in the domain name is only feature extraction, and finally, a dynamic loss function based
related to the former N − 1 characters. For the 2-gram on curriculum learning is proposed to improve the training
model, the average entropy of the domain name is calculated effect of the model.
as
 1 1 n 4.1. Adaptive Embedding Module. In most previous dynamic
 H c1 , c2 , . . . , cn � − P ck |ck− 1 log2 P ck |ck− 1 ,
 n n k�1 domain name detection models, the embedding is based on
 (15) each character in the domain name, and the relationship
 between characters is not processed. The literature [29]
and for character-based domain names, 10,000 legitimate proposed an N-gram embedding method, which can im-
samples, Corebot samples, and Cryptolocker samples are prove the perception of the association between different
selected and their average entropy distribution histograms characters in the domain name. It is clear from the analysis
are calculated as shown in Figure 3. Since the occurrence of in Section 3 that there are multiple types of elements in
each character is more random in Corebot and Crypto- domain names, and the N-gram embedding cannot effec-
locker, their conditional probability of character occurrence tively express the different elements in domain names. In
is lower compared with that of legitimate domain names, this research, we believe that a more accurate segmentation
and thus the average entropy is significantly lower than that of different elements in domain names, along with appro-
of legitimate domain names. priate embedding, can improve the accuracy of the repre-
 For word-based domains, 10,000 legitimate domain sentation of domain names and thus improve the detection
names, 10,000 Matsnu domain names, and 10,000 Suppobox effect.
domain names are selected and the histograms of their According to the analysis in Section 3, the elements in
domain mean entropy distribution are as shown in Figure 4. the domain names are shown in equation (16).
Security and Communication Networks 7

 15
 14
 Frequency of characters in domain names (%) 13
 12
 11
 10
 9
 8
 7
 6
 5
 4
 3
 2
 1
 0
 a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 0 - _ 40
 Characters in domain names

 Legitimate Suppobox
 Cryptolocker SDGA
 Figure 2: Frequency statistics of character distribution of legitimate and four malicious domain names.

 400
 Frequency

 300
 200
 100
 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy

 Legitimate
 (a)
 400
 Frequency

 300
 200
 100
 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy

 Corebot
 (b)
 400
 Frequency

 300
 200
 100
 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy

 Cryptolocker
 (c)

 Figure 3: Histogram of the average entropy distribution of legitimate, Corebot, and Cryptolocker domain names.
8 Security and Communication Networks

 400

 Frequency
 300
 200
 100
 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy

 Legitimate
 (a)
 800
 Frequency

 600
 400
 200
 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy

 Matsnu
 (b)
 150
 Frequency

 100

 50

 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy

 Suppobox
 (c)

 Figure 4: Histogram of the average entropy distribution of legitimate, Matsnu, and Suppobox domain names.

 400
 Frequency

 300
 200
 100
 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy

 Legitimate
 (a)
 300
 Frequency

 200

 100

 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy

 SDGA (based on domain names)
 (b)
 Figure 5: Continued.
Security and Communication Networks 9

 400

 Frequency
 300
 200
 100
 0
 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
 Average entropy
 SDGA (based on English words)
 (c)

 Figure 5: Histogram of the average entropy distribution of legitimate and SDGA domain names.

 ⎪
 ⎧ A_ � {a − z}, each tree, the result with the highest entropy value is taken as
 ⎪
 ⎪
 ⎪
 ⎪ the final domain name parsing result. The algorithm is as
 ⎪ N_ � {0 − 9},
 ⎪
 ⎪
 ⎪ follows (Algorithm1).
 ⎪
 ⎪ The segmentation method used in step 1 is the bidi-
 ⎪
 ⎪ T_ � ’ − , ’ ,
 ⎪
 ⎪ ′ ′ rectional maximum matching algorithm [15]. A unique
 ⎪
 ⎪
 ⎪
 ⎪ parsing result Ts′ � e1 , e2 , . . . en can be obtained through
 ⎪
 ⎪ W _ � w1 , w2 , . . . wn ,
 ⎪
 ⎪ the domain name syntax parsing algorithm. For each ele-
 ⎪
 ⎪ _
 ⎪
 ⎨ B � b1 , b2 , . . . b50 ,
 ⎪ ment, a vector can be obtained using Word2vec [30]. To
 ⎪ F_ � f1 , f2 , . . . f100 , (16) obtain a more accurate word representation, a pretrained
 ⎪
 ⎪
 ⎪
 ⎪ word vector model is used to embed the words. For other
 ⎪
 ⎪
 ⎪ G_ � g1 , g2 , . . . gn , elements in the domain name, the collected domain names
 ⎪
 ⎪
 ⎪
 ⎪ H_ � h1 , h2 , . . . hn , are used as a corpus to parse the elements in the domain
 ⎪
 ⎪
 ⎪
 ⎪ name and the elements are trained with Word2vec to obtain
 ⎪
 ⎪ I_ � i1 , i2 , . . . in ,
 ⎪
 ⎪ the embedding model, and the length of each element vector
 ⎪
 ⎪
 ⎪
 ⎪
 ⎪ J_ � j1 , j2 , . . . , jn , is 128. Since there are many element types in domain names,
 ⎪
 ⎪
 ⎩ E_ � A,_ N, _ T, _ W, _ B,_ F, _ H,
 _ G, _ I,_ J_ , different combinations of element types in domain names
 have different impacts on the legitimacy determination. To
where E_ denotes the set of all elements in the domain names, obtain more accurate domain name element vectors, the
A_ denotes English letters, N_ denotes numbers, T_ denotes types of domain names are also mapped to vectors and are
special characters, W_ denotes English words, B_ denotes two- added to the element vector. Domain names are divided into
character combinations, F_ denotes three-character combi- 10 major categories: English letters, numbers, special
nations, G_ denotes Chinese pinyin, H_ denotes English ab- characters, English words, Chinese pinyin, English abbre-
breviations, I_ denotes Chinese abbreviations, and J_ denotes viations, Chinese abbreviations, numeric combinations,
numeric combinations. double-character combinations, and three-character com-
 The different elements are not completely unrelated to binations. Some element types can be subdivided into more
each other, for example, English words contain characters subcategories. English letters can be divided into two cat-
and two-character combinations and three-character com- egories: vowels and consonants, and English words can be
binations, and numeric combinations contain numbers. divided into various lexical types according to the classifi-
Therefore, for dividing strings into different elements in a cation method of natural language processing. Each element
domain name, it is necessary to define the priority level and type in the domain name is mapped to a vector of length 128,
divide them according to the priority level from highest to which is stitched with the original element vector to a vector
lowest. The definition of priority is shown in Table 2, and the of length 256. The vector length 128 is an experimentally
larger the priority value the lower the priority. Among them, obtained value that can balance computational efficiency
English words, Chinese pinyin, English abbreviations, and accuracy and is also used in the model of [28]. The
Chinese abbreviations, and numeric combinations have the domain name element vector is generated as shown in
highest priority. Two-character combinations and three- equation (17).
character combinations have lower priority. The rest of V ek � Embedding ek , Embedding X ek , (17)
elements have the lowest priority, which means the domain
name will be split into single-character elements, including where V(ek ) denotes the vector of elements in the domain
English letters, numbers, and special characters, only when it name and X(ek ) denotes the element type.
cannot be split into other elements. Domain name element parsing and embedding process
 A domain name can be parsed to different trees, each of are shown in Figure 6. Take the domain name “baidutop1”
which contains different element types. For different parse for example, which contains Chinese pinyin, English words,
trees, the average priority of each tree is calculated, and the and numbers; the most appropriate way is to parse it into the
result with the highest priority is taken. For the parse trees elements “bai,” “du,” “top,” and “1” and then map each
with the same priority, by calculating the average entropy of resolved element and its type to a vector. In this case, the
10 Security and Communication Networks

 Table 2: Element priority definitions in domain names.
Priority Element type Element description
1 W, _ H,
 _ G, _ I,_ J_ English words, Chinese pinyin, English abbreviations, Chinese abbreviations, and numeric combinations
2 B,_ F_ Two-character combinations and three-character combinations
3 _ N,
 A, _ T_ English letters, numbers, and special characters

English word “top” is mapped using a public pretrained In the model, domain name element vectors are firstly
embedding model, while “bai” and “du” and “1” are obtained obtained by adaptive embedding. Then four convolutional
by training elements from the collected corpus. layers with different kernel sizes are used to convolve the
 element vectors, respectively, which can obtain four con-
 volutional feature maps with different perceptual ranges,
4.2. Parallel Convolutional Model Based on a Feature Selection and the attention weights of the four feature maps are
Module. After embedding the elements in the domain calculated by the feature selection module. After that, the
name, a parallel convolutional model based on the feature original feature maps are multiplied with the weight results,
selection module is designed to achieve effective classifica- and finally, the four feature maps are summed to obtain the
tion. The feature selection module is combined with a extracted domain name features. The extracted feature maps
multibranch parallel convolutional structure to achieve are input into three fully connected layers to obtain the
more effective feature extraction. Several previous works binary classification results.
have verified the effectiveness of CNN-based models in the In the feature selection module, four feature maps are
domain name detection field. Convolutional layers are elementwise summed to obtain a feature map that fully
mainly used to obtain the intrinsic association of adjacent mixes the four perceptual results:
elements by convolutional computation. Convolutional
layers with different sizes of convolutional kernels are used U � U1 + U2 + U3 + U4 . (20)
to obtain the association between adjacent elements of the
different perceived ranges. In the domain name detection The global average pooling layer is then used to obtain
problem, the elements are usually mapped to one-dimen- the global information of each filter channel. For the c-th
sional vectors, and then the one-dimensional vectors are filter channel, if the number of elements is N, the global
subjected to one-dimensional convolution. For two adjacent average pooling result is calculated as
element vectors V1 � [v11 , v12 , . . . , v1n ] and
V2 � [v21 , v22 , . . . , v2n ], the convolution is calculated as 1 N
 Sc � U (i). (21)
shown in equation (18). N i�1 c
 N− 1
 S(n) � v1m · v2(n− m) . (18) To obtain feature weights adaptively, the results of the
 m�0
 global average pooling are calculated using a fully connected
 layer:
 In convolutional neural networks, convolution param-
eters and bias terms are usually added to the convolutional Z � δ(WS), (22)
computation, and different convolutional kernel sizes are
used, so the convolution process can be represented by where δ denotes the ReLU activation function and W de-
 notes the fully connected parameter.
 ci � f(w · V + b) To implement the weight calculation for different con-
 (19)
 V � Vi , Vi+1 , . . . , Vi+h , volutional branch channels, a soft attention model is used
 and the normalized weight results are obtained using
where ci denotes the result of the convolution output, f Softmax. The weight calculation for the four branches is
denotes the nonlinear activation function, w denotes the
parameters of the convolution, b denotes the bias term, V ⎪
 ⎧
 ⎪ e Ac Z
denotes the element vector of a window, and h denotes the ⎪
 ⎪ a � ,
 ⎪
 ⎪ c
 e c + e c + eCc Z + eDc Z
 A Z B Z
convolution window size, i.e., the convolution kernel size. ⎪
 ⎪
 ⎪
 ⎪
 To extract convolutional features in different perceptual ⎪
 ⎪
 ⎪
 ⎪
ranges, four convolutional kernels with different sizes are ⎪
 ⎪
 ⎪ eBc Z
 ⎪
 ⎪ b c � ,
used to convolve the input element vectors, and the con- ⎪
 ⎪ eAc Z + eBc Z + eCc Z + eDc Z
 ⎨
volutional kernel sizes are taken as 2, 3, 5, and 7. In a domain ⎪ (23)
name, the correlation ranges of elements are different, and ⎪
 ⎪
 ⎪
 ⎪ eCc Z
the effective perceptual fields vary in different domain ⎪
 ⎪ c � ,
 ⎪
 ⎪ c
names. To improve the selection validity of the convolution ⎪
 ⎪
 ⎪ e c + e c + eCc Z + eDc Z
 A Z B Z
 ⎪
 ⎪
results for different convolution kernel sizes, a feature se- ⎪
 ⎪
 ⎪
 ⎪
lection module for one-dimensional parallel convolution is ⎪
 ⎪
 ⎪ e Dc Z
proposed, which is similar to SKNets [31]. The architecture ⎩ dc � A Z B Z C Z D Z ,
 e c +e c +e c +e c
of the model is shown in Figure 7.
Security and Communication Networks 11

 Domain
 element Element e1 e2 en
 parsing embedding
 T = [t1, t2, ..., tn] 128 ...
 D
 Parse trees
 Domain
 Ts′ = [e1, e2, ..., en]
 Selected parse tree ...
 256

 X (e1)X (e2) X (en)

 Ts = arg min (L (tk)) Element type 128 ...
 Minimum average embedding
 priority calculation

 Ts′ = arg max (H′ (tk))
 Maximum average
 entropy calculation

 Figure 6: Domain name element parsing and embedding process.

 Feature selection module
 Kernel 2
 U1
 ×
 Domain
 D

 Adaptive Kernel 3
 Softmax ×
 embedding Global
 average Fully a
 U2 pooling connected +
 b
 +
 U3 S Z c
 ×
 d F
 U
 Kernel 5

 ×
 Kernel 7 U4

 Figure 7: Parallel convolutional model based on the feature selection module.

where a, b, c, and d denote the attention weight vectors of dynamic loss function is proposed based on the idea of cur-
U1∼U4, respectively. riculum learning and the Focalloss [32]. It can adaptively adjust
 Based on the attention weights, the results of the four the loss of hard-to-detect samples at different stages of training
convolutional branches can be weighted, and the sum of the to improve the detection accuracy of hard-to-detect samples
weights of the four convolutional branches is 1 to obtain the and thus improve the overall detection effect of the model.
feature map result V. The calculation of every channel satisfies In the binary classification task, the cross-entropy loss
 function is usually used for model training:
 Vc � ac · U1 + bc · U2 + cc · U3 + dc · U4 , ac + bc + cc + dc � 1.
 (24) L � − ylogy′ − (1 − y)log 1 − y′ , (25)

 The feature selection module can adaptively select features where y′ is the prediction result and y is the ground truth;
extracted by different convolutional kernels by using a soft the ground truth is 0 or 1 in binary classification. When
attention mechanism to calculate the weights of multiple y � 1, the closer y′ to 1 the smaller the value of loss gen-
convolutional branches, which can achieve adaptive selection erated, which indicates the lower difficulty of sample de-
of features extracted from different convolutional kernels. tection. When y � 0, the closer y′ to 0 the smaller the loss
 value generated. Hard-to-detect samples can be defined as
 samples that produce large loss values, and easy-to-detect
4.3. Improved Dynamic Loss Function. To improve the de- samples can be defined as samples that produce small loss
tection effect of the model, the detection accuracy for hard-to- values. Since the number of easy-to-detect samples is much
detect samples needs to be improved, for which an improved more than the hard-to-detect samples, to reduce the weight
12 Security and Communication Networks

of the loss generated by the easy-to-detect samples and phase, which can make the model focus on hard-to-detect
increase the loss weight of the hard-to-detect samples, samples with different weights at different stages, and im-
Focalloss introduces a weighting factor, which makes the prove the detection accuracy of hard-to-detect samples while
model pay more attention to the hard-to-detect samples, and maintaining the detection rate of easy-to-detect samples.
the Focalloss is
 c
 ⎨ − 1 − y′ logy′ ,
 ⎧ y � 1, 5. Experiments and Analysis
 Lfl � ⎩ c (26)
 − y′ log 1 − y′ , y � 0, In this section, several experiments are designed to evaluate
 the model. Firstly, an ablation experiment is designed to
and the curves of the Focalloss with different c are shown in verify the effectiveness of the proposed adaptive embedding
Figure 8. It can be seen that as the value of c increases, the module, feature selection model, and improved dynamic loss
difference between the loss value generated by samples close function, and the results denote that the proposed modules
to the ground truth and samples far from the ground truth are effective in improving the detection accuracy of the
increases significantly, the weight of the loss generated by model. Then the proposed detection algorithm is compared
easy-to-detect samples is reduced, and the weight of the loss with several previous algorithms, and the results denote that
of hard-to-detect samples is increased. the proposed algorithm can effectively improve detection
 In the Focalloss function, the larger the value of c the accuracy for many types of malicious dynamic domain
higher the weight of the loss of hard-to-detect samples. names.
However, according to the idea of curriculum learning, the
model training is similar to the characteristics of human
learning which learn the course from easy to hard; it can 5.1. Experimental Preparation. In this experiment, a large
make the model better optimized [33]. Therefore, in the early number of legitimate domain names and multiple types of
stage of the model training phase, it is not necessary to pay malicious domain names are collected from public datasets.
too much attention to hard-to-detect samples, and after the For the legitimate domain names, we collected the top 1
model achieves a better detection effect on easy-to-detect million domain names of Alexa’s daily updated rankings
samples, the loss weight of hard-to-detect samples is from January 2009 to February 2019, 3159 collected lists for
gradually increased, which makes the model pay more at- total, and for each list, we selected the top 200,000 to add to
tention to difficult detection samples. Based on the above the whitelist, and after cleaning and merging, a whitelist
considerations, when using the Focalloss function, the c sample of over 10 million is obtained. For multiple malicious
value can be dynamically adjusted at different stages of the domain names with different elements, we selected 20
model training phase to achieve dynamic adjustment of the samples of DGAs containing typical elements and genera-
loss weights of hard-to-detect and easy-to-detect samples. tion methods from the UMUDGA [34] dataset, with 10,000
For this purpose, the definition of dynamic c value is given as of each DGA family. For SDGAs, 8 types of SDGA domain
 c � α · tβ , (27) names generated with 8 parameters are selected, and the
 total number of samples is 88,000. To further analyze the
where t denotes the training epochs, and α and β are the fixed effectiveness of the proposed model for detecting malicious
parameters. domain names based on multiple elements, we generate
 As shown in Figure 9, the value of c increases as the multielement dynamic domain names based on several
number of training epochs increases, and the range of DGAs and HMM-based SDGAs using elements such as
variation is affected by α and β. The parameter α should be characters, numbers, special characters, two-character
adjusted according to the convergence of the model. In the combinations, three-character combinations, and words.
model with more difficult convergence and more training The domain names generated based on multielement and
epochs, smaller α should be used, while in the model with DGA are named ME-DGA, and the domain names based on
fewer training epochs, larger α should be used. The pa- multielement and SDGA are named ME-SDGA. The data-
rameter β should be set according to the training samples, sets used in the experiments are shown in Table 3.
and a larger β should be used for models that require a larger Based on the above data, several experimental datasets
differentiation weight between difficult and easy samples; are constructed for testing the detection capability of the
otherwise, a smaller β should be used. proposed algorithm for different types of malicious domains.
 Further, to limit the value of c, a truncation threshold The detection is binary classification; i.e., a domain name is
range [c1 , c2 ] for c can be set: classified as legitimate or malicious. To evaluate the detec-
 tion effectiveness of the models, Precision, Recall, Accuracy,
 ⎪
 ⎧
 ⎪ c � 0, c < c1 , and F1-score can be used as evaluation metrics. To calculate
 ⎪
 ⎨
 c � α · tβ , c 1 ≤ c ≤ c2 , (28) these evaluation metrics, the numbers of True Positive (TP),
 ⎪
 ⎪ False Positive (FP), True Negative (TN), and False Negative
 ⎪
 ⎩c�c ,
 2 c > c2 . (FN) samples need to be calculated. The metrics are defined
 as follows:
 An improved Focalloss function is proposed that can
dynamically adjust the loss weights of hard-to-detect and (i) TP: number of correctly classified malicious dy-
easy-to-detect samples at different stages of the training namic domain names
Security and Communication Networks 13

 4 4

 3.5 3.5

 3 3

 2.5 2.5
 Loss

 Loss
 2 2

 1.5 1.5

 1 1

 0.5 0.5

 0 0
 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
 y=1 y=0

 γ=0 γ=2 γ=0 γ=2
 γ=1 γ=5 γ=1 γ=5
 (a) (b)

Figure 8: Loss value of Focalloss function with different values of c. (a) The curve of loss value when y � 1 and c � 0, 1, 2, 5. (b) The curve of
loss value when y � 0 and c � 0, 1, 2, 5.

 4.5 TP
 Precision � . (29)
 4 TP + FP

 3.5 (ii) Recall: The proportion of true malicious samples
 among all samples classified as malicious. It is
 3
 calculated as
 2.5 TP
γ Recall � . (30)
 2 TP + FN
 1.5 (iii) Accuracy: The proportion of correctly classified
 1
 samples among all samples. It is calculated as
 TP + TN
 0.5 Accuracy � . (31)
 TP + FP + TN + FN
 0
 0 20 40 60 80 100 120 140 160 180 200 (iv) F1-score: the harmonic mean of Precision and
 Epochs Recall. It is calculated as
 α = 0.01, β = 1 α = 0.01, β = 1.05
 α = 0.01, β = 0.95 α = 0.01, β = 1.15
 Precision × Recall
 Figure 9: Curves of c values with different parameters. F1 − Score � 2 × . (32)
 Precision + Recall
 (ii) TN: number of correctly classified legitimate do- Keras is used as the model training framework in this
 main names experiment. The machine used in the experiments has a
 (iii) FP: number of wrongly classified legitimate domain dual-way Intel Xeon E5-2630 V4 CPU, 64 GB of memory, 4T
 names hard disk, and Ubuntu 16.04 operating system.
 (iv) FN: number of wrongly classified malicious dy-
 namic domain names 5.2. Ablation Experiments. To verify the effectiveness of the
 adaptive embedding module, the feature selection module,
 Using these four terms, the Precision, Recall, Accuracy,
 and the improved dynamic Focalloss proposed in this paper,
and F1-score can be defined and calculated as follows:
 an ablation experiment containing the comparison of
 (i) Precision: The proportion of correctly classified models with different modules is designed. The baseline
 malicious domain samples to all malicious domain model in the experiment uses character embedding for the
 samples. It is calculated as input domain name and then uses four parallel
14 Security and Communication Networks

 Step 1: Domain D is segmented to obtain multiple parse tree: T � t1 , t2 , . . . , tn , where each tree contains x elements:
 tk � e1 , e2 , . . . ex .
 Step 2: Calculate the average priority of the elements in each tree and take the result with the smallest average priority value:
 Ts � argmin(L(tk )), where L(tk ) � (1/x) l(ek ), l(ek ) is the value of the element priority.
 Step 3: The above steps lead to a set of parse trees with the same priority: Ts � t1 , t2 , . . . , tx , calculate the average element entropy
 value of each parse tree, and take the result with the highest entropy value: Ts′ � argmax(H′ (tk )), where
 H′ (tk ) � − (1/n) P(ex |ex− 1 )log2 P(ex |ex− 1 ) is the cooccurrence probability of two elements obtained from the statistical samples.
 Step 4: After obtaining the result of the maximum entropy Ts′, if there are multiple results with the same entropy value, the average
 probability of occurrence of the elements in the domain name is further calculated and the maximum result is taken:
 Ts′ � argmax(P′ (tk )), where P′ (tk ) � 1/n P(ex ), P(ex ) denotes the probability of occurrence of the elements obtained from the
 statistical samples.

 ALGORITHM 1: Domain name element parsing algorithm.

convolutional branches to extract features, and the results of feature selection module, and the improved dynamic
the four branches are combined and input into a three-layer Focalloss function, the detection accuracy of the model is
fully connected layer, and the model is trained using a cross- significantly improved compared with the baseline model.
entropy loss function. The length of the character vector in The ablation experimental results denote that the adaptive
the model is 128, the number of filters is 128, and the domain name element embedding module, feature selection
number of nodes in the fully connected layer is 512. 200,000 module, and improved dynamic Focalloss can effectively
DGA samples and 800,000 legitimate samples are used in the improve the detection effect of the model.
ablation experiments.
 There are seven models in the ablation experiment,
namely: the baseline model, model 1 that uses the adaptive 5.3. Comparison Experiments on DGA Dataset. In this
embedding module instead of the character embedding, subsection, the proposed detection model is compared with
model 2 that adds the feature selection module to the five previous detection algorithms, namely, LSTM model
baseline model, model 3 that trains the baseline model using [23], Invincea CNN model [25], LSTM-MI model [26],
the Focalloss, model 4 that trains the baseline model using N-gram CNN-based model [29], and HDNN model [28].
the improved dynamic Focalloss, model 5 that applies the The proposed model uses the model 6 structure from the
adaptive embedding module and the feature selection previous subsection which is named MEAE-CNN. The
module to the baseline model, and model 6 with adaptive contrasted dataset consists of 20 selected DGA samples and
embedding module, feature selection module, and improved legitimate domain names. The comparison detection results
dynamic Focalloss. Table 4 shows the detection accuracies are shown in Table 5.
and F1-scores of the above models, in which “×” represents Among the 20 selected DGA domain name samples,
that the model does not contain the module and “○” rep- most of them are generated by selecting characters according
resents that the model contains the module. to a certain random number of seeds, and some of them are
 The detection results of model 1 show that the average generated with less random characters, which are more
detection accuracy of the model improves from 0.9685 to difficult to detect; for example, Simda and Virut algorithms
0.9732 and the F1-score also improves from 0.9684 to 0.9730 strictly limit the length of generated domain names and
after using the adaptive element embedding module; it generate domain names with shorter lengths and less ran-
denotes that the adaptive speed quantization module can domness. Although Pykspa and Symmi generate longer
significantly improve the detection effect of the model. The domain names, the vowel letters and consonant letters are
average accuracy of model 2 is improved to 0.9724 and the selected according to their cooccurrence relationship; the
F1-score is also improved to 0.9720, which indicates that the generated domain names are more difficult to detect.
feature selection module can improve the detection effect of Matsnu and Suppobox use words to generate domain names
the parallel convolutional model. The detection results of with the lowest randomness. Among the compared algo-
model 3 denote that the Focalloss function can improve the rithms, Invincea CNN simply uses a parallel convolutional
model detection effect compared with the cross-entropy loss structure, which cannot effectively extract the association
function. The average detection accuracy of model 4 is features between characters and therefore has the worst
0.9721 and the F1-score is 0.9720, indicating that the im- detection effect on DGA domain names, with an average
proved dynamic Focalloss can further improve the detection accuracy of 0.9685. In contrast, the detection accuracy of the
effect of the model. Model 5 achieves an average detection LSTM model is significantly higher, at 0.9712. After adding a
accuracy of 0.9758 and an F1-score of 0.9755, indicating that sample equalization strategy to the LSTM model, the ac-
the use of both the adaptive element embedding module and curacy of the LSTM-MI model is slightly improved, and the
the feature selection module can effectively improve the accuracy is increased to 0.9721, and the recall is improved to
detection effect. The average detection accuracy of model 6 0.9709. N-gram CNN can achieve a more accurate quan-
can reach 0.9812 and the F1-score can reach 0.9810. It titative representation of DGA domain names due to the
denotes that, after using the adaptive element embedding, embedding method of N-gram characters; the accuracy is
You can also read