Adaptive intelligent learning approach based on visual anti-spam email model for multi-natural language
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Journal of Intelligent Systems 2021; 30: 774–792 Research Article Mazin Abed Mohammed*, Dheyaa Ahmed Ibrahim, and Akbal Omran Salman Adaptive intelligent learning approach based on visual anti-spam email model for multi-natural language https://doi.org/10.1515/jisys-2021-0045 received March 24, 2021; accepted May 26, 2021 Abstract: Spam electronic mails (emails) refer to harmful and unwanted commercial emails sent to corpo- rate bodies or individuals to cause harm. Even though such mails are often used for advertising services and products, they sometimes contain links to malware or phishing hosting websites through which private information can be stolen. This study shows how the adaptive intelligent learning approach, based on the visual anti-spam model for multi-natural language, can be used to detect abnormal situations effectively. The application of this approach is for spam filtering. With adaptive intelligent learning, high performance is achieved alongside a low false detection rate. There are three main phases through which the approach functions intelligently to ascertain if an email is legitimate based on the knowledge that has been gathered previously during the course of training. The proposed approach includes two models to identify the phishing emails. The first model has proposed to identify the type of the language. New trainable model based on Naive Bayes classifier has also been proposed. The proposed model is trained on three types of languages (Arabic, English and Chinese) and the trained model has used to identify the language type and use the label for the next model. The second model has been built by using two classes (phishing and normal email for each language) as a training data. The second trained model (Naive Bayes classifier) has been applied to identify the phishing emails as a final decision for the proposed approach. The proposed strategy is implemented using the Java environments and JADE agent platform. The testing of the perfor- mance of the AIA learning model involved the use of a dataset that is made up of 2,000 emails, and the results proved the efficiency of the model in accurately detecting and filtering a wide range of spam emails. The results of our study suggest that the Naive Bayes classifier performed ideally when tested on a database that has the biggest estimate (having a general accuracy of 98.4%, false positive rate of 0.08%, and false negative rate of 2.90%). This indicates that our Naive Bayes classifier algorithm will work viably on the off chance, connected to a real-world database, which is more common but not the largest. Keywords: anti-spam detection, machine learning techniques, adaptive intelligent learning, multi-natural language, multi-agent system, Naive Bayes classifier * Corresponding author: Mazin Abed Mohammed, Information Systems Department, College of Computer Science and Information Technology, University of Anbar, 31001, Anbar, Iraq, e-mail: mazinalshujeary@uoanbar.edu.iq, tel: +964-7801141441 Dheyaa Ahmed Ibrahim: Communications Engineering Techniques Department, Information Technology Collage, Imam Ja’afar Al-Sadiq University, Baghdad, Iraq, e-mail: Dheyaa.ibrahim@sadiq.edu.iq Akbal Omran Salman: Department of Control & Automation Techniques Engineering, Electrical Engineering Technical College, Middle Technical University, Baghdad, Iraq, e-mail: akbal.o.salman@mtu.edu.iq Open Access. © 2021 Mazin Abed Mohammed et al., published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.
Adaptive intelligent learning approach for multi-natural language 775 1 Introduction One of the most provoking and harmful extensions to internet technology is spam. The effectiveness of conventional software used for filtering spam has been lessened by the increased amount of spam, which has overwhelmed anti-spam defenses. The increase in the number of spam-related problems urges the need for developing tools that are more effective and efficient in controlling such problems [1]. Machine learning (ML) methods have supplied scientists with a good method to strange spam. ML has been applied success- fully to classify spam electronic mails (emails) [2–4]. Presently, the risk of phishing emails can be identified with minimum human involvement, thereby enabling a high level of accuracy and easy control. The different components of spam emails need to be subjected to processing before their application in the correct algorithm, after which the algorithm can be applied to email filtering and classification. The algo- rithm can only be used for this purpose after the email contents are transformed into numeric data; this is the first step to apply this algorithm to the above purposes. However, the header and the body of the email can be different and do not reflect the message that the sender is sending to the recipient. Sometimes, the data contained in the header share similarities with the content in the body of the email. Thus, the accuracy of email spam filters can be reduced when only headers are used [5]. This process involving the use of only headers is referred to as preprocessing, and it involves the extraction of features, selections of features, removing the stop word, and stemming [6]. To increase the efficiency of the spam filter, the preprocessing step can be employed. The step can be used when feature vectors are trained and tested, considering the problems that it is faced with. Image spam is a different type of spam, whereby the spammer sends the spam as a portion of an entrenched file supplement instead of sending messages. The image content like GIF file format may be included in the spam image, or even a similar kind of file format, usually containing several random words, which in some cases is referred to as word salad. Sometimes, the image may contain a link to a website. To avoid detection by conventional anti-spam technologies, an image spammer may combine such components. Usually, when these images go to the receiver, they are displayed automatically. Sadly, some of the available spam filters are inadequate to detect such image spam messages as spam. The spam capture rate declined throughout the messaging security industry due to the rise in more complex image spam. This increase, in turn, leads to a waste of productivity as well as end-user frustration as the number of delivered spam increases. Presently, images form a crucial part of the World Wide Web, as statistics taken from over four million HTML webpages show that the pictures make up 70.1% of web content. This number means that, on average, each HTML webpage contains 18.8 images [7]. The text-based methods are characterized by two main shortcomings [8]. The first shortcoming is associated with using a wide range of tactics by spammers to create confusion for text-based anti-spam filtering. The second shortcoming is associated with the diversity of the kind of information contained in emails. The variety has increased due to the continuous growth of the capacity and the scale of the internet. Presently, emails are not only made up of text but also multimedia content. These limitations and challenges have reduced the effectiveness of the existing pro- posed models (anti-spam filters). One of the main issues faced with the existing data is that the emails include not only the text but also multi-form of data. Similarly, this technique is used by spammers to disguise spam messages, thereby confusing text-based anti-spam filters through the use of tactics that are HTML-based [9]. A filtering program may find the unprocessed content of spam emails meaningless. However, the messages that are concealed can be seen by the recipients. Due to the increasing prevalence of visual information in emails, it becomes crucial to use this information to ensure that anti-spam filtering achieves a high level of accuracy. Thus, this study investigates how the use of visual information, especially images, can be employed in anti-spam filtering. Moreover, a novel intelligent anti-spam technique is developed in this study to overcome these shortcomings. Based on the three points of view, a discussion on spam problems and their effect has been provided by different researchers. Several researchers have highlighted the financial, economic, management, mar- keting, and business implications of spam, whereas others have focused on studying the effect, which spam has on privacy, security, and data protection. Many researchers have studied the different anti-spam filter approaches like IP blocks and ML. Furthermore, many studies have focused on investigating the effect
776 Mazin Abed Mohammed et al. of spam on email account, society, and email reliability. The proposed approach is implemented within the Java environment using the JADE agent platform. The successful detection and filtering of a wide range of spams can be achieved when this application is used. In this work, a novel adaptive intelligent learning approach is introduced based on the visual anti-spam model for multi-natural language capable of addres- sing these weaknesses and cover Arabic, English, and Chinese languages. Anti-spam filtering method based on multi-trainable model has been proposed to identify the phishing email. The main contributions of this article can be summarized as follows: • To scrutinize an extant anti-spam technology and alongside its limitations. • To compare the various mutual document processing and identify their strengths and the weaknesses of the products. • To propose and implement a novel mutual document processing model that can address the limitations of the extant mutual document processing model and evaluate the method’s performance. • To construct a novel visual anti-spam model based on the text of the newly proposed mutual document processing model. • To evaluate the performance of the anti-spam model based on the text using both subjective and objective evaluations. • To introduce a multi-trainable model that can identify the language type and the phishing email by using Naive Bayes classifier. • To propose a new affective model to identify the language by using the text features and the Naive Bayes classifier as the first model and the outcome of this model to be used as input for the next model. • To use the training data (phishing and normal email) as a learning stage for the second model and use the trained model to identify the email situation. The rest of this article is divided into five sections and several subsections. Section 2 provides the study of existing literature on methods of classifying anti-spam, followed by Section 3, which describes the materials and methods used in the research. Section 4 illustrates the theoretical background of the pro- posed approach that is proposed in this work. In Section 5, the implementation of the method is presented, and the results of the performance of the proposed model are also given. Finally, Section 6 concludes the work and highlights the directions of future work. 2 Background and related works 2.1 Email A messaging system through which messages are transmitted across computer networks electronically is an email [8]. The messaging system requires the sender of the message to open a message panel, type in the address of the recipient, the subject of the message, and the email content, and then send the message to the recipient by clicking the send icon. Free email services such as Yahoo, Hotmail, and Gmail could be easily accessed by users. They could even obtain an email account by getting registered with the internet service providers (ISPs). Such email services are not paid for but can only be used with an internet con- nection. Also, another characteristic of emails is the immediacy with which they are received after being sent. Using an efficient mail delivery system, communication can be established between email users at an affordable rate [9]. Email services have emerged as the most commonly used means of communication because they are reliable, user friendly, and readily available. Thus, corporate bodies and individuals rely on this means of communication [10]. As noted earlier, the main elements of an email are the header and the body of the message. The information about the transmission and subject of the mail is found in the header. The elements of the header are as follows:
Adaptive intelligent learning approach for multi-natural language 777 • From: information of the sender like the email address is found here. • To: contains the receiver’s details like the email address. • Date: the date when the email is sent to the specific recipient(s) is contained here. • Received: this part contains the server’s intermediary information and the date when the email message is processed. A drastic reduction in communication costs can be achieved using email, a cheap and speedy means of communication. In addition, email is a very efficient communication tool that can be used for marketing and, as such, can be leveraged by business corporations [11]. As it is a commonly used channel of adver- tising, businesses can capitalize upon email. Nevertheless, one of the problems associated with this com- munication tool is spam, and this is because sending emails is characterized by affordability and simplicity. The description “spam” is given to unsolicited emails sent in bulk to many users for varying reasons like phishing and commercial purposes [12]. 2.2 The spam phenomenon There are different definitions of the term “spam.” It is also known as junk mail. In the various definitions, the difference between legitimate and spam is highlighted. Even though there are different definitions, the most commonly used is the one, which refers to spam as “unsolicited bulk email” [13–15]. Spam has been categorized based on the research carried out by Subramaniam et al. [10]. These categories are presented in Table 1. Table 1: The categories of spam applications Categories Descriptions Health Involves spam emails that promote or advertise fake medications Promotional products Spam containing counterfeit fashion goods, such as watches, shoes, bags, and clothes Adult content Spam, which contains pornography or other related contents Finance and marketing This kind of spam offers loan packages, stock kiting, and tax solutions Phishing Fraud or phishing spam such as “Spanish Prisoner” and “Nigerian 419,” which are sent to users with the aim of defrauding them Malware This kind of spam is sent with the aim of spying on and attacking personal computers Education This kind of spam is sent to users by offering them fake certificates of online education Political The spam of political targets like elections by online voting In general, a broad range of goods and services are often advertised using spam, and changes occur in the rate of advertisement devoted to a particular class of goods and services as time goes by. Most of the time, spam is used by online fraudsters to satisfy their needs. A classic example of spamming activity is phishing, whereby fraudsters search for confidential information, such as details and passwords of credit cards. Here, the fraudsters imitate official requests from reliable authorities such as banks and service providers [16]. Another kind of harmful spam is a virus; the operation of a mail server is interrupted by fraudsters using a massive spam attack [17]. Conclusively, the senders of spam messages do so to steal people’s confidential information to defraud them by advertising ideas, goods, and services; delivering harmful software; or temporarily disrupting a mail server. Based on the content of spam, they are catego- rized into various subjects and a wide range of genres because of the simulation of different classes of authentic emails, such as memos, order confirmations, and letters [18]. The characteristics of legitimate email traffic differ from those of spam traffic. However, the spam sent steadily over the time of legal emails occurs during the day time to provide clear image that is normal email [19].
778 Mazin Abed Mohammed et al. When spammers send spam, they often conceal their identity using a variety of methods. However, their identity is not hidden while the email addresses are being harvested from online materials, such as papers or websites. In essence, harvesting activities can be a way to determine the spammers’ activities [20]. It is important to note that spammers are normally reactive, i.e., any successful anti-spam effort is actively opposed by spammers [1]. With this, after the deployment of every new method, a decrease in the efficiency of such a method occurs. The study carried out in ref. [21] analyzed the evolution of spamming techniques, and the results revealed that with the presence of very efficient filters or other efficient solutions, the effectiveness of spam can be sabotaged. 2.3 The spammers’ tricks The ability of spammers to send spam emails depends on their capability to obtain email addresses; they employ the use of special software to harvest the email addresses from the internet. With this software, the spammers can systematically gather email addresses from group discussions or websites [22]. Moreover, many email addresses can be bought or hired by spammers from other specialist co-ops spammers. There is a variety of tactics used by spammers to avoid being identified through filters. These baits are presented and briefly described in Table 2. Table 2: Tactics that spammers use when sending spam [1] Tricks Descriptions Zombies or Botnets With these tactics, a large amount of spam, viruses, and malware is sent through personal computers on the internet Bayesian sneaking and This kind of trick allows the spammer to write a spam message using words that are poisoning rarely used in a spam message. In addition, the spam messages do not “poison” the Bayesian filter database IP address The spammer acquires and uses a trusted IP address that also has nonpartisan repute Offshore ISPs Here, offshore ISPs with no measures of security are used by the spammers Open proxies/open-relay servers This tactic allows spam to be redirected to vulnerable users through the use of servers Third-party mail back software Here, the spammers employ the use of emails that are wrongly anchored on trustworthy websites Falsified header information False header details are added to the spam message Obfuscation Nonsense creative symbols of HTML tags are used in splitting words with the aim of masking spam messages Vertical slicing This trick allows spam messages to be written in vertical direction HTML manipulation HTML format is manipulated with the aim of preventing the spam message from being detected HTML encoding The use of encoding methods, such as Base64, is employed in changing the binary attachment to plain text characters JavaScript messages A JavaScript scrap is used in setting the whole content, and when the message is opened, it gets enacted ASCII art Glyphs of standard letters are used to write spam messages Image-based An image is used by spammers to send textual information to users URL address or redirect URL With this trick, detection is avoided by adding URL address. In some cases, unimportant portals are used so that the users can be led to the real websites Encrypted messages Message is encoded but gets unscramble upon achieving the letterbox 2.4 Spam impact The first spam was discovered on 3 May 1987, where spam emails were transmitted to almost 400 ARPANET users who were given an open invitation to the then-forthcoming computer hardware demonstrations [23].
Adaptive intelligent learning approach for multi-natural language 779 Since then, spam messages have become part of a day-to-day disturbance that users of email services experience. Currently, the volume of spam emails recorded worldwide is in the range of 53–58%. Down from a peak of 88% in 2010 [24], it is still well above its estimated proportion of 10% in 1998 [25]. Figure 1 shows the average spam distribution from 2006 to 2016. Figure 1: Average of spam from 2006 to 2016 [1,12]. 2.5 Current methods for spam detection Recently, the problems associated with spam messages have increased because of the increase in email usage. It has been observed that spam emails have become a major concern as many spam messages with offensive content are circulated. With this problem, the reliability of email is reduced, thereby reducing the confidence of email users. As a result, when such spam messages are received, the network bandwidth is wasted and so is the time spent by users to differentiate between legitimate messages and spam. However, business owners can benefit from spam marketing since it allows sending of bulk messages at an affordable rate, thereby leading to the maximization of profits. Several countries have adopted spam as a marketing tool because of economic gains, although it is limited by the fact that many such messages emanate from various countries. Hence, it may be difficult to trace the real senders of the spam messages, and this, in turn, makes it difficult to set laws and regulations associated with the use of spam. Besides using legal approaches to dealing with spam, researchers have suggested that the models of operation and protocols should be changed. In the current section, a wide range of available anti-spam techniques are presented. The research efforts made in this area are commendable, and as a result of the relevance of this subject matter, more research is being conducted to discover newer spam detection methods. In the work con- ducted by Nosseir et al. [26], a character-based method was proposed. In the proposed method, the use of a multi-neural network classifier is employed, and it also involves the training of each neural network based on a normalized weight derived from the ASCII value of the word characters. Nevertheless, the words can be camouflaged by the attacker who can use quite a different spelling to write the words or can use visuals to boycott any detection attempt. Thus, the rate of detection can be lessened. In the work by Aski and Sourati [27], a rule-based method was proposed, and 23 features have been identified and selected carefully from a spam dataset that was accumulated personally. Then, a score was designated for each of the criteria. A comparison between the accumulated score and a threshold value was carried out to determine whether an email is legitimate or spam. They used three approaches of ML, which are as follows: C4.5 Decision Tree classifier, Multilayer Perceptron, and Naive Bayesian classifier. However, the database used for the study was insufficient as it contained only 750 spam messages. Their results were not reflective of the performance in terms of memory and time. In a different study by Feldman et al. [28], the use of term level was proposed for mining text in emails. The authors suggested that the first step of the mining process
780 Mazin Abed Mohammed et al. should be the preprocessing step, whereby the collection of the document is subjected to preprocessing, and the important terms are extracted from the documents. After the preprocessing step, each document will be denoted as a collection of annotations and terms, which the document is characterized by. With this approach, the frequency at which the terms occur can be obtained. Nevertheless, this approach also has a limitation, which is its inability to treat a huge number of texts. The literature review also shows that some researchers have focused on developing approaches that can be used to identify the accounts used for spamming. One of the approaches proposed for the identification of spamming accounts is the early detection of spamming (ErDOS) system, which is capable of detecting spamming accounts early. In ErDOS, the detection of spamming accounts involves the combination of features and content-based detection techniques, based on patterns of inter-account communication [29]. For future work, the authors suggested that the work can be further improved to detect spammers’ account in real-time. Idris et al. [30] proposed a novel method by combining differential evolution and negative selection algorithms. The proposed novel hybrid model is characterized by a special feature that allows differential evolution to be implemented in the random generation phase of NSA. Also, the generated detector is maximized by the model, whereas the overlap detectors are minimized [31]. However, the authors failed to address other problems such as clickjacking and image spam. This method can be improved by adding a feature that allows the extraction of regular expressions from the entire text of the messages rather than the extraction of the contents of the subject header, which is currently used in this approach. Clickjacking, also referred to as UI redressing or IFrame overlay, is a kind of attack that involves overlaying a button or a field by harmful links or scripts that are often invisible. This technique is popular in the community of hackers who deceive to bait users into clicking buttons or links, usually due to the color possessed by the link. It is often difficult to distinguish between the background color of the page and the color of the link because they are almost the same. Also, no precautionary measures are put in place by many banking websites and three of the Alexa top ten websites to combat attacks that are associated with clickjacking [32]. One of the ways through which clickjacking can be combated is sending a confirmation prompt to users when they click the target element [33]. Users should be required to mark a checkbox as a way of confirming or to enter a correct CAPTCHA before the intended button can be clicked. The “Phishing Email Detection System,” proposed in ref. [34], is developed based on unsupervised and supervised techniques. This system is equipped with features of reinforcement learning, and it can adapt to environmental changes. The efficiency of this technique is greater in Zero-Day phishing attacks, but they are less suitable for spam that is usually used for advertisements. The feature evaluation and reduction algorithm, the most integral part of the system, allows dynamic selection and ranking of the most relevant features from emails based on many environmental criteria. Nevertheless, the features that are chosen are quite unconventional and insufficient. The study conducted by Zhu and Tan presents a feature extraction that is based on local concentration (LC). They developed their anti-spam framework using this technique [35], inspired by the “Biological Immune System.” Using the LC approach, each area of a message can be converted to a corresponding LC feature such that the position-correlated details can be determined from a message. A sliding window with a fixed length can be used to divide the content of the message. Feature engineering can be employed to obtain fresh features from extant ones [36]. Hayat et al. [37] presented a discussion on the implementation of spam filtering, which is based on the improved version of the Naive Bayes algorithm. Compared to the other systems developed based on the original version of the Naive Bayes algorithm, the performance of the improved version is better. The model works by comparing the content of the current set of emails with the previous ones. If a major change is detected, the model makes automatic adjustments wherever necessary. Thus, a significant improvement is seen in spam detection rate after the changes are made and updating is completed. In this study, the authors achieved an accuracy rate ranging from 8 to 9%, depending on a multinomial Naive Bayes algorithm. Usually, the use of support vector machines (SVMs) has been extensively employed, and their efficiencies have been evident in designing anti-spam systems based on ML. SVMs are more advantageous than other algorithms as they are capable of dealing with feature sets that are highly dimensional with several attributes [38]. Additionally, the “kernel trick,” which is part of SVMs, can be used to change non-linearly separable data into new data that can be
Adaptive intelligent learning approach for multi-natural language 781 separated linearly [39]. Nonetheless, the study conducted by Alsmadi and Alhami [40] contended that when n-Gram-based clustering classification is used, a lower rate of false positives (FPs) can be obtained. 3 Research methodology This study follows basic steps of research such as data gathering and mutual document processing. The three major stages involved in this research are presented in Figure 2. Study the existing methods and highlight the limitations Main Research Focuses Identify the evaluation methods Specify the requirements of the proposed method Document Processing Technique Data Collection and Preparation Evaluate the Technique Design the model Design and implement the proposed model Implement the model Evaluation Stage Figure 2: Phases of the research methodology. 3.1 Main research focuses In the preliminary stage, the anti-spam filtering methods were investigated using the soft computing tech- niques based on mutual text processing. The researcher’s understanding of the issues associated with extant anti-spam technology, as mentioned earlier in the problem statement, and the performance assessment of extant anti-spam technology were also investigated. The requirements for developing a novel efficient visual anti-spam technology were also defined in this phase, while the text database was prepared. The preparation of the database involved the establishment of a connection between the email of the researcher and Outlook, to read the emails of the researcher from the Outlook, and also to be able to save these emails on the PC in the HTML format. Subsequently, WordPad was used to access the emails so that they can be saved in the text format. If an email, with the body (HTMIL) in the form within tickets, is identified as spam, the researcher staged in this way took all the emails within contents. Two kinds of emails were collected for this study: legitimate emails and spam emails. During the collection of emails, some challenges were faced by the researcher. The researcher also encountered some limitations, such as the unavailability of text emails with complete contents in HTML on the internet. The emails that were gathered from the internet were sorted
782 Mazin Abed Mohammed et al. manually, thereby consuming much time. About 2 months were required for this activity. Out of 200 emails that were gathered, 100 were non-spam and the remaining 100 were spam. 3.2 Data collection and preparation This phase involved the set up required for the testing of the dataset. In this phase, the message was examined by linking an email to the Outlook. After this activity, the emails were stored in the HTML format, while WordPad was employed to save the emails in the form of text files. Thus, the body of the email was made in the HTML format. The emails were then divided into two groups, i.e., spam and legitimate messages. After the completion of the setup, the emails were used to perform the testing. 3.3 Design and implement the proposed model To get a uniform format that can easily be understood by the trainable model, a transforming stage has been proposed. It is only this way that the emails can be used as input to the trainable model [1]. The data, i.e., the email contents have been processed to extract the body from the header as two parts. While the general details of the sender (address of the recipient, subject of email, and details of route) are contained in the header, the body contains the actual contents of the email. All this information must be extracted using preprocessing before the emails can be filtered. The body and the header do not contain the same information, and as such do not provide clues on the information that is conveyed to the receivers by the senders. Thus, depending on one part of the email might result in a low accuracy rate in terms of email spam filtration. 4 Adaptive intelligent learning approach The adaptive intelligent agent learning model is operated based on a multi-agent system in which both the model and the system have interaction within the anti-spam classification domain [1,41]. These agents function collaboratively to find solutions to problems; working individually will not solve the problems. The use of agent application increases the system’s flexibility. Besides, with agent application, the system’s functionality can be segregated, while interaction is enabled between the systems and their modules. In this work, a novel adaptive intelligent learning approach is introduced, based on the visual anti-spam model for multi-natural language capable of addressing these weaknesses and cover Arabic, English, and Chinese languages. A multi-trainable system operates this model that contains different forms of information, i.e., images in the proposed trainable anti-spam model. An adaptive intelligent agent learning model is made up of different stages, as shown in Figure 3. The processing steps involved in the adaptive intelligent learning model, by the multi-agent system, are illustrated in Figure 2. Different agents perform different tasks during the spam detection process, which are as follows: (1) is a representation of the first agent that processes short words, (2) denotes the second agent saddled with the responsibility of extracting features, (3) represents the third agent, responsible for selecting the features, (4) denotes the fourth agent responsible for the presentation of the instances, and (5) is the agent that classifies the email. This shows the different roles played by the different agents in this process. As mentioned earlier, an email has two main parts: the header and the email body. In both the header and the body, the information contained must be extracted before the process of filtration; this is done in the preprocessing stage. There may be differences between the two parts, and they do not provide clues about the message conveyed in the email. Thus, the accuracy rate of the email spam filter may be reduced if only one part is used, i.e., either the body or the header. To achieve a higher accuracy rate, both parts should be used.
Adaptive intelligent learning approach for multi-natural language 783 Emails in Arabic Emails in English Emails in Chinese language language language Feature Extraction Naive Bayes classifier Testing Images Trained Model Language type decision Phishing email Feature Extraction Normal email Naive Bayes Labels classifier Trained Model Train a new model for the phishing email identification Final decision Figure 3: The adaptive intelligent learning model. The concept of mutual document processing is used in describing this process. Utilizing this process, mes- sages in the email are transformed into a uniform format, which the learning algorithm can comprehend [1]. The concepts associated with the novel mutual document processing are described below. 4.1 Short words form Contrary to the usage of short forms of words in short message service (SMS), used because of the limited size of the message, most spammers use short forms of words to confuse spam engines. The study in ref. [42] noted that some companies have focused on the creation of software that is capable of translating SMS, and
784 Mazin Abed Mohammed et al. an example of one such company is Geneva Software Technologies Limited. With the new SMS trends, several websites such as Canada’s transl8it.com have been introduced; the company provides SMS transla- tion services by matching words directly. Moreover, employing this new translation trend, cooperation is encouraged between companies that provide translation services and service providers. An example of such a collaboration is the one that exists between Singapore’s GistXL Pte Ltd. and Singtel network. From what is known, no study has focused on the translation of short-form words to full-form intending to solve this problem. As mentioned above, it is important to carry out extensive research on short form and stop word messages. During the document processing stage, the ambiguous words will be eliminated by the anti-spam filtering or they will be regarded as unknown. For example, the spam cannot deduce any meaning from the word “LOL,” whereas it means “Laughing out loud.” The authors in ref. [43] published over 1,300 abbre- viations, as provided in Table 3. Table 3: List of short-form messages [43] Short messages Meaning Short messages Meaning 4 For 0.02 My (or your) two cents worth 9 Parent is watching 2 Meaning “to” in SMS 86 Over 19 Zero hand (online gaming) 88 Bye-bye 20 Meaning “location” 88 Hugs and kisses 121 One-to-one (private chat initiation) 404 I do not know 143 I love you 411 Information 1337 Leet, meaning “elite” 411 Information ;S Gentle warning, like “Hmm? What did you say?” 420 Let’s get high ? Having a question for you 420 Marijuana ? Did not understand your question 459 Love you ?4U Having a question 511 Too much information (more than 411) @TEOTD At the end of the day 555 Sobbing, crying ^^ “Read line” or “message above” 831 I love you (8 letters, 3 words, 1 meaning)
Adaptive intelligent learning approach for multi-natural language 785 parse the words (tokens), the words are added to a vector space, and the features space is constructed to enable classification [46]. The body and header of the emails have been used as input to the proposed model to extract the wanted features. 4.2.1 Reading and tokenization Tokenization refers to a process through which a message is reduced to its colloquial component [46]. In this process, the message is taken and divided into different tokens, also known as words. These words come from the mail body, even though consideration can be given to the header and subject fields. The acquired words are added to a vector space so that a feature space can be constructed. All the possible features can be extracted from message by using the tokenization process, regardless of how relevant they are. Features that have been transformed to tokens are highly vulnerable to content obscuring, and as such, they must be subjected to the processes of stemming, reduction of dimensions, and elimination of stop word. However, no given solution or standard exists that should be used for tokenization of the character stream. Also, there is no consensus on how this stage should define results due to the lack of shared knowledge and techniques in this area. Furthermore, not much attention has been given to assessing the quality of results due to the absence of assessment methods with standard benchmarks. Another issue associated with this area is using different languages by the spammer, such as Arabic spam. In modern day Arabic messages, English characters with numbers are employed. Instead of using Arabic letters that are not present in English, numbers are used, as presented in Table 4. In such a case, processes of short word form are required (Table 5). Table 4: Arabic characters examples used as modern Arabic samples for chatting Arabic character ﺍ ﺏ ﺕ ﺙ ﺝ ﺡ ﺥ ﺩ ﺫ ﺭ ﺯ ﺱ ﺵ ﺹ English character 2 b T Th G 7 5 d 4 r Z s sh 9 Arabic character ﺽ ﻁ ﻅ ﻉ ﻍ ﻑ ﻕ ﻙ ﻝ ﻡ ﻥ ﻩ ﻭ ﻱ English character ‘9 6 ‘7 3 ‘3 f 8 k L m N h w ﻱ Table 5: Three sentences or expressions in Arabic, modern Arabic, and English language that are used for chatting Arabic Modern Arabic chat English ﻛﻴﻒ ﺣﺎﻟﻚ؟ kef 7alk How are you? ﻣﺮﺣﺒﺎ Mar7aba Hello ﺷﻮ ﺍﺧﺒﺎﺭﻙ Sho a5barak What’s up The modern Arabic language has become a widely used chatting language, and for this reason, many spammers have started employing this technique. Based on the current knowledge, no study has been conducted on this type of spam, but in this study, the researcher found one email in which this kind of method of spamming was used. Also, this type of method might be adopted for other languages, i.e., English language. 4.2.2 Regular expressions The normal expressions computing are a flexible and brief means through which strings of the text can be matched, like characters, character patterns, or patterns of words. Formal language is used in writing a
786 Mazin Abed Mohammed et al. regular expression, and a regular expression processor can be used to interpret the formal language [1]. Additionally, a parser generator, which enables the identification of parts that correspond to a specification or a program used for the examination of texts, can be used here. Below are some examples of normal expressions that can be used in different sentences or words, i.e., the word “car” might appear in “car- rageen” or “Career.” • The expression “car” when we use it as a single word. • The expression “car” when we use it after the word “red” and “blue.” • One or more digits follow the dollar sign immediately, i.e., 20$ or 222.56$. 4.3 The agent feature selection The selection agent performs the task of reducing the dimensionality of feature vectors [1]. Another function of the selection agent is measuring the frequency of appearance of a specific term or phrase. According to a fixed threshold, the elimination of unimportant and idle words or phrases is also carried out by this agent. This process plays a vital role in improving the classification [44,45]. This process also involves the elimination of common morphological phrases that have similar details and stop words, such as “the,” “a,” and “an.” 4.3.1 Document frequency Document frequency is described as the number of documents, n, in which feature appears. The measure- ment of the features’ weight is taken based on frequency, and if the frequency is less than a fixed threshold, it is eliminated. Moreover, irrelevant features that do not contribute, in any way, to the process of classi- fication are not considered, thereby resulting in the improvement of the classifier’s efficiency [47]. 4.3.2 Mutual information Mutual information is described as a quantity through which the mutual dependence of two variables can be measured. If a feature is independent of a class, it is eliminated from the vector space. The predictions made by mutual information are often accurate, and implementing this mutual information model is easier [47]. 4.3.3 Stemming The process of stemming involves stripping plural words from noun words (e.g., “boys” to “boy”), suffixes from verbs (e.g., “measuring” to “measure”), or other affixes. The process of stemming, which was first introduced in 1980 by Porter, is defined as a process through which inflectional endings and common morphological endings are eliminated from English words. Here, the words are transformed to their stems or roots by applying a set of rules continuously. With this method, the number of features within the space vector can be reduced, while the speed of learning and process of categorization are increased for a wide range of classifiers. Nonetheless, two different words can stem from the same word when stemming is used [47].
Adaptive intelligent learning approach for multi-natural language 787 4.3.4 Stop word removal The removal of the stop word is concerned with the deletion of common words with high frequency but less meaning than the keywords. Emails contain many non-informative words like prepositions (e.g., “on,” “inside”), articles (“the,” “an,” and “a”), and conjunctions (e.g., “but,” “for”), and the size of the vector space increases by such words. When the vector space increases, it complicates the process of categoriza- tion. During this process, a list of stop words is produced and is then compared with the vector space to map the words to remove the list. Stop words make up about 5% of the texts in documents [1,47]. Table 6 shows some common stop words. Table 6: Different expressions of the stop words [47] Couldn’t Hers Nor That’s We’re Yourselves Between He Me Should Up You’d Below Having Let’s She’s Until You Been Hasn’t I’ve She To Won’t At Had It’s Own This Why Any Few Is Ours They’d While And Each I’m Our They Which Against Doing How’s Only Then When A Did Herself Not The We’ve As Further Into Over They’ve Whom From Aren’t Some Wasn’t They’re Who’s In Mustn’t He’s By Out You’ve Table 6 contains a few of the stop words that are often used. Research shows that several other stop words are eliminated at this stage. It is expected that the proposed system will demonstrate flexibility in terms of adding and removing stop words. 4.3.5 Noise removal (regular expressions) Noise in an email is the ambiguous words. The term obfuscation is characterized by the intentional mis- spelling of words, space, or embedding of unique characters. For example, the word “Viagra” has been obfuscated to “V1agra,” “V|iagra,” or free into “fr33.” The spammers use this technique so that the spam filters do not identify such terms [1,46,47]. To differentiate the misspelled terms, the use of the regular expression is employed in this process. 4.4 The agent feature presentation This agent exhibits the features in the most suitable format to allow ML filtering. Normally, the features that are obtained from the email are denoted as “bag of words” or vector space model. The representation of lexical features is done in numeric or binary form. Suppose that a vector space model represents the message as M = {M 1, M 2, M 3,…, Mn}vectors, where M 1 … M are the attribute values and the values are binary (0 or 1); if the attribute value (M) = 0 mean the corresponding word(feature) is not present in the message, if X1 =, otherwise M1 = 1. Here, the attributes are numerically represented, and X1 indicates how frequently the feature appears in the email. For instance, if the word “Viagra” appears in the message, then the feature will be assigned the value 1. Also, the character n-gram model is another representation of the feature that is often used. With this method, the sequences of characters and term frequency–inverse
788 Mazin Abed Mohammed et al. document frequency (tf − idf) are obtained. The n-gram, also known as the co-occurring set of characters in a word, is an n-character slice of a word. Moreover, the constituents of n-gram include qua-gram, bi-gram, and trigram. The (tf − idf) is a statistical measure with which the significance of a word to a document in feature corpus can be calculated. Term frequency is used in establishing the frequency of words; the significance of the word to the document can be determined from the number of times it occurs in the message. Then, the term frequency is multi- plied with inverse document frequency (idf ), through which the frequency of the occurrence of the words in all messages is measured [48]. 4.5 Adaptive intelligent learning classification approach for virtual anti-spam model The classification in the adaptive intelligent learning classification model uses a Naive Bayes classification algorithm adapted from ref. [1]; this algorithm is employed with three different phases of training and testing. In the training phase, the agent uses sets of feature vectors that are used in training the classifier in the previous phase. It includes K-fold cross-validation data allocations. The agent runs the classifier to distinguish between legitimate emails and spam emails. The proposed virtual anti-spam framework involves some phases and is explained in this section. The first phase is the training phase, in which the novel anti-spam model is trained. The training is carried out in two parts – text training and image training. The part for the text training entails training of text, according to the representation of the features, whereas the part for image training entails training of images based on feature vector. In the image training part, the agents have two sets from feature vectors, which are sets from the previous phase; one set contains pornographic images, whereas the other set is made of non-pornographic images. These sets are saved and used for training. The part for the text training involves using two sets from feature presentation, and these sets are from the previous phase; one set comprises spam messages, whereas the other set contains the legitimate text. The two sets are trained and saved as the dictionary. The second stage tests the novel visual anti-spam. In the results, the presentations are classified as either legitimate text or spam. 5 Experimental results and evaluation The adaptive intelligent agent learning classification model for virtual anti-spam model refers to an ML model through which spam emails are identified using a multi-agent system. A JADE agent platform was used in the Java environment to implement the proposed model. With this application, spam can be detected and distinguished from legitimate emails by filtering. The identification accuracy in the testing stage has been calculated by the summation the true positive (TP) and true negative (TN) of the legitimate emails and phishing email and divide the outcome of the summation by the total numbers of the emails (TP + FN + FP + TN) of the legitimate and phishing emails. Accuracy = (TP + TN)/(TP + FN + FP + TN) (1) Therefore, the representatives of false negatives (FNs) resulting from the that are classified as untrusted terms and represents FPs resulting from the classified terms as denotes true positive that results from the that are correctly identified; denotes TN that results from that are correctly identified. Research evaluation can be objectively or subjectively described; in this study, the research is evaluated subjectively and objectively. One of the biggest challenges in this field of research is the objective and subjective assessment of anti-spam accuracy. The objective assessment involves determining anti-spam accuracy by calculating given statistical indices based on whether the system can categorize emails successfully as either spam or legitimate. The subjective evaluation depends on people’s opinions, whereas the objective evaluation is based on the facts obtained from statistical calculations. Based on the test performed, the results of
Adaptive intelligent learning approach for multi-natural language 789 classification achieved by the proposed model are presented in Table 7. Three runs of the table show the results of the proposed approach, which achieved high accuracy. Table 7: The ten-fold cross-validation classification results Run Emails for the folder Email sum FP FN Accuracy Precision F-measure Recall 1 20 100 0.02 5.15 97.1 99 96.1 95.4 2 40 200 0.02 4.93 97.3 99 96.4 95.8 3 80 400 0.05 4.17 97.6 99 96.9 96.2 4 150 800 0.05 3.89 97.8 99 97.3 96.5 5 250 1,500 0.07 3.05 98.1 99 97.4 96.7 6 350 2,000 0.08 2.90 98.4 98.1 97.65 96.9 As we mentioned earlier, ML techniques involve two main stages: the training stage and the testing stage. The prediction precision of the classifiers depends exclusively on the data acquired amid the training task; in case the data obtained are low, the prediction precision becomes low; however, on the off chance that the data obtained are high, the classifier precision will be high. As expressed above, we utilized the ten- fold cross-validation method. The Naive Bayes classifier sometimes creates a random forest, and the data obtained for all the best features are computed utilizing the data obtained from the method explained by Mitchell [49]. The features with the optimal data obtained are chosen and utilized for building the Naive Bayes classifier. The mode of a vote for the Naive Bayes model is computed and utilized for the email prediction process. The data extracted is one of the features ranking measured and highly utilized in numerous text classification issues nowadays. The details of our classifier method are described in the next subsection below. The testing of our classifier approach, utilizing a large dataset size (as shown in Table 7), was done to know the execution of the Naive Bayes classifier on both small and large databases. As shown in Table 7, the classifier performed ideally while being tested on a database that has the biggest estimate (having a general accuracy of 98.4%, an FN rate of 2.90%, and an FP rate of 0.08%). This observation suggests that our classifier algorithm will work viably on the off chance that is connected to the real-world database, which is more often than not the largest in measure. Naive Bayes classifier, moreover, accomplished high predictive accuracy (98.4%) com- pared to the precision of 97% accomplished by Fette et al. [50]. As we can see, the proposed model can identify the phishing email with an accuracy of 98.4%, and the experiment result has approved that the proposed model can be used in real-time to detect spam emails. Classification result of Naive Bayes classifier on the best features with a recent study is presented in Table 8. Table 8: Classification result of Naive Bayes classifier on the best features with a recent study Method FP FN Precision F-measure Recall Fette et al. [50] 0.13 3.62 98.92 97.64 96.38 Our approach 0.08 2.90 99 97.65 96.90 The performance of the classifiers is evaluated based on the retrieval of information (in terms of accuracy, precision, derived measures, and recall) and the decision theory (in terms of FNs and FPs). The most relevant metrics that should be used to measure the performance of anti-spam include spam detection accuracy, spam precision, and recall. The recall indicates the number of spams that are classified correctly against those that are wrongly classified as a legitimate email, along with the number of spams that are recognized as spam. Precision indicates the proportion of the number of spams that are classified
790 Mazin Abed Mohammed et al. correctly to the number of all images or texts that are labeled as spam. Accuracy is defined as the ratio of the number of spams that are correctly classified and legitimate emails to the total number of images or texts used for testing, i.e., all images and texts that have been classified correctly by the classifier [47]. The following equations can be used for calculating the aforementioned parameters: Recall = TP /(TP + FN) (2) Precision = TP /(TP + FP) (3) FNs refer to spam images or messages that have been wrongly classified as legitimate messages, while FPs refer to a legitimate message that has been wrongly classified as a spam message. TP refers to spam images or messages that have been correctly predicted as spam, while on the other hand, TN refers to the number of messages or images that are legitimate and correctly detected legitimate. As stated earlier, the subjective evaluation method is the second type of evaluation used in this study to evaluate the output, to determine if the visual anti-spam could successfully distinguish between spam and legitimate emails. The main reason for using the subjective evaluation is that the opinion of experts regarding the performance of the visual anti-spam can be gathered. Humans interpret things very differ- ently from the way a machine does because the interpretation of humans is universal and subjective. For instance, a good result may be achieved by an objective quality criterion, but the same result may not be good for human interpreters. Additionally, the subjective method is used continuously to measure the quality of application research, particularly in this study, because humans can naturally detect spam with higher accuracy than machines. For example, when a subjective evaluation is performed, viewers’ attention is on the differences between the original message and the images or messages that have been reconstructed. Herein, loss of information can be observed, which cannot be accepted as a result of mis- interpretation by the machine. Thus, the quality of the visual anti-spam can be rated by viewers so that the performance of the system can be determined in terms of its accuracy in distinguishing between spam and non-spam. 6 Conclusion Spam emails have become a genuine risk to security and the economy worldwide. Increasing the number of phishing emails has made the issue more complex and updating the blacklist becomes almost impossible. Thus, in this study, we have displayed a content-based spam email detection method, which has bridged the current gap distinguished within the previous studies. This study shows how an adaptive intelligent learning approach, based on the visual anti-spam model for multi-natural language, can be used to detect unusual situations effectively. This approach is used for spam filtering. With adaptive intelligent learning, high performance is achieved, along with a low rate of false detection. The result of our study, as presented, suggests that the classifier performed ideally while testing on the database that has the biggest estimate (having a general accuracy of 98.4%, an FN rate of 2.90%, and an FP rate of 0.08%). This observation also suggests that our classifier algorithm will work viably on the off chance that is connected to a real-world database, which is more often than not the largest in measure. For future works, the suggestion is to enhance our study by combining this approach with nature-inspired methods such as particle swarm optimization or ant colony optimization that can dynamically and automatically distinguish the finest spam emails. Thus, this approach can be utilized to construct a robust spam email detection system with the highest classification accuracy. The utilization of this model and approach will improve the predictive accuracy of the classifiers used for the effective identification of spam emails that rely on spam email features. Conflict of interest: Authors state no conflict of interest.
You can also read