"Improved Synonym Approach to Linguistic Steganography" Design and Proof-of-Concept Implementation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
“Improved Synonym Approach to Linguistic Steganography” Design and Proof-of-Concept Implementation Aniket M. Nanhe Mayuresh P. Kunjir Sumedh V. Sakdeo B.Tech Comp. Sci., B.Tech Comp. Sci., B.Tech Comp. Sci., College of Engineering, College of Engineering, College of Engineering, Pune, India. Pune, India. Pune, India. nanhe.aniket@gmail.com mayuresh.kunjir@gmail.co sumedhsakdeo@gmail.com m Abstract communication pass through the warden Wendy; and if Wendy detects any suspicious messages, he This paper develops a linguistically robust will frustrate their plan by throwing them in solitary Linguistic steganography approach using synonym confinement. So they must find some way of hiding replacement, which converts a message into their secret message in an innocuous looking cover semantically innocuous text. Drawing upon linguistic text‖. criteria, this approach uses word replacement, with Information hiding has taken one form in image substitution classes based on traditional word based steganography, utilizing minimal changes in replacement features (syntactic categories and pixels or watermarking techniques. While text-based subcategories), as well as features under-exploited messages have also been used within image-based in earlier works: semantic criteria, inflectional class, maneuvers, by modifying the white space between and frequency statistics. The original message is letters and by minutely changing the fonts, this has hidden through use of a cover text which is shared proved less fruitful because text can be retyped and between sender and receiver. This paper also is often altered in the conversion from one program presents a new approach of sharing the cover text version or platform to another. Proving more and changing it periodically to make the algorithm productive, as well as resistant to the difficulties safe from steganalysis. surrounding the re-typing of text-based messages is lexical steganography, which uses linguistic structures to disguise encryption of text messages 1. Introduction such that the appearance of the message remains While current encryption techniques are semantically and syntactically innocent. sufficiently advanced to make code-breaking This paper presents a new approach of practically impossible, one major drawback of Linguistic Steganography using synonym current encryption methods is the ease in identifying replacement, a linguistically-informed alternative to an encrypted text—they do not resemble natural text existing text-based steganography systems. This in any way. Steganography attempts to answer this approach adds extra features like inflectional class need, acting to conceal the message's existence, in and frequency statistics thereby producing order to transmit encrypted messages without semantically and syntactically correct text which is arousing suspicion. Steganography is the art and more natural in appearance to human eye. science of concealing a secret message inside a There is one more area which is under-exploited cover object. When the secret message is in digital in earlier works: sharing of cover text and changing it form, it leaves enormous choices for the cover periodically. The cover text is very critical as it is the objects. For instance, one could hide digital text which is transformed and sent over information inside images, audios, binaries, videos, communication channel. Our aim is to make texts etc. Steganography can be classified steganalysis difficult by altering cover text frequently depending upon the type of cover objects used. so no suspicion would be detected. This paper These cover objects could be images, audio, discusses a new approach of achieving this by binaries and as in our case, natural language text. exploiting the part of cover text which remains The study of this subject in scientific literature unchanged in transformation. dates back to 1983, when Simmons formulated it as This paper analyzes Past Research in Section 2, ―The Prisoners‘ Problem‖. It says, ―Alice and Bob are Basic Algorithm in Section 3 and Cover Text in jail and wish to hatch an escape plan; their entire Selection in Section 4.
In general, syntactical steganography 2. Past Research techniques produce text having syntactically wellformedness without semantically wellformedness. It can be seen from Chomsky‘s Lexical steganography has had three main veins famous sentence ―Colorless green ideas sleep of research: watermarking techniques that furiously‖. manipulate sentences through syntactic transformations a.k.a. ontological techniques, word 2.2. Lexical Steganography replacement systems both with and without cover texts, and context-free grammars such as In lexical Steganography lexical units of natural NICETEXT. We will see the work carried out in each language text such as words are used to hide secret of these techniques. bits. The most straightforward subliminal channel in natural language is probably the choice of words. A 2.1. Syntactical Steganography word could be replaced by its synonym and the choice of word to be chosen from the list of The approaches to syntactical steganography synonyms would depend upon secret bits. For exploit the syntactic structures of a text. The example consider a sentence – approaches make use of Context Free Grammars (CFG) to build syntactically correct sentences. The ―Pune is a nice little city‖ famous algorithm of CFG based Mimicry developed by Peter Wayner[1] comes under this category. Now, suppose list of synonyms for nice is {nice, There is another famous algorithm by Chapman et wonderful, great, and decent}. Each of the al.[8], NICETEXT, also based on CFG. synonyms can be represented by two bits as shown NICETEXT uses the cover text simply as a in the table: source of syntactic patterns: by running the cover text through a part-of-speech tagger, NICETEXT Word Code obtains a set of "sentence frames," e.g. [(noun) (verb) (prep) (det) (noun)] for ‗I sat in the tree.‘ It also Nice 00 compiles a lexicon of words found in the cover text via part-of-speech tags, with each word in the Wonderful 01 lexicon associated (arbitrarily) to either of the binary digits 0 or 1. In encryption, the plain text message is Great 10 converted into a sequence of binary digits. A random sentence frame is chosen and the part of speech Decent 11 tags in it are replaced by words in the lexicon according to the sequence of binary digits. Table 1: Lexical code table Although, NICETEXT produces syntactically correct sentences; it fails on the count of semantics. Depending upon the input secret bits The output text is almost always set of appropriate synonym for ‗nice‘ will be selected and ungrammatical and semantically anomalous put in the stego text. So, the possible stego texts sentences. could be: Another factor worth considering is the density of encryption within the cover text. Ideally, the cover a) Pune is a nice little city. text should work to hide the word frequencies and b) Pune is a wonderful little city. syntactic structure of the hidden plain text message. c) Pune is a decent little city. Steganographic goals encourage sparse encryption, d) Pune is a great little city. which does not alter a majority of the text by the word replacement. NICETEXT encryption is The lexical techniques produce better quality maximally dense—every word within the final text than syntactical techniques. It is hard to find encrypted cover text is conveying hidden presence of hidden message for statistical attacks. information. Given that each encrypted word is part The replacements are critical part of these of the original information bearing message and techniques. To give an example, in the above common word usage patterns are unavoidable, this mentioned synonym replacement approach, some is problematic for the original steganographic intent: words can have more than one sense. (Noun ―bank‖ avoiding detection and producing naturalistic text. has two senses – ―a long pile or heap‖ or ―an institution for receiving, lending, and safeguarding
money and transacting other financial business‖) If We use a word dictionary to get synonym. The input we don‘t use synonym having same sense as that of text to be hidden is compressed using Huffman original word, the output will look suspicious. Compression Algorithm and a string of bits is generated. The input bits are consumed in selection 1. Bring those instruments. of synonyms. 2. Bring those tool. The algorithm works in stages. The various stages of the algorithms are: A further impediment to synonym based word replacement is inflection classes (i.e., legal and 3.1. Part-of-speech Tagging illegal word combinations). (2) replaces a plural noun ‗instruments‘ of (1) by its singular synonym The basic requirement of this algorithm is, a noun ‗tool‘; thus making sentence grammatically cover text should be shared between sender and incorrect. receiver. Natural Language Processing is done on the cover text in order to determine the part of 2.3. Ontological Technique speech of each word. This is essential part of the algorithm as we are going to replace only common Of the techniques considered herein, the nouns, adjectives, adverbs and verbs in the cover ontological one is the most sophisticated approach text. A Parts-of-speech tagger is applied on cover with respect to modeling semantics. Instead of text which outputs each word followed by its part-of- implicitly leaving semantics intact by replacing only speech. synonymous words while embedding information into an innocuous text, an explicit model for 3.2. Input Compression ―meaning‖ is used to evaluate equivalence between texts. The input secret text is treated as binary bit Atallah et al.[4] watermark texts by manipulating string. These bits are to be used in synonym and exploiting the syntax (formal word order and replacement stage to make a choice of synonym grammatical voice) of sentences. Through common that is to be used in place of a word. Using standard generative transformations (clefting (4), adjunct ASCII representation, we need 7 bits for each fronting (5), passivization (6), adverbial insertion (7)), character of input. So we need to hide (7 * ‗number the syntax of each sentence is altered: of characters‘) bits. We can improve on this number by exploiting characteristics of English language. 3. The lion ate the food yesterday.(original sentence) Some characters appear more frequently in normal 4. It was the lion that ate the food yesterday. English text than other characters. If we use less 5. Yesterday, the lion ate the food. number of bits for such characters, we can easily 6. The food was eaten by the lion yesterday. reduce number of input bits to be hidden. To achieve 7. Surprisingly, the lion ate the food yesterday. this, we use Huffman Compression algorithm. On an average, Huffman coding reduces the The Ontological techniques though have some size of input bit string to 33% of the original. problems. The transformations sometimes affect the semantics of a text. Newer theories of language 3.3. Synonym Replacement argue for the interconnectedness of the semantic and syntactic levels, demonstrating that the syntactic This stage is the core of the algorithm. The pattern is itself inherently meaningful. Furthermore, actual task of hiding bits into a cover text is carried statistically, various syntactic structures (word out here. The inputs to this stage are the orders) are not equal in distribution: different genres compressed bit string and tagged cover text at of text have wildly different syntactic structures, and sender‘s side and receiver needs encoded text replacing such structures freely could create a text (a.k.a. stego text) and tagged cover text. This stage which is trivially broken by statistical methods—a makes use of dictionary to find replacements for security threat to the program. word. 3. Basic Algorithm 3.3.1 Dictionaries The algorithm replaces all the nouns, adjectives, We use three dictionaries here: verbs and adverbs of cover text by their respective synonyms. A semantically and orthographically a. WordNet2.1 English dictionary correct text is used as cover text to hide messages.
WordNet is an open source English dictionary ―travelling―, the stego text should contain present containing almost all English words. We can get all participle for of the ―go― to maintain the tense of the synonyms of a word using this dictionary. A word sentence. So verb dictionary provides this inflected may have more than one sense in which it can be form of the base form ―go‖ for actual replacement. used. WordNet provides output in terms of ‗synsets‘. A synset defines a sense of a word. Each synset c. Noun Inflection dictionary contains all the synonyms of given word which can be used in that particular sense. A noun can be either in singular or plural form. Again as the case with verbs, WordNet always gives e.g.: Synonym Sets of the word ―travelling‖ are: the synonyms of a noun in their singular form. If we 1. travel, go, move, locomote replace a plural by its singular synonym noun, we 2. travel, journey will get grammatically incorrect sentence. So we use 3. travel, trip, jaunt separate noun dictionary to avoid this situation. 4. travel, journey We maintain a list of nouns (about 89,051 nouns) in their singular as well as plural forms. WordNet also provides frequency of occurrence Before replacing a noun by its synonym, we check of a word in normal English text. This information is whether both are in same form. If not, we select very useful in our algorithm using which we can appropriate form of synonym noun from noun encode the synonyms. Huffman coding is used here dictionary and replace original by it. again so that more frequently occurring synonyms get shorter codes and vice versa. This is important e.g.: Tagged Cover Text: from Word Sense Disambiguation aspect, wherein A/DT group/NN of/IN frogs/NNS were/VBD the WordNet‘s 1st Sense assigns some frequency to traveling/VBG through/IN the/DT woods/NNS ,/, each synonym of the word. This frequency is assigned to a word depending upon the use of that The noun ―frogs‖ is plural form of the word frog. synonym for that particular word in normal English Suppose synonym Gaul is selected by input bit text. Assigning a shorter code to most frequently string; we need to ensure that it is plural as ‗frogs‘ is used synonym ensures maintaining proper word a plural. So we use this dictionary to obtain singular sense. and plural forms of the nouns present in dictionary. b. Verb Inflection dictionary 3.4. Synonym Replacement A verb can have many inflected forms like Figure 1 shows the mechanism that is carried present participle form, past tense form, past out at sender end. As can be seen, tagged text participle form and base form. When we try to find obtained from stage 1 is scanned. Whenever a synonyms for a verb, WordNet always gives the noun, adjective, verb or adverb is found, its synonyms of a verb in their base form irrespective of synonyms are obtained from WordNet. All synonyms inflected form of input verb. If we replace a verb by a are put in a frequency table; the frequencies are synonym in its base form, it will make output obtained from WordNet. Huffman coding is done on sentence grammatically incorrect. To avoid this this frequency table to obtain codes for all situation, we use a separate verb dictionary. synonyms. By using frequencies, we achieve word We maintain a list of all verbs (about 16,064 sense disambiguation also, as more frequently used verbs) along with their all inflected forms in a senses get shorter codes so that they have higher separate file. Before replacing a verb by its probability of being used. synonym, we check whether inflected forms of both After building the encode table, we use input bit original verb and its synonym match. If they don‘t string to select one of the synonyms from the table. match, we select appropriate inflected form of If we are replacing a verb, the inflected forms are synonym verb from verb dictionary and replace checked and appropriate form of verb is obtained original by it. from verb dictionary. Similarly if we are replacing a noun, the singular or plural form is selected from e.g.: Tagged Cover Text: noun dictionary in accordance of original noun‘s A/DT group/NN of/IN frogs/NNS were/VBD form. Otherwise the selected synonym is put in place traveling/VBG through/IN the/DT woods/NNS ,/, of original word. Appendix shows examples of sample stego text generated from cover text by Suppose, synonym ―go‖ is selected for hiding the secret text. replacement from WordNet as a synonym of
bits are obtained and these are appended to output string. The output string when decompressed, produce original secret text. 4. Cover Text Selection Steganalysis is identifying existence of a secret message. This is obvious as the field of steganography aims to conceal the existence of a message, not scramble it. Our approach uses Word Replacement in cover text. As only few words are replaced by their synonyms majority of text remains unchanged. If same cover text is used again and again, an attack ―Known Stego-Text‖, in which intruder keeps a track of text being sent on the communication medium is possible. To prevent text from steganalysis, the cover text needs to be changed periodically. Our approach uses a book, a collection of different chapters. Book should be Figure 1: Sender end privately owned by sender and receiver. One of the chapters from the book can be selected as cover text. The choice of the chapter is randomly decided by sender. For reverse transformation at the receiver end, same chapter should be selected as cover text. To achieve this, we exploit the unchanged part of the cover text. Receiver decides which chapter is to be used as cover text from stego text. This approach calculates difference between individual chapters in book and stego text and selects the chapter with minimum difference. Initially, each chapter in book is scanned once to obtain a code for each sentence. Words which are common nouns, adjectives, adverbs or verbs are ignored from the calculation of the code. Values of other words are calculated using ASCII values of characters and position of the word in that particular sentence. All these values are summed up to determine the code for that sentence. Similarly, codes for all sentences are calculated. e.g.: ―This is important for me‖. Figure 2: Receiver End Code =‗t‘ * 1 + ‗h‘ * 1 + ‗i‘ * 1 +‗s‘ * 1 + ‗i‘ * 2 + ‗s‘ * 2 + ‗f‘ * 4 + ‗o‘ * 4 + ‗r‘ * 4 + ‗m‘ * 5 + ‗e‘ * 5 Reverse algorithm is carried out at receiver end. ―important‖ is not used in calculation of the code Figure 2 shows the mechanism. Tagged text is for this sentence as it is an adjective. scanned for noun, adjectives, verbs and adverbs. Similar algorithm for code generation is applied When one is found, its synonyms are obtained from on stego text. Code of stego text is compared with WordNet. As done at sender end, frequency table codes for all chapters. Ideally codes for all and later encode table is formed using frequencies sentences of stego text should match with codes of of the words. a chapter which was used as cover text by sender. At the same time, stego text is also scanned. But our algorithm allows compound word From that stego text, we obtain the synonym replacements (―travel‖ can be replaced by ―move selected at sender‘s side. Then from the table, its around‖). This causes a difference between codes of stego text and chapter to be used as cover text. This
difference although is very small compared to Triezenberg, ―Natural language watermarking difference for other chapters. So the chapter with and tamperproofing,‖ in Information Hiding: Fifth minimum difference is selected as cover text. International Workshop, F. A. P. Petitcolas, ed., Lecture Notes in Computer Science 2578, pp. 5. Conclusion 196–212, Springer, October 2002. [6] K. Bennett, ―Linguistic steganography: Survey, The field of Linguistic Steganography is very analysis, and robustness concerns for hiding interesting as it conceals the very existence of information in text,‖ Tech. Rep. TR 2004-13, secret message from intruder, which is not Purdue CERIAS, May 2004. achievable by cryptography. The Synonym [7] M. T. Chapman, ―Hiding the hidden: A software replacement approach used for Linguistic system for concealing ciphertext as innocuous Steganography produces innocuous looking English text,‖ Master‘s thesis, University of Wisconsin- text thereby making detection of secret message Milwaukee, May 1997. very hard. The famous Stego-Turing Test states that [8] M. T. Chapman and G. I. Davida, ―Hiding the it is very hard for a computer to alter a natural hidden: A software system for concealing language text in a way that is undetectable to a ciphertext as innocuous text,‖ in Information and human. Many approaches have been carried out in Communications Security: First International the past in doing this. But none has been able to Conference, O. S. Q. Yongfei Han Tatsuaki, ed., solve the problem. Though our solution doesn‘t Lecture Notes in Computer Science 1334, solve the problem, it produces better quality of Springer, August 1997. output than previously done approaches. [9] M. T. Chapman, G. I. Davida, and M. Rennhard, Also we give a new approach to dynamically ―A practical and effective approach to large- choose a cover text from chunk of text being shared scale automated linguistic steganography,‖ in by sender and receiver. This allows user to use Information Security: Fourth International different cover text for hiding message each time, Conference, G. I. Davida and Y. Frankel, eds., thus making steganalysis difficult. Lecture Notes in Computer Science 2200, p. Synonym Replacement Approach to Linguistic 156ff, Springer, October 2001. Steganography using Inflection classes, Frequency [10] R. Bergmair, ―Towards linguistic steganography: Statistics and Dynamic Cover Text Selection is a A systematic investigation of approaches, new improvement in the field of Linguistic systems, and issues.‖ final year thesis, April Steganography and provides a very good, efficient 2004. handed in in partial fulfillment of the tool for Information Hiding. degree requirements for the degree ―B.Sc. (Hons.) in Computer Studies‖ to the University of Derby. [11] I. A. Bolshakov, ―A method of linguistic 6. References steganography based on collocationally-verified synonymy.,‖ in Information Hiding: 6th [1] P. Wayner, ―Mimic functions,‖ Cryptologia XVI, International Workshop, J. J. Fridrich, ed., pp. 193–214, July 1992. Lecture Notes in Computer Science 3200, pp. [2] P. Wayner, ―Strong theoretical steganography,‖ 180–191, Springer, May 2004. Cryptologia XIX, pp. 285–299, July 1995. [12] K. Winstein, ―Lexical steganography through [3] P. Wayner, ―Disappearing Cryptography- adaptive modulation of the word choice hash,‖ Information Hiding: Steganography & January 1999. Was disseminated during Watermarking‖, 2nd edition Morgan Kaufmann secondary education at the Illinois Mathematics Publishers, Los Altos, CA 94022, USA, second and Science Academy. The paper won the third ed., 2002 pp. 67-126, pp. 303-314. prize in the 2000 Intel Science Talent Search. [4] M. J. Atallah, V. Raskin, M. Crogan, C. [13] A. J. Tenenbaum, ―Linguistic steganography: Hempelmann, F. Kerschbaum, D. Mohamed, Passing covert data using text-based mimicry.‖ and S. Naik, ―Natural language watermarking: final year thesis, April 2002. Submitted in partial Design, analysis, and a proof-of-concept fulfillment of the requirements for the degree of implementation,‖ in Information Hiding: Fourth ―Bachelor of Applied Science‖ to the University International Workshop, I. S. Moskowitz, ed., of Toronto. Lecture Notes in Computer Science 2137, pp. [14] Vineeta Chand and C. Orhan Orgun, ―Exploiting 185–199, Springer, April 2001. linguistic features in Lexical Steganography‖, [5] M. J. Atallah, V. Raskin, C. F. Hempelmann, M. Proceedings on 39th Hawaii International Karahan, R. Sion, U. Topkara, and K. E. Conference on System Sciences - 2006.
sentence for mentation, and the opposition offered APPENDIX practical education which, to our regret, was only too good. Secret Text: Escape from jail today evening. Sample Cover Text: EVER since I have been scrutinizing political events, I have taken a tremendous interest in propagandist activity. I saw that the Socialist-Marxist organizations mastered and applied this instrument with astounding skill. And I soon realized that the correct use of propaganda is a true art which has remained practically unknown to the bourgeois parties. Only the Christian-Social movement, especially in Lueger's time, achieved a certain virtuosity on this instrument, to which it owed many of its successes. But it was not until the War that it became evident what immense results could be obtained by a correct application of propaganda. Here again, unfortunately, all our studying had to be done on the enemy side, for the activity on our side was modest, to say the least. The total miscarriage of the German 'enlightenment ' service stared every soldier in the face, and this spurred me to take up the question of propaganda even more deeply than before. There was often more than enough time for thinking, and the enemy offered practical instruction which, to our sorrow, was only too good. Sample StegoText: ever since I have been scrutinizing political cases, I have taken a tremendous interest in propagandist action. I experienced that the Socialist- Marxist organizations subdued and practiced this tool with astounding accomplishment. And I shortly recognized that the right use of propaganda is a truthful art which has stayed much unknown to the businessperson parties. merely the Christian-Social move, particularly in Lueger 's time, accomplished a sure virtuosity on this instrument, to which it owed many of its successes. But it was not until the War that it turned evident what immense effects could be found by a right application of propaganda. hither once more, unfortunately, all our considering had to be made on the enemy side, for the action on our side was modest, to say the least. The total stillbirth of the German ` Nirvana ' service starred every soldier in the face, and this spurred me to bring up the inquiry of propaganda even more deeply than ahead. There was frequently more than plenty
You can also read