A review of "Asymmetric Numeral Systems"
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A review of “Asymmetric Numeral Systems” Mart Simisker 28. January 2018 Abstract In 2009, J. Duda proposed Asymmetric Numeral System for lossless compressive encodig, which is supposed to perform at speeds similar to Huffman coding with compression close to Arithmetic coding. This coding can combine compression and encryption into one step and is suited for systems with low computational power. 1 Introduction With the growing amounts of data, the need for better compression increases. In case of lossless compression, the two most known algorithms are Huffman coding, proposed by David A. Huffman in 1952 [9], and Arithmetic coding [13]. While Huffman Coding is fast, it does not typically compress as well as Arithmetic Coding. Arithmetic Coding, however requires more computational power, which is not always present in low power embedded systems. In 1948, Shannon [11] introduced entropy of data, which is the theoretical information content of data and is the lower bound for lossless compression. Entropy coders try to compress the data as close to the entropy value. As proof of interest in Asymmetric Numeral Sysmtes, a specific methodology based on ranged ANS (ranged ANS will be further described in Section 6) has been patented[8]. Another example is Facebook’s compressor, which uses tabled ANS (tabled ANS further described in section 5.2), called Zstandard algorithm[6]. Most of the facts about ANS are based on the papers from 2009[3], 2013[4] and 2016[5]. At first, an overview of Huffman coding and Arithmetic coding is given. In section 4 ANS is introduced. Stream coding, tabled ANS and possible cryptographic applications are also discussed. Section 6 will introduce range ANS. In section 7 compression with ranged ANS is compared to Huffman coding and the results are presented. 2 Notation We use the following notation: 1
A for alphabet ps - the probability of element s in the alphabet C - denotes the encoding function D - denotes the decoding function s - denotes a symbol x - a natural number, into which the symbols are encoded For range based ANS, the following additional notation is used: bs - the beginning of element s ls - the number of occurrences for element s 3 General overview 3.1 Overview of Huffman Coding In Huffman coding, input characters are assigned variable length prefix codes based on their frequencies. The idea is to reduce the length of the overall code by assigning shorter codes to more frequent characters. Huffman coding is performed by first constructing a Huffman tree. The encoding and decoding is then performed by traversing the tree from the root node to a leaf containing the character. The cost during encoding and decoding, is the number of steps taken when traversing the tree and the cost of logical operations[9]. 3.2 Arithmetic coding (Range coding) In Arithmetic coding, the alphabet is mapped to a range [0, 1) according to the probabilities of the symbols. Encoding a symbol gives a range. For decoding, only one value from this range is required[13]. The process of encoding is dis- played in Figure 1. Both, the encoding and decoding contain two multiplication operations, which can be seen from Algorithm 1 and Algorithm 2. Algorithm 1 Pseudocode for the Encoding Procedure of Arithmetic Coding[13, Figure 2] Require: symbol, cum f req range = high − low high = low + range ∗ cum f req[symbol − 1] low = low + range ∗ cum f req[symbol] 2
Figure 1: Representation of the Arithmetic Coding Process with the interval scaled up at each stage [1]. Algorithm 2 Pseudocode for the Decoding Procedure of Arithmetic Coding[13, Figure 2] Require: cum f req value − low find symbol such that cum f req[symbol]≤
The encoding function takes a state and a symbol and encodes them into a new natural number. The decoding function takes a state and decodes a symbol, while producing a new state, from which new symbols can be extracted, coding C(x, s) → x0 decoding D(x0 ) → (x, s). In the paper [4], the amount of information a symbol should contain is discussed. If x is seen as a possibility of choosing a symbol from range {0, 1, ..., x − 1}, then it should contain log2 x bits of information. Then a symbol s is supposed to contain log2 1/ps bits of information. Then x0 , which is supposed to contain both x and s should contain log2 x + log2 1/ps = log2 x/ps bits of information. Therefore x0 being approximately x/ps , allows to choose from a larger interval {0, 1, ..., x0 − 1}. The range {0, 1, ..., x0 −1} consists of subsets, each corresponding to a symbol s. There is a function s̄(x) = s used to map the natural numbers to the alphabet. When xs denotes the value in the original range {0, 1, ..., x − 1} for which the corresponding symbol given by the function s̄ is s. In the paper, it is mentioned, that x/xs is the number of bits currently used to encode symbol s. To reduce inaccuracy, the xs ≈ xps approximation should be as close as possible. To understand the concept more clearly, we will now look at an example. 4.1 Example encoding function for uniform Asymmetric Binary System In case of nearly uniform distribution with alphabet A = 0, 1 and probabilities p1 , p0 , an example of coding functions is given in the paper[4, Section 2.2]. As there are only two symbols, the following equations use p = p1 . The coding function is: d(x + 1)/(1 − p)e − 1 if s = 0 C(x, s) = bx/pc if s = 1 The decoding function is: x − dxpe if s = 0 D(x) = dxpe if s = 1 5 Other Applications of ANS 5.1 Quick overview of stream encoding When coding with ANS, the state grows exponentially. To keep x from growing to infinity, a range I = {l, ..., lb − 1}, where l is some number and b is the base of the numeral system, is decided upon. In case of binary, b = 2. When x is 4
greater than the range, least significant digits are written to the stream. When x reaches the range, coding function can be used. There are specific requirements, for an example, the range is called b-unique, as inserting or removing symbols will eventually reach the interval in a unique way[4, Section 3.1]. 5.2 Tabled ANS Compared to rANS, it seemed to be a bit less wide spread solution for Asymmet- ric Numeral Systems. On the other hand, it has been implemented by Facebook in their Zstandard algorithm[6]. In tabled version of ANS, during the initializa- tion, encoding and decoding tables are constructed. The process is displayed in Algorithm 3. A key part is finding optimal symbol distributions. At first every symbol is entered into the table with some value. Then, for every possible value of x in the range from l to bl − 1, the actual value is recalculated and entered into the final encoding or decoding table. The encoding/decoding process is done mostly by matching the given values to entries in the encoding/decoding table. Algorithm 3 “Precise initialization” [4, Section 4.1] for s =0 to n− 1 do 0.5 put ,s ; ps x s = ls ; end for D[x] = (s, xs ) or C[s, xs ] = x xs + + 5.3 Cryptographic application The possibility to use ANS for encryption was first mentioned in Dudas’ initial ANS paper [3] and later expanded upon in [5]. A way to use it as a Pseudo Random Number Generator was also described. For this, the initial state would be given as a random seed. Then symbols can be fed to the system, changing the state. It is mentioned, that after a period, the system would end up in the same state, but this period would be long and could easily be increased[3, Section 8.1]. In [4], the chaotic behavior of ANS is described. Compared to arithmetic coding, where symbols stay close to each other, during addition of succeeding symbols, in ANS, the new state differs from the previous in a much more chaotic behavior. The three sources of chaos, described in the paper, are: asymmetry - due to the different probabilities of symbols, the shifts can differ a lot. ergodicity - defined by the uniform coverage of the area, in the paper, it is 5
mentioned, that logb (1/ps ) is irrational and even a single symbol should lead to uniform coverage of the area. diffusion - to avoid direct links between original symbols and symbols in the cipher text, a small change in one of the two should result in a large (approximately half bits) change in the other. In case of ANS, by changing the value of x, the decipherable symbol s changes. From the other side, changing the inputs of the encoding function gives a different result x. In a paper by Duda and Niemic from 2016[5], some tests on the security of ANS were carried out. The results suggest that tabled ANS with large enough key can protect confidentiality at a high level of security. Additional enhancements were discussed. Three main principles were suggested: • using a relatively large number of states and a large alphabet. • encrypting the final state, which is required for decoding. • using a completely random initial state. The sources of chaos were also re-discussed and tested. Further enhancements and future topics were discussed. A more advanced cryptoanalysis could be carried out and the optimum between encryption and compression is also yet to be found[5]. 6 Ranged variant of ANS In the range variant, symbol appearances are placed in ranges. This allows for larger alphabets, close to the Range coding (Arithmetic coding). rANS, however requires one multiplication instead of two. P First, a base m numeral system is chosen, where m = s ls . Thereon, the symbols are mapped to {0, 1, ..., m − 1}. Then s(x) is used to define the symbol x ∈ {0, 1, ..., m − 1} and s̄(x) = s(mod(x, m)), where s(x) = min{s : x < Ps Ps−1 i=0 li }. The beginning of symbol s is calculated as bs = i=0 li . It is suggested to keep ls and bs values in tables to increase the overall speed of the process. For encoding and decoding with rANS, following functions are given. C(s, x) = mbx/ls c + bs + mod(x, ls ) D(x) = (s, ls bx/mc + mod(x, m) − bs ), where s = s(mod(x, m)) The algorithm is easily implementable. Implementations in Python and Java can be found at Github [12, rANS.py, RANSimpl.java]. Another known implementation can also be found on GitHub[7]. 6
Letter a e i o u ! corresponding 0 1 2 3 4 5 number ls 2 3 1 2 1 1 bs 0 2 5 6 8 9 Table 1: Example encoding and decoding table for rANS with 6 symbol alpha- bet. 6.1 Example The following demonstrates encoding and decoding of a 3 letter string ‘eai’ in base m = 10 numeral system. The probabilities, beginnings and lengths of ranges are given in Table 1. Encoding ‘eai’ means encoding symbols one by one to a value. Let x = 0. The first character ‘e’ corresponds to 1, has range beginning at 2 and length 3. C(1, 0) = mbx/ls c + bs + mod(x, ls ) = 10b0/3c + 2 + mod(0, 3) = 0 + 2 + 0 = 2 The second character ‘a’ corresponds to symbol 0, has range beginning at 0 and length 2. Note that this time, the x is 2, the output of the previous coding. C(0, 2) = 10b2/2c + 0 + mod(2, 2) = 10 + 0 + 0 = 10 The third character ‘i’ corresponds to symbol 2, has range beginning at 5 and length 1. C(2, 10) = 10b10/1c + 5 + mod(10, 1) = 100 + 5 + 0 = 105 Therefore, the string ‘eai’ has been encoded into 105, which fits 7 bits and would require 2 bytes. Decoding the 3 letters will give them in the reverse order. Getting the first symbol from 105, ( s̄ ) X s̄ = s(mod(x, m)) = s(5) = min s̄ : 5 < li = 2. i=0 The corresponding character to symbol 2 is ‘i’, which was the last element in the string. Next, we calculate the new x. From D(x): x = 105, s = 2 → ls bx/mc+mod(x, m)−bs = 1b105/10c+mod(105, 10)−5) = 10+5−5 = 10 Decoding the second symbol: s̄ = s(0) = 0 Symbol 0 is ‘a’. x = 10, s = 0 → 2b10/10c + mod(10, 10) − 0 = 2 + 0 − 0 = 2 7
Decoding the third symbol: s̄ = s(2) = 1 Symbol 1 is ‘e’. x = 2, s = 1 → 3b2/10c + mod(2, 10) − 2 = 0 + 2 − 2 = 0 The symbols in the reversed order give the original string ‘eai’. 7 Assessment of compression with rANS At first, a Python script was implemented using the coding function provided in the paper [4] by Duda from 2013. For each encoding, the optimal system was constructed by counting the appearances of symbols in the text. The script was then tested on 100, 150, 175 blocks of characters from “A Childs History of England” by C. Dickens [2]. For comparison, an implementation of Huffman coding was added. The Huffman coders’ tree building algorithm implementation was taken from online sources [10]. The script was further modified to fit the test. For each text, the Huffman code was run with a tree where the text was used to produce a character probability distribution that gave the best results. In this implementation, with increasing block size the decoding test failed to decode the original text. When limiting the block size, in some cases the de- coding would work. Due to personal knowledge, the next implementation was written in Java using BigInteger class used for scientific calculations. The hopes was to increase the possible block sizes when encoding and also compare the de- coding. The results showed that when encoding to simple integer, the decoding would fail at similar values. When using a BigInteger, the block size could be increased at least to thousand of bytes and the decoding worked perfectly. During the comparison, the sizes of compressed texts were compared and the difference was calculated. Based on that, the average difference was also calculated. For each test, there is the text block size, number of tests that were carried out, number of times the compressed text sizes were equal, the average difference in bytes between the text compressed with different methods (size of Huffman encoded text minus size of rANS encoded text). The average size of compressed text for the method divided by the original text size, which we will call average compression ratio. With block size 16 and bigger, there were no cases where Huffman coding gave a better result than ranged ANS. With block size of 8, result with Huffman coding was better in 0.026% times. The results are de- scribed in Table 2. The columns in the table indicate block sizes of text being encoded and are the same as the input of method runTests. The compression ratios are also compared on Figure 2. The x axis is log2 of the block size and y axis displays the compression ratio percentage. The calculations were made with the Java code. Both implementations can also be accessed on GitHub [12]. 8
Block Number Times Average difference rANS Average HC Average size of tests equal (in Bytes) compression ratio compression ratio 8 1273870 939996 0.26 24.0% 27.0% 16 636997 16 2.58 23.0% 39.0% 32 318510 0 6.95 24.0% 46.0% 64 159257 0 16.02 25.0% 50.0% 128 79629 0 34.08 26.0% 53.0% 256 39815 0 70.32 27.0% 54.0% 512 19908 0 143.03 27.0% 55.0% 1024 9954 0 289.05 27.0% 56.0% Table 2: Results of comparing rANS to Huffman coding (HC) compression of “A Childs History of England”. Figure 2: Comparison of the average compression ratios of rANS and Huffman coding. 9
From these tests, we can observe, that rANS produced better results for compression than Huffman coding. Also the difference in the efficiency grew with the increase of the size of the input text. 8 Summary Asymmetric Numeral Systems are a new family of entropy coders, with speeds similar to Huffman coding and with compression rates similar to Arithmetic coding. To compare the compression ratios of rANS and Huffman coding, an implementation of rANS was written and some tests were carried out. The test looked at the best case compression scenarios since the implementation was not optimized for speed. The results are provided in Table 2. Some remaining questions are, what is a good block length to use for compressing textual data? How much do the compression ratios of ranged ANS and Arithmetic coding differ? Acknowledgement I would like to thank Mr. Benson Muite for his guidance and advice throughout the project. The author of this report has received the IT Academy Specializa- tion stipend for the autumn semester of academic year 2017/2018. References [1] https://commons.wikimedia.org/wiki/File:Arithmetic_encoding. svg. Accessed: 2018-01-15. [2] C. Dickens. A Child’s History of England. https://www.gutenberg.org/ ebooks/699.txt.utf-8. Project Gutenberg, 1996. [3] J. Duda. “Asymmetric numeral systems”. In: ArXiv e-prints (Feb. 2009). arXiv: 0902.0271 [cs.IT]. [4] J. Duda. “Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding”. In: ArXiv e-prints (Nov. 2013). arXiv: 1311.2540 [cs.IT]. [5] J. Duda and M. Niemiec. “Lightweight compression with encryption based on Asymmetric Numeral Systems”. In: ArXiv e-prints (Dec. 2016). arXiv: 1612.04662 [cs.IT]. [6] Facebook, Inc. https://github.com/facebook/zstd/releases/. Ac- cessed: 2017-12-09. [7] F. Giesen. https://github.com/rygorous/ryg_rans. Accessed: 2017- 11-28. 10
[8] D. Greenfield and A. Rrustemi. System and method for compressing data using asymmetric numeral systems with probability distributions. US Patent App. 15/041,228. Aug. 2016. url: https://www.google.com/patents/ US20160248440. [9] D. A. Huffman. “A Method for the Construction of Minimum-Redundancy Codes”. In: Proceedings of the IRE 40 (Sept. 1952). DOI: 10.1109/JR- PROC.1952.273898, pp. 1098–1101. [10] M. Reid. https://gist.github.com/mreid/fdf6353ec39d050e972b. Accessed: 2017-12-21. [11] C. E. Shannon. “A Mathematical Theory of Communication”. In: Bell System Technical Journal 27.3 (1948), pp. 379–423. issn: 1538-7305. doi: 10.1002/j.1538-7305.1948.tb01338.x. url: http://dx.doi.org/10. 1002/j.1538-7305.1948.tb01338.x. [12] M. Simisker. https://github.com/Martsim/crypto_seminar_2017_ fall. Accessed: 2018-01-19. [13] I. H. Witten, R. M. Neal, and J. G. Cleary. “Arithmetic Coding for Data Compression”. In: Communications of the ACM Volume 30, Number 6 (June 1987). 11
You can also read