A review of "Asymmetric Numeral Systems"

Page created by Fernando Haynes
 
CONTINUE READING
A review of “Asymmetric Numeral Systems”
                                 Mart Simisker
                             28. January 2018

                                   Abstract
        In 2009, J. Duda proposed Asymmetric Numeral System for lossless
     compressive encodig, which is supposed to perform at speeds similar to
     Huffman coding with compression close to Arithmetic coding. This coding
     can combine compression and encryption into one step and is suited for
     systems with low computational power.

1    Introduction
With the growing amounts of data, the need for better compression increases.
In case of lossless compression, the two most known algorithms are Huffman
coding, proposed by David A. Huffman in 1952 [9], and Arithmetic coding
[13]. While Huffman Coding is fast, it does not typically compress as well as
Arithmetic Coding. Arithmetic Coding, however requires more computational
power, which is not always present in low power embedded systems. In 1948,
Shannon [11] introduced entropy of data, which is the theoretical information
content of data and is the lower bound for lossless compression. Entropy coders
try to compress the data as close to the entropy value.
    As proof of interest in Asymmetric Numeral Sysmtes, a specific methodology
based on ranged ANS (ranged ANS will be further described in Section 6)
has been patented[8]. Another example is Facebook’s compressor, which uses
tabled ANS (tabled ANS further described in section 5.2), called Zstandard
algorithm[6].
    Most of the facts about ANS are based on the papers from 2009[3], 2013[4]
and 2016[5]. At first, an overview of Huffman coding and Arithmetic coding
is given. In section 4 ANS is introduced. Stream coding, tabled ANS and
possible cryptographic applications are also discussed. Section 6 will introduce
range ANS. In section 7 compression with ranged ANS is compared to Huffman
coding and the results are presented.

2    Notation
We use the following notation:

                                       1
A for alphabet
      ps - the probability of element s in the alphabet
      C - denotes the encoding function
      D - denotes the decoding function

      s - denotes a symbol
      x - a natural number, into which the symbols are encoded
    For range based ANS, the following additional notation is used:

      bs - the beginning of element s
      ls - the number of occurrences for element s

3     General overview
3.1    Overview of Huffman Coding
In Huffman coding, input characters are assigned variable length prefix codes
based on their frequencies. The idea is to reduce the length of the overall
code by assigning shorter codes to more frequent characters. Huffman coding is
performed by first constructing a Huffman tree. The encoding and decoding is
then performed by traversing the tree from the root node to a leaf containing
the character. The cost during encoding and decoding, is the number of steps
taken when traversing the tree and the cost of logical operations[9].

3.2    Arithmetic coding (Range coding)
In Arithmetic coding, the alphabet is mapped to a range [0, 1) according to the
probabilities of the symbols. Encoding a symbol gives a range. For decoding,
only one value from this range is required[13]. The process of encoding is dis-
played in Figure 1. Both, the encoding and decoding contain two multiplication
operations, which can be seen from Algorithm 1 and Algorithm 2.

Algorithm 1 Pseudocode for the Encoding Procedure of Arithmetic Coding[13,
Figure 2]
Require: symbol, cum f req
  range = high − low
  high = low + range ∗ cum f req[symbol − 1]
  low = low + range ∗ cum f req[symbol]

                                        2
Figure 1: Representation of the Arithmetic Coding Process with the interval
scaled up at each stage [1].

Algorithm 2 Pseudocode for the Decoding Procedure of Arithmetic Coding[13,
Figure 2]
Require: cum f req
                                           value − low
  find symbol such that cum f req[symbol]≤
The encoding function takes a state and a symbol and encodes them into a
new natural number. The decoding function takes a state and decodes a symbol,
while producing a new state, from which new symbols can be extracted,

                             coding        C(x, s) → x0

                           decoding        D(x0 ) → (x, s).
     In the paper [4], the amount of information a symbol should contain is
discussed. If x is seen as a possibility of choosing a symbol from range {0, 1,
..., x − 1}, then it should contain log2 x bits of information. Then a symbol s is
supposed to contain log2 1/ps bits of information. Then x0 , which is supposed
to contain both x and s should contain log2 x + log2 1/ps = log2 x/ps bits of
information. Therefore x0 being approximately x/ps , allows to choose from a
larger interval {0, 1, ..., x0 − 1}.
     The range {0, 1, ..., x0 −1} consists of subsets, each corresponding to a symbol
s. There is a function s̄(x) = s used to map the natural numbers to the alphabet.
When xs denotes the value in the original range {0, 1, ..., x − 1} for which the
corresponding symbol given by the function s̄ is s.
     In the paper, it is mentioned, that x/xs is the number of bits currently used
to encode symbol s. To reduce inaccuracy, the xs ≈ xps approximation should
be as close as possible.
     To understand the concept more clearly, we will now look at an example.

4.1    Example encoding function for uniform Asymmetric
       Binary System
In case of nearly uniform distribution with alphabet A = 0, 1 and probabilities
p1 , p0 , an example of coding functions is given in the paper[4, Section 2.2]. As
there are only two symbols, the following equations use p = p1 .
     The coding function is:
                              
                                d(x + 1)/(1 − p)e − 1 if s = 0
                    C(x, s) =
                                bx/pc                  if s = 1

The decoding function is:
                                    
                                        x − dxpe if s = 0
                           D(x) =
                                        dxpe      if s = 1

5     Other Applications of ANS
5.1    Quick overview of stream encoding
When coding with ANS, the state grows exponentially. To keep x from growing
to infinity, a range I = {l, ..., lb − 1}, where l is some number and b is the base
of the numeral system, is decided upon. In case of binary, b = 2. When x is

                                           4
greater than the range, least significant digits are written to the stream. When x
reaches the range, coding function can be used. There are specific requirements,
for an example, the range is called b-unique, as inserting or removing symbols
will eventually reach the interval in a unique way[4, Section 3.1].

5.2    Tabled ANS
Compared to rANS, it seemed to be a bit less wide spread solution for Asymmet-
ric Numeral Systems. On the other hand, it has been implemented by Facebook
in their Zstandard algorithm[6]. In tabled version of ANS, during the initializa-
tion, encoding and decoding tables are constructed. The process is displayed in
Algorithm 3. A key part is finding optimal symbol distributions. At first every
symbol is entered into the table with some value. Then, for every possible value
of x in the range from l to bl − 1, the actual value is recalculated and entered
into the final encoding or decoding table. The encoding/decoding process is
done mostly by matching the given values to entries in the encoding/decoding
table.

Algorithm 3 “Precise initialization” [4, Section 4.1]
 for s =0 to n− 1 do
          0.5
   put        ,s ;
           ps
   x s = ls ;
 end for
 D[x] = (s, xs ) or C[s, xs ] = x
 xs + +

5.3    Cryptographic application
The possibility to use ANS for encryption was first mentioned in Dudas’ initial
ANS paper [3] and later expanded upon in [5]. A way to use it as a Pseudo
Random Number Generator was also described. For this, the initial state would
be given as a random seed. Then symbols can be fed to the system, changing
the state. It is mentioned, that after a period, the system would end up in
the same state, but this period would be long and could easily be increased[3,
Section 8.1].
   In [4], the chaotic behavior of ANS is described. Compared to arithmetic
coding, where symbols stay close to each other, during addition of succeeding
symbols, in ANS, the new state differs from the previous in a much more chaotic
behavior. The three sources of chaos, described in the paper, are:
      asymmetry - due to the different probabilities of symbols, the shifts can
      differ a lot.
      ergodicity - defined by the uniform coverage of the area, in the paper, it is

                                        5
mentioned, that logb (1/ps ) is irrational and even a single symbol should
      lead to uniform coverage of the area.
      diffusion - to avoid direct links between original symbols and symbols in
      the cipher text, a small change in one of the two should result in a large
      (approximately half bits) change in the other. In case of ANS, by changing
      the value of x, the decipherable symbol s changes. From the other side,
      changing the inputs of the encoding function gives a different result x.
In a paper by Duda and Niemic from 2016[5], some tests on the security of ANS
were carried out. The results suggest that tabled ANS with large enough key
can protect confidentiality at a high level of security. Additional enhancements
were discussed. Three main principles were suggested:
    • using a relatively large number of states and a large alphabet.
    • encrypting the final state, which is required for decoding.
    • using a completely random initial state.

The sources of chaos were also re-discussed and tested. Further enhancements
and future topics were discussed. A more advanced cryptoanalysis could be
carried out and the optimum between encryption and compression is also yet to
be found[5].

6     Ranged variant of ANS
In the range variant, symbol appearances are placed in ranges. This allows for
larger alphabets, close to the Range coding (Arithmetic coding). rANS, however
requires one multiplication instead of two.                    P
    First, a base m numeral system is chosen, where m = s ls . Thereon, the
symbols are mapped to {0, 1, ..., m − 1}. Then s(x) is used to define the symbol
x ∈ {0, 1, ..., m − 1} and s̄(x) = s(mod(x, m)), where s(x) = min{s : x <
Ps                                                             Ps−1
   i=0 li }. The beginning of symbol s is calculated as bs =     i=0 li .
    It is suggested to keep ls and bs values in tables to increase the overall speed
of the process.
    For encoding and decoding with rANS, following functions are given.

                      C(s, x) = mbx/ls c + bs + mod(x, ls )

        D(x) = (s, ls bx/mc + mod(x, m) − bs ), where s = s(mod(x, m))
   The algorithm is easily implementable. Implementations in Python and
Java can be found at Github [12, rANS.py, RANSimpl.java]. Another known
implementation can also be found on GitHub[7].

                                         6
Letter                a   e   i     o   u   !
 corresponding         0   1   2     3   4   5
 number
 ls                    2   3   1     2   1   1
 bs                    0   2   5     6   8   9

Table 1: Example encoding and decoding table for rANS with 6 symbol alpha-
bet.

6.1    Example
The following demonstrates encoding and decoding of a 3 letter string ‘eai’ in
base m = 10 numeral system. The probabilities, beginnings and lengths of
ranges are given in Table 1.
   Encoding ‘eai’ means encoding symbols one by one to a value. Let x = 0.
The first character ‘e’ corresponds to 1, has range beginning at 2 and length 3.

C(1, 0) = mbx/ls c + bs + mod(x, ls ) = 10b0/3c + 2 + mod(0, 3) = 0 + 2 + 0 = 2

The second character ‘a’ corresponds to symbol 0, has range beginning at 0 and
length 2. Note that this time, the x is 2, the output of the previous coding.

             C(0, 2) = 10b2/2c + 0 + mod(2, 2) = 10 + 0 + 0 = 10

   The third character ‘i’ corresponds to symbol 2, has range beginning at 5
and length 1.

          C(2, 10) = 10b10/1c + 5 + mod(10, 1) = 100 + 5 + 0 = 105

   Therefore, the string ‘eai’ has been encoded into 105, which fits 7 bits and
would require 2 bytes.
   Decoding the 3 letters will give them in the reverse order. Getting the first
symbol from 105,
                                             (          s̄
                                                             )
                                                      X
            s̄ = s(mod(x, m)) = s(5) = min s̄ : 5 <        li = 2.
                                                      i=0

The corresponding character to symbol 2 is ‘i’, which was the last element in
the string. Next, we calculate the new x. From D(x):

x = 105, s = 2 → ls bx/mc+mod(x, m)−bs = 1b105/10c+mod(105, 10)−5) = 10+5−5 = 10

   Decoding the second symbol:

                                   s̄ = s(0) = 0

Symbol 0 is ‘a’.

         x = 10, s = 0 → 2b10/10c + mod(10, 10) − 0 = 2 + 0 − 0 = 2

                                         7
Decoding the third symbol:

                                  s̄ = s(2) = 1

Symbol 1 is ‘e’.

           x = 2, s = 1 → 3b2/10c + mod(2, 10) − 2 = 0 + 2 − 2 = 0

    The symbols in the reversed order give the original string ‘eai’.

7     Assessment of compression with rANS
At first, a Python script was implemented using the coding function provided
in the paper [4] by Duda from 2013. For each encoding, the optimal system
was constructed by counting the appearances of symbols in the text. The script
was then tested on 100, 150, 175 blocks of characters from “A Childs History
of England” by C. Dickens [2]. For comparison, an implementation of Huffman
coding was added. The Huffman coders’ tree building algorithm implementation
was taken from online sources [10]. The script was further modified to fit the
test. For each text, the Huffman code was run with a tree where the text was
used to produce a character probability distribution that gave the best results.
    In this implementation, with increasing block size the decoding test failed
to decode the original text. When limiting the block size, in some cases the de-
coding would work. Due to personal knowledge, the next implementation was
written in Java using BigInteger class used for scientific calculations. The hopes
was to increase the possible block sizes when encoding and also compare the de-
coding. The results showed that when encoding to simple integer, the decoding
would fail at similar values. When using a BigInteger, the block size could be
increased at least to thousand of bytes and the decoding worked perfectly.
    During the comparison, the sizes of compressed texts were compared and
the difference was calculated. Based on that, the average difference was also
calculated.
    For each test, there is the text block size, number of tests that were carried
out, number of times the compressed text sizes were equal, the average difference
in bytes between the text compressed with different methods (size of Huffman
encoded text minus size of rANS encoded text). The average size of compressed
text for the method divided by the original text size, which we will call average
compression ratio. With block size 16 and bigger, there were no cases where
Huffman coding gave a better result than ranged ANS. With block size of 8,
result with Huffman coding was better in 0.026% times. The results are de-
scribed in Table 2. The columns in the table indicate block sizes of text being
encoded and are the same as the input of method runTests. The compression
ratios are also compared on Figure 2. The x axis is log2 of the block size and
y axis displays the compression ratio percentage. The calculations were made
with the Java code. Both implementations can also be accessed on GitHub [12].

                                        8
Block   Number     Times    Average difference    rANS Average         HC Average
  size   of tests    equal      (in Bytes)        compression ratio   compression ratio
    8    1273870    939996         0.26                24.0%               27.0%
   16     636997      16           2.58                23.0%               39.0%
   32     318510       0           6.95                24.0%               46.0%
   64     159257       0           16.02               25.0%               50.0%
  128     79629        0           34.08               26.0%               53.0%
  256     39815        0           70.32               27.0%               54.0%
  512     19908        0          143.03               27.0%               55.0%
 1024      9954        0          289.05               27.0%               56.0%

Table 2: Results of comparing rANS to Huffman coding (HC) compression of
“A Childs History of England”.

Figure 2: Comparison of the average compression ratios of rANS and Huffman
coding.

                                     9
From these tests, we can observe, that rANS produced better results for
compression than Huffman coding. Also the difference in the efficiency grew
with the increase of the size of the input text.

8      Summary
Asymmetric Numeral Systems are a new family of entropy coders, with speeds
similar to Huffman coding and with compression rates similar to Arithmetic
coding. To compare the compression ratios of rANS and Huffman coding, an
implementation of rANS was written and some tests were carried out. The
test looked at the best case compression scenarios since the implementation was
not optimized for speed. The results are provided in Table 2. Some remaining
questions are, what is a good block length to use for compressing textual data?
How much do the compression ratios of ranged ANS and Arithmetic coding
differ?

Acknowledgement
I would like to thank Mr. Benson Muite for his guidance and advice throughout
the project. The author of this report has received the IT Academy Specializa-
tion stipend for the autumn semester of academic year 2017/2018.

References
 [1] https://commons.wikimedia.org/wiki/File:Arithmetic_encoding.
     svg. Accessed: 2018-01-15.
 [2]   C. Dickens. A Child’s History of England. https://www.gutenberg.org/
       ebooks/699.txt.utf-8. Project Gutenberg, 1996.
 [3]   J. Duda. “Asymmetric numeral systems”. In: ArXiv e-prints (Feb. 2009).
       arXiv: 0902.0271 [cs.IT].
 [4]   J. Duda. “Asymmetric numeral systems: entropy coding combining speed
       of Huffman coding with compression rate of arithmetic coding”. In: ArXiv
       e-prints (Nov. 2013). arXiv: 1311.2540 [cs.IT].
 [5]   J. Duda and M. Niemiec. “Lightweight compression with encryption based
       on Asymmetric Numeral Systems”. In: ArXiv e-prints (Dec. 2016). arXiv:
       1612.04662 [cs.IT].
 [6]   Facebook, Inc. https://github.com/facebook/zstd/releases/. Ac-
       cessed: 2017-12-09.
 [7]   F. Giesen. https://github.com/rygorous/ryg_rans. Accessed: 2017-
       11-28.

                                      10
[8]   D. Greenfield and A. Rrustemi. System and method for compressing data
       using asymmetric numeral systems with probability distributions. US Patent
       App. 15/041,228. Aug. 2016. url: https://www.google.com/patents/
       US20160248440.
 [9]   D. A. Huffman. “A Method for the Construction of Minimum-Redundancy
       Codes”. In: Proceedings of the IRE 40 (Sept. 1952). DOI: 10.1109/JR-
       PROC.1952.273898, pp. 1098–1101.
[10]   M. Reid. https://gist.github.com/mreid/fdf6353ec39d050e972b.
       Accessed: 2017-12-21.
[11]   C. E. Shannon. “A Mathematical Theory of Communication”. In: Bell
       System Technical Journal 27.3 (1948), pp. 379–423. issn: 1538-7305. doi:
       10.1002/j.1538-7305.1948.tb01338.x. url: http://dx.doi.org/10.
       1002/j.1538-7305.1948.tb01338.x.
[12]   M. Simisker. https://github.com/Martsim/crypto_seminar_2017_
       fall. Accessed: 2018-01-19.
[13]   I. H. Witten, R. M. Neal, and J. G. Cleary. “Arithmetic Coding for Data
       Compression”. In: Communications of the ACM Volume 30, Number 6
       (June 1987).

                                       11
You can also read