Cybernetics and Computer Engineering, 2020, 4(202)
Anisimov A.V., DSc (Phys & Math),
corresponding member of National Academy of Sciences of Ukraine, Dean of the Faculty of Computer Science and Cybernetics
Zavadskyi I.O., DSc (Phys & Math),
Associate Professor of the Mathematical Informatics Department
Chudakov T.S., student
Faculty of Computer Science and Cybernetics
of Taras Shevchenko National University of Kyiv,
2d, Glushkov av. 03022, Kyiv, Ukraine
APPLICATION OF MULTI-DELIMITER CODES TO NATURAL LANGUAGE TEXT ARCHIVING
Introduction. The efficiency of modern archivers is approaching to the theoretical limit. Even small compression ratio improvements for some specific data types, by less than 1%, is assumed to be essential when the reasonable time complexity is maintained. This research is actual since a new data encoding method is developed, which gives the possibility to achieve rather more significant improvement of the compression ratio when it comes to English or German texts archiving.
The purpose of the paper is to solve the problem of non-monotonicity of a multi-delimiter code dictionary and investigate the possibility of use the multi-delimiter encoding on the preprocessing stage of natural language texts archiving.
Results. The concept of the reverse multi-delimiter code is introduced. The monotonic encoding as well as the decoding mapping from the set of natural numbers to the set of reverse multi-delimiter code codewords is built. The efficiency of applying the reverse multi-delimiter codes to natural language text compression is investigated together with the method of dictionary optimization. The provided experiments show that the reverse multi-delimiter encoding of English and German texts on the preprocessing stage and applying the proposed dictionary optimization method allows us to improve the marginal compression efficiency of the most powerful archivers in the maximal compression mode by about 1–3%.
Conclusions. The reverse multi-delimiter codes can be considered as an efficient tool when it comes to compression of natural language texts. As a standalone solution, these codes are robust, provide the possibility to fast decode and search the data in a compressed file. As a tool for natural language text preprocessing for subsequent archiving, the reverse multi-delimiter codes together with the method of dictionary optimization allow us to improve the compression rate of the best up-to-date known archivers.
Keywords: compression, archiving, archiver, compression code, multi-delimiter code, reverse multi-delimiter code, dictionary optimization, natural language text.
1. D. Salomon. Variable-Length Codes for Data Compression. London. U.K.: Springer-Verlag, 2007.
2. Burrows M., Wheeler D. J. A Block-Sorting Lossless Data Compression Algorithm. Digital Systems Research Center, Palo Alto, CA, USA. 1994, Research Report 124.
3. B. Ya. Ryabko, Data compression using the book stack transformation. Information transmission problems. 1980, Vol. 16, no. 2, pp. 16-21. (in Russian).
4. Huffman D. A method for the construction of minimum redundancy codes. Proc. IRE. 1952, Vol. 40, pp. 1098-1101.
5. Lakshmanan K.B. On universal codeword sets. IEEE Transactions on Information Theory. 1981, Vol. 27, pp. 659-662.
6. Apostolico A., Fraenkel A. Robust transmission of unbounded strings using Fibonacci representations. IEEE Transactions on Information Theory. 1987, Vol. IT-33, no. 2, pp. 238-245.
7. Anisimov A.V., Zavadskyi I.O. Variable-Length Prefix Codes With Multiple Delimiters IEEE Transactions on Information Theory. 2017, vol. 63, no. 5, pp. 2885-2895.
8. Elias P. Universal codeword sets and representation of the integers. IEEE Transactions on Information Theory. 1975, Vol. 21, no. 2, pp. 194-203.
9. Golomb S.W. Run-length encodings. IEEE Transactions on Information Theory. 1966, Vol. IT-12, no. 3, pp. 399-401.
10. Klein S.T., Ben-Nissan M.K. On the usefulness of Fibonacci compression codes. Computer Journal. 2010, Vol. 53, no. 6, pp. 701-716.
11. Ziv J., Lempel A. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory. 1977, Vol. 23, no. 3, pp. 337-343.
12. Welch T. A Technique for High-Performance Data Compression. IEEE Computer. 1984, Vol. 17, no. 6, pp. 8-19.
13. Witten I.H., Neal R.M., Cleary J.G. Arithmetic Coding for Data Compression. Communications of the ACM. 1987, Vol. 30, no. 6, pp. 520-540.
14. Duda J., Tahboub K., Gadgil N. J., Delp E. J. The use of asymmetric numeral systems as an accurate replacement for Huffman coding. 2015 Picture Coding Symposium (Cairns, QLD, Australia, 31th of May-3rd of Jun, 2015). Cairns, Australia, pp. 65-69.
15. Anisimov A.V., Zavadskyi I.O. Splittable Data Compression Codes. 2019 IEEE International Conference on Advanced Trends in Information Theory (Kyiv, Ukraine, 18-20th of Dec, 2019). Kyiv, Ukraine, pp. 71-74.
16. Zavadskyi I.O. Splittable codes and their applications. Doctor of science thesis. Kyiv, 2020, 335 P.
17. Zavadskyi I.O., Anisimov A.V. A Family of Data Compression Codes with Multiple Delimiters. Prague Stringology Conference (Prague, Czech Republic, 29-31th of Aug, 2016). Prague, Czech Republic, 2016, pp. 71-84.