Issue 4 (190), article 1
DOI:https://doi.org/10.15407/kvt190.04.005
Kibern. vyčisl. teh., 2017, Issue 4 (190), pp.
Grytsenko V.I., Corresponding Member of NASU of Ukraine,
Director of International research and training
center for Information technologies and systems
of the NASU and MESU
e-mail: vig@irtc.org.ua
Onyshchenko I.M., PhD (Economics),
Senior Researcher of the Department of Economic and Social
Systems and Information Technologies
e-mail: standardscoring@gmail.com
International research and training center for Information
technologies and systems of the NASU and MESU,
40, Ave Glushkov, 03680, Kiev, Ukraine
DETERMINING THE INFORMATIVITY OF PARAMETERS IN A PROGNOSTIC MODEL FOR EVALUATING THE PROBABILITY OF PRODUCT SELECTION IN THE CONDITIONS OF “BIG DATA”
Introduction. Fast growth of collected and stored data due to IT bumming caused a problem called “Big Data Problem”. Most of the new data are unstructured and this is the core reason why traditional relational data warehouse are so inefficient to deal with “Big Data”. Predicting and modeling based on “Big Data” also can be problematic because of high volume and velocity. To avoid some problems online learning algorithms can be successful for high-load systems.
The purpose of the article is to develop an approach to feature selection and modeling in case of “Big Data” with using online learning algorithm.
Method. Online learning algorithm for FTRL (Follow-The-Regularized-Leader) model with L1 and L2 regularization to select only important features was used.
Results. The approaches of modeling in cases of using batch and online learning algorithms are described on the example of online auction system. The online learning algorithm has very strong preferences in case of high load and high velocity. Mathematical background for modification of linear discriminator of FTL (Follow-The-Leader) model with adding regularization was described. L1 and L2 regularization allows us to select important features in real time. If the feature becomes useless, the regularization will set the corresponding coefficient equal to 0. But it does not remove the feature from training process and the coefficient can be restored with some value in case of its importance for model. The full process is prepared as a program in Python and can be used in practice.
The results may be applied for modeling and predicting in projects with high volume or velocity of data for example — social networks, online auctions, online gaming, recommendation systems and others.
The results may be applied for modeling and forcasting in projects with high volume or velocity of data, for example — social networks, online auctions, online gaming, recommendation systems and others .
Conclusions. FTRL model to work as online learning algorithm that allows to predict binary outcomes in high load “Big Data” systems was modified.
Getting into account that number of predictors can be enormous it takes much computing resources, time and make the process difficult. This feature selection problem was solved with using L1 regularization. The selection procedure was added to modified online learning FTRL model. L1 regularization to score the importance of predictors in real time was used.
A program that runs described mathematical algorithm was developed. Note that the algorithm effectively works with sparse matrices by analyzing incoming data and updating weights only for predictors that are presented. The algorithm has L1 and L2 regularization features that may be used for feature selection and avoid overfitting.
Keywords: information technologies in economics, economical and mathematical modeling, online learning algorithms, regularization, Big Data.
REFERENCES
1 Maier-Shenberher Vyktor. Bolshye dannye. Revoliutsyia, kotoraia yzmenyt to, kak my zhyvem, rabotaem i myslym/Vyktor Maier-Shenberher, Kennet Kuker; per. s anhl. Ynna Haidiuk. — Moskow: Mann, Yvanovy Ferber, 2014. — 240 p. (in Russian).
2 M. Regelson and D. Fain. Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions, volume 9623. Citeseer, 2006.
3 M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521–530. ACM, 2007.
https://doi.org/10.1145/1242572.1242643
4 Shalev-Shwartz, Shai. “Online Learning and Online Convex Optimization”. Foundations and Trends in Machine Learning. 2011. pp. 107–194.
https://doi.org/10.1561/2200000018
5 Gilles Gasso. Batch and online learning algorithms for nonconvex Neyman-Pearson classification / Gilles Gasso, Aristidis Pappaioannou, Marina Spivak, Leon Bottou / ACM Transaction on Intelligent System and Technologies, 2(3), 2011.
https://doi.org/10.1145/1961189.1961200
6 H Brendan McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1 regularization. International Conference on Artificial Intelligence and Statistics, pages 525–533, 2011.
7 Byll Franks. Ukroshchenye bolshykh dannykh: kak yzvlekat znanyia yz massyvov ynformatsyy s pomoshchiu hlubokoi analytyky / Byll Franks; per. s anhl. Andreia Baranova. — M. : Mann, Yvanov y Ferber, 2014. — 352 p. (in Russian).
8 N.B. Shakhovska. Model Velykykh Danykh “Sutnist — kharakterystyka”. / N.B. Shakhovska, Yu.Ia. Boliubash / 2015 r. [Elektronnyi resurs] — Rezhym dostupu: http://www.academia.edu/19609620/%D0%9C%D0%9E%D0%94%D0%95%D0%9B%D0%AC_%D0%92%D0%95%D0%9B%D0%98%D0%9A%D0%98%D0%A5_%D0%94%D0%90%D0%9D%D0%98%D0%A5_%D0%A1%D0%A3%D0%A2%D0%9D%D0%86%D0%A1%D0%A2%D0%AC-%D0%A5%D0%90%D0%A0%D0%90%D0%A% D0%A2%D0%95%D0%A0%D0%98%D0%A1%D0%A2%D0%98%D0%9A%D0%90_ (in Ukrainian).
9 Cherniak Leonyd. Bolshye Dannye — novaia teoryia y praktyka. Otkrytye systemy. SUBD. — M.: Otkrytye systemy, 2011. — No 10. [Elektronnyi resurs] — Rezhym dostupu: http://www.osp.ru/os/2011/10/13010990/ (in Russian).
10 Uskenbaeva, R.K. Tasks of resources provision of distributed computer systems functionality / R.K. Uskenbayeva, A.A. Kuandykov, A.U. Kalizhanova. — Dubai, World Academy of Science, Engineering and Technology. — 2012. — Iss. 70. — P. 580–581.
11 R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. 2011
12 H.B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In AISTATS, 2011.
13 H.B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010.
14 Hrytsenko V.I. Zastosuvannia instrumentiv Big Data dlia pidvyshchennia efektyvnosti onlain reklamy. Ekonomiko-matematychne modeliuvannia sotsialno-ekonomichnykh system. Vypusk 21. — Kyiv, 2016. P 5–21 (in Ukrainian).
15 Big Data — Wikipedia. [Elektronnyi resurs] — Rezhym dostupu: https://en.wikipedia.org/wiki/Big_data
16 Chto takoe Real-Time Bidding. [Elektronnyi resurs] — rezhym dostupu: http://konverta.ru/how (in Russian).
17 Introduction to online machine learning: Simplified. [Elektronnyi resurs] — rezhym dostupu: http://www.analyticsvidhya.com/blog/2015/01/introduction-online-machine-learning-simplified-2/
18 Riedman J. H. Regularization paths for generalized linear models via coordinate descent / Riedman J. H., Hastie T., Tibshirani R. / Journal of Statistical Software. 2010. Vol. 33, no. 1. pp. 1–22
19 L1- y L2-rehuliaryzatsyia v mashynnom obuchenyy. [Elektronnyi resurs] — rezhym dostupu: https://msdn.microsoft.com/uk-ua/magazine/dn904675.aspx (in Russian).
20 L1-rehuliaryzatsyia lyneinoi rehressyy. Rehressyia naymenshykh uhlov (alhorytm LARS). [Elektronnyi resurs] — rezhym dostupu: chrome-extension: //ecnphlgnajanjnkcmbpancdjoidceilk/content/web/viewer.html?source=extension_pdfhandler &file=http%3A%2F%2Fwww.machinelearning.ru%2Fwiki%2Fimages%2F7%2F7e%2F VetrovSem11_LARS.pdf (in Russian).
Received 28.09.2017