Home Conceptualizing Mining of Firm’s Web Log Files
Article
Licensed
Unlicensed Requires Authentication

Conceptualizing Mining of Firm’s Web Log Files

  • Ruangsak Trakunphutthirak , Yen Cheung and Vincent C. S. Lee EMAIL logo
Published/Copyright: December 20, 2017
Become an author with De Gruyter Brill

Abstract

In this era of a data-driven society, useful data (Big Data) is often unintentionally ignored due to lack of convenient tools and expensive software. For example, web log files can be used to identify explicit information of browsing patterns when users access web sites. Some hidden information, however, cannot be directly derived from the log files. We may need external resources to discover more knowledge from browsing patterns. The purpose of this study is to investigate the application of web usage mining based on web log files. The outcome of this study sets further directions of this investigation on what and how implicit information embedded in log files can be efficiently and effectively extracted. Further work involves combining the use of social media data to improve business decision quality.


Supported by Royal Thai Government Scholarship and Faculty of IT, Monash University, Resources Support


Acknowledgements

The first author acknowledges financial support from Royal Thai Government Scholarship to pursue the Ph.D. research program with Faculty of IT, Clayton campus, Monash University. All authors gratefully acknowledge the IT resources support through Faculty of Information Technology, Monash University, Australia.

References

[1] Chen M, Mao S, Liu Y. Big data: A survey. Mobile Networks and Applications, 2014, 19(2): 171–209.10.1007/s11036-013-0489-0Search in Google Scholar

[2] Fan W, Bifet A. Mining big data: Current status, and forecast to the future. ACM SIGKDD Explorations Newsletter, 2013, 14(2): 1–5.10.1145/2481244.2481246Search in Google Scholar

[3] Demchenko Y, Laat C D, Membrey P. Defining architecture components of the Big Data Ecosystem. International Conference on Collaboration Technologies and Systems (CTS), 2014.10.1109/CTS.2014.6867550Search in Google Scholar

[4] Khan R A, Quadri S. Business intelligence: An integrated approach. Business Intelligence Journal, 2012, 5(1): 64–70.Search in Google Scholar

[5] Agosti M, Crivellari F, Di Nunzio G M, Web log analysis: A review of a decade of studies about information acquisition, inspection and interpretation of user interaction. Data Mining and Knowledge Discovery, 2012, 24(3): 663–696.10.1007/s10618-011-0228-8Search in Google Scholar

[6] Chung P T, Chung S H. On data integration and data mining for developing business intelligence. IEEE Long Island Systems, Applications and Technology Conference (LISAT), 2013.10.1109/LISAT.2013.6578235Search in Google Scholar

[7] Grace L K, Maheswari V, Nagamalai D. Analysis of web logs and web user in web mining. arXiv preprint, 2011, arXiv: 1101.5668.Search in Google Scholar

[8] Manikandan S G, Ravi S. Big data analysis using apache hadoop. International Conference on IT Convergence and Security (ICITCS), 2014.10.1109/ICITCS.2014.7021746Search in Google Scholar

[9] Arbelaitz O, Gurrutxaga I, Lojo A. Web usage and content mining to extract knowledge for modelling the users of the Bidasoa Turismo website and to adapt it. Expert Systems with Applications, 2013, 40(18): 7478–7491.10.1016/j.eswa.2013.07.040Search in Google Scholar

[10] Sujatha V. Improved user navigation pattern prediction technique from web log data. Procedia Engineering, 2012, 30(1): 92–99.10.1016/j.proeng.2012.01.838Search in Google Scholar

[11] Pamutha T, Chimphlee S, Kimpan C, et al. Data preprocessing on web server log files for mining users access patterns. International Journal of Research and Reviews in Wireless Communications (IJRRWC), 2012, 2(2): 92–98.Search in Google Scholar

[12] Srivastava J, Garg R, Mishra P K. Preprocessing techniques in web usage mining: A survey. International Journal of Computer Applications, 2014, 97(18): 1–9.10.5120/17104-7737Search in Google Scholar

[13] Heikkinen E, Timo D H. LOGDIG log file analyzer for mining expected behavior from log files. SPLST, 2015.10.1109/IECON.2016.7793774Search in Google Scholar

[14] Lokeshkumar R, Sindhuja R, Sengottuvelan P. A survey on preprocessing of web log file in web usage mining to improve the quality of data. International Journal of Emerging Technology and Advanced Engineering, 2014, 2250–2459.Search in Google Scholar

[15] Hoek V, Wilko, Shen W, et al. Identifying user behavior in domain-specific repositories. Information Services & Use, 2014, 34(3–4): 249–258.10.3233/ISU-140745Search in Google Scholar

[16] Ghezzi C, Sama M, Tambrrelli G. Mining behavior models from user-intensive web applications. ACM Proceedings of the 36th International Conference on Software Engineering, 2014.10.1145/2568225.2568234Search in Google Scholar

[17] Sumo Logic. http://www.sumologic.com.Search in Google Scholar

[18] Andrew M. How big data analytics is solving big advertiser problems. https://www.entrepreneur.com/article/293678.Search in Google Scholar

[19] Pentaho Community. http://community.pentaho.com.Search in Google Scholar

[20] Adamov A. Data mining and analysis in depth: Case study of Qafqaz University HTTP server log analysis. IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), 2014.10.1109/ICAICT.2014.7035947Search in Google Scholar

[21] Keim D, Qu H, Ma K L. Big-data visualization. IEEE Computer Graphics and Applications, 2013, 33(4): 20–21.10.1109/MCG.2013.54Search in Google Scholar PubMed

[22] Wu X, Zhu X, Wu G, et al. Data mining with big data. IEEE transactions on knowledge and data engineering, 2014, 26(1): 97–107.10.1109/TKDE.2013.109Search in Google Scholar

[23] Koliopoulos A K, Yiapanis P, Tekiner F, et al. A parallel distributed weka framework for big data mining using Spark. IEEE International Congress on Big Data, 2015.10.1109/BigDataCongress.2015.12Search in Google Scholar

[24] Hendler J. Broad data: Challenges on the emerging Web of data. ACM Proceedings of the 2nd IKDD Conference on Data Sciences, 2015.10.1145/2778865.2778870Search in Google Scholar

[25] Madhavji N H, Miranskyy A, Kontogiannis K. Big picture of big data software engineering: With example research challenges. IEEE/ACM 1st International Workshop on Big Data Software Engineering (BIGDSE), 2015.10.1109/BIGDSE.2015.10Search in Google Scholar

[26] Seref B, Bostanci E. Opportunities, threats and future directions in big data for medical wearables. ACM Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, 2016.10.1145/3010089.3010100Search in Google Scholar

[27] Franklin M J. Making Sense of big data with the berkeley data analytics stack. SSDBM, 2013.10.1145/2484838.2484884Search in Google Scholar

[28] Mohammad A, Mcheick H, Grant E. Big data architecture evolution: 2014 and beyond. Proceedings of the Fourth ACM International Symposium on Development and Analysis of Intelligent Vehicular Networks and Applications, 2014.10.1145/2656346.2656358Search in Google Scholar

[29] Klein J, Gorton I. Runtime performance challenges in big data systems. Proceedings of the 2015 Workshop on Challenges in Performance Methods for Software Development, 2015.10.1145/2693561.2693563Search in Google Scholar

[30] Cuzzocrea A, Bellatreche L, Song I Y. Data warehousing and OLAP over big data: Current challenges and future research directions. ACM Proceedings of the Sixteenth International Workshop on Data Warehousing and OLAP, 2013.10.1145/2513190.2517828Search in Google Scholar

[31] Cuzzocrea A. Warehousing and protecting big ata: State-of-the-art-analysis, methodologies, future challenges. ACM Proceedings of the Interntional Conference on Internet of Things and Cloud Computing, 2016.10.1145/2896387.2900335Search in Google Scholar

[32] Zhou J. Big data analytics and intelligence at alibaba cloud. ACM Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating System, 2017.10.1145/3037697.3037699Search in Google Scholar

[33] Cuzzocrea A, Loia V, Tommasetti A. Big-data-driven innovation for enterprises: Innovative big value paradigms for next-generation digital ecosystems. ACM Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, 2017.10.1145/3102254.3102271Search in Google Scholar

[34] Susha I, Janssen M, Verhulst S, et al. Data collaboratives: How to create value from data for public problem solving?: Panel. ACM Proceedings of the 18th Annual International Conference on Digital Government Research, 2017.10.1145/3085228.3085309Search in Google Scholar

[35] Chaudhuri S. What next?: A half-dozen data management research goals for big data and the cloud. Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2012.10.1145/2213556.2213558Search in Google Scholar

[36] Cuzzocrea A. Privacy and security of big data: Current challenges and future research perspectives. ACM Proceedings of the First International Workshop on Privacy and Secuirty of Big Data, 2014.10.1145/2663715.2669614Search in Google Scholar

[37] Agrawal R, Kadadi A, Dai X, et al. Challenges and opportunities with big data visualization. ACM Proceedings of the 7th International Conference on Management of Computational and Collective Intelligence in Digital EcoSystems, 2015.10.1145/2857218.2857256Search in Google Scholar

[38] Fang R, Pouyanfar S, Yang Y. Computational health informatics in the big data age: A survey. ACM Computing Surveys (CSUR), 2016, 49(1): 12.10.1145/2932707Search in Google Scholar

[39] Kechadi M. Healthcare big data: Challenges and opportunities. ACM Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, 2016.10.1145/3010089.3010143Search in Google Scholar

[40] McAuley J, Leskovec J. Hidden factors and hidden topics: Understanding rating dimensions with review text. Proceedings of the 7th ACM Conference on Recommender Systems, 2013.10.1145/2507157.2507163Search in Google Scholar

[41] Wang Z, Ji Q. Classifier learning with hidden information. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.Search in Google Scholar

[42] Chu D, Sheets D A, Zhao Y, et al. Visualizing hidden themes of taxi movement with semantic transformation. Visualization Symposium (PacificVis) IEEE Pacific, 2014.Search in Google Scholar

[43] Abrol S, Kotrotsou A, Salem A. Radiomic phenotyping in brain cancer to unravel hidden information in medical images. Topics in Magnetic Resonance Imaging, 2017, 26(1): 43–53.10.1097/RMR.0000000000000117Search in Google Scholar PubMed

[44] Suh-Lee C, Jo J Y, Kim Y. Text mining for security threat detection discovering hidden information in unstructured log messages. IEEE Conference on Communications and Network Security (CNS), 2016.10.1109/CNS.2016.7860492Search in Google Scholar

[45] Thusoo A, Shao Z, Anthony S. Data warehousing and analytics infrastructure at facebook. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010.10.1145/1807167.1807278Search in Google Scholar

[46] Zhong C, Salehi M, Shah S, et al. Social bootstrapping: How pinterest and last. fm social communities benefit by borrowing links from facebook. ACM Proceedings of the 23rd International Conference on World Wide Web, 2014.10.1145/2566486.2568031Search in Google Scholar

[47] Nacke L E, Klauser M, Prescod P. Social player analytics in a facebook health game. Proceedings of HCI Korea, 2014.Search in Google Scholar

[48] Chen C, Iglasias J, Lin X, et al. Facebook traffic pattern analytics. MISNC, 2016.10.1145/2955129.2955161Search in Google Scholar

[49] Rieder B. Studying facebook via data extraction: the netvizz application. Proceedings of the 5th Annual ACM Web Science Conference, 2013.10.1145/2464464.2464475Search in Google Scholar

[50] Sloan L, Morgan J, Burnap P. Who tweets? Deriving the demographic characteristics of age, occupation and social class from Twitter user meta-data. PloS One, 2015, 10(3): e0115545.10.1371/journal.pone.0115545Search in Google Scholar PubMed PubMed Central

[51] Kumar S, Morstatter F, Liu H. Twitter data analytics. Springer, 2014.10.1007/978-1-4614-9372-3Search in Google Scholar

[52] Lee K, Agrawal A, Choudhary A. Real-time disease surveillance using twitter data: Demonstration on Flu and Cancer. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.10.1145/2487575.2487709Search in Google Scholar

[53] Williams G, Mahmoud A. Mining Twitter data for a more responsive software engineering process. Proceedings of the 39th International Conference on Software Engineering Companion, IEEE Press, 2017.10.1109/ICSE-C.2017.53Search in Google Scholar

[54] Kalampokis E, Karamanou A, Tambouris E, et al. On oredicting election results using twitter and linked open data: The case of the UK 2010 election. J. UCS, 2017, 23(3): 280–303.Search in Google Scholar

[55] Sarnovsky M, Butka P, Huzvarova A. Twitter data analysis and visualizations using the R language on top of the hadoop platform. IEEE 15th International Symposium on Applied Machine Intelligence and Informatics (SAMI), 2017.10.1109/SAMI.2017.7880327Search in Google Scholar

[56] Korpusik M, Sakaki S, Chen Y Y. Recurrent neural networks for customer purchase prediction on twitter. CBRecSys, 2016.Search in Google Scholar

[57] Adel H, Chen F, Chen Y. Ranking convolutional recurrent neural networks for purchase stage identification on imbalanced Twitter data. EACL 2017, 2017: 592.10.18653/v1/E17-2094Search in Google Scholar

[58] Cooley R, Mobasher B, Srivastava J. Web mining: Information and pattern discovery on the world wide web. Proceedings of Ninth IEEE International Conference on Tools with Artificial Intelligence, 1997.10.1109/TAI.1997.632303Search in Google Scholar

[59] Pabarskaite Z. Implementing advanced cleaning and end-user interpretability technologies in web log mining. IEEE Proceedings of the 24th International Conference on Information Technology Interfaces, 2002.10.1109/ITI.2002.1024660Search in Google Scholar

[60] Yuan F, Wang L J, Yu G. Study on data preprocessing algorithm in web log mining. IEEE International Conference on Machine Learning and Cybernetics, 2003.Search in Google Scholar

[61] Zhang H, Liang W. An intelligent algorithm of data pre-processing in web usage mining. IEEE Fifth World Congress on Intelligent Control and Automation, 2004.10.1109/WCICA.2004.1343095Search in Google Scholar

[62] Tanasa D, Trousse B. Advanced data preprocessing for intersites web usage mining. IEEE Intelligent Systems, 2004, 19(2): 59–65.10.1109/MIS.2004.1274912Search in Google Scholar

[63] Khasawneh N, Chan C C. Active user-based and ontology-based web log data preprocessing for web usage mining. IEEE/WIC/ACM International Conference on Web Intelligence, 2006.10.1109/WI.2006.32Search in Google Scholar

[64] Murata T, Saito K. Extracting users’ interests from web log data. IEEE/WIC/ACM International Conference on Web Intelligence, 2006.10.1109/WI.2006.75Search in Google Scholar

[65] Pabarskaite Z, Raudys A. A process of knowledge discovery from web log data: Systematization and critical review. Journal of Intelligent Information Systems, 2007, 28(1): 79–104.10.1007/s10844-006-0004-1Search in Google Scholar

[66] Castellano G, Fanelli A M, Torsello M A. LODAP: A log data preprocessor for mining web browsing patterns. Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, 2007.Search in Google Scholar

[67] Stermsek G, Strembeck M, Neumann G. A user profile derivation approach based on log-file analysis. IKE, 2007: 258–264.Search in Google Scholar

[68] Dell R F, Roman P E, Velsquez J D. Web user session reconstruction using integer programming. Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2008.10.1109/WIIAT.2008.181Search in Google Scholar

[69] Wahab M H A, Mohd M N H, Hanafi H F, et al. Data pre-processing on web server logs for generalized association rules mining algorithm. World Academy of Science, Engineering and Technology, 2008.Search in Google Scholar

[70] Li Y, Feng B, Mao Q. Research on path completion technique in web usage mining. International Symposium on Computer Science and Computational Technology, 2008.10.1109/ISCSCT.2008.151Search in Google Scholar

[71] Suneetha K R, Krishnamoorthi R. Identifying user behavior by analyzing web server access log file. IJCSNS International Journal of Computer Science and Network Security, 2009, 9(4): 327–332.Search in Google Scholar

[72] Khosla M S, Bhojane M V. Capturing web log and performing preprocessing of the users accessing distance education system. International Journal of Modern Engineering Research (IJMER), 2012, 2(5): 3128–3130.Search in Google Scholar

[73] Li X Y. Data preprocessing in web usage mining. The 19th International Conference on Industrial Engineering and Engineering Management, 2013.10.1007/978-3-642-38391-5_27Search in Google Scholar

[74] Chauhan A, Tarar S. Prediction of user browsing behavior using web log data. IJSRSET, 2016, 1(2): 419–422.Search in Google Scholar

[75] Witten I H, Frank E, Hall M A. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.Search in Google Scholar

[76] Cho Y M, Ritchie M D, Moore J H. Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia, 2003, 47: 549–554.10.1007/s00125-003-1321-3Search in Google Scholar PubMed

[77] Cichocki A, Mandic D, Phan A H, et al. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 2015, 32(2): 145–163.10.1109/MSP.2013.2297439Search in Google Scholar

[78] Schlomer G, Bauman L, Card N A. Best practices for missing data management in counseling psychology. Journal of Counseling psychology, 2010, 57(1): 1.10.1037/a0018082Search in Google Scholar PubMed

[79] Yang S, Kalpakis K, Mackenzie C F, et al. Online recovery of missing values in vital signs data streams using low-rank matrix completion. IEEE International Conference on Machine Learning and Applications (ICMLA), 2012.10.1109/ICMLA.2012.55Search in Google Scholar

[80] Newman D A. Longitudinal modeling with randomly and systematically missing data: A simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organizational Research Methods, 2003, 6(3): 328–362.10.1177/1094428103254673Search in Google Scholar

[81] Jiang N, Gruenwald L. Estimating missing data in data streams. Advances in Databases: Concepts, Systems and Applications, 2007: 981–987.10.1007/978-3-540-71703-4_89Search in Google Scholar

[82] Zhang P, Zhu X, Tan J. Skif: A data imputation framework for concept drifting data streams. ACM Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 2010: 1869–1872.10.1145/1871437.1871750Search in Google Scholar

[83] Aryal S, Kai M T, Washio T, et al. Data-dependent dissimilarity measure: An effective alternative to geometric distance measures. Knowledge and Information Systems, 2017: 1–28.10.1007/s10115-017-1046-0Search in Google Scholar

Received: 2017-7-30
Accepted: 2017-9-25
Published Online: 2017-12-20

© 2017 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 20.11.2025 from https://www.degruyterbrill.com/document/doi/10.21078/JSSI-2017-489-22/html
Scroll to top button