Utilization of Co-occurrence Pattern Mining with Optimal Fuzzy Classifier for Web Page Personalization

Pappu Srinivasa Rao; Devara Vasumathi

doi:10.1515/jisys-2016-0157

Article Open Access

Utilization of Co-occurrence Pattern Mining with Optimal Fuzzy Classifier for Web Page Personalization

Pappu Srinivasa Rao and Devara Vasumathi

Published/Copyright: January 13, 2017

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Journal of Intelligent Systems Volume 27 Issue 2

Abstract

Several users use metasearch engines directly or indirectly to access and gather data from more than one data source. The effectiveness of a metasearch engine is majorly determined by the quality of the results it returns in response to user queries. The rank aggregation methods that have been proposed until now exploit a very limited set of parameters, such as total number of used resources and the rankings they achieved from each individual resource. In this paper, we use the fuzzy-bat to merge the score computation module effectively. Initially, we give a query to different search engines we use and the top n list from each search engine is chosen for further processing our technique. We then merge the top n list based on unique links, and we do some parameter calculations such as title-based calculation, snippet-based calculation, content-based calculation, address-based calculation, link-based calculation, uniform resource locator-based calculation, and co-occurrence-based calculation. We give the solutions of the calculations with the user given the ranking of links to the fuzzy-bat to train the system. The system then ranks and merges the links we obtain from different search engines for the query we give.

Keywords: Metasearch engine; fuzzy-bat; retrieval of documents; ranking list

1 Introduction

The rapid growth of the World Wide Web (Web) poses unprecedented scaling challenges for search engines. In the modern era of high-volume information generation, search engines prove to be a pivotal technology of data mining and information retrieval. General-purpose search engines have achieved a great deal of success in providing relevant information to the user. They used to be an effective tool for retrieving information from the huge information repository. For instance, Google, which is one of the popular search engines, not only provides fitting search results for users worldwide by listing more than 20 hundred million Web pages, but also the time to search is not always beyond 0.5 s [17]. The ubiquity of the Internet and Web has led to the emergence of several Web search engines with varying capabilities. These search engines index Web sites, images, Usenet newsgroups, content-based directories, and news sources with the goal of producing search results that are most relevant to user queries. However, only a small number of Web users actually know how to utilize the true power of Web search engines. In order to address this problem, search engines have started providing access to their services via various interfaces [1].

Web search engines are the most popular tools for finding useful information about a subject of interest. What makes search engines popular is the straightforward and natural way via which people interact with them [19]. Web data mining can be divided into three general categories: Web content mining, Web structure mining, and Web usage mining. Here, we focus on the latter area of Web data mining that tries to exploit the navigational traces of the users in order to extract knowledge about their preferences and behavior. The task of modeling and predicting a user’s navigational behavior on a Web site or on a Web domain can be useful in rather many Web applications such as Web caching [5]. Every day, hundreds of millions of transactions flow through the network all over the world. Any information can be transferred from one place to another within only a few seconds. Together with the growth of the needs of information, the Web pages on the Internet have grown explosively during the past few years and such increase is expected to be more acute going forward [3, 21]. Actual Web data consist of the Web pages, Web page structure, linkage structure between the Web pages, surfing navigational behavior of the users, and user profiles including demographic and registration information about the users [5]. The Web is today the main “all kinds of information” repository and has been so far very successful in disseminating information to humans. It is very easy for humans to navigate through a Web site and retrieve the useful information [11, 12]. A Web page has references to these fragments, which are stored independently on the server and in caches. In a fragment-based publishing scheme, the cache ability and lifetime are specified at a fragment granularity rather than at the Web page level. While cache ability properties specify whether a fragment can be cached, its lifetime indicates how long the fragment would remain fresh [13]. The metasearch engines can run a user query across multiple component search engines concurrently, retrieve the generated outcomes, and amass them. The benefits of metasearch engines against the search engines are notable [16].

The metasearch engine enhances the search coverage of the Web, providing higher recall. The overlap among the primary search engines is generally small [4]. Personalization has proven to increase user motivation and user satisfaction with the information experience. Within enterprises, there is also growing interest in personalized customer care, as personalization helps users stay more on the Web site and encourages them to return to the service provider [15]. The metasearch engine enhances the retrieval effectiveness, providing higher precision because of the “chorus effect” [6]. Web metasearching in disparity to rank aggregation is an issue representing its own unique challenges. The outcomes that a metasearch system gathers from its component engines are not similar to votes or any other single-dimensional entities. Apart from the individual ranking it is assigned by a component engine, a Web outcome also incorporates a title, a small fragment of text that represents its significance to the submitted query [9, 18] (textual snippet), and a uniform resource locator (URL). Ostensibly, the traditional rank aggregation techniques are insufficient for providing a robust ranking mechanism appropriate for metasearch engines because they ignore the semantics accompanying each Web result.

2 Related Works

This section shows a brief review of some of the related works. Akritidis et al. [2] presented the QuadRank technique, which considers additional information regarding the query terms, collected results, and data correlation. They implemented and tested the QuadRank in a real-world metasearch engine. They comprehensively tested QuadRank for both effectiveness and efficiency in the real-world search environment and also used the task from the TREC-2009 conference. They demonstrated that, in most cases, their technique outperformed all component engines.

Ishii et al. [8] proposed a technique to reduce the computation and communication loads for the Page Rank algorithm. They developed a method to systematically aggregate the Web page into groups by using the sparsity inherent in the Web. For each group, they computed an aggregated Page Rank value that can be distributed among the group members. They provided a distributed update scheme for the aggregated Page Rank along with an analysis of its convergence properties. They provided a numerical example to illustrate the level of reduction in computation while keeping the error in rankings small.

Lodhia [10] considered the potential of the Web as a medium for communicating social and environmental issues in the Australian minerals industry. The media richness framework was used to assess the communication potential of the Web. The communication of social and environmental issues on Web sites of three Australian mining companies was analyzed over a period of time. Interviews were also conducted with sustainability and communication managers from these companies. The findings suggested that companies are still learning about Web-based social and environmental communication. There was varying usage of different Web capabilities across the three companies. Managers were willing to utilize the organizational and mass communication capabilities of the Web more than its timeliness and presentation features. Limited consideration was given to the interactive potential of the Web. It was also found that certain social considerations could limit the use of the Web for social and environmental communication.

Event extraction, a specialized stream of information extraction rooted back into the 1980s, has greatly gained in popularity due to the advent of big data and the developments in the related fields of text mining and natural language processing. However, up to this date, an overview of this particular field remains elusive. Therefore, we give a summarization of event extraction techniques for textual data, distinguishing between data-driven, knowledge-driven, and hybrid methods, and present a qualitative evaluation of these. Moreover, Hogenboom [7] discussed common decision support applications of event extraction from text corpora. Last, we elaborate on the evaluation of event extraction systems and identify current research issues.

Yang and Hanjalic [20] developed a prototype-based re-ranking framework, which constructs meta re-rankers corresponding to visual prototypes representing the textual query and learns the weights of a linear re-ranking model to combine the results of individual meta re-rankers and produces the re-ranking score of a given image taken from the initial text-based search result. The induced re-ranking model was learned in a query-independent way requiring only a limited labeling effort and being able to scale up to a broad range of queries. The experimental results on the Web Queries dataset demonstrated that the proposed method outperforms all the existing supervised and unsupervised re-ranking methods.

With the development of the Internet, Web service generates a large amount of log information and how to mine user-preferred browsing paths is an important research area. Current researches mainly focus on the mining of user-preferred browsing paths; however, they do not delve into the personalization of preferred paths and lack of semantic information. To provide personalized preferred paths to fulfill user needs, Zhou and Yang [22] proposed a novel method to compute the similarities of preferred paths and the given fields by experts. Firstly, the similarities of each page on the preferred paths and the given fields were computed. Secondly, according to the computed similarities of each page on the preferred paths and the given fields, the average similarity of all the pages on the preferred path and the given field was computed, and it was used as the similarity of the preferred path and the given file. The experimental result showed that it was accurate and scalable. It could be applied to optimize the Web site or design personalized service.

Rizvi and Keole [14] presented a new framework for a semantic-enhanced Web page recommendation, and a suite of enabling techniques that include semantic network models of domain knowledge and Web usage knowledge, querying techniques, and Web page recommendation strategies. It enables the system to automatically discover and construct the domain and Web usage knowledge bases and to generate effective Web page recommendations.

3 Problem Definition

The common problems of the existing research are shown below.

The personalization task can, therefore, be viewed as a prediction problem: the system must attempt to predict the user’s level of interest in, or the utility of, specific content categories, pages, or items, and rank these according to their predicted values.
A problem with the naive approach algorithm is that it requires repeated search through the rule base.
One problem for association rule recommendation systems is that a system cannot give any recommendations when the dataset is sparse, and hence larger item sets often do not meet the minimum support constraint.
The problem of Web user clustering (or segmentation) is the use of Web access log files to partition a set of users into clusters such that the users within a cluster are more similar to each other than users from different clusters. The discovered clusters can then help in the on-the-fly transformation of the Web site content. In particular, Web pages can be automatically linked by artificial hyperlinks. The Web log file is a key to match an active user’s access pattern with one or more of the clusters.

These problems motivate to do the research on Web page personalization with the optimal fuzzy classifier as shown in Figure.

4 Proposed Methodology

Web page recommendation or personalization plays a significant role in intelligent Web systems. Useful knowledge discovery from Web usage data and satisfactory knowledge representation for effective Web page recommendations are crucial and challenging. In the paper, they give a query to different search engines and the top n list from each search engine is chosen for further processing our technique. Initially, the relevant features of the input data are extracted. For feature extraction, the co-occurrence pattern mining technique will be used. These patterns will be given as an input for the final phase. We can apply the classification technique for classifying the top n relevant data of the inputs in the search engine. Here, the hybrid technique is utilized for classification purpose and the optimization algorithm is used to find the optimal top n relevant data. The hybrid technique is an optimal fuzzy classifier combined with the bat algorithm. The fuzzy classifier is a promising non-linear, non-parametric classification technique, which gives good results in the data mining field. Bat algorithm is a heuristic global optimization method that is easy to implement, with few parameters to adjust. By utilizing the hybrid technique, the user gets the exact data as needed and unnecessary data are reduced. Within a short period of time, the relevant data are searched. Therefore, the optimal relevant data for the top n search are obtained by utilizing the proposed technique. The performances of the proposed technique will be compared with the existing techniques. Our proposed method is implemented with the help of JAVA platform.

4.1 Metasearch Engine

4.2 Knowledge Discovery

Knowledge discovery and data mining is a matter of considerable importance and necessity. Given the recent growth of the field, it is not surprising that a wide variety of methods are now available to researchers and practitioners. No one method is superior to others for all cases. The handbook Data Mining and Knowledge Discovery from Data aims to organize all significant methods developed in the field into a coherent and unified catalog, presents performance evaluation approaches and techniques, and explains with cases and software tools the use of the different methods. At first, the query keywords are given to various search engines, namely Google, Bing, and Yahoo. Then, the top n results from each of the three search engines are selected for rearranging the order of the results retrieved from the search engines. For the entire documents, the keywords are extracted from Web document and stop words are removed from the extracted content of the Web document.

4.3 Feature Extraction

In the feature extraction phase, we are extracting URL, content, stripped URL, rank, and title from that knowledge discovered link; also instead of that, we are combining a score value and that is given by fuzzy logic. In the feature extraction phase utilizing the co-occurrence technique to extract relevant URL from that URL, we are mining the patterns. In these, mining URLs are given to the classification phase. In the classification phase, we are using the hybrid fuzzy-bat technique. In the title-based calculation, snippet-based calculation, and content-based calculation, we have to preprocess the query we give to the different search engines. Initially, we remove the stop words from the query and we give the remaining words to the Word Net to find the synonyms of those words. The query words and the synonyms of the respective query words together with the unique items from the top n list of different search engines are used to perform the title-based calculation, snippet-based calculation, and content-based calculation.

4.3.1 Title-Based Calculation

Each document (link) would have a title, and the calculation we do here is based on the title. The calculation based on the title of the unique links is calculated as follows: after separating the query words and finding the meanings for each query word, we compare it with the titles of the unique links separately to find the frequency of the words.

Ts(p)=∑i = 1a(TsDi−max(TDi)+1max(TDi)×wQ+∑j = 1bTsDiMj−max(TDiMj)+1max(TDiMj)×wM).

In the above equation, T_s (p) is the title-based value of the s^th unique link; T_sD_i is the number of occurrence of the i^th query word in the title T of the s^th link; max(TD_i ) is the maximum number of occurrence of the i^th query word in the title of whole unique links; T_sD_iM_j is the number of occurrence of the j^th meaning of the i^th query word in the title T of the s^th link; max(TD_iM_j ) is the maximum number of occurrence of the j^th meaning of the i^th query word in the title of whole unique links; a is the total number of query words; b is the total number of meaning of the i^th query word; w_Q is the weight value of the query word; and w_M is the weight value of the meaning of the query word.

4.3.2 Snippet-Based Calculation

A snippet is a small piece of information about the main information in the link. The snippet would be visible under each link we obtained from the search engine as a small note. The calculation based on the snippet of the unique link is as follows: we check the query word and the meanings of the query word with the snippet of each link to calculate the number of occurrence in the snippet. Table 2 shows the number of occurrence of query word and the meanings of query words in the snippet of each link.

Ss(p)=∑i = 1a(SsDimax(SDi)×wQ+∑j = 1bSsDiMjmax(SDiMj)×wM).

In the above equation, S_s (p) is the calculated snippet-based value of the s^th unique link; S_sD_i is the number of occurrence of the i^th query word D in the snippet S of the s^th unique link; max(SD_i ) is the maximum number occurrence of the i^th query word D in the snippet S of whole unique links we obtained; S_sD_iM_j is the j^th meaning of the i^th query word D in the snippet S of the s^th unique link; max(SD_iM_j ) is the maximum number of occurrence of the j^th meaning of the i^th query word D in the snippet S of the whole unique links we obtained; w_Q is the weight value of the query word; and w_M is the weight value of the meaning of the query word.

4.3.3 Content-Based Calculation

In content-based calculation, we compare the contents of each link with the separated query words and their synonyms to check the number of occurrence of separated query words and their synonyms in the contents of each link. Table 3 shows the number of occurrence of query words and their synonyms in the contents of each link.

Cs(p)=∑i = 1a(CsDimax(CDi)×wQ+∑j = 1bCsDiMjmax(CDiMj)×wM).

In the above equation, C_s (p) is the calculated content-based value of the s^th unique link; C_sD_i is the number of occurrence of the i^th query word D in the content of the s^th unique link; max(CD_i ) is the maximum number of occurrence of the i^th query word D in the content of the s^th unique link; C_sD_iM_j is the number of occurrence of the j^th synonym of the i^th query word D in the content of the s^th unique link; and max(CD_iM_j ) is the maximum number of occurrence of the j^th synonym of the i^th query word D in the content of the s^th unique link.

4.3.4 URL-Based Calculation

Each link we obtain from the different search engines would come under a specific domain name. An example for the domain name is “Wikipedia.” We calculate the domain value for each unique link using the domain name we found for each link in the different search engines. The equation to calculate the domain value for each unique link is given below:

Us(p)=log10(2m−1+accs2m).

In the above equation, U_s (p) is the calculated domain value of the s^th unique link; m is the number of search engines we used; and acc_s is the number of unique links with the same domain name. For example, we are having 10 unique links and five unique links are from the same domain. While checking any one of the links from those five unique links that are under the same domain, the acc_s value is five for that unique link we check.

4.3.5 Link-Based Calculation

This calculation is based on the ranking of the link in different search engines we used; that is, the link is present in which position in each search engine we have chosen for our process. The formula to calculate the position of a link is shown below:

Rs(p)=m*k−(∑l = 1mr(p))m*k.

In the above equation, R_s (p) is the position value of the link; m is the number of search engines used; k is a number of links we have taken for our process from each search engine; and r(p) is the rank of a link in a particular search engine.

4.4 Co-occurrence Calculation

The co-occurrence of each link is calculated by comparing each link with another and by calculating the ratio of the number of similar contents in both comparing links to the total number of unique contents in both the comparing links. We will not compare the same links with each other.

V=No. of similar contents in comparing linksTotal no. of unique contents in both links.

After calculating the co-occurrence values, we calculate the total co-occurrence for each unique link using the following equation:

Co-O(P)=1n∑i = 1nrow(ULi).

The above equation is explained as follows: n is the total number of unique links we obtained, and row(UL_i ) is the row-wise values of the co-occurrence, i.e. the total co-occurrence value of respective unique links.

4.5 Classification with Fuzzy-Bat

4.5.1 Classification of Using Fuzzy Logic

The fuzzy rule-based classifier is utilized to determine whether the optimized fuzzy rules are generated that are given to the artificial bee colony algorithm.

4.5.2 Fuzzy Logic

Fuzzy rule-based classification is a method of generating a mapping from a given input to an output using fuzzy logic. Then, the mapping gives a basis from which decisions can be generated. Membership functions, logical operations, and if-then rules are used in the fuzzy rule-based process. The stages of fuzzy logic are

Fuzzification;
Fuzzy rules generation;
Defuzzification.

4.5.2.1 Fuzzification

During the fuzzification process, to convert the crisp input into linguistic, variables are converted into fuzzy. After that, the minimum and maximum values are calculated from the input data. The process of fuzzification is computed by applying the following equations:

(1)ML=min + (max − min3),

(2)XL=ML+(max − min3),

where ML is the minimum limit values of the feature M and XL is the maximum limit values of the feature M.

Use Eqs. (1) and (2) for calculating the minimum and maximum limit values. Also, three conditions are provided to generate the fuzzy values by using these equations.

4.5.2.2 Fuzzy Rules Generation

According to the fuzzy values for each feature that are generated in the fuzzification process, the fuzzy rules are also generated. The rules are given below.

4.5.2.3 General Form of Fuzzy Rule

“IF A THEN B.”

The “IF” part of the fuzzy rule is called the “antecedent” and the “THEN” part of the rule is called the “conclusion.” The output value between the antecedent and the conclusion of the fuzzy is trained for generating the fuzzy rules. Then, the fuzzy-generated rules are given into the bat algorithm to find the optimal solution.

4.5.3 Bat Algorithm for Finding the Optimal Location to Store the Data

The innovative bat algorithm represents a metaheuristic technique, stimulated by the echolocation conduct of the micro-bats. It is effectively employed to find the optimal location. Recounted below is a concise account of the novel bat algorithm.

4.5.3.1 Step-by-Step Procedure of Bat Algorithm

Step 1: At the outset, the bat population s_i (i=1, 2, …, n) is initialized.

Step 2: Thereafter, the pulse frequency (f) and velocity (v) are defined.

Step 3: It is followed by the initialization of the pulse rate (R) and loudness (L).

Step 4: Now, the fitness is evaluated by means of Eq. (2).

(3)Fitness=maximum matched data.

Step 5: Create the new solution by adapting the frequency and updating the velocity with the help of the following relations:

(4)fi=fmin+(fmax−fmin)γvix=vix − 1+(six−s0)fisnew=sold+ELx,

where i={1, 2, …, N}. N denotes the number of bats; E and γ represent arbitrary numbers; E and γε[0, 1]; s₀ symbolizes the existing global best location; and L^x =<L_i^x > refers to the average of loudness.

Step 6: If the arbitrary number exceeds the pulse rate, go to step 7.

Step 7: Choose the solution from among the best and create a local solution around the best solution by flying arbitrarily.

Step 9: Now the fitness is evaluated.

Step 10: If (rand<L_i and f(s_i )<f(s_n )), accept the new solution by enhancing the pulse rate and reducing the loudness.

Step 11: Find out the best location. The flowchart for the novel bat technique is beautifully pictured in Figure 1.

Figure 1:

Proposed Optimal Fuzzy Classifier for Web Page Personalization.

5 Results and Discussion

This section shows the results obtained for our proposed technique in comparison with the existing technique. Our technique is implemented in Java (jdk 1.7) that has the following system configuration: Core2Duo processor with clock speed of 2.3 GHz and RAM of 2 GB that runs Windows 7 operating system.

5.1 Query Description and Our Process

This section explains the queries used for our comparison. It took top 10 links from each search engine and the search engines used are Google, Bing, and Yahoo. Now, it has 30 links (top 10 from each search engine) and it merges the links based on unique links. For example, if it uses three search engines, the query given will be searched in all the three Web search engines and top 10 lists will be chosen from those three search engines, and the lists will be merged based on unique links; that is, if a link is present in the top 10 list of the first search engine and if the same link is present in the top 10 list of the second search engine, it will consider that link as a single link while merging the links. To compare the response time between the proposed technique and the existing technique, the experimentation is done using the top 50 links and top 100 links from each search engine.

5.2 Retrieval Effectiveness Evaluation

The effectiveness of the retrieval evaluation of our technique is compared with the existing fuzzy technique. The evaluation is done based on 50 users; that is, it gives the queries used for our evaluation to 50 users and take the top 10 lists from each of the search engines used and merge them based on the unique links and rank the links based on the relationship with the query and their discernment. Eventually, the ranked lists of the 50 users are converted to a single ranked list to perform the evaluation of our technique. Table 1 shows the top list and relevant documents of the query when our technique is applied.

Table 1:

Top Lists and Relevant Documents of the Query “Data Mining Techniques.”

Rank	Links	R	G	B	Y	Ex
1	http://searchbusinessintelligence.techtarget.in/tip/5-data-mining-techniques-for-optimal-results	R1	6	–	10	6
2	http://www.zentut.com/data-mining/data-mining-techniques/	R2	2	5	6	3
3	https://datafloq.com/read/data-mining-techniques-create-business-value/121	R3	9	–	13	1
4	http://www.ijcse.com/docs/IJCSE10-01-04-51.pdf	R4	4	–	8	7
5	http://academic.csuohio.edu/fuy/Pub/pot97.pdf	R5	7	–	11	2
6	http://www.fmi.uni-sofia.bg/fmi/statist/education/textbook/eng/stdatmin.html	R6	8	–	12	5
7	http://www.ibm.com/developerworks/library/ba-data-mining-techniques/	R7	5	3	9	8
8	http://documents.software.dell.com/statistics/textbook/data-mining-techniques	R8	3	4	7	4
9	http://www.thearling.com/text/dmtechniques/dmtechniques.htm	R9	1	1	5	10
10	https://en.wikipedia.org/wiki/Data_mining	R10	10	2	14	9

Table 1 is explained as follows: the first column represents the ranking for the query “Business intelligence in data mining technique for optimal results” using our technique; the second column represents the top links; the third column shows which of the documents are relevant to the given query with respect to the conclusion made by the user; the fourth column represents the ranking based on Google; the fifth column represents the ranking based on Bing; the sixth column represents the ranking based on Yahoo; and the last column shows the ranking based on the existing technique. The retrieval effectiveness is then used to find the precision of our technique. Table 2 shows the relevant documents in the top 10 lists for the query “Business intelligence in data mining technique for optimal results.”

Table 2:

Relevant Documents in the Top 10 Lists for the Query “Data Mining Techniques.”

Engine	1	2	3	4	5	6	7	8	9	10	R
Our technique	R	R	–	R	R	–	R	R	R	R	8
Old technique	R	R	R	R	–	–	R	R	–	R	7
Google	R	R	–	R	R	–	R	–	R	–	6
Bing	R	R	R	–	R	–	–	R	R	–	6
Yahoo	R	–	R	R	–	R	–	R	–	R	6

Table 2 is explained as follows: R denotes that the respective document of the respective technique is relevant to the query.

5.3 Performance Comparison

This section shows the performance of our technique compared to the existing technique and individual Web search engines such as Google, Bing, and Yahoo. The performance is calculated based on the precision. The precision is calculated for the queries given by the user. The precision is calculated by taking the total relevant documents retrieved for the query divided by total documents retrieved for the query.

5.3.1 Precision Based on User-Given Queries

The precision using user-given queries is explained in this section. Figure 1 shows the precision comparison for the user-given query when top 10 links are taken from each search engine.

Figure 2 shows the precision of our technique compared to the existing technique (QuadRank) [1] with the search engines used for our technique for the user-given query “tickets for UEFA champions league final 2010” when top 10 links are taken. Here, the precision of our technique is high compared to the other techniques. The precision obtained for our technique is 76%; the precision obtained for the existing technique is 70%; the precision obtained using Google and Yahoo is 60%; and the precision obtained using Bing is 50%.

Figure 2:

Precision Comparison for the User-Given Query “Data Mining” When Topmost Links Are Taken.

Figure 3 shows the precision of our technique compared to the existing technique with the search engines used. The precision in Figure 3 is calculated for the user-given query “distributed index construction” when top 10 links are taken. Here, the precision of our technique is high compared to the other techniques. The precision obtained for our technique is 76%; it is 72% using the existing technique; it is 70% using Google search engine; it is 50% using Bing search engine; and it is 60% using Yahoo search engine.

Figure 3:

Precision Comparison for the User-Given Query “Lung Cancer” When Topmost Links Are Taken.

Figure 4 shows the precision comparison of our technique compared to the existing technique for the user query “lung cancer symptoms” when top 10 links are considered. Here, the precision of our technique is high compared to the existing technique. Graph 4 shows the precision comparison for the user-given queries when the top 100 links are taken from each search engine.

Figure 4:

Precision Comparison for the User-Given Query “Web Usage Mining” When Topmost Links Are Taken.

In Figure 5, the precision is calculated for each user-given query by considering the top 100 links from each search engine. The comparison shows that the proposed technique achieves better precision than the existing technique for all the user-given queries.

Figure 5:

Precision Comparison for the User-Given Queries When Top 100 Links Are Taken.

5.4 Evaluation of Response Time

This section shows the response time of our technique compared to the existing technique for the user-given queries based on the top 50 and top 100 links.

5.4.1 Response Time

Table 3 shows the response time comparison of our technique with the existing fuzzy technique in terms of the top 50 links. After giving the query, the top 50 links from each search engine are taken and merged based on the unique links. Thereafter, the queries are ranked using our technique and the existing technique. The values in Table 3 is in milliseconds, and the first column shows the queries we used for our comparison. n represents the list that derives from the fusion of the input rankings, and response time represents the time taken to rank the merged list. Here, we take 50 input rankings to check the response time and compare these results to prove that our proposed technique had given a better result in the data mining query than the existing technique. The existing technique will take 173 ms for searching a query with 50 input rankings; when we compare this result to our proposed technique, it would give a better response time in that it will take only 137 ms to search a similar query. All the queries take the minimum amount of time to give a result. In most cases, the time taken to rank the merged list of our technique is less compared to the existing technique when top 50 links are taken. Table 4 shows the response time comparison in terms of top 100 links.

Table 3:

Response Time Comparison in Terms of Top 50 Links.

Query	n	Response Time (ms)
Query	n	Our Technique	Old Technique
Data mining	50	137	173
Indian economy	50	74	78
Lung cancer symptoms	50	191	233
Intellectual property	50	102	128
Web usage mining	50	116	177
Federal funding mental illness	50	131	198
Home buying	50	68	126
Criteria obtain US	50	59	93

Table 4:

Response Time Comparison in Terms of Top 100 Links.

Query	n	Response Time (ms)
Query	n	Our Technique	QuadRank [1]
Data mining	100	281	349
Indian economy	100	81	97
Lung cancer symptoms	100	125	168
Intellectual property	100	119	182
Web usage mining	100	137	193
Federal funding mental illness	100	107	137
Home buying	100	86	97
Criteria obtain US	100	61	78

Table 4 shows the response time comparison of our technique with the existing fuzzy technique in terms of top 100 links. After giving the query, the top 100 links from each search engine are taken and merged based on the unique links, and then the queries are ranked using our technique and the existing technique. Here, also, the time taken to rank the merged list using our proposed technique is less in most cases compared to the time taken to rank the merged list using the existing technique. Here, we take 100 input rankings to compare the proposed and existing methods to prove that our proposed method gives better results. For example, the Web usage mining query in the existing technique will take 193 ms of searching time. When we compare this result, our proposed method gives a better result in that it takes 137 ms of searching time. Similar to these results, all queries taken the minimum amount of time to give a result.

6 Conclusion

In this paper, we proposed a fuzzy-bat-based classification. We queried different search engines; we chose top n lists from the different search engines; and we merged the lists based on the unique links. Using the merged lists, we performed title-based calculation, snippet-based calculation, content-based calculation, URL calculation, link calculation, and co-occurrence calculation. The calculated values with the user-ranked list are given to the fuzzy-bat to rank the list. We compared our technique with the existing fuzzy technique in terms of precision and response time. We used different queries given by the user Web track data to perform our analysis. In most cases, the response time based on top 50 links and top 100 links of our technique is better compared to the existing fuzzy technique, and the precision of our technique is high compared to the existing technique for the different queries we gave.

Bibliography

[1] J. H. Abawajy and M. J. Hu, A new Internet meta-search engine and implementation, in: The 3rd ACS/IEEE International Conference on Computer Systems and Applications, 2005.Search in Google Scholar

[2] L. Akritidis, D. Katsaros and P. Bozanis, Effective rank aggregation for metasearching, J. Syst. Softw.84 (2011), 130–143.10.1016/j.jss.2010.09.001Search in Google Scholar

[3] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke and S. Raghavan, Searching the web, ACM Trans. Internet Technol.1 (2001), 243.10.1145/383034.383035Search in Google Scholar

[4] J. A. Aslam and M. H. Montague, Metasearch consistency, in: Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), pp. 386–387, 2001.Search in Google Scholar

[5] C. Dimopoulos and C. Makris, A web page usage prediction scheme using sequence indexing and clustering techniques, Data Knowl. Eng.69 (2010), 371–382.10.1016/j.datak.2009.04.010Search in Google Scholar

[6] C. Dwork, R. Kumar, M. Naor and D. Sivakumar, Rank aggregation methods for the Web, in: Proceedings of the ACM International Conference on World Wide Web (WWW), pp. 613–622, 2001.10.1145/371920.372165Search in Google Scholar

[7] F. Hogenboom, A survey of event extraction methods from text for decision support systems, J. Decis. Support Syst.85 (2016), 12–22.10.1016/j.dss.2016.02.006Search in Google Scholar

[8] H. Ishii, R. Tempo and E. W. Bai, A web aggregation approach for distributed randomized page rank algorithms, IEEE Trans. Autom. Control57 (2012), 2703–2717.10.1109/TAC.2012.2190161Search in Google Scholar

[9] F. Lamberti, A. Sanna and C. Demartini, A relation-based page rank algorithm for semantic web search engines, IEEE Trans. Knowl. Data Eng.21 (2009).10.1109/TKDE.2008.113Search in Google Scholar

[10] S. Lodhia, Web based social and environmental communication in the Australian minerals industry: an application of media richness framework, J. Cleaner Prod.25 (2010), 73–85.10.1016/j.jclepro.2011.11.040Search in Google Scholar

[11] C. D. Manning, P. Raghavan and H. Schutze, Introduction to Information Retrieval, Cambridge University Press, New York, 2008.10.1017/CBO9780511809071Search in Google Scholar

[12] N. K. Papadakis, STAVIES: a system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques, Proc. IEEE Trans. Knowl. Data Eng.17 (2005), 1638–1652.10.1109/TKDE.2005.203Search in Google Scholar

[13] L. Ramaswamy, Automatic fragment detection in dynamic web pages and its impact on caching, Proc. IEEE Trans. Knowl. Eng.17 (2005), 859–874.10.1109/TKDE.2005.89Search in Google Scholar

[14] N. T. S. H. Rizvi and R. R. Keole, A preliminary review of web-page recommendation in information retrieval using domain knowledge and web usage mining, Int. J. Adv. Res. Comput. Sci. Manage. Stud.3 (2015), 156–166.Search in Google Scholar

[15] M. Sah and V. Wade, Automatic metadata mining from multilingual enterprise content, Web Semantics Sci. Serv. Agents World Wide Web11 (2012), 41–62.10.1016/j.websem.2011.11.001Search in Google Scholar

[16] A. Spink, B. J. Jansen, C. Blakely and S. Koshman, Overlap among major Web search engines, in: Proceedings of the IEEE International Conference on Information Technology: New Generations (ITNG), pp. 370–374, 2006.10.1109/ITNG.2006.105Search in Google Scholar

[17] J. Tang, Y. J. Du and K. L. Wang, Design and implement of personalize meta-search engine based on FCA, in: Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19–22 August 2007.10.1109/ICMLC.2007.4370850Search in Google Scholar

[18] A. Telang, C. Li and S. Chakravarthy, One size does not fit all: toward user- and query-dependent ranking for web databases, IEEE Trans. Knowl. Data Eng.24 (2012), 1671–1685.10.1109/TKDE.2011.36Search in Google Scholar

[19] I. Varlamis and S. Stamou, Semantically driven snippet selection for supporting focused web searches, Data Knowl. Eng.68 (2009), 261–277.10.1016/j.datak.2008.10.002Search in Google Scholar

[20] L. Yang and A. Hanjalic, Prototype-based image search reranking, IEEE Trans. Multimed.14 (2012), 871–882.10.1109/TMM.2012.2187778Search in Google Scholar

[21] B. P. C. Yen and Y. W. Wan, Design and evaluation of improvement method on the web information navigation – a stochastic search approach, J. Decis. Support Syst.49 (2010), 14–23.10.1016/j.dss.2009.12.004Search in Google Scholar

[22] Z. Zhou and D. Yang, Personalized recommendation of preferred paths based on web log, J. Softw.9 (2014), 684–688.10.4304/jsw.9.3.684-688Search in Google Scholar

Received: 2016-8-25

Published Online: 2017-1-13

Published in Print: 2018-3-28

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Articles in the same Issue

https://doi.org/10.1515/jisys-2016-0157

Keywords for this article

Metasearch engine; fuzzy-bat; retrieval of documents; ranking list

Creative Commons

BY-NC-ND 3.0