This can be exploited to improve website performance and to recommend products or links based on user’s behavior. Visitors entering the site exhibits different behavior. They might just surf through or the process might end up in a purchase. For understanding customer behavior and thus Improve the performance of your web site, certain standards should be used like perform mining on web log data A. WEB MINING: The information space known as Web is a collection of resources (Web resources) residing on the Internet, that can be accessed using HTTP and protocols that derive from it.
A resource “can be anything that has identity. Familiar examples include an electronic document, an image, a service (e. G. “today’s weather report for Los Angels”), as well as a collection of other resources. Not all resources are network “retrievable”; e. G. , human beings, corporations, and bound books in a library can also be considered resources”. The most important concept regarding the Web is of course the resource that a server makes available to clients spread everywhere on the Internet, without any resource, the whole system won’t have any sense.
When a resource is accessed by a client at a specific time and space, we talk of resource manifestation The general definition for client Is “the role adopted by an application when it is retrieving and/or rendering resources or resource manifestations”, whereas the specific one for the Web defines the Web client as an responses containing Web resource manifestations”[8] On the other hand, the server is “the role adopted by an application when it is supplying resources or resource manifestations” to the requesting client.
Web mining involves a wide range of applications that aims at discovering and extracting hidden information [4] in data stored on the Web. Another important purpose of Web mining is to provide a mechanism to make the data access more efficiently and adequately. The third interesting approach is to discover the information which can be derived from the activities of users, which are stored in log files for example for predictive Web caching. Thus, Web mining can be categorized into three different classes based on which part of the Web is to be mined.
These three categories are (I) Web content mining, (it) Web structure mining and (iii) Web usage Mining [2]. While web structure and content mining utilize primary data on the web, web usage mining works on the secondary data such as web server access logs, proxy server logs, referrer logs, browser logs, error logs, user profiles, registration ATA, user sessions or transactions, cookies, user queries, and bookmark data. Through analyzing these log files [2] and documents we can access to interesting usage patterns and information.
The Various Business Areas Where Web Mining has helped Improving the Business Detection Making The World Wide Web is one of the most used interfaces to access remote data and commercial, noncommercial services and the number of actors involved in these transactions is growing very quickly[4]. Everyone using the Web Experiences knows that how the connection to a popular website may be very slow during rush hours ND it is well known that web users tend to leave a site if the wait time for a page to be served exceeds a given value[7].
Therefore, performance and service quality attributes have gained enormous relevance in service design and deployment. This has led to the development of web benchmarking tools that are largely available in the market. One of the most common criticism to this approach is that synthetic workload produced by web stressing tools is far from realistic. Moreover, websites need to be analyzed for discovering commercial rules and user profiles and models must be extracted from log files and monitored data[7]. C.
WEB DATA Web data are those that can be collected and used in the context of Web personalization. These data are classified in four categories according to [SC+O] Fig. 1: Web Mining Taxonomy Content data are presented to the end-user appropriately structured. They can be simple text, images, or structured data, such as information retrieved from databases. Structure data represent the way content is organized. They can be either data entities used within a Web page, such as HTML or XML tags, or data entities used to put a Web site together, such as hyperlinks connecting one page to another.
Usage data represent a Web site’s usage, such as a visitor’s IP address, time and date of access, complete path (files or directories) accessed, referrers’ address, and other attributes that can be included in a Web access log. User profile data provide information about the users of a Web site. A user profile contains demographic and preferences. Such information is acquired through registration forms or questionnaires, or can be inferred by analyzing Web usage logs. Usage mining. Log files [20] are stored on the server side, on the client side and on the proxy servers.
By paving more than one place for storing the information of navigation patterns of the users makes the mining process more difficult. Really reliable results could be obtained only if one has data from all three types of log file. The reason for this is that the server side does not contain records of those Web page accesses that are cached on the proxy servers or on the client side. Besides the log file on the server, that on the proxy server provides additional information. However, the page requests stored in the client side are missing[18]. Yet, it is problematic to collect all the information from the client side.
It is the related word that has to attach the attention of the Web usage miner. In order to discover usage patterns from the available data described above, it is necessary to perform three steps A. Pre- processing This phase is probably the most complex and ungrateful step of the overall process. Its main task is to “clean” the raw web log files and insert the processed data into a relational database or a data warehouse, in order to make it appropriate to apply the data mining techniques in the second phase of the process. B.
Pattern discovery The goal of this stage is to find hidden relationships in the data. Typically the first quinine applied to the data is statistical analysis. With this technique, the type of information extracted is ; Most frequently requested pages; Average access time; Most common error coded, etc. Although this kind of information can be valuable for the systems administrator, in terms of a business perspective it has limited interest. Other methods like clustering, Hidden Markova Models or Bayesian Belief Networks are usually applied to make classification and discover dependency between data[19].
C. Pattern analysis The various patterns can be obtained after patterns discovery process. All of these tatters can not be interesting, pattern analyzer find the interestingness among patterns and only choose some of the interesting patterns, and then rest of the patterns can be ignored. These phases define the Web mining process and can also be used for the general case, not only the Web server one[16]. The first phase is the most important of all, because of the complex nature of the Web architecture, and therefore it is the most difficult one.
Raw data coming from the Web server is unfortunately incomplete, and only a few fields are available for discovering patterns (IP address, time, user agent); en issued) and more often a single client (when this is behind a proxy). Thus this preparatory step is essential, and its aim is to build an as complete and robust as possible data file (server session file), by gathering information from the different available sources shown in the previous section.
And this task is anything but easy[l O]. The pattern discovery phase, consists of different techniques derived from various fields such as statistics, machine learning, data mining, pattern recognition, etc. Applied to the Web domain and to the available data. D. Statistical analysis This kind of analysis is performed by many tools, available also for free, and its aim is to give a description of the traffic on a Web site, like most visited pages, average daily hits, etc. ; Ill.
The main idea is to consider every URL requested by a user in a visit as basket data (item) and to discover relationships with a minimum support level between them; this is the case discussed in this paper from next section; Association rule Mining is as the task of finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories[11]. This field of data mining was originally developed to perform Market Basket Analysis, where the idea was to find the items that were bought most frequently together.
In the web mining context the idea is to find the related pages that are accessed in the click stream, a give it a certain measure of probability, for example “If a person visits the CNN Web site, there is 60% chance the person will visit the BBC News Web site in the same month”. Several algorithms exist to perform association rule mining priori-like algorithms, Eclat, Frequent-pattern tree algorithms, and many others. A. Sequential patterns The attempt of this technique is to discover time ordered sequences of URL followed by past users, in order to predict future ones (this is much used for Web advertisement purposes).
Sequence mining is the task of finding temporal patterns over a database of sequences, in this case a data base of click streams. Sequence mining is considered to be an extension of association mining that only finds no temporal patterns. This technique can have a very important role in knowledge discovery in web log data, due to the (temporally) ordered nature of click-streams. The type of patterns that results form the application of this technique, can have an example like this If user visits page X, and then page Y, it will visit page Z with c% of chance”.
The algorithms for sequence mining inherited much from the association mining algorithms, and many of them are extensions of the firsts, where the main difference is that in sequence mining inter-sequence patterns are searched, where in the association mining the patterns searched are intra-sequence patterns. B. Clustering: Meaningful clusters of URL can be created by discovering similar characteristics teen them according to users behaviors. In the next sections, I consider only the case of the discovery of association rules from an HTTP server data.