In times of rapid internet connections, we all got used to fast responses when it comes to web or mobile applications. Some studies[1] show that the time that users need to wait for a website is crucial and has a direct impact on user experience, and thus on the generated income. Prefetching the data seems to be good way to reduce the latency. Here we present a bunch of methods which may be useful to achieve that.

What is the problem?

Typical web application is built over the HTTP protocol – every request sent by a user is processed by some kind of a webserver, which produces a response and sends it back to the user. Initially, one TCP connection was used per each HTTP connection. It changed when persisent connections[2] were introduced and currently almost all modern browsers use them to avoid establishing and terminating too many short-lasting ones. HTTP pipelining[3] is another great feature that allows to reduce the latency dramatically, but even though it is supported by the majority of HTTP servers, it is rarely enabled in browsers by default, so most of users do not use it. These both enhancments share the same idea – the aim is to reduce the time that a person visiting a website spends on waiting for the requested resources. Why don’t go a step further then? If we could prefetch the resources a particular user would want to use in their next steps, before they even send a request to receive them, then the delays would never occur. It is especially meaningful for mobile devices, which do not always have a good internet connection and having the data already stored can be crucial for user experience. Facebook has put a related issue on their list of top open data problems[4].

User modeling

We could store a copy of all the data gathered by our system on each device that uses it and update whenever possible, seems perfect but it is obviously not a good solution. It is absolutely pointless to store something what users will never access or what is simply not interesting to them. Preparing such data will also lead to efficiency issues and higher network traffic, so there is a need to find a tradeoff between the amount of prefetched data and computational power used to generate it. To make such a process efficient we need to decide what is important for a particular person, or for a whole group of users. That’s where the user behaviour modeling might be useful. If a system is used on a daily basis (e.g. social media or working tools), users tend to perform similar actions. By recognizing these usage patterns we can predict what they would like to do next, and do not even wait for a request to perform this action. There are a couple of methods used to achieve that.

Content based prediction

From the users’ perspective, an application delivers some kind of documents and allows to perform a limited set of actions. These documents may be directly connected to each other or share similar topic. Georgakis and Li[5] describe the algorithm which uses HTML hyperlinks as relations between the documents and their content to identify the subject. Whenever user visits a page, all the outbound links are being collected with the surrounding text. This context of a single link is thought to describe the related document in terms of its subject. The content of the document is then cleaned from unnecessary HTML tags, punctuation marks as well as the common words which are not meaningful, in order to extract only relevant information. The next phase is a stemming which helps to extract the informative context by word suffixes removal. Based on a low-to-medium law, which claims the less frequent the term the more informative it is, the least informative terms are eliminated to reduce the dimensionality. Instead of single stems, the bigram model is used, since collocations can describe more complex concepts than single words only. The bigrams are defined for each link separately.

Data Prefetching in Web Applications

Applying the above procedure helps to describe the documents in terms of their content by using the context of links pointing to them and they are used further to find the importance factors of the bigrams. Each user has its own profile, which is used to store the keywords that they find interesting. Documents are considered to be interesting for the particular user, if they have already visited a document having similar bigrams. The prefetching is done after applying weights to each outbound link on a subpage. These weights are calculated based on this particular link keywords and user preferences. The authors claim the reduction of latency from 25% up to 75%, depending on the number of prefetched documents, but with the cost of increasing the bandwith overhead.

Statistical modeling of interaction

Another way to look at the problem of user behaviour modeling is to analyze the HTTP requests. Majority of web servers collects access logs, which are usually well-structured plain text documents and may be processed further to extract the information about the precedence of sent HTTP requests. A time ordered list of such requests, coming from the same user within closed time range, may be defined as a session. Assuming that a particular user interacts with the application in a similar way over time, we should simply notice repeating patterns in the log. There are many applications of machine learning techniques in data mining to such patterns recognintion.

Statistical modeling of interaction: data Prefetching

Markov chain[6] is a random process with a finite set of states and transitions between them. The probability of next state depends only on the previous state – it is a fulfillment of Markov property. For n-th order Markov chain, the state depends on n previous states. A Hidden Markov Model[7] is a random process, but with unobservable states and the observations, which are state-dependent and visible. Similarly to Markov chains, we need to have the probabilities of state-to-state transitions, but additionaly some probability distribution of observations while being in a particular state. Meera Narvekar and Shaikh Sakina Banu[8] propose an architecture built on top of these two models to predict the next action that user will perform. They achieved ~30% accuracy on raw data taken from the NASA access log[9].

Data Prefetching: modeling of interaction

N-gram[10] is popular statistical model, a n items sequence, usually of some linguistic unit. Despite the fact such an architecture is mainly used in computational linguistic, there are some attempts to adapt it to different problems as well. Zhong Su et al.[11] proposed to use it to model users’ behaviour in case of web application interaction. Once again, they suggest to use web server logs, divided into users’ sessions as a learning dataset. They claim that short-lasting sessions have rather random flow and hardly predictable requests, so making any predictions should be performed only on visits which are long enough. Due to this assumption, they extended standard n-gram model into n-gram+, which operates on sequences of any length greater than n. As an input, it takes whole sequence of user interaction and tries to find the longest subsequence that matches to its knowledge database, the historical observations statistics, and find the most probable action that will be performed after. If given path has been never observed before, the first element of this sequence is removed and the algorithm tries to find next action for one item shorter path. In case of sequences shorter than n, no prediction is performed. It means that short sessions are not even considered in this process. The goal of the research, the authors performed, was to find a good tradeoff between the precision and the applicability of the created model. By applicability they define a fraction of requests which may be predicted by the system. The tests were performed on the same NASA dataset and showed that for traditional n-grams the precision grows with the value of n, but the applicability decreases drastically. Using n-gram+ model leads to good compromise between these two figures. Finally, they achieved ~60% precision for 3-gram+ model, what is slightly lower than for 4-gram alone, but about 10% better than standard 3-gram, with the maintained applicability.

The problem of user behaviour modeling for prefetching is just the tip of the iceberg. There are many related problems, like adaptive interfaces, recommendation engines and content personalization which may directly result in a revenue increase of our business. There is quite a big niche for such systems, but presented architectures seem to be really good starting points for further investigation.

References:

  1. http://blog.smartbear.com/software-quality/5-10-15-seconds-how-long-will-you-wait-for-a-web-page-to-load/
  2. https://en.wikipedia.org/wiki/HTTP_persistent_connection
  3. https://en.wikipedia.org/wiki/HTTP_pipelining
  4. https://research.facebook.com/blog/facebook-s-top-open-data-problems/
  5. A. Georgakis, H. Li, “User behavior modeling and content based speculative web page prefetching”, Data & Knowledge Engineering 59 (2006) 770–788
  6. https://en.wikipedia.org/wiki/Markov_chain
  7. https://en.wikipedia.org/wiki/Hidden_Markov_model
  8. M. Narvekar, S. S. Banu, “Predicting User’s Web Navigation Behavior Using Hybrid Approach”, Procedia Computer Science 45 ( 2015 ) 3 – 12
  9. http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
  10. https://en.wikipedia.org/wiki/N-gram
  11. Z. Su, Q. Yang, Y. Lu, H. Zhang, “A Prediction System for Web Requests using N-gram Sequence Models”, Proceedings of the First International Conference on Web Information Systems Engineering (WISE’00)-Volume 1, p.214, June 19-20, 2000

Software Engineer

I am a big fan of AI and applying machine learning methods in real-life problems, with an experience in web development and databases. Currently, I'm involved in an internal research project at Codete.