An important way telcos can increase revenue is to improve, within the constraints of privacy laws, provision of personalised services for subscribers. To achieve that, they need to be able to build good subscriber preference models. These can take a number of forms, depending on the specific business context and the exact data available.
In this blog post, we will focus on the problem of building a statistical URL classification model, which can be used to a score a subscriber’s web browsing history to determine his likely interests. Technically, we seek a model
classify : URL → Class (1)
that takes a URL and returns a class. In particular, the model will be realised in the form of a SQL user-defined function that can be executed in a database like PostgreSQL or Greenplum as follows:
SELECT classify(‘http://www.google.com’) SELECT classify(‘http://19.46.59.15/paypal%20cmd=_login-run.php’) SELECT url, classify(url) FROM atable
To profile each subscriber, we can, for instance, classify all the URLs visited by the person over a certain time period and represent the person by a preference vector , where n is the number of classes we have and
is the number of times (normalised if necessary) the person has visited a URL of class i in the chosen time period. With a sufficiently granular set of classes, preference vectors so constructed can serve as good summaries of subscriber interests. It is easy to query such data to find matches for advertisements and other communications that are meant to be targetted. One can also easily perform clustering analysis on preference vectors to find natural groupings of subscribers.
To learn such models from data, we need a set of labelled URLs. There are a number of such data sets available, most of them from commercial vendors. For our purpose, we use a publicly available data set called Shalla’s Blacklists (details at http://www.shallalist.de), which contains a collection of 1.7 million URLs grouped into 74 different classes. The data set is the result of community effort over many years. Table 1 shows a small sample from Shalla’s Blacklists. Useful classes for understanding telco subscribers include Shopping, Drugs, Finance, Gamble, Music, Education, Movies, Automobile, JobSearch, Dating, Hospitals, Alcohol, Spyware, etc.
To transform the Shalla dataset into a form susceptible to standard analysis, we turn to a simple statistical technique to break each URL into its probable constituent tokens – for example, we want to be able to break the url hungrygowhere.com into a bag of four words {hungry, go, where, com} – which then allows us to treat the resultant dataset the way we would a standard text classification problem: represent each bag of words by a term-occurrence vector and then apply learning algorithms that work well with sparse, high-dimensional data.
Both Naive Bayes Classifier and Linear Support Vector Machines are known to give good results on text classification problems. We will use the implementation of Linear SVM available in MADlib (http://madlib.net), an open-source library of large-scale in-database machine learning algorithms. To deal with the large number of classes, we adopt the one-vs-all reduction of an n-class classification problem into n binary classification problems, one model for each class.
The problem of how to optimally tokenise a URL like hungrygowhere.com is a potentially deep question that would likely require a good understanding of natural language processing and optimisation techniques to formulate and solve computationally. In this post, we adopt a simple and intuitive greedy-search algorithm that repeatedly finds all the prefixes of (what remains of) a URL standing for valid English words and then pick the most probable word from that list. In among the candidate words, the algorithm pick the one that maximises the score
score(w) := ln(freq(w))length(w),
where freq(w) is the number of times a word w occurs in a corpus, and length(w) is the number of characters in w. Intuitively, we prefer long words over short words, and more commonly seen words over those less commonly seen. The natural logarithm function is there to temper the distribution of words, which usually takes the characteristics of a power law.
Two data sets are required to implement the above tokenisation algorithm: a dictionary of (English) words; and a corpus of English texts to count the frequency of words. For the former, we use the dictionary /usr/share/dict/words that comes with every Unix/Linux system. For the latter, we sidestepped the problem of finding an English corpus by adopting a precompiled word- frequency database in the form of the British National Corpus (BNC) dataset available at
http://www.kilgarriff.co.uk/bnc-readme.html.
The English dictionary is organised into a trie (aka prefix tree), which can be done using the open-source Marisa-Trie implementation available at
https://code.google.com/p/marisa-trie.
To allow efficient search for word frequencies, the BNC dataset is organised into a hash map. Finally, the whole algorithm can be wrapped inside a PL/Python user-defined function to allow parallel processing of all the URLs.
The complete algorithm involves two other preprocessing steps. We first turn each URL into lower-case letters. The URL is then broken into components by splitting on special characters like ‘.’, ‘/’, ‘-’, ‘%20’, ‘?’. Tokenisation is finally done on the individual components. All these steps can be implemented easily in SQL with standard string functions.
The main advantages of the above greedy algorithm are that it is fast and works reasonably well on a range of URLs. If there is a need further improve the tokenisation algorithm, one can adopt a more globally optimal solution by solving the following optimisation problem:
where the maximisation is over all splittings {h : t | h : t = w} of the URL w into a prefix h and the rest of the URL t. A fairly efficient dynamic programming algorithm can be constructed in a straightforward manner to solve the above problem.
In Part 2 of this post, we will look at the problem of learning a URL classifier model from the preprocessed Shalla dataset using Support Vector Machines.