Telcos everywhere are working on initiatives to better monetise their data. For many of them, a key challenge in addressing customer requirements is lack of labelled data. For example, a customer may come along and make a request: “Tell me something about the shopping behaviour of housewives in the country”. This seemingly simple question is actually not easy to answer.
To begin with, a typical telco would not know with 100% confidence who among their subscribers are housewives. There are several complexities. The occupation field in the customer table, if that exists, may not be filled. Even if it is filled, it may no longer be accurate, depending on when the data was collected. And then there are cases where a phone number is associated with a housewife but registered under the husband’s name. Et cetera.
The typical response to such a problem is to construct some business rules to capture our intuitive understanding of “housewife-ness”. Here are few examples:
- Rule 1: A person is likely to be a housewife if she mostly stays at home during standard office hours (9am-6pm), as determined from the coordinates of a mobile device over time.
- Rule 2: A person is likely to be a housewife if she has a web browsing history that includes visits to cooking websites, fashion magazines, and school/education websites for those with children.
- Rule 3: A person is likely to be a housewife is she spends a considerable amount of time talking on the phone to other housewives (we have a recursive concept here).
These are all fairly reasonable rules, but how can we actually use them in a principled way to answer customer queries, and how can we know whether they are any good without some ground truth — labelled data — to calibrate their accuracy?
There is actually a simple solution here, which is to combine all three business rules and classify someone as a housewife only if all three, or a majority, of the rules are satisfied. If we can quantify the accuracy of each rule, then the accuracy of the combined rule can be determined easily. Ideally, we want each rule to have at least 50% accuracy.
To see how this works, suppose Rules (1)-(3) are each at least 60% accurate and they are independent. What is the accuracy of the model that only outputs Yes when a majority of the three rules are satisfied?
The model makes a wrong prediction when
- all three rules are wrong, which happens with probability at most 0.4 * 0.4 * 0.4 = 0.064
- two out of three rules are wrong, which happens with probability at most 3 * 0.6 * 0.4 * 0.4 = 0.288
Adding the two numbers up, we get an error rate of 0.352 for the majority-voting model, or 64.8% accuracy; not bad given how we started with rules that are assumed to be, on their own, not that accurate. Naturally, the higher the accuracy of the individual rules, the higher the accuracy of the majority-voting model. Also, having more rules would help.
One can also calculate the point probabilities if we have an estimate of the base probabilities. For example, assuming 30% of the subscriber population are housewives, Pr( Housewife | All three rules are satisfied) can be shown to be at least 0.5912 by straightforward application of Bayes rule and the independence of the rules.
The problem given here, in its general form, is exactly the problem facing the data-monetisation divisions of many telcos today. The solution proposed here, in its general form, is a way forward. The only thing we need are rules that are (conservatively believed/estimated to be) more right than wrong.