Everyone with a rudimentary understanding of text analytics knows about term-frequency-inverse-document-frequency (TF-IDF) vectors. What is less known but deserve to be more widely appreciated and applied is a little related trick called Explicit Semantic Analysis (ESA) introduced by E. Gabrilovish and S. Markovitch, for which they were awarded the 2014 IJCAI-JAIR Best Paper Prize. (https://www.jair.org/bestpaper.html)
ESA was originally introduced to allow the authors to compute “semantic relatedness” between texts by exploiting a rich human-compiled corpus like the Wikipedia. Since its introduction, the technique has been shown to have much wider application and used to significantly improve many different text analysis algorithms.
The basic idea behind ESA is to treat each topic in Wikipedia (or some such corpus) as a concept and construct a (normalised) TF-IDF matrix M of dimension |W| · |C|, where W is the set of all words and C is the set of all concepts. Each row in M tells us (in a weighted form) the concepts associated with each word. The semantic esa({w1,…,wn}) of a text consisting of words {w1,…,wn} is simply the centroid (or some other sensible notion of “average”) of the weighted concept vectors associated with the words. In other words, we simply compute the TF-IDF vectors for each article in Wikipedia, and then transpose the matrix and normalise to arrive at something much more interesting.
For example, the ESA representation of a word like “Apple” will be a concept vector that is equally weighted towards “fruit-ness” and “computer-ness”. However, a longer sentence like “Apple releases a new tablet” will give us a concept vector that is much more weighted towards “computer-ness” than “fruit-ness” since the ESA representation of the sentence will effectively be the average of the concept vectors for “Apple” and “tablet”.
In practice, one can replace the TF-IDF vector constructed from {w1,w2,…,wn} with the esa({w1,w2,…,wn}) vector in most, if not all, text analysis tasks and get better results straightaway. Now, that’s a useful trick to know!
One thought on “From Words to Concepts: Explicit Semantic Analysis”