Vector representations of categories
After thinking a while over the last point in my list of topics in ML to explore…
I finally came to the thought that I’d better use some graphical model, since there is a great majority of possible algorithms without any warranty of their workability.
So, first thing — probably one can use LDA (Latent Dirichlet allocation,
wiki) to find good representation for categories.
In this case, we should use the bag-of-words model, and categories
will stand for words.
Yes, the model of LDA doesn’t seem to be appropriate, though I’m sure it will give reasonable representations.
Another graphical model I came up with is much simpler (and probably was developed a long time ago, but I don’t know). Its generative model looks like this:
- First, a
topic
of the event is generated. That’s an n-dimensional vector $t_{event} \in \mathbb{R}^n$ drawn from a Gaussian distribution. - For each category, say,
site_id
, each value ofsite_id
has its own topic $t_{cat_value}$ of the same dimension $n$. The greater the dot product $(t_{cat_value}, t_{event})$, the higher the probability that this value of the category will be chosen. I shall stress here that I’m not thinking over an arbitrary number of categories; I’m currently interested in the case where the number and types of categories are fixed. -
To be more precise, the probability of each value within a category to be drawn for the event with $t_{event}$ vector is proportional to:
\[p(\text{cat\_value} | \text{event}) \sim p(\text{cat\_value}) \times e^{(t_{cat\_value}, t_{event})}\]So, to compute the final probability, one shall use the softmax function.
That said, currently the problem I expect to meet is the extremely low speed of training when the number of values for some category is very large (say, there are cases when the number of categories one shall process is greater than 10,000).
PS: After writing this, I understood that the usage of decision trees to generate IDs of leaves — new categories
—
combined with the usage of LibFM over these new features, was a very interesting and good idea
(this idea appeared recently in one of the Kaggle challenges).