After thinking a while over the last point in my list of topics in machine learning I was going to think over...

I finally came to the thought that I’d better use some graphical model, since there is a great majority of possible algorithms without any warranty of their workability.

So, first thing — probably one can use LDA (Latent Dirichlet allocation, wiki) to find good representation for categories. In this case we should use bag-of-words model, and `categories` will stand for words.
Yes, the model of LDA doesn’t seem to be appropriate, though I’m sure it will give reasonable representations.

Another graphical model I came with, is much simpler (and probably was developed a long time ago, but I don’t know).
It’s generative model looks looks this:

  1. First, a `topic` of event is generated. That’s an n-dimensional vector $t_{event} \in \mathbb{R}^n$ drawn from gaussian distribution.
  2. For each category, say, `site_id`, each value of site_id has it’s own topic $t_{cat.value}$ of same dimension $n$. The greater dot product $(t_{cat.value}, t_{event})$, the higher probability that this value of category will be chosen. I shall stress here, that I’m not thinking over the arbitrary number of categories, I’m currently interested in the case, where the number and types of categories are fixed.
  3. To be more precise, the probability of each value within category to be drawn for the event with $t_{event}$ vector is propostional to $$ p(\text{cat_value} | \text{event}) \sim p(\text{cat.value}) \times e^{(t_{cat.value}, t_{event})} $$ so, to compute final probability one shall use softmax function.
That said, currently the problem I expect to meet is extremely low speed of training when the number of values for some category is very large (say, there are cases when number of categories one shall proceed is greater then 10000)

PS. After writing this I understood that usage of decision trees to generate ids of leaves — new ’categories’ + usage of LibFM over these new features was a very interesting and good idea.