Hadoop Hackathon 2013

Computing Kneser-Ney smoothed language models

Kneser-Ney smoothing usually produces the best language model performance. Can we do this compute such models using Map Reduce? This project would work-well with ongoing work at Edinburgh on building and serving randomised language models. The randlm distribution already has Map-Reduce code for producing ngram-count pairs and this could be extended.

We have billions of words of text which can be used.