Friday, March 26, 2010

Approximate intersection of two lists

Search engines mantains a word index made up by a dictionary of words and, for each work a list of documents containing the. For instance

impossible -> 343, 5, 63459, 4, ....., 32
mission -> 3449, 558, ...., 49

Suppose the two lists are very long. For instance, the word 'impossible' can be contained in 100*10^6 documents, while the word mission can be contained in 50*10^6 documents.

1) return the documents containing the words 'impossible mission' (documents must contain both the words)
2) what is the complexity?
3) can you accellerate the computation?
4) how to compute the size of the lists intersection?
5) can you estimate this size in a fast way?

No comments:

Post a Comment