Friday, October 5, 2012

Find all the anagrams in a collection of strings

1 comment:

  1. 1) Populate an HashMap with the sorted version of each word in the Collection. I.e. map.put(sort(w)) for each w in dict.

    2) At step 1 associate to the inserted key an array (better a set) storing all the words that sort in the same way. I. e. if (map.contains(sort(w))) map.get(sort(w)).add(w) for each w in dict.

    3) Sort the input word and look it up in the HashMap.

    4) Print the array associated to the key from the HashMap, if the key is present.

    If dict is too big to fit in memory, sort it with an external merge sort. The sort has to compare the sorted version of the words, not the words as is. Then binary search of the sorted version of input word on the sorted file. In order to speed up the things you can build a shorter list of words from the sorted dict. Each word has also a pointer to the original sorted file. This way you "bucketize" the file.
    You can leverage this smaller file for the binary search (perhaps this smaller file can fit into memory). Binary-search this smaller file with the sorted version of the input word, get the pointer to the sorted dict on disk, and scan the file until you encounter the words that sort in the same way. Print that words.

    ReplyDelete