Methods, systems, and apparatus for performing simhash based spell correction are provided.A character string is simhashed to generate a simhashed character string.Now instead of matching 9996 shingles to 19996 other shingles, we are are comparing 200 shingles to another 200 shingles.

In our case we have 3 common elements: "chair", "rug" and "keyboard". So the Jaccard Coefficient = 3 / (8 - 3) = 0.6, or 60%.This tells us that these two documents should be compared for their similarity.This is useful when you have a document, and you want to know which other documents to compare to it for similarity. is an online calculator that you can play around with to determine the similarity of two sets.While a document can be though of as a giant set of words, we don't just break down a document into individual words, place them in a set and calculate the similarity, because that looses the importance of the order of the words.

So if a document contained a single sentence of "The quick brown fox jumps over the lazy dog", that would be broken down into the following 5 word long shingles : So we now have a way to compare two documents for similarity, but it is not an efficient process.