Finding Similar Articles - Algorithms Research

less than 1 minute read

Published:

Due to the demand of removing similar articles in Touchpal Dialer’s feeds, I did some research on related algorithms and realized a few of them.

I will tell about three of them in this article:

SimHash

SimHash is a well-known algorithm which is applied by Google in its similar webpage detection.

MinHash

pass (to be continued…)

LSHForest

The first time I saw this name is when I was browsing the guideline of sklearn.


Research notes from 2017 on text similarity algorithms for news feed deduplication.