Finding Similar Articles - Algorithms Research
Published:
Due to the demand of removing similar articles in Touchpal Dialer’s feeds, I did some research on related algorithms and realized a few of them.
I will tell about three of them in this article:
SimHash
SimHash is a well-known algorithm which is applied by Google in its similar webpage detection.
MinHash
pass (to be continued…)
LSHForest
The first time I saw this name is when I was browsing the guideline of sklearn.
Research notes from 2017 on text similarity algorithms for news feed deduplication.
