Finding similar articles
Due to the demand of removing similar articles in Touchpal Dialer’s feeds, I did some research on related algorithms and realized a few of them.
I will tell about three of them in this article.
- The well-known simhash[https://en.wikipedia.org/wiki/SimHash],
- minhash[https://en.wikipedia.org/wiki/MinHash]
- lshforest[http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html]
simhash
Simhash is a wellknown algorithm which is applyied by Google in its similar webpage detection.
minhash
pass
LSHForest
The first time I saw this name is when I was browsing the guideline of sklearn.
Reference
[]
[]
[]
[]