Grouping similar news contents together like in GOOGLE NEWS Grouping similar news contents together like in GOOGLE NEWS php php

Grouping similar news contents together like in GOOGLE NEWS


This is definitely a not-so-easy-to-solve problem that can be solved by:

  • smart text-parsing functions
  • raw hardware power
  • both of them
  • testing, testing, testing
  • fine-tuning at the end

First of all i'd group different news sources to some relatively broad category. You can easily determine a Tech news source won't publish news under economic category. (Or will, that's the problem.)

Most of the cases news title won't be touched, it remains in the original form at the most. So Category, Title, and Publish Date a good starting point to group news into one.

If you detect problems with the methods above you need some fine-tuning under the hood.

Maybe you need to read the whole article and compare two (thousands of) articles word-by-word.

  • There are a lot of stopwords that can distort the comparison, so you'll need to ignore these.
  • You may want define synonyms (J Lo = Jennifer Lopez)

If the raw texts of news are similar (you can define a threshold value) you can compare the other factors again (described above).

Some news sources providing good tagging in the RSS source, maybe you can use this too but not rely on it.

And remember, you'll need a lot of fine-tunings at the start (about 1 year) then you'll be fine.


I read somewhere - but I do not have a reference - that Google News uses a variant of MinHash to detect near-duplicate news posts. And a lot of them are almost identical, coming from a press agency only with minor adaptions by the newspapers.

http://en.wikipedia.org/wiki/MinHash

has a reference and the statement that Google News used a variant of LSH and MinHash:

Das, Abhinandan S. et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th international conference on World Wide Web. ACM


I don't see any question here, but I would start by developing some sort of fingerprint algorithm, with words, names, titles, dates etc from the articles. Then I would check the similarity of the fingerprints to find identical articles, maybe by some sort of MapReduce job to easily spread the work to different servers in a cluster.

If you want some inspiration, check out the source code for Google Living Stories:http://code.google.com/p/living-stories/