Grouping similar news contents together like in GOOGLE NEWS

This is definitely a not-so-easy-to-solve problem that can be solved by:

smart text-parsing functions
raw hardware power
both of them
testing, testing, testing
fine-tuning at the end

First of all i'd group different news sources to some relatively broad category. You can easily determine a Tech news source won't publish news under economic category. (Or will, that's the problem.)

Most of the cases news title won't be touched, it remains in the original form at the most. So Category, Title, and Publish Date a good starting point to group news into one.

If you detect problems with the methods above you need some fine-tuning under the hood.

Maybe you need to read the whole article and compare two (thousands of) articles word-by-word.

There are a lot of stopwords that can distort the comparison, so you'll need to ignore these.
You may want define synonyms (J Lo = Jennifer Lopez)

If the raw texts of news are similar (you can define a threshold value) you can compare the other factors again (described above).

Some news sources providing good tagging in the RSS source, maybe you can use this too but not rely on it.

And remember, you'll need a lot of fine-tunings at the start (about 1 year) then you'll be fine.

php rss cluster-analysis feed

I read somewhere - but I do not have a reference - that Google News uses a variant of MinHash to detect near-duplicate news posts. And a lot of them are almost identical, coming from a press agency only with minor adaptions by the newspapers.

http://en.wikipedia.org/wiki/MinHash

has a reference and the statement that Google News used a variant of LSH and MinHash:

Das, Abhinandan S. et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th international conference on World Wide Web. ACM

php rss cluster-analysis feed

I don't see any question here, but I would start by developing some sort of fingerprint algorithm, with words, names, titles, dates etc from the articles. Then I would check the similarity of the fingerprints to find identical articles, maybe by some sort of MapReduce job to easily spread the work to different servers in a cluster.

If you want some inspiration, check out the source code for Google Living Stories:http://code.google.com/p/living-stories/

CodeHunter

Grouping similar news contents together like in GOOGLE NEWS

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last