How do I go about creating an efficient content filter for certain posts? How do I go about creating an efficient content filter for certain posts? wordpress wordpress

How do I go about creating an efficient content filter for certain posts?


Do it when the profile is created.

Try reversing the whole process. Rather than checking the content for the words, check the words for the content's words.

  1. Break the content post on entry into words (on space)
  2. Eliminate duplicates, ones under the smallest size of a word in the database, ones over the largest size, and ones in a 'common words' list that you keep.
  3. Check against each table, if some of your tables include phrases with spaces, do a %text% search, otherwise do a straight match (much faster) or even build a hash table if it really is that big a problem. (I would do this as a PHP array and cache the result somehow, no sense reinventing the wheel)
  4. Create your links with the now dramatically smaller lists.

You should be able to easily keep this under 1 second even as you move out to even 100,000 words you are checking against. I've done exactly this, without caching the word lists, for a Bayesian Filter before.

With the smaller list, even if it is greedy and gathers words that don't match "clown" will catch "clown loach", the resulting smaller list should be only a few to a few dozen words with links. Which will take no time at all to do a find and replace over a chunk of text.

The above doesn't really address your concern over the older profiles. You don't say exactly how many there are, just that there is a lot of text and that it is on 1400 to 3100 (both items) put together. This older content you could do based on popularity if you have the info. Or on date entered, newest first. Regardless the best way to do this is to write a script that suspends the time limit on PHP and just batch-runs a load/process/save on all the posts. If each one takes about 1 second (probably much less but worst case) you are talking 3100 seconds which is a little less than an hour.