Word frequencies from strings in Postgres?

Something like this?

SELECT some_pk,        regexp_split_to_table(some_column, '\s') as wordFROM some_table

Getting the distinct words is easy then:

SELECT DISTINCT wordFROM (   SELECT regexp_split_to_table(some_column, '\s') as word  FROM some_table) t

or getting the count for each word:

SELECT word, count(*)FROM (   SELECT regexp_split_to_table(some_column, '\s') as word  FROM some_table) tGROUP BY word

postgresql text nlp word-frequency

You could also use the PostgreSQL text-searching functionality for this, for example:

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

will yield:

  word   | ndoc | nentry ---------+------+-------- ridicul |    1 |      1 hello   |    1 |      3 dere    |    1 |      1(3 rows)

(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the simple instead of the english dictionary, see below.)

The nested SELECT statement can be any select statement that yields a tsvector column, so you could substitute a function that applies the to_tsvector function to any number of text fields, and concatenates them into a single tsvector, over any subset of your documents, for example:

SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;

Would yield a matrix of total word counts taken from the title and body fields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (the ndoc column).

See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html

postgresql text nlp word-frequency

Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.

SELECT word, count(*)FROM (   SELECT regexp_split_to_table(some_column, ' ') as word  FROM some_table) tGROUP BY word

CodeHunter

Word frequencies from strings in Postgres?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last