How to get all distinct words of a specified minimum length from multiple columns in a MySQL table?

mysql sql regex text

The best solution is not using that structure to store data and normalize your database in compliance with normal forms. But if you want to split strings to words and get them as a table and you can't normalize the database and you can't use the latest version of MYSQL with CTE you could create a simple stored procedure to split strings and store them to a temporary table. For example, the stored procedure might look like:

DELIMITER //CREATE PROCEDURE split_string_to_table (str longtext)BEGIN  DECLARE val TEXT DEFAULT NULL;  DROP TEMPORARY TABLE IF EXISTS temp_values;  CREATE TEMPORARY TABLE temp_values (     `value` varchar(200)    );  iterator:  LOOP      IF LENGTH(TRIM(str)) = 0 OR str IS NULL THEN      LEAVE iterator;    END IF;    SET val = SUBSTRING_INDEX(str, ' ', 1);    INSERT INTO temp_values (`value`) VALUES (TRIM(val));    SET str = INSERT(str, 1, LENGTH(val) + 1, '');  END LOOP;  SELECT DISTINCT(`value`) FROM temp_values WHERE CHAR_LENGTH(`value`) >= 5;END //DELIMITER ;

After it, you can join all strings to one string and store it in a temporary variable and pass its value to the stored procedure:

SELECT CONCAT_WS(' ',                  GROUP_CONCAT(col_a SEPARATOR ' '),                  GROUP_CONCAT(col_b SEPARATOR ' '),                  GROUP_CONCAT(col_c SEPARATOR ' ')       ) INTO @textFROM mytable;CALL split_string_to_table(@text);

Result:

--------------| value      |--------------| apple      |--------------| orange     |--------------| banana     |--------------| fruit      |--------------| green      |--------------| yellow     |--------------| frank      |--------------| james      |--------------| sarah-jane |--------------

You can see the demo of that realization in DBFiddle

mysql sql regex text

Shell script might be efficient...

SELECT CONCAT_WS(' ', col_a, col_b, col_c) INTO OUTFILE 'x' ... to get the columns into a file
tr ' ' "\n" <x -- split into one word per line
awk 'length($1) >= 5' -- minimum size of 5 characters per word
sort -u -- to dedup

There are no stopwords, but sed or awk could deal with that.

 mysql -e "SELECT ... INTO OUTFILE 'x' ..." ... tr ' ' "\n" <x  |  awk 'length($1) >= 5'  |  sort -u

mysql sql regex text

How about exporting the data into a file and then import it into a new table?

select col_a, col_b, col_c    into outfile '/tmp/words.csv'    fields terminated by ' ' escaped by ''     lines  terminated by ' '    from tablename;create table tmp_words(word varchar(50));load data infile '/tmp/words.csv'    into table tmp_words    lines terminated by ' ';select distinct word from tmp_words where char_length(word) >= 5;drop table tmp_words;

Result:

word----------appleorangegreenbananayellowfrankjamesfruitsarah-jane

Since you have a FULLTEXT INDEX, you could also just read the words from the information_schema:

set global innodb_ft_aux_table = 'test/tablename';select WORDfrom information_schema.INNODB_FT_INDEX_TABLEwhere char_length(WORD) >= 5union select WORDfrom information_schema.INNODB_FT_INDEX_CACHEwhere char_length(WORD) >= 5

However - Due to how the FULLTEXT INDEX works, "words" like "sarah-jane" will be splitted. You can see that in the result:

WORD------applebananafrankfruitgreenjamesorangesarah   <-- !yellow

db-fiddle

You will also miss stopwords like "about".

See: InnoDB INFORMATION_SCHEMA FULLTEXT Index Tables

CodeHunter

How to get all distinct words of a specified minimum length from multiple columns in a MySQL table?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last