How to get all distinct words of a specified minimum length from multiple columns in a MySQL table? How to get all distinct words of a specified minimum length from multiple columns in a MySQL table? mysql mysql

How to get all distinct words of a specified minimum length from multiple columns in a MySQL table?


The best solution is not using that structure to store data and normalize your database in compliance with normal forms. But if you want to split strings to words and get them as a table and you can't normalize the database and you can't use the latest version of MYSQL with CTE you could create a simple stored procedure to split strings and store them to a temporary table. For example, the stored procedure might look like:

DELIMITER //CREATE PROCEDURE split_string_to_table (str longtext)BEGIN  DECLARE val TEXT DEFAULT NULL;  DROP TEMPORARY TABLE IF EXISTS temp_values;  CREATE TEMPORARY TABLE temp_values (     `value` varchar(200)    );  iterator:  LOOP      IF LENGTH(TRIM(str)) = 0 OR str IS NULL THEN      LEAVE iterator;    END IF;    SET val = SUBSTRING_INDEX(str, ' ', 1);    INSERT INTO temp_values (`value`) VALUES (TRIM(val));    SET str = INSERT(str, 1, LENGTH(val) + 1, '');  END LOOP;  SELECT DISTINCT(`value`) FROM temp_values WHERE CHAR_LENGTH(`value`) >= 5;END //DELIMITER ;

After it, you can join all strings to one string and store it in a temporary variable and pass its value to the stored procedure:

SELECT CONCAT_WS(' ',                  GROUP_CONCAT(col_a SEPARATOR ' '),                  GROUP_CONCAT(col_b SEPARATOR ' '),                  GROUP_CONCAT(col_c SEPARATOR ' ')       ) INTO @textFROM mytable;CALL split_string_to_table(@text);

Result:

--------------| value      |--------------| apple      |--------------| orange     |--------------| banana     |--------------| fruit      |--------------| green      |--------------| yellow     |--------------| frank      |--------------| james      |--------------| sarah-jane |--------------

You can see the demo of that realization in DBFiddle


Shell script might be efficient...

  1. SELECT CONCAT_WS(' ', col_a, col_b, col_c) INTO OUTFILE 'x' ... to get the columns into a file
  2. tr ' ' "\n" <x -- split into one word per line
  3. awk 'length($1) >= 5' -- minimum size of 5 characters per word
  4. sort -u -- to dedup

There are no stopwords, but sed or awk could deal with that.

 mysql -e "SELECT ... INTO OUTFILE 'x' ..." ... tr ' ' "\n" <x  |  awk 'length($1) >= 5'  |  sort -u


How about exporting the data into a file and then import it into a new table?

select col_a, col_b, col_c    into outfile '/tmp/words.csv'    fields terminated by ' ' escaped by ''     lines  terminated by ' '    from tablename;create table tmp_words(word varchar(50));load data infile '/tmp/words.csv'    into table tmp_words    lines terminated by ' ';select distinct word from tmp_words where char_length(word) >= 5;drop table tmp_words;

Result:

word----------appleorangegreenbananayellowfrankjamesfruitsarah-jane

Since you have a FULLTEXT INDEX, you could also just read the words from the information_schema:

set global innodb_ft_aux_table = 'test/tablename';select WORDfrom information_schema.INNODB_FT_INDEX_TABLEwhere char_length(WORD) >= 5union select WORDfrom information_schema.INNODB_FT_INDEX_CACHEwhere char_length(WORD) >= 5

However - Due to how the FULLTEXT INDEX works, "words" like "sarah-jane" will be splitted. You can see that in the result:

WORD------applebananafrankfruitgreenjamesorangesarah   <-- !yellow

db-fiddle

You will also miss stopwords like "about".

See: InnoDB INFORMATION_SCHEMA FULLTEXT Index Tables