How to get all distinct words of a specified minimum length from multiple columns in a MySQL table?
The best solution is not using that structure to store data and normalize your database in compliance with normal forms. But if you want to split strings to words and get them as a table and you can't normalize the database and you can't use the latest version of MYSQL with CTE you could create a simple stored procedure to split strings and store them to a temporary table. For example, the stored procedure might look like:
DELIMITER //CREATE PROCEDURE split_string_to_table (str longtext)BEGIN DECLARE val TEXT DEFAULT NULL; DROP TEMPORARY TABLE IF EXISTS temp_values; CREATE TEMPORARY TABLE temp_values ( `value` varchar(200) ); iterator: LOOP IF LENGTH(TRIM(str)) = 0 OR str IS NULL THEN LEAVE iterator; END IF; SET val = SUBSTRING_INDEX(str, ' ', 1); INSERT INTO temp_values (`value`) VALUES (TRIM(val)); SET str = INSERT(str, 1, LENGTH(val) + 1, ''); END LOOP; SELECT DISTINCT(`value`) FROM temp_values WHERE CHAR_LENGTH(`value`) >= 5;END //DELIMITER ;
After it, you can join all strings to one string and store it in a temporary variable and pass its value to the stored procedure:
SELECT CONCAT_WS(' ', GROUP_CONCAT(col_a SEPARATOR ' '), GROUP_CONCAT(col_b SEPARATOR ' '), GROUP_CONCAT(col_c SEPARATOR ' ') ) INTO @textFROM mytable;CALL split_string_to_table(@text);
Result:
--------------| value |--------------| apple |--------------| orange |--------------| banana |--------------| fruit |--------------| green |--------------| yellow |--------------| frank |--------------| james |--------------| sarah-jane |--------------
You can see the demo of that realization in DBFiddle
Shell script might be efficient...
SELECT CONCAT_WS(' ', col_a, col_b, col_c) INTO OUTFILE 'x' ...
to get the columns into a filetr ' ' "\n" <x
-- split into one word per lineawk 'length($1) >= 5'
-- minimum size of 5 characters per wordsort -u
-- to dedup
There are no stopwords, but sed or awk could deal with that.
mysql -e "SELECT ... INTO OUTFILE 'x' ..." ... tr ' ' "\n" <x | awk 'length($1) >= 5' | sort -u
How about exporting the data into a file and then import it into a new table?
select col_a, col_b, col_c into outfile '/tmp/words.csv' fields terminated by ' ' escaped by '' lines terminated by ' ' from tablename;create table tmp_words(word varchar(50));load data infile '/tmp/words.csv' into table tmp_words lines terminated by ' ';select distinct word from tmp_words where char_length(word) >= 5;drop table tmp_words;
Result:
word----------appleorangegreenbananayellowfrankjamesfruitsarah-jane
Since you have a FULLTEXT INDEX, you could also just read the words from the information_schema
:
set global innodb_ft_aux_table = 'test/tablename';select WORDfrom information_schema.INNODB_FT_INDEX_TABLEwhere char_length(WORD) >= 5union select WORDfrom information_schema.INNODB_FT_INDEX_CACHEwhere char_length(WORD) >= 5
However - Due to how the FULLTEXT INDEX works, "words" like "sarah-jane" will be splitted. You can see that in the result:
WORD------applebananafrankfruitgreenjamesorangesarah <-- !yellow
You will also miss stopwords like "about".