Processing Emojis in SQLite Processing Emojis in SQLite sqlite sqlite

Processing Emojis in SQLite


The only way I can see to do this with SQLite directly would be to compile SQLite from the source code so you could add support for regex_replace.

However, you only plan to do it once, and recompiling SQLite might be a bit overkill.

Instead, you could copy your text column into a plain text file, and run the following command:

sed 's/\(.\)/\1\n/g' temp.txt | sed 's/[[:alnum:].-]//g' | sort -r | uniq -c

This would turn the following:

Hello! Are you stuck? 🤔I saw 🐻🐻🐻 in the park!!!!! 🎂🎂🎂🎂🎂🎂 - all lies.Easy as 123! 😎😎😎😎😎😎😎😎😎😎😎

into:

  1 🤔 11 😎  3 🐻  6 🎂  1 ?  7 ! 17 50

Which would hopefully be close enough to get you to your goal. The last two entries are for tabs and spaces.

sed is a linux command, so if you are running windows you may want to get a windows version here: https://github.com/mbuilov/sed-windows


Build emojis table, with contains emojis strings in text fields, then try this:

SELECT emojis.text, sum(length(replace(message.text, emojis.text, emojis.text || '_')) - length( message.text))FROM messageINNER JOIN emojis ON message.text LIKE '%' || emojis.text || '%'GROUP BY emojis.text

You can use this method to count occurrences of any text in messages...


Possibly not the most efficient example but you could use the group_concat function to merge the text into a single string then use a recursive cte to turn that into a single table of chars. You can filter out all the 'normal' acii chars using the hex function greater than 7F( see asii table) you might need to play around with that depending on what your emojis are and whether you have loads of non latin unicode in there. Then it is a simple case of a count if you just want the total or a group by with a count if you want to know what the individual emoji count is.

WITH RECURSIVE theChars (    position    ,charValue    )AS (    SELECT 2        ,substr((                SELECT group_concat(TEXT, '')                FROM message                ), 1, 1)    UNION ALL    SELECT position + 1        ,substr((SELECT group_concat(TEXT, '') FROM message)            ,position            ,1)    FROM theChars LIMIT length((SELECT group_concat(TEXT, '') FROM message))    )SELECT charValue,count(*)FROM theCharsWHERE hex(charValue) > '7F'group by charValue;