How to extract all the emojis from text? How to extract all the emojis from text? python python

How to extract all the emojis from text?


You can use the emoji library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI.

import emojidef extract_emojis(s):  return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])


I think it's important to point out that the previous answers won't work with emojis like πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ , because it consists of 4 emojis, and using ... in emoji.UNICODE_EMOJI will return 4 different emojis. Same for emojis with skin color like πŸ™…πŸ½.

My solution

Include the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦

import emojiimport regexdef split_count(text):    emoji_list = []    data = regex.findall(r'\X', text)    for word in data:        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):            emoji_list.append(word)        return emoji_list

Testing

with more emojis with skin color:

line = ["πŸ€” πŸ™ˆ me asΓ­, se 😌 ds πŸ’•πŸ‘­πŸ‘™ hello πŸ‘©πŸΎβ€πŸŽ“ emoji hello πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ how are 😊 you todayπŸ™…πŸ½πŸ™…πŸ½"]counter = split_count(line[0])print(' '.join(emoji for emoji in counter))

output:

πŸ€” πŸ™ˆ 😌 πŸ’• πŸ‘­ πŸ‘™ πŸ‘©πŸΎβ€πŸŽ“ πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ 😊 πŸ™…πŸ½ πŸ™…πŸ½

Include flags

If you want to include flags, like πŸ‡΅πŸ‡° the Unicode range would be from πŸ‡¦ to πŸ‡Ώ, so add:

flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text) 

to the function above, and return emoji_list + flags.

See this answer to "A python regex that matches the regional indicator character class" for more information about the flags.

For newer emoji versions

to work with emoji >= v1.2.0 you have to add a language specifier (e.g. en as in above code):

emoji.UNICODE_EMOJI['en']


If you don't want to use an external library, as a pythonic way you can simply use regular expressions and re.findall() with a proper regex to find the emojies:

In [74]: import reIn [75]: re.findall(r'[^\w\s,]', a_list[0])Out[75]: ['πŸ€”', 'πŸ™ˆ', '😌', 'πŸ’•', 'πŸ‘­', 'πŸ‘™']

The regular expression r'[^\w\s,]' is a negated character class that matches any character that is not a word character, whitespace or comma.

As I mentioned in comment, a text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.

Another solution is instead of a negated character class that excludes the non-emoji characters use a character class that accepts emojies ([] without ^). Since there are a lot of emojis with different unicode values, you just need to add the ranges to the character class. If you want to match more emojies here is a good reference contain all the standard emojies with the respective range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode: