How to extract all the emojis from text?

python python-3.x emoji

You can use the emoji library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI.

import emojidef extract_emojis(s):  return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])

python python-3.x emoji

I think it's important to point out that the previous answers won't work with emojis like 👨‍👩‍👦‍👦 , because it consists of 4 emojis, and using ... in emoji.UNICODE_EMOJI will return 4 different emojis. Same for emojis with skin color like 🙅🏽.

My solution

Include the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like 👨‍👩‍👦‍👦

import emojiimport regexdef split_count(text):    emoji_list = []    data = regex.findall(r'\X', text)    for word in data:        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):            emoji_list.append(word)        return emoji_list

Testing

with more emojis with skin color:

line = ["🤔 🙈 me así, se 😌 ds 💕👭👙 hello 👩🏾‍🎓 emoji hello 👨‍👩‍👦‍👦 how are 😊 you today🙅🏽🙅🏽"]counter = split_count(line[0])print(' '.join(emoji for emoji in counter))

output:

🤔 🙈 😌 💕 👭 👙 👩🏾‍🎓 👨‍👩‍👦‍👦 😊 🙅🏽 🙅🏽

Include flags

If you want to include flags, like 🇵🇰 the Unicode range would be from 🇦 to 🇿, so add:

flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text)

to the function above, and return emoji_list + flags.

See this answer to "A python regex that matches the regional indicator character class" for more information about the flags.

For newer `emoji` versions

to work with emoji >= v1.2.0 you have to add a language specifier (e.g. en as in above code):

emoji.UNICODE_EMOJI['en']

python python-3.x emoji

If you don't want to use an external library, as a pythonic way you can simply use regular expressions and re.findall() with a proper regex to find the emojies:

In [74]: import reIn [75]: re.findall(r'[^\w\s,]', a_list[0])Out[75]: ['🤔', '🙈', '😌', '💕', '👭', '👙']

The regular expression r'[^\w\s,]' is a negated character class that matches any character that is not a word character, whitespace or comma.

As I mentioned in comment, a text is generally contain word characters and punctuation which will be easily dealt with by this approach, for other cases you can just add them to the character class manually. Note that since you can specify a range of characters in character class you can even make it shorter and more flexible.

Another solution is instead of a negated character class that excludes the non-emoji characters use a character class that accepts emojies ([] without ^). Since there are a lot of emojis with different unicode values, you just need to add the ranges to the character class. If you want to match more emojies here is a good reference contain all the standard emojies with the respective range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode:

CodeHunter

How to extract all the emojis from text?

My solution

Testing

Include flags

For newer `emoji` versions

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

How to extract all the emojis from text?

My solution

Testing

Include flags

For newer emoji versions

Recent Posts

For newer `emoji` versions