What is the correct way to deal with emoji in Flask / Python? What is the correct way to deal with emoji in Flask / Python? flask flask

What is the correct way to deal with emoji in Flask / Python?


OK, I know now why that was happening...

> Server side

Although both version run over Python 2.6, the AWS EB Python version was compiled with UCS4 support, and the local (Mac OS X) Python 2.6 with UCS2. More info about UCS here.

AWS EB EC2:
>>> import sys>>> print sys.maxunicode1114111
Local Python 2.6.8 installation:
>>> import sys>>> print sys.maxunicode65535

At the end I decide that is better for our project to use Python 2.6 whit UCS4 support, so I have to update my Python Installation (Mac OS X 10.9.4):

Download and Install Python 2.6.8 (same as EC2 instance):

$ curl -O https://www.python.org/ftp/python/2.6.8/Python-2.6.8.tgz$ tar xzvf Python-2.6.8.tgz$ cd Python-2.6.8$ ./configure --disable-framework --disable-toolbox-glue OPT="-fast -arch x86_64 -Wall -Wstrict-prototypes -fno-common -fPIC" --enable-unicode=ucs4 LDFLAGS="-arch x86_64"$ make$ sudo make install

Creating new virtualenv and install dependencies:

$ virtualenv -p /usr/local/bin/python2.6 venv_ayf_eb_26$ . venv_ayf_eb_26/bin/activate$ pip install -r requirements.txt

> Client Side

Now in the Client (Javascript) we need to update the way we loop the string because ECMAScript 5- use UCS2.

So to read the "real string/symbols length" we use:

String.prototype.getSymbols = function() {    var length = this.length;    var index = -1;    var output = [];    var character;    var charCode;    while (++index < length) {        character = this.charAt(index);        charCode = character.charCodeAt(0);        if (charCode >= 0xD800 && charCode <= 0xDBFF) {            // note: this doesn’t account for lone high surrogates            output.push(character + this.charAt(++index));        } else {            output.push(character);        }    }    return output;};String.prototype.realLength = function() {    return this.getSymbols().length;};

Looping:

// GET original_text over REST APItext = original_text.getSymbols();for ( var i=0; i<original_text.length; i++) { /* DO SOMETHING */ }

References

  1. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel Spolsky
  2. Unipain: Pragmatic Unicode - Ned Batchelder
  3. Universal Character Set - Wikipedia
  4. ECMAScript - Wikipedia
  5. ECMAScript® Language Specification (5.1) - Ecma International
  6. JavaScript has a Unicode problem - Mathias Bynens
  7. Python, convert 4-byte char to avoid MySQL error “Incorrect string value:” - StackOverflow
  8. How to find out if Python is compiled with UCS-2 or UCS-4? - StackOverflow