Python - Map / Reduce - How do I read JSON specific field in using DISCO count words example
Your problem is in disco/worker/classic/func.py
... str()
will not accept a unicode character...
>>> str(u'\xb4')Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 0: ordinal not in range(128)>>>
Since you are only counting words, you could convert your unicode data into strings with the unicodedata
module...
import jsonimport unicodedataf = open('file.json')for line in f: r = json.loads(line).get('text') s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore') print r print s
Output:
@CataDuarte8 No! avÃseme cuando vaya ah salir para yo salir igual!@CataDuarte8 No! aviseme cuando vaya ah salir para yo salir igual!
Applying this to your problem... rewrite your map()
function as...
def map(line, params): r = simplejson.loads(line).get('text') s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore') for word in s.split(): yield word, 1