PyYaml - Dump unicode with special characters ( i.e. accents ) PyYaml - Dump unicode with special characters ( i.e. accents ) python python

PyYaml - Dump unicode with special characters ( i.e. accents )


yaml is capable of dumping unicode characters by providing the allow_unicode=True keyword argument to any of the dumpers. If you don't provide a file, you will get an utf-8 string back from dump() method (i.e. the result of getvalue() on the StringIO() instance that is created to hold the dumped data) and you have to convert that to utf-8 before appending it to your string

# coding: utf-8import codecsimport ruamel.yaml as yamlfile_name = r'toto.txt'text = u'héhéhé, hûhûhû'textDict = {"data": text}with open(file_name, 'w') as fp:    yaml.dump(textDict, stream=fp, allow_unicode=True)print('yaml dump dict 1   : ' + open(file_name).read()),f = codecs.open(file_name,"w",encoding="utf-8")f.write('yaml dump dict 2   : ' + yaml.dump(textDict, allow_unicode=True).decode('utf-8'))f.close()print(open(file_name).read())

output:

yaml dump dict 1    : {data: 'héhéhé, hûhûhû'}yaml dump dict 2    : {data: 'héhéhé, hûhûhû'}

I tested this with my enhanced version of PyYAML (ruamel.yaml), but this should work the same in PyYAML itself.


Update (2020)

Nowadays, PyYaml does easily process unicode with Python 3, but this requires the allow_unicode=True argument:

import yamld = {'a': 'héhéhé', 'b': 'hühühü'}yaml_code = yaml.dump(d, allow_unicode=True, sort_keys=False)print(yaml_code)

Will result in:

a: héhéhéb: hühühü

Note: The sortkeys=False argument should be used as of Python 3.6, to leave the keys of the dictionary unaltered. PyYaml has been traditionally sorting keys, because Python dictionaries did not have a definite order. Even though dictionary keys have been ordered since Python 3.6; and officially since 3.7, PyYaml has kept sorting keys by default.