Does Ruby support unicode and how does it work? Does Ruby support unicode and how does it work? ruby ruby

Does Ruby support unicode and how does it work?


What you heard is outdated and applies (only partially) to Ruby 1.8 or before. The latest stable version of Ruby (1.9), supports no less than 95 different character encodings (counted on my system just now). This includes pretty much all known Unicode Transformation Formats, including UTF-8.

The previous stable version of Ruby (1.8) has partial support for UTF-8.

If you use Rails, it takes care of default UTF-8 encoding for you. If all you need is UTF-8 encoding awareness, Rails will work for you no matter if you run Ruby 1.9 or Ruby 1.8. If you have very specific character encoding requirements, you should aim for Ruby 1.9.

If you're really interested, here is a series of articles describing the encoding issues in Ruby 1.8 and how they were worked around, and eventually solved in Ruby 1.9. Rails still includes workarounds for many common flaws in Ruby 1.8.


Adding the following line on top my file solved it.

# encoding: utf-8


That's not true. What is true is that Ruby does not support only Unicode, it supports a whole slew of other encodings as well.

This is in contrast to systems such as Java, .NET or Python, which follow the "One Encoding To Rule Them All" model. Ruby has what one of the designers of Ruby's m17n system calls a "CSI" model (Code Set Indepedent), which means that instead of all strings just having one and the same encoding, every string is tagged with its own encoding.

This has some significant advantages both for ease of use and performance, because it means that if your input and output encodings are the same, you never need to transcode, whereas with the One True Encoding model, you need to transcode twice in the worst case (and that worst case unfortunately happens pretty often, because most of these environments chose an internal encoding that nobody actually uses), from the input encoding into the internal encoding and then to the output encoding. In Ruby, you need to transcode at most once.

The basic problem with the OTE model is that whatever encoding you choose as the One True Encoding, it will be a completely arbitrary choice, since there simply isn't a single encoding that everybody, or even a majority, uses.

In Java, for example, they chose UCS-2 as the One True Encoding. Then, a couple of years later, it turned out that UCS-2 was actually not enough to encode all characters, so they had to make a backwards-incompatible change to Java, to switch to UTF-16 as the One True Encoding. Except by that time, a significant portion of the world had moved on from UTF-16 to UTF-8. If Java had been invented a couple of years earlier, they would probably have chosen ASCII as the One True Encoding. If it had been invented in another country, it might be Shift-JIS. If it had been invented by another company, it might be EBCDIC. It's really completely arbitrary, and such an important choice shouldn't be.