Convert non-breaking spaces to spaces in Ruby

ruby json unicode utf-8 whitespace

Use /\u00a0/ to match non-breaking spaces. For instance s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces.

Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.

See also: Ruby Regexp documentation

ruby json unicode utf-8 whitespace

If you cannot use \s for Unicode whitespace, that’s a bug in the Ruby regex implementation, because according to UTS#18 “Unicode Regular Expressions” Annex C on Compatibility Properties a \s, is absolutely required to match any Unicode whitespace code point.

There is no wiggle-room allowed since the two columns detailing the Standard Recommendation and the POSIX Compatibility are the same for the \s case. You cannot document your way around this: you are out of compliance with The Unicode Standard, in particular, with UTS#18’s RL1.2a, if you do not do this.

If you do not meet RL1.2a, you do not meet the Level 1 requirements, which are the most basic and elementary functionality needed to use regular expressions on Unicode. Without that, you are pretty much lost. This is why standards exist. My recollection is that Ruby also fails to meet several other Level 1 requirements. You may therefore wish to use a programming language that meets at least Level 1 if you actually need to handle Unicode with regular expressions.

Note that you cannot use a Unicode General Category property like \p{Zs} to stand for \p{Whitespace}. That’s because the Whitespace property is a derived property, not a general category. There are also control characters included in it, not just separators.

ruby json unicode utf-8 whitespace

Actual functioning IRB code examples that answer the question, with latest Rubies (May 2012)

Ruby 1.9

require 'rubygems'require 'nokogiri'RUBY_DESCRIPTION # => "ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]"doc = '<html><body>   </body></html>'page = Nokogiri::HTML(doc)s = page.inner_texts.each_codepoint {|c| print c, ' ' } #=> 32 160 32s.strip.each_codepoint {|c| print c, ' ' } #=> 160s.gsub(/\s+/,'').each_codepoint {|c| print c, ' ' } #=> 160s.gsub(/\u00A0/,'').strip.empty? #true

Ruby 1.8

require 'rubygems'require 'nokogiri'RUBY_DESCRIPTION # => "ruby 1.8.7 (2012-02-08 patchlevel 358) [x86_64-linux]"doc = '<html><body>   </body></html>'page = Nokogiri::HTML(doc)s = page.inner_text # " \302\240 "s.gsub(/\s+/,'') # "\302\240"s.gsub(/\302\240/,'').strip.empty? #true

CodeHunter

Convert non-breaking spaces to spaces in Ruby

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last