How can I extract a URL with non-English characters from a string?

ruby-on-rails ruby string url uri

Ruby's built-in URI is useful for some things, but it's not the best choice when dealing with international characters or IDNA addresses. For that I recommend using the Addressable gem.

This is some cleaned-up IRB output:

require 'addressable/uri'url = 'http://www.example.com/wp content/uploads/2012/01/München.jpg'uri = Addressable::URI.parse(url)

Here's what Ruby knows now:

#<Addressable::URI:0x102c1ca20    @uri_string = nil,    @validation_deferred = false,    attr_accessor :authority = nil,    attr_accessor :host = "www.example.com",    attr_accessor :path = "/wp content/uploads/2012/01/München.jpg",    attr_accessor :scheme = "http",    attr_reader :hash = nil,    attr_reader :normalized_host = nil,    attr_reader :normalized_path = nil,    attr_reader :normalized_scheme = nil>

And looking at the path you can see it as is, or as it should be:

1.9.2-p290 :004 > uri.path            # => "/wp content/uploads/2012/01/München.jpg"1.9.2-p290 :005 > uri.normalized_path # => "/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg"

Addressable really should be selected to replace Ruby's URI considering how the internet is moving to more complex URIs and mixed Unicode characters.

Now, getting at the string is easy too, but depends on how much text you have to look through.

If you have a full HTML document, your best bet is to use Nokogiri to parse the HTML and extract the href parameters from the <a> tags. This is where to start for a single <a>:

require 'nokogiri'html = '<a href="http://www.example.com/wp content/uploads/2012/01/München.jpg">München</a>'doc = Nokogiri::HTML::DocumentFragment.parse(html)doc.at('a')['href'] # => "http://www.example.com/wp content/uploads/2012/01/München.jpg"

Parsing using DocumentFragment avoids wrapping the fragment in the usual <html><body> tags. For a full document you'd want to use:

doc = Nokogiri::HTML.parse(html)

Here's the difference between the two:

irb(main):006:0> Nokogiri::HTML::DocumentFragment.parse(html).to_html=> "<a href=\"http://www.example.com/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg\">München</a>"

versus:

irb(main):007:0> Nokogiri::HTML.parse(html).to_html=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><a href=\"http://www.example.com/wp%20content/uploads/2012/01/M%C3%BCnchen.jpg\">München</a></body></html>\n"

So, use the second for a full HTML document, and for a small, partial chunk, use the first.

To scan an entire document, extracting all the hrefs, use:

hrefs = doc.search('a').map{ |a| a['href'] }

If you only have small strings like you show in your example, you can consider using a simple regex to isolate the needed href:

html[/href="([^"]+)"/, 1]=> "http://www.example.com/wp content/uploads/2012/01/München.jpg"

ruby-on-rails ruby string url uri

You have to encode the URL first:

URI.extract(URI.encode('<a href="http://www.example.com/wp_content/uploads/2012/01/München.jpg">München</a>'))

ruby-on-rails ruby string url uri

The URI module is probably restricted to 7-bit ASCII characters. Although UTF-8 is the presumed standard for a lot of things, this is never assured, and there's no way to specify the encoding of a URI like you can for a complete HTTP exchange.

One solution is to render non-ASCII characters as their % equivalents. Related Stack Overflow post: Unicode characters in URLs

If you're dealing with data that's already mangled, you may want to call URI.encode on it first to percentify it, then match against it again.

CodeHunter

How can I extract a URL with non-English characters from a string?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last