How can I normalize a URL in python

Have a look at this module: werkzeug.utils. (now in werkzeug.urls)

The function you are looking for is called "url_fix" and works like this:

>>> from werkzeug.urls import url_fix>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

It's implemented in Werkzeug as follows:

import urllibimport urlparsedef url_fix(s, charset='utf-8'):    """Sometimes you get an URL by a user that just isn't a real    URL because it contains unsafe characters like ' ' and so on.  This    function can fix some of the problems in a similar way browsers    handle data entered by the user:    >>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')    'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'    :param charset: The target charset for the URL if the url was                    given as unicode string.    """    if isinstance(s, unicode):        s = s.encode(charset, 'ignore')    scheme, netloc, path, qs, anchor = urlparse.urlsplit(s)    path = urllib.quote(path, '/%')    qs = urllib.quote_plus(qs, ':&=')    return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

python url normalization normalize

Real fix in Python 2.7 for that problem

Right solution was:

 # percent encode url, fixing lame server errors for e.g, like space # within url paths. fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")

For more information see Issue918368: "urllib doesn't correct server returned urls"

python url normalization normalize

use urllib.quote or urllib.quote_plus

From the urllib documentation:

quote(string[, safe])
Replace special characters in string using the "%xx" escape. Letters, digits, and the characters "_.-" are never quoted. The optional safe parameter specifies additional characters that should not be quoted -- its default value is '/'.
Example: quote('/~connolly/') yields '/%7econnolly/'.
quote_plus(string[, safe])
Like quote(), but also replaces spaces by plus signs, as required for quoting HTML form values. Plus signs in the original string are escaped unless they are included in safe. It also does not have safe default to '/'.

EDIT: Using urllib.quote or urllib.quote_plus on the whole URL will mangle it, as @ΤΖΩΤΖΙΟΥ points out:

>>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html')>>> quoted_url'http%3A//www.example.com/foo%20goo/bar.html'>>> urllib2.urlopen(quoted_url)Traceback (most recent call last):  File "<stdin>", line 1, in <module>  File "c:\python25\lib\urllib2.py", line 124, in urlopen    return _opener.open(url, data)  File "c:\python25\lib\urllib2.py", line 373, in open    protocol = req.get_type()  File "c:\python25\lib\urllib2.py", line 244, in get_type    raise ValueError, "unknown url type: %s" % self.__originalValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html

@ΤΖΩΤΖΙΟΥ provides a function that uses urlparse.urlparse and urlparse.urlunparse to parse the url and only encode the path. This may be more useful for you, although if you're building the URL from a known protocol and host but with a suspect path, you could probably do just as well to avoid urlparse and just quote the suspect part of the URL, concatenating with known safe parts.

CodeHunter

How can I normalize a URL in python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last