How do you extract a url from a string using python?

There may be few ways to do this but the cleanest would be to use regex

>>> myString = "This is a link http://www.google.com">>> print re.search("(?P<url>https?://[^\s]+)", myString).group("url")http://www.google.com

If there can be multiple links you can use something similar to below

>>> myString = "These are the links http://www.google.com  and http://stackoverflow.com/questions/839994/extracting-a-url-in-python">>> print re.findall(r'(https?://[^\s]+)', myString)['http://www.google.com', 'http://stackoverflow.com/questions/839994/extracting-a-url-in-python']>>>

python string url extract

In order to find a web URL in a generic string, you can use a regular expression (regex).

A simple regex for URL matching like the following should fit your case.

    regex = r'('    # Scheme (HTTP, HTTPS, FTP and SFTP):    regex += r'(?:(https?|s?ftp):\/\/)?'    # www:    regex += r'(?:www\.)?'    regex += r'('    # Host and domain (including ccSLD):    regex += r'(?:(?:[A-Z0-9][A-Z0-9-]{0,61}[A-Z0-9]\.)+)'    # TLD:    regex += r'([A-Z]{2,6})'    # IP Address:    regex += r'|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'    regex += r')'    # Port:    regex += r'(?::(\d{1,5}))?'    # Query path:    regex += r'(?:(\/\S+)*)'    regex += r')'

If you want to be even more precise, in the TLD section, you should ensure that the TLD is a valid TLD (see the entire list of valid TLDs here: https://data.iana.org/TLD/tlds-alpha-by-domain.txt):

    # TLD:    regex += r'(com|net|org|eu|...)'

Then, you can simply compile the former regex and use it to find possible matches:

    import re    string = "This is a link http://www.google.com"    find_urls_in_string = re.compile(regex, re.IGNORECASE)    url = find_urls_in_string.search(string)    if url is not None and url.group(0) is not None:        print("URL parts: " + str(url.groups()))        print("URL" + url.group(0).strip())

Which, in case of the string "This is a link http://www.google.com" will output:

    URL parts: ('http://www.google.com', 'http', 'google.com', 'com', None, None)    URL: http://www.google.com

If you change the input with a more complex URL, for example "This is also a URL https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo but this is not anymore" the output will be:

    URL parts: ('https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo', 'https', 'host.domain.com', 'com', '80', '/path/page.php?query=value&a2=v2#foo')    URL: https://www.host.domain.com:80/path/page.php?query=value&a2=v2#foo

NOTE: If you are looking for more URLs in a single string, you can still use the same regex, but just use findall() instead of search().

python string url extract

There is another way how to extract URLs from text easily. You can use urlextract to do it for you, just install it via pip:

pip install urlextract

and then you can use it like this:

from urlextract import URLExtractextractor = URLExtract()urls = extractor.find_urls("Let's have URL stackoverflow.com as an example.")print(urls) # prints: ['stackoverflow.com']

You can find more info on my github page: https://github.com/lipoja/URLExtract

NOTE: It downloads a list of TLDs from iana.org to keep you up to date. But if the program does not have internet access then it's not for you.

CodeHunter

How do you extract a url from a string using python?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last