Parse URL in shell script Parse URL in shell script shell shell

Parse URL in shell script


[EDIT 2019]This answer is not meant to be a catch-all, works for everything solution it was intended to provide a simple alternative to the python based version and it ended up having more features than the original.


It answered the basic question in a bash-only way and then was modified multiple times by myself to include a hand full of demands by commenters. I think at this point however adding even more complexity would make it unmaintainable. I know not all things are straight forward (checking for a valid port for example requires comparing hostport and host) but I would rather not add even more complexity.


[Original answer]

Assuming your URL is passed as first parameter to the script:

#!/bin/bash# extract the protocolproto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"# remove the protocolurl="$(echo ${1/$proto/})"# extract the user (if any)user="$(echo $url | grep @ | cut -d@ -f1)"# extract the host and porthostport="$(echo ${url/$user@/} | cut -d/ -f1)"# by request host without port    host="$(echo $hostport | sed -e 's,:.*,,g')"# by request - try to extract the portport="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"# extract the path (if any)path="$(echo $url | grep / | cut -d/ -f2-)"echo "url: $url"echo "  proto: $proto"echo "  user: $user"echo "  host: $host"echo "  port: $port"echo "  path: $path"

I must admit this is not the cleanest solution but it doesn't rely on another scriptinglanguage like perl or python.(Providing a solution using one of them would produce cleaner results ;) )

Using your example the results are:

url: user@host.net/some/random/path  proto: sftp://  user: user  host: host.net  port:  path: some/random/path

This will also work for URLs without a protocol/username or path.In this case the respective variable will contain an empty string.

[EDIT]
If your bash version won't cope with the substitutions (${1/$proto/}) try this:

#!/bin/bash# extract the protocolproto="$(echo $1 | grep :// | sed -e's,^\(.*://\).*,\1,g')"# remove the protocol -- updatedurl=$(echo $1 | sed -e s,$proto,,g)# extract the user (if any)user="$(echo $url | grep @ | cut -d@ -f1)"# extract the host and port -- updatedhostport=$(echo $url | sed -e s,$user@,,g | cut -d/ -f1)# by request host without porthost="$(echo $hostport | sed -e 's,:.*,,g')"# by request - try to extract the portport="$(echo $hostport | sed -e 's,^.*:,:,g' -e 's,.*:\([0-9]*\).*,\1,g' -e 's,[^0-9],,g')"# extract the path (if any)path="$(echo $url | grep / | cut -d/ -f2-)"


The above, refined (added password and port parsing), and working in /bin/sh:

# extract the protocolproto="`echo $DATABASE_URL | grep '://' | sed -e's,^\(.*://\).*,\1,g'`"# remove the protocolurl=`echo $DATABASE_URL | sed -e s,$proto,,g`# extract the user and password (if any)userpass="`echo $url | grep @ | cut -d@ -f1`"pass=`echo $userpass | grep : | cut -d: -f2`if [ -n "$pass" ]; then    user=`echo $userpass | grep : | cut -d: -f1`else    user=$userpassfi# extract the host -- updatedhostport=`echo $url | sed -e s,$userpass@,,g | cut -d/ -f1`port=`echo $hostport | grep : | cut -d: -f2`if [ -n "$port" ]; then    host=`echo $hostport | grep : | cut -d: -f1`else    host=$hostportfi# extract the path (if any)path="`echo $url | grep / | cut -d/ -f2-`"

Posted b/c I needed it, so I wrote it (based on @Shirkin's answer, obviously), and I figured someone else might appreciate it.


This solution in principle works the same as Adam Ryczkowski's, in this thread - but has improved regular expression based on RFC3986, (with some changes) and fixes some errors (e.g. userinfo can contain '_' character). This can also understand relative URIs (e.g. to extract query or fragment).

# !/bin/bash# Following regex is based on https://tools.ietf.org/html/rfc3986#appendix-B with# additional sub-expressions to split authority into userinfo, host and port#readonly URI_REGEX='^(([^:/?#]+):)?(//((([^:/?#]+)@)?([^:/?#]+)(:([0-9]+))?))?(/([^?#]*))(\?([^#]*))?(#(.*))?'#                    ↑↑            ↑  ↑↑↑            ↑         ↑ ↑            ↑ ↑        ↑  ↑        ↑ ↑#                    |2 scheme     |  ||6 userinfo   7 host    | 9 port       | 11 rpath |  13 query | 15 fragment#                    1 scheme:     |  |5 userinfo@             8 :…           10 path    12 ?…       14 #…#                                  |  4 authority#                                  3 //…parse_scheme () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[2]}"}parse_authority () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[4]}"}parse_user () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[6]}"}parse_host () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[7]}"}parse_port () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[9]}"}parse_path () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[10]}"}parse_rpath () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[11]}"}parse_query () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[13]}"}parse_fragment () {    [[ "$@" =~ $URI_REGEX ]] && echo "${BASH_REMATCH[15]}"}