How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file shell shell

How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file


$ sed -n 's/.*href="\([^"]*\).*/\1/p' filehttp://www.drawspace.com/lessons/b03/simple-symmetryhttp://www.drawspace.com/lessons/b04/faces-and-a-vasehttp://www.drawspace.com/lessons/b05/blind-contour-drawinghttp://www.drawspace.com/lessons/b06/seeing-values


You can use grep for this:

grep -Po '(?<=href=")[^"]*' file

It prints everything after href=" until a new double quote appears.

With your given input it returns:

http://www.drawspace.com/lessons/b03/simple-symmetryhttp://www.drawspace.com/lessons/b04/faces-and-a-vasehttp://www.drawspace.com/lessons/b05/blind-contour-drawinghttp://www.drawspace.com/lessons/b06/seeing-values

Note that it is not necessary to write cat drawspace.txt | grep '<a href=".*">', you can get rid of the useless use of cat with grep '<a href=".*">' drawspace.txt.

Another example

$ cat ahello <a href="httafasdf">asdas</a>hello <a href="hello">asdas</a>other things$ grep -Po '(?<=href=")[^"]*' ahttafasdfhello


My guess is your PC or Mac will not have the lynx command installed by default (it's available for free on the web), but lynx will let you do things like this:

$lynx -dump -image_links -listonly /usr/share/xdiagnose/workloads/youtube-reload.html

Output:References

  1. file://localhost/usr/share/xdiagnose/workloads/youtube-reload.html
  2. http://www.youtube.com/v/zeNXuC3N5TQ&hl=en&fs=1&autoplay=1

It is then a simple matter to grep for the http: lines. And there even may be lynx options to print just the http: lines (lynx has many, many options).