How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file
You can use grep
for this:
grep -Po '(?<=href=")[^"]*' file
It prints everything after href="
until a new double quote appears.
With your given input it returns:
http://www.drawspace.com/lessons/b03/simple-symmetryhttp://www.drawspace.com/lessons/b04/faces-and-a-vasehttp://www.drawspace.com/lessons/b05/blind-contour-drawinghttp://www.drawspace.com/lessons/b06/seeing-values
Note that it is not necessary to write cat drawspace.txt | grep '<a href=".*">'
, you can get rid of the useless use of cat with grep '<a href=".*">' drawspace.txt
.
Another example
$ cat ahello <a href="httafasdf">asdas</a>hello <a href="hello">asdas</a>other things$ grep -Po '(?<=href=")[^"]*' ahttafasdfhello
My guess is your PC or Mac will not have the lynx command installed by default (it's available for free on the web), but lynx will let you do things like this:
$lynx -dump -image_links -listonly /usr/share/xdiagnose/workloads/youtube-reload.html
Output:References
- file://localhost/usr/share/xdiagnose/workloads/youtube-reload.html
- http://www.youtube.com/v/zeNXuC3N5TQ&hl=en&fs=1&autoplay=1
It is then a simple matter to grep for the http: lines. And there even may be lynx options to print just the http: lines (lynx has many, many options).