[NTLUG:Discuss] copying web documents

David Stanaway david at stanaway.net
Thu May 18 17:40:44 CDT 2006


Fred wrote:

> To get this exercise to work properly requires a THOROUGH understanding of
> wget and the desire to spend way too much time jacking with it. Maybe someone
> can write a tool (bash, C, whatever) that will do the job. 

Not really, it is just that wget follows the rules, and you want to bend
the rules. No doubt the pdf is better if you wanted to print it, but I
prefer html docs for searching, using images for other things etc.

I don't expect wget to supply an easy switch to turn off robots.txt
observance, although it might be nice, I am sure the authors have a
reason, perhaps they got some heavy handed legal threats (Probably
baseless, but not worth the effort to fight). If thats what you want to
do, its not that hard to change it (See my binary change of robots to
nobots in the executable which requires no knowledge of C or how to
compile programs). You can't blame wget for the site being intentionally
difficult to download.



More information about the Discuss mailing list