Getting wget to work with uncooperative websites (-l not working)

Mar 22nd, 2011 | By | Category: Linux / Freebsd

So you have a website that you want to grab but doing something like wget -l 4 domain.com is only returning the top level, index.html and that’s it. What’s going on?

Well for starters it’s possible that domain.com has a rule in it’s robots.txt file that prohibits such behavior, in which case you are going to need to add the command line option of  ‘ -e robots=off  ‘ onto your wget command.

However this may not do the trick, in which case making the website believe that you are a browser window and not wget will be necessary.  You accomplish this by tacking on ‘ wget -e robots=off -m -U “Mozilla/5.0 (compatible; Konqueror/3.2; Linux)” ‘ to your command line as well.

Now if you want to get a bit more specific into things there is a handy little program called httrack , should be in your repo’s so a little apt-get install httrack will get that installed in a jiffy for you.  There are some sites out there that have gone to some pretty impressive lengths to prevent themselves from being mirrored and for those httrack seems to be the best best.  But for everything else wget is your answer!

Throw both of these badboys into the mix and your problems should dissapear.. Happy hunting!

Leave a Comment