[NTLUG:Discuss] wget

Preston Hagar prestonh at gmail.com
Thu May 21 16:00:29 CDT 2009


On Thu, May 21, 2009 at 9:23 AM, Tom Tumelty <tomtumelty at gmail.com> wrote:
> I am trying to download a website using wget.
> It always just downloads about 4 files, the image directory, and 4 or 5
> images.
>
> I have tried :
>
> wget -r http://www.decaturjetcenter.com
>
> wget -rc http://www.decaturjetcenter.com
> and get same results either way.
> I don't think the robots.txt file is causing the problem.
> Any idea what I am doing wrong?
> Thanks in advance,
> Tom
> _______________________________________________

If you want a complete site, -m is a good way to go.  I took a look at
the site in question, and I think the issue has to do with the www.

If I do

wget -m http://www.decaturjetcenter.com

I get 191,739 bytes in 8 files (probably more or less what you got)

If I do

wget -m http://decaturjetcenter.com

I get 2,372,505 bytes in 74 files.  In looking at the html from the
home page, most of the links do not have the www, so they don't get
followed/downloaded by wget if you do www.decaturjetcenter.com.

Anyway, I think you administer this site from your other posts, in
which case I will pass on the tip that it is usually good, if you are
going to have the page load both with and without www (which is very
good), to have one do a 301 redirect to the other.  That way, search
engines will look at both with and without the www as the same site,
instead of two sites.  It looks like you are using IIS, so
unfortunatly I don't know how to configure that.

A good site to check your site in for tips to help it out (although it
isn't a complete solution) is http://website.grader.com .  Try your
site with both the www and without and see how it turns out.

Hope this helps.

Preston



More information about the Discuss mailing list