[Welcome to Sensei's Library!]

StartingPoints
ReferenceSection
About


Referenced by
WikiOrphans

 

Win HTTrack
   

HTTrack is a free (open source) and easy-to-use offline browser utility. It allows you to download a World Wide website from the Internet to a local directory, building recursively all directories, getting html, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system. WinHTTrack is the Windows 95/98/NT/2K/XP release of HTTrack.

[ext] http://www.httrack.com


Arno: I strongly discourage using web mirror tools. See WhyMirroringIsBad.


I downloaded the SLSnapshot in May 2002. The SLSnapshot was a little bit outdated. So I tried to mirror the Web-Site with a programm running under MS Windows.

When I tried to mirror Senseis Library the first time using "WinHTTrack Website Copier 3.16-2" ([ext] http://www.httrack.com) I was blocked.

On the forum of www.httrack.com I reported my problem.

I received this answer:


Begin quotation from [ext] http://forum.httrack.com/

> Does HTTrack use the referrer?

Yes

> Is it possible to configure an automatic wait period between > requests?

Yes - you can select 1 connection per second, but also limit the number of simultaneous connection to 1 or 2.

> How can I enable HTTrack to mirror this web-site?

You may also limit the bandwidth to something like 8KB/s ; the bandwidth limiter in httrack is now very sharp and allow you to limit bandwidth abuse

> The above meassures should shield SL from the most > offensive scripts. What if you would still like to > mirror/download SL? Use a friendly script such as wget > which obeys robots.txt.

Therefore, if you leave all httrack options as is (follow robots.txt), and use bandwith limiter (1 conn/second, 1 simultaneous connection, +bw limit), this should be okay.

> If you use wget don't forget to specify a > wait period between the requests (at least '-w 3'). Yes,

Err, 3 seconds? I'll have to implement a larger delay in httrack (which is limited to 1 second) in the future - but using slower bandwidth limit should be okay (maybe 3 or 4KB)

Also, please cut/paste this filter into the 'Scan rules' options of httrack (Options/Scan rules) :

-*/*?edit=* -*/*?copy=* -*/*?diff=* -*/*?header=* -*/*? info=* -*/*?search=* -*/*?blockme=* -*/*?random=* -*/*? edit=*

as the current (basic) handling of robots.txt does not understand the format of this site (/?foo..) (added on the todo list..)

End quotation from [ext] http://forum.httrack.com/


I changed the values according to this using "WinHTTrack Website Copier 3.16-2" ([ext] http://www.httrack.com) and this time I managed to mirror the complete Web-Site (and I was not blocked again).

The mirroring process took about 6 hours.

frankyd

Arno: so you were the guy who downloaded the FrontPage 80000 times causing some hundreds of MB traffic before realizing it didn't work as intended? Other people would ban your IP range permanently for such behaviour, you know.


@arno: I suppose I did not download the frontpage 80000 times, because the first time I tried to mirror with httrack I was blocked after a few minutes.

I am sorry that you expressed your disappointment about my behavior and I beg your pardon for causing too much traffic. I will not repeat to mirror your Web-Site and I appreciate that you did not ban my IP range.

I would like to suggest to update the SLSnapshot on a regular basis and tell about the date when the new SLSnapshot is available.

frankyd


Stefan: frankyd, if you submit a lot of good go content to the library, we will collectively beg Arno and Morten to have mercy on you. :-)

Arno: Frankyd, maybe I should've checked the server logs more thoroughly. Your mirroring caused only about 50MB of traffic. The frontpage-guy (causing some 600MB traffic in total) was someone else (from Stuttgart). 50MB is not that much. I just don't want to make mirroring SL common practice as it puts too much burden on the server - after all mirroring causes as much traffic as a visit from 150 users.

About the snapshot: currently, making the snapshot is not fully automatic and takes some time on my part. I think that a snapshot every 2-3 month is reasonable enough. No?

Dieter: So much for the apologies and the demands. I must have missed the words "thank you" to the SL-adminstrator, for surely if one find SL worth mirroring, one must be very grateful to the guys who made it possible, no ?



This is a copy of the living page "Win HTTrack" at Sensei's Library.
(C) the Authors, published under the OpenContent License V1.0.