PDA

View Full Version : Linked Webpages


mike2002
08-28-2004, 08:44 AM
I know that The PC Guide doesn't approve of such things but I tried out one of those 'Mass Downloading' programs, namely 'Surf Offline'.
It works fine on some sites, but not so good on others. But the program downloads the pages amidst a mass of folders, some of which can be 6 sub-folders deep. When offline, it can take a bit of searching to locate the main 'Index' page from which to start from.

My query is, is there any other way of downloading multiple connected pages whilst still retaining the 'links' that connect them all together.
As an example, I downloaded a Dell 'PC User Guide' in a zipped file.
When unzipped, it contained:--

46 HTML files
24 GIF images
181 JPEG Images
1 Cascading Sheet Style Document.

Nothing else; no additional files as are found in the 'Surf Offline' program. But when you start at the Index page, the links to all the other pages are retained, including the images just as if you were still online.

How is it done?? :confused:

mike2002
08-28-2004, 08:50 AM
Sorry - I forgot to mention, there are no Directory folders in the Dell
files.

Paul Komski
08-28-2004, 05:18 PM
You can go three levels deep with IE by adding a page to favorites and choosing "make available offline" without using any specialist software.

No folders or lots of folders; it just depends how the site was designed. Most "spiders" stick to one domain at a time but some websites utilise external links more than others and these would tend to break the spidering.

There are no absolute rules and some "bots" work better than others and have better filters inside them. There is also likely to be less success with sites using a lot of client-side DHTML and with those using active server-side scripting of one sort or another.

So the spiders also need to be able to distinguish between relative and absolute links, between absolute links to foreign domains, to the same domain and to sub-domains and to pick up any included files (locally or far away) be they pictures or text or style sheets or javascripts and so on. Some sites are as nice and simple as the one in your unzipped folder but that is perhaps the exception rather than the rule.

pave_spectre
08-29-2004, 01:00 AM
I use a program called wget (http://wget.sunsite.dk/) which is an extremely useful downloading tool also available for windows.

It's main purpose is to be a simple non-interactive command-line download utility capable of resuming in the same way as programs like getright are. It also has a basic spidering function that I have used a couple of times and has proved to be quite reliable. That would also depend on the complexity of the site being downloaded.

mike2002
08-29-2004, 09:32 AM
paul: You can go three levels deep with IE by adding a page to favorites and choosing "make available offline" without using any specialist software.
Strangely enough, I've never tried that function in I.E. - no particular reason!
Other than that, yes, I agree with your other comments. As you say, the simplest sites are the most effective regarding mass downloading.

I've recently noticed that Ebay no longer allow their webpages to be saved, except in HTML format.
Out of curiosity I put an Ebay URL into 'Surf Offline' just to see what it would make of it. All I got was jumbled mess.

Paul Komski
08-29-2004, 12:47 PM
Any webpage downloaded into your browser will be in html format (unless you are downloading a document such as a .doc, .txt, .pdf, etc file) even if it is not an htm/html file on the web-server (eg asp, cgi, php, ...). If you read the source code it will be in ordinary html and the only time this can get difficult is when frames are involved - but you just need to extract the url of the frame content and then view that page individually; opera will do this second part for you.

So it should always be possible to just copy and paste the source code and save it as html. Some of the bigger probles will be linking up with other scripts, particularly if they contain errors, and the other page elements which will be sitting in the TIF folder.