Web Extraction

Home
Up
Analyze602
AVReceiver
FreeSpace
Home Automation
Irrigation Control 1
ISP Performance
PantryTech/Pod
QuizManager
RandomPix
SlideShow
Web Extraction
Web Path Analyzer
Web Tool
WxPatch

Web Extraction

Here's the problematic scenario that caused development of this program.

  • You have a Palm Connected Organizer (see http://www.palm.com)
  • You use Avantgo (http://www.avantgo.com) to sync web pages to your Palm
  • Some of the pages you sync are dynamically served, river levels at the USGS for instance, and their server isn't fast enough, so the Avantgo server times out and moves on
  • When you look at these pages in your handheld, you find they sometimes work and sometimes they are unavailable - basically unreliable

Caveat to this Solution

  • You have a persistent internet connection
  • You have a web server installed
  • You have Perl installed
  • You have a scheduler (Win98 Task Scheduler, anti-virus vendor scheduler, cron, etc.)

Solution

  • A bit of programming and you have a script to scour the pages of interest
  • Copy them to your web server
  • Fix the links to graphics in those pages
  • Copy the graphics to your server
  • Set Avantgo to sync with your server
  • Automate the process with Task Scheduler

Interested?

If you're interested in this as a solution to a problem you have, here's how to install it. First, the disclaimer -

You are on your own. If you don't know what you're doing, it is indeed possible to overwrite files and generally mess up your computer.

Make sure that you have Perl installed and functioning. Also, you will need the LWP module.

Download the Perl Script and the sample configuration file

WebExtraction.zip

Containing:
WebExtraction.pl
WebExtraction.txt

Configure the Script

The script can be configured in one of two ways. The first is to embed the configuration into the script. The second is to configure the script from the command line.

Embedded Configuration

The script has three configurable items - the path to the location where the web pages are served, which is also where it will find the configuration filename, the configuration filename, and the name of the default web page.

Simply open the script in notepad, or something similar, search for #!!! near the top of the program, make the appropriate changes to the quoted strings, and save it back.

Note that the last of the three changes to make involves the DEFAULT filename that your web server will present. This is typically one of many choices, "index.htm" being one of the more popular.

Command Line Configuration

In this case, you still should still edit the script as shown above, taking note, and changing if necessary, the DEFAULT filename.

The first command line parameter will be the BASEPATH value, and the second will be the CONFIGFILE name.

License Agreement

The license agreement is reasonably simple and included in the script itself.

 

You can use this secure payment system to tell me how you like these programs. If you feel you've acquired a $1 program, then please, send only a single dollar. If you feel it is worth more, then fill in the amount that matches the value you've received.