Here is some code for you to study. It uses the advice
given by chromatic - chalk it up to another boring
Saturday night. :)
Keep in mind that this is not an exact science, there is
a bit of art involved - mainly in finding a way to
extract the data into the fields you wish to store, without
having to write code that is unmaintainable. My solution
uses a method that I am not particularly found of - hard
coded array indexes, but it works for the example you
gave. It will break if the web pages you are parsing tend
to change from page to page.
Also, I do not bother with any actual database code, since
I do not know what your tables look like.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Parser;
# get the content of the web page
my $content = get("http://www.foobar.com/poopy/data.html");
# instaniate a new parser and let it crunch our data
my @lines;
my $parser = new MyParser;
$parser->parse($content);
# now the data is in @lines - there is more than one way
# to do this - if you know that __EVERY__ web page will
# have the __SAME__ layout, you can hardcode your indexes
# for example, at this point @lines looks like this:
#
# 0 - Location:
# 1 - Northern Africa, bordering the Mediterranean Sea, between Egypt
+ and Tunisia
# 2 - Geographic coordinates:
# 3 - 25 00 N, 17 00 E
# 4 - Map references:
# 5 - Africa
# 6 - Area:
# 7 - total:
# 8 - 1,759,540 sq km
# 9 - land:
# 10 - 1,759,540 sq km
# 11 - water:
# 12 - 0 sq km
# 13 - Area - comparative:
# 14 - slightly larger than Alaska
# 15 - Land boundaries:
# 16 - total:
# 17 - 4,383 km
# 18 - border countries:
# 19 - Algeria 982 km, Chad 1,055 km, Egypt 1,150 km, Niger 354 km, Su
+dan 383 km, Tunisia 459 km
# 20 - Coastline:
# 21 - 1,770 km
#
# so now I can store my variables by accessing the proper index
# in the array: (only the first 5 - you do the rest :)
my ($location, $coords, $refs, $total_area, $land_area);
$location = $lines[1];
$coords = $lines[3];
$refs = $lines[5];
$total_area = $lines[8];
$land_area = $lines[10];
# not pretty - but it works for the example
# insert into database - you will have to implement
# your own subroutine for this
&insert_row($location, $coords, $refs, $total_area, $land_area);
####################################################################
# package MyParser - inheritance and event-driven programming
# are the things to study if you want to understand how this works
####################################################################
{
package MyParser;
use base qw(HTML::Parser);
# override the text sub to simply store the
# plain text of the content in a linear fashion
sub text {
my ($self, $origtext) = @_;
# first, any remove leading and trailing white space
# there are better ways to do this, but it's late *yawn*
$origtext =~ s/^\s*//;
$origtext =~ s/(.*)\s*$/$1/; # forgive me Ovid
# finally, only store the line if it's not empty
push(@lines, $origtext) if $origtext =~ m/\w+/;
}
}
Jeff
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
F--F--F--F--F--F--F--F--
(the triplet paradiddle)
|