in reply to So Simple, Yet no tutorial covers it

Here is some code for you to study. It uses the advice given by chromatic - chalk it up to another boring Saturday night. :)

Keep in mind that this is not an exact science, there is a bit of art involved - mainly in finding a way to extract the data into the fields you wish to store, without having to write code that is unmaintainable. My solution uses a method that I am not particularly found of - hard coded array indexes, but it works for the example you gave. It will break if the web pages you are parsing tend to change from page to page.

Also, I do not bother with any actual database code, since I do not know what your tables look like.

#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::Parser; # get the content of the web page my $content = get("http://www.foobar.com/poopy/data.html"); # instaniate a new parser and let it crunch our data my @lines; my $parser = new MyParser; $parser->parse($content); # now the data is in @lines - there is more than one way # to do this - if you know that __EVERY__ web page will # have the __SAME__ layout, you can hardcode your indexes # for example, at this point @lines looks like this: # # 0 - Location: # 1 - Northern Africa, bordering the Mediterranean Sea, between Egypt + and Tunisia # 2 - Geographic coordinates: # 3 - 25 00 N, 17 00 E # 4 - Map references: # 5 - Africa # 6 - Area: # 7 - total: # 8 - 1,759,540 sq km # 9 - land: # 10 - 1,759,540 sq km # 11 - water: # 12 - 0 sq km # 13 - Area - comparative: # 14 - slightly larger than Alaska # 15 - Land boundaries: # 16 - total: # 17 - 4,383 km # 18 - border countries: # 19 - Algeria 982 km, Chad 1,055 km, Egypt 1,150 km, Niger 354 km, Su +dan 383 km, Tunisia 459 km # 20 - Coastline: # 21 - 1,770 km # # so now I can store my variables by accessing the proper index # in the array: (only the first 5 - you do the rest :) my ($location, $coords, $refs, $total_area, $land_area); $location = $lines[1]; $coords = $lines[3]; $refs = $lines[5]; $total_area = $lines[8]; $land_area = $lines[10]; # not pretty - but it works for the example # insert into database - you will have to implement # your own subroutine for this &insert_row($location, $coords, $refs, $total_area, $land_area); #################################################################### # package MyParser - inheritance and event-driven programming # are the things to study if you want to understand how this works #################################################################### { package MyParser; use base qw(HTML::Parser); # override the text sub to simply store the # plain text of the content in a linear fashion sub text { my ($self, $origtext) = @_; # first, any remove leading and trailing white space # there are better ways to do this, but it's late *yawn* $origtext =~ s/^\s*//; $origtext =~ s/(.*)\s*$/$1/; # forgive me Ovid # finally, only store the line if it's not empty push(@lines, $origtext) if $origtext =~ m/\w+/; } }

Jeff

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
F--F--F--F--F--F--F--F--
(the triplet paradiddle)

Replies are listed 'Best First'.
Re: (jeffa) Re: So Simple, Yet no tutorial covers it
by eggbert (Initiate) on Aug 11, 2002 at 16:55 UTC
    Exactly what I'm looking for. Might be sacrilige here to say it, but I didn't really care how I did this, i.e VBScript, whatever. But starting out doing an initial search of the web led me to www.asptoday.com, with an article on 'Parsing Web Pages From ASP Through HTTP' by Randall Kindig, which I'd have had the pleasure of having to pay $5 for, or subscribing $10 a month to their site, kind of irritating. Excellent site PerlMonks lots of freely shared information, and saved me a significant amount of time on a couple of things so far.