in reply to So Simple, Yet no tutorial covers it
Keep in mind that this is not an exact science, there is a bit of art involved - mainly in finding a way to extract the data into the fields you wish to store, without having to write code that is unmaintainable. My solution uses a method that I am not particularly found of - hard coded array indexes, but it works for the example you gave. It will break if the web pages you are parsing tend to change from page to page.
Also, I do not bother with any actual database code, since I do not know what your tables look like.
#!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::Parser; # get the content of the web page my $content = get("http://www.foobar.com/poopy/data.html"); # instaniate a new parser and let it crunch our data my @lines; my $parser = new MyParser; $parser->parse($content); # now the data is in @lines - there is more than one way # to do this - if you know that __EVERY__ web page will # have the __SAME__ layout, you can hardcode your indexes # for example, at this point @lines looks like this: # # 0 - Location: # 1 - Northern Africa, bordering the Mediterranean Sea, between Egypt + and Tunisia # 2 - Geographic coordinates: # 3 - 25 00 N, 17 00 E # 4 - Map references: # 5 - Africa # 6 - Area: # 7 - total: # 8 - 1,759,540 sq km # 9 - land: # 10 - 1,759,540 sq km # 11 - water: # 12 - 0 sq km # 13 - Area - comparative: # 14 - slightly larger than Alaska # 15 - Land boundaries: # 16 - total: # 17 - 4,383 km # 18 - border countries: # 19 - Algeria 982 km, Chad 1,055 km, Egypt 1,150 km, Niger 354 km, Su +dan 383 km, Tunisia 459 km # 20 - Coastline: # 21 - 1,770 km # # so now I can store my variables by accessing the proper index # in the array: (only the first 5 - you do the rest :) my ($location, $coords, $refs, $total_area, $land_area); $location = $lines[1]; $coords = $lines[3]; $refs = $lines[5]; $total_area = $lines[8]; $land_area = $lines[10]; # not pretty - but it works for the example # insert into database - you will have to implement # your own subroutine for this &insert_row($location, $coords, $refs, $total_area, $land_area); #################################################################### # package MyParser - inheritance and event-driven programming # are the things to study if you want to understand how this works #################################################################### { package MyParser; use base qw(HTML::Parser); # override the text sub to simply store the # plain text of the content in a linear fashion sub text { my ($self, $origtext) = @_; # first, any remove leading and trailing white space # there are better ways to do this, but it's late *yawn* $origtext =~ s/^\s*//; $origtext =~ s/(.*)\s*$/$1/; # forgive me Ovid # finally, only store the line if it's not empty push(@lines, $origtext) if $origtext =~ m/\w+/; } }
Jeff
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
F--F--F--F--F--F--F--F--
(the triplet paradiddle)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: (jeffa) Re: So Simple, Yet no tutorial covers it
by eggbert (Initiate) on Aug 11, 2002 at 16:55 UTC |