shu has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's minimum standard of quality and will not be displayed.
  • Comment on Need help extracting data from web page

Replies are listed 'Best First'.
Re: Need help extracting data from web page
by dominix (Deacon) on Jan 15, 2004 at 10:29 UTC
    we'll don't do your homework but we can give you tracks to get by.
    my track : use TableExtract
    use HTML::TableExtract; my @content = (<>); # give html file on command line my $content = join( //, @content ); for $count ( 1 .. 8 ) { $te = new HTML::TableExtract( depth => 1, count => $count ); $te->parse($content); # Shorthand...top level rows() method assumes the first table fou +nd # in the document if no arguments are supplied. foreach $row ( $te->rows ) { print join( ',', @$row ); } }
    --
    dominix
Re: Need help extracting data from web page
by gjb (Vicar) on Jan 15, 2004 at 14:22 UTC

    In addition to the tips given above, you might also want to familiarize yourself with WWW::Mechanize, that provides a very intuitive way to scrape information from the web.

    To get the hang of it, you can use WWW::Mechanize::Shell which allows to to "browse" pages through WWW::Mechanize's functions using an interactive shell.

    The documentation to both modules should get you started.

    Hope this helps, -gjb-

Re: Need help extracting data from web page
by valdez (Monsignor) on Jan 15, 2004 at 12:11 UTC

    Quick and dirty and not complete, but it works :)

    #!/usr/bin/perl use LWP::Simple; use HTML::TokeParser::Simple; use Data::Dumper; use strict; use warnings; my $start = 'http://155.69.224.75:8000/eeepeople/AcadStaff.asp'; my $file = './index.html'; LWP::Simple::mirror($start, $file); my $p = HTML::TokeParser::Simple->new($file); my $state = 0; my ($url, $name, @teachers ); while (my $t = $p->get_token) { if ($state == 0) { if ($t->is_start_tag('a')) { my $attr = $t->return_attr; if (exists $attr->{href} and $attr->{href} =~ /\/cv\//) { $url = $attr->{href}; $state = 1; } } } elsif ($state == 1) { if ($t->is_end_tag('a')) { push @teachers, { name => $name, url => $url }; $name = ''; $url = ''; $state = 0; } elsif ($t->is_text) { $name .= $t->as_is; } } } print Dumper(\@teachers), "\n"; foreach my $teacher (@teachers) { my $filename = lc($teacher->{name}); $filename =~ s/\s+/_/g; $filename .= '.html'; LWP::Simple::mirror($teacher->{url}, $filename); my $p = HTML::TokeParser::Simple->new($filename); $state = 0; my ($pub, $res, @publications, @interests); while (my $t = $p->get_token) { if ($state == 0) { if ($t->is_text and $t->as_is =~ /publication/i) { $state = 2; } } elsif ($state == 2) { if ($t->is_start_tag('li')) { $state = 3; } elsif ($t->is_end_tag('ul')) { $state = 0; } } elsif ($state == 3) { if ($t->is_text) { $pub .= $t->as_is; } elsif ($t->is_end_tag('li')) { push @publications, $pub; $pub = ''; $state = 2; } } } print $teacher->{name} ." published:\n"; print Dumper(\@publications), "\n"; }

    Ciao, Valerio

Re: Need help extracting data from web page
by cees (Curate) on Jan 15, 2004 at 17:07 UTC

    Have a look at the Template::Extract module, which will take a Template Toolkit template snippet, and an HTML page, and it will parse all the data out of the HTML page and give you a big perl data structure with all the data. It does the reverse of what a normal templating system does. Instead of generating HTML, you are pulling data out of a structured HTML document...

    #!/usr/bin/perl use strict; use warnings; use Template::Extract; use Data::Dumper; use HTML::Clean; use LWP::UserAgent; # Get the page my $ua = LWP::UserAgent->new; my $response = $ua->get('http://155.69.224.75:8000/eeepeople/AcadStaff +.asp'); die $response->status_line unless $response->is_success; my $html = $response->content; # Create the extraction template my $obj = Template::Extract->new; my $template = << '.'; [% FOREACH record %]<tr bgcolor=[% ... %]><td><a href=[% url %] target +="_blank">[% name %]</a></td><td>[% title %]</td><td>[% phonenumber % +]</td><td>[% location %]</td><td><a href="mailto:[% email %]">[% user +name %]</a></td></tr>[% ... %][% END %] . # strip out any unnecesary whitespace from # the html to make parsing easier my $h = new HTML::Clean(\$html); $h->strip(); # extract the data from the html page and # dump the resulting data structure to STDOUT print Data::Dumper::Dumper( $obj->extract($template, $html) );

    The above code doesn't solve the whole problem, because it only parses the first section of names from the page. But you should be able to extend it to parse all the info (hint wrap another FOREACH block around the template)

    - Cees