Have a look at the Template::Extract module, which will take a Template Toolkit template snippet, and an HTML page, and it will parse all the data out of the HTML page and give you a big perl data structure with all the data. It does the reverse of what a normal templating system does. Instead of generating HTML, you are pulling data out of a structured HTML document...

#!/usr/bin/perl use strict; use warnings; use Template::Extract; use Data::Dumper; use HTML::Clean; use LWP::UserAgent; # Get the page my $ua = LWP::UserAgent->new; my $response = $ua->get('http://155.69.224.75:8000/eeepeople/AcadStaff +.asp'); die $response->status_line unless $response->is_success; my $html = $response->content; # Create the extraction template my $obj = Template::Extract->new; my $template = << '.'; [% FOREACH record %]<tr bgcolor=[% ... %]><td><a href=[% url %] target +="_blank">[% name %]</a></td><td>[% title %]</td><td>[% phonenumber % +]</td><td>[% location %]</td><td><a href="mailto:[% email %]">[% user +name %]</a></td></tr>[% ... %][% END %] . # strip out any unnecesary whitespace from # the html to make parsing easier my $h = new HTML::Clean(\$html); $h->strip(); # extract the data from the html page and # dump the resulting data structure to STDOUT print Data::Dumper::Dumper( $obj->extract($template, $html) );

The above code doesn't solve the whole problem, because it only parses the first section of names from the page. But you should be able to extend it to parse all the info (hint wrap another FOREACH block around the template)

- Cees


In reply to Re: Need help extracting data from web page by cees
in thread Need help extracting data from web page by shu

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.