Sang has asked for the wisdom of the Perl Monks concerning the following question:

I'm an utter newbie with perl so excuse the ugly code at the bottom. I'm trying to automate pulling census data with an approach such as:

  • Submit the search area from a local cgi form
  • Discard everything but the links to the data
  • Modify those links to actually pull the data instead of displaying a data selection form
  • Show the modified links
    The code below gets me halfway there so far, currently I'm just saving the results to a file cause I've yet to tackle how to actually display it in a browser without saving it to disk first. The problems I'm having a difficult time figuring out are:

    Given this Census area selection page how could I modify the code below to also display the area names, e.g., "Tulsa, OK (city) STF3A" instead of just "STF3A"?

    How to make a self-contained CGI script to dynamically generate these pages instead of saving them to disk? I realize that's a big question but, mebbe someone could show a little example? Type an address in a form and it retrieves and displays the page changing any occurance of "the" to "tha", etc?

    use URI::URL; use LWP::Simple; use HTML::TokeParser; use strict; my $url = url('http://www.census.gov/cgi-bin/gazetteer'); $url->query_form( city => "Tulsa", state => "OK" ); my $document = get( $url ); my $p = HTML::TokeParser->new(\$document); open( OUTPUT, ">output.html" ) || die "Couldn't open 'output.html': $! +\n"; while (my $token = $p->get_tag("a")) { my $url = $token->[1]{href}; $url =~ s/CMD=TABLES/CMD=RET/; my $text = $p->get_trimmed_text("/a"); if ($text eq "STF1A" || $text eq "STF3A") { print OUTPUT "<a href=$url/FMT=HTML/T=P1>$text</a><BR>\n"; } } close( OUTPUT ) || die "Can't close 'output.html': $!";
  • Replies are listed 'Best First'.
    Re: Retreive, modify, & display webpage
    by AidanLee (Chaplain) on Jan 03, 2002 at 19:57 UTC
      I'll re-post this here as the other post is marked as the duplicate

      What you'll probably want to do is Walk through the bulleted list and for each bullet:

      1. Pull off the first line of text (the name)
      2. Then get the link from the link(s?) that comes after that bullet, but before the next.

      It may be difficult to do with TokeParser since the generated page doesn't close their list-element ( <li>) tags, and I don't know what it can or can't handle. If it does not work, as much as It's usually unwise to advocate it, since you have a "known format" you're working with, it would be possible to parse this page with regular expressions:

      my @document = split /\n/, $document; my $entry = ''; foreach ( @document ) { m|^<li>(.*?)</strong>| and do { $entry = $1; next }; m|<a href=(.*?)>(.*?)</a>| and do { my $url = $1; $url =~ s/CMD=TABLES/CMD=RET/; my $text = $2; if ($text eq "STF1A" || $text eq "STF3A") { print OUTPUT "<a href=$url/FMT=HTML/T=P1>$entry $text</a>< +br />\n"; } next; }; }
        Aidan: Thanks for the reply, I'm currently trying the regex approach and ran into a little bump. The pattern for grabbing the link's text description...
        m|<a href=(.*?)>(.*?)</a>|
        ...will grab everything but "STF3A" and "STF1A". Given:
        Browse Tiger <a href="http://tiger.census.gov/cgi-bin/mapbrowse-tbl?la +t=36.12000 &lon=-95.94135&wid=0.75&ht=0.75&mlat=36.12000&mlon=-95.94135&msym=redp +in&off=CIT IES&mlabel=Tulsa+County,+OK">Map</a> of area.<br>
        $text will hold "Map" but when given:
        Lookup 1990 Census <a href=http://venus.census.gov/cdrom/lookup/CMD=TA +BLES/DB=C9 0STF1A/F0=FIPS.STATE/F1=FIPS.COUNTY90/F2=STUB.GEO/LEV=COUNTY90/SEL=40, +143,Tulsa+ County>STF1A</a>
        $text is empty...I've tried tweaking the pattern but I'm even more of a newbie with regex than I am with perl, any suggestions?
          If STF1A and STF3A are the only two strings you'll ever want to match you might consider changing it to this:
          m|<a href=(.*?)>(STF1A|STF3A)</a>|
          But that won't necessarily address why it isn't matching. If the urls you're parsing are broken on multiple lines like that you'll need to add the 's' modifier so that the .*? will match newlines as well:
          m|<a href=(.*?)>(STF1A|STF3A)</a>|
          HTH