pem725 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings PerlMonks,

I have a quick question that hopefully can be resolved by some simple regex directives. I have a simple script that scrapes a web page. I use the usual LWP and HTML::TreeBuilder modules but I cannot seem to get my code correct for matching a portion of an href.

Here is the scenario. I have a group of web pages that contain a table with specific data that I wish to extract and save. The table I want is in the same location on each of those web pages but there are many other tables on the page. So, to identify the relevant table, I figured I would use the unique info contained in an href located in one of the first cells and use the look_down method. The only problem is that only a portion of the href is common across all the pages. For example, the first page I wish to scrape has the following code:

<a href="checkme.php?player=1&amp;year=2001">

and the second page may have:

<a href="checkme.php?player=2&amp;year=1992">

The part that is common across all the pages is:

checkme.php?player=

followed by the player ID (ranging from 1 to about 4500)

&amp;year=

followed by the year (ranging from 1875 to 2006). My code so far goes as follows:

my $p = HTML::TreeBuilder->new_from_content($page);

I know the line above works. I then use the look_down function to identify the a tag with the href that matches that unique cell identifier.

my @tabrows = $p->look_down( _tag => 'a', href => qr{^\Qcheckme.php?player=\E \d+ \Q&amp;year=\E \d+ $}x);

The line above does not work. Can anyone steer me in the right direction with my regex? Perhaps I am going about this problem incorrectly. I don't know but I am stumped and would greatly appreciate some ideas.

Thanks in advance for any help with this question.

Replies are listed 'Best First'.
Re: partial matching of href in HTML::TreeBuilder
by pem725 (Initiate) on Dec 27, 2006 at 07:03 UTC

    Darn it! I figured it out on my own just after posting my question. The solution I used was just to ignore the changing elements and use wildcards at the end. Specifically, I used the

    my @tabrows = $p->look_down( _tag => 'a', href => qr{^\Qcheckme.php?player=\E .* $}x);

    and that seemed to fix my problem. Onto more debugging! Sorry to waste the bandwidth and user time.

Re: partial matching of href in HTML::TreeBuilder
by ww (Archbishop) on Dec 27, 2006 at 16:32 UTC
    In effect, you've re-discovered the Teddy Bear Technique:
    Set a Teddy Bear on your monitor.
    When stuck, explain the problem clearly to the Bear.
    The Bear, upon understanding the issue, will often help you solve your problem.

    and, "Welcome" pem725. While the above-paraphased advice is something less than 100% guaranteed, you'll likely find that your co-religionists here in the Monastery will offer useful info on those problems which resist the TBT.

      ww,

      HA! I love it. I shall use the TBT first from now on. Thanks for the warm welcome. The Perl Monks looks like my kind of place.