pem725 has asked for the wisdom of the Perl Monks concerning the following question:
I have a quick question that hopefully can be resolved by some simple regex directives. I have a simple script that scrapes a web page. I use the usual LWP and HTML::TreeBuilder modules but I cannot seem to get my code correct for matching a portion of an href.
Here is the scenario. I have a group of web pages that contain a table with specific data that I wish to extract and save. The table I want is in the same location on each of those web pages but there are many other tables on the page. So, to identify the relevant table, I figured I would use the unique info contained in an href located in one of the first cells and use the look_down method. The only problem is that only a portion of the href is common across all the pages. For example, the first page I wish to scrape has the following code:
<a href="checkme.php?player=1&year=2001">
and the second page may have:
<a href="checkme.php?player=2&year=1992">
The part that is common across all the pages is:
checkme.php?player=
followed by the player ID (ranging from 1 to about 4500)
&year=
followed by the year (ranging from 1875 to 2006). My code so far goes as follows:
my $p = HTML::TreeBuilder->new_from_content($page);
I know the line above works. I then use the look_down function to identify the a tag with the href that matches that unique cell identifier.
my @tabrows = $p->look_down( _tag => 'a', href => qr{^\Qcheckme.php?player=\E \d+ \Q&year=\E \d+ $}x);
The line above does not work. Can anyone steer me in the right direction with my regex? Perhaps I am going about this problem incorrectly. I don't know but I am stumped and would greatly appreciate some ideas.
Thanks in advance for any help with this question.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: partial matching of href in HTML::TreeBuilder
by pem725 (Initiate) on Dec 27, 2006 at 07:03 UTC | |
|
Re: partial matching of href in HTML::TreeBuilder
by ww (Archbishop) on Dec 27, 2006 at 16:32 UTC | |
by pem725 (Initiate) on Dec 27, 2006 at 18:17 UTC | |
by jasonk (Parson) on Dec 28, 2006 at 17:42 UTC |