PanchoAguirre has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I am new to parsing html pages using regular expressions. And I am wondering if some could point me into the right direction. I am trying to get the word: PAL1001 from the page below.
<font size="-2" face=verdana> <b>Catalog Number:</b>&nbsp; PAL1001 <br> <&c> I have tried the following code unsuccessfully: I have tried the following code <c> $results=~ /(Catalog Number:)/; if($1) {print "$1\n";} else {print "nothing\n";};
When I removed the colon it matches the words Catalog Numbers in a incorrect location.
$results=~ /(Catalog Number)/; if($1) {print "$1\n";} else {print "nothing\n";};
Thanks in advance for the help. Sincerely, Pancho

Replies are listed 'Best First'.
Re: parsing hmtl file with regex
by davido (Cardinal) on Sep 30, 2011 at 23:15 UTC

    You want to capture the catalog number, but instead you're matching the anchor text, and never even looking for what comes after it.

    Try this.

    /Catalog Number:\s+(\w+)/

    Update: (The first part of this node was posted from a smartphone, and pecking out markup and other symbols was unpleasant enough that I avoided my usual verbosity, which will now follow):

    That anchors on "Catalog Number:" followed by any amount of whitespace, and then captures all contiguous "word" characters that follow, which would include alpha, numeric, and underscore. $1 would hold the catalog number in a successful match.

    Anyone who mentioned you ought to parse HTML with a proper parsing module is correct though. Regexp solutions are fragile. It's strange that when we take our car to the mechanic we never say, "I want you to fix it using only a 12mm socket wrench." But people think nothing of coming for advice on parsing HTML, and in the same breath suggest that we ought to adapt our solutions to use only regular expressions, avoiding the vast array of other tools, many of which are more suitable for the task.


    Dave

Re: parsing hmtl file with regex
by ww (Archbishop) on Oct 01, 2011 at 01:00 UTC
    But then, don't.

    Don't use regexen to parse and extract from HTML. You'll almost certainly come a cropper.

    Instead, use an appropriate HTML::(module). Search CPAN or ActiveState's PPM repository (depending on which Perl you're using) for the great range of modules ( for ex., CPAN://HTML::Extract] )which do the job.

    And, please, wrap data (!!!) and code in <c>...</c> tags, as you're advised at the page where you enter your text.

Re: parsing hmtl file with regex
by planetscape (Chancellor) on Oct 01, 2011 at 16:11 UTC
    I am new to parsing html pages using regular expressions. And I am wondering if some could point me into the right direction.

    Yes. Don't.

    HTH,

    planetscape
Re: parsing hmtl file with regex
by JavaFan (Canon) on Sep 30, 2011 at 23:10 UTC
    Considering that you are just printing out Catalog Number, does it really matter which "Catalog Number" you print?

    As for trying to achieve your objective, I'd write:

    print "Gotit! ($1)\n" if /Catalog Number: (PAL1001)/;
Re: parsing hmtl file with regex
by Marshall (Canon) on Oct 01, 2011 at 07:10 UTC
    I try to avoid the use of $1, $2 etc.

    I find it better and easier as far as the coding, to put the left hand side of the regex match into a list context and for example assign $catalog_num directly instead of fiddling with $1!

    #!/usr/bin/perl -w use strict; my @lines = ( 'Catalog Number: PAL1001', ' Catalog Number:PAL1001', 'Catalog Number: Catalog Number: PAL1001', 'Catalog Number: PAL1001Catalog Number: PAL1001', 'Cat Number: PAL1001Catalog', 'Catalog Number: 123PAL100'); foreach my $line (@lines) { my ($catalog_num) = $line =~ /^\s*Catalog Number:\s*([A-Za-z]+\d+)/ +; if ($catalog_num) { print "$catalog_num\n" } else { print "Bad Line!...$line\n"; } } __END__ PAL1001 PAL1001 Bad Line!...Catalog Number: Catalog Number: PAL1001 PAL1001 Bad Line!...Cat Number: PAL1001Catalog Bad Line!...Catalog Number: 123PAL100
    Update:
    When writing if statements like above, you have to consider "truth-ness". $catalog_num is false if that value is undefined or it is numeric zero. And that's a "Bad Line!"

    In the above, a valid catalog_num cannot be a numeric zero and so I can just test "if ($catalog_num)" instead of "if (defined $catalog_num)".