Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I posted another regex question a minute ago but this is completely different. The below code works but it's very redundant and I'm sure it can be cut down quite a bit.
$mech->content() =~ m#Lumber:</td><td align="right"><b>(\d+)#i; my $lumber = $1; $mech->content() =~ m#Clay:</td><td align="right"><b>(\d+)#i; my $clay = $1; $mech->content() =~ m#Iron:</td><td align="right"><b>(\d+)#i; my $iron = $1; $mech->content() =~ m#Crop:</td><td align="right"><b>(\d+)#i; my $crop = $1;
Can some experienced monks show me how THEY would go about this?

Thank you!

Replies are listed 'Best First'.
Re: help shorten this series of regexes
by bobf (Monsignor) on Sep 19, 2007 at 04:06 UTC

    I probably wouldn't try to parse HTML using regular expressions, but since you didn't post an example of it I'll suggest the following:

    my %hash; foreach my $type qw( Lumber Clay Iron Crop ) { if( $mech->content() =~ m#$type:</td><td align="right"><b>(\d+)#i +) { $hash{$type} = $1; # lc( $type ), etc if desired } else { # I dunno - what should happen if the match fails? } }

Re: help shorten this series of regexes
by GrandFather (Saint) on Sep 19, 2007 at 04:16 UTC

    The bigger picture would help a lot because hand parsing HTML like that is heading for a world of pain.

    That said, and without knowing how the code fragment fits into the bigger picture, something like the following may help:

    use strict; use warnings; my $str = 'Lumber:</td><td align="right"><b>10;'; my $str = 'Lumber:</td><td align="right"><b>10;'; my %hits; ++$hits{lc $1} if $str =~ /(lumber|clay|iron|crop):<\/td><td align="ri +ght"><b>\d+/i; print join ', ', sort keys %hits;

    Prints:

    lumber

    Note in particular to use of a hash instead of a bunch of manifest variables.


    DWIM is Perl's answer to Gödel
Re: help shorten this series of regexes
by throop (Chaplain) on Sep 19, 2007 at 04:46 UTC
    It's only reasonable to be regex parsing HTML code if
    • The text you're looking at was output by some process very free of human intervention, so your way sure that it's going to be absolutely regular. No extra spaces, no funny capitalization, no varying between 'right' and "right".
    But you've got that 'i' at the end of the regexes. Which is usually good practice. However, in this context, it tells me that, no, you're not absolutely confident that there won't be any variance in the capitalization. Hmmmm....

    Maybe you should be in the market for a good HTML Parser.

    throop

Re: help shorten this series of regexes
by Anno (Deacon) on Sep 19, 2007 at 09:20 UTC
    There is a module on CPAN, named HTML::TableExtractor or similar that should be able to help with your task.

    Warnings about using regular expressions for HTML parsing aside, here is one approach that extracts data according to your pattern into a hash (named %price at a guess):

    my $html = <<EOT; stuff Lumber:</td><td align="right"><b>10 more stuff Clay:</td><td align="right"><b>20 glug Iron:</td><td align="right"><b>30 blub Crop:</td><td align="right"><b>40 stuff again EOT my %price; $price{ lc $1} = $2 while $html =~ m#(\w+):</td><td align="right"><b>(\d+)#ig; my ( $lumber, $clay, $iron, $crop) = @price{ qw( lumber clay iron crop)};
    Anno
Re: help shorten this series of regexes
by Gangabass (Vicar) on Sep 19, 2007 at 05:15 UTC

    You don't need two assignment operations:

    my ($lumber) = $mech->content() =~ m#Lumber:</td><td align="right"><b> +(\d+)#i;
Re: help shorten this series of regexes
by johngg (Canon) on Sep 19, 2007 at 09:00 UTC
    The below code works ...

    I might not work the way you think. You are doing an assignment willy-nilly regardless of whether the matches succeed or not. For instance, if "Clay" was the value then lumber would get undef or whatever was in $1 from some previous match, $clay would get the right value but $iron and $crop would get the value associated with "Clay".

    Always test that your matches have been successful.

    Cheers,

    JohnGG