help shorten this series of regexes

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: help shorten this series of regexes by bobf (Monsignor) on Sep 19, 2007 at 04:06 UTC
I probably wouldn't try to parse HTML using regular expressions, but since you didn't post an example of it I'll suggest the following: `my %hash; foreach my $type qw( Lumber Clay Iron Crop ) { if( $mech->content() =~ m#$type:</td><td align="right"><b>(\d+)#i +) { $hash{$type} = $1; # lc( $type ), etc if desired } else { # I dunno - what should happen if the match fails? } }` [download]	[reply] [d/l]
Re: help shorten this series of regexes by GrandFather (Saint) on Sep 19, 2007 at 04:16 UTC
The bigger picture would help a lot because hand parsing HTML like that is heading for a world of pain. That said, and without knowing how the code fragment fits into the bigger picture, something like the following may help: `use strict; use warnings; my $str = 'Lumber:</td><td align="right"><b>10;'; my $str = 'Lumber:</td><td align="right"><b>10;'; my %hits; ++$hits{lc $1} if $str =~ /(lumber\|clay\|iron\|crop):<\/td><td align="ri +ght"><b>\d+/i; print join ', ', sort keys %hits;` [download] Prints: `lumber` [download] Note in particular to use of a hash instead of a bunch of manifest variables. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: help shorten this series of regexes by throop (Chaplain) on Sep 19, 2007 at 04:46 UTC
It's only reasonable to be regex parsing HTML code if The text you're looking at was output by some process very free of human intervention, so your way sure that it's going to be absolutely regular. No extra spaces, no funny capitalization, no varying between 'right' and "right". But you've got that 'i' at the end of the regexes. Which is usually good practice. However, in this context, it tells me that, no, you're not absolutely confident that there won't be any variance in the capitalization. Hmmmm.... Maybe you should be in the market for a good HTML Parser. throop	[reply]
Re: help shorten this series of regexes by Anno (Deacon) on Sep 19, 2007 at 09:20 UTC
There is a module on CPAN, named `HTML::TableExtractor` or similar that should be able to help with your task. Warnings about using regular expressions for HTML parsing aside, here is one approach that extracts data according to your pattern into a hash (named `%price` at a guess): `my $html = <<EOT; stuff Lumber:</td><td align="right"><b>10 more stuff Clay:</td><td align="right"><b>20 glug Iron:</td><td align="right"><b>30 blub Crop:</td><td align="right"><b>40 stuff again EOT my %price; $price{ lc $1} = $2 while $html =~ m#(\w+):</td><td align="right"><b>(\d+)#ig; my ( $lumber, $clay, $iron, $crop) = @price{ qw( lumber clay iron crop)};` [download] Anno	[reply] [d/l] [select]
Re: help shorten this series of regexes by Gangabass (Vicar) on Sep 19, 2007 at 05:15 UTC
You don't need two assignment operations: `my ($lumber) = $mech->content() =~ m#Lumber:</td><td align="right"><b> +(\d+)#i;` [download]	[reply] [d/l]
Re: help shorten this series of regexes by johngg (Canon) on Sep 19, 2007 at 09:00 UTC
The below code works ... I might not work the way you think. You are doing an assignment willy-nilly regardless of whether the matches succeed or not. For instance, if "Clay" was the value then `lumber` would get `undef` or whatever was in `$1` from some previous match, `$clay` would get the right value but `$iron` and `$crop` would get the value associated with "Clay". Always test that your matches have been successful. Cheers, JohnGG	[reply] [d/l] [select]