Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I'm trying to simplify a reg. ex. to extract currencies from HTML files. Here is a snippet of the HTML that wraps the currency and f/x rate:
<TR BGCOLOR="#F0F0DF"> <TD ALIGN=LEFT VALIGN=TOP CLASS="mrktdata1"><FONT FACE="arial, +helvetica" SIZE="-1" COLOR="#000000">Falkland Island Pound (*FKP)</FO +NT></TD> <TD ALIGN=RIGHT VALIGN=TOP CLASS="mrktdata1"><FONT FACE="arial +,helvetica" SIZE="-1" COLOR="#000000">1.4409 </FONT></TD> </TR></TR>
Here's the expression I'm currently using to parse out the currencies:
while ($content =~ /(\**\s?\w+\.?\s+\S*\.?\s*\w*\.?\/?\s*\w*\.?\/?\s*\ +w*\.?\/?\s*\(\*?[A-Z]{2,4}\))/g){ print RAW "$1\n";
Some sample output:
Falkland Island Pound (*FKP) South African Rand/fin (ZAR)
It works, but as you can see my expression is rather cumbersome. Any ideas on how I can somehow group things and/or condense or simplify it? I've tried using variations of \S to catch instances of odd things like '&' or '/', but that really slows it down. Is there someway I can group all the repeating sections? I've attempted a couple grouping schemes and tried putting it into a class, but without much success. I'm also trying to keep it somewhat generic and flexible to catch name changes, etc.

Finally, is there a nice way I can also read in the rate (1.4409 in the sample) on the same pass through the file? Thanks.

Replies are listed 'Best First'.
Re: simplifying an expression
by Sifmole (Chaplain) on Apr 20, 2001 at 16:17 UTC
    Hmm... I don't have an answer for you, but out of curiosity -- would I be correct if I observed that it seemed like you were trying to parse up Bloomberg.com's Currency pages?

    Or at least they look a lot like the pages I wrote for them. :)


    Okay... I have something. I believe is is somewhat less cumbersome, but I do not have the time to Benchmark the difference.

    while ($string =~ s/<TR.+?([^\)>]+\))<.+?(\d+\.\d+)//s) { print "$1 -- $2 \n"; } <TR <-- locates the start of each row. .+? <-- will slurp up lots of chars, but not greedy ([^\)>]+\)) <-- will grab the currency name, $1 <.+? <-- slurp more, not greedy (\d+\.\d+) <-- will grab the currency value, $2

    I hope this is some help.

    Edit 2001-04-20 by tye

      Sorry! I have no idea how my posting ended up in the primary area. I beg your pardon.


      Okay... I have something. I believe is is somewhat less cumbersome, but I do not have the time to Benchmark the difference.

      while ($string =~ s/<TR.+?([^\)>]+\))<.+?(\d+\.\d+)//s) { print "$1 -- $2 \n"; } <TR <-- locates the start of each row. .+? <-- will slurp up lots of chars, but not greedy ([^\)>]+\)) <-- will grab the currency name, $1 <.+? <-- slurp more, not greedy (\d+\.\d+) <-- will grab the currency value, $2

      I hope this is some help.

      Edit 2001-04-20 by tye

Re: simplifying an expression
by suaveant (Parson) on Apr 20, 2001 at 17:50 UTC
    You could do...
    /<TR[^>]*>\s*(?:<[^>]*>|\s+)+([^<]+)(?:<[^>]*>|\s+)+([^<]+)/;
    I tested it, it works... here's a breakdown.
    <TR[^>]*>\s* match a TR tag, followed by 0 or more whitespace.
    (?:<[^>]*>|\s+)+ Match as much tagged data or whitespace as possible, basically just keep going till you run out of tagged data or whitespace. the \s+ is important, I believe \s* would result in a very slow regexp. ?: means don't save data from this grouping in $1
    ([^<]+) Matches everything till the next tag starts.
    Then it just repeats the anytag match, and the non-tag text match.

    This will work for this case, however you may need some code in there to make sure you don't get data from other TDs... maybe check for mrktdata1 or something else that is unique to these TDs. Currency is in $1, rate is in $2. enjoy.
                    - Ant