johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've got a page scraping script that needs to get a particular column (say the 13th) from a 20-column html table, there is no other difference between the columns, and I only want the first row. Is there an easy regex to do this? thanks.
my $text=<<END; <TR> <TD nowrap>4.2</TD> <TD nowrap>-1.2</TD> <TD nowrap>3.5</TD> <TD nowrap>6.2</TD> <TD nowrap>5</TD> <TD nowrap>2e-8</TD> <TD nowrap>1.3</TD> <TD nowrap>12.0</TD> <TD nowrap>text</TD> <TD nowrap>other</TD> <TD nowrap>-23</TD> <TD nowrap>2.3</TD> <TD nowrap>0.2</TD> <TD nowrap>1.4</TD> <TD nowrap>4</TD> <TD nowrap>6</TD> <TD nowrap>03</TD> <TD nowrap>2.3</TD> <TD nowrap>e12</TD> <TD nowrap>4</TD> <TR> END # want to get the number 0.2 which is in the 13th column

Replies are listed 'Best First'.
Re: how to match after the n-th occurance?
by japhy (Canon) on Jun 03, 2005 at 06:27 UTC
    While I don't condone regexes on HTML, the general idea is:
    ($nth_chunk) = $text =~ /(?:PREFACE(MATCH)POSTFIX){N}/;
    In your case, that'd be:
    my ($number13) = $text =~ /(?:<TD nowrap>(.*?)</TD>\n){13}/;
    The reason it works is because a capturing group INSIDE a quantifier keeps on overwriting the associated $DIGIT variable on each repetition. Here's a simpler example:
    "abcdef" =~ /(.)/; # $1 is "a" "abcdef" =~ /(.)+/; # $1 is "f"

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
Re: how to match after the n-th occurance?
by Zaxo (Archbishop) on Jun 03, 2005 at 06:36 UTC

    You can do a global match for a single instance and index the result: my $foo = ($text =~ m!<TD nowrap>([^<]*)</TD>!g)[12]; Indexing places the global match in list context.

    After Compline,
    Zaxo

Re: how to match after the n-th occurance?
by tlm (Prior) on Jun 03, 2005 at 13:00 UTC