knlst8 has asked for the wisdom of the Perl Monks concerning the following question:

I'm relatively new to perl, and could use some help solving a substitution problem. So I'm trying to make a change to every cell in the nth column of an HTML table. Those of you familiar with HTML will know that HTML doesn't "think" in columns, so I have to make a change to the nth cell in each row. Try as I might, I can only manage to change the first row. I'm thinking something like this:
#this while loop isn't right, but I don't know what is...; while ($tbltext =~ /<tr.*?>/cg) { $counter = 1; #reset the counter; $tbltext =~ s/(<td.*?\/td>)/ ++$counter == $nth #is this the nth cell? ? "${1}newtext" #if yes, add "newtext" : $1 #otherwise, leave it /ge; $tbltext =~ /<\/tr>/cg; #find the end of the row } #exit the loop
I can't get a 'while' loop to work. In addition, I've tried 'for' and 'foreach' loops, and every combination of \g \G and \cg that I can think of. It always finds only the first row tag, and then counts the cells and substitutes for the nth cell in the first row. Any sort of loop that I put around it only makes the substitution in the first row in every pass through the loop. I'm betting there's a really simple answer that I just haven't stumbled on yet. Can anyone here help?

Replies are listed 'Best First'.
Re: Continuing after replacing nth occurrence
by GrandFather (Saint) on Mar 10, 2009 at 01:57 UTC

    First off the obligatory advice to those new to Perl: always use strictures (use strict; use warnings;)!

    Secondly, the obligatory advice to those who haven't yet learned the folly of rolling their own HTML parsing code - Don't. Use CPAN HTML modules instead. In this case you could try HTML::TableExtract and HTML::Table:

    use strict; use warnings; use HTML::Table; use HTML::TableExtract; my $tbltext = <<'END_TBL'; <table> <tr><td>1</td><!-- deleted! <td>2</td>!--><td>3</td><td>4</td></tr +> <tr><td>one</td><td>two</td><td>three</td></tr> </table> END_TBL my $tableEx = HTML::TableExtract->new (); $tableEx->parse ($tbltext); my @rows = $tableEx-> rows (); for my $row (@rows) { next if @$row < 3; # Not interested in short rows $row->[2] = 'newText'; } my $tableOut = HTML::Table->new (); $tableOut->addRow (@$_) for @rows; $tableOut->print ();

    Prints:

    <table> <tr><td>1</td><td>3</td><td>newText</td></tr> <tr><td>one</td><td>two</td><td>newText</td></tr> </table>

    True laziness is hard work
      Thanks! I believe HTML::TableExtract and HTML::Table: will be my new best friends. Now I have to wonder why I didn't think to search for such things, when every other module I've ever wanted has already existed. Thanks again for the help!
Re: Continuing after replacing nth occurrence
by almut (Canon) on Mar 10, 2009 at 02:45 UTC

    I agree with GrandFather's advice to not parse HTML with regexes, in general.  That being said, however, it's sometimes helpful to understand why some approach didn't work, as well as what would have been a way around the problem.

    So, the thing with your code is the s/// operator not being aware of the incremental search position that you're trying to handle with the /c option of the outer matches. I.e., it always starts from the beginning of the string (even if you attempt to save/restore the current position using $pos = pos($tbltext); before, and pos($tbltext) = $pos; after the substitution code)...

    One way around the problem would be to move the column substitution code into a subroutine, which you then simply call in a normal repeated (/g) substitution;

    my $tbltext = qq(<tr><td>a1</td><td>a2</td><td>a3</td><td>a4</td></tr> +<tr><td>b1</td><td>b2</td><td>b3</td><td>b4</td></tr><tr><td>c1</td>< +td>c2</td><td>c3</td><td>c4</td></tr>); sub fix_column { my $s = shift; my $nth = shift; my $counter = 0; $s =~ s/(<td.*?\/td>)/ ++$counter == $nth #is this the nth cell? ? "${1}newtext" #if yes, add "newtext" : $1 #otherwise, leave it /ge; return $s; } $tbltext =~ s/(<tr.*?>)(.*?)(<\/tr>)/$1.fix_column($2, 3).$3/ge; print "$tbltext\n"; __END__ <tr><td>a1</td><td>a2</td><td>a3</td>newtext<td>a4</td></tr><tr><td>b1 +</td><td>b2</td><td>b3</td>newtext<td>b4</td></tr><tr><td>c1</td><td> +c2</td><td>c3</td>newtext<td>c4</td></tr>

    BTW, are you sure you want to insert the text in between the cells, not within?

    (P.S.: I know, the <tr>...</tr> matching regex is not perfect... but I deliberately kept it simple for this demo.)

      Thanks for your reply. You're right that I was wondering why my approach wouldn't work, in addition to needing to find a working approach. I think I understand now. And yes, I did intend to insert text in between cells. Odd, I know. I can't explain it without describing the whole project in detail, so I'm going to leave you wondering about that one. :)