Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a single line of html string like this:
<h3>AAA 28/07/2018</h3><table><tr><td>1351</td><td>990</td><td>783</td +><td>523</td></tr></table><h3><table><tr>BBBBB 01/08/2018</h3><td>236 +</td><td>002</td><td>121</td><td>266</td></tr></table><h3>KK K P 25/0 +7/2018</h3><table><tr><td>200</td><td>3345</td><td>667</td><td>137</t +d></tr></table>
I want to get the 3-digit numbers just from BBBBB i.e. 236, 002, 121, 266.

Here's the code I am trying:
if ($line =~ /<h3>BBBBB .*?<td>(\d{3})<\/td>/g) { print "$1\n"; }
I am only getting 236 but only the other 3 sets (002, 121, 266).

Please enlighten me. Thank you very much.

Replies are listed 'Best First'.
Re: Regex help (updated)
by AnomalousMonk (Archbishop) on Aug 01, 2018 at 03:47 UTC

    Note that parsing (X|HT)ML with regexes is fragile; you'd do better to use an XML parser, about which others can advise better than I. However:

    c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; use Data::Dump qw(dd); ;; my $s = '<tr><td>99</td></tr><h3><table><tr>BBBBB 01/08/2018</h3>' . '<td>236</td><td>002</td><td>121</td><td>266</td></tr>' . '<tr><td>999</td><td>9999</td></tr>' ; print qq{[[$s]] \n}; ;; my @n = $s =~ m{ (?: BBBBB [^>]+ > | \G) <td> (\d+) </td> }xmsg; dd \@n; " [[<tr><td>99</td></tr><h3><table><tr>BBBBB 01/08/2018</h3><td>236</td> +<td>002</td><td>121</td><td>266</td></tr> <tr><td>999</td><td>9999</td></tr>]] [236, "002", 121, 266]

    Update: A few afterthoughts:

    • The  use 5.010; statement enforcing a minimum Perl version of 5.10.0 is not needed. I originally used a regex operator introduced with 5.10, but got rid of it, so you can lose the statement too.
    • It might be wise to add more delimitation before and after the  BBBBB keying substring in the regex:
          m{ (?: <tr> BBBBB \b [^>]+ > | \G) <td> (\d+) </td> }xmsg
      This makes regex parsing of this HTML only slightly less problematic.
    • Both these changes tested under Perl version 5.8.9.

    • Give a man a fish:  <%-{-{-{-<

      I never understood regex anchors before seeing this very clear example. Next time I won't need 2 regexes :-) Thank you AnomalousMonk
Re: Regex help
by jahero (Pilgrim) on Aug 01, 2018 at 07:20 UTC

    Input data do not seem to be a valid HTML.

    <h3> <table><tr>BBBBB 01/08/2018 </h3> # </H3> is nested insi +de the table <td>236</td><td>002</td><td>121</td> <td>266</td></tr> </table>

    With that out of the way... This is how you could do it using Mojo::DOM.

    use strict; use warnings; use Mojo::DOM; use Data::Dumper; use feature qw/say/; my $input = join '', map { $_ } <DATA>; my $dom = Mojo::DOM->new($input); my @results; for my $h3 ($dom->find('h3')->each) { # skip everything but the heading we are after next unless $h3->all_text =~ /^\s* BBBBB \s+/ix; # assume that immediately after the heading we have the table we a +re interested in my $table = $h3->following->first; $table->find('td')->each( sub { push @results, $_->all_text} ); } say Dumper \@results; __DATA__ <h3>AAA 28/07/2018</h3> <table><tr><td>1351</td><td>990</td><td>783</td><td>523</td></tr></tab +le> <h3>BBBBB 01/08/2018</h3> <table><tr><td>236</td><td>002</td><td>121</td><td>266</td></tr></tabl +e> <h3>KK K P 25/07/2018</h3> <table><tr><td>200</td><td>3345</td><td>667</td><td>137</td></tr></tab +le>

Re: Regex help
by Anonymous Monk on Aug 01, 2018 at 02:44 UTC
    When one regex won't do try two!
    #!/usr/bin/perl use strict; use warnings; my $line = <DATA>; # First capture the row: if ($line =~ /<tr>(BBBBB.*?)<\/tr>/) { # Then capture the cells: my @data = $1 =~ /<td>([^<]+)<\/td>/g; print $_ for @data; } __DATA__ <h3>AAA 28/07/2018</h3><table><tr><td>1351</td><td>990</td><td>783</td +><td>523</td></tr></table><h3><table><tr>BBBBB 01/08/2018</h3><td>236 +</td><td>002</td><td>121</td><td>266</td></tr></table><h3>KK K P 25/0 +7/2018</h3><table><tr><td>200</td><td>3345</td><td>667</td><td>137</t +d></tr></table>
    BTW HTML::TableExtract is awesome for such tasks.
      Works great! Am using your code - didn't know how to capture repeat matches into an array but your code has taught me how to do it.

      A zillion thanks!
Re: Regex help
by Anonymous Monk on Aug 01, 2018 at 14:39 UTC
    Thanks everyone for your code and help. Greatly appreciate. Thank you :)))