Why isn't my global matching regex working?

jdseymour has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Why isn't my global matching regex working? (update: /s modifier) by LanX (Saint) on Apr 14, 2018 at 22:04 UTC
Why do you escape the `<` ? Anyway parsing arbitrary html with regex is normally a waste of time. Please show us some input to help you. update works for me, stripped the useless (but harmless) esacaping `use strict; use warnings; use Data::Dump qw/pp dd/; my $lsdata = do { local $/;<DATA> }; my @table = $lsdata =~ m/(<tr.?<\/tr>)/g; pp @table; __DATA__ <tr>bla1</tr><tr>bla2</tr> <tr>bla3</tr>` [download] `("<tr>bla1</tr>", "<tr>bla2</tr>", "<tr>bla3</tr>")` so better inspect your input again! update Ah wait ... I think you need an /s modifier to make `.` match newlines !!! `use strict; use warnings; use Data::Dump qw/pp dd/; my $lsdata = do { local $/;<DATA> }; my @table = $lsdata =~ m/(<tr.?<\/tr>)/sg; pp @table; __DATA__ <tr>bla1</tr><tr>bla2</tr> <tr>bla3</tr> <tr> bla4 </tr>` [download] Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Why isn't my global matching regex working? (update: /s modifier) by jdseymour (Initiate) on Apr 15, 2018 at 15:27 UTC
Bingo! The /s did the trick. Sorry for not supplying the input. For those interested, here's a quick test. `use strict; my $lsdata; while (<>) { $lsdata .= $_; } # Add the /s option to make this work my @table = $lsdata =~ m/(<tr.*?<\/tr>)/g; print $#table;` [download] And here's some sample data to use: <!-- Generated HTML --><html><head><title>Now Playing</title><meta htt +p-equiv="Content-Type" content="text/html; charset=UTF-8"><h1>Now Pla +ying</h1> <table cellpadding="7" width="100%"> <tr bgcolor="E5E5C5"> <th width="1%"> <th width="1%">Source</th> <th>Description</th> <th width="5%">Date</th> <th width="5%">Size</th> <th width="5%">Links</th></tr> <tr bgcolor="F5F595"> <td> <td align="center" valign="top"><img src="logo-65725.png" alt="BBCAMHD +-W"></td> <td align="left" valign="top"><b>TV Show Title</b><br>TV Show Descript +ion</td> <td align="center" valign="top" nowrap>Sat<br>4/14</td> <td align="center" valign="top" nowrap>1:01:00<br>6.03 GB</td> <td align="center" valign="top" nowrap><a href="download/tv-show-file- +name">Download MPEG-PS</a><br><a href="download/tv-show-file-name</a> +</td></tr> <tr bgcolor="F5F595"> <td align="center" valign="top"><img src="images/folder.png"></td> <td> <td valign="top"><b>Folder Name</b></td> <td align="center" valign="top" nowrap>Fri<br>4/13</td> <td align="center" valign="top">2 items<br></td> <td align="center" valign="top"><a href="TiVoConnect?Command=QueryCont +ainer&Container=%2FNowPlaying%2F17%2F55067240">folder</a></td></t +r> </table>2 items, <a href="index.html?Recurse=Yes">classic</a>.<p><font + size="-2">This feature is not supported. The TiVo license agreement +allows you to transfer content to up to ten devices within your house +hold, but not outside your household. Unauthorized transfers or dist +ribution of copyrighted works outside of your home may constitute a c +opyright infringement. TiVo reserves the right to terminate the TiVo +service accounts of users who transfer or distribute content in viola +tion of this Agreement. </font></body></html> [download] Without the /s option, the test code shows "-1" (no matches). With it, it shows "2" (3 matches - which is correct). Thanks again, all!	[reply] [d/l] [select]
Re: Why isn't my global matching regex working? by Marshall (Canon) on Apr 15, 2018 at 14:10 UTC
You might find this code helpful. I wrote this as a Perl version of a Python lab assignment. If there is no entry for a column, you get undef. HTML::TableExtract might not be the most up to date module for this, however, this worked well. Multiple tables can be in the html page, here there is only one. #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use HTML::TableExtract; my @db; my $raw_html = do { open my $in, '<', 'lab8input.txt' or die "Can't open lab8input.txt + $!"; local $/ = undef; <$in>; }; my $te = new HTML::TableExtract( headers => ['Course', 'Course\s+Name', "Fall", "Winter", "Spring", "Summer"]); + $te->parse($raw_html); # Examine all matching tables foreach my $ts ( $te->table_states ) { # print "Table (", join(',', $ts->coords), "):\n"; # there's only o +ne table! foreach my $row ( $ts->rows ) { @$row = map{$_//=''}@$row; #set undef cols to empty string my @quarter_flags = (@$row)[-4..-1]; my @quarters; push @quarters,'Fall' if shift @quarter_flags; push @quarters,'Winter' if shift @quarter_flags; push @quarters,'Spring' if shift @quarter_flags; push @quarters,'Summer' if shift @quarter_flags; push @db, [@$row[0], @$row[1], join(',',@quarters)]; # print join( ',', @$row ), "\n"; } } print Dumper \@db; __END__ @db looks like this: 3 columns: Course,Name,Quarters offered (if any +listed) [ 'CIS 95B', 'Project Planning and Control - A Practicum', 'Fall,Spring' ], [download]	[reply] [d/l]
Re: Why isn't my global matching regex working? by hippo (Archbishop) on Apr 15, 2018 at 13:07 UTC
As I understand it, this will extract all <tr> blocks from my input. Since you have not provided that input, we can only guess. Looking at How to ask better questions using Test::More and sample data it is simple enough to come up with a demonstration which shows that regex to work for a given set of input. Note that like brother LanX I'm going to get rid of all those backslashes. `use strict; use warnings; use Test::More tests => 4; my $input = '<html><tr>foo</tr><tr>bar</tr><tr>baz</tr></html>'; my @table = $input =~ m#(<tr.*?</tr>)#g; is $#table, 2; is $table[0], '<tr>foo</tr>'; is $table[1], '<tr>bar</tr>'; is $table[2], '<tr>baz</tr>';` [download]	[reply] [d/l]

update

update