jdseymour has asked for the wisdom of the Perl Monks concerning the following question:

I want to extract table entries from an HTML file and I've been banging my head against this code all morning. I'm obviously missing something, so I thought I'd come here...

Here's what I want to do: (The variable $lsdata contains the entire .html file)

my @table = $lsdata =~ m/(\<tr.*?\<\/tr\>)/g;

As I understand it, this will extract all <tr> blocks from my input. However, it always returns a blank array.

If I use this simpler match, it works:

my @table = $lsdata =~ m/(\<tr)/g;

(Of course, with the simpler match, the $1 variable doesn't include the rest of the <tr> block).

Help?

Thanks.

Replies are listed 'Best First'.
Re: Why isn't my global matching regex working? (update: /s modifier)
by LanX (Saint) on Apr 14, 2018 at 22:04 UTC
    Why do you escape the < ?

    Anyway parsing arbitrary html with regex is normally a waste of time.

    Please show us some input to help you.

    update

    works for me, stripped the useless (but harmless) esacaping

    use strict; use warnings; use Data::Dump qw/pp dd/; my $lsdata = do { local $/;<DATA> }; my @table = $lsdata =~ m/(<tr.*?<\/tr>)/g; pp @table; __DATA__ <tr>bla1</tr><tr>bla2</tr> <tr>bla3</tr>

    ("<tr>bla1</tr>", "<tr>bla2</tr>", "<tr>bla3</tr>")

    so better inspect your input again!

    update

    Ah wait ... I think you need an /s modifier to make . match newlines !!!

    use strict; use warnings; use Data::Dump qw/pp dd/; my $lsdata = do { local $/;<DATA> }; my @table = $lsdata =~ m/(<tr.*?<\/tr>)/sg; pp @table; __DATA__ <tr>bla1</tr><tr>bla2</tr> <tr>bla3</tr> <tr> bla4 </tr>

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

      Bingo! The /s did the trick.

      Sorry for not supplying the input. For those interested, here's a quick test.

      use strict; my $lsdata; while (<>) { $lsdata .= $_; } # Add the /s option to make this work my @table = $lsdata =~ m/(<tr.*?<\/tr>)/g; print $#table;

      And here's some sample data to use:

      <!-- Generated HTML --><html><head><title>Now Playing</title><meta htt +p-equiv="Content-Type" content="text/html; charset=UTF-8"><h1>Now Pla +ying</h1> <table cellpadding="7" width="100%"> <tr bgcolor="E5E5C5"> <th width="1%"> <th width="1%">Source</th> <th>Description</th> <th width="5%">Date</th> <th width="5%">Size</th> <th width="5%">Links</th></tr> <tr bgcolor="F5F595"> <td> <td align="center" valign="top"><img src="logo-65725.png" alt="BBCAMHD +-W"></td> <td align="left" valign="top"><b>TV Show Title</b><br>TV Show Descript +ion</td> <td align="center" valign="top" nowrap>Sat<br>4/14</td> <td align="center" valign="top" nowrap>1:01:00<br>6.03 GB</td> <td align="center" valign="top" nowrap><a href="download/tv-show-file- +name">Download MPEG-PS</a><br><a href="download/tv-show-file-name</a> +</td></tr> <tr bgcolor="F5F595"> <td align="center" valign="top"><img src="images/folder.png"></td> <td> <td valign="top"><b>Folder Name</b></td> <td align="center" valign="top" nowrap>Fri<br>4/13</td> <td align="center" valign="top">2 items<br></td> <td align="center" valign="top"><a href="TiVoConnect?Command=QueryCont +ainer&amp;Container=%2FNowPlaying%2F17%2F55067240">folder</a></td></t +r> </table>2 items, <a href="index.html?Recurse=Yes">classic</a>.<p><font + size="-2">This feature is not supported. The TiVo license agreement +allows you to transfer content to up to ten devices within your house +hold, but not outside your household. Unauthorized transfers or dist +ribution of copyrighted works outside of your home may constitute a c +opyright infringement. TiVo reserves the right to terminate the TiVo +service accounts of users who transfer or distribute content in viola +tion of this Agreement. </font></body></html>

      Without the /s option, the test code shows "-1" (no matches). With it, it shows "2" (3 matches - which is correct).

      Thanks again, all!

Re: Why isn't my global matching regex working?
by Marshall (Canon) on Apr 15, 2018 at 14:10 UTC
    You might find this code helpful. I wrote this as a Perl version of a Python lab assignment. If there is no entry for a column, you get undef. HTML::TableExtract might not be the most up to date module for this, however, this worked well. Multiple tables can be in the html page, here there is only one.
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use HTML::TableExtract; my @db; my $raw_html = do { open my $in, '<', 'lab8input.txt' or die "Can't open lab8input.txt + $!"; local $/ = undef; <$in>; }; my $te = new HTML::TableExtract( headers => ['Course', 'Course\s+Name', "Fall", "Winter", "Spring", "Summer"]); + $te->parse($raw_html); # Examine all matching tables foreach my $ts ( $te->table_states ) { # print "Table (", join(',', $ts->coords), "):\n"; # there's only o +ne table! foreach my $row ( $ts->rows ) { @$row = map{$_//=''}@$row; #set undef cols to empty string my @quarter_flags = (@$row)[-4..-1]; my @quarters; push @quarters,'Fall' if shift @quarter_flags; push @quarters,'Winter' if shift @quarter_flags; push @quarters,'Spring' if shift @quarter_flags; push @quarters,'Summer' if shift @quarter_flags; push @db, [@$row[0], @$row[1], join(',',@quarters)]; # print join( ',', @$row ), "\n"; } } print Dumper \@db; __END__ @db looks like this: 3 columns: Course,Name,Quarters offered (if any +listed) [ 'CIS 95B', 'Project Planning and Control - A Practicum', 'Fall,Spring' ],
Re: Why isn't my global matching regex working?
by hippo (Archbishop) on Apr 15, 2018 at 13:07 UTC
    As I understand it, this will extract all <tr> blocks from my input.

    Since you have not provided that input, we can only guess. Looking at How to ask better questions using Test::More and sample data it is simple enough to come up with a demonstration which shows that regex to work for a given set of input. Note that like brother LanX I'm going to get rid of all those backslashes.

    use strict; use warnings; use Test::More tests => 4; my $input = '<html><tr>foo</tr><tr>bar</tr><tr>baz</tr></html>'; my @table = $input =~ m#(<tr.*?</tr>)#g; is $#table, 2; is $table[0], '<tr>foo</tr>'; is $table[1], '<tr>bar</tr>'; is $table[2], '<tr>baz</tr>';