Why do you escape the < ?
Anyway parsing arbitrary html with regex is normally a waste of time.
Please show us some input to help you.
update
works for me, stripped the useless (but harmless) esacaping
use strict;
use warnings;
use Data::Dump qw/pp dd/;
my $lsdata = do { local $/;<DATA> };
my @table = $lsdata =~ m/(<tr.*?<\/tr>)/g;
pp @table;
__DATA__
<tr>bla1</tr><tr>bla2</tr>
<tr>bla3</tr>
("<tr>bla1</tr>", "<tr>bla2</tr>", "<tr>bla3</tr>")
so better inspect your input again!
update
Ah wait ... I think you need an /s modifier to make . match newlines !!!
use strict;
use warnings;
use Data::Dump qw/pp dd/;
my $lsdata = do { local $/;<DATA> };
my @table = $lsdata =~ m/(<tr.*?<\/tr>)/sg;
pp @table;
__DATA__
<tr>bla1</tr><tr>bla2</tr>
<tr>bla3</tr>
<tr>
bla4
</tr>
| [reply] [d/l] [select] |
Bingo! The /s did the trick.
Sorry for not supplying the input. For those interested, here's a quick test.
use strict;
my $lsdata;
while (<>) {
$lsdata .= $_;
}
# Add the /s option to make this work
my @table = $lsdata =~ m/(<tr.*?<\/tr>)/g;
print $#table;
And here's some sample data to use:
<!-- Generated HTML --><html><head><title>Now Playing</title><meta htt
+p-equiv="Content-Type" content="text/html; charset=UTF-8"><h1>Now Pla
+ying</h1>
<table cellpadding="7" width="100%">
<tr bgcolor="E5E5C5">
<th width="1%">
<th width="1%">Source</th>
<th>Description</th>
<th width="5%">Date</th>
<th width="5%">Size</th>
<th width="5%">Links</th></tr>
<tr bgcolor="F5F595">
<td>
<td align="center" valign="top"><img src="logo-65725.png" alt="BBCAMHD
+-W"></td>
<td align="left" valign="top"><b>TV Show Title</b><br>TV Show Descript
+ion</td>
<td align="center" valign="top" nowrap>Sat<br>4/14</td>
<td align="center" valign="top" nowrap>1:01:00<br>6.03 GB</td>
<td align="center" valign="top" nowrap><a href="download/tv-show-file-
+name">Download MPEG-PS</a><br><a href="download/tv-show-file-name</a>
+</td></tr>
<tr bgcolor="F5F595">
<td align="center" valign="top"><img src="images/folder.png"></td>
<td>
<td valign="top"><b>Folder Name</b></td>
<td align="center" valign="top" nowrap>Fri<br>4/13</td>
<td align="center" valign="top">2 items<br></td>
<td align="center" valign="top"><a href="TiVoConnect?Command=QueryCont
+ainer&Container=%2FNowPlaying%2F17%2F55067240">folder</a></td></t
+r>
</table>2 items, <a href="index.html?Recurse=Yes">classic</a>.<p><font
+ size="-2">This feature is not supported. The TiVo license agreement
+allows you to transfer content to up to ten devices within your house
+hold, but not outside your household. Unauthorized transfers or dist
+ribution of copyrighted works outside of your home may constitute a c
+opyright infringement. TiVo reserves the right to terminate the TiVo
+service accounts of users who transfer or distribute content in viola
+tion of this Agreement. </font></body></html>
Without the /s option, the test code shows "-1" (no matches). With it, it shows "2" (3 matches - which is correct).
Thanks again, all!
| [reply] [d/l] [select] |
You might find this code helpful. I wrote this as a Perl version of a Python lab assignment. If there is no entry for a column, you get undef. HTML::TableExtract might not be the most up to date module for this, however, this worked well. Multiple tables can be in the html page, here there is only one.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use HTML::TableExtract;
my @db;
my $raw_html = do {
open my $in, '<', 'lab8input.txt' or die "Can't open lab8input.txt
+ $!";
local $/ = undef;
<$in>;
};
my $te = new HTML::TableExtract( headers => ['Course',
'Course\s+Name',
"Fall",
"Winter",
"Spring",
"Summer"]);
+
$te->parse($raw_html);
# Examine all matching tables
foreach my $ts ( $te->table_states )
{
# print "Table (", join(',', $ts->coords), "):\n"; # there's only o
+ne table!
foreach my $row ( $ts->rows )
{
@$row = map{$_//=''}@$row; #set undef cols to empty string
my @quarter_flags = (@$row)[-4..-1];
my @quarters;
push @quarters,'Fall' if shift @quarter_flags;
push @quarters,'Winter' if shift @quarter_flags;
push @quarters,'Spring' if shift @quarter_flags;
push @quarters,'Summer' if shift @quarter_flags;
push @db, [@$row[0], @$row[1], join(',',@quarters)];
# print join( ',', @$row ), "\n";
}
}
print Dumper \@db;
__END__
@db looks like this: 3 columns: Course,Name,Quarters offered (if any
+listed)
[
'CIS 95B',
'Project Planning and Control - A Practicum',
'Fall,Spring'
],
| [reply] [d/l] |
As I understand it, this will extract all <tr> blocks from my input.
Since you have not provided that input, we can only guess. Looking at How to ask better questions using Test::More and sample data it is simple enough to come up with a demonstration which shows that regex to work for a given set of input. Note that like brother LanX I'm going to get rid of all those backslashes.
use strict;
use warnings;
use Test::More tests => 4;
my $input = '<html><tr>foo</tr><tr>bar</tr><tr>baz</tr></html>';
my @table = $input =~ m#(<tr.*?</tr>)#g;
is $#table, 2;
is $table[0], '<tr>foo</tr>';
is $table[1], '<tr>bar</tr>';
is $table[2], '<tr>baz</tr>';
| [reply] [d/l] |