Re: help needed with match multiple lines
by pc88mxer (Vicar) on Jun 20, 2008 at 16:35 UTC
|
Can you provide a sample of the text you want to match?
Also, one suggestion... use the match operator with curly braces: m{}. You won't have to do as much escaping, and your regex will much more readable:
if (m{class="report".*&id=1">(.*?)</a.*"report"\swidth.*>(.*?)</td>$}s
+i) {
...
}
| [reply] [d/l] [select] |
|
|
Hi,
Thank you for your help. I think the problem is probably with the undef $/, since it will run if I get rid of it and just match for one line. Thanks, Joice
| [reply] [d/l] |
Re: help needed with match multiple lines
by olus (Curate) on Jun 20, 2008 at 17:39 UTC
|
When you undef $/; you'll read everything at once, so the while is not doing what I think you expect it to do.
I wrote the following example based on your regexp and slightly modified it to give the result that I believe are as expected.
use strict;
use warnings;
use Data::Dumper;
#my @array = <DATA>;
#my $text = join '', @array;
undef $/;
my $text = <DATA>;
my @matches = $text =~ m{class="report".*?&id=1">\s*(.*?)</a.*?"report
+"\swidth.*?>\s*(.*?)\s*</td>}sgx;
print Dumper(\@matches);
__DATA__
<td class="report">
<a href="blahblah1&id=1">link to blah 1</a>
</td>
<td class="report" width=50>
cell2 1
</td>
<td> some other stuff </td>
<td class="report">
<a href="blahblah2&id=1">link to blah 2</a>
</td>
<td class="report" width=50>
cell2 2
</td>
That outputs:
$VAR1 = [
'link to blah 1',
'cell2 1',
'link to blah 2',
'cell2 2'
];
Now you can iterate through the resulting array in pairs of two. Hope that helps. | [reply] [d/l] [select] |
|
|
Hi,
Thank you very much for your help. I followed your codes(adding the directory part), but still keep running (problem of my directory loop?? )... I am not sure what is the problem..
Thanks again!
Joice
| [reply] |
Re: help needed with match multiple lines
by cfreak (Chaplain) on Jun 20, 2008 at 17:28 UTC
|
I think others have answered your initial question I'd just like to add since it looks like you're trying to parse through some HTML its far easier (and more accurate) to use an HTML parser module instead.
HTML::TokeParser is my favorite for general parsing
| [reply] |
|
|
Hi,
Thanks for the help. I'll check on it.
Best,
Joice
| [reply] |
Re: help needed with match multiple lines
by jethro (Monsignor) on Jun 20, 2008 at 17:15 UTC
|
You seem to be certain that your regex is not matching multiple lines but is otherwise correct.
And the thing is, your code works for multiple lines. When I substituted your regex with /(class).*(pattern)/si, and planted an appropriate file with 'class' and 'pattern' each on its own line, the regex matched and list.txt had a line in it.
So you don't have a problem with multiline matching, you have a problem with your regex not matching the text you want to match.
| [reply] [d/l] |
|
|
Hi,
Thanks much for your reply. Here is part of the text. I am trying to match the name and the number.
class="report" width="15%"><a href="rm=mode2&id=1">12R</a><
+/td>
class="report" width="15%">567</td>
class="report" width="15%"><a href="rm=mode2&id=1">14R</a><
+/td>
class="report" width="15%">129</td>
When I run the codes, it seems to be in endless loop. Not sure what's the problem.
Thanks!!!
Joice
| [reply] [d/l] |
|
|
One thing wrong with your code is that you use '$' at the end of your regexp. This will only match the absolute end of your string, not a line ending. The regex switch m lets '$' also match line endings, i.e. use /.../msi
But that alone won't help you, because you have so many greedy matches in your regex (i.e. .*), that you will match horribly wrong in any nontrivial html file. Changing all .* to .*? will make a big difference.
BUT what happens when there is a line ...<td> SPACE RETURN. Your regex won't match it because you forgot a \s* before the '$'. Instead it will match a few more lines until the next td without spaces behind it
So you see, getting this right is not trivial. Better use a module like cfreak suggested
By the way, I didn't get any endless loop with your example data. You might check with a print statement in your while loop if that is looping, but it shouldn't. The while loop isn't really necessary since you read the file in one take, so you could substitute it with if (defined($_=<IN>)) which eliminates the while and provides the hidden magic of the while(<>) loop
| [reply] [d/l] [select] |
Re: help needed with match multiple lines
by GrandFather (Saint) on Jun 20, 2008 at 18:45 UTC
|
Avoid greedy .* matches. Your regex will only find one match because of the .* matches.
However, a much better solution for parsing standard markup is to use the tools designed for the purpose. In this case one of the HTML modules, HTML::TreeBuilder perhaps, would be appropriate. Parsing markup reliably is hard and reinventing wheels generally takes much longer than one might expect.
Perl is environmentally friendly - it saves trees
| [reply] |
Re: help needed with match multiple lines
by graff (Chancellor) on Jun 21, 2008 at 00:31 UTC
|
If the data sample that you provided in this reply above is (part of) the content of a single data file, then you should have multiple lines printed for each file. I would try something like this (although actually, I think it would be worthwhile to try a real HTML parsing module, as recommended by others above):
use strict;
opendir( DIR, "data" ); # use a variable if you need to handle differ
+ent values
my @files = grep /\.txt$/, readdir( DIR );
closedir DIR;
warn scalar @files . " files to read\n";
open( OUT, ">", "list.txt" ) or die "list.txt: $!\n";
for my $file ( @files ) {
local $/; # sets $/ to undef (locally)
open( IN, $file ) or warn "$file: $!\n";
$_ = <IN>; # reads entire file because $/ is undef
close IN;
while ( m{<a href=.*?>(\w+)<.*?class="report"[^>]+>(\w+)<}sg ) {
print OUT "$file -- $1 -- $2\n";
}
}
(not tested) | [reply] [d/l] |