help needed with match multiple lines

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: help needed with match multiple lines by pc88mxer (Vicar) on Jun 20, 2008 at 16:35 UTC
Can you provide a sample of the text you want to match? Also, one suggestion... use the match operator with curly braces: `m{}`. You won't have to do as much escaping, and your regex will much more readable: `if (m{class="report".&id=1">(.?)</a."report"\swidth.>(.*?)</td>$}s +i) { ... }` [download]	[reply] [d/l] [select]
Re^2: help needed with match multiple lines by Anonymous Monk on Jun 20, 2008 at 17:14 UTC
Hi, Thank you for your help. I think the problem is probably with the `undef $/`, since it will run if I get rid of it and just match for one line. Thanks, Joice	[reply] [d/l]
Re: help needed with match multiple lines by olus (Curate) on Jun 20, 2008 at 17:39 UTC
When you `undef $/;` you'll read everything at once, so the while is not doing what I think you expect it to do. I wrote the following example based on your regexp and slightly modified it to give the result that I believe are as expected. use strict; use warnings; use Data::Dumper; #my @array = <DATA>; #my $text = join '', @array; undef $/; my $text = <DATA>; my @matches = $text =~ m{class="report".?&id=1">\s(.?)</a.?"report +"\swidth.?>\s(.?)\s</td>}sgx; print Dumper(\@matches); __DATA__ <td class="report"> <a href="blahblah1&id=1">link to blah 1</a> </td> <td class="report" width=50> cell2 1 </td> <td> some other stuff </td> <td class="report"> <a href="blahblah2&id=1">link to blah 2</a> </td> <td class="report" width=50> cell2 2 </td> [download] That outputs: `$VAR1 = [ 'link to blah 1', 'cell2 1', 'link to blah 2', 'cell2 2' ];` [download] Now you can iterate through the resulting array in pairs of two. Hope that helps.	[reply] [d/l] [select]
Re^2: help needed with match multiple lines by Anonymous Monk on Jun 20, 2008 at 17:59 UTC
Hi, Thank you very much for your help. I followed your codes(adding the directory part), but still keep running (problem of my directory loop?? )... I am not sure what is the problem.. Thanks again! Joice	[reply]
Re: help needed with match multiple lines by cfreak (Chaplain) on Jun 20, 2008 at 17:28 UTC
I think others have answered your initial question I'd just like to add since it looks like you're trying to parse through some HTML its far easier (and more accurate) to use an HTML parser module instead. HTML::TokeParser is my favorite for general parsing Lobster Aliens Are attacking the world!	[reply]
Re^2: help needed with match multiple lines by Anonymous Monk on Jun 20, 2008 at 18:01 UTC
Hi, Thanks for the help. I'll check on it. Best, Joice	[reply]
Re: help needed with match multiple lines by jethro (Monsignor) on Jun 20, 2008 at 17:15 UTC
You seem to be certain that your regex is not matching multiple lines but is otherwise correct. And the thing is, your code works for multiple lines. When I substituted your regex with `/(class).*(pattern)/si`, and planted an appropriate file with 'class' and 'pattern' each on its own line, the regex matched and list.txt had a line in it. So you don't have a problem with multiline matching, you have a problem with your regex not matching the text you want to match.	[reply] [d/l]
Re^2: help needed with match multiple lines by Anonymous Monk on Jun 20, 2008 at 17:43 UTC
Hi, Thanks much for your reply. Here is part of the text. I am trying to match the name and the number. `class="report" width="15%"><a href="rm=mode2&id=1">12R</a>< +/td> class="report" width="15%">567</td> class="report" width="15%"><a href="rm=mode2&id=1">14R</a>< +/td> class="report" width="15%">129</td>` [download] When I run the codes, it seems to be in endless loop. Not sure what's the problem. Thanks!!! Joice	[reply] [d/l]
Re^3: help needed with match multiple lines by jethro (Monsignor) on Jun 20, 2008 at 19:10 UTC
One thing wrong with your code is that you use '$' at the end of your regexp. This will only match the absolute end of your string, not a line ending. The regex switch m lets '$' also match line endings, i.e. use `/.../msi` But that alone won't help you, because you have so many greedy matches in your regex (i.e. .), that you will match horribly wrong in any nontrivial html file. Changing all . to .? will make a big difference. BUT what happens when there is a line `...<td> SPACE RETURN`. Your regex won't match it because you forgot a \s before the '$'. Instead it will match a few more lines until the next td without spaces behind it So you see, getting this right is not trivial. Better use a module like cfreak suggested By the way, I didn't get any endless loop with your example data. You might check with a print statement in your while loop if that is looping, but it shouldn't. The while loop isn't really necessary since you read the file in one take, so you could substitute it with `if (defined($_=<IN>))` which eliminates the while and provides the hidden magic of the while(<>) loop	[reply] [d/l] [select]
Re: help needed with match multiple lines by GrandFather (Saint) on Jun 20, 2008 at 18:45 UTC
Avoid greedy .* matches. Your regex will only find one match because of the .* matches. However, a much better solution for parsing standard markup is to use the tools designed for the purpose. In this case one of the HTML modules, HTML::TreeBuilder perhaps, would be appropriate. Parsing markup reliably is hard and reinventing wheels generally takes much longer than one might expect. Perl is environmentally friendly - it saves trees	[reply]
Re: help needed with match multiple lines by graff (Chancellor) on Jun 21, 2008 at 00:31 UTC
If the data sample that you provided in this reply above is (part of) the content of a single data file, then you should have multiple lines printed for each file. I would try something like this (although actually, I think it would be worthwhile to try a real HTML parsing module, as recommended by others above): use strict; opendir( DIR, "data" ); # use a variable if you need to handle differ +ent values my @files = grep /\.txt$/, readdir( DIR ); closedir DIR; warn scalar @files . " files to read\n"; open( OUT, ">", "list.txt" ) or die "list.txt: $!\n"; for my $file ( @files ) { local $/; # sets $/ to undef (locally) open( IN, $file ) or warn "$file: $!\n"; $_ = <IN>; # reads entire file because $/ is undef close IN; while ( m{<a href=.?>(\w+)<.?class="report"[^>]+>(\w+)<}sg ) { print OUT "$file -- $1 -- $2\n"; } } [download] (not tested)	[reply] [d/l]