Levan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I am having problem matching some text that spans more than one line.
The text file I have looks something like this:

SomeText
void what(void *kiss,INT8U
miss)
SomeText

Using the brackets as delimiters and the INT8U as the target, i use the following code to try and get the 2 line between the 2 SomeText up but i don't seem to get it

if ($Hap =~ m/.*INT8U.*\)/s) { print "$Hap"; }
One thing to note is that i have tried using the undef $/ but it keep giving me the whole text back.
Is there any other way to do that?
Thanks

Replies are listed 'Best First'.
Re: Trouble matching more than one line
by !1 (Hermit) on Nov 18, 2003 at 04:42 UTC

    First and foremost: do not undef $/ unless this is a script that...

    1. is a throwaway script.
    2. doesn't use any other modules that just might happen to open a file for reading.
    3. you don't and never will work with a developer who is an avid gun collector and also has a low tolerance for weird stuff in scripts he has to maintain.

    If none of those apply to the script in question, go ahead. Of course, otherwise you probably want to use as small of a scope as possible and localize $/. Perl Idioms Explained - my $string = do { local $/; <FILEHANDLE> }; by jeffa might help out with a preferred way to do it.

    In regards to your regex question, you're merely checking whether or not $Hap contains INT8U followed somewhere later by a right parenthesis. You then happily print $Hap. Of course, I'm curious about a few things. You want to capture the void what(void *kiss,INT8U\nmiss) but it's somewhat difficult to know how to design our regex unless we know a few things: Will the lines always come in pairs? Will the first line always start with void or is there a list of keywords that can be there or can it be just anything? If INT8U isn't on the same lines as the opening parenthesis, should we still match the entire line? Is it legal to have our opening left parenthesis at the beginning of a new line? Can parentheses be somehow embedded within the parentheses? Will we always be looking for INT8U or does it depend upon a variable? Should we stop capturing at the closing right parenthesis or should we capture that entire line?

    Those are the questions you need to recognize and ask yourself prior to attempting to write your regular expression. Once you can do that, recognizing text patterns and deciding upon a regex to describe those patterns becomes much easier, even natural. In short, check out perldoc perlretut for a basic introduction to regular expressions as well as the difference between matching and substituting as well as when and when not to capture. Whenever you think you have a good grip on the basics, read perldoc perlre for everything you'd ever want to know about regular expressions.

    While I somewhat doubt I've helped you with your problem, I hope this at least helps you to become more proficient with regular expressions.

Re: Trouble matching more than one line
by pg (Canon) on Nov 18, 2003 at 04:37 UTC

    The data you are processing looks like c program. You are right to use s modifier to take data as a single line.

    I quickly come up with something, and hope it helps: (Well, I do not handle things like brackets in quotes, (unmatched) brackets in comments, but in your real code, you have to expect those)

    use strict; use warnings; open(CPROG, "<test1.cpp"); my $concat_line; while (my $line = <CPROG>) { if ($line =~ /\(/s) { $concat_line = $line; } else { $concat_line .= $line if ($concat_line); } if ($concat_line && ($concat_line =~ /\(.*\)/s)) { print $concat_line; $concat_line = undef; } } close(CPROG);

    I tried it with

    #include <string.h> #include <stdio.h> main() { char a[80]; strcpy(a, "abcd"); strcat(a, "\015"); printf("%s", a); }

    And it gives me:

    main() { strcpy(a, "abcd"); strcat(a, "\015"); printf("%s", a);
Re: Trouble matching more than one line
by etcshadow (Priest) on Nov 18, 2003 at 03:42 UTC
    Well, there are different ways that you can do this, but they all revolve around one central point: you need to define how much context you are interested in searching.

    If, as you say, you just undef $/, meaning, I assume, that you read an entire file into $Hap, then your regular expression is too broad. You have .* before and .* after the key bit of text... this means that you'll be matching the entire file. (Even if you limited the regular expression, though, you're printing $Hap, not printing just the portion of the string that matched the regexp.)

    In order to handle that properly, you'd want to be clear in your regular expression where you wanted to begin your match... like maybe: /void .*?INT8U.*?\)/. Also, you'd want to capture the match: /(void .*?INT8U.*?\))/ and then reference the captured text: print $1;.

    Anyway, another entirely different way to deal with it is to use the same regular expression, but set $/ to an appropriate value, so that you split your input into the chunks that you are interested in. Maybe, for example, by setting $/ to "\n\n". That would break up your input into paragraphs (which may or may not be what you want). Anyway, I can't really tell more, because your question is a little vague.


    ------------
    :Wq
    Not an editor command: Wq
Re: Trouble matching more than one line
by forrest (Beadle) on Nov 18, 2003 at 03:55 UTC
    If I understand the question correctly
    if ($Hap =~ m/SomeText\n(.*?INT8U.*?)\nSomeText/s) { print "$1"; }
Re: Trouble matching more than one line
by ysth (Canon) on Nov 18, 2003 at 05:22 UTC
    It looks like you need to go through some tutorials. I like perldoc perlrequick and perldoc perlretut. Some comments on your code, in no particular order:
  • If you want only the second . to be allowed to match a newline, but not the first, you can set the s flag for just a part of your regex: m/.*INT8U(?s:.)\)/
  • If you want the match to begin at the beginning of the line that has INT8U, use ^ and the m flag: m/^.*INT8U.../m. Without //m, ^ matches only at the beginning of the string, not on interior newlines.
  • m// only checks if part of $Hap matches, it doesn't alter $Hap. To get the part of $Hap that matched, use $& (only after testing that the match was successful) or assign it from the match: if (($match) = $Hap =~ m/.../). (That will get the whole match as long as your pattern has no capturing parentheses.)
  • It sounds as if you have experimented with reading a line at a time or the whole file at a time via undef $/. A line at a time isn't going to work if you need results from more than one line (as it sounds as if you do). See above for getting just the matched part when reading the whole file. If you need to get multiple matches out of the whole file, use a while loop and the //g flag: while ($Hap =~ m/.../g) { print $& }

    I hope at least some of this helps you along.