coltman has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, This may be a stupid question, but I will really apprecite it if someone can help me with this.
I have a text file which part of it is displayed below. Example RENTAL NOTICE LETTER EXHIBIT 10.2 FIRST RESTATED AND AMENDED AUTO LEASE NOTICE TH +IS LEASE is made as of the 1st day of March, 2001, by and between HAN +NAH RENTAL, CO. a Maine Corporation having its principal offices in P +ortland End What I want is to extract any phrase that contains the word "NOTICE". +In this example, both "RENTAL NOTICE LETTER" and "AUTO LEASE NOTICE" +are the information that I need. Because there can be two ways for locating the phrase "XXX NOTICE YYY" +: (1) (boundary or begining of line) XXX NOTICE YYY (\n); (2) (more than two spaces, such as " ") XXX NOTICE YYY (more than tw +o spaces, such as " " ); I tried with the following: if ($a =~ /(?:\b| +?)(.+)*(NOTICE)(.+)*(?:\n| +?)/i) { print $1." ".$2." ".$3."\n"; }

But it does not work quite well. Can you make some suggestions?

Replies are listed 'Best First'.
Re: How to parse this information out?
by Anno (Deacon) on Mar 21, 2007 at 21:12 UTC
    I think this is a case where a single big(gish) regex is not the best way to extract the information. A combination of split and grep with simple regexes seems a better choice.

    First split the text on either two or more blanks, or a line feed. That isolates the candidate phrases mixed with other text fragments. Then select three-word-phrases by keeping only those fragments that contain exactly two blanks. Finally, select for the word "NOTICE". Put together:

    print "$_\n" for grep /\bNOTICE\b/, grep tr/ // == 2, split / {2,}|\n/, $text;
    Anno
Re: How to parse this information out?
by kyle (Abbot) on Mar 21, 2007 at 18:38 UTC

    This is what I came up with.

    while ( $text =~ m{ (?:[ ]{2}|\n|\A) # Two spaces, newline, or string +start ( # start of phrase (?:\S+\s)* # nonspaces + ONE space, repeated NOTICE # literal NOTICE (?:\s\S+)* # one space + nonspaces, repeated ) # end of phrase (?:[ ]{2}|\n|\z) # two spaces, newline, or string +end }xmsg ) { print $1, "\n"; }
Re: How to parse this information out?
by Sixtease (Friar) on Mar 21, 2007 at 21:18 UTC

    Hello coltman, how about this:

    $\ = "\n"; # always print a newline at the end while (<>) { @s = split /\s{3,}/; # split the input into phrases print grep /NOTICE/, @s; # print the matching phrases }

    Regards
    ~ Sixtease

      This prints an extra blank line for every input line that does not match. Taking a cue from Anno's solution, you could fix this by inserting a for. You also get an extra newline where there's a matching phrase on a line by itself. Fix with chomp.

      $\="\n"; while (<>) { chomp; @s=split /\s{3,}/; # split the input into phrases print for grep /NOTICE/, @s; # print the matches }