How to parse this information out?

coltman has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, This may be a stupid question, but I will really apprecite it if someone can help me with this.

I have a text file which part of it is displayed below. 

Example

RENTAL NOTICE LETTER


 EXHIBIT 10.2    FIRST RESTATED AND AMENDED    AUTO LEASE NOTICE    TH
+IS LEASE is made as of the 1st day of March, 2001, by and between HAN
+NAH RENTAL, CO. a Maine Corporation having its principal offices in P
+ortland 


End


What I want is to extract any phrase that contains the word "NOTICE". 
+In this example, both "RENTAL NOTICE LETTER" and "AUTO LEASE NOTICE" 
+are the information that I need.

Because there can be two ways for locating the phrase "XXX NOTICE YYY"
+:
(1) (boundary or begining of line) XXX NOTICE YYY (\n);
(2) (more than two spaces, such as "   ") XXX NOTICE YYY (more than tw
+o spaces, such as "   "    ); 

I tried with the following:

if ($a =~ /(?:\b|  +?)(.+)*(NOTICE)(.+)*(?:\n|  +?)/i)
    {
     print $1." ".$2." ".$3."\n";
    }
[download]

But it does not work quite well. Can you make some suggestions?

Comment on How to parse this information out? Download Code

Replies are listed 'Best First'.
Re: How to parse this information out? by Anno (Deacon) on Mar 21, 2007 at 21:12 UTC
I think this is a case where a single big(gish) regex is not the best way to extract the information. A combination of split and grep with simple regexes seems a better choice. First split the text on either two or more blanks, or a line feed. That isolates the candidate phrases mixed with other text fragments. Then select three-word-phrases by keeping only those fragments that contain exactly two blanks. Finally, select for the word "NOTICE". Put together: `print "$_\n" for grep /\bNOTICE\b/, grep tr/ // == 2, split / {2,}\|\n/, $text;` [download] Anno	[reply] [d/l]
Re: How to parse this information out? by kyle (Abbot) on Mar 21, 2007 at 18:38 UTC
This is what I came up with. `while ( $text =~ m{ (?:[ ]{2}\|\n\|\A) # Two spaces, newline, or string +start ( # start of phrase (?:\S+\s)* # nonspaces + ONE space, repeated NOTICE # literal NOTICE (?:\s\S+)* # one space + nonspaces, repeated ) # end of phrase (?:[ ]{2}\|\n\|\z) # two spaces, newline, or string +end }xmsg ) { print $1, "\n"; }` [download]	[reply] [d/l]
Re: How to parse this information out? by Sixtease (Friar) on Mar 21, 2007 at 21:18 UTC
Hello coltman, how about this: `$\ = "\n"; # always print a newline at the end while (<>) { @s = split /\s{3,}/; # split the input into phrases print grep /NOTICE/, @s; # print the matching phrases }` [download] Regards ~ Sixtease	[reply] [d/l]
Re^2: How to parse this information out? by kyle (Abbot) on Mar 21, 2007 at 21:28 UTC
This prints an extra blank line for every input line that does not match. Taking a cue from Anno's solution, you could fix this by inserting a `for`. You also get an extra newline where there's a matching phrase on a line by itself. Fix with chomp. `$\="\n"; while (<>) { chomp; @s=split /\s{3,}/; # split the input into phrases print for grep /NOTICE/, @s; # print the matches }` [download]	[reply] [d/l] [select]