How to match regex over multiline file

kyaloupe has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a program that takes a raw textfile and prints out each sentence in the file on its own line using a regex. I've been able to write the regex to match the sentences, however, the program is reading in the file line by line, and since each sentence in the textfile spans multiple lines, the output is giving me the regex match but it's cut off because the entire sentence isn't all one one line in the original file.

Here's my code so far:

open $fh, $ARGV[0] or die "File $ARGV[0] not found!\n";
while ($line = <$fh>){
    while($line =~ /\s*((((([A-Za-z]|[0-9])*((\'*|\-*)[A-Za-z]*))\s*\.
+*\!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*))(\.|\?|\!))/g){
        print "$1\n";
[download]

I've tried adding \n* to my regex to make it accept the newline characters in the file, but it doesn't make a difference. I've also tried using m// or /s instead of the /g at the end of my regex, but all that does is give me an infinite loop. I've also tried concatenating the lines together, but I'm very new with perl and it's just not working.

Comment on How to match regex over multiline file Download Code

Replies are listed 'Best First'.

Re: How to match regex over multiline file
by hdb (Monsignor) on Oct 10, 2013 at 14:10 UTC

use strict;
use warnings;
my $text = <<EOT;
I probably do not understand your 
requirement.  Is it not as simple 
as reading the file line by line, 
removing all newlines and adding a 
newline after all full stops, question 
and exclamation marks?  After that 
operation each line is one sentence.
EOT
open my $fn, "<", \$text;
while(<$fn>){
        chomp;
        s/[.!?]\K\s*/\n/g;
        print;
}
close $fn;
[download]

[reply]
[d/l]

Re: How to match regex over multiline file
by Anonymous Monk on Oct 09, 2013 at 23:56 UTC

Re: Count Quoted Words

Matching in huge files

sliding window technique

and try combining with "paragraph mode"

 local $/ = ""; ## paragraph mode

If you can make any kind of progress with these nodes, I'll help you fill in the blanks

[reply]

Re^2: How to match regex over multiline file

by kyaloupe (Initiate) on Oct 10, 2013 at 02:28 UTC

Alright, I was able to fix my regex and it's working exactly as I want it to! Thank you! I was wondering if I could ask another question, though.

So now that I have my regex matching over multiple lines, I wanted to take the raw textfile and have the output be the entire paragraph bracketed in paragraph tags and the individual sentences inside with sentence tags. I was able to write the code to do both separately, with the necessary regex, but I need to write it so they're nested within each other.

Here's the code I have so far:

local $/ = "";
open $fh, $ARGV[0] or die "File $ARGV[0] not found!\n";
$scount = 0;
$pcount=0;

while ($line = <$fh>){
#brackets sentences
    while($line =~ /\s*(([A-Z][A-Za-z]*)(((([A-Za-z]|[0-9])*((\'*|\-*)
+[A-Za-z]*))\s*(\.{3})*\!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*))(\.|
+\?|\!))/g){
        print "<s>$1</s>\n";
        $scount++;
    }

#brackets paragraphs
    if ($line =~ /\s*((((([A-Za-z]|[0-9])*((\'*|\-*)[A-Za-z]*))\s*\.*\
+!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*))(\.|\?|\!))/g){
        print "<p>\n$1\n</p>\n";
        $pcount++;
    }
}
    


print "\n Total Lines: $scount\n";
print "\n Total Paragraphs: $pcount\n";
[download]

When I run both sections at the same time, first it will print out each paragraph section with the sentence tags around each sentence, then it prints the same paragraph but with the paragraph tags. How do I fix it?

[reply]
[d/l]

Re^3: How to match regex over multiline file

by Athanasius (Archbishop) on Oct 10, 2013 at 08:01 UTC

Hello kyaloupe, and welcome to the Monastery!

Since you’re reading the text in paragraph mode, I don’t see why you need any regex to identify paragraphs? Also, unless your data (not shown) is special, I don’t see why you need such a complicated regex to identify sentences? In any case, here is how I would tackle this problem:

#! perl
use strict;
use warnings;

local $/ = '';    # Paragraph mode

my $sentence_count  = 0;
my $paragraph_count = 0;
my @paragraphs;

while (my $paragraph = <DATA>)
{
    my @sentences;

    while ($paragraph =~ m{\s*(.+?(?:\.|\?|!|$))}g)
    {
        push @sentences, "<s>$1</s>";
        ++$sentence_count;
    }

    push @paragraphs, "<p>\n\t" . join("\n\t", @sentences) . "\n</p>\n
+";
    ++$paragraph_count;
}

print "\nTotal sentences:  $sentence_count\n";
print   "Total paragraphs: $paragraph_count\n";

print for @paragraphs;

__DATA__
The quick brown fox jumped over the unfortunate dog. What a shame!

She sells seashells by the sea shore. Peter Piper picked a peck of pic
+kled peppers. Didn't he? Yes, he did.

This sentence has no termination
[download]

Output:

17:55 >perl 741_SoPW.pl

Total sentences:  7
Total paragraphs: 3
<p>
        <s>The quick brown fox jumped over the unfortunate dog.</s>
        <s>What a shame!</s>
</p>
<p>
        <s>She sells seashells by the sea shore.</s>
        <s>Peter Piper picked a peck of pickled peppers.</s>
        <s>Didn't he?</s>
        <s>Yes, he did.</s>
</p>
<p>
        <s>This sentence has no termination</s>
</p>

17:55 >
[download]

As you can see, I identify sentences as each paragraph is read in, and then wrap what is found in the appropriate tags. See join. (I’ve added tabs just to make the structure of the markup easier to see when it’s printed out.)

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^4: How to match regex over multiline file

by kyaloupe (Initiate) on Oct 11, 2013 at 23:27 UTC

Re^5: How to match regex over multiline file

by aaron_baugher (Curate) on Oct 12, 2013 at 12:49 UTC

Re^5: How to match regex over multiline file

by Athanasius (Archbishop) on Oct 13, 2013 at 03:01 UTC

Re^2: How to match regex over multiline file

by kyaloupe (Initiate) on Oct 10, 2013 at 01:57 UTC

Alright, using the paragraph mode definitely did something, it's now giving me entire paragraphs from the textfile as the output, which works for one part of my code, but not quite all of it. I'm going to try editing my regex (maybe it's just wayyyy too broad, which is why it's giving me the whole paragraph) to just match a single sentence.

Thank you!

[reply]