kyaloupe has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a program that takes a raw textfile and prints out each sentence in the file on its own line using a regex. I've been able to write the regex to match the sentences, however, the program is reading in the file line by line, and since each sentence in the textfile spans multiple lines, the output is giving me the regex match but it's cut off because the entire sentence isn't all one one line in the original file.

Here's my code so far:

open $fh, $ARGV[0] or die "File $ARGV[0] not found!\n"; while ($line = <$fh>){ while($line =~ /\s*((((([A-Za-z]|[0-9])*((\'*|\-*)[A-Za-z]*))\s*\. +*\!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*))(\.|\?|\!))/g){ print "$1\n";

I've tried adding \n* to my regex to make it accept the newline characters in the file, but it doesn't make a difference. I've also tried using m// or /s instead of the /g at the end of my regex, but all that does is give me an infinite loop. I've also tried concatenating the lines together, but I'm very new with perl and it's just not working.

Replies are listed 'Best First'.
Re: How to match regex over multiline file
by hdb (Monsignor) on Oct 10, 2013 at 14:10 UTC
    use strict; use warnings; my $text = <<EOT; I probably do not understand your requirement. Is it not as simple as reading the file line by line, removing all newlines and adding a newline after all full stops, question and exclamation marks? After that operation each line is one sentence. EOT open my $fn, "<", \$text; while(<$fn>){ chomp; s/[.!?]\K\s*/\n/g; print; } close $fn;
Re: How to match regex over multiline file
by Anonymous Monk on Oct 09, 2013 at 23:56 UTC

      Alright, I was able to fix my regex and it's working exactly as I want it to! Thank you! I was wondering if I could ask another question, though.

      So now that I have my regex matching over multiple lines, I wanted to take the raw textfile and have the output be the entire paragraph bracketed in paragraph tags and the individual sentences inside with sentence tags. I was able to write the code to do both separately, with the necessary regex, but I need to write it so they're nested within each other.

      Here's the code I have so far:

      local $/ = ""; open $fh, $ARGV[0] or die "File $ARGV[0] not found!\n"; $scount = 0; $pcount=0; while ($line = <$fh>){ #brackets sentences while($line =~ /\s*(([A-Z][A-Za-z]*)(((([A-Za-z]|[0-9])*((\'*|\-*) +[A-Za-z]*))\s*(\.{3})*\!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*))(\.| +\?|\!))/g){ print "<s>$1</s>\n"; $scount++; } #brackets paragraphs if ($line =~ /\s*((((([A-Za-z]|[0-9])*((\'*|\-*)[A-Za-z]*))\s*\.*\ +!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*))(\.|\?|\!))/g){ print "<p>\n$1\n</p>\n"; $pcount++; } } print "\n Total Lines: $scount\n"; print "\n Total Paragraphs: $pcount\n";

      When I run both sections at the same time, first it will print out each paragraph section with the sentence tags around each sentence, then it prints the same paragraph but with the paragraph tags. How do I fix it?

        Hello kyaloupe, and welcome to the Monastery!

        Since you’re reading the text in paragraph mode, I don’t see why you need any regex to identify paragraphs? Also, unless your data (not shown) is special, I don’t see why you need such a complicated regex to identify sentences? In any case, here is how I would tackle this problem:

        #! perl use strict; use warnings; local $/ = ''; # Paragraph mode my $sentence_count = 0; my $paragraph_count = 0; my @paragraphs; while (my $paragraph = <DATA>) { my @sentences; while ($paragraph =~ m{\s*(.+?(?:\.|\?|!|$))}g) { push @sentences, "<s>$1</s>"; ++$sentence_count; } push @paragraphs, "<p>\n\t" . join("\n\t", @sentences) . "\n</p>\n +"; ++$paragraph_count; } print "\nTotal sentences: $sentence_count\n"; print "Total paragraphs: $paragraph_count\n"; print for @paragraphs; __DATA__ The quick brown fox jumped over the unfortunate dog. What a shame! She sells seashells by the sea shore. Peter Piper picked a peck of pic +kled peppers. Didn't he? Yes, he did. This sentence has no termination

        Output:

        17:55 >perl 741_SoPW.pl Total sentences: 7 Total paragraphs: 3 <p> <s>The quick brown fox jumped over the unfortunate dog.</s> <s>What a shame!</s> </p> <p> <s>She sells seashells by the sea shore.</s> <s>Peter Piper picked a peck of pickled peppers.</s> <s>Didn't he?</s> <s>Yes, he did.</s> </p> <p> <s>This sentence has no termination</s> </p> 17:55 >

        As you can see, I identify sentences as each paragraph is read in, and then wrap what is found in the appropriate tags. See join. (I’ve added tabs just to make the structure of the markup easier to see when it’s printed out.)

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Alright, using the paragraph mode definitely did something, it's now giving me entire paragraphs from the textfile as the output, which works for one part of my code, but not quite all of it. I'm going to try editing my regex (maybe it's just wayyyy too broad, which is why it's giving me the whole paragraph) to just match a single sentence.

      Thank you!