Hello kyaloupe, and welcome to the Monastery!
Since you’re reading the text in paragraph mode, I don’t see why you need any regex to identify paragraphs? Also, unless your data (not shown) is special, I don’t see why you need such a complicated regex to identify sentences? In any case, here is how I would tackle this problem:
#! perl use strict; use warnings; local $/ = ''; # Paragraph mode my $sentence_count = 0; my $paragraph_count = 0; my @paragraphs; while (my $paragraph = <DATA>) { my @sentences; while ($paragraph =~ m{\s*(.+?(?:\.|\?|!|$))}g) { push @sentences, "<s>$1</s>"; ++$sentence_count; } push @paragraphs, "<p>\n\t" . join("\n\t", @sentences) . "\n</p>\n +"; ++$paragraph_count; } print "\nTotal sentences: $sentence_count\n"; print "Total paragraphs: $paragraph_count\n"; print for @paragraphs; __DATA__ The quick brown fox jumped over the unfortunate dog. What a shame! She sells seashells by the sea shore. Peter Piper picked a peck of pic +kled peppers. Didn't he? Yes, he did. This sentence has no termination
Output:
17:55 >perl 741_SoPW.pl Total sentences: 7 Total paragraphs: 3 <p> <s>The quick brown fox jumped over the unfortunate dog.</s> <s>What a shame!</s> </p> <p> <s>She sells seashells by the sea shore.</s> <s>Peter Piper picked a peck of pickled peppers.</s> <s>Didn't he?</s> <s>Yes, he did.</s> </p> <p> <s>This sentence has no termination</s> </p> 17:55 >
As you can see, I identify sentences as each paragraph is read in, and then wrap what is found in the appropriate tags. See join. (I’ve added tabs just to make the structure of the markup easier to see when it’s printed out.)
Hope that helps,
| Athanasius <°(((>< contra mundum | Iustus alius egestas vitae, eros Piratica, |
In reply to Re^3: How to match regex over multiline file
by Athanasius
in thread How to match regex over multiline file
by kyaloupe
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |