Hello kyaloupe, and welcome to the Monastery!
Since you’re reading the text in paragraph mode, I don’t see why you need any regex to identify paragraphs? Also, unless your data (not shown) is special, I don’t see why you need such a complicated regex to identify sentences? In any case, here is how I would tackle this problem:
#! perl
use strict;
use warnings;
local $/ = ''; # Paragraph mode
my $sentence_count = 0;
my $paragraph_count = 0;
my @paragraphs;
while (my $paragraph = <DATA>)
{
my @sentences;
while ($paragraph =~ m{\s*(.+?(?:\.|\?|!|$))}g)
{
push @sentences, "<s>$1</s>";
++$sentence_count;
}
push @paragraphs, "<p>\n\t" . join("\n\t", @sentences) . "\n</p>\n
+";
++$paragraph_count;
}
print "\nTotal sentences: $sentence_count\n";
print "Total paragraphs: $paragraph_count\n";
print for @paragraphs;
__DATA__
The quick brown fox jumped over the unfortunate dog. What a shame!
She sells seashells by the sea shore. Peter Piper picked a peck of pic
+kled peppers. Didn't he? Yes, he did.
This sentence has no termination
Output:
17:55 >perl 741_SoPW.pl
Total sentences: 7
Total paragraphs: 3
<p>
<s>The quick brown fox jumped over the unfortunate dog.</s>
<s>What a shame!</s>
</p>
<p>
<s>She sells seashells by the sea shore.</s>
<s>Peter Piper picked a peck of pickled peppers.</s>
<s>Didn't he?</s>
<s>Yes, he did.</s>
</p>
<p>
<s>This sentence has no termination</s>
</p>
17:55 >
As you can see, I identify sentences as each paragraph is read in, and then wrap what is found in the appropriate tags. See join. (I’ve added tabs just to make the structure of the markup easier to see when it’s printed out.)
Hope that helps,
| [reply] [d/l] [select] |
I tried using your code with a couple of changes, primarily substituting in the regex I had for the my sentences (the data that I'm using has a ton of punctuation, such as ... and "" and so on, so the regex you had in your example wasn't what I needed and was cutting out a lot of the data). However, when I put my regex in, it went back to the same issue I had before, which was that it would print out the first paragraph with the sentence brackets around the sentences but then print out the same paragraph again but with the paragraph brackets around the whole paragraph...
Just so you can have an idea of what I changed, I've included the code below, with an excerpt of the text file I'm using.
local $/ = "";
open $fh, $ARGV[0] or die "File $ARGV[0] not found!\n";
$scount = 0;
$pcount=0;
@paragraphs;
while ($paragraph = <$fh>){
@sentences;
while($paragraph =~ /\s*(([A-Z][A-Za-z]*)(((([A-Za-z]|[0-9])*((\'*
+|\-*)[A-Za-z]*))\s*(\.{3})*\!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*)
+)(\.|\?|\!))/g){
push @sentences, "<s>$1</s>";
$scount++;
}
push @paragraphs, "<p>\n\t" . join("\n\t", @sentences) . "\n</p>\n
+";
$pcount++;
}
print for @paragraphs;
print "\n Total Lines: $scount\n";
print "\n Total Paragraphs: $pcount\n";
Data:
But the truth is that in the short run, markets can occasionally be pushed,
especially when so many decisions to buy or sell are keyed off what everyone
else in the market is doing. Chain reactions are not much harder to start (in
fact, given how quickly price moves get noticed, they may be easier) than they
were 70 years ago.
All that notwithstanding, the interesting thing about the Greenspan
resignation rumor was that it raised an obvious question: Would it really
matter? As Jacob Weisberg just pointed out in "
Ballot Box," Steve Forbes is apparently the only American who doesn't think
Greenspan has done a terrific job as Fed chairman. And most of us would be
happy to have Greenspan stay in office even after his current term expires in
the middle of next year. But it's interesting to note that in the past couple
of months there have been more than a few voices--including those of economists
Greg Mankiw and Robert Barr--suggesting that Greenspan has been more the
beneficiary of good economic fundamentals than the creator of them.
That position may be a bit overstated, particularly since Greenspan has
shown an unusual ability to let his thinking on inflation, productivity, and
the economy's possible growth rate evolve in response to changing data. But the
essential point, that the soundness of this economy does not depend on
Greenspan's presence at the head of the Fed, is right. That might not be the
case if Greenspan's successor were either an inflation dove like William
Greider or a perma-bear like Jim Grant. But whoever would succeed Greenspan
would be nothing of the sort. He or she would be, in a word, Greenspanian,
still concerned about the possibility of an overheating economy but also
convinced that important technological changes have allowed this economy to
grow faster than in the past without sparking inflation.
If anything, in fact, the bond market should have rallied on news that
Greenspan might be stepping down, since he has long since stopped being
paranoid enough for bondholders, who seem perpetually convinced that the United
States is about to become Brazil. There are certainly Fed governors out there
who would be far more likely to raise interest rates aggressively at the first
hint of price pressures than Greenspan.
| [reply] [d/l] |
I don't know about your regex, but since you don't limit @sentences to the context of your while loop with my, it's a global variable. It doesn't go out of context at the end of the loop, so each time through the loop it retains the elements it already had, and then you push the next paragraph's sentences onto it. Like this:
for my $l ('a'..'d'){
@list;
push @list, $l;
}
print @list: # prints qw( a a b a b c a b c d );
You should learn to localize variables within a block with my, and use strict to tell you when you forgot to do that. Failing that, you should empty @sentences at the start of each loop with @sentences=(); . But really, learn strict and my.
Aaron B.
Available for small or large Perl jobs; see my home node.
| [reply] [d/l] [select] |
aaron_baugher++ has solved the problem. But as to the regex, why reinvent the wheel when has modules to identify English sentences? For example, Lingua::EN::Sentence contains a get_sentences function which seems to do nicely:
#! perl
use strict;
use warnings;
use Lingua::EN::Sentence 'get_sentences';
local $/ = ''; # Paragraph mode
my $sentence_count = 0;
my $paragraph_count = 0;
my @paragraphs;
while (my $paragraph = <DATA>)
{
my $sentences = get_sentences($paragraph);
@$sentences = map { '<s>' . $_ . '</s>' } @$sentences;
push @paragraphs, "<p>\n\t" . join("\n\t", @$sentences) . "\n</p>\
+n";
$sentence_count += scalar @$sentences;
++$paragraph_count;
}
print "\nTotal sentences: $sentence_count\n";
print "Total paragraphs: $paragraph_count\n";
print for @paragraphs;
__DATA__
But the truth is that in the short run, markets can occasionally be pu
+shed, especially when so many decisions to buy or sell are keyed off
+what everyone else in the market is doing. Chain reactions are not mu
+ch harder to start (in fact, given how quickly price moves get notice
+d, they may be easier) than they were 70 years ago.
This is a sentence containing ... an ellipsis. "Well, OK then," he sai
+d, "but let's not get ahead of ourselves." And so to bed...
And in this sentence, dialogue is delimited with single quotes. 'Well,
+ OK then,' he said, 'but let's not get ahead of ourselves.' And the h
+euristic still works!
Output:
12:59 >perl 741a_SoPW.pl
Total sentences: 8
Total paragraphs: 3
<p>
<s>But the truth is that in the short run, markets can occasio
+nally be pushed, especially when so many decisions to buy or sell are
+ keyed off what every one else in the market is doing.</s>
<s>Chain reactions are not much harder to start (in fact, give
+n how quickly price moves get noticed, they may be easier) than they
+were 70 years ago.</s>
</p>
<p>
<s>This is a sentence containing ... an ellipsis.</s>
<s>"Well, OK then," he said, "but let's not get ahead of ourse
+lves."</s>
<s>And so to bed...</s>
</p>
<p>
<s>And in this sentence, dialogue is delimited with single quo
+tes.</s>
<s>'Well, OK then,' he said, 'but let's not get ahead of ourse
+lves.'</s>
<s>And the heuristic still works!</s>
</p>
12:59 >
Hope that helps,
| [reply] [d/l] [select] |