in reply to Re: How to match regex over multiline file
in thread How to match regex over multiline file

Alright, I was able to fix my regex and it's working exactly as I want it to! Thank you! I was wondering if I could ask another question, though.

So now that I have my regex matching over multiple lines, I wanted to take the raw textfile and have the output be the entire paragraph bracketed in paragraph tags and the individual sentences inside with sentence tags. I was able to write the code to do both separately, with the necessary regex, but I need to write it so they're nested within each other.

Here's the code I have so far:

local $/ = ""; open $fh, $ARGV[0] or die "File $ARGV[0] not found!\n"; $scount = 0; $pcount=0; while ($line = <$fh>){ #brackets sentences while($line =~ /\s*(([A-Z][A-Za-z]*)(((([A-Za-z]|[0-9])*((\'*|\-*) +[A-Za-z]*))\s*(\.{3})*\!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*))(\.| +\?|\!))/g){ print "<s>$1</s>\n"; $scount++; } #brackets paragraphs if ($line =~ /\s*((((([A-Za-z]|[0-9])*((\'*|\-*)[A-Za-z]*))\s*\.*\ +!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*))(\.|\?|\!))/g){ print "<p>\n$1\n</p>\n"; $pcount++; } } print "\n Total Lines: $scount\n"; print "\n Total Paragraphs: $pcount\n";

When I run both sections at the same time, first it will print out each paragraph section with the sentence tags around each sentence, then it prints the same paragraph but with the paragraph tags. How do I fix it?

Replies are listed 'Best First'.
Re^3: How to match regex over multiline file
by Athanasius (Archbishop) on Oct 10, 2013 at 08:01 UTC

    Hello kyaloupe, and welcome to the Monastery!

    Since you’re reading the text in paragraph mode, I don’t see why you need any regex to identify paragraphs? Also, unless your data (not shown) is special, I don’t see why you need such a complicated regex to identify sentences? In any case, here is how I would tackle this problem:

    #! perl use strict; use warnings; local $/ = ''; # Paragraph mode my $sentence_count = 0; my $paragraph_count = 0; my @paragraphs; while (my $paragraph = <DATA>) { my @sentences; while ($paragraph =~ m{\s*(.+?(?:\.|\?|!|$))}g) { push @sentences, "<s>$1</s>"; ++$sentence_count; } push @paragraphs, "<p>\n\t" . join("\n\t", @sentences) . "\n</p>\n +"; ++$paragraph_count; } print "\nTotal sentences: $sentence_count\n"; print "Total paragraphs: $paragraph_count\n"; print for @paragraphs; __DATA__ The quick brown fox jumped over the unfortunate dog. What a shame! She sells seashells by the sea shore. Peter Piper picked a peck of pic +kled peppers. Didn't he? Yes, he did. This sentence has no termination

    Output:

    17:55 >perl 741_SoPW.pl Total sentences: 7 Total paragraphs: 3 <p> <s>The quick brown fox jumped over the unfortunate dog.</s> <s>What a shame!</s> </p> <p> <s>She sells seashells by the sea shore.</s> <s>Peter Piper picked a peck of pickled peppers.</s> <s>Didn't he?</s> <s>Yes, he did.</s> </p> <p> <s>This sentence has no termination</s> </p> 17:55 >

    As you can see, I identify sentences as each paragraph is read in, and then wrap what is found in the appropriate tags. See join. (I’ve added tabs just to make the structure of the markup easier to see when it’s printed out.)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I tried using your code with a couple of changes, primarily substituting in the regex I had for the my sentences (the data that I'm using has a ton of punctuation, such as ... and "" and so on, so the regex you had in your example wasn't what I needed and was cutting out a lot of the data). However, when I put my regex in, it went back to the same issue I had before, which was that it would print out the first paragraph with the sentence brackets around the sentences but then print out the same paragraph again but with the paragraph brackets around the whole paragraph...

      Just so you can have an idea of what I changed, I've included the code below, with an excerpt of the text file I'm using.

      local $/ = ""; open $fh, $ARGV[0] or die "File $ARGV[0] not found!\n"; $scount = 0; $pcount=0; @paragraphs; while ($paragraph = <$fh>){ @sentences; while($paragraph =~ /\s*(([A-Z][A-Za-z]*)(((([A-Za-z]|[0-9])*((\'* +|\-*)[A-Za-z]*))\s*(\.{3})*\!*\"*\(*\)*\,*\:*\s*)*(([A-Za-z]|[0-9])*) +)(\.|\?|\!))/g){ push @sentences, "<s>$1</s>"; $scount++; } push @paragraphs, "<p>\n\t" . join("\n\t", @sentences) . "\n</p>\n +"; $pcount++; } print for @paragraphs; print "\n Total Lines: $scount\n"; print "\n Total Paragraphs: $pcount\n";

      Data:

      But the truth is that in the short run, markets can occasionally be pushed, especially when so many decisions to buy or sell are keyed off what everyone else in the market is doing. Chain reactions are not much harder to start (in fact, given how quickly price moves get noticed, they may be easier) than they were 70 years ago.

      All that notwithstanding, the interesting thing about the Greenspan resignation rumor was that it raised an obvious question: Would it really matter? As Jacob Weisberg just pointed out in " Ballot Box," Steve Forbes is apparently the only American who doesn't think Greenspan has done a terrific job as Fed chairman. And most of us would be happy to have Greenspan stay in office even after his current term expires in the middle of next year. But it's interesting to note that in the past couple of months there have been more than a few voices--including those of economists Greg Mankiw and Robert Barr--suggesting that Greenspan has been more the beneficiary of good economic fundamentals than the creator of them.

      That position may be a bit overstated, particularly since Greenspan has shown an unusual ability to let his thinking on inflation, productivity, and the economy's possible growth rate evolve in response to changing data. But the essential point, that the soundness of this economy does not depend on Greenspan's presence at the head of the Fed, is right. That might not be the case if Greenspan's successor were either an inflation dove like William Greider or a perma-bear like Jim Grant. But whoever would succeed Greenspan would be nothing of the sort. He or she would be, in a word, Greenspanian, still concerned about the possibility of an overheating economy but also convinced that important technological changes have allowed this economy to grow faster than in the past without sparking inflation.

      If anything, in fact, the bond market should have rallied on news that Greenspan might be stepping down, since he has long since stopped being paranoid enough for bondholders, who seem perpetually convinced that the United States is about to become Brazil. There are certainly Fed governors out there who would be far more likely to raise interest rates aggressively at the first hint of price pressures than Greenspan.

        I don't know about your regex, but since you don't limit @sentences to the context of your while loop with my, it's a global variable. It doesn't go out of context at the end of the loop, so each time through the loop it retains the elements it already had, and then you push the next paragraph's sentences onto it. Like this:

        for my $l ('a'..'d'){ @list; push @list, $l; } print @list: # prints qw( a a b a b c a b c d );

        You should learn to localize variables within a block with my, and use strict to tell you when you forgot to do that. Failing that, you should empty @sentences at the start of each loop with @sentences=(); . But really, learn strict and my.

        Aaron B.
        Available for small or large Perl jobs; see my home node.

        aaron_baugher++ has solved the problem. But as to the regex, why reinvent the wheel when has modules to identify English sentences? For example, Lingua::EN::Sentence contains a get_sentences function which seems to do nicely:

        #! perl use strict; use warnings; use Lingua::EN::Sentence 'get_sentences'; local $/ = ''; # Paragraph mode my $sentence_count = 0; my $paragraph_count = 0; my @paragraphs; while (my $paragraph = <DATA>) { my $sentences = get_sentences($paragraph); @$sentences = map { '<s>' . $_ . '</s>' } @$sentences; push @paragraphs, "<p>\n\t" . join("\n\t", @$sentences) . "\n</p>\ +n"; $sentence_count += scalar @$sentences; ++$paragraph_count; } print "\nTotal sentences: $sentence_count\n"; print "Total paragraphs: $paragraph_count\n"; print for @paragraphs; __DATA__ But the truth is that in the short run, markets can occasionally be pu +shed, especially when so many decisions to buy or sell are keyed off +what everyone else in the market is doing. Chain reactions are not mu +ch harder to start (in fact, given how quickly price moves get notice +d, they may be easier) than they were 70 years ago. This is a sentence containing ... an ellipsis. "Well, OK then," he sai +d, "but let's not get ahead of ourselves." And so to bed... And in this sentence, dialogue is delimited with single quotes. 'Well, + OK then,' he said, 'but let's not get ahead of ourselves.' And the h +euristic still works!

        Output:

        12:59 >perl 741a_SoPW.pl Total sentences: 8 Total paragraphs: 3 <p> <s>But the truth is that in the short run, markets can occasio +nally be pushed, especially when so many decisions to buy or sell are + keyed off what every one else in the market is doing.</s> <s>Chain reactions are not much harder to start (in fact, give +n how quickly price moves get noticed, they may be easier) than they +were 70 years ago.</s> </p> <p> <s>This is a sentence containing ... an ellipsis.</s> <s>"Well, OK then," he said, "but let's not get ahead of ourse +lves."</s> <s>And so to bed...</s> </p> <p> <s>And in this sentence, dialogue is delimited with single quo +tes.</s> <s>'Well, OK then,' he said, 'but let's not get ahead of ourse +lves.'</s> <s>And the heuristic still works!</s> </p> 12:59 >

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,