in reply to Parse a large string

This is a brain twisting regex problem.
Here's my go at it:
#!/usr/bin/perl -w use strict; $/=undef; my $data = <DATA>; my @text = ($data =~ /(Nullam.+(?:augue|libero)\.)/g); print @text; #__prints: #Nullam quis augue.

Update: Including grandfather's idea and cleaning up line to account for Nullam and augue or libero being on different lines (take out the \n's if any), and print each sentence on different line, see below. The [^.]+ works well here as we don't have to worry about using say /s regx modifier to allow "." to also match newlines (by default "." matches anything except a newline).

$/=undef; my $data = <DATA>; my @text = ($data =~ /(Nullam\b[^.]+(?:augue|libero)\.)/g); @text = map{tr/\n/ /;$_}@text; print join("\n",@text),"\n"; #print join(" ",@text); #alternative to put a space after the period.

Replies are listed 'Best First'.
Re^2: Parse a large string
by GrandFather (Saint) on Mar 10, 2009 at 20:40 UTC

    Consider what happens with the following string (disregarding any speeling misadventures errors and grammatical):

    Nullamie a orci. Nullam quis augue. Aliquam lacinia tempus Praugue.

    It is considered good practice to avoid using .* and .+ - they tend to be greedier than you often intend. Very often you are better to use a negated character class: [^.]+ would help a lot in this case. Also the word break anchor \b will help get intended behavior.


    True laziness is hard work
      Quite correct! As written the regex would match Nullamie as well as Nullam and the greediness would eat the first augue!

      Another way to calm greediness is the the ? modifier, .+ is a maximal match, .+? is a minimal match, like: $data =~ /(Nullam\b.+?(?:augue|libero)\.)/g); That's sometimes a good way to go and would work if we didn't have the "." to help us out here. Although I like your [^.]+ your idea looks great to me! There is more than one way to skin these regex cats!