Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Afternoon benevolent monks,
I'm trying to put together a regex which will pass out speech from a text which is inside either single or double quotation marks for an idea that I have. I've got the regex to find the whole string but I just want what is inside the quotation marks and not the whole sentence.
use strict; use warnings; while (my $text = <DATA>) { if ($text =~ /(['"].*?['"])/ ){ print "Found $text \n"; } else { print "Out of luck \n"; } } __DATA__ "Mary and a little lamb", she said. She thought 'Hang on a tick' Nobody loves me "Mary and a large lamb"
I know that there are a couple of CPAN modules which can do this but I thought it might be interesting to try myself first. I did try a lookahead and lookbehinds but it would only match the final string and not where the text is in a sentence. I'd be grateful for any pointers.

Replies are listed 'Best First'.
Re: Trying to find all items in between quotation and speech marks
by kennethk (Abbot) on Jan 23, 2009 at 16:25 UTC

    Put your quotes outside you parentheses, so only the string gets stored in $1. I'd also change from a *? to a +?, since you likely aren't interested in null strings. One case you should consider which is currently not handled, is "I like Perl," said Mary, "since it gives me more time with my lamb!" - you'll match on ' said Mary, ' in addition to what you want.

    while (my $text = <DATA>) { if ($text =~ /['"](.+?)['"]/ ){ print "Found $1 \n"; } else { print "Out of luck \n"; } }

    Update: Details on regular expressions can be found in the documentation in Perl regular expressions, quick reference guide, and tutorial.

      Thanks. I haven't written out the edge cases yet but this gets me a little closer to where I want to get.
Re: Trying to find all items in between quotation and speech marks
by AnomalousMonk (Archbishop) on Jan 23, 2009 at 19:29 UTC
    Here's another approach. Note from the last two lines of the  __DATA__ section that it's easy to put together an example that confuses these regexes: natural language parsing is hard!
    use warnings; use strict; MAIN: { my $sq = q{'}; my $dq = q{"}; my $sq_body = qr{ [^\\$sq]* (?: \\. [^\\$sq]* )* }xms; my $dq_body = qr{ [^\\$dq]* (?: \\. [^\\$dq]* )* }xms; my $text = do { local $/; <DATA> }; # slurp all text my @quotes = grep defined && length, # ignore empty captures, null strings $text =~ m{ $dq ($dq_body) $dq | $sq ($sq_body) $sq }xmsg; s{ \n }{ }xmsg for @quotes; # make multi-line quotes into one line print "<$_> \n" for @quotes; } __DATA__ "Mary had a little lamb", she said. She thought, "I'm sure he said 'Wait a tick' before". She wondered, "What happens to an escaped \"?" Also, what happens to a "" or '' null quote? Nobody loves me 'Mary had a large and "wooly" lamb' Do not divide by '0'. Don't divide by '0'. She said, 'I'm sure it'll be ok.'
    Output:
    <Mary had a little lamb> <I'm sure he said 'Wait a tick' before> <What happens to an escaped \"?> <Mary had a large and "wooly" lamb> <0> <t divide by > <. She said, > <m sure it>
Re: Trying to find all items in between quotation and speech marks
by Marshall (Canon) on Jan 24, 2009 at 05:54 UTC
    The basic idea below is using "match global". There will be problems with contractions, like this isn't perfect.. have fun..hope it helps..
    #!/usr/bin/perl -w use strict; my $text =""; while (<DATA>) { s/\n/ /; #not graceful way of \n $text = $text.$_; #not either, but not main point... } #main point is to use match global (/g) #think about /m and /s options also #this is just an example of an idea.... # my @quotes = $text =~ m/["'](.*?)["']/g; print join("\n",@quotes),"\n"; __DATA__ "Mary and a little lamb", she said. She thought 'Hang on a tick'. this 'is a line spanning quote' and course not! I really do not know about the really stange things, "But this is another line span quote" Nobody loves me "Mary and a large lamb" this is just nonsense the PM said 'something' __END__ OUTPUT: Mary and a little lamb Hang on a tick is a line spanning quote But this is another line span quote Mary and a large lamb something
      Note that the regex  m/["'](.*?)["']/g (and other, similar regexes in this thread) will match unbalanced quotes; e.g., it will match the substring  q{bar} in the string  q{foo 'bar" baz}.

      Although he or she does not say so, what the OPer probably wants is something to match balanced quotes.

        Ok, there is also the idea of contractions. "don't", etc. another refinement...
        m/(["'])(.+?)\1/;
        If you tag the ["'], then using \1 looks for which ever quote character matched at the beginning. I think some \W also needed in some fashion. For a don't in the middle of a sentence. There may a special case when line beings with no character at all in front of the quote. Not sure what the requirements are when 'foo" is encountered or other non-standard english constructions.
        m/\W(["'])(.+?)\1\W/;
        Its tricky to think of all cases!
Re: Trying to find all items in between quotation and speech marks
by Anonymous Monk on Jan 23, 2009 at 16:16 UTC
    use regex from those CPAN modules