lobs has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I need help with Regular expression for the wikipedia module. What I am trying to implement is a simple question/answer system. I have parsed the question to get the keyword to search wikipedia with and get a page back. All I need is the first sentence since that is usually where the answer is. The Problem I am having is there the complete first sentence is not returning. For example George Washington:
(A lot more info but this is what I am trying to parse)

'George Washington' was the first President of the United States , the Commander-in-Chief of the Continental Army during the American Revolutionary War, and one of the Founding Fathers of the United States. He presided over the convention that drafted the current United States Constitution and during his lifetime was called the "father of his country".<ref name="Grizzard105"></ref> ...

Even with this I am having a lot of trouble with just extracting the first sentence. The closest I have gotten is with this code where $sent is the keyword I am searching(always the wiki articles itself), in this example we can say $sent=George Washington
if($doc =~ /'$sent' (is|was) ([\w]+[\s]*[,|;|"|'|\-]?[\s]*)+\./) { $reform = "$sent $1 $2.\n\nFOUND!!"; }
This returned:
$reform = George Washington was States.
This was what I was using earlier but gave me the whole article in $reform.
if($doc =~ /'$sent' (is|was) (.*)\./) { $reform = "$sent $1 $2.\n\nFOUND!!"; }
So can someone help me with my regex match please.

Replies are listed 'Best First'.
Re: Regular expression for Wikipedia Module
by Athanasius (Archbishop) on Apr 20, 2016 at 04:14 UTC

    Hello lobs,

    Where possible, it’s generally better to use an existing module than to re-invent the wheel. In this case, there are modules on such as Text::Sentence that do most of the work for you:

    use strict; use warnings; use Text::Sentence qw( split_sentences ); my $sent = 'George Washington'; my $doc = do { local $/; <DATA>; }; my @sentences = split_sentences($doc); for (@sentences) { if (/^'?$sent'?\s+(?:is|was)/) { print "FOUND:\n$_\n"; last; } } __DATA__ The quick brown fox jumped over the unfortunate dog. 'George Washington' was the first President of the United States, the +Commander-in-Chief of the Continental Army during the American Revolu +tionary War, and one of the Founding Fathers of the United States. He + presided over the convention that drafted the current United States +Constitution and during his lifetime was called the "father of his co +untry". Widely admired for his strong leadership qualities, Washington was una +nimously elected president in the first two national elections. He ov +ersaw the creation of a strong, well-financed national government tha +t maintained neutrality in the French Revolutionary Wars, suppressed +the Whiskey Rebellion, and won acceptance among Americans of all type +s.[5] Washington's incumbency established many precedents, still in u +se today, such as the cabinet system, the inaugural address, and the +title Mr. President.[6][7] His retirement from office after two terms + established a tradition that lasted until 1940, when Franklin Delano + Roosevelt won an unprecedented third term. The 22nd Amendment (1951) + now limits the president to two elected terms.

    Output:

    14:11 >perl 1601_SoPW.pl FOUND: 'George Washington' was the first President of the United States, the +Commander-in-Chief of the Continental Army during the American Revolu +tionary War, and one of the Founding Fathers of the United States. 14:11 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Thanks for the heads up I saw that already, but my professor probably wouldn't like if we use it. Can't really ask since it's after hours and the assignment is due in the morning. I fixed the bug though by encapsulating ([\w]+[\s]*[,|;|"|'|\-]?[\s]*)+ in parenthesis.
Re: Regular expresión for Wikipedia Module
by NetWallah (Canon) on Apr 20, 2016 at 04:32 UTC
    This works for me:
    $doc =~/([^\.]+$sent.+?\.)/;
    Gathers:
    'George Washington' was the first President of the United States , the + Commander-in-Chief of the Continental Army during the American Revol +utionary War, and one of the Founding Fathers of the United States.

            This is not an optical illusion, it just looks like one.