http://qs1969.pair.com?node_id=11135826


in reply to Match last word in a sentence

My definitions produce a slightly different result. I assume that "sentence" is the string and "last word" is the last string of 'word' characters. The "rest" is all the characters before the "last word". Note that the space before "last word" is included in 'rest'. Possible non-word characters after "Last word" are discarded.
#!/usr/bin/perl use strict; use warnings; my $list = "This is my list"; $list =~ / ^(.+) # The 'rest' (Everything before last word) \b # UPDATE (ref [eyespoplikeamosquito] below) (\w+) # Last 'word' (string of contiguous word characters) \W*$ # Possible non-word characters at end of string /x; my $last = $2; my $the_rest = $1; print $the_rest; # "This is my" print $last; # "list"; <\c> <p>RESULT:</p> <c> This is my list

UPDATE: Added explicit definition of 'sentence'.

Bill

Replies are listed 'Best First'.
Re^2: Match last word in a sentence
by eyepopslikeamosquito (Archbishop) on Aug 14, 2021 at 00:06 UTC

    This appears to contain a bug, revealed when you change the print lines as shown below:

    use strict; use warnings; my $list = "This is my list"; $list =~ / ^(.+) # The 'rest' (Everything before last word) (\w+) # Last 'word' (string of contiguous word characters) \W*$ # Possible non-word characters at end of string /x; my $last = $2; my $the_rest = $1; print "the_rest='$the_rest'\n"; print "last='$last'\n";
    Running this produces:
    the_rest='This is my lis' last='t'

    There are many ways to fix. Here is one way (adding a \b assertion):

    $list =~ / ^(.+) # The 'rest' (Everything before last word) \b(\w+) # Last 'word' (string of contiguous word characters) \W*$ # Possible non-word characters at end of string /x;

    Alternative fixes welcome.

      Is making the first part non-greedy not enough ...

      my $list = "This is my list"; $list =~ m/^(.+?) # The 'rest' (Everything before last word) (\w+) # Last 'word' (string of contiguous word character +s) \W*$ # Possible non-word characters at end of string /x;

      ... (would it fail on some other string)?

      Much later. For English language, \w is not inclusive enough (lacks hyphen) and includes too much (includes underscore & digits). Short of a proper grammar based parser, I would rather use word regex which addresses that ...

      $word_re = qr{ (?: & | -? [a-zA-Z]+ [a-zA-Z-]* ) }x;

      ... is still incomplete as it does not deal with accented characters; periods in a title; acronyms with spaces and/or periods, among other things.