End of sentence regex excluding " i.e." and " e.g."

jabowery has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: End of sentence regex excluding " i.e." and " e.g." by kennethk (Abbot) on Feb 06, 2017 at 18:00 UTC
First, to be pedantic, e.g. and i.e. should always be followed by a comma, so you are dealing with grammatical errors. http://www.dailywritingtips.com/comma-after-i-e-and-e-g/. There isn't a general solution to this problem because of names (e.g, H.G. Wells) and quoting, but perhaps `/[.!?]\s{1,2}(?=[A-Z0-9])/` [download] will be sufficiently robust for your need? In general, for a corpus like this, I'd split it into known good, known bad, and grey, and then use test-driven development in order to build out my filter. Update: Augmented regex for ! and ? #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l]
Re^2: End of sentence regex excluding " i.e." and " e.g." by Not_a_Number (Prior) on Feb 06, 2017 at 18:19 UTC
e.g. and i.e. should always be followed by a comma Did you actually read the document you linked to? It clearly says: Style guides do not agree on whether or not a comma should follow both these abbreviations ... The consensus seems to be in favor of the comma in American usage; against it in British usage.	[reply]
Re^3: End of sentence regex excluding " i.e." and " e.g." by kennethk (Abbot) on Feb 06, 2017 at 18:27 UTC
Which is why I opened with a pedantic warning. It's true I was assuming he was parsing American English. And that's also why I gave a link, and not just my comment. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply]
Re^2: End of sentence regex excluding " i.e." and " e.g." by choroba (Cardinal) on Feb 07, 2017 at 08:44 UTC
> to be pedantic, e.g. and i.e. should always be followed by a comma In this very sentence, none of them is followed by a comma. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^2: End of sentence regex excluding " i.e." and " e.g." by jabowery (Beadle) on Feb 06, 2017 at 18:05 UTC
In general, for a corpus like this, I'd split it into known good, known bad, and grey, and then use test-driven development in order to build out my filter. That's what I'm doing, but I got stuck at the early stage of handling just the cases of " e.g." and " i.e.", and I'm asking how to get unstuck so I can follow your advice, which I already was doing.	[reply]
Re^3: End of sentence regex excluding " i.e." and " e.g." by kennethk (Abbot) on Feb 06, 2017 at 18:35 UTC
Did the negative look-ahead for a capital help? I should mention I think you have a typo in your real script (as opposed to what you posted) because the following script behaves well for me: #!/usr/bin/perl use strict; use warnings; use File::Stream; my ($handler, $stream) = File::Stream->new( \DATA, read_length => 1024, separator => qr/(?<!\b[A-Z])(?<!e\.g)(?<!i\.e)[.!?]\s{1,2}(?=[A-Z0 +-9])/, ); while (<$stream>) { print "$_\n\n" ; } __DATA__ Perl filehandles are streams, but sometimes they just aren't powerful enough. This module offers to have streams from filehandles searched with regexes and allows the global input record separator variable to contain regexes. Thus, readline() and the <> operator can now return records delimited by regular expression matches. There are some very important gripes with applying regular expressions to (possibly infinite) streams. Please read the CAVEATS section of this documentation carfully. Some bunnys are fluffy, e.g. Peter. H.G. Wells was a great author. Some sports require specialized equipment, e.g. baseball. [download] Debugging is hard without particular examples from your corpus. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l]
Re: End of sentence regex excluding " i.e." and " e.g." by Not_a_Number (Prior) on Feb 06, 2017 at 18:25 UTC
Any suggestions? Yes: use Lingua::EN::Sentence.	[reply]
Re: End of sentence regex excluding " i.e." and " e.g." by AnomalousMonk (Archbishop) on Feb 06, 2017 at 18:20 UTC
I don't think `/(?<!( e\|\.g)\|( i\|\.e))\.\s/` does what I think you think it does. If you have Perl version 5.10+ regex extensions, try something like (untested): ~~`my ($exclude) = map qr{ (?: \Q$_\E) (SKIP) (FAIL) }xms, join q{\|}, qw(e.g. i.e. Dr. Mr. Mrs. ... etc.) ; my $delimiter = qr{ $exclude [.?!] \s }xms;`~~ ~~[download]~~ `my ($exclude) = map qr{ (?: $_) (SKIP) (FAIL) }xms, join q{ \| }, map qq{\Q$_\E}, reverse sort qw(e.g. i.e. Dr. Mr. Mrs. ... etc.) ; my $delimiter = qr{ $exclude [.?!] \s }xms;` [download] I have no idea how you could handle something like "H.G. Wells". Update: I was a bit too quick with my post; see my update above. Also, I think I might see a way to exclude initialed names and similar things: `my $name = qr{ [[:upper:]] [[:lower:]]+ }xms; my $initialed_name = qr{ \b [[:upper:]] [.] (?= \s+ $name) }xms; my ($exclude) = map qr{ (?: $_) (SKIP) (FAIL) }xms, join q{ \| }, $initialed_name, map qq{\Q$_\E}, reverse sort qw(e.g. i.e. Dr. Mr. Mrs. ... etc.) ; my $delimiter = qr{ $exclude [.?!] \s }xms;` [download] Obviously, this is just a starting point toward a robust solution. Update 2: It occurs to me that the above won't handle a name like P.D.Q. Bach, so maybe change `$initialed_name` as follows (still untested): `my $initial = qr{ \b [[:upper:]] [.] \s* }xms; my $name = qr{ \b [[:upper:]] [[:lower:]]* }xms; my $initialed_name = qr{ $initial+ (?= \s+ $name) }xms;` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]