Do you mean like this?
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
while( m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g ) {
my $acronym = $1;
print qq("$acronym"\n);
}
}
exit(0);
__DATA__
L F and LF and L.F. and L. F. and not L, F.
some HTML some XML.
or X.H.T.M.L. or X. H. T. M. L. or even X H T M L
but not U and I,
or You and I.
...
Sample output:
"L F "
"LF"
"L.F."
"L. F. "
"HTML"
"XML"
"X.H.T.M.L."
"X. H. T. M. L. "
"X H T M L"
It is probably better to break m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g down bit by bit and explain how it works:
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- This arranges for the expression to match repeatedly, beginning each search where the previous search "left off", with the entire pattern in a capture group. I have used numbered capture groups here because the expression uses a backreference.
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- First, we must find an uppercase letter, then we note what comes after it in a second capture group. The first letter can be followed by a dot and whitespace, a dot only, a space only, or an empty string, but we remember what we matched here as group 2.
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- This non-capturing group defines the repetition. We require at least 1 additional "element" in the acronym, so single upppercase letters are not recognized.
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- Later "elements" must consist of an uppercase letter and the same string (the "\2") that we previously found between the first and second letter, or must run up to the end of the input string.
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- Lastly, there's an important detail here: backtracking in this expression is very limited and the engine will either advance or fail the match very quickly. The critical detail is that there is no point in the expression where the engine can "stretch" one element and still match the next.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.