Do you mean like this?
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
while( m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g ) {
my $acronym = $1;
print qq("$acronym"\n);
}
}
exit(0);
__DATA__
L F and LF and L.F. and L. F. and not L, F.
some HTML some XML.
or X.H.T.M.L. or X. H. T. M. L. or even X H T M L
but not U and I,
or You and I.
...
Sample output:
"L F "
"LF"
"L.F."
"L. F. "
"HTML"
"XML"
"X.H.T.M.L."
"X. H. T. M. L. "
"X H T M L"
It is probably better to break m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g down bit by bit and explain how it works:
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- This arranges for the expression to match repeatedly, beginning each search where the previous search "left off", with the entire pattern in a capture group. I have used numbered capture groups here because the expression uses a backreference.
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- First, we must find an uppercase letter, then we note what comes after it in a second capture group. The first letter can be followed by a dot and whitespace, a dot only, a space only, or an empty string, but we remember what we matched here as group 2.
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- This non-capturing group defines the repetition. We require at least 1 additional "element" in the acronym, so single upppercase letters are not recognized.
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- Later "elements" must consist of an uppercase letter and the same string (the "\2") that we previously found between the first and second letter, or must run up to the end of the input string.
- m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
- Lastly, there's an important detail here: backtracking in this expression is very limited and the engine will either advance or fail the match very quickly. The critical detail is that there is no point in the expression where the engine can "stretch" one element and still match the next.