in reply to Regular expression for finding acronyms

Do you mean like this?

#!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; while( m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g ) { my $acronym = $1; print qq("$acronym"\n); } } exit(0); __DATA__ L F and LF and L.F. and L. F. and not L, F. some HTML some XML. or X.H.T.M.L. or X. H. T. M. L. or even X H T M L but not U and I, or You and I. ...

Sample output:

"L F " "LF" "L.F." "L. F. " "HTML" "XML" "X.H.T.M.L." "X. H. T. M. L. " "X H T M L"

It is probably better to break m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g down bit by bit and explain how it works:

m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
This arranges for the expression to match repeatedly, beginning each search where the previous search "left off", with the entire pattern in a capture group. I have used numbered capture groups here because the expression uses a backreference.
m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
First, we must find an uppercase letter, then we note what comes after it in a second capture group. The first letter can be followed by a dot and whitespace, a dot only, a space only, or an empty string, but we remember what we matched here as group 2.
m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
This non-capturing group defines the repetition. We require at least 1 additional "element" in the acronym, so single upppercase letters are not recognized.
m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
Later "elements" must consist of an uppercase letter and the same string (the "\2") that we previously found between the first and second letter, or must run up to the end of the input string.
m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
Lastly, there's an important detail here: backtracking in this expression is very limited and the engine will either advance or fail the match very quickly. The critical detail is that there is no point in the expression where the engine can "stretch" one element and still match the next.

Replies are listed 'Best First'.
Re^2: Regular expression for finding acronyms
by mldvx4 (Hermit) on Aug 16, 2019 at 05:05 UTC

    Thanks for pattern and its explanation. That way is a lot simpler and gets the desired result.

    I had not known that capture groups could be nested like that, nor non-capture markers either.

        Yes, I've been poring over perlre a lot lately. However, it has been past time to revisit perlretut.