mldvx4 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to build a pattern that will extract acronyms. I've gotten fairly close, but it leaves off the final period where it should be saved. Notice that it is missing from the output. For example, "X.H.T.M.L" should be "X.H.T.M.L."

I wrote the pattern below in a moment of insight and now that moment has passed. How may I modify it to preserve the trailing periods?

#!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; while( s/(?'foo' ( (?=([[:upper:]]\.\s){2})[[:upper:]\.\s]{2,} | (?=([[:upper:]]\s){2})[[:upper:]\s]{2,} | (?=([[:upper:]]\.){2})[[:upper:]\.]{2,} | [[:upper:]] ){2,} )//x ) { my $acronym = $+{foo}; print qq("$acronym"\n); } } exit(0); __DATA__ L F and LF and L.F. and L. F. and not L, F. some HTML some XML. or X.H.T.M.L. or X. H. T. M. L. or even X H T M L but not U and I, or You and I. ...

Or if there is an existing function or module which does that already, I can use that instead.

Replies are listed 'Best First'.
Re: Regular expression for finding acronyms
by jcb (Parson) on Aug 16, 2019 at 04:33 UTC

    Do you mean like this?

    #!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; while( m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g ) { my $acronym = $1; print qq("$acronym"\n); } } exit(0); __DATA__ L F and LF and L.F. and L. F. and not L, F. some HTML some XML. or X.H.T.M.L. or X. H. T. M. L. or even X H T M L but not U and I, or You and I. ...

    Sample output:

    "L F " "LF" "L.F." "L. F. " "HTML" "XML" "X.H.T.M.L." "X. H. T. M. L. " "X H T M L"

    It is probably better to break m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g down bit by bit and explain how it works:

    m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
    This arranges for the expression to match repeatedly, beginning each search where the previous search "left off", with the entire pattern in a capture group. I have used numbered capture groups here because the expression uses a backreference.
    m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
    First, we must find an uppercase letter, then we note what comes after it in a second capture group. The first letter can be followed by a dot and whitespace, a dot only, a space only, or an empty string, but we remember what we matched here as group 2.
    m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
    This non-capturing group defines the repetition. We require at least 1 additional "element" in the acronym, so single upppercase letters are not recognized.
    m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
    Later "elements" must consist of an uppercase letter and the same string (the "\2") that we previously found between the first and second letter, or must run up to the end of the input string.
    m/([[:upper:]](\.\s|\.|\s|)(?:[[:upper:]](?:\2|$))+)/g
    Lastly, there's an important detail here: backtracking in this expression is very limited and the engine will either advance or fail the match very quickly. The critical detail is that there is no point in the expression where the engine can "stretch" one element and still match the next.

      Thanks for pattern and its explanation. That way is a lot simpler and gets the desired result.

      I had not known that capture groups could be nested like that, nor non-capture markers either.

Re: Regular expression for finding acronyms
by Marshall (Canon) on Aug 16, 2019 at 02:34 UTC
    You are not close at all to producing acronym like: LF,XML,HTML,XHTML. Your code produces:
    "L F" "LF" "L.F" "L. F" "HTML" "XML" "X.H.T.M.L" "X. H. T. M. L" "X H T M L"
    What do you think this should say? Your code does not produce acronyms.

    "An acronym is an initial abbreviation that can be pronounced as a word, such as NASA or WASP. This term is also used to refer to a series of initials pronounced individually, such as FBI or TGIF, but the technical term is initialism."

    Update: Perhaps consider this:

    #!/usr/bin/perl use strict; use warnings; $|=1; while (my $line = <DATA>) { next if $line =~ /^\s*$/; #skip blank lines $line =~ s/\s*$//; #delete line endings print "INPUT LINE: \'$line\'\n"; $line =~ s/\s+|\,|\.//g; #remove spaces, commas, periods my @acron = $line =~ m/([A-Z]{2,})/g; #get sequence of 2 or #more uppercase chars print "Acronym $_\n" for @acron; } =Prints INPUT LINE: 'L F and LF and L.F. and L. F. and not L, F.' Acronym LF Acronym LF Acronym LF Acronym LF Acronym LF INPUT LINE: 'some HTML some XML.' Acronym HTML Acronym XML INPUT LINE: 'or X.H.T.M.L. or X. H. T. M. L. or even X H T M L' Acronym XHTML Acronym XHTML Acronym XHTML INPUT LINE: 'but not U and I,' INPUT LINE: 'or You and I.' INPUT LINE: '...' =cut __DATA__ L F and LF and L.F. and L. F. and not L, F. some HTML some XML. or X.H.T.M.L. or X. H. T. M. L. or even X H T M L but not U and I, or You and I. ...
    Update: Oh, I guess you don't want 'L,' to count...in that case don't delete the comma *see above*

    LF, HTML, XML, XHTML are acronyms. something like: X. H. T. M. L. is not.