aeqr has asked for the wisdom of the Perl Monks concerning the following question:

I would like to extract words of three letters from a string, for example:

ABCDEF would give (ABC), (BCD),(CDE), (DEF)

I would like to try to use pattern matching for this, so I have tried this:
sub build_dictionnary{ my $line="ABCDE"; my @dic; $line=~s/(([A-Z]{1})[A-Z]{2})/push(@dic,$1);/g; }
So what I am trying to do here is to capture the 3 letters, save it in the dictionary array and then substitute the first letter by nothing (i.e remove it). This solution doesn't work, is it possible to do it that way?

Thank you!

Replies are listed 'Best First'.
Re: substitution in regular expression
by AnomalousMonk (Archbishop) on Apr 23, 2014 at 19:59 UTC

    If you just want to extract overlapping triplets without changing the original string:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'ABCDEF'; ;; my @triplets = $s =~ m{ (?= (...)) }xmsg; printf qq{'$_' } for @triplets; " 'ABC' 'BCD' 'CDE' 'DEF'

    If you want to simultaneously do substitutions to change the match string so that it ends up as 'DEF' or 'EF', that's trickier (at least, it's tricky to do with a single substitution operation), but I'm assuming substitution is just an artifact of the potential approach you happened to come up with, i.e., it's an XY Problem. Please advise on this point.

    Update: See Re^3: substitution in regular expression for a string-modifying  s/// solution.

      Thanks for the help, it's ok to modify the string. I would like to do it in the way I described if it's possible.

        Maybe (?) something like this is what you want?

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'ABCDEF'; ;; my @triplets; $s =~ s{ (?= (...)) . }{ push @triplets, $1; ''; }xmsge; ;; print qq{'$s'}; printf qq{'$_' } for @triplets; " 'EF' 'ABC' 'BCD' 'CDE' 'DEF'

        Update: Here's another  s/// solution. I'm not sure I like it so much: I'm always suspicious, perhaps without cause, of code embedded in a regex. In addition, the  @triplets array must be a package global (ideally local-ized) due to a bug in lexical management that wasn't fixed until Perl version 5.16 or 5.18 (I think — I don't have access to these versions and I'm too lazy to check the on-line docs).

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'ABCDEF'; ;; local our @triplets; $s =~ s{ (?= (...) (?{ push @triplets, $^N })) . }''xmsg; ;; print qq{'$s'}; printf qq{'$_' } for @triplets; " 'EF' 'ABC' 'BCD' 'CDE' 'DEF'

        So how should the string end up, as 'DEF' or as 'EF'?

Re: substitution in regular experssion
by Anonymous Monk on Apr 23, 2014 at 19:54 UTC

    You've placed code (push(@dic,$1);) inside the replacement part of the regular expression. Each three-letter code will be replaced by the string "push(@dic,$1);" instead of the code being executed, and because of that the string will never get shorter. Even though you could get the code to execute by adding the /e modifier on the regex, it still wouldn't do what you want (since the replacement value would be the return value of the push call), and so it's better to just move that code outside the regular expression.

    Since you're matching three letters with your regular expression, and you want to replace those with the last two of those three letters, it's easier to just write it that way:

    while(length($line)>2){ $line =~ s/([A-Z]([A-Z]{2}))/$2/; push(@dic, $1); }

    I'm sure other monks will have (TI)MTOWTDI and more elegant solutions, but the above gets what you want with only a few changes.

      Thanks for the info, I have tried your solution but it doesn't seem to work :/

      Also I would like to know the way to do it without the while loop. That is, editing the string and saving as I have described...

        ... it doesn't seem to work :/

        But what does that mean? In general, replies along the lines of "it doesn't work" are not helpful. How does it "not work"?

        It seems you've edited your node to remove the while loop you originally had. Please don't do that without marking your updates because it confuses things, now monks won't know which version of your question to answer.

        ... it doesn't seem to work

        In what way? Do you get an error, or are you seeing unexpected results? Because it works for me:

        use Data::Dumper; print Dumper([build_dictionnary()]); sub build_dictionnary{ my $line="ABCDEF"; my @dic; while(length($line)>2){ $line =~ s/([A-Z]([A-Z]{2}))/$2/; push(@dic, $1); } return @dic; } # Output (whitespace compressed): # $VAR1 = [ 'ABC', 'BCD', 'CDE', 'DEF' ];
        I would like to know the way to do it without the while loop.

        Why?

Re: substitution in regular expression
by Laurent_R (Canon) on Apr 23, 2014 at 21:37 UTC
    Regex might not be the best way. And don't make mistakes on using loops or not. When you use the s///g operator (i.e. with the g modifier), you are in effect doing an implicit loop, even if it does not appear to be the case. Just as when you are using the grep or the map function, it may look as you are not looping on the source list or array, but you are just doing an implicit loop in that case (and the explicit loop of a for/foreach solution might often be actually slightly quicker).

    All this to introduce the fact that I will propose a rather concise solution with an explicit loop in the following Perl one-liner:

    $ perl -le 'my $s = "ABCDEF"; print substr $s, $_, 3 for 0..length($s) +-3;' ABC BCD CDE DEF
    I did not check, but it is likely to be faster that any regex on large data input. Check it and tell your teacher about your findings on the various solutions, you might get an A+.

      Thanks for the additional idea and explanations. Good to see you have a sense of humor as well ;)
Re: substitution in regular expression
by trizen (Hermit) on Apr 23, 2014 at 22:15 UTC
    One-line solution: "ABCDEF" =~ /([A-Z]{3})(?{print "$1\n"})(?!)/;
      Thanks for the idea, I'll write it down. Just one thing, could you explain the:
      (?{print "$1\n"})(?!)
      I don't understand the question mark before the print block. Also why the (?!) at the end. I noticed that removing it only prints ABC, but I don't understand why. Thank you

        Short explanation: (?{...}) means to execute arbitrary Perl code inside a regular expression, and (?!) makes the regex engine to fail and backtrack, trying to match from the last_pos + 1. When it starts matching ABC, it prints it, fails, backtracks and starts matching from B the next three letters, giving us BCD. The process repeats until the internal regex counter reaches the end of the string.

        I know, I'm really bad at explaining things to humans, but, fortunately, Athanasius explained this better once.

        Please see: Re: RegEx + vs. {1,}