lyapunov has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks; I recently created a bug (which I have corrected). During the course of fixing the bug I wanted to go down the most Righteous path of Self Learning. I found multiple answers. The problem is that after looking at this, I realize that I really don't understand the low level mechanics of how Perl is parsing and interpreting the code. To me, the results were unexpected. I implore you to shine your wisdom on this so I can be become a better programmer.

N.B. I started using a different syntax after reading the split section of the Functions Chapter in my most holy bible, Programming Perl 3rd edition, by Larry Wall. I picked up this syntax on page 18 of the same when I first learned Perl and used it ever since until recently.

I was writing a Nagios check that I wanted to test the timestamps of specific files to make sure log archiving successfully occurred. I have several files, per machine so I was using associative arrays to store the values based on machinename and file name e.g. machine1.messages, machine1.audit and so on. Naturally I wanted to split on the dot so that I could use them in the hashed arrays.

As I mentioned, I found the solution, but when I wrote a program to get insight into the mechanics of how this works, I came across results that were unexpected. At least they were to me.

Here is the code and the output.

#!/usr/bin/perl $foo="data1.txt"; (my $first, my $second)=split(".",$foo); printf("Using a double quoted . the answer is first $first, second $se +cond\n"); (my $first, my $second)=split("\.",$foo); printf("Using an escaped . the answer is first $first, second $second +\n"); (my $first, my $second)=split("\\.",$foo); printf("Using a double quoted, double escaped . the answer is first $f +irst, second $second\n"); (my $first, my $second)=split('.',$foo); printf("Using a single quoted . the answer is first $first, second $se +cond\n"); (my $first, my $second)=split('\.',$foo); printf("Using a single quote escaped . the answer is first $first, sec +ond $second\n"); (my $first, my $second)=split('\\.',$foo); printf("Using a single quote double escaped . the answer is first $fir +st, second $second\n"); (my $first, my $second)=split /./, $foo; printf("Using /./ in split the answer is first $first, second $second +\n"); (my $first, my $second)=split /\./, $foo; printf("Using /\\./ in split the answer is first $first, second $seco +nd\n"); (my $first, my $second)=split /\\./, $foo; printf("Using /\\\\./ in split the answer is first $first, second $se +cond\n");

Here is the output

Using a double quoted . the answer is first , second Using an escaped . the answer is first , second Using a double quoted, double escaped . the answer is first data1, sec +ond txt Using a single quoted . the answer is first , second Using a single quote escaped . the answer is first data1, second txt Using a single quote double escaped . the answer is first data1, secon +d txt Using /./ in split the answer is first , second Using /\./ in split the answer is first data1, second txt Using /\\./ in split the answer is first data1.txt, second

In short, my questions are:

1) Why doesn't split split on any printable character? In particular it looks like a dot should be passed in the examples one and two of the double quote examples.

2) While it makes sense that the double escaped dot will work for both the single and double quote versions. Why does the single escaped dot work for the single quote, but not the double quote?

3) What exactly is split splitting on in the last example?

Needless to say, there are some fundamental concepts that are eluding me. It has been a few years since I had compiler construction, but I will do my best to keep up with what answer you provide.

Again, thanks so much for your help. I really do want to understand what is happening here.

Replies are listed 'Best First'.
Re: double quote vs single quote oddities. I need enlightenment
by kennethk (Abbot) on Jul 08, 2010 at 16:31 UTC
    Most of your questions seem to stem from an expectation that split works by finding a string literal you pass it. While this expectation is correct for cases like split 'x', 'axbxc';, split actually works based upon regular expressions. Regular expressions use several metacharacters, one of which is . - it a wildcard. You may gain some illumination by testing your expressions in code similar to:

    print join "_|_", split /./, $foo;

    To answer more on point for your questions:

    1. As previously stated, . is a regular expression metacharacter. To split on literal periods, use the regular expression /\./ or use the \Q \E combo (e.g. split /\Q.\E/, $foo;) to handle your escaping for you - see Quote and Quote like Operators.
    2. Double quotes are interpolated by Perl and single quotes are not - again, see Quote and Quote like Operators. In your case, double quotes apply escaping to backslashed characters. This means "\\." is equivalent to '\.' and "\." is equivalent to '.'. In particular, backslash is an ordinary character in single quotes unless followed by another backslash or by a single quote.
    3. In the last example, the regular expression engine is looking for a literal backslash followed by any character.

    If any of this is unclear, I would be happy to expound further.

      kennethk, thank you for the reply.

      I do realize that . is a metacharacter. I did not know about the interpolation. So what you said makes sense for me up until the single quote doubled escaped . e.g. \\. The split function does split the string into data1 and txt Shouldn't it work like the last example where the first variable is the complete filename since the literal \. was not encountered?

      Also, as . is any single printable character, why doesn't split bust the expression apart into NULL and ata1.txt? I did your trick with the join (thanks for that by the way) but split is apparently not returning anything

        So what you said makes sense for me up until the single quote doubled escaped

        For the expression split "\\.", 'data.txt', first Perl interpolates "\\." to the literal string [\.]. This string is then fed to the regular expression engine, making split "\\.", 'data.txt' equivalent to split /\./, 'data.txt'. The input is split on literal periods and then result is then ('data','txt').

        why doesn't split bust the expression apart into NULL and ata1.txt

        You would get your expected result if you include a limit on the expression, i.e. split /./, $foo, 2;. Unless an explicit limit is included, split will operate on every match. Therefore, the initial result of the split is a list of nine empty strings. As documented in split,

        empty trailing ones are deleted. (If all fields are empty, they are considered to be trailing.)

        Therefore, it returns an empty list.

Re: double quote vs single quote oddities. I need enlightenment
by moritz (Cardinal) on Jul 08, 2010 at 16:33 UTC
    The most important point is that split interprets the first argument as a regex. So split '.', split "." and split /./ are exactly the same.

    The rest of your question is easy to answer if you just print out the string or regex that split sees:

    $ perl -wle 'print "\."' . # so it's the same as /./ $ perl -wle 'print "\\."' \. # as a regex, matches a literal dot

    In the case of regexes, /\./ matches a literal dot, /\\./ a backslash followed by any character, /\\\./ a backslash followed by a literal dot and so on.

    Update: There's also a special quoting for to create regexes, that avoids having to use excessive amounts of backslashes if you want to store a pattern in a variable:

    my $regex = qr{\.}; # matches a literal dot my @chunks = split $regex, 'dot.delimited.string';
    Perl 6 - links to (nearly) everything that is Perl 6.

      Moritz, thank you for the reply and the use of the qr. I just read the "Staying in Control" of Chapter 5.

      Thanks! That is a big help.

Re: double quote vs single quote oddities. I need enlightenment
by JavaFan (Canon) on Jul 08, 2010 at 21:20 UTC
    If you use [.] to match a literal dot, it doesn't matter what quotes you use. split "[.]", split '[.]' and split /./ are all the same (assuming no overloading).