Hue-Bond has asked for the wisdom of the Perl Monks concerning the following question:

I've found that glob sometimes needs two backslashes in order to recognize some special characters in file names. I only tested it with []{} so far, but will do further testing. First let's prepare the field:

$ mkdir 'a{bc.d'; touch 'a{bc.d/GOT ITabcde' $ mkdir 'a {bc.d'; touch 'a {bc.d/GOT ITabcde' $ ls a {bc.d a{bc.d

Now the game begins:

$ cat foo $a=q|a\ \\\{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), +">\n"; $ perl foo pattern <a\ \\{bc.d/*> globs to <a {bc.d/GOT ITabcde>

As soon as I prepend only one backslash to the opening curly, glob fails to expand the directory:

$ cat foo $a=q|a\ \{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), "> +\n"; $ perl foo pattern <a\ \{bc.d/*> globs to <a >

On the other hand, it's odd that, it the file doesn't contain spaces, it suffices to use only one backslash:

$ cat foo $a=q|a\{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), ">\n +"; $ perl foo pattern <a\{bc.d/*> globs to <a{bc.d/GOT ITabcde>

In fact, using three makes glob produce no output:

$ cat foo $a=q|a\\\{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), "> +\n"; $ perl foo pattern <a\\{bc.d/*> globs to <a\>

This happens in all these 5 systems:

I think that the shell must be involved at some point when the pattern has spaces, so the two backslashes become one. The doc of File::Glob says something but I'm not sure it's related ("Due to historical reasons, CORE::glob() will also split its argument on whitespace, treating it as multiple patterns").

FWIW, perl, version 5.005_02 built for PA-RISC1.1 gives internal error: glob failed at foo line 1. on HP-UX td192 B.11.11 when it sees more than one backslash in a row. All four tests work as expected (ie, perform the glob correctly) if I use just one backslash. I'm downloading some old versions of Debian to test 5.6 and 5.004.

Ideas?

Update: Added perl 5.6.1.

--
David Serrano

Replies are listed 'Best First'.
Re: glob with special characters
by shmem (Chancellor) on Oct 04, 2006 at 06:42 UTC

    No shell is involved. If you strace perl (strace -f -ff on linux), you'll see that no shell is spawned.

    AFAIK perl now uses File::Glob internally1, and in fact foo.pl as

    $a=q|a\ \\\{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), +">\n"; print "$_\n" for sort keys %INC;

    yields

    pattern <a\ \\{bc.d/*> globs to <a {bc.d/GOT ITabcde> Carp.pm Exporter.pm File/Glob.pm Text/ParseWords.pm XSLoader.pm strict.pm vars.pm warnings.pm warnings/register.pm

    while running

    $a=q|a\{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), ">\n +"; print "$_\n" for sort keys %INC;

    results in

    pattern <a\{bc.d/*> globs to <a{bc.d/GOT ITabcde> File/Glob.pm XSLoader.pm strict.pm

    I suspect a subtlety (bug?) in Text::ParseWords or its usage. The behaviour you describe (which I confirm for my systems) looks like a two-pass parsing whenever a whitespace is present - in that case the \{ seems to get "optimized away" (running with -Dcr):

    Guessing start of match, REx "\\(.)" against "\{bc.d/*"... Found anchored substr "\" at offset 0... Guessed: match at offset 0 Matching REx "\\(.)" against "\{bc.d/*" Setting an EVAL scope, savestack=56 0 <> <\{bc.d/*> | 1: EXACT <\\> 1 <\> <{bc.d/*> | 3: OPEN1 1 <\> <{bc.d/*> | 5: SANY 2 <\{> <bc.d/*> | 6: CLOSE1 2 <\{> <bc.d/*> | 8: END Match successful! Guessing start of match, REx "\\(.)" against "bc.d/*"... Did not find anchored substr "\"... Match rejected by optimizer Not present... Match failed

    No time to track that down rigth now... smells like eval involved.

    1) From doio.c:

    =head1 IO Functions =for apidoc start_glob Function called by C<do_readline> to spawn a glob (or do the glob insi +de perl on VMS). This code used to be inline, but now perl uses C<File::G +lob> this glob starter is only used by miniperl during the build process. Moving it away shrinks pp_hot.c; shrinking pp_hot.c helps speed perl u +p. =cut */ PerlIO * Perl_start_glob (pTHX_ SV *tmpglob, IO *io)

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      I suspect a subtlety (bug?) in Text::ParseWords or its usage.

      The documentation of this module is quite explicit about how it works:

      The $keep argument is a boolean flag. If true, then the tokens are split on the specified delimiter, but all other characters (quotes, backslashes, etc.) are kept in the tokens. If $keep is false then the &*quotewords() functions remove all quotes and backslashes that are not themselves backslash-escaped or inside of single quotes (i.e., "ewords() tries to interpret these characters just like the Bourne shell).

      The *quotewords functions all call parse_line, which is the one that performs the real job. File::Glob calls parse_line with a $keep argument of 0:

      if ($pat =~ /\s/) { # XXX this is needed for compatibility with the csh # implementation in Perl. Need to support a flag # to disable this behavior. require Text::ParseWords; @pat = Text::ParseWords::parse_line('\s+',0,$pat); }

      So, knowing this, it comes as no surprise that some backslashes are being eaten. As soon as I replace the 0 with a 1, the behaviour of glob begins to match my expectations. What bothers me is that File::Glob is one of that pieces of software so widely used that it's impossible that this little humble programmer have found a bug in it :^).

      --
      David Serrano

        Hue-Bond++

        will you file a bug report on this?

        abrazo,
        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
        400th post