glob with special characters

Hue-Bond has asked for the wisdom of the Perl Monks concerning the following question:

I've found that glob sometimes needs two backslashes in order to recognize some special characters in file names. I only tested it with []{} so far, but will do further testing. First let's prepare the field:

$ mkdir 'a{bc.d';  touch 'a{bc.d/GOT ITabcde'
$ mkdir 'a {bc.d'; touch 'a {bc.d/GOT ITabcde'
$ ls
a {bc.d     a{bc.d
[download]

Now the game begins:

$ cat foo
$a=q|a\ \\\{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), 
+">\n";
$ perl foo
pattern <a\ \\{bc.d/*> globs to <a {bc.d/GOT ITabcde>
[download]

As soon as I prepend only one backslash to the opening curly, glob fails to expand the directory:

$ cat foo
$a=q|a\ \{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), ">
+\n";
$ perl foo
pattern <a\ \{bc.d/*> globs to <a >
[download]

On the other hand, it's odd that, it the file doesn't contain spaces, it suffices to use only one backslash:

$ cat foo
$a=q|a\{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), ">\n
+";
$ perl foo
pattern <a\{bc.d/*> globs to <a{bc.d/GOT ITabcde>
[download]

In fact, using three makes glob produce no output:

$ cat foo
$a=q|a\\\{bc.d/*|;print "pattern <", $a, "> globs to <", (glob $a), ">
+\n";
$ perl foo
pattern <a\\{bc.d/*> globs to <a\>
[download]

This happens in all these 5 systems:

This is perl, v5.8.4 built for ia64-linux-thread-multi on Debian 3.1
This is perl, v5.8.8 built for i586-linux-thread-multi on SUSE Linux Enterprise Server 10
This is perl, v5.8.5 built for i386-linux-thread-multi on Red Hat Enterprise Linux ES release 4
This is perl, v5.8.8 built for ia64-freebsd on FreeBSD 6.1-RELEASE
This is perl, v5.6.1 built for i386-linux on Debian 3.0

I think that the shell must be involved at some point when the pattern has spaces, so the two backslashes become one. The doc of File::Glob says something but I'm not sure it's related ("Due to historical reasons, CORE::glob() will also split its argument on whitespace, treating it as multiple patterns").

FWIW, perl, version 5.005_02 built for PA-RISC1.1 gives internal error: glob failed at foo line 1. on HP-UX td192 B.11.11 when it sees more than one backslash in a row. All four tests work as expected (ie, perform the glob correctly) if I use just one backslash. I'm downloading some old versions of Debian to test 5.6 and 5.004.

Ideas?

Update: Added perl 5.6.1.

--
David Serrano

Comment on glob with special characters Select or Download Code

Replies are listed 'Best First'.
Re: glob with special characters by shmem (Chancellor) on Oct 04, 2006 at 06:42 UTC
No shell is involved. If you strace perl (`strace -f -ff` on linux), you'll see that no shell is spawned. AFAIK perl now uses File::Glob internally¹, and in fact `foo.pl` as `$a=q\|a\ \\\{bc.d/\|;print "pattern <", $a, "> globs to <", (glob $a), +">\n"; print "$_\n" for sort keys %INC;` [download] yields `pattern <a\ \\{bc.d/> globs to <a {bc.d/GOT ITabcde> Carp.pm Exporter.pm File/Glob.pm Text/ParseWords.pm XSLoader.pm strict.pm vars.pm warnings.pm warnings/register.pm` [download] while running `$a=q\|a\{bc.d/\|;print "pattern <", $a, "> globs to <", (glob $a), ">\n +"; print "$_\n" for sort keys %INC;` [download] results in `pattern <a\{bc.d/> globs to <a{bc.d/GOT ITabcde> File/Glob.pm XSLoader.pm strict.pm` [download] I suspect a subtlety (bug?) in Text::ParseWords or its usage. The behaviour you describe (which I confirm for my systems) looks like a two-pass parsing whenever a whitespace is present - in that case the `\{` seems to get "optimized away" (running with -Dcr): Guessing start of match, REx "\\(.)" against "\{bc.d/"... Found anchored substr "\" at offset 0... Guessed: match at offset 0 Matching REx "\\(.)" against "\{bc.d/" Setting an EVAL scope, savestack=56 0 <> <\{bc.d/> \| 1: EXACT <\\> 1 <\> <{bc.d/> \| 3: OPEN1 1 <\> <{bc.d/> \| 5: SANY 2 <\{> <bc.d/> \| 6: CLOSE1 2 <\{> <bc.d/> \| 8: END Match successful! Guessing start of match, REx "\\(.)" against "bc.d/"... Did not find anchored substr "\"... Match rejected by optimizer Not present... Match failed [download] No time to track that down rigth now... smells like eval involved. ¹) From doio.c: `=head1 IO Functions =for apidoc start_glob Function called by C<do_readline> to spawn a glob (or do the glob insi +de perl on VMS). This code used to be inline, but now perl uses C<File::G +lob> this glob starter is only used by miniperl during the build process. Moving it away shrinks pp_hot.c; shrinking pp_hot.c helps speed perl u +p. =cut / PerlIO Perl_start_glob (pTHX_ SV tmpglob, IO io)` [download] --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re^2: glob with special characters by Hue-Bond (Priest) on Oct 06, 2006 at 21:38 UTC
I suspect a subtlety (bug?) in Text::ParseWords or its usage. The documentation of this module is quite explicit about how it works: The $keep argument is a boolean flag. If true, then the tokens are split on the specified delimiter, but all other characters (quotes, backslashes, etc.) are kept in the tokens. If $keep is false then the &quotewords() functions remove all quotes and backslashes that are not themselves backslash-escaped or inside of single quotes (i.e., "ewords() tries to interpret these characters just like the Bourne shell). The `quotewords` functions all call `parse_line`, which is the one that performs the real job. File::Glob calls `parse_line` with a `$keep` argument of `0`: `if ($pat =~ /\s/) { # XXX this is needed for compatibility with the csh # implementation in Perl. Need to support a flag # to disable this behavior. require Text::ParseWords; @pat = Text::ParseWords::parse_line('\s+',0,$pat); }` [download] So, knowing this, it comes as no surprise that some backslashes are being eaten. As soon as I replace the `0` with a `1`, the behaviour of glob begins to match my expectations. What bothers me is that File::Glob is one of that pieces of software so widely used that it's impossible that this little humble programmer have found a bug in it :^). -- David Serrano	[reply] [d/l] [select]
Re^3: glob with special characters by shmem (Chancellor) on Oct 06, 2006 at 21:44 UTC
Hue-Bond++ will you file a bug report on this? abrazo, --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print} 400th post	[reply]
Re^4: glob with special characters by Hue-Bond (Priest) on Oct 07, 2006 at 18:17 UTC