AdamtheKiwi has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I'm writing a little script to run through a (large) list of files and test which of the text files contains ^M characters. In doing this, I've used the -T test to narrow the search down. The code (fragment) looks something like this:

foreach my $folder (@folders) { my $folderfullpath = $componentrootpath . "/" . $folder; print "Working with folder " . $folderfullpath . "\n" if ($debug); opendir(my $folderhandle, $folderfullpath); foreach my $element (readdir($folderhandle)) { next if ($element =~ /^\./); my $quotedfullelementpath = "\"" . $folderfullpath . "/" . $elemen +t . "\""; print "\n *** Full element path: " . $quotedfullelementpath . "\n +" if ($debug); next unless (-T $fullelementpath); open (FILE, "<" . $quotedfullelementpath) or die "Can't open file +$quotedfullelementpath:$!\n"; while (<FILE>) { binmode(FILE); my $line1 = unpack("H*", $_); if ($line1 =~ /0d/) { # File is contaminated; print $quotedfullelementpath . " - contaminated\n" if ($debug) +; my ($elementrootname, $elementuuid) = &nameanduuid($fullelemen +tpath); push (@contaminateduuids, $elementuuid) unless ($contaminatedu +uids{$elementuuid}); $contaminateduuids{$elementuuid} = 1; last; } } close (FILE); } }

I introduced the quotes because one of the teams whose code I'm reporting on uses spaces in their filenames. Unfortunately, the quotes seem to nullify the -T test. I tried adding in this piece of debug code:

my $quotedfullelementpath = "\"" . $folderfullpath . "/" . $elemen +t . "\""; my $singlefullelementpath = "'" . $folderfullpath . "/" . $element + . "'"; my $fullelementpath = $folderfullpath . "/" . $element; print "-> " .$quotedfullelementpath . " is "; print "NOT " unless (-T $quotedfullelementpath); print "a text file\n"; print "-> " .$singlefullelementpath . " is "; print "NOT " unless (-T $singlefullelementpath); print "a text file\n"; print "-> " .$fullelementpath . " is "; print "NOT " unless (-T $fullelementpath); print "a text file\n";

and got this output (fragment):

-> "<shortened-path>/ra6535_5.ddl" is NOT a text file -> '<shortened-path>/ra6535_5.ddl' is NOT a text file -> <shortened-path>/ra6535_5.ddl is a text file -> "<shortened-path>/ra6535_6.ddl" is NOT a text file -> '<shortened-path>/ra6535_6.ddl' is NOT a text file -> <shortened-path>/ra6535_6.ddl is a text file -> "<shortened-path>/ra6535_7.ddl" is NOT a text file -> '<shortened-path>/ra6535_7.ddl' is NOT a text file -> <shortened-path>/ra6535_7.ddl is a text file -> "<shortened-path>/ra6536_5.ddl" is NOT a text file -> '<shortened-path>/ra6536_5.ddl' is NOT a text file -> <shortened-path>/ra6536_5.ddl is a text file -> "<shortened-path>/ra6536_6.ddl" is NOT a text file -> '<shortened-path>/ra6536_6.ddl' is NOT a text file -> <shortened-path>/ra6536_6.ddl is a text file

Clearly, these are all text files.

What gives? Is there another test I could use?

Thanks for for your help - Adam...

Replies are listed 'Best First'.
Re: File test -T (text) and quoted filenames
by jethro (Monsignor) on Feb 11, 2011 at 12:47 UTC

    Are you sure you need the quoting at all? I did a quick test (on linux) and it seemed to work with spaces:

    perl -e ' $t= "test and more.txt"; print "yes" if -T $t' #prints yes

    It should work the same on windows I think

      Testing W32 files names w/spaces:

      #!/usr/bin/perl use strict; use warnings; my $fullelementpath = "f:/_wo/haldane.txt"; if (-T $fullelementpath) {print "passed T\n";}else{print "failed for $ +fullelementpath\n";} # ----------------------- my $fullelementpath2 = "f:/_wo/haldane space.txt"; if (-T $fullelementpath2) {print "passed T\n";}else{print "failed for +$fullelementpath2\n";}

      Output:

      C:\>pl_test\887602.pl passed T passed T C:\>
Re: File test -T (text) and quoted filenames
by bart (Canon) on Feb 11, 2011 at 13:04 UTC
    A text editor I used to use a lot had a simple heuristic to check if a file is a binary file: "Does it contain any null bytes (=bytes with character code 0)?". It worked very well in practice; but you'll have to be very careful with non-UTF-8 Unicode files...

    Besides, AFAIK file tests like -T use the raw file name, spaces and all, as a parameter. So does 3 argument open. A peculiar exception is glob, for which a file pattern has to be quoted if it contains spaces.

Re: File test -T (text) and quoted filenames
by cdarke (Prior) on Feb 11, 2011 at 13:34 UTC
    Quotes around a filename containing spaces would be required with a UNIX shell variable (except inside [[...]]), but not a perl variable.

    As others, I have been unable to reproduce this and suspect it is a coincidence. ^M is usually \r which has come from a Windows file. So I tried a 450k Windows file on Linux, i.e. with ^M characters, and it reported the file as text. I suspect that ^M (\r) is considered to be text.

    I don't find -T and -B particularly reliable anyway. From perlfunc:
    If too many strange characters (>30%) are found, it's a -B file; otherwise it's a -T file.

    You might be better off search for a magic number instead (see the UNIX file(1) command).
Re: File test -T (text) and quoted filenames
by Khen1950fx (Canon) on Feb 11, 2011 at 13:42 UTC
    What is "text"? I like merlyn's definition:

    "If there's not much weird stuff, then it looks like text".

    Binary isn't as much fun: null bytes, unusual control chracters, bytes with the high bit set. Perl has to guess at what the file is most of the time, and it's right most of the time; however, sometimes it makes mistakes. If a file can't be read, doesn't exist, or is given an incorrect path, then Perl will say that it's not text and not binary. Maybe that's what is happening here. I would do a simple test on a few files to find out what's wrong. I used a simple if:

    #!/usr/bin/perl use strict; use warnings; my $fullelementpath = shift @ARGV; if ( -T $fullelementpath) { print "This is a text file\n"; } else { print "This is not a text file\n"; } if ( -B $fullelementpath ) { print "This is a binary file\n"; } else { print "This is not a binary file\n"; }
Re: File test -T (text) and quoted filenames
by jwkrahn (Abbot) on Feb 11, 2011 at 18:30 UTC
    open (FILE, "<" . $quotedfullelementpath) or die "Can't open file +$quotedfullelementpath:$!\n"; while (<FILE>) { binmode(FILE); my $line1 = unpack("H*", $_); if ($line1 =~ /0d/)

    You have the binmode in the wrong place.    You have to use it right after the open and before the while loop.

    open (FILE, "<" . $quotedfullelementpath) or die "Can't open file +$quotedfullelementpath:$!\n"; binmode(FILE); while (<FILE>) {

    Or you could set binmode from open:

    open (FILE, "<:raw" . $quotedfullelementpath) or die "Can't open f +ile $quotedfullelementpath:$!\n"; while (<FILE>) {

    Because you are using unpack you could be matching one byte that has a "0" nybble and the next byte that has a "d" nybble, for example "10d2".    You need to just search for the byte that has the value "\x0d":

    if (/\x0d/)
Re: File test -T (text) and quoted filenames
by TomDLux (Vicar) on Feb 17, 2011 at 05:02 UTC

    Now that you have an answer to the spaces-in-filenames question, why not make things easier in terms of quoting? You have:

    my $quotedfullelementpath = "\"" . $folderfullpath . "/" . $element . +"\"";

    Of all the options available to you, you've settled for the most difficult and least attractive. You could use single quotes around the double quote characters, or the alternate single quote q. You could use a symbol constant to represent the quote and slash. Best of all, you could mash it all into a single string using qq! Assuming there was a real reason to put the double quotes in the string.

    my $quotedfullelementpath = '"' . $folderfullpath . '/' . $element . ' +"'; my $quotedfullelementpath = q{"} . $folderfullpath . q{/} . $element . + q{"}; use Readonly; Readonly my $DQUOTE => q{"}; Readonly my $SLASH => q{/}; my $quotedfullelementpath = $DQUOTE . $folderfullpath . $SLASH . $element . $DQUOTE; my $quotedfullelementpath = qq{"$folderfullpath/$element "};

    Why suffer when there are alternatives?

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.