roperl has asked for the wisdom of the Perl Monks concerning the following question:

What method is better to test if file is a plain ascii text file. I'm already checking MIME types of non-text file with File::Type.
Should I check the text files with perl's -T like so:
if(-T $file) { print "$file is an ascii text file \n"; } else { print "Not an ascii text file \n"; }
Or check that it matches application/octet-stream MIME type like so:
my $ft = File::Type->new(); my $type = $ft->mime_type($file); if ( $type eq "application/octet-stream" ) { do this.. } else { do that.. }

Replies are listed 'Best First'.
Re: Perl -T vs Mime::Types
by afoken (Chancellor) on Sep 19, 2017 at 19:38 UTC

    -T tests just a few bytes of the file (see -X). File::Type just guesses a file type by searching for a few magic numbers, like file. Both can't be reliable.

    If you want to check for a file that contains only ASCII characters, you have to check the entire file. There is no other way.

    I guess you also want to check for a sane file size, perhaps some hundred kBytes or a few MBytes. On a modern computer, slurping the entire file with that limitation is no big problem.

    You may want something like this (untested):

    -f $filename or die "$filename is not a file"; (-s _ < 100_000) or die "$filename is too large"; # avoid a second sta +t() syscall by using the special handle "_" my $blob=do { open my $f,'<:raw',$filename or die "Can't open $filename: $!"; local $/; # slurp mode <$f>; # slurp # leaving the do block auto-closes $f }; # Accept only CR, LF, TAB, and printable characters from 0x20 to 0x7E. $blob=~/^[\r\n\t\x20-\x7E]*$/s or die "$filename is not ASCII";

    If you want significantly larger files, you have to read smaller blocks (perhaps 1 MByte each), and check each block for its "ASCIIness". Abort at the first failed block.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      tr/// may be a bit faster than s///, so maybe (also untested)
          $blob =~ tr/\r\n\t\x20-\x7E//c or die "$filename is not ASCII";
      (See perlop Quote-Like Operators for  tr/// and its  /c (complement) modifier.)

      Update: Correction: The logical operator should be  and because we wish an exception to be thrown if any "non-ASCII" character is found, i.e., if the  tr///c count is non-zero:
          $blob =~ tr/\r\n\t\x20-\x7E//c and die "$filename is not ASCII";


      Give a man a fish:  <%-{-{-{-<

Re: Perl -T vs Mime::Types
by Athanasius (Archbishop) on Sep 20, 2017 at 07:03 UTC

    Hello roperl,

    CPAN has the module Test::PureASCII, but it’s designed to function in a testing environment:

    use strict; use warnings; use Test::PureASCII tests => 2; for my $file ('ascii.txt', 'umlaut.txt') { file_is_pure_ascii($file, "Only ASCII in $file"); }

    Output (with suitable input files in the current directory):

    17:00 >perl 1823_SoPW.pl 1..2 ok 1 - Only ASCII in ascii.txt not ok 2 - Only ASCII in umlaut.txt # Failed test 'Only ASCII in umlaut.txt' # at 1823_SoPW.pl line 38. # non ASCII character sequence 0xc3, 0xb6 at line 5 in umlaut.txt # Looks like you failed 1 test of 2. 17:00 >

    Hope that’s of interest,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Perl -T vs Mime::Types
by ateague (Monk) on Sep 20, 2017 at 13:27 UTC

    It might be beneficial to take a step back and look at the big picture end goal.

    What is it that you are wanting to do that requires checking to see if a file contains plain ASCII text?

      Program is handling input files from various clients. These files can be sent in via sftp or ftp with files encrypted by gpg. Files can also be zipped or compressed with gz. Once the files are either unzipped, decompressed or decrypted I'm expecting a plain ASCII text file. I want to ensure the file is valid before moving it off to another program to handle
        These files can be sent in via sftp or ftp with files encrypted by gpg. Files can also be zipped or compressed with gz. Once the files are either unzipped, decompressed or decrypted I'm expecting a plain ASCII text file.

        Without knowing all the intricacies and ins and outs of your workflow pipeline, I'd almost say (in my admitted ignorance) that the whole "check for ASCII" test is a bit superfluous.

        Wouldn't the upstream process be responsible for checking if the decryption/decompression was successful, and wouldn't the downstream process be responsible for checking for well-formed data? Is there a particular case you are trying to guard against?

        It seems to me that a plain -T $file should be sufficient to catch a rogue encrypted and/or compressed file that made it past the first process without triggering an error (although what happens if the file is Base64 or otherwise ASCII-armored?)

Re: Perl -T vs Mime::Types
by RonW (Parson) on Sep 21, 2017 at 23:37 UTC
    Or check that it matches application/octet-stream MIME type

    Actually, for your purpose, text/plain would be the appropriate type.

    -T is probably good enough to catch any files that aren't yet fully decoded/decompressed/etc.

    Caveats: -T will accept valid UTF-8 encoded data as text. Also, even if the data isn't valid UTF-8, it will still allow some non-ASCII bytes and accept it as ASCII text. (The documentation says up to a third of the examined part of the file.)