in reply to Perl -T vs Mime::Types

-T tests just a few bytes of the file (see -X). File::Type just guesses a file type by searching for a few magic numbers, like file. Both can't be reliable.

If you want to check for a file that contains only ASCII characters, you have to check the entire file. There is no other way.

I guess you also want to check for a sane file size, perhaps some hundred kBytes or a few MBytes. On a modern computer, slurping the entire file with that limitation is no big problem.

You may want something like this (untested):

-f $filename or die "$filename is not a file"; (-s _ < 100_000) or die "$filename is too large"; # avoid a second sta +t() syscall by using the special handle "_" my $blob=do { open my $f,'<:raw',$filename or die "Can't open $filename: $!"; local $/; # slurp mode <$f>; # slurp # leaving the do block auto-closes $f }; # Accept only CR, LF, TAB, and printable characters from 0x20 to 0x7E. $blob=~/^[\r\n\t\x20-\x7E]*$/s or die "$filename is not ASCII";

If you want significantly larger files, you have to read smaller blocks (perhaps 1 MByte each), and check each block for its "ASCIIness". Abort at the first failed block.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^2: Perl -T vs Mime::Types
by AnomalousMonk (Archbishop) on Sep 20, 2017 at 00:43 UTC

    tr/// may be a bit faster than s///, so maybe (also untested)
        $blob =~ tr/\r\n\t\x20-\x7E//c or die "$filename is not ASCII";
    (See perlop Quote-Like Operators for  tr/// and its  /c (complement) modifier.)

    Update: Correction: The logical operator should be  and because we wish an exception to be thrown if any "non-ASCII" character is found, i.e., if the  tr///c count is non-zero:
        $blob =~ tr/\r\n\t\x20-\x7E//c and die "$filename is not ASCII";


    Give a man a fish:  <%-{-{-{-<