Hello monks, I have an interesting dilemma on my hands.

I'm trying to find a way to determine whether some arbitrary blob of data is text or just binary. I used to have an old "is_binary()" method, which just looks for characters that fall outside of the 127 byte ASCII range, but that doesn't work when the string contains Unicode characters, because the control characters are outside the ASCII range.

sub is_binary { my $content = shift; if(!defined($content)) { return 0; } my @char = unpack("C" x length($content),$content); foreach $a (@char) { if($a > 127) { return 1; } } return 0; }

Here's a script I'm using for testing, to try to figure out a way to detect whether data is UTF-8 or just random binary:

#!/usr/bin/perl -w use strict; use warnings; use utf8; use lib "www/siikir/cms/src"; use Siikir::Util; binmode(STDOUT, "utf8"); # Valid UTF-8 strings my @valid = @{ Siikir::Util::utf8_decode([ "hello world", "Hello!\nWorld!", "My favorite pokemon is ブラッキー", "No, エーフィ is better than ブ&#125 +21;ッキー!", "ミュウツー ミュウ +ツー", ])}; # Create some invalid strings. my @invalid = ( scalar(`cat /usr/bin/vim`), scalar(`cat /usr/share/pixmaps/xchat.png`), scalar(map { chr(hex($_)) } qw/0xFF 0x4C 0x3D 0x10 0x27 0x78 0xED/ +), ); chomp(@invalid); print "Testing valid strings...\n"; foreach my $v (@valid) { my $pass = is_binary($v); print "Str: $v (pass: $pass)\n"; } print "Testing invalid strings...\n"; foreach my $i (@invalid) { my $pass = is_binary($i); print "Pass: $pass\n"; } sub is_binary { my $data = shift; # # Valid UTF-8? Fail: gives a pass to everything. # use Test::utf8; # if (is_valid_string($data)) { # return "true"; # } # return "false"; # Sane UTF-8? Fail: gives a pass to a PNG image # use Test::utf8; # if (is_sane_utf8($data)) { # return "true"; # } # return "false"; # Valid UTF-8? if (utf8::is_utf8($data)) { return "true"; } return "false"; }

(The utf8::decode function can be found on another one of my perlmonks posts, JSON, UTF-8 and Filehandles).

Seems like the only reliable method I found was just to rely on the is_utf8 flag (and relying on the assumption that most valid strings throughout the code have been properly decoded to have the UTF-8 flag on them).

Is there a better way?


In reply to Can't tell if UTF-8... or just binary... by Kirsle

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.