in reply to Re: How to find Unicode: 0x13 in File
in thread How to find Unicode: 0x13 in File

I may be wrong here, but if I am, I will learn something new :)

Could you not read in few MB's of the file (if it is big enough) and then unpack it and then test to see if the character matches 0x13?

Something like:
open (my $fh, '<', 'file') or die "$!\n"; binmode($fh); while(read $fh, my $char, 0x01){ $buf = unpack('H*', $char); if ($buf =~ /13/){ print "found 0x13\n" } }
Contents of 'file': '.Eg5™eEfx`.' #'.' = 0x13;
Im not up to par on unicode so I could be way off.

Replies are listed 'Best First'.
Re^3: How to find Unicode: 0x13 in File
by choroba (Cardinal) on Nov 18, 2016 at 16:51 UTC
    > 0x01

    Why do you specify the length in hex?

    Also note that if you use a length greater than 1 (which you want to speed it up), you can find false positives: read $fh, my $char, 2 reports 0x13 present in the following file:

    a1

    because

    $ perl -wE 'say unpack "H*", "a1"' 6131 ~~

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      "Why is read length hex." - Because that is how I deal with most files I personally work with, so it is more of preference than anything.

      The false positive does not show on my pc unless I use unpack('h*' $_); so I am sure it is a platform/architecture scenario between our PC's. Also, from the way I read the OP's question, I thought he is not looking for actually 0x13 in plaintext, I though he was looking for 0x13 after unpacking to hex. If I am wrong I apologize, I were actually hoping to learn something more than anything else.

      I am not up to par on Unicode because i havent had to deal with it in any of the material I work on. So if I am way off in left field, I apologize.

        The false positive does not show on my pc unless I use unpack('h*' $_); so I am sure it is a platform/architecture scenario between our PC's.

        Not a platform mismatch, I would say, but probably because you're still reading a single character at a time from the file. If more than one character is read, the  /13/ ambiguity | false positive can appear with either  'H*' or  'h*' unpack templates:

        c:\@Work\Perl>perl -wMstrict -le "print 'A: found 0x13!' if unpack('H*', 'a1') =~ /13/; print 'B: found 0x13!' if unpack('h*', qq{\x{1f}s}) =~ /13/; " A: found 0x13! B: found 0x13!
        (BTW: The  * in both  'H*' and  'h*' implies reading and operating on a string of more than one character.)

        I am not up to par on Unicode because ...

        ... and because you value your sanity.

        Update: Another, perhaps more general, code example:

        c:\@Work\Perl>perl -wMstrict -le "use Data::Dump qw(pp); ;; for my $s ('a1', qq{\x{1f}s}) { print q{'H*' found 0x13! in }, pp($s) if unpack('H*', $s) =~ /13/; print q{'h*' found 0x13! in }, pp($s) if unpack('h*', $s) =~ /13/; } " 'H*' found 0x13! in "a1" 'h*' found 0x13! in "a1" 'h*' found 0x13! in "\37s"


        Give a man a fish:  <%-{-{-{-<