Re^4: How to find Unicode: 0x13 in File

Replies are listed 'Best First'.
Re^5: How to find Unicode: 0x13 in File by AnomalousMonk (Archbishop) on Nov 18, 2016 at 19:57 UTC
The false positive does not show on my pc unless I use `unpack('h' $_);` so I am sure it is a platform/architecture scenario between our PC's.* Not a platform mismatch, I would say, but probably because you're still reading a single character at a time from the file. If more than one character is read, the `/13/` ~~ambiguity~~ \| false positive can appear with either `'H'` or `'h'` unpack templates: `c:\@Work\Perl>perl -wMstrict -le "print 'A: found 0x13!' if unpack('H', 'a1') =~ /13/; print 'B: found 0x13!' if unpack('h', qq{\x{1f}s}) =~ /13/; " A: found 0x13! B: found 0x13!` [download] (BTW: The `` in both `'H'` and `'h'` implies reading and operating on a string of more than one character.) I am not up to par on Unicode because ...* ... and because you value your sanity. Update: Another, perhaps more general, code example: `c:\@Work\Perl>perl -wMstrict -le "use Data::Dump qw(pp); ;; for my $s ('a1', qq{\x{1f}s}) { print q{'H' found 0x13! in }, pp($s) if unpack('H', $s) =~ /13/; print q{'h' found 0x13! in }, pp($s) if unpack('h', $s) =~ /13/; } " 'H' found 0x13! in "a1" 'h' found 0x13! in "a1" 'h*' found 0x13! in "\37s"` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^6: How to find Unicode: 0x13 in File by james28909 (Deacon) on Nov 19, 2016 at 15:55 UTC
I were able to reproduce that, However I have no idea why it does that. The following code seems to work as expected: `my ($one, $two) = unpack('(h2)*', 'a1'); #print "$one \| $two\n"; print "found 0x13\n" if ($one =~ '13' \| $two =~ '13');` [download] But even doing this, it is splitting the two bytes into individual bytes. Is this behavior of unpack documented? I am using Active Perl version 5.16.3 as well.	[reply] [d/l]
Re^7: How to find Unicode: 0x13 in File by AnomalousMonk (Archbishop) on Nov 19, 2016 at 17:27 UTC
... I have no idea why it does that. ... splitting the two bytes into individual bytes. Is this behavior of unpack documented? The `h` (un)pack template specifier is (and always has been, as far as I can recall) documented to unpack: `h A hex string (low nybble first).` The low nybble- versus high nybble-first ordering is the critical difference between, respectively, the `'h'` and `'H'` specifiers. This means that a string containing an ASCII `'1'` character (a byte with the hex value 0x31 — this is also a UTF8 representation of `'1'`) will `unpack` as `'13'` with `h` and you have your false positive. (It also means that `h` `unpack`s an actual 0x13 character as `'31'` and you have a false negative!) I wouldn't do it this way, but if you absolutely have to use a regex to search a hex-unpacked string for the hex representation of the character 0x13, I would first make sure the nybble order of the unpack template specifier I'm using is consistent with the unpacked character pair I'm looking for (e.g., `'H'` with `'13'`) and then make sure the regex only matches on even character offset boundaries: `c:\@Work\Perl>perl -wMstrict -le "use Data::Dump qw(pp); ;; print 'perl version: ', $]; ;; for my $s ('a1', qq{\x{1f}s}, qq{\x13}, qq{ab\x{13}cd}) { print q{'H' found 0x13 in }, pp($s) if unpack('H', $s) =~ m{ \A (?: ..)* 13 }xms; } " perl version: 5.008009 'H' found 0x13 in "\23" 'H' found 0x13 in "ab\23cd"` [download] (Runs the same on ActiveState 5.8.9 and Strawberry 5.14.4.1.) Personally, I've always found the behavior of the `H` and `h` (un)pack specifiers counterintuitive and tricksey. Whenever I use them, I always have go to the pack and perlpacktut documentation and tutorial and stare at the info therein for a while before I can get it right. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]