james28909 has asked for the wisdom of the Perl Monks concerning the following question:

Hello again good fellows :)
I have been trying to read up on regex especially when it comes to matching any repeating characters. What I am trying to do is find repetitions of 0A-9F that repeat themselves in 2 byte and 4 byte and 8 byte occurences. I have converted the hexadecimal to a literal string (eg - 0xaa 0xbb 0xcc 0xdd is $string aabbccdd11223344)
Here is a small example. Say for instance I have 64 bytes and I want to scan it for any repeating instances of hexadecimal characters between 0a-9f:

whats in the $string: 0a0a0a0a0b0b0a0a0c0c0c0c0c0c0c0c 1f1f2b2b2b2b3e3e7b7b7b7b7b7b7b7b 8f8f8f8f8f8f8f8f6c6c4b4b4b4b3f3f 9d9d0f0f0f0f0f0f0f0f3a3a2e2e2e2e
I am wanting it to match any repeating characters, the above is just an example. i will be working with files less than a few MB at most and i think i almost have it, but i am a little confused as how to setup the regex to accomplish this. This is what i have so far:

$string =~/\d{2,4,8}/\/[0-9a-fA-F]/; or $string =~/\w{2,4,8]/\/[0-9a-fA-F]/;

I am not sure how to setup the regular expression. I know the first part tells it to match digits of characters, but the way I have converted it it will print whatever is at the offset(0x1F2A will print as 1f2a in console), so I am pretty sure I need to be matching words.
All in all I want to be able to match any character that repeats itself in sets of 2 bytes/4 bytes/8 bytes. I feel that i am close, but still no cigar. Could anyone be so kind as to help me out just a little? :)

Replies are listed 'Best First'.
Re: matching characters and numbers with regex
by roboticus (Chancellor) on May 31, 2014 at 12:01 UTC

    james28909:

    Changing the data to a character representation of the hex codes actually makes your task harder (as Athanasius indicated previously). If you have the data as a byte string, you can find the repeats like this:

    while ($w =~ /((.)(\2{7}|\2{3}|\2))/sg) { ... do something ... }

    Since you wanted only repeats of certain lengths, the regex is a little goofy. I had to put the sequences in descending order by length, otherwise it would just give the shortest version of the sequence.)

    I played around with it a little and came up with an example:

    #!/usr/bin/perl use strict; use warnings; my $t = pack "H*", '0a0a0a0a0b0b0a0a0c0c0c0c0c0c0c0c' . '1f1f2b2b2b2b3e3e7b7b7b7b7b7b7b7b' . '8f8f8f8f8f8f8f8f6c6c4b4b4b4b3f3f' . '9d9d0f0f0f0f0f0f0f0f3a3a2e2e2e2e'; repeats($t); repeats('abbcccddddeeeeeffffffggggggghhhhhhhhiiiiiiiii'); sub repeats { my $w = shift; print "\nBYTES: ", unpack("H*",$w),"\n"; while ($w =~ /((.)(\2{7}|\2{3}|\2))/sg) { my $bytes = $1; my $hex = unpack "H*", $bytes; $bytes =~ s/[\x00-\x1f\x80-\xff]/_/g; print "repeat: $hex ($bytes) pos:", pos($w)-length($bytes), "\ +n"; } }

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: matching characters and numbers with regex
by johngg (Canon) on May 31, 2014 at 13:05 UTC

    Using a look-ahead combined with pos avoids any need for sums and finds overlapping matches (if that is part of your spec.).

    $ perl -Mstrict -Mwarnings -E ' my $str = qq{\0} x 32; substr $str, 5, 2, qq{\x0b\x9e}; substr $str, 26, 5, qq{\x3c\x5a\x1e\x6b\x48}; substr $str, 11, 11, qq{\x0f\x2c\x34\x3c\x5a\x1e\x6b\x48\x0b\x9e\x88}; say unpack q{H*}, $str; for my $quant ( 8, 4, 2 ) { say q{}; say qq{$quant [\\x0a-\\x9f] found at @{ [ pos $str ] }} while $str =~ m{(?x) (?= [\x0a-\x9f] {$quant} ) }g; }' 00000000000b9e000000000f2c343c5a1e6b480b9e88000000003c5a1e6b4800 8 [\x0a-\x9f] found at 11 8 [\x0a-\x9f] found at 12 8 [\x0a-\x9f] found at 13 8 [\x0a-\x9f] found at 14 4 [\x0a-\x9f] found at 11 4 [\x0a-\x9f] found at 12 4 [\x0a-\x9f] found at 13 4 [\x0a-\x9f] found at 14 4 [\x0a-\x9f] found at 15 4 [\x0a-\x9f] found at 16 4 [\x0a-\x9f] found at 17 4 [\x0a-\x9f] found at 18 4 [\x0a-\x9f] found at 26 4 [\x0a-\x9f] found at 27 2 [\x0a-\x9f] found at 5 2 [\x0a-\x9f] found at 11 2 [\x0a-\x9f] found at 12 2 [\x0a-\x9f] found at 13 2 [\x0a-\x9f] found at 14 2 [\x0a-\x9f] found at 15 2 [\x0a-\x9f] found at 16 2 [\x0a-\x9f] found at 17 2 [\x0a-\x9f] found at 18 2 [\x0a-\x9f] found at 19 2 [\x0a-\x9f] found at 20 2 [\x0a-\x9f] found at 26 2 [\x0a-\x9f] found at 27 2 [\x0a-\x9f] found at 28 2 [\x0a-\x9f] found at 29 $

    I hope this is helpful.

    Cheers,

    JohnGG

      i think i have found a better approach to this, but its gonna take alot of code to perform the task because i have to match 00 - FF in sets of 4 then sets of 8 then sets of 16 characters. tell me if this will work correctly:
      <br> while ($string){ read $string, $chunk, 4; if ($chunk =~ FFFF); print ("corrupted"); ;

      I would have to make while loops for 0000 thru FFFF. and for 4 characters then 8 characters, then 16 characters.
      What i am trying to do is scan the string for any repeating characters, as in "0000", "FFFF", "00000000", "FFFFFFFF", "0000000000000000", "FFFFFFFFFFFFFFFF", and i would have to do that for every hexadecimal character, so its gonna take many many loops and lines of code. Im trying to think of a way to simplify this as much as possible.

      Also thank you for the examples.

        No need to use while loops, use backreferences in your pattern. In the following code I make arrays of references to substrings of 2, 4 & 8 characters without converting bytes to string representations. I then test by matching each dereferenced element against the pattern and print an error if I find repeats. I test a clean string first then introduce some repeats and test it again.

        use strict; use warnings; use 5.014; my $str = q{}; $str .= chr for 0 .. 31; say qq{\n}, q{| . . . ^ . . . } x 4; say unpack q{H*}, $str; for my $len ( 8, 4, 2 ) { say qq{\nChecking groups of $len}; my $quant = $len - 1; my @groups = map { \ substr $str, $_ * $len, $len } 0 .. ( length( $str ) / $len ) - 1; for my $idx ( 0 .. $#groups ) { say qq{Found @{ [ unpack q{H*}, $1 ] } }, qq{at offset @{ [ $len * $idx ] }} if ${ $groups[ $idx ] } =~ m{((.)\2{$quant})}; } } substr $str, 0, 2, qq{\x3e\x3e}; substr $str, 16, 8, qq{\xac} x 8; substr $str, 4, 4, qq{\x7f} x 4; substr $str, 26, 4, qq{\x45} x 4; say qq{\n}, q{| . . . ^ . . . } x 4; say unpack q{H*}, $str; for my $len ( 8, 4, 2 ) { say qq{\nChecking groups of $len}; my $quant = $len - 1; my @groups = map { \ substr $str, $_ * $len, $len } 0 .. ( length( $str ) / $len ) - 1; for my $idx ( 0 .. $#groups ) { say qq{Found @{ [ unpack q{H*}, $1 ] } }, qq{at offset @{ [ $len * $idx ] }} if ${ $groups[ $idx ] } =~ m{((.)\2{$quant})}; } }

        The output.

        | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f Checking groups of 8 Checking groups of 4 Checking groups of 2 | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . 3e3e02037f7f7f7f08090a0b0c0d0e0facacacacacacacac1819454545451e1f Checking groups of 8 Found acacacacacacacac at offset 16 Checking groups of 4 Found 7f7f7f7f at offset 4 Found acacacac at offset 16 Found acacacac at offset 20 Checking groups of 2 Found 3e3e at offset 0 Found 7f7f at offset 4 Found 7f7f at offset 6 Found acac at offset 16 Found acac at offset 18 Found acac at offset 20 Found acac at offset 22 Found 4545 at offset 26 Found 4545 at offset 28

        I hope this helps you along.

        Cheers,

        JohnGG

        or better yet, i could read each byte without converting it to string, and see if the next 2/4/and 8 bytes matches it. if it does then its a corrupt file
Re: matching characters and numbers with regex
by james28909 (Deacon) on May 31, 2014 at 04:22 UTC
    actually i think i need to be using /\w{4,8,16} :
    $string =~/\w{4,8,16}/\/[0-9a-fA-F]/;
    because instead of searching in hexadecimal bytes, its searching characters in a string. in hexadecimal 0a is 1 character, but in a string 0a is 2 characters correct?
      $string =~/\w{4,8,16}/\/[0-9a-fA-F]/;

      There are two errors here:

      1. The quantifier syntax X{y,z} means: at least y and no more than z occurrences of X. You want to say: either exactly 4 occurrences, or exactly 8 occurrences, or exactly 16 occurrences; but you can’t do that with this quantifier. See “Quantifiers” in perlre#Regular-Expressions.
      2. The construct /.../\/.../ is a syntax error: the regex ends at the second /.

      Now for the bigger picture.

      You can probably do what you want with regexes, but it quickly becomes complicated. Here is some code I came up with to identify repeated 4-character sequences:

      #! perl use strict; use warnings; use List::MoreUtils 'uniq'; my $string = '0a0a0a0a0b0b0a0a0c0c0c0c0c0c0c0c' . '1f1f2b2b2b2b3e3e7b7b7b7b7b7b7b7b' . '8f8f8f8f8f8f8f8f6c6c4b4b4b4b3f3f' . '9d9d0f0f0f0f0f0f0f0f3a3a2e2e2e2e'; my @seqs = $string =~ /(([0-9a-fA-F]{2})\2)/g; @seqs = uniq grep { length == 4 } @seqs; for my $seq (@seqs) { my $matches = () = $string =~ /$seq/g; printf "%s: %d\n", $seq, $matches; }

      Output:

      17:30 >perl 914_SoPW.pl 0a0a: 3 0b0b: 1 0c0c: 4 1f1f: 1 2b2b: 2 3e3e: 1 7b7b: 4 8f8f: 4 6c6c: 1 4b4b: 2 3f3f: 1 9d9d: 1 0f0f: 4 3a3a: 1 2e2e: 2 17:30 >

      What concerns me here is the alignment problem: you presumably do not want to match a non-aligned sequence like the following:

      0a0axxx0a0a0yyyy ^^^^ ^^^^

      See, for example, the discussion of the \G anchor in the “Global matching” section of perlretut#Using-regular-expressions-in-Perl.

      I’m not sure that regexes are the best tool for this job. I would look at converting your string into an array of integers, then building a hash of integer sequences (of the desired lengths) mapped to their number of occurrences in the original string.

      Hope that helps,

      Update (June 1): Corrected alignment example.

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        I want it to match any 4 characters that match from the beginning of the string. It will be checking for corruptness of a file. If the file is supposed to be

        "4428FBABCBED062405E56F853AAE238C4428FBABCBED062405E56F853AAE238CCC9AA594B5B35063A28224E2FE347EE349E9FFEDB897E32725F42C0D9FA2400D56C78EC7E711F47AA032CB76E11996D4"

        Then i want to make sure it doesnt have any repeating characters that are 4,8, and 16 characters long. So if this above string was:

        "0A0AFBABCBED062405E56F853AAE238C4428FBABCBED062405E56F853AAE238CCC9AA594B5B35063A28224E2FE347EE349E9FFEDB897E32725F42C0D9FA2400D56C78EC7E711F47AA032CB76E11996D4"

        Difference in these two string are the Repeating characters 0A0A at the beginning of the string. if it finds repeating characters then it will terminate the program and not continue because its checking for corruptness.