in reply to matching characters and numbers with regex

Using a look-ahead combined with pos avoids any need for sums and finds overlapping matches (if that is part of your spec.).

$ perl -Mstrict -Mwarnings -E ' my $str = qq{\0} x 32; substr $str, 5, 2, qq{\x0b\x9e}; substr $str, 26, 5, qq{\x3c\x5a\x1e\x6b\x48}; substr $str, 11, 11, qq{\x0f\x2c\x34\x3c\x5a\x1e\x6b\x48\x0b\x9e\x88}; say unpack q{H*}, $str; for my $quant ( 8, 4, 2 ) { say q{}; say qq{$quant [\\x0a-\\x9f] found at @{ [ pos $str ] }} while $str =~ m{(?x) (?= [\x0a-\x9f] {$quant} ) }g; }' 00000000000b9e000000000f2c343c5a1e6b480b9e88000000003c5a1e6b4800 8 [\x0a-\x9f] found at 11 8 [\x0a-\x9f] found at 12 8 [\x0a-\x9f] found at 13 8 [\x0a-\x9f] found at 14 4 [\x0a-\x9f] found at 11 4 [\x0a-\x9f] found at 12 4 [\x0a-\x9f] found at 13 4 [\x0a-\x9f] found at 14 4 [\x0a-\x9f] found at 15 4 [\x0a-\x9f] found at 16 4 [\x0a-\x9f] found at 17 4 [\x0a-\x9f] found at 18 4 [\x0a-\x9f] found at 26 4 [\x0a-\x9f] found at 27 2 [\x0a-\x9f] found at 5 2 [\x0a-\x9f] found at 11 2 [\x0a-\x9f] found at 12 2 [\x0a-\x9f] found at 13 2 [\x0a-\x9f] found at 14 2 [\x0a-\x9f] found at 15 2 [\x0a-\x9f] found at 16 2 [\x0a-\x9f] found at 17 2 [\x0a-\x9f] found at 18 2 [\x0a-\x9f] found at 19 2 [\x0a-\x9f] found at 20 2 [\x0a-\x9f] found at 26 2 [\x0a-\x9f] found at 27 2 [\x0a-\x9f] found at 28 2 [\x0a-\x9f] found at 29 $

I hope this is helpful.

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^2: matching characters and numbers with regex
by james28909 (Deacon) on May 31, 2014 at 22:27 UTC
    i think i have found a better approach to this, but its gonna take alot of code to perform the task because i have to match 00 - FF in sets of 4 then sets of 8 then sets of 16 characters. tell me if this will work correctly:
    <br> while ($string){ read $string, $chunk, 4; if ($chunk =~ FFFF); print ("corrupted"); ;

    I would have to make while loops for 0000 thru FFFF. and for 4 characters then 8 characters, then 16 characters.
    What i am trying to do is scan the string for any repeating characters, as in "0000", "FFFF", "00000000", "FFFFFFFF", "0000000000000000", "FFFFFFFFFFFFFFFF", and i would have to do that for every hexadecimal character, so its gonna take many many loops and lines of code. Im trying to think of a way to simplify this as much as possible.

    Also thank you for the examples.

      No need to use while loops, use backreferences in your pattern. In the following code I make arrays of references to substrings of 2, 4 & 8 characters without converting bytes to string representations. I then test by matching each dereferenced element against the pattern and print an error if I find repeats. I test a clean string first then introduce some repeats and test it again.

      use strict; use warnings; use 5.014; my $str = q{}; $str .= chr for 0 .. 31; say qq{\n}, q{| . . . ^ . . . } x 4; say unpack q{H*}, $str; for my $len ( 8, 4, 2 ) { say qq{\nChecking groups of $len}; my $quant = $len - 1; my @groups = map { \ substr $str, $_ * $len, $len } 0 .. ( length( $str ) / $len ) - 1; for my $idx ( 0 .. $#groups ) { say qq{Found @{ [ unpack q{H*}, $1 ] } }, qq{at offset @{ [ $len * $idx ] }} if ${ $groups[ $idx ] } =~ m{((.)\2{$quant})}; } } substr $str, 0, 2, qq{\x3e\x3e}; substr $str, 16, 8, qq{\xac} x 8; substr $str, 4, 4, qq{\x7f} x 4; substr $str, 26, 4, qq{\x45} x 4; say qq{\n}, q{| . . . ^ . . . } x 4; say unpack q{H*}, $str; for my $len ( 8, 4, 2 ) { say qq{\nChecking groups of $len}; my $quant = $len - 1; my @groups = map { \ substr $str, $_ * $len, $len } 0 .. ( length( $str ) / $len ) - 1; for my $idx ( 0 .. $#groups ) { say qq{Found @{ [ unpack q{H*}, $1 ] } }, qq{at offset @{ [ $len * $idx ] }} if ${ $groups[ $idx ] } =~ m{((.)\2{$quant})}; } }

      The output.

      | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f Checking groups of 8 Checking groups of 4 Checking groups of 2 | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . 3e3e02037f7f7f7f08090a0b0c0d0e0facacacacacacacac1819454545451e1f Checking groups of 8 Found acacacacacacacac at offset 16 Checking groups of 4 Found 7f7f7f7f at offset 4 Found acacacac at offset 16 Found acacacac at offset 20 Checking groups of 2 Found 3e3e at offset 0 Found 7f7f at offset 4 Found 7f7f at offset 6 Found acac at offset 16 Found acac at offset 18 Found acac at offset 20 Found acac at offset 22 Found 4545 at offset 26 Found 4545 at offset 28

      I hope this helps you along.

      Cheers,

      JohnGG

      or better yet, i could read each byte without converting it to string, and see if the next 2/4/and 8 bytes matches it. if it does then its a corrupt file
        nevermind, i just read and understood the comments lol, thanks again yall :)