matching characters and numbers with regex

james28909 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: matching characters and numbers with regex
by roboticus (Chancellor) on May 31, 2014 at 12:01 UTC

Changing the data to a character representation of the hex codes actually makes your task harder (as Athanasius indicated previously). If you have the data as a byte string, you can find the repeats like this:

while ($w =~ /((.)(\2{7}|\2{3}|\2))/sg) {
   ... do something ...
}
[download]

Since you wanted only repeats of certain lengths, the regex is a little goofy. I had to put the sequences in descending order by length, otherwise it would just give the shortest version of the sequence.)

I played around with it a little and came up with an example:

#!/usr/bin/perl
use strict;
use warnings;

my $t = pack "H*",
        '0a0a0a0a0b0b0a0a0c0c0c0c0c0c0c0c'
      . '1f1f2b2b2b2b3e3e7b7b7b7b7b7b7b7b'
      . '8f8f8f8f8f8f8f8f6c6c4b4b4b4b3f3f'
      . '9d9d0f0f0f0f0f0f0f0f3a3a2e2e2e2e';

repeats($t);
repeats('abbcccddddeeeeeffffffggggggghhhhhhhhiiiiiiiii');

sub repeats {
    my $w = shift;
    print "\nBYTES: ", unpack("H*",$w),"\n";
    while ($w =~ /((.)(\2{7}|\2{3}|\2))/sg) {
        my $bytes = $1;
        my $hex = unpack "H*", $bytes;
        $bytes =~ s/[\x00-\x1f\x80-\xff]/_/g;
        print "repeat: $hex ($bytes) pos:", pos($w)-length($bytes), "\
+n";
    }
}
[download]

...roboticus

When your only tool is a hammer, all problems look like your thumb.

[reply]
[d/l]
[select]

Re: matching characters and numbers with regex
by johngg (Canon) on May 31, 2014 at 13:05 UTC

Using a look-ahead combined with pos avoids any need for sums and finds overlapping matches (if that is part of your spec.).

$ perl -Mstrict -Mwarnings -E '
my $str = qq{\0} x 32;
substr $str,  5,  2, qq{\x0b\x9e};
substr $str, 26,  5, qq{\x3c\x5a\x1e\x6b\x48};
substr $str, 11, 11, qq{\x0f\x2c\x34\x3c\x5a\x1e\x6b\x48\x0b\x9e\x88};
say unpack q{H*}, $str;

for my $quant ( 8, 4, 2 )
{
    say q{};
    say qq{$quant [\\x0a-\\x9f] found at @{ [ pos $str ] }}
       while $str =~ m{(?x) (?= [\x0a-\x9f] {$quant} ) }g;
}'
00000000000b9e000000000f2c343c5a1e6b480b9e88000000003c5a1e6b4800

8 [\x0a-\x9f] found at 11
8 [\x0a-\x9f] found at 12
8 [\x0a-\x9f] found at 13
8 [\x0a-\x9f] found at 14

4 [\x0a-\x9f] found at 11
4 [\x0a-\x9f] found at 12
4 [\x0a-\x9f] found at 13
4 [\x0a-\x9f] found at 14
4 [\x0a-\x9f] found at 15
4 [\x0a-\x9f] found at 16
4 [\x0a-\x9f] found at 17
4 [\x0a-\x9f] found at 18
4 [\x0a-\x9f] found at 26
4 [\x0a-\x9f] found at 27

2 [\x0a-\x9f] found at 5
2 [\x0a-\x9f] found at 11
2 [\x0a-\x9f] found at 12
2 [\x0a-\x9f] found at 13
2 [\x0a-\x9f] found at 14
2 [\x0a-\x9f] found at 15
2 [\x0a-\x9f] found at 16
2 [\x0a-\x9f] found at 17
2 [\x0a-\x9f] found at 18
2 [\x0a-\x9f] found at 19
2 [\x0a-\x9f] found at 20
2 [\x0a-\x9f] found at 26
2 [\x0a-\x9f] found at 27
2 [\x0a-\x9f] found at 28
2 [\x0a-\x9f] found at 29
$
[download]

I hope this is helpful.

Cheers,

JohnGG

[reply]
[d/l]

Re^2: matching characters and numbers with regex

by james28909 (Deacon) on May 31, 2014 at 22:27 UTC

<br>

while ($string){
read $string, $chunk, 4;
    if ($chunk =~ FFFF);
    print ("corrupted");
;
[download]

[reply]
[d/l]

Re^3: matching characters and numbers with regex

by johngg (Canon) on Jun 01, 2014 at 00:42 UTC

No need to use while loops, use backreferences in your pattern. In the following code I make arrays of references to substrings of 2, 4 & 8 characters without converting bytes to string representations. I then test by matching each dereferenced element against the pattern and print an error if I find repeats. I test a clean string first then introduce some repeats and test it again.

use strict;
use warnings;

use 5.014;

my $str = q{};
$str   .= chr for 0 .. 31;
say qq{\n}, q{| . . . ^ . . . } x 4; 
say unpack q{H*}, $str;

for my $len ( 8, 4, 2 )
{
    say qq{\nChecking groups of $len};
    my $quant  = $len - 1;
    my @groups =
       map { \ substr $str, $_ * $len, $len }
       0 .. ( length( $str ) / $len ) - 1;

    for my $idx ( 0 .. $#groups )
    {
        say
           qq{Found @{ [ unpack q{H*}, $1 ] } },
           qq{at offset @{ [ $len * $idx ] }}
           if ${ $groups[ $idx ] } =~ m{((.)\2{$quant})};
    }
}

substr $str,  0, 2, qq{\x3e\x3e};
substr $str, 16, 8, qq{\xac} x 8;
substr $str,  4, 4, qq{\x7f} x 4;
substr $str, 26, 4, qq{\x45} x 4;
say qq{\n}, q{| . . . ^ . . . } x 4; 
say unpack q{H*}, $str;

for my $len ( 8, 4, 2 )
{
    say qq{\nChecking groups of $len};
    my $quant  = $len - 1;
    my @groups =
       map { \ substr $str, $_ * $len, $len }
       0 .. ( length( $str ) / $len ) - 1;

    for my $idx ( 0 .. $#groups )
    {
        say
           qq{Found @{ [ unpack q{H*}, $1 ] } },
           qq{at offset @{ [ $len * $idx ] }}
           if ${ $groups[ $idx ] } =~ m{((.)\2{$quant})};
    }
}
[download]

The output.


| . . . ^ . . . | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . 
000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f

Checking groups of 8

Checking groups of 4

Checking groups of 2

| . . . ^ . . . | . . . ^ . . . | . . . ^ . . . | . . . ^ . . . 
3e3e02037f7f7f7f08090a0b0c0d0e0facacacacacacacac1819454545451e1f

Checking groups of 8
Found acacacacacacacac at offset 16

Checking groups of 4
Found 7f7f7f7f at offset 4
Found acacacac at offset 16
Found acacacac at offset 20

Checking groups of 2
Found 3e3e at offset 0
Found 7f7f at offset 4
Found 7f7f at offset 6
Found acac at offset 16
Found acac at offset 18
Found acac at offset 20
Found acac at offset 22
Found 4545 at offset 26
Found 4545 at offset 28
[download]

I hope this helps you along.

Cheers,

JohnGG

[reply]
[d/l]
[select]

Re^3: matching characters and numbers with regex

by james28909 (Deacon) on May 31, 2014 at 22:55 UTC

or better yet, i could read each byte without converting it to string, and see if the next 2/4/and 8 bytes matches it. if it does then its a corrupt file

[reply]

Re^4: matching characters and numbers with regex

by james28909 (Deacon) on May 31, 2014 at 23:42 UTC

Re^5: matching characters and numbers with regex

by james28909 (Deacon) on Jun 01, 2014 at 02:25 UTC

Re: matching characters and numbers with regex
by james28909 (Deacon) on May 31, 2014 at 04:22 UTC

$string =~/\w{4,8,16}/\/[0-9a-fA-F]/;
[download]

[reply]
[d/l]

Re^2: matching characters and numbers with regex

by Athanasius (Archbishop) on May 31, 2014 at 07:45 UTC

$string =~/\w{4,8,16}/\/[0-9a-fA-F]/;

There are two errors here:

The quantifier syntax X{y,z} means: at least y and no more than z occurrences of X. You want to say: either exactly 4 occurrences, or exactly 8 occurrences, or exactly 16 occurrences; but you can’t do that with this quantifier. See “Quantifiers” in perlre#Regular-Expressions.
The construct /.../\/.../ is a syntax error: the regex ends at the second /.

Now for the bigger picture.

You can probably do what you want with regexes, but it quickly becomes complicated. Here is some code I came up with to identify repeated 4-character sequences:

#! perl
use strict;
use warnings;
use List::MoreUtils 'uniq';

my $string = '0a0a0a0a0b0b0a0a0c0c0c0c0c0c0c0c' .
             '1f1f2b2b2b2b3e3e7b7b7b7b7b7b7b7b' .
             '8f8f8f8f8f8f8f8f6c6c4b4b4b4b3f3f' .
             '9d9d0f0f0f0f0f0f0f0f3a3a2e2e2e2e';

my @seqs = $string =~ /(([0-9a-fA-F]{2})\2)/g;
   @seqs = uniq grep { length == 4 } @seqs;

for my $seq (@seqs)
{
    my $matches = () = $string =~ /$seq/g;
    printf "%s: %d\n", $seq, $matches;
}
[download]

Output:

17:30 >perl 914_SoPW.pl
0a0a: 3
0b0b: 1
0c0c: 4
1f1f: 1
2b2b: 2
3e3e: 1
7b7b: 4
8f8f: 4
6c6c: 1
4b4b: 2
3f3f: 1
9d9d: 1
0f0f: 4
3a3a: 1
2e2e: 2

17:30 >
[download]

What concerns me here is the alignment problem: you presumably do not want to match a non-aligned sequence like the following:

0a0axxx0a0a0yyyy
^^^^   ^^^^
[download]

See, for example, the discussion of the \G anchor in the “Global matching” section of perlretut#Using-regular-expressions-in-Perl.

I’m not sure that regexes are the best tool for this job. I would look at converting your string into an array of integers, then building a hash of integer sequences (of the desired lengths) mapped to their number of occurrences in the original string.

Hope that helps,

Update (June 1): Corrected alignment example.

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^3: matching characters and numbers with regex

by james28909 (Deacon) on May 31, 2014 at 22:52 UTC

"4428FBABCBED062405E56F853AAE238C4428FBABCBED062405E56F853AAE238CCC9AA594B5B35063A28224E2FE347EE349E9FFEDB897E32725F42C0D9FA2400D56C78EC7E711F47AA032CB76E11996D4"

"0A0AFBABCBED062405E56F853AAE238C4428FBABCBED062405E56F853AAE238CCC9AA594B5B35063A28224E2FE347EE349E9FFEDB897E32725F42C0D9FA2400D56C78EC7E711F47AA032CB76E11996D4"

[reply]
[d/l]
[select]

Re^4: matching characters and numbers with regex

by Athanasius (Archbishop) on Jun 01, 2014 at 03:26 UTC