Re: matching characters and numbers with regex

Replies are listed 'Best First'.

Re^2: matching characters and numbers with regex
by Athanasius (Archbishop) on May 31, 2014 at 07:45 UTC

$string =~/\w{4,8,16}/\/[0-9a-fA-F]/;

There are two errors here:

The quantifier syntax X{y,z} means: at least y and no more than z occurrences of X. You want to say: either exactly 4 occurrences, or exactly 8 occurrences, or exactly 16 occurrences; but you can’t do that with this quantifier. See “Quantifiers” in perlre#Regular-Expressions.
The construct /.../\/.../ is a syntax error: the regex ends at the second /.

Now for the bigger picture.

You can probably do what you want with regexes, but it quickly becomes complicated. Here is some code I came up with to identify repeated 4-character sequences:

#! perl
use strict;
use warnings;
use List::MoreUtils 'uniq';

my $string = '0a0a0a0a0b0b0a0a0c0c0c0c0c0c0c0c' .
             '1f1f2b2b2b2b3e3e7b7b7b7b7b7b7b7b' .
             '8f8f8f8f8f8f8f8f6c6c4b4b4b4b3f3f' .
             '9d9d0f0f0f0f0f0f0f0f3a3a2e2e2e2e';

my @seqs = $string =~ /(([0-9a-fA-F]{2})\2)/g;
   @seqs = uniq grep { length == 4 } @seqs;

for my $seq (@seqs)
{
    my $matches = () = $string =~ /$seq/g;
    printf "%s: %d\n", $seq, $matches;
}
[download]

Output:

17:30 >perl 914_SoPW.pl
0a0a: 3
0b0b: 1
0c0c: 4
1f1f: 1
2b2b: 2
3e3e: 1
7b7b: 4
8f8f: 4
6c6c: 1
4b4b: 2
3f3f: 1
9d9d: 1
0f0f: 4
3a3a: 1
2e2e: 2

17:30 >
[download]

What concerns me here is the alignment problem: you presumably do not want to match a non-aligned sequence like the following:

0a0axxx0a0a0yyyy
^^^^   ^^^^
[download]

See, for example, the discussion of the \G anchor in the “Global matching” section of perlretut#Using-regular-expressions-in-Perl.

I’m not sure that regexes are the best tool for this job. I would look at converting your string into an array of integers, then building a hash of integer sequences (of the desired lengths) mapped to their number of occurrences in the original string.

Hope that helps,

Update (June 1): Corrected alignment example.

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^3: matching characters and numbers with regex

by james28909 (Deacon) on May 31, 2014 at 22:52 UTC

"4428FBABCBED062405E56F853AAE238C4428FBABCBED062405E56F853AAE238CCC9AA594B5B35063A28224E2FE347EE349E9FFEDB897E32725F42C0D9FA2400D56C78EC7E711F47AA032CB76E11996D4"

"0A0AFBABCBED062405E56F853AAE238C4428FBABCBED062405E56F853AAE238CCC9AA594B5B35063A28224E2FE347EE349E9FFEDB897E32725F42C0D9FA2400D56C78EC7E711F47AA032CB76E11996D4"

[reply]
[d/l]
[select]

Re^4: matching characters and numbers with regex

by Athanasius (Archbishop) on Jun 01, 2014 at 03:26 UTC

james28909,

Well, this is a completely different spec from the one given previously (as I understood it, anyway)! If this is really all you need, it’s as simple as:

#! perl
use strict;
use warnings;

while (<DATA>)
{
    if    (/^([0-9a-fA-F]{2})\1/)
    {
        print "Found  4 repeating characters: $1$1\n";
    }
    elsif (/^([0-9a-fA-F]{4})\1/)
    {
        print "Found  8 repeating characters: $1$1\n";
    }
    elsif (/^([0-9a-fA-F]{8})\1/)
    {
        print "Found 16 repeating characters: $1$1\n";
    }
    else
    {
        print "Found  0 repeating characters\n";
    }
}

__DATA__
1234FBABCBED062405E56F853AAE238C4428FBABCBED0624
0A0AFBABCBED062405E56F853AAE238C4428FBABCBED0624
0A1B0A1BCBED062405E56F853AAE238C4428FBABCBED0624
0A1B2C3D0A1B2C3DCBED062405E56F853AAE238C4428FBAB
01230A0AFBABCBED062405E56F853AAE238C4428FBABCBED
[download]

Output:

13:12 >perl 914_SoPW.pl
Found  0 repeating characters
Found  4 repeating characters: 0A0A
Found  8 repeating characters: 0A1B0A1B
Found 16 repeating characters: 0A1B2C3D0A1B2C3D
Found  0 repeating characters

13:12 >
[download]

(Note that the final string tested here contains the repeated characters 0A0A, but these are not at the beginning of the string.)

Two obvious questions:

Why shouldn’t a legitimate (i.e., non-corrupt) file begin with repeated characters?
If a file is “corrupted,” will this always manifest as repeated characters at the start of the file? If not, how will you test for other forms of file corruption?

I’ve got a sneaking suspicion that this thread is dealing with an XY Problem. If the answers don’t solve your real problem, you will need to explain the nature of the files and the process(es) by which the corruption may occur.

Update: More compact version:

while (my $string = <DATA>)
{
    for my $chars (2, 4, 8)
    {
        printf "Found %2d repeating characters: %s\n", $chars * 2, $1 
+. $1
            if $string =~ /^([0-9a-fA-F]{$chars})\1/;
    }
}
[download]

(In the actual script, the printf would be replaced by a die statement.)

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]