in reply to matching characters and numbers with regex

actually i think i need to be using /\w{4,8,16} :
$string =~/\w{4,8,16}/\/[0-9a-fA-F]/;
because instead of searching in hexadecimal bytes, its searching characters in a string. in hexadecimal 0a is 1 character, but in a string 0a is 2 characters correct?

Replies are listed 'Best First'.
Re^2: matching characters and numbers with regex
by Athanasius (Archbishop) on May 31, 2014 at 07:45 UTC
    $string =~/\w{4,8,16}/\/[0-9a-fA-F]/;

    There are two errors here:

    1. The quantifier syntax X{y,z} means: at least y and no more than z occurrences of X. You want to say: either exactly 4 occurrences, or exactly 8 occurrences, or exactly 16 occurrences; but you can’t do that with this quantifier. See “Quantifiers” in perlre#Regular-Expressions.
    2. The construct /.../\/.../ is a syntax error: the regex ends at the second /.

    Now for the bigger picture.

    You can probably do what you want with regexes, but it quickly becomes complicated. Here is some code I came up with to identify repeated 4-character sequences:

    #! perl use strict; use warnings; use List::MoreUtils 'uniq'; my $string = '0a0a0a0a0b0b0a0a0c0c0c0c0c0c0c0c' . '1f1f2b2b2b2b3e3e7b7b7b7b7b7b7b7b' . '8f8f8f8f8f8f8f8f6c6c4b4b4b4b3f3f' . '9d9d0f0f0f0f0f0f0f0f3a3a2e2e2e2e'; my @seqs = $string =~ /(([0-9a-fA-F]{2})\2)/g; @seqs = uniq grep { length == 4 } @seqs; for my $seq (@seqs) { my $matches = () = $string =~ /$seq/g; printf "%s: %d\n", $seq, $matches; }

    Output:

    17:30 >perl 914_SoPW.pl 0a0a: 3 0b0b: 1 0c0c: 4 1f1f: 1 2b2b: 2 3e3e: 1 7b7b: 4 8f8f: 4 6c6c: 1 4b4b: 2 3f3f: 1 9d9d: 1 0f0f: 4 3a3a: 1 2e2e: 2 17:30 >

    What concerns me here is the alignment problem: you presumably do not want to match a non-aligned sequence like the following:

    0a0axxx0a0a0yyyy ^^^^ ^^^^

    See, for example, the discussion of the \G anchor in the “Global matching” section of perlretut#Using-regular-expressions-in-Perl.

    I’m not sure that regexes are the best tool for this job. I would look at converting your string into an array of integers, then building a hash of integer sequences (of the desired lengths) mapped to their number of occurrences in the original string.

    Hope that helps,

    Update (June 1): Corrected alignment example.

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I want it to match any 4 characters that match from the beginning of the string. It will be checking for corruptness of a file. If the file is supposed to be

      "4428FBABCBED062405E56F853AAE238C4428FBABCBED062405E56F853AAE238CCC9AA594B5B35063A28224E2FE347EE349E9FFEDB897E32725F42C0D9FA2400D56C78EC7E711F47AA032CB76E11996D4"

      Then i want to make sure it doesnt have any repeating characters that are 4,8, and 16 characters long. So if this above string was:

      "0A0AFBABCBED062405E56F853AAE238C4428FBABCBED062405E56F853AAE238CCC9AA594B5B35063A28224E2FE347EE349E9FFEDB897E32725F42C0D9FA2400D56C78EC7E711F47AA032CB76E11996D4"

      Difference in these two string are the Repeating characters 0A0A at the beginning of the string. if it finds repeating characters then it will terminate the program and not continue because its checking for corruptness.

        james28909,

        Well, this is a completely different spec from the one given previously (as I understood it, anyway)! If this is really all you need, it’s as simple as:

        #! perl use strict; use warnings; while (<DATA>) { if (/^([0-9a-fA-F]{2})\1/) { print "Found 4 repeating characters: $1$1\n"; } elsif (/^([0-9a-fA-F]{4})\1/) { print "Found 8 repeating characters: $1$1\n"; } elsif (/^([0-9a-fA-F]{8})\1/) { print "Found 16 repeating characters: $1$1\n"; } else { print "Found 0 repeating characters\n"; } } __DATA__ 1234FBABCBED062405E56F853AAE238C4428FBABCBED0624 0A0AFBABCBED062405E56F853AAE238C4428FBABCBED0624 0A1B0A1BCBED062405E56F853AAE238C4428FBABCBED0624 0A1B2C3D0A1B2C3DCBED062405E56F853AAE238C4428FBAB 01230A0AFBABCBED062405E56F853AAE238C4428FBABCBED

        Output:

        13:12 >perl 914_SoPW.pl Found 0 repeating characters Found 4 repeating characters: 0A0A Found 8 repeating characters: 0A1B0A1B Found 16 repeating characters: 0A1B2C3D0A1B2C3D Found 0 repeating characters 13:12 >

        (Note that the final string tested here contains the repeated characters 0A0A, but these are not at the beginning of the string.)

        Two obvious questions:

        1. Why shouldn’t a legitimate (i.e., non-corrupt) file begin with repeated characters?
        2. If a file is “corrupted,” will this always manifest as repeated characters at the start of the file? If not, how will you test for other forms of file corruption?

        I’ve got a sneaking suspicion that this thread is dealing with an XY Problem. If the answers don’t solve your real problem, you will need to explain the nature of the files and the process(es) by which the corruption may occur.

        Update: More compact version:

        while (my $string = <DATA>) { for my $chars (2, 4, 8) { printf "Found %2d repeating characters: %s\n", $chars * 2, $1 +. $1 if $string =~ /^([0-9a-fA-F]{$chars})\1/; } }

        (In the actual script, the printf would be replaced by a die statement.)

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,