Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of strings which contain a series of numbers delimited by a '-', like the following example.

0-0-0-23-34-2345-345-21-0-0-0-256-78-0-0-0-0-0-0-0-56-45-3-34-...

What I would like to find (with perl regular expressions?) is every number subseries in the string with a specific structure. For example, I could want to find any series of consecutive non-zero numbers followed by four or more zeros

.......-2-3-4-5-6-0-0-0-0-0-0-........

Is there an easy way to do these kinds of searches?

Replies are listed 'Best First'.
Re: Perl pattern finding
by kennethk (Abbot) on Jul 06, 2011 at 20:06 UTC

      :-) Haha, nope I am not a college student ditching homework. More like a frustrated grad student. So here is what I am trying.

      $string=~/((-\d{1,5}){4,20}(-0){4,20})/ print pos($1),"\n";

      In my head, this should find a pattern of up to four to twenty -(number)'s followed by four to twenty -0's. The print pos($1) should return the position in the string of the start of the match.

        One issue you are encountering is that you are using pos incorrectly. pos should be called on the variable you matched against, $string in your example. You also probably want to use a m//g (see Modifiers) wrapped in a while loop. Perhaps something like:

        #!/usr/bin/perl -w use strict; my $string = '0-0-0-23-34-2345-345-21-0-0-0-256-78-0-0-0-0-0-0-0-56-45 +-3-34-0-2-3-4-5-6-0-0-0-0-0-0'; while ($string=~/((-\d{1,5}){4,20}(-0){4,20})/g) { print pos($string),"\n"; }

        This will return the end positions where your regular expression matched. I'm pretty sure this result does not meet your actual spec.

        If I were going to write that regex, however, it would look more like:

        #!/usr/bin/perl -w use strict; my $string = '0-0-0-23-34-2345-345-21-0-0-0-256-78-0-0-0-0-0-0-0-56-45 +-3-34-0-2-3-4-5-6-0-0-0-0-0-0'; for my $match ($string =~/(?<!\d)(?:[^0]\d*-)+(?:0-){3,}0/g) { print "$match\n"; }

        where the regular expression matches as follows:

        NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- (?<! look behind to see if there is not: ---------------------------------------------------------------------- \d digits (0-9) ---------------------------------------------------------------------- ) end of look-behind ---------------------------------------------------------------------- (?: group, but do not capture (1 or more times (matching the most amount possible)): ---------------------------------------------------------------------- [1-9]+ any character of: '1' to '9' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- - '-' ---------------------------------------------------------------------- )+ end of grouping ---------------------------------------------------------------------- (?: group, but do not capture (at least 3 times (matching the most amount possible)): ---------------------------------------------------------------------- 0- '0-' ---------------------------------------------------------------------- ){3,} end of grouping ---------------------------------------------------------------------- 0 '0' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

        I've also used the magic that a Global matching modified regular expression in list context returns the list of all matches. Note that my negative lookbehind (see Looking ahead and looking behind) means that it will match at the start of the string, not just in the middle.

        Update:Changed [1-9]+ to [^0]\d* since we need "doesn't start with 0" not "no zeroes".

Re: Perl pattern finding
by AnomalousMonk (Archbishop) on Jul 06, 2011 at 23:13 UTC

    Here's my take. These assumptions are all easily changed in the code:

    • A 'non-zero' number may have leading zero(s) (i.e., is numerically != 0), e.g., '01', '0023';
    • A 'zero' may have multiple 0 digits (again, numerically == 0), e.g., '00', '000';
    • A single non-zero number is 'consecutive';
    • 'Consecutive' means consecutive increasing, with differences == 1.

    >perl -wMstrict -le "my $s = '0-0-0-0-23-34-2345-345-21-0-0-0-256-78-' . '0-0-0-0-0-0-0-56-45-3-34-0-' . '2-3-4-5-6-0-0-0-0-0-0-3-2-1-0-0-0-0-0-' . '21-22-0-0-0-0-1-0-0-0-0-' . '24-25-0-0-0-33-34-0-00-000-0000-' . '01-02-0-0-0-0-3-4-01-01-01-01' ; ;; my $nz = qr{ (?<! \d) \d* [1-9] \d* (?! \d) }xms; my $z = qr{ (?<! \d) 0+ (?! \d) }xms; ;; my $nzs_4zs = qr{ $nz (?: - $nz)* (?: - $z){4} }xms; ;; print qq{'$_'} for grep { consecutive_nzs($_, $nz) } $s =~ m{$nzs_4zs}xmsg; ;; sub consecutive_nzs { my ($s, $nz_rx) = @_; my @nzs = $s =~ m{ $nz_rx }xmsg; $nzs[$_]+1 == $nzs[$_+1] || return for 0 .. $#nzs - 1; return 1; } " '2-3-4-5-6-0-0-0-0' '21-22-0-0-0-0' '1-0-0-0-0' '33-34-0-00-000-0000' '01-02-0-0-0-0'
Re: Perl pattern finding
by ikegami (Patriarch) on Jul 06, 2011 at 20:20 UTC
    my @matches = "-$str-" =~ /(?:-[^0-]+)*(?:-0){4,}(?=-)/g;

    Useful test cases:

    • Pattern is at the beginning of the string: 1-0-0-0-0-2-3-4 (That's why I prepended a "-" to the string.)
    • Pattern is at the end of the string: 0-1-0-0-0-0 (That's why I appended a "-" to the string.)
    • Two patterns in a row: 1-2-3-0-0-0-0-4-5-6-0-0-0-0 (That's why I used /(?=-)/ instead of /-/ at the end.)
    • No leading non-zeroes: 0-0-0-0-1-2-3

    Note: My code doesn't find overlapping matches. It's not an issue with the pattern you describe, but it might be for other patterns.

Re: Perl pattern finding
by locked_user sundialsvc4 (Abbot) on Jul 07, 2011 at 02:40 UTC

    Generally speaking, this sort of problem is often handled by using the features that allow you to match an item in a string and then remain in the position where the match occurred, so that subsequent regexes pick up where the others left off in the same string.   So, this helps by allowing you to break down the problem into two smaller, but well-defined patterns: (a) a series of consecutive non-zero numbers; and (b) a series of four-or-more zeroes.   Your basic approach, then, now uses if statements and other standard procedural-programming constructs in conjunction with these two regexes.

    (pseudo-code) ... while (the first pattern matches) { $p = the ending position if (second pattern matches) { if (the first pattern matches again) { success! $p = the ending position now ... } } resume at $p }

    Finally, if you find yourself doing a lot of this sort of thing, it is possible that you are starting to wander into the lands where Parse::RecDescent might be of service.