regex, pos, \G, and substr

ff has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks
My users create strings which contain text like update 8923 mark complete. I'd like to allow them to create strings like update 8435 and 9323 mark complete and convert those into multiple strings that look like the old pattern, i.e. update 8435 mark complete update 9323 mark complete. The following snippet does just what I want, but the Camel Book, in describing the \G assertion, says

Whenever you start thinking in terms of the pos function, it's tempting to start carving your string up with substr, but this is rarely the right thing to do.

So, should I consider doing something else? Thanks.

#!/usr/bin/perl -w
use strict;

my $data_stg
    = 'junk text update 8923 mark complete update 8324 mark '
    . 'complete more junk update 5438 and 5843 and 1522 mark '
    . 'complete update 8435 and 9323 mark complete true junk';


pos( $data_stg ) = 0;

my %mult_updates;
my $pass = 1;
while ( $data_stg =~ /(update \d+( and \d+)+ mark complete)/ig ) {
    my $mult_update_pos = pos( $data_stg );
print "$pass pos: '$mult_update_pos'\n";

    my $mult_update = $1;
print "$pass orig_mult_update: '$mult_update'\n";

    my $mult_update_length = length $mult_update;
print "$pass length: '$mult_update_length'\n";

    $mult_update =~ s/and (\d+)/mark complete update $1/gi;
print "$pass new_mult_update: '$mult_update'\n";

    $mult_updates{ $mult_update_pos - $mult_update_length }
        = [ ($mult_update_length, $mult_update) ];
}
continue {
    $pass++;
}

# Work backwards from the end of the string, doing substr
# on positions which have been identified as having code to
# replace.  Let the key define a starting position and the
# key's value contain an array ref describing the length
# of the target and the desired replacement text.
foreach ( sort {$b <=> $a} keys %mult_updates ) {
    substr(
        $data_stg,
        $_,
        $mult_updates{$_}->[0],
        $mult_updates{$_}->[1]
    );
}

print "\n$data_stg\n";
[download]

Comment on regex, pos, \G, and substr Select or Download Code

Replies are listed 'Best First'.
Re: regex, pos, \G, and substr by BrowserUk (Patriarch) on Jun 03, 2007 at 02:29 UTC
This seems somewhat simpler, though you might want to strengthen the regex to validate the input more. `#! perl -slw use strict; my $data_stg = 'junk text update 8923 mark complete update 8324 mark ' . 'complete more junk update 5438 and 5843 and 1522 mark ' . 'complete update 8435 and 9323 mark complete true junk' ; $data_stg =~ s[update (.+?) mark complete]{ join ' ', map{ "update $_ mark complete"} split '\s+and\s+', $1 }ge; print $data_stg;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: regex, pos, \G, and substr by ff (Hermit) on Jun 03, 2007 at 03:02 UTC
I think it's perceptive to split the guts of the phrases on `[ ]and[ ]`, but it's really important in my case that the leftovers are only digits. While I could throw `grep { /^\d+$/ }` in front of the `split`, I'd lose visibility to any non-digit stuff that was (mistakenly) there in the process of following through with the replace side of the `(s)ubstitute` operator. In other words, I'd rather leave everything alone if there's anything "non-digit" besides the `and` splitters in there. BTW, I like the single-quotes for delimiting the split regex.	[reply] [d/l] [select]
Re^3: regex, pos, \G, and substr by BrowserUk (Patriarch) on Jun 03, 2007 at 03:34 UTC
That's what I meant by strengthening the regex. Note that the non-conformant additional third line is left untouched: #! perl -slw use strict; my $data_stg = 'junk text update 8923 mark complete update 8324 mark ' . 'complete more junk update 5438 and 5843 and 1522 mark ' . 'complete more junk update junk and 5843 and 1522 mark ' . 'complete update 8435 and 9323 mark complete true junk' ; $data_stg =~ s[update ((?:\d+\|\s\|and)+) mark complete]{ join ' ', map{ "update $_ mark complete"} split '\s+and\s+', $1 }ge; print $data_stg; __END__ ## Output wrapped to match input for easier verification. junk text update 8923 mark complete update 8324 mark complete more junk update 5438 mark complete update 5843 mark complete + update 1522 mark complete more junk update junk and 5843 and 1522 mark complete update 8435 mark complete update 9323 mark complete true junk [download] Alternatively, verify that the split values are numeric, produce a warning and put the original back if not: #! perl -slw use strict; my $data_stg = 'junk text update 8923 mark complete update 8324 mark ' . 'complete more junk update 5438 and 5843 and 1522 mark ' . 'complete more junk update junk and 5843 and 1522 mark ' . 'complete update 8435 and 9323 mark complete true junk' ; $data_stg =~ s[(update (.+?) mark complete)]{ my @numbers = split '\s+and\s+', $2; if( grep{ !/^\d+$/ } @numbers ) { warn "Malformed request: '$1'\n"; $1; } else{ join ' ', map{ "update $_ mark complete"} @numbers; } }ge; print $data_stg; __END__ ## Output wrapped to match input for easier verification. Malformed request: 'update junk and 5843 and 1522 mark complete' junk text update 8923 mark complete update 8324 mark complete more junk update 5438 mark complete update 5843 mark complete + update 1522 mark complete more junk update junk and 5843 and 1522 mark complete update 8435 mark complete update 9323 mark complete true junk [download] BTW, I like the single-quotes for delimiting the split regex. Most don't. They consider it a bad habit of mine. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^3: regex, pos, \G, and substr by ysth (Canon) on Jun 04, 2007 at 01:02 UTC
I'd rather leave everything alone if there's anything "non-digit" besides the and splitters in there. Then leave that part the same as in your original looping regex: `s[update (\d+(?: and \d+)+) mark complete]{...}ge;` [download]	[reply] [d/l]
Re: regex, pos, \G, and substr by moritz (Cardinal) on Jun 03, 2007 at 09:20 UTC
If you want to be ultra lazy and your data is not read by other programs that you have no control of, you might use a common serialization format like yaml, xml or json. Then you could read and write them with the appropriate CPAN modules and be pretty sure that it works as expected. Perl 6 in German	[reply]
Re^2: regex, pos, \G, and substr by Anonymous Monk on Jun 03, 2007 at 14:58 UTC
Hey. XSLT lets you write whole computer programs in XML, so maybe Perl6 should written in YAML or JSON. It would do away with all that complicated syntax and the need to use horrible, nasty, complicated things like regexes. We could just load up a cpan module and Perl6 would be ready by next weekend. And we could be sure it worked properly.	[reply]
Re^3: regex, pos, \G, and substr by pKai (Priest) on Jun 03, 2007 at 22:02 UTC
The point of this reply is completly obscure to me. I have been doing a lot of XSLT this last year at $work. And the ability to use regular-expressions in XSLT/XPath is something which is well sought after. XSLT also isn't the last conclusion of wisdom with regard to high level programming, IMHO, though it certainly has its niche where it might be concidered useful, e. g. to avoid a "media break".	[reply]