Re: Proper use of split

Assuming for the moment that this is not a well-known format like JSON for which a more complete solution exists, I can think of two general approaches for tackling this problem. One is to split the string on commas into a list. The other is to use the “global matching” (g and c) as ultimately described in the section, Using regular expressions in Perl, in perldoc perlretut. Of the two, I rather like the second one best, especially if the data is consistently numeric.

“Global matching” lets you apply a regex more than one time to the same string, so that you can take a “winnowing the wheat from the chaff” approach by using a regular expression that corresponds to the “wheat.” The position of the matching string is established by the pos() function, which has one very important “gotcha”: that the start-position corresponding to “from the start of the string” is undef, not zero. (Uh huh... “ouch! it bit me!”)

As an extemporaneous example, a pattern such as \"([a-z_]+)\"\:([0-9.]+) could be applied and it would return the matched substrings as $1 and $2 ... I repeat, extemporaneous example ... and it would return $1='temp' $2="70.00' the first time, $1='tmode', $2='2' the second time, and so on (if I actually got it right). It would skip over anything that did not match in search of the next thing that did. This can be a useful technique, although as with everything else having to do with regular-expressions it demands rigorous testing. (Beware that if the regular expression does not encompass all of the actual data, any data which doesn’t match will simply be skipped! For example, I had to edit this post to include an underscore-character ...)

Replies are listed 'Best First'.
Re^2: Proper use of split by AnomalousMonk (Archbishop) on Jun 02, 2012 at 17:59 UTC
The position of the matching string is established by the pos() function... But pos sez (emphasis added): Returns the offset of where the last "m//g" search left off for the variable in question [...]. Note that 0 is a valid match offset. "undef" indicates that the search position is reset (usually due to match failure, but can also be because no match has yet been run on the scalar). IOW, `pos` controls the point at which `m//g` matching resumes following a previous `m//g` match on a given string. If there was no previous `m//g` match (either because such a match was not attempted or because it failed), the point at which to resume `m//g` matching has no meaning and is literally undef. Update: Assuming that the start-position corresponding to “from the start of the string” refers to the `\A` assertion, consider the following (note that `print_pos()` undefines `pos($_)` on each call): `>perl -wMstrict -le "$_ = 'abcdef'; print_pos('initial'); ;; m{ \A }xms; print_pos('\A'); ;; m{ \A }xmsg; print_pos('\A/g'); ;; m{ \A }xmsg; m{ \A }xmsg; print_pos('\A/g repeated'); ;;;; sub print_pos { printf qq{%14s: pos = %s \n}, $_[0], defined(pos) ? pos() : 'undef' ; pos = undef; } " initial: pos = undef \A: pos = undef \A/g: pos = 0 \A/g repeated: pos = undef` [download] `pos($_)` is undefined after the initialization of the `$_` scalar as a string. Following the first `m{ \A }xms;` statement (non-`m//g` match), `pos($_)` is undefined because no `m//g` has yet been done. Following the single `m{ \A }xmsg;` global match statement, `pos($_)` is 0 because this is the character position after the `\A` absolute-beginning-of-the-string assertion. (Remember that `\A` is a zero-width assertion and so can be comfortable in the narrow confine between the start of the string and its first character!) This is the position from which a subsequent `m//g` would begin matching. Following the repeated `m{ \A }xmsg;` `m{ \A }xmsg;` global match statements, `pos($_)` is undefined because the second global match failed: it could not find a position at which the `\A` assertion was true when searching from character position 0 to the end of the string. Ok, you're so smart, so go explain these results: `>perl -wMstrict -le "$_ = 'abcdef'; print_pos('initial'); ;; m{ \b }xmsg; print_pos('single \b/g'); ;; m{ \b }xmsg; m{ \b }xmsg; print_pos('double \b/g'); ;; m{ \b }xmsg; m{ \b }xmsg; m{ \b }xmsg; print_pos('triple \b/g'); ;;;; sub print_pos { printf qq{%14s: pos = %s \n}, $_[0], defined(pos) ? pos() : 'undef' ; pos = undef; } " initial: pos = undef single \b/g: pos = 0 double \b/g: pos = 6 triple \b/g: pos = undef` [download] (In particular, if the string `'abcdef'` has six characters and therefore character positions 0 .. 5 inclusive, what does it mean that a `m{ \b }xmsg;` statement finds a 'match' at a `pos` of six?)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Proper use of split
by AnomalousMonk (Archbishop) on Jun 02, 2012 at 17:59 UTC

The position of the matching string is established by the pos() function...

But pos sez (emphasis added):

Returns the offset of where the last "m//g" search left off for
the variable in question [...]. Note that 0 is a valid match offset. "undef"
indicates that the search position is reset (usually due to
match failure, but can also be because no match has yet been run
on the scalar).

IOW, pos controls the point at which m//g matching resumes following a previous m//g match on a given string. If there was no previous m//g match (either because such a match was not attempted or because it failed), the point at which to resume m//g matching has no meaning and is literally undef.

Update: Assuming that the start-position corresponding to “from the start of the string” refers to the \A assertion, consider the following (note that print_pos() undefines pos($_) on each call):

>perl -wMstrict -le
"$_ = 'abcdef';
 print_pos('initial');
 ;;
 m{ \A }xms;
 print_pos('\A');
 ;;
 m{ \A }xmsg;
 print_pos('\A/g');
 ;;
 m{ \A }xmsg;
 m{ \A }xmsg;
 print_pos('\A/g repeated');
 ;;;;
 sub print_pos {
   printf qq{%14s: pos = %s \n},
          $_[0], defined(pos) ? pos() : 'undef'
          ;
   pos = undef;
   }
"
       initial: pos = undef
            \A: pos = undef
          \A/g: pos = 0
 \A/g repeated: pos = undef
[download]

pos($_) is undefined after the initialization of the $_ scalar as a string. Following the first
m{ \A }xms;
statement (non-m//g match), pos($_) is undefined because no m//g has yet been done. Following the single
m{ \A }xmsg;
global match statement, pos($_) is 0 because this is the character position after the \A absolute-beginning-of-the-string assertion. (Remember that \A is a zero-width assertion and so can be comfortable in the narrow confine between the start of the string and its first character!) This is the position from which a subsequent m//g would begin matching. Following the repeated
m{ \A }xmsg;
m{ \A }xmsg;
global match statements, pos($_) is undefined because the second global match failed: it could not find a position at which the \A assertion was true when searching from character position 0 to the end of the string.

Ok, you're so smart, so go explain these results:

>perl -wMstrict -le
"$_ = 'abcdef';
 print_pos('initial');
 ;;
 m{ \b }xmsg;
 print_pos('single \b/g');
 ;;
 m{ \b }xmsg;
 m{ \b }xmsg;
 print_pos('double \b/g');
 ;;
 m{ \b }xmsg;
 m{ \b }xmsg;
 m{ \b }xmsg;
 print_pos('triple \b/g');
 ;;;;
 sub print_pos {
   printf qq{%14s: pos = %s \n},
          $_[0], defined(pos) ? pos() : 'undef'
          ;
   pos = undef;
   }
"
       initial: pos = undef
   single \b/g: pos = 0
   double \b/g: pos = 6
   triple \b/g: pos = undef
[download]

(In particular, if the string 'abcdef' has six characters and therefore character positions 0 .. 5 inclusive, what does it mean that a
m{ \b }xmsg;
statement finds a 'match' at a pos of six?)

[reply]
[d/l]
[select]