Re: regex anchoring issue

Welcome to the monastery.

Firstly, your data description seems a little ambiguous: you say "end character" then describe <SOH> (5 chars), ^A (2 chars) and Ctrl-A (1 char). If, by <SOH>, you mean the ASCII character - that is the same character as Ctrl-A (i.e. the character with the ASCII value of 1).

Your main problem in your regexp is the use of a character class (i.e. [...]) - see Character Classes and other Special Escapes under perlre - Regular Expressions for details. You also don't need the 'g' modifier in either the match (m/.../) or the split function.

The following script does what I think you want (in terms of identifying the line endings). If not, please provide some sample data with expected output to remove the ambiguity I mentioned at the start.

#!/usr/bin/env perl

use 5.010;
use strict;
use warnings;

my $soh_string = 'soh_string<SOH>';
my $caret_a_string = 'caret_a_string^A';
my $ctrl_a_string = 'ctrl_a_string' . chr(1);

my $test_string = join('',
    $soh_string, $caret_a_string, $ctrl_a_string,
    $caret_a_string, $ctrl_a_string, $soh_string,
    $ctrl_a_string, $soh_string, $caret_a_string
);

my $string_re = qr{(?><SOH>|\^A|\cA)};

say for split $string_re => $test_string;
[download]

Output:

$ pm_soh_split.pl
soh_string
caret_a_string
ctrl_a_string
caret_a_string
ctrl_a_string
soh_string
ctrl_a_string
soh_string
caret_a_string
[download]

-- Ken

Comment on Re: regex anchoring issue Select or Download Code

Replies are listed 'Best First'.
Re^2: regex anchoring issue by BillKSmith (Monsignor) on Feb 15, 2013 at 14:01 UTC
Refer to charnames for a neat way to code the value of your $soh_string. `use charnames qw(:full); $soh_string = "\N{SOH}";` [download] Bill	[reply] [d/l]
Re^3: regex anchoring issue by kcott (Archbishop) on Feb 16, 2013 at 06:35 UTC
Thanks, Bill. I had considered that but decided not to use it due to the ambiguity I noted in my opening paragraph. Had penguin-attack wanted the single ASCII character `SOH`, instead of the string '`<SOH>`', that was covered by `Ctrl-A` (also noted). [Side issue (struggling not to appear grossly pedantic): the `charnames` pragma has been distributed with Perl since at least v5.8.8 - the perldoc link (charnames) would provide the most recent documentation.] -- Ken	[reply] [d/l] [select]
Re^2: regex anchoring issue by smls (Friar) on Feb 15, 2013 at 11:23 UTC
Why do you place the regex in a `(?>...)` non-backtracking group?	[reply] [d/l]
Re^3: regex anchoring issue by kcott (Archbishop) on Feb 16, 2013 at 05:58 UTC
Wrapping regexp alternations in `(?>...)` is something I do by default. While there may be rare cases where this might be problematical, I haven't encountered any: it's something that doesn't hurt and, indeed, often helps. This usage is based on a "Perl Best Practices" guideline: Backtracking (page 269). It's summarised on page 271 as: ... rewrite any instance of: `X \| Y` [download] as: `(?> X \| Y )` [download] While I'm not a slave to all "Perl Best Practices" guidelines, this is one I have found to be useful. Update: `s/have encountered/haven't encountered/` -- Ken	[reply] [d/l] [select]