spx2 has asked for the wisdom of the Perl Monks concerning the following question:

just read almost all of perlre
i am trying to get the followin regex to be
used with split to be able to separate the $text
into particles,where the separators for this particular
text are:
,
|
:
>
][
_|_

so i wrote this wich doesnt work as its supposed to
my $text="jojo,has|some:big>balls][nuts,sometimes_|_he,scratches"; my $pattern= "(,|\||:|>|][|_\|_)"; my @splitted= split /\Q$pattern/,$text; foreach $particle (@splitted) { print $particle."\n"; }

also i have one more question,
if i put in $pattern="has" i get the text splitted
at the word "has" but if i do $pattern="(has|big)" i
expect it to be splitted at words "has" and "big" but
it doesnt work at all.
thank you

Replies are listed 'Best First'.
Re: regex trouble
by trwww (Priest) on May 30, 2007 at 07:28 UTC

    This is a pretty straightforward thing to do. I'd break up the construction of $pattern in to steps:

    [trwww@www misc]$ cat 618115.pl use warnings; use strict; my @delimiters = ( ',', '|', ':', '>', '][', '_|_', ); my $pattern = join '|', map quotemeta, @delimiters; my $text = "jojo,has|some:big>balls][nuts,sometimes_|_he,scratches"; foreach my $particle (split /$pattern/, $text) { print $particle."\n"; }

    That gives the following output:

    [trwww@www misc]$ perl 618115.pl jojo has some big balls nuts sometimes he scratches

    Hope this helps,

    trwww

      I just discovered quotemeta as a result of the recent functional functions node, so I'm pleased to see it in use here.

      I have a question about your solution, though. When I print "$pattern\n", I get
      \,|\||\:|\>|\]\[|_\|_
      To me, it's odd that this works correctly with the comma, colon, and chevron escaped. I'll go look at perlre, but why doesn't it matter that these are escaped?

      ~dewey
        In a regex, "\," is exactly equivalent to "," -- and likewise for colon and angle brackets. Those characters do not have any "magical" force in the regex syntax when used without escapes (in contrast to period, asterisk, square brackets and so on), nor do they have any special meaning when preceded by backslash (in contrast to "n", "t", "b", "d" and so on).

        Meanwhile, quotemeta is a more-or-less general-purpose function -- according to the manual, it 'Returns the value of EXPR with all non-"word" characters backslashed. (That is, all characters not matching "/[A-Za-z_0-9]/" will be preceded by a backslash in the returned string, regardless of any locale settings.)'

        (updated to fix display of square brackets in last paragraph)

Re: regex trouble
by GrandFather (Saint) on May 30, 2007 at 07:08 UTC

    Your description is such a mess due to bad HTML that I'm not sure what you really want to do, but most likely your problem is that \Q quotes meta characters so that the contents of $pattern are treated as a string to match. Omit \Q and in this case your life may be happier.

    You also need to quote all the meta characters in $pattern and you need to "double quote" in a double quoted string: "(,|\\||:|>|\\]\\[|_\\|_)" so the quote character is available to be parsed by the regex engine. The two changed lines then become:

    my $pattern= "(,|\\||:|>|\\]\\[|_\\|_)"; my @splitted= split /$pattern/,$text;

    DWIM is Perl's answer to Gödel

      The two changed lines then become:

      my $pattern= "(,|\\||:|>|\\]\\[|_\\|_)"; my @splitted= split /$pattern/,$text;

      I know you're trying to stay as close as possible to the OP's code, but as an additional recommendation, for maximum clarity in such a situation one should really follow a cleaner approach like trwww's++ or if using a single hardcoded regex, then taking advantage of the /x modifier and throw in suitable whitespace.

        At some point adding whitespace

        leads

        to

        reduced

        clarity

        and

        slower

        comprehension

        .

        Two things would clean those particular lines up in my view - adding /x as you suggest, and using a character set for the single character delimiters:

        my $pattern= "([,|:>] | \\]\\[ | _\\|_)"; my @splitted= split /$pattern/x, $text;

        DWIM is Perl's answer to Gödel
Re: regex trouble
by dewey (Pilgrim) on May 30, 2007 at 07:16 UTC
    The main problem I can see is that you need to escape ] and [ in regexen (they are used for character classes). This worked for me:
    my $text="jojo,has|some:big>balls][nuts,sometimes_|_he,scratches"; my @splitted= split /,|\||:|>|\]\[|_\|_/, $text; foreach $particle (@splitted) { print $particle."\n"; }
    PS: ...or just see grandfather's post above :)

    ~dewey