MajingaZ has asked for the wisdom of the Perl Monks concerning the following question:

This is my first post to SoPW, though I've had solid success using the CB.

I work with lots of text files and have had problems with the perl mods ie Text::CSV_XS etc... do not allow for embedded non-escaped text qualifiers or field seperators. So this is an attempt to retrieve the text qualifier and field seperator from a text file containing a header.
$foo = qq{^snafu^|^foobar^\n}; $foo =~ m/\A(\W) # \A instead of ^ and match first non-word [^\1]+ # Match everything that isn't in \1 \1(\W) # Match non-word following the 2nd \1 /xms; $text_qual = $1; $field_sep = $2;
The regular expression is setting the second match to newline instead of pipe as it should.
It appears as if [^\1]+ is being interpreted as .+ for some reason? Obviously in this case I could use
$foo =~ m/\A(\W) # \A instead of ^ and match first non-word \w+\1 # Match any word char until the next \1 (\W) # Match following non-word /xms;
But this only works for this limited case where the first field contains no symbols. Any suggestions would be greatly appreciated.

MajingaZ
~Sic volere parcas~
So spin the Fates

Replies are listed 'Best First'.
Re: Regular Expresssion TroubleShoot Help plz
by hv (Prior) on Mar 29, 2006 at 02:20 UTC

    The problem is precisely that \1 in a character class is not a backreference: it refers to the ASCII character chr(1), an abbreviation of the octal escape sequence \001.

    You can achieve what you want with a slightly more complicated approach using negative lookahead:

    m{ \A # anchor to start (\W) # open text (?: (?!\1) . )* # anything that isn't the closer \1 # close text (\W) # separator }xs

    That works for the general case, when you simply want to match a bunch of stuff not containing a given substring. In this case though, you want to match "up to the first occurrence" of that substring, so it's much simpler - you just need a minimal match:

    m{ \A # anchor to start (\W) # open text .*? # anything contained, up to the ... \1 # ... close text (\W) # separator }xs

    (I've taken the liberty of replacing your '+' with '*', on the assumption that you want to allow empty fields.)

    Hugo

      $foo = qq{^snafu^|^foobar^\n}; $foo =~ m/\A(\W) # \A instead of ^ and match first non-word .+? # +? Minimal match everything that isn't in \1 \1(\W) # Match non-word following the 2nd \1 /xms; $TEXT_QUAL = $1; $FIELD_SEP = $2;


      Having now found the proper way to attempt to acquire delimiters, the following questions how to utilize these new found delimiters.
      Instead of creating one large regex, I'd perfer to store them in scalars, which is the core of this particular problem.
      $foo =~ /\G$TEXT_QUAL(.*?)$TEXT_QUAL[$FIELD_SEP\n]/xmsgc;
      Fails to work since the qualifiers are metacharacters used in regular expressions.


      $foo =~ /\G\$TEXT_QUAL(.*?)\$TEXT_QUAL[\$FIELD_SEP\n]/xmsgc;
      Fails to work as \$ is a literal $ followed by the name.


      $foo =~ /\G\\$TEXT_QUAL(.*?)\\$TEXT_QUAL[\\$FIELD_SEP\n]/xmsgc;
      Also Fails to work as \\ is is a literal \ The only way I've found is


      $LIT_TEXT_QUAL = qq{\\$TEXT_QUAL}; $LIT_FIELD_SEP = qq{\\$FIELD_SEP}; $foo =~ /\G$LIT_TEXT_QUAL(.*?)$LIT_TEXT_QUAL[$LIT_FIELD_SEP\n]/xmsgc;


      I do have reasons for using all those flags as this thread continues, however with the intent to get discrete answers to smaller problems I'm hoping to reduce the amount of new information my brain will have to process.

      Basically this post is looking for a way to use any variable in a regex that may or may not contain metachacters. Edit:: OK yeah missed the boat on this one, answer is just quotemeta function, from CB thanx guys!
Re: Regular Expresssion TroubleShoot Help plz
by Crackers2 (Parson) on Mar 29, 2006 at 00:42 UTC

    I think the problem is simply that the [^\1]+ is greedy. Changing it to [^\1]+? seems to work:

    $foo = qq{^snafu^|^foobar^\n}; $foo =~ m/\A(\W) # \A instead of ^ and match first non-word [^\1]+? # Match everything that isn't in \1 \1(\W) # Match non-word following the 2nd \1 /xms; $text_qual = $1; $field_sep = $2; print "[$text_qual]\n"; print "[$field_sep]\n";
    Output:
    [^] [|]

    Update: Nope I think I'm at least partially wrong too. Greedy or not, if \1 is ^, then [^\1]+ should stop at the first ^.

    Changing [^\1]+ to the hardcoded [^^]+ shows that greedyness doesn't matter. So it does appear to have something to do with \1 inside character classes.

    Update 2: I think .*? instead of [^\1]+ might work, but it contains the dreaded dot-star. I assume the real solution uses some form of lookahead, but I've never been good with those.

Re: Regular Expresssion TroubleShoot Help plz
by SamCG (Hermit) on Mar 29, 2006 at 00:38 UTC
    update: No -- Crackers2 is correct. Sorry. First time I've wanted to downvote my own post.

    Well, it seems your character class isn't working the way you expect. I usually find the print statement to be an excellent debugger. I modified your code a bit -- first, I didn't see a particular need for the \A in this short a sample. I'm also used to looking at regexes without whitespace, and I'm not sure why you used both \s and \m modifiers (aren't they contradictory?){update: never mind that last -- I found "both s and m modifiers (//sm): Treat string as a single long line, but detect multiple lines. '.' matches any character, even "\n". ^ and $, however, are able to match at the start or end of any line within the string." in the docs}
    $foo = qq{^snafu^|^foobar^\n}; $foo =~ m/(\W)([^\1]+)\1(\W)/; $text_qual = $1; $field_sep = $3; print "text: $text_qual\n"; print "field: $field_sep\n"; print $2; print "\n2nd try\n"; $foo2 = qq{^snafu1|^foobar^\n}; $foo2 =~ m/(\W)([^\1]+)\1(\W)/; $text_qual = $1; $field_sep = $3; print "text: $text_qual\n"; print "field: $field_sep\n"; print $2;
    yields
    H:\script>perl majingz.pl text: ^ field: snafu^|^foobar 2nd try text: ^ field: snafu1|^foobar
    Telling me your [^\1]+class sucked up everything from s to r, and then the \1 kicked in for the fourth ^. So, your backreference isn't working inside a character class. This isn't quite so surprising (to me, anyway) since a character class doesn't follow many standard regex rules (a period inside a character class, for example, is just a period, escaped or not). I don't see any hard documentation on the failure of backreferences in character classes, but it makes sense to me.

    What is somewhat surprising to me is that for the second try the match for the second try (where I assume the "1" is part of the character class), the "1" doesn't trigger the class. I'm guess this is because the "\" is the escape character. I do note that if I double escape (i.e., [^\\1]), I get the expected result of that class matching on the "1".

    I wish I could tell you how to resolve your situation, but I think it's a difficult one: parsing csv's is not an easy task. That's one reason there's a module.