xiaoyafeng has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks I'm having trouble with Regex. I want to extract data between 3th comma and 4th comma.I try to use bellow code:

while (<DATA>) { /^.*?,.*?,.*?,(.*?),/; print "$1\n"; } __DATA__ 6,0,3,"8.1",1 7,578,,"8.2,r1",1 8,0,5,"8.2,r3",1 18,0,13,"6.2",1 19,D610,,"6.3,r1",1 20,2f78,15,"6.3,r2",1
But it failed,because some string like "6.3,r1" include comma!Someone tell me to use (?key).Is it right?

Thanks a lot! UPDATE: correct a stupid mistake in code as bart's reply.

Replies are listed 'Best First'.
Re: Nth match.
by ikegami (Patriarch) on Dec 14, 2006 at 09:27 UTC
Re: Nth match.
by bart (Canon) on Dec 14, 2006 at 12:03 UTC
    Instead of .*? you need to use something more complex, something that can handle quoted strings. Something like:
    my $term = qr/"[^"\\]*(?>\\.[^"\\]*)*"|[^,"]*/;
    (which can handle escaped (backslashed) contents, too, change to
    my $term = qr/"[^"]*(?>""[^"]*)*"|[^,"]*/;
    if you escape quotes by doubling them)

    And then you can do

    /^$term,$term,$term,($term)/o;
    which seems to work well for me... BTW it's "while", not "While".
    my $term = qr/"[^"\\]*(?>\\.[^"\\]*)*"|[^,"]*/; while (<DATA>) { /^$term,$term,$term,($term)/o and print "$1\n"; } __DATA__ 6,0,3,"8.1",1 7,578,,"8.2,r1",1 8,0,5,"8.2,r3",1 18,0,13,"6.2",1 19,D610,,"6.3,r1",1 20,2f78,15,"6.3,r2",1
    Result:
    "8.1"
    "8.2,r1"
    "8.2,r3"
    "6.2"
    "6.3,r1"
    "6.3,r2"
    
Re: Nth match.
by shmem (Chancellor) on Dec 14, 2006 at 09:34 UTC
    Addendum to ikegami's Re: Nth match. -

    ...and don't forget to look into the code of that modules to find out how it's done :-)

      thanks for your advice!:-)
Re: Nth match.
by jonadab (Parson) on Dec 14, 2006 at 11:34 UTC
    Someone tell me to use (?key). Is it right?

    Assuming that the quotes cannot be nested, it is possible to write a regex for this, yes. However, it's going to be a fairly complicated regex, and if you aren't comfortable using the more advanced regex features, then you're better off using a module, especially if quoting can use either single or double quotes and embed the other, and other such things you might not think of when you write the regex but could run into later. A well-tested module off the CPAN will already hande such things, if it's a common CSV format you're parsing.

    So use the module.

    For purely educational purposes, the regex could look something like this untested monstrosity:

    /^(?:(?:["][^"]*["]|['][^']*[']|[^,]*),){3}(["][^"]*["]|['][^']*[']|[^,]*),/

    Hey, at least there are no lookbehind assertions. Nonetheless, when you think about trying to maintain code with stuff like that in it, you will understand why we say, "use a module from the CPAN". No need to maintain what somebody else is already maintaining.


    Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. You can just call me "Mister Sanity". Why, I've got so much sanity it's driving me crazy.