Re: regular expression (search and destroy)

This can get to be rather complicated to parse. The problems I've seen with this type of data can throw a wrench into your parsing methods. I haven't found a good module that covers all the subtlties with quoted delimited data. Just as an example if your data looks like you describe:

121212, "Simpson, Bart", Springfield

this is a trivial matter to parse, but what if your data looks like:

121212,"2" tape, white", springfield

If the case is that you'd never encounter quotes embed within your fields then it is less of a problem. If you are dead set against using some of the fine CPAN modules or even as previously suggested Text::Balance (core module) you could do something like this:

Untested psudeo-code

RECORD:
while (<DATA>){
    # read data 1 byte at a time
    for (my $i=0;$i < length($_);$i++) {
        $byte = substr($_, $i, 1);
        if ($byte eq "\""){
            $i++;
            $next_byte = substr($_, 1, $i)
            if ($next_byte ne ",") {
                $quoting = 1;
            } else { 
                $quoting = 0;
            }
            if ($quoting) {
                print $byte$next_byte;
                next;
            } else {
                print $nextbyte
                next;
            }
        }
    }elsif ($byte =~ /\n/) {
        $quoting = 0;
        next RECORD;
    } else {
        print $byte;
    }
    $quoting = 0;
}
[download]

The idea is to read a record then walk through the record 1 byte at a time trying to determine if a delimiter is inside a set of protecting quotes.
It gets more difficult if you have more complex data like the above examples and worse.

One other thing, the above method is not very rapid so if you have tons (100's of megs/gigs/terras) you may have to wait awhile.

In the end, your probably best off using a module.

Comment on Re: regular expression (search and destroy) Download Code