in reply to regular expression (search and destroy)
121212, "Simpson, Bart", Springfield
this is a trivial matter to parse, but what if your data looks like:
121212,"2" tape, white", springfield
If the case is that you'd never encounter quotes embed within your fields then it is less of a problem. If you are dead set against using some of the fine CPAN modules or even as previously suggested Text::Balance (core module) you could do something like this:
RECORD: while (<DATA>){ # read data 1 byte at a time for (my $i=0;$i < length($_);$i++) { $byte = substr($_, $i, 1); if ($byte eq "\""){ $i++; $next_byte = substr($_, 1, $i) if ($next_byte ne ",") { $quoting = 1; } else { $quoting = 0; } if ($quoting) { print $byte$next_byte; next; } else { print $nextbyte next; } } }elsif ($byte =~ /\n/) { $quoting = 0; next RECORD; } else { print $byte; } $quoting = 0; }
The idea is to read a record then walk through the record 1 byte at a time trying to determine if a delimiter is inside a set of protecting quotes.
It gets more difficult if you have more complex data like the above examples and worse.
One other thing, the above method is not very rapid so if you have tons (100's of megs/gigs/terras) you may have to wait awhile.
In the end, your probably best off using a module.
|
|---|