This can get to be rather complicated to parse. The problems I've seen with this type of data can throw a wrench into your parsing methods. I haven't found a good module that covers all the subtlties with quoted delimited data. Just as an example if your data looks like you describe:

121212, "Simpson, Bart", Springfield

this is a trivial matter to parse, but what if your data looks like:

121212,"2" tape, white", springfield

If the case is that you'd never encounter quotes embed within your fields then it is less of a problem. If you are dead set against using some of the fine CPAN modules or even as previously suggested Text::Balance (core module) you could do something like this:

Untested psudeo-code

RECORD: while (<DATA>){ # read data 1 byte at a time for (my $i=0;$i < length($_);$i++) { $byte = substr($_, $i, 1); if ($byte eq "\""){ $i++; $next_byte = substr($_, 1, $i) if ($next_byte ne ",") { $quoting = 1; } else { $quoting = 0; } if ($quoting) { print $byte$next_byte; next; } else { print $nextbyte next; } } }elsif ($byte =~ /\n/) { $quoting = 0; next RECORD; } else { print $byte; } $quoting = 0; }

The idea is to read a record then walk through the record 1 byte at a time trying to determine if a delimiter is inside a set of protecting quotes.
It gets more difficult if you have more complex data like the above examples and worse.

One other thing, the above method is not very rapid so if you have tons (100's of megs/gigs/terras) you may have to wait awhile.

In the end, your probably best off using a module.


In reply to Re: regular expression (search and destroy) by sweetblood
in thread regular expression (search and destroy) by data67

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.