comment on

I am trying to use a plain perl regex s/// to fix up the formatting of fields in a CSV file, so that the real parser will no longer choke on it. The fields, separated by semicolons, are formatted like this:

Text fields are between double quotes. An internal double quote is doubled.
Numeric fields are unquoted and use a comma as a decimal separator.
Empty fields contain only a question mark, unquoted.

What I'm trying to do is to leave the quoted fields alone, replace the comma in numeric fields with ".", and drop the unquoted question mark.

The basis of what I've been using looks like this — I've added extensive regex comment, describing what it does:

    s( ("[^"]*")     # a quoted field, or standalone part of a field
      | (?<![^;])    # start of line or preceded by semicolon = start 
+of field
        ( [\-\d,]+   # characters most likely forming a number
          | ([?]) )  # or a "?" 
        (?![^;])     # end of line or followed by semicolon  = end of 
+field
    )                # end of regex, start of substitution
    {
        $1 or        # replace quoted string by itself = skip
        $3 ? ''      # a bare unquoted '?', delete
        : do { (my $number = $2)   # must be a number
              =~ tr/,/./;       # replace ',' with '.'
            $number }           # return value
    }xge;
[download]

Now the part that I'm having some trouble with: I'm trying to add support for multiline records, thus containing newlines within quoted strings, but without reading in the whole data file at once. Now I can detect if a quoted string is still open by making the closing quote optional, and checking for its presence. The problem is: how do you continue parsing the same open string, until you find the first semicolon, on the next line?

My idea was that, if the previous line was closed, the pattern should work as above, but if we were in a quoted field at the end, it should behave like:

m( ( (?:^|") [^"]* ("?) ) | (?<![^;]) ( [\-\d,]+ | ([?]) ) (?![^;]) )x
[download]

instead. Now how do you do that? I've tried experimenting with the, still marked as "highly experimental" after over 5 years, features of (?{CODE})but I don't quite get it, and I couldn't get it to work properly. Because of its "experimental nature" (it may be here to stay, but that doesn't mean it has been properly debugged), I'd like to avoid it, anyway.

I've also though about using /"/g to skip any leading remainders of a quoted string, but s///g simply ignores \G.

So... What would you do?

In reply to Conditional continued matching with regexes by bart

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.