Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Delete unmatched quotes from a delimited file?

by Buckaroo Buddha (Scribe)
on Jul 24, 2001 at 19:48 UTC ( [id://99376]=perlquestion: print w/replies, xml ) Need Help??

Buckaroo Buddha has asked for the wisdom of the Perl Monks concerning the following question:

I've got a data set that looks somewhat like this:
|value 1|value two|"|Value 3 was "NULL"|2001/06/06|

It is a very large file and It is being imported daily into an MS-SQL server by a SAS script

unfortunately, in this case SAS can't handle quotes very well. So I must clean the file before SAS gets its ruddy little paws on it.

I had hoped to do a quick and sexy inline thing to the effect of:

perl -pi.bak -e "s/\|\"\|/\|\|/g" filename

sweet ... unfortunately there are too many special characters for an inline to work
then I moved on to
open OUTFILE, ">>@ARGV[0].dat"; while (<>) { $_ =~ s/\|\"\|/\|\|/g; print OUTFILE $_; } close OUTFILE;

sweet, it works! no problem

but now i've found that unmatched quotes still happen at other places in the data. I must instead come up with a way of saying:

"if you find a | before you find a second quote you must substitute it for a blank space"

now one way would be to:

while (<>) { split /\|/; my $output = "|"; foreach my $value (@_) { if ($value =~ m/"{1}/) { $value =~ s/"//g; } $output =. $value . "|"; } print $output; }
the two questions i have are:
1) is there a better way to do this? I figure regex's are so powerful that this could probably be done in one line. if I went out and bought that O'reilly book on mastering regular expressions do you think i'd be able to write that one myself?

2) is there a way to match any ODD number of quotes as opposed to just one quote?

Replies are listed 'Best First'.
Re: Delete unmatched quotes from a delimited file?
by bwana147 (Pilgrim) on Jul 24, 2001 at 20:02 UTC

    I guess you can count the number of quotes and add one if it's missing. This assumes you can't quote pipes. Something like this might do:

    while ( <> ) { my @F = split /\|/; foreach ( @F ) { $_ .= '"' if tr/"// & 1; } print join '|', @F; }

    --bwana147

Re: Delete unmatched quotes from a delimited file?
by wine (Scribe) on Jul 24, 2001 at 20:27 UTC
    bwana147 has the right solution. I only want to add that your code has some strange things in it.

    while (<>) { split /\|/; # Use of implicit split to @_ is dep +recated my $output = "|"; foreach my $value (@_) { if ($value =~ m/"{1}/) { # equivalent to /"/, so does not do +any counting $value =~ s/"//g; } $output =. $value . "|"; # should be .= } print $output; }

    Couldn't help to mention it. ;) Good luck.

(tye)Re: Delete unmatched quotes from a delimited file?
by tye (Sage) on Jul 25, 2001 at 01:30 UTC

    You can rely on some properties of regular expression matching to make this much simpler than what I've seen so far (unless I'm missing something, of course).

    s#"([^|"]*)([|"]|$)# ( $2 eq '"' ? '"' : ' ' ) . $1 . $2 #ge;
    Look for a " followed by whatever until you hit " or | or end of line. If you hit ", then you found a matched pair so leave the " in. If you don't, then replace the " with a space. The "g" modifier on the regex then starts where the last match left off. This means you don't need to worry about starting a match at the second " of a matched pair.

            - tye (but my friends call me "Tye")
negative lookahead to the rescue! (boo)
by boo_radley (Parson) on Jul 24, 2001 at 20:47 UTC
    $_='|value 1|value two|"|Value 3 was "NULL"|2001/06/06|'; s/(\|.*?)(")(?!")(.*?\|)/$1$3/; print;
    there are 4 sets of parens in the regex, I'll try to break them down :
    the first one looks for a pipe, then any number of characters (but not being greedy about it)
    the second grabs the lone quote the third is a negative lookahead. This is where is golden nugget of regular expression goodness lies!
    the negative lookahead makes sure that there's no quote following the one from the second backreference. Also, since it's a zero-width assertion, it doesn't create a backreference of its own.
    Finally, the last set of parens describes "the rest of the string" up to the ending pipe-delimiter.

    So,
    "look for a pipe, and then any characters up to a quote, make sure it's not followed by another quote, and then the rest of the string, up to a pipe"
    Now the one caveat for this re is that it will misbehave on "", but my reg-fu is not strong enough to determine the handler for that contingency.

(MeowChow) Re: Delete unmatched quotes from a delimited file?
by MeowChow (Vicar) on Jul 24, 2001 at 21:58 UTC
    A one-liner regex is not the most maintainable or efficient solution to this problem. But since you asked, here's a regex answer:
      
    s/((?:^|\|)(?:[^"|]*"[^"|]*")*[^"|]*)"([^"|]*)(?=$|\|)/$1$2/g;
    The readability of that could be improved upon, however:
      
    my $T = qr/[^"|]*/; s/((?:^|\|)(?:$T"$T")*$T)"($T)(?=$|\|)/$1$2/go;
    Now here's how I would have really done it:
      
    join '|', map { s/"([^"]*)$/$1/ if y/"// & 1; $_ } split /\|/;
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://99376]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-03-28 17:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found