regexp problems

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am having a regexp problem which is a couple of common regexp problems with a twist. My regexp skills are not up to snuff so I am sort of stuck.

I am working in a csv file, and ultimately:
- I need to parse it, and change the delimeter
- keep the commas that were within the strings
- remove the double quotes signifying strings
for example:

1,2,3,4,"foo, bar, and bob", "example",5,6,"plip,plop"

this would end up as:
1~2~3~4~foo,bar, and bob~example~5~6~plip,plop

Handling the commas is whats stopping me...

I was trying to do it seperately from all the other goals and was using:
$_ =~ s/,\s*(\".*?),(.*?\")/$1-=O=-$2/g;
This piece of code skips by multiple commas which get matched in $2.

The files I have to manipulate are -=huge=- so efficiency is important. unfortunately, the more I goof with this, the less efficient it becomes :P I know there are some modules out there which can help with parsing, but I was unfamiliar as to whether it would give me the speed boost that I need.

Thanks you all for listening to my rambling.
Below is the file in case what I said still doesn't make sense.

use strict;
my $data_file = $ARGV[0]; #input file
my $old_delim = $ARGV[1]; # old delimeter
my $output_file = $ARGV[2]; # out put file
my $new_delim = $ARGV[3]; # new delimeter
my $count =0;

print "\n$0 Started.....\n\n";
print " Input file name: $data_file\n";
print " File delimiter $old_delim\n";
print " Output file name: $output_file\n";
print " File delimiter $new_delim\n";

open(INFILE, "$data_file") || die "INPUT ERROR: $!\n"; #open file or k
+ill the script and return the error
die "INPUT ERROR: input file is empty\n" if( -s $data_file < 1); # if 
+the file is empty, kill the script and print out an error message

open(OUTFILE, ">$output_file") || die "OUTPUT ERROR: $!\n"; 


while(<INFILE>){
    $_ =~ s/,\s*(\".*?), (.*?)\"/$1-=O=-$2/g;
    $_ =~ tr/\\\"//d;
    my @words = split($old_delim,$_);
    my $newline =  join($new_delim,@words);
    $newline =~ s/-=O=-/,/g;
    print OUTFILE $newline;
    $count++;
}



print "line count: $count\n";
print ("\n\nCompleted.\n\n");
[download]

Comment on regexp problems Select or Download Code

Replies are listed 'Best First'.
Re: regexp problems by diotalevi (Canon) on Nov 27, 2002 at 04:11 UTC
You are solving this problem the wrong way. I normally use Text::CSV_XS for doing CVS reading/writing. It's dirt easy. In fact, it is so easy that I've just copied in some sample source for you to look at. If you want to go faster then you can start feeding it a filehandle to read from and other various tricks. Those are up to you and you'll ahve to read the documention on that. `#!/usr/bin/perl use Text::CSV_XS; use strict; use warnings; $\| = 1; my $c = Text::CSV_XS->new; while (my $line = <>) { $c->parse($line); my @fields = $c->fields; if (1 < @fields) { $line = join("\t",@fields)."\n"; $line =~ s/\\//g; print STDOUT $line; } else { print STDERR $line; } }` [download] `__SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;` [download]	[reply] [d/l] [select]
Re: Re: regexp problems by Anonymous Monk on Nov 27, 2002 at 08:22 UTC
thank you all for your help:) I greatly appreciate it. The input file is around 500 megs. I was given the wrong specs, and didn't know I had to maintain the old delimeter characters within a string. The very fact that its nested just kills the script. I chose Text::Parsewords over Text::CSV because it doesn't require the people who would use the script to grab the module. Suprisingly, &parse_line() in Text::parsewords does -exactly- what I need...it can strip out double quotes, and backslashes. It also maintains the commas within the string fields. This means all I really have to do is call &parse_line() and join it back together on the new delimeter. The code is extremely clean, and easy to implement, but its just too slow. As it stands, the script takes a little over an hour to run. If I didn't have to worry about nesting the script would run in only a couple of minutes. The reason why I came here is because I have seen some of you guys do some sick sick derranged golfing. Its never pretty but it usually hauls ass:) I have learned a couple tricks from here for speeding stuff up over the years, but I still don't hold a candle to most of the pro golfers. I was hoping someone had an idea for boosting the speed.	[reply]
Re^3: regexp problems by diotalevi (Canon) on Nov 27, 2002 at 08:27 UTC
In this case you do use Text::CSV_XS for the speed. The thing is - the core routines are coded in C and are supposed to be fast. That's what the '_XS' part of the name sort of implies. So for your case you ought to go get the module since it's a speed issue. I didn't reply with that information last time just for kicks. I normally process several gig files through with this and it's definately a help to use the fast module over other things. `__SIG__ use B; printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;` [download]	[reply] [d/l]
Re: regexp problems by pg (Canon) on Nov 27, 2002 at 04:24 UTC
For the regexp part, I posted a similar question a while ago, asked fellow monks how to split a string on spaces, but not those within quots. I learned lots from those replies others kindly offered. Please check all the replies out. From my original post, you can find all the replies.	[reply]