Re: Optimizing a script
by qq (Hermit) on Apr 19, 2004 at 12:20 UTC
|
$/ = qq{"\$"\t""\t""\n};
You've got too much here. It should probably be qq{\$\t\t\n}, or something similar. You've got qq{}, which is double quotes so interpolates, but then "" wrapping the pieces inside. The interior quotes are left in. Try printing $/ to see what you've got. Presumably you are not getting the DEBUG messages, which means its likely just trying to grab the whole file. Your previous post, also, doesn't mention the tabs in the record separator.
You should take, as your sample file, the first three of entries from your main file. If it works on that it should work on the rest (or die with an error).
Finally, build your script slowly. Comment out everything in the while loop. Add print $_;. If thats right, add the my $file ... line in. Print and check. Repeat. (Poor mans testing!).
qq | [reply] [d/l] [select] |
Re: Optimizing a script
by matija (Priest) on Apr 19, 2004 at 12:17 UTC
|
I think your problem is in the line $/ = qq{"\$"\t""\t""\n};
I looked at the other thread and what you're using here does not match the data you described there. I have a feeling that your script is trying to find a match for a record separator that doesn't exist, and therefore tried to read the whole file into memory, and got hopelessly bogged down with swapping as a consequence.
Your record separator isn't
"$"<TAB>""<TAB>""(newline) is it? I think you need to get rid of all those " characters, for starters...
| [reply] [d/l] |
|
|
Hi Matija, everyone thanks for your help.
I changed the regexp to qq{"\$"\t""\t""};, that solved my first problem. Now, the script won't give the files the names I want, it continues to overwrite .csv. I have changed the script a bit to
my $i = 1;
while ( <A> ) {
my $file = ( split /\n/, $_ )[1] or next;
( $file ) = $file =~ m/"(.*?)"/ or next;
$file = $i;
open B, "> $file.csv"
or warn( "Cannot open '$file': $!" ), next;
print "[DEBUG] '$file': open ok\n";
chomp $_;
print B $_;
close B or warn( "Cannot close '$file': $!" ), next;
print "[DEBUG] '$file': close ok\n";
$i++;
}
Which gives the files the names 1, 2, 3 etc. But I would like the filenames to be the name of the company. How can I achieve this? Thanks,
Jan | [reply] [d/l] [select] |
Re: Optimizing a script
by aquarium (Curate) on Apr 19, 2004 at 12:42 UTC
|
why are you now delimiting (or trying to split) records on a
$<TAB><TAB><NEWLINE>
not even sure that that's what's it's trying to split on, as you've got quotes in the wrong places there "...$/ = qq{"\$"\t""\t""\n};..."
To cut to the chase: the program was trying to slurp the whole file, as it couldn't find the record delimiter, and was trying real hard to do so.
Here's a simple solution...please provide a proper sequence of characters for the regex to look for to find the end of each record, if it's not the original "$" (dollar sign on a line by itself)
while($line=<>) {
chomp $line;
if($line =~ /^\$$/) { #look for start of line, followed by a lite
+ral dollar sign, followed by the end of the line (not newline charact
+er, but $ has to be at the end of $line)
close OUTFILE;
undef $data;
next;
}
if(!$data) {
open(OUTFILE,">$line") or die;
} else {
$data = 1;
print OUTFILE "$line\n";
}
}
this is nice and simple (procedural style coding) that you will find easier to understand. run the script "perl script.pl <your_data_file" | [reply] [d/l] [select] |
|
|
btw...processing the file this way, one line at a time, is very friendly to your computer's memory. it will ever only hold a single line from your input file into memory. So it doesn't matter if input file has 50.000 lines for a particular section you want to separate out, or if it has 5 lines. Doing the "tricky" thing of changing your record delimiter would eat up lots of memory if a data section was 50,000 lines. the problem would be worse if you had malformed record delimiters e.g. "$\t\n", which is harder to debug, as the program would just fail, rather than producing (not exactly what you want) kind of output.
| [reply] |
|
|
for( $above_post ) {s/lines/records/g; s/single line/single record/g}
# </pedantic>
Update: Never mind me, I misread and thought this was in reply to the original post which was processing by records. However if you do the record processing it's more state you have to keep up with yourself rather than letting perl handle it for you. Not to mention that reading by lines isn't necessarily going to protect you from malformed input any better (for example someone sends you a multi-meg file with Mac \cM line endings . . .).
| [reply] [d/l] |
|
|
Re: Optimizing a script
by ysth (Canon) on Apr 19, 2004 at 15:47 UTC
|
Obviously your problem is not just performance, but the first two lines of your loop could be combined (untested):
my ($file) = /\n.*?"(.*?)".*?\n/ or next;
(If your data allows, remove the final .*?\n)
| [reply] [d/l] |
Re: Optimizing a script
by thor (Priest) on Apr 19, 2004 at 12:26 UTC
|
Does it even print out the [DEBUG] messages? If not, then I'd suspect that there's something wrong (i.e. it's not processing your file at all). As far as perl is concerned, there is no difference between a small and a large file when you're reading a chunk at a time...
| [reply] |
Re: Optimizing a script
by Anonymous Monk on Apr 20, 2004 at 22:57 UTC
|
You may want to consider the following simple solution:
#!/usr/bin/perl -w
use strict;
open A, '<export_gesamt.asc'
|| die "Failed to open: $!\n";
my $line = '$';
do {
if ( $line eq /^\$\s+$/ ) {
my $file = <A>;
open ( OUT ">$file" )
|| warn "Failed to create file '$file': $!\n";
print "Writing $file\n";
}
else {
print OUT $line;
}
} while ( $line = <A> );
| [reply] [d/l] |