Re: Reading Multiple lines
by moritz (Cardinal) on Oct 15, 2008 at 09:14 UTC
|
The problem is that you're reading line by line, and not record by record, and you have no solid detection of when a record is continued on the next line.
I think that Text::CSV has a solution for that, you might need to tweak some of the options though. | [reply] |
|
|
When I open the file I have the correct pattern
eg Input is.
"data","data","data","data"
"data","data with
new with new Line" "data"
"data","data","data","data"
"data","data","data","data"
"data","data with new with new Line" "data"
"data","data","data","data"
I'm not sure if I need to use that module.As, at this point in the process I don't process the fields and only process each line | [reply] |
|
|
Oh, I should have read your code more thoroughly.
<code>
"data","data","data","data"
"data","data with new with new Line" "data"
It would be more consistent to put a comma between the two last fields, not just a blank.
If you do that, you can just feed the lines to Text::CSV. Or reinvent the wheel by crafting a clever regex, but that has been discussed here many times (try Super Search for CSV or "comma separated"), so I won't re-write the wheel.
| [reply] |
|
|
|
|
|
|
I think that Text::CSV has a solution for that, you might need to tweak some of the options though.
I personally believe that if nothing else for completeness one should mention Text::xSV as well. In particular, before posting I checked its docs and it says:
People usually naively solve this with split. A next step up is to read a line and parse it. Unfortunately this choice of interface (which is made by Text::CSV on CPAN) makes it difficult to handle returns embedded in a field. (Earlier versions of this document claimed impossible. That is false. But the calling code has to supply the logic to add lines until you have a valid row. To the extent that you don't do this consistently, your code will be buggy.) Therefore you it is good for the parsing logic to have access to the whole file.
This module solves the problem by creating a CSV object with access to the filehandle, if in parsing it notices that a new line is needed, it can read at will.
The additional emphasis is mine: what is claimed there means that the module should solve the OP's problem. Apologies both to you and the OP for replying so late...
| [reply] [d/l] |
Re: Reading Multiple lines
by tweetiepooh (Hermit) on Oct 15, 2008 at 09:56 UTC
|
This does work but may not be suitable if lots of data. It may help I hope.
#!/usr/local/bin/perl -w
use strict;
$/ = "FDJK"; # sets end on input line to nonsence
my $data = <DATA>;
print "$data\n\n";
$data =~ s/([^"])\n/$1 /msxg;
$data =~ s/(\w) "/$1"/;
print $data;
__END__
"data","data","data", "data"
"data","data
with new line
" "data"
"data","data,"data","data"
| [reply] [d/l] |
|
|
Thanks for your reply,
I was thinking of doing something similar, but as you said it is not advisable with large amounts of data
| [reply] |
Re: Reading Multiple lines
by gone2015 (Deacon) on Oct 15, 2008 at 14:28 UTC
|
As presented, your code won't run:
syntax error at mick2020.pl line 17, near "}"
Execution of mick2020.pl aborted due to compilation errors.
which isn't a good start.
AFAIKS the problem is that before the if you've chomped the input line, but in the print after the if you don't restore the line ending. (Though you do at the end, inside the if.)
This appears to do what you seem to want:
while (<DATA>) {
while (($_ !~ /"\s*$/) && !eof) {
s/[\r\n]+$// ;
$_ .= ' '.<DATA> ;
} ;
print $_;
}
__DATA__
"data", "data" ,"data", "data","data", "data"
"data", "data" ,"data", "data
with new line
", "and some more!
","data", "data"
"data", "data" ,"data", "data","data", "data"
"tada", "tada
Noting that the test for whether the line ends in " allows for all kinds of trailing whitespace on the line -- not just the line ending. Also, it only removes [\r\n] at the end of the lines it concatenates. Your code was smacking [\r\n] everywhere, except on lines that ended with " -- which may, or may not, have been deliberate. | [reply] [d/l] [select] |
|
|
Thanks for your input but unfortunately it has not worked.
I have appended the code from the second part below.
I think the problem is in the next phase. I can't understand though why it only processes the files
that have the carriage returns using this preprocessing phase It processes a line (that has not being in the if statement)once and closes the filehandle but only on the lines that have not being processed in the if statement
It could be a bug in my code(more than likely :) ) or a bug with perl.
| [reply] |
|
|
You've lost me.
I assume that in the code in 717191 the PROCESSEDFILE is the file produced by stitching the CSV stuff together. I pushed the bits of code together, as best as I can, to get it to run. The $end variable was undefined and the $nameF variable was missing its my -- but apart from that it runs.
First time through the while (<PROCESSEDFILE>) loop: $num will be set to 1, so if ($file{$key}{num} > 1) will fail, so no file will be opened. I cannot figure out what this is trying to do, but I suspect it's trying not to open a file in the '(column not present)' case -- and making a mess of it.
Incidentally, the first time it gets a '(column not present)' case it will pass the unless ($file{$key}) test and promptly set my $nameF = $c[$field], although $c[$field] is already known to be undefined. Not sure I see the point of that, either.
Finally, under all conditions -- including '(column not present)' and when it has failed to open a file -- it gets to the print {$file{$key}{name}} @c; line. What is this intended to do ? Under strict it gives me
Can't use string ("/somewhere/data.END") as a symbol ref while "strict refs" in use ...
but for all I know it does something wonderful in non-strict. I note however that you set $file{$key}{fh} which looks like a dead ringer for somewhere to output to ?
Between you and me this looks like a bit of a train wreck. I suggest putting in the odd print statement here and there so that you can tell what's going on at each stage in the process... that may show you where things are and are not working as you expect.
BTW: I recommend Markup in the Monastery
| [reply] [d/l] [select] |
Re: Reading Multiple lines
by jwkrahn (Abbot) on Oct 15, 2008 at 10:47 UTC
|
@ARGV == 2 or die "usage: $0 input_file output_file\n";
open OUTFILE, '>', "$ARGV[1]temp.csv" or die "Cannot open '$ARGV[1]tem
+p.csv' $!";
open INFILE, '<', $ARGV[0] or die "Cannot open '$ARGV[0]' $!";
while ( <INFILE> ) {
chomp;
$\ = /"$/ ? "\n" : ' ';
print OUTFILE $_;
}
close INFILE;
close OUTFILE;
| [reply] [d/l] |
|
|
Hi,
I have tried your suggestion and it does not solve the problem
I have compared the format of the output of this with the format of a file that has no carriage return and the formats are identical
use FileCache maxOpen => 50;
# config:
my $field = 0;
my $sep = ",";
$, = $sep;
$\ = $/;
my %file;
my $fnum = 1;
my $outDir = $ARGV[1];
unless (-d $outDir){
die "There is a no such directory.";
}
open PROCESSEDFILE, $ARGV[0] or die $!;
while (<PROCESSEDFILE>)
{
chomp;
my @c = split(/$sep/,$_);
my( $key, $num ) = defined $c[$field]
? ( $c[$field], $fnum++ )
: ( '(column not present)', 0 );
unless ( $file{$key})
{
$nameF = $c[$field];
$nameF =~ s/"//g;
$file{$key}{num} = $num;
$file{$key}{name} = $ARGV[1].$nameF.$end;
if(($file{$key}{num}) >1){
-f $file{$key}{name} and die
"Sorry, '$file{$key}{name}' exists; won't clobber.";
$file{$key}{fh} = cacheout $file{$key}{name} or die
"Error opening '$file{$key}{name}' for write - $!";
}}
print {$file{$key}{name}} @c;
}
This is the code for the next part of the processing.
The output of this is okay for files that are not preprocessed.
When I use a preprocessed file as input to this. It will process the records that had the carriage return ok but it doesn't process records that where formatted properly in the original file
I am thinking that there is some character that I am inserting when processing the records that have \r\n and leaving it out when writing the records that don't have these but I can't spot it.
| [reply] [d/l] |
Re: Reading Multiple lines
by mick2020 (Novice) on Oct 15, 2008 at 09:15 UTC
|
I tried the following code:
open (MYFILE, ">", $ARGV[1]."temp.csv")or die $!;
open (FILEHANDLE,"<", $ARGV[0]) or die $!;
while (<FILEHANDLE>)
{
chomp;
if($_ !~ /"$/){
$_ =~ s/[\n\r]//g;
while($_ !~/"$/){
$test = <FILEHANDLE>;
$test =~ s/[\n\r]//g;
$_ = $_." ".$test;
}
$_ = $_."\n";
}
print MYFILE $_;
}
close (FILEHANDLE);
close (MYFILE);
exit;
Still have the same problem
| [reply] [d/l] |
Re: Reading Multiple lines
by mick2020 (Novice) on Oct 15, 2008 at 13:04 UTC
|
It is closing the filehandle on the files of the records that have not been processed.
Any ideas why this is so? | [reply] |