Reading Multiple lines

mick2020 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reading Multiple lines by moritz (Cardinal) on Oct 15, 2008 at 09:14 UTC
The problem is that you're reading line by line, and not record by record, and you have no solid detection of when a record is continued on the next line. I think that Text::CSV has a solution for that, you might need to tweak some of the options though.	[reply]
Re^2: Reading Multiple lines by mick2020 (Novice) on Oct 15, 2008 at 09:25 UTC
When I open the file I have the correct pattern eg Input is. "data","data","data","data" "data","data with new with new Line" "data" "data","data","data","data" "data","data","data","data" "data","data with new with new Line" "data" "data","data","data","data" I'm not sure if I need to use that module.As, at this point in the process I don't process the fields and only process each line	[reply]
Re^3: Reading Multiple lines by moritz (Cardinal) on Oct 15, 2008 at 09:33 UTC
Oh, I should have read your code more thoroughly. <code> "data","data","data","data" "data","data with new with new Line" "data" It would be more consistent to put a comma between the two last fields, not just a blank. If you do that, you can just feed the lines to Text::CSV. Or reinvent the wheel by crafting a clever regex, but that has been discussed here many times (try Super Search for CSV or "comma separated"), so I won't re-write the wheel.	[reply]
Re^4: Reading Multiple lines by mick2020 (Novice) on Oct 15, 2008 at 09:47 UTC
Re^5: Reading Multiple lines by Anonymous Monk on Oct 15, 2008 at 09:52 UTC
Re^2: Reading Multiple lines by blazar (Canon) on Oct 18, 2008 at 10:16 UTC
I think that Text::CSV has a solution for that, you might need to tweak some of the options though. I personally believe that if nothing else for completeness one should mention Text::xSV as well. In particular, before posting I checked its docs and it says: People usually naively solve this with split. A next step up is to read a line and parse it. Unfortunately this choice of interface (which is made by Text::CSV on CPAN) makes it difficult to handle returns embedded in a field. (Earlier versions of this document claimed impossible. That is false. But the calling code has to supply the logic to add lines until you have a valid row. To the extent that you don't do this consistently, your code will be buggy.) Therefore you it is good for the parsing logic to have access to the whole file. This module solves the problem by creating a CSV object with access to the filehandle, if in parsing it notices that a new line is needed, it can read at will. The additional emphasis is mine: what is claimed there means that the module should solve the OP's problem. Apologies both to you and the OP for replying so late... `--` If you can't understand the incipit, then please check the IPB Campaign.	[reply] [d/l]
Re: Reading Multiple lines by tweetiepooh (Hermit) on Oct 15, 2008 at 09:56 UTC
This does work but may not be suitable if lots of data. It may help I hope. `#!/usr/local/bin/perl -w use strict; $/ = "FDJK"; # sets end on input line to nonsence my $data = <DATA>; print "$data\n\n"; $data =~ s/([^"])\n/$1 /msxg; $data =~ s/(\w) "/$1"/; print $data; __END__ "data","data","data", "data" "data","data with new line " "data" "data","data,"data","data"` [download]	[reply] [d/l]
Re^2: Reading Multiple lines by mick2020 (Novice) on Oct 15, 2008 at 10:06 UTC
Thanks for your reply, I was thinking of doing something similar, but as you said it is not advisable with large amounts of data	[reply]
Re: Reading Multiple lines by gone2015 (Deacon) on Oct 15, 2008 at 14:28 UTC
As presented, your code won't run: syntax error at mick2020.pl line 17, near "}" Execution of mick2020.pl aborted due to compilation errors. which isn't a good start. AFAIKS the problem is that before the `if` you've `chomp`ed the input line, but in the print after the `if` you don't restore the line ending. (Though you do at the end, inside the `if`.) This appears to do what you seem to want: `while (<DATA>) { while (($_ !~ /"\s*$/) && !eof) { s/[\r\n]+$// ; $_ .= ' '.<DATA> ; } ; print $_; } __DATA__ "data", "data" ,"data", "data","data", "data" "data", "data" ,"data", "data with new line ", "and some more! ","data", "data" "data", "data" ,"data", "data","data", "data" "tada", "tada` [download] Noting that the test for whether the line ends in `"` allows for all kinds of trailing whitespace on the line -- not just the line ending. Also, it only removes `[\r\n]` at the end of the lines it concatenates. Your code was smacking `[\r\n]` everywhere, except on lines that ended with `"` -- which may, or may not, have been deliberate.	[reply] [d/l] [select]
Re^2: Reading Multiple lines by mick2020 (Novice) on Oct 15, 2008 at 16:01 UTC
Thanks for your input but unfortunately it has not worked. I have appended the code from the second part below. I think the problem is in the next phase. I can't understand though why it only processes the files that have the carriage returns using this preprocessing phase It processes a line (that has not being in the if statement)once and closes the filehandle but only on the lines that have not being processed in the if statement It could be a bug in my code(more than likely :) ) or a bug with perl.	[reply]
Re^3: Reading Multiple lines by gone2015 (Deacon) on Oct 15, 2008 at 23:49 UTC
You've lost me. I assume that in the code in 717191 the `PROCESSEDFILE` is the file produced by stitching the CSV stuff together. I pushed the bits of code together, as best as I can, to get it to run. The `$end` variable was undefined and the `$nameF` variable was missing its `my` -- but apart from that it runs. First time through the `while (<PROCESSEDFILE>)` loop: `$num` will be set to `1`, so `if ($file{$key}{num} > 1)` will fail, so no file will be opened. I cannot figure out what this is trying to do, but I suspect it's trying not to open a file in the `'(column not present)'` case -- and making a mess of it. Incidentally, the first time it gets a `'(column not present)'` case it will pass the `unless ($file{$key})` test and promptly set `my $nameF = $c[$field]`, although `$c[$field]` is already known to be undefined. Not sure I see the point of that, either. Finally, under all conditions -- including `'(column not present)'` and when it has failed to open a file -- it gets to the `print {$file{$key}{name}} @c;` line. What is this intended to do ? Under `strict` it gives me Can't use string ("/somewhere/data.END") as a symbol ref while "strict refs" in use ... but for all I know it does something wonderful in non-strict. I note however that you set `$file{$key}{fh}` which looks like a dead ringer for somewhere to output to ? Between you and me this looks like a bit of a train wreck. I suggest putting in the odd `print` statement here and there so that you can tell what's going on at each stage in the process... that may show you where things are and are not working as you expect. BTW: I recommend Markup in the Monastery Read more... the code I was trying (2 kB)	[reply] [d/l] [select]
Re: Reading Multiple lines by jwkrahn (Abbot) on Oct 15, 2008 at 10:47 UTC
Perhaps you want something like this: `@ARGV == 2 or die "usage: $0 input_file output_file\n"; open OUTFILE, '>', "$ARGV[1]temp.csv" or die "Cannot open '$ARGV[1]tem +p.csv' $!"; open INFILE, '<', $ARGV[0] or die "Cannot open '$ARGV[0]' $!"; while ( <INFILE> ) { chomp; $\ = /"$/ ? "\n" : ' '; print OUTFILE $_; } close INFILE; close OUTFILE;` [download]	[reply] [d/l]
Re^2: Reading Multiple lines by mick2020 (Novice) on Oct 15, 2008 at 11:44 UTC
Hi, I have tried your suggestion and it does not solve the problem I have compared the format of the output of this with the format of a file that has no carriage return and the formats are identical use FileCache maxOpen => 50; # config: my $field = 0; my $sep = ","; $, = $sep; $\ = $/; my %file; my $fnum = 1; my $outDir = $ARGV[1]; unless (-d $outDir){ die "There is a no such directory."; } open PROCESSEDFILE, $ARGV[0] or die $!; while (<PROCESSEDFILE>) { chomp; my @c = split(/$sep/,$_); my( $key, $num ) = defined $c[$field] ? ( $c[$field], $fnum++ ) : ( '(column not present)', 0 ); unless ( $file{$key}) { $nameF = $c[$field]; $nameF =~ s/"//g; $file{$key}{num} = $num; $file{$key}{name} = $ARGV[1].$nameF.$end; if(($file{$key}{num}) >1){ -f $file{$key}{name} and die "Sorry, '$file{$key}{name}' exists; won't clobber."; $file{$key}{fh} = cacheout $file{$key}{name} or die "Error opening '$file{$key}{name}' for write - $!"; }} print {$file{$key}{name}} @c; } [download] This is the code for the next part of the processing. The output of this is okay for files that are not preprocessed. When I use a preprocessed file as input to this. It will process the records that had the carriage return ok but it doesn't process records that where formatted properly in the original file I am thinking that there is some character that I am inserting when processing the records that have \r\n and leaving it out when writing the records that don't have these but I can't spot it.	[reply] [d/l]
Re: Reading Multiple lines by mick2020 (Novice) on Oct 15, 2008 at 09:15 UTC
I tried the following code: `open (MYFILE, ">", $ARGV[1]."temp.csv")or die $!; open (FILEHANDLE,"<", $ARGV[0]) or die $!; while (<FILEHANDLE>) { chomp; if($_ !~ /"$/){ $_ =~ s/[\n\r]//g; while($_ !~/"$/){ $test = <FILEHANDLE>; $test =~ s/[\n\r]//g; $_ = $_." ".$test; } $_ = $_."\n"; } print MYFILE $_; } close (FILEHANDLE); close (MYFILE); exit;` [download] Still have the same problem	[reply] [d/l]
Re: Reading Multiple lines by mick2020 (Novice) on Oct 15, 2008 at 13:04 UTC
It is closing the filehandle on the files of the records that have not been processed. Any ideas why this is so?	[reply]