Re: Parsing a text file
by graff (Chancellor) on Jan 14, 2009 at 01:56 UTC
|
That bit about having run-on lines with ^M delimiters would make me want to try something like this (not tested):
#!/usr/bin/perl
use strict;
die "Usage: $0 filename.csv\n" unless ( @ARGV and -f $ARGV[0] );
for my $csvname ( @ARGV ) {
my $records = read_csv( $csvname );
if ( ref( $records ) ne 'ARRAY' ) {
warn "Unable to pull records from file $csvname\n";
next;
}
elsif ( @$records == 0 ) {
warn "No csv data found in file $csvname\n";
next;
}
do_something( $records );
}
sub read_csv
{
my $filename = shift;
open( IN, "<", $filename ) or do {
warn "open failed on $filename: $!\n";
return;
};
local $/;
my $alldata = <IN>;
my @records = grep !/^#|^\s*$/, split( /[\r\n]+/, $alldata );
return \@records;
}
sub do_something
{
# because just being able to read is seldom enough...
}
I suppose if your files are really huge (hundreds of MB), the slurping and splitting might be impractical. But these days, anything up to a 100 MB or so should fit comfortably.
(Updated to fix grammar in the opening sentence. I'd also suggest that "read_csv" should really be called something else, like "read_file_data" -- there's nothing particularly "csv-ish" about that sub.) | [reply] [d/l] |
|
|
graff, be careful about \r and \n. I changed my habit of writing \n when I'm not sure where my script will be run. The reason is this section from perldc perlipc section Internet Line Terminators
Internet Line Terminators
The Internet line terminator is "\015\012". Under ASCII variants of
Unix, that could usually be written as "\r\n", but under other systems,
"\r\n" might at times be "\015\015\012", "\012\012\015", or something
completely different. The standards specify writing "\015\012" to be
conformant (be strict in what you provide), but they also recommend
accepting a lone "\012" on input (but be lenient in what you require).
We haven't always been very good about that in the code in this man-
page, but unless you're on a Mac, you'll probably be ok.
in this case, where the file is edited by several people on different platforms it might be a good idea to use a combination of \012 and \015
s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
| [reply] [d/l] [select] |
|
|
That's part wrong, part outdated.
- In the situation where "\r\n" ends up being "\015\015\012" (when using :crlf), so will "\015\012".
- "\012\012\015" should read "\012\015", and that only occurs on MacPerl (Perl for Macs earlier than OS X).
On all current operating systems, "\015" is interchangeable with "\r" and "\012" is interchangeable with "\n".
(Well, not on EBCDIC systems. But it's not clear to me how you'd want the program to behave there in this case.)
| [reply] [d/l] [select] |
|
|
/[\r\n]+/
| [reply] [d/l] |
|
|
|
|
|
|
|
Graff.... damn your good.
Thank you, after reading your post, a lightbulb went on in my head. My script is now working and launching video and radio channels like a bat out of hell. I would post the entire product but it is proprietary. Ya know how that goes.. ;-)
| [reply] |
Re: Parsing a text file
by gone2015 (Deacon) on Jan 14, 2009 at 00:10 UTC
|
On my machine it told me it found data on 1 line, which is what I would expect...
neither elsif (/^M/) nor split /^M/ are doing what you expect. If you want cntrl-M you want \cM not ^M -- though I'd use \x0D, but that's a matter of taste.
especially because something odd is apparently happening, I'd recommend an or die "failed to open $csv: $!" after the open.
you don't really need the for loop, you could simply push @chunks, split(/\cM/, $_). Note that split throws away trailing separators, so if a line ends "\cM\cM\cM" you won't end up with three blank lines -- which may or may not be what you want.
you pass the filename $csv to the read_csv subroutine, but don't use it, which doesn't look right.
But I cannot explain why you seem to get 0 lines... I don't suppose it's possible that you have set $/ to undef ?
| [reply] [d/l] [select] |
|
|
thank you oshalla for your time... In response to your bullet points here are my responses.
1. I have made this change even though I had this part working merely because you are right and ^M's don't always behave the way one would expect.
2. Also added this change even though I know it was not failing to open the csv file as it was printing the contents. Suffice it to say that it is a good coding practice that I usually use. I just whipped up an example script for perlmonks.org to show my problem.
3. Must have the for loop as the ^M's are delineating lines of data. Throwing away anything after a ^M is throwing away data I need.
4. You are right, in this particular example I am not using the passed variable name. It's just a habit
| [reply] |
|
|
I realised as I woke up that it could be that the ^M in your code could be actual ^M characters and not the ^M that I had mistaken them for. Now that that's clear, I'd still use an explicit escape sequence e.g. \x0D (or the divinely retro \015).
I fear I didn't get the point across re the for loop... push takes a LIST of things to push onto the ARRAY. So in push @chunks, split(/\cM/, $_) the entire list returned by split is pushed onto @chunks all in one go.
The caveat about trailing separators and split can be seen in
print map("'$_' ", split(/:/, 'a:b:c:::') ), "\n" ; # 'a' 'b' 'c'
print map("'$_' ", split(/:/, 'a:b:c:::', -1)), "\n" ; # 'a' 'b' 'c'
+'' '' ''
see split. I fear I confused the issue by addressing two points in one paragraph, for which I will now do penance and hope for forgiveness from the gods of clear English.
I note that the problem is now fixed. Since the code as posted worked on my machine, I'm particularly curious as to what the problem was.
| [reply] [d/l] [select] |
Re: Parsing a text file
by toolic (Bishop) on Jan 14, 2009 at 01:25 UTC
|
| [reply] |
|
|
Nope, don't want to print anything actually. I put a $verbose variable in there to turn on messages but for the most part,we want this to run silent. (until it breaks)
as for preprocessing to get rid of the ^M's, that is a thought but I hate to call external (possibly missing) programs to do something that perl can/should handle natively.
| [reply] |
|
|
| [reply] [d/l] |
Re: Parsing a text file
by Plankton (Vicar) on Jan 13, 2009 at 23:47 UTC
|
| [reply] |
|
|
That would be great if indeed I had control over the csv file. But I don't. End L-users are writing/updating the file and sometimes I get empty lines or lines joined by ^M.. In fact, for all intensive purposes, you can ignore the fact that I am trying to parse comma separated values because the true problem lies in defining the individual lines themselves. The csv stuff I already have is golden.
| [reply] |