in reply to Parsing a text file

That bit about having run-on lines with ^M delimiters would make me want to try something like this (not tested):
#!/usr/bin/perl use strict; die "Usage: $0 filename.csv\n" unless ( @ARGV and -f $ARGV[0] ); for my $csvname ( @ARGV ) { my $records = read_csv( $csvname ); if ( ref( $records ) ne 'ARRAY' ) { warn "Unable to pull records from file $csvname\n"; next; } elsif ( @$records == 0 ) { warn "No csv data found in file $csvname\n"; next; } do_something( $records ); } sub read_csv { my $filename = shift; open( IN, "<", $filename ) or do { warn "open failed on $filename: $!\n"; return; }; local $/; my $alldata = <IN>; my @records = grep !/^#|^\s*$/, split( /[\r\n]+/, $alldata ); return \@records; } sub do_something { # because just being able to read is seldom enough... }
I suppose if your files are really huge (hundreds of MB), the slurping and splitting might be impractical. But these days, anything up to a 100 MB or so should fit comfortably.

(Updated to fix grammar in the opening sentence. I'd also suggest that "read_csv" should really be called something else, like "read_file_data" -- there's nothing particularly "csv-ish" about that sub.)

Replies are listed 'Best First'.
Re^2: Parsing a text file
by Skeeve (Parson) on Jan 14, 2009 at 05:05 UTC

    graff, be careful about \r and \n. I changed my habit of writing \n when I'm not sure where my script will be run. The reason is this section from perldc perlipc section Internet Line Terminators

    Internet Line Terminators

    The Internet line terminator is "\015\012". Under ASCII variants of Unix, that could usually be written as "\r\n", but under other systems, "\r\n" might at times be "\015\015\012", "\012\012\015", or something completely different. The standards specify writing "\015\012" to be conformant (be strict in what you provide), but they also recommend accepting a lone "\012" on input (but be lenient in what you require). We haven't always been very good about that in the code in this man- page, but unless you're on a Mac, you'll probably be ok.

    in this case, where the file is edited by several people on different platforms it might be a good idea to use a combination of \012 and \015


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

      That's part wrong, part outdated.

      • In the situation where "\r\n" ends up being "\015\015\012" (when using :crlf), so will "\015\012".
      • "\012\012\015" should read "\012\015", and that only occurs on MacPerl (Perl for Macs earlier than OS X).

      On all current operating systems, "\015" is interchangeable with "\r" and "\012" is interchangeable with "\n".

      (Well, not on EBCDIC systems. But it's not clear to me how you'd want the program to behave there in this case.)

      But if you look a bit more into graff code you'll see that he's already use a combination

      /[\r\n]+/

        [\r\n] does appear to finesse the problem nicely.

        Unfortunately, when this last came up, I looked at all the relevant documentation I could find, but I did not see any guarantee that "\r" will be "\x0A" if "\n" is "\x0D" (or vice versa) in not-EBCDIC land. Or even that "\r" and "\n" are in general guaranteed to be duals of each other.

        As brother ikegami says, you know and I know that these days, with the exception of EBCDIC systems, "\r\n" is exactly "\x0D\x0A". If an authoritative position were taken that as of (say) 5.8.0:

        • "\r\n" eq "\x0D\x0A" except for EBCDIC.

        • any system using line endings other than "\n" will support, and will by default use, a PerlIO layer than maps those line endings to/from "\n"

        then we could consign worrying about this piece of magic to the bin. I don't know what the position is with MacPerl, but perlmacos suggests that the above could be back-dated to 5.8.0 including MacPerl.

        FWIW, socket handling can (of course) be simplified by applying binmode $sock, ':crlf', which is nice. Nevertheless, chomp is a snare and a delusion if you think it's handling Internet CRLF line endings (unless you're futzing about with $/ at the same time). Wouldn't it be nice to have a chompnl equivalent to s/\x0D?\x0A$// ? And, perhaps, chomps equivalent to s/\s+$// ?

        BTW, I note that \R is defined in perlreref as (?>\v|\x0D\x0A). Shouldn't that be (?>\x0D\x0A|\v) ? And I wonder what the EBCDIC folk make of this !

Re^2: Parsing a text file
by calmthestorm (Acolyte) on Jan 14, 2009 at 02:16 UTC
    Graff.... damn your good.

    Thank you, after reading your post, a lightbulb went on in my head. My script is now working and launching video and radio channels like a bat out of hell. I would post the entire product but it is proprietary. Ya know how that goes.. ;-)