Parsing a text file

calmthestorm has asked for the wisdom of the Perl Monks concerning the following question:

Hail noble monks of perl,

I am in need of another pair of eyes to look over the mess I am dealing with. At one point I had all of this working but for some reason it no longer does...

I am trying to parse a text file that is created / updated by multiple users of different OS/applications. The file is plain text comma separated values and "could" contain comments, blank lines and the nasty old dos ^M characters. What I need to do is to be able to break this file into seperate lines while ignoring comments and blank lines. Simple enough right, I thought so too until this piece of code started lumping the whole thing into one line regardless.

The CSV file contents:
# This is a comment followed by a blank line

# this comment is followed by a run-on line with ^M's
humpty dumpty sat on a rock^Mhumpty dumpty fell^MPoor humpty dumpty is
+ busted all to hell



The script in question:
#!/usr/bin/perl
use strict;

# define the csv datafile name (and path if need be) 
my $csv = 'testing/testfile.csv';

# Process the CSV file
&read_csv($csv);

# All is well that exits zero...
exit;

sub read_csv { 
    my @chunks;
    open(CONF, $csv);
    while(<CONF>){
        chomp;  
        if (/^#/){ 
            # Must be a comment, skip it
            next;   
        } elsif (/^\s*$/) {
            # Only contains whitespace, skip it
            next;   
        } elsif (/^M/){    
            # Contains dos/mac control characters 
            my @lines = split /^M/, $_;    
            for ( my $i = 0 ; $i <= $#lines ; $i++ ){  
                push(@chunks, $lines[$i]);    
            }    
        } else {
            # assumed to be a normal data line
            push(@chunks, $_);
        }   
    print "Found data for ", scalar(@chunks), " lines in $csv\n\n";
    }   
    close(CONF);
}
[download]

After all is said and done, the script should print out one line with a number of actual data lines found... I get 0 because it is finding the first comment and lumping the entire file into that one line. I know I must have done something stupid but for the life of me, I dont see it.

Please save me from insanity...

Comment on Parsing a text file Download Code

Replies are listed 'Best First'.
Re: Parsing a text file by graff (Chancellor) on Jan 14, 2009 at 01:56 UTC
That bit about having run-on lines with ^M delimiters would make me want to try something like this (not tested): #!/usr/bin/perl use strict; die "Usage: $0 filename.csv\n" unless ( @ARGV and -f $ARGV[0] ); for my $csvname ( @ARGV ) { my $records = read_csv( $csvname ); if ( ref( $records ) ne 'ARRAY' ) { warn "Unable to pull records from file $csvname\n"; next; } elsif ( @$records == 0 ) { warn "No csv data found in file $csvname\n"; next; } do_something( $records ); } sub read_csv { my $filename = shift; open( IN, "<", $filename ) or do { warn "open failed on $filename: $!\n"; return; }; local $/; my $alldata = <IN>; my @records = grep !/^#\|^\s$/, split( /[\r\n]+/, $alldata ); return \@records; } sub do_something { # because just being able to read is seldom enough... } [download] I suppose if your files are really* huge (hundreds of MB), the slurping and splitting might be impractical. But these days, anything up to a 100 MB or so should fit comfortably. (Updated to fix grammar in the opening sentence. I'd also suggest that "read_csv" should really be called something else, like "read_file_data" -- there's nothing particularly "csv-ish" about that sub.)	[reply] [d/l]
Re^2: Parsing a text file by Skeeve (Parson) on Jan 14, 2009 at 05:05 UTC
graff, be careful about \r and \n. I changed my habit of writing \n when I'm not sure where my script will be run. The reason is this section from perldc perlipc section Internet Line Terminators Internet Line Terminators The Internet line terminator is "\015\012". Under ASCII variants of Unix, that could usually be written as "\r\n", but under other systems, "\r\n" might at times be "\015\015\012", "\012\012\015", or something completely different. The standards specify writing "\015\012" to be conformant (be strict in what you provide), but they also recommend accepting a lone "\012" on input (but be lenient in what you require). We haven't always been very good about that in the code in this man- page, but unless you're on a Mac, you'll probably be ok. in this case, where the file is edited by several people on different platforms it might be a good idea to use a combination of \012 and \015 `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^3: Parsing a text file by ikegami (Patriarch) on Jan 14, 2009 at 05:20 UTC
That's part wrong, part outdated. In the situation where `"\r\n"` ends up being `"\015\015\012"` (when using `:crlf`), so will `"\015\012"`. `"\012\012\015"` should read `"\012\015"`, and that only occurs on MacPerl (Perl for Macs earlier than OS X). On all current operating systems, `"\015"` is interchangeable with `"\r"` and `"\012"` is interchangeable with `"\n"`. (Well, not on EBCDIC systems. But it's not clear to me how you'd want the program to behave there in this case.)	[reply] [d/l] [select]
Re^3: Parsing a text file by Gangabass (Vicar) on Jan 14, 2009 at 05:40 UTC
But if you look a bit more into graff code you'll see that he's already use a combination `/[\r\n]+/` [download]	[reply] [d/l]
Re^4: Parsing a text file by gone2015 (Deacon) on Jan 14, 2009 at 10:41 UTC
Re^5: Parsing a text file by graff (Chancellor) on Jan 14, 2009 at 13:20 UTC
Some notes below your chosen depth have not been shown here
Re^2: Parsing a text file by calmthestorm (Acolyte) on Jan 14, 2009 at 02:16 UTC
Graff.... damn your good. Thank you, after reading your post, a lightbulb went on in my head. My script is now working and launching video and radio channels like a bat out of hell. I would post the entire product but it is proprietary. Ya know how that goes.. ;-)	[reply]
Re: Parsing a text file by gone2015 (Deacon) on Jan 14, 2009 at 00:10 UTC
On my machine it told me it found data on 1 line, which is what I would expect... neither `elsif (/^M/)` nor `split /^M/` are doing what you expect. If you want cntrl-M you want `\cM` not `^M` -- though I'd use `\x0D`, but that's a matter of taste. especially because something odd is apparently happening, I'd recommend an `or die "failed to open $csv: $!"` after the `open`. you don't really need the `for` loop, you could simply `push @chunks, split(/\cM/, $_)`. Note that `split` throws away trailing separators, so if a line ends "\cM\cM\cM" you won't end up with three blank lines -- which may or may not be what you want. you pass the filename `$csv` to the `read_csv` subroutine, but don't use it, which doesn't look right. But I cannot explain why you seem to get 0 lines... I don't suppose it's possible that you have set `$/` to undef ?	[reply] [d/l] [select]
Re^2: Parsing a text file by calmthestorm (Acolyte) on Jan 14, 2009 at 00:53 UTC
thank you oshalla for your time... In response to your bullet points here are my responses. 1. I have made this change even though I had this part working merely because you are right and ^M's don't always behave the way one would expect. 2. Also added this change even though I know it was not failing to open the csv file as it was printing the contents. Suffice it to say that it is a good coding practice that I usually use. I just whipped up an example script for perlmonks.org to show my problem. 3. Must have the for loop as the ^M's are delineating lines of data. Throwing away anything after a ^M is throwing away data I need. 4. You are right, in this particular example I am not using the passed variable name. It's just a habit	[reply]
Re^3: Parsing a text file by gone2015 (Deacon) on Jan 14, 2009 at 08:56 UTC
I realised as I woke up that it could be that the `^M` in your code could be actual `^M` characters and not the `^M` that I had mistaken them for. Now that that's clear, I'd still use an explicit escape sequence e.g. `\x0D` (or the divinely retro `\015`). I fear I didn't get the point across re the `for` loop... `push` takes a `LIST` of things to push onto the `ARRAY`. So in `push @chunks, split(/\cM/, $_)` the entire list returned by `split` is pushed onto `@chunks` all in one go. The caveat about trailing separators and `split` can be seen in `print map("'$_' ", split(/:/, 'a:b:c:::') ), "\n" ; # 'a' 'b' 'c' print map("'$_' ", split(/:/, 'a:b:c:::', -1)), "\n" ; # 'a' 'b' 'c' +'' '' ''` [download] see split. I fear I confused the issue by addressing two points in one paragraph, for which I will now do penance and hope for forgiveness from the gods of clear English. I note that the problem is now fixed. Since the code as posted worked on my machine, I'm particularly curious as to what the problem was.	[reply] [d/l] [select]
Re: Parsing a text file by toolic (Bishop) on Jan 14, 2009 at 01:25 UTC
Not that it will solve your problem, but I think you want your print outside of your "while" loop. Have you tried pre-processing your csv file using the dos2unix utility to clean up the nasty ^M's?	[reply]
Re^2: Parsing a text file by calmthestorm (Acolyte) on Jan 14, 2009 at 01:52 UTC
Nope, don't want to print anything actually. I put a $verbose variable in there to turn on messages but for the most part,we want this to run silent. (until it breaks) as for preprocessing to get rid of the ^M's, that is a thought but I hate to call external (possibly missing) programs to do something that perl can/should handle natively.	[reply]
Re^3: Parsing a text file by juster (Friar) on Jan 14, 2009 at 05:43 UTC
You can preprocess it easily with perl: `shell$ perl -pe 'tr/\r/\n/' infile > outfile`	[reply] [d/l]
Re: Parsing a text file by Plankton (Vicar) on Jan 13, 2009 at 23:47 UTC
I bet there are some good Perl Modules on CPAN. http://search.cpan.org/search?query=CSV&mode=all DBD::CVS looks interesting.	[reply]
Re^2: Parsing a text file by calmthestorm (Acolyte) on Jan 13, 2009 at 23:58 UTC
That would be great if indeed I had control over the csv file. But I don't. End L-users are writing/updating the file and sometimes I get empty lines or lines joined by ^M.. In fact, for all intensive purposes, you can ignore the fact that I am trying to parse comma separated values because the true problem lies in defining the individual lines themselves. The csv stuff I already have is golden.	[reply]