Help with parsing through a comma delimted file

vonman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help with parsing through a comma delimted file by danger (Priest) on Mar 14, 2001 at 08:03 UTC
Well, that doesn't look comma delimited (or comma separated) to me, but here you go: (see also: Schwartzian Transform): `#!/usr/bin/perl -w use strict; my @lines = <DATA>; my @sorted = map {$_->[0]} sort {$a->[1]<=>$b->[1]} map {[$_,/^(\d+)/]} @lines[2..$#lines]; print @sorted; __DATA__ COUNT Type Error Message ------------------------------ 3 pro bad message #1 99 dis bad message #2 209 pro bad message #3 44 dis bad message #4 19 dis Bad message #5` [download]	[reply] [d/l]
Re: Re: Help with parsing through a comma delimted file by Anonymous Monk on Mar 14, 2001 at 08:30 UTC
thanks for your help. I justextracted a few lines from the file but in actuality it has no header and trust me it is comma delimted. If you get a minute could you explain what your lines do? I am sure they work but as I am new to perl an explanation would be very helpful. Txs!!!	[reply]
Re: Re: Re: Help with parsing through a comma delimted file by Adam (Vicar) on Mar 14, 2001 at 12:05 UTC
Explanation of danger's code: #!/usr/bin/perl -w use strict; my @lines = <DATA>; # Read this file starting with the line after __D +ATA__ # Schwartzian Transform: my @sorted = map {$_->[0]} sort {$a->[1]<=>$b->[1]} map {[$_,/^(\d+)/]} @lines[2..$#lines]; # That is where all the work is done. # Read it from right to left, bottom up: # So first he uses an array slice to get get the data # minus the first two header rows. This is passed to # map(), which reads the first number in each row # and maps it to the row itself, that number is the COUNT. # This is put into an array ref # map() returns an array of those array refs, which is passed to sort( +) # sort sorts the array refs based on the second element, COUNT # and returns an array, which is passed to another map() # which extracts the first element out of the array ref # and creates the array you had in the first place, only now # it has been sorted. print @sorted; # The next line actually ends the script, and acts a marker for <DATA> __DATA__ COUNT Type Error Message ------------------------------ 3 pro bad message #1 99 dis bad message #2 209 pro bad message #3 44 dis bad message #4 19 dis Bad message #5 [download] I suspect that uses plenty of Perl-isms, more then most people want to learn in one sitting, but You did ask how it works, and I would hate for an earnest question like that to go un-answered. I would be happy to answer any other questions you have about that code, but first you must read the docs for map, sort, perlman:perllol and of course, Efficient sorting using the Schwartzian Transform	[reply] [d/l]
Re: Help with parsing through a comma delimted file by jlawrenc (Scribe) on Mar 14, 2001 at 08:43 UTC
Here is another approach that might be easier to understand. (The map function might be kinda heavy for newbies). My approach will be to open the file, ignore the junk. Then for each line in the file I am going to read it, bust it apart by your criteria. In this case you've said that it is comma separated values. Now I am going to assume that you don't have field values with embedded commas to keep the code straightforward. Once the data is busted I'm gonna save it to memory in a array of hashes structure. Each array entry will account for a line of data. Then you want your data sorted by a field value. I am going to assume that you're sorting by count. To accomplish this we can't simply say "sort @data". We have to actually use the $datax{'count'} value for sort purposes. This is done by creating an inline sort function. This sort/print section will expose how perl stores arrays of hashes. Each array entry is actually a reference to an anonymous hash. So the sort function is actually sorting hash references. That means when perl wants to compare two entries in the array to decide who is bigger it is comparing two hash references. ($a and $b). The inline sort function dereferences the "Count" entry of the two hash references to actually regurgiate a number which is a much more meaningful value for sorting. Finally the value in the foreach is going to be a sorted list of hash references that are from the @data array. To actually use that data we just deference the value as I have shown below - voila data. open FIN, "datafile.dat" \|\| die "Cannot open datafile: $!\n"; my $line=<FIN>; $line=<FIN>; # Toss the first two lines my @data; my $index=0; while ($line=<FIN>) { chop; my @field=split /,/, $line; # Assuming that data is not # quoted or escaped - that's more work $data[$index]{'Count'}=$field[0]; $data[$index]{'Type'}=$field[1]; $data[$index]{'Message'}=$field[2]; $index++; } foreach my $data (sort {$a->{'Count'} <=> $b->{'Count'}} @data) { print "$data->{'Count'} is of type $data->{'Type'} - $data->{'Messa +ge'}\n"; } [download] Finally, if you actually ment to say that you have a data file of positional data - just change the @field=split line to parse your data accordingly. I hope this is helpful and maybe even a bit educational. Jay	[reply] [d/l]
Re: Help with parsing through a comma delimted file by Desdinova (Friar) on Mar 14, 2001 at 11:26 UTC
I did something like this once using just arrays. I dont know if this better or worse than hashes. I used multidimensional arrays so you end up refering to the data by `$data[row][coloum]`. I also made it a sub that returns the array this makes it easier to use in other scripts. #!/usr/local/bin/perl -w use strict; my @sorted=sort_file('data.txt'); my $cnt=0; my $size=@sorted; while ($cnt < $size) { print "$sorted[$cnt][0] is of type $sorted[$cnt][1] - $sorted[$cnt][ +2]\n"; $cnt++; } exit(); sub sort_file{ my $filename = shift; open DATA,$filename\|\| die "Unable to open file $filename :$!\n"; my @record; my $row =0; while (<DATA>) { chomp; #create Array one element per field my @line=split /,/; # move values into multi-dimensional array $record[$row][0]=$line[0]; $record[$row][1]=$line[1]; $record[$row][2]=$line[2]; $row++; } return (sort { $a->[0] <=> $b->[0] } @record); } [download] PS--Any comments are welcome as I now am getting to know enough to really shoot myself in the foot :>	[reply] [d/l] [select]
Re (tilly) 2: Help with parsing through a comma delimted file by tilly (Archbishop) on Mar 15, 2001 at 16:46 UTC
In addition to my comments to jlawrenc I would like to point out that in Perl use of explicit indexing is usually unnecessary, and by avoiding it you can seriously reduce the potential for error. Instead in this case you can use push. The section just to read the file would then reduce down to: `while (<DATA>) { chomp; push @record, [split(/,/, $_, -1)]; }` [download] which is considerably shorter, faster, and reduces the chance of error. Also I think that reading the file and sorting it are two different things. You are likely to want to read the file into data for lots of reasons. You are likely to later discover the need to sort the file in lots of ways. Why not have two functions? Of course whenever I see a CSV format with the field names not in the first row, I tend to get upset. And I really prefer hashes. Therefore the above snippet of code would set off a bunch of danger signs for me. Certainly any data format that I have any say in will include the columns in the definition of the format, and code that handles it will be expected to handle columns moving around. In this simple case a function to read the format could look like this: `use strict; use Carp; # Time passes... sub read_csv { my $file = shift; local *CSV; open (CSV, $file) or confess("Cannot read '$file': $!"); my $header = <CSV>; chomp($header); my @fields = split /,/, $header; # You could do an error check for repeated field names... my @data; while (<CSV>) { chomp; my $row; @$row{@fields} = split(/,/, $_, -1); push @data, $row; } return @data; }` [download] I keep meaning to clean up and then post a more robust version of this that handles quoting, fields with embedded commas and returns, can be used either for slurping (like this) or for a stream-oriented file...	[reply] [d/l] [select]
Re: Re: Help with parsing through a comma delimted file by jlawrenc (Scribe) on Mar 14, 2001 at 20:29 UTC
Your suggestion is quite feasable. In fact it is not much different than mine. By using arrays you get a more frugal memory usage and faster performance. You do trade off for readability when it comes time to use the data. Peronally, for maintainability I'll use a hash any day when I can give data a meaningful label rather than a cryptic number. I might even change the section around @line=split /,/ a bit to something along the lines of this: `[ defined earlier ] @fieldname = qw(Count Type Message); [ in the loop ] my @line=split /,/; foreach (my $i=0; $i<=$#line; $i++) { $record[$row]{$fieldname[$i]}=$line[$i]; }` [download] Again, from the perspective maintainability I'd make two constructive comments for your suggestion. The first is the use of $_. I think this variable is confusing for newbies. Really when you think about it, a default variable is rather unique to Perl. As well, I feel it can be an accident waiting to happen. Sure, use them, especially if performance and crisp clean code is your objective. But, if you're not in a crunch I always do the following: `while ( $line=<DATA> ) { }` [download] Then it is really obvious that you have read a _line_ of DATA from your ol' data file. While we're talking about lines, here is my second constructive comment. You have chosen the variable @line to represent the splitting a line of data. A subtle notion, but I feel it is much clearer as to the meaning of data if you think about how it is to be accessed. After the split you are not accessing lines of data, you are accessing fields from a record or line. Calling it "@field" means that you're going to be referring to "$field[0]", "$field[1]", etc. which comes from a $line of <DATA>. Not to say that there is anything _wrong_ with your code, but these would aid in long term maintainability. As I have seen, code does have a nasty habit of lasting longer than expected. Worst of all the quick 'n dirties often last the longest. J Update: fixed comparison on for loop - should be <=$#line, not <$#line. My mistake.	[reply] [d/l] [select]
Re (tilly) 3: Help with parsing through a comma delimted file by tilly (Archbishop) on Mar 15, 2001 at 16:24 UTC
Your update underscores a point I like to make. Whenever you can, avoid C style loops in favour of Perl's foreach construction. Thus you would write: `foreach my $i (0..$#line) { $record[$row]{$fieldname[$i]}=$line[$i]; }` [download] The reason is that you will make fewer fencepost errors that way. (One of the most common bugs in C, but generally automatically avoidable in Perl. Beyond that I would recommend using a -1 for the third argument to split, and given that you have tabular data I would lose the loop entirely for a slice: `my %rec; @rec{@fields} = split(/,/, $_, -1); $record[$row] = \%rec;` [download] (I would actually clean it up even more, more on that in another post.)	[reply] [d/l] [select]
Re: Help with parsing through a comma delimted file by Anonymous Monk on Mar 14, 2001 at 09:56 UTC
how about: open ( IN, "myinput.csv" ) \|\| die "$!\n"; my %data; #data is a hash table. key is COUNT, and vals are #key . Type . Error Message vals. Note that there isn't a # very good hash value associated with the set of data, #as COUNT probably isn't guaranteed to be unique ... while (<IN>) { #split on the comma my ( $count, $type, $errMsg ) = split( /,/ ); print "collision!?!\n" if exists $data{$count . $type . $errMsg }; $data{$count . $type . $errMsg} = ($type, $errMsg); #ref to list } my @sortedKeys = sort keys $data; print "COUNT\tType\tError Message\n"; foreach my $key ( $data ) { print "$key\t$data->{$key}[0]\t$data->{key}[1]\n"; } [download] I think I did it without use of libraries ... :) Sozin	[reply] [d/l]
Re: Help with parsing through a comma delimted file by Anonymous Monk on Mar 14, 2001 at 18:44 UTC
Thanks all for your help. I will try these out today. You are lifesavers!!!	[reply]
Re: Help with parsing through a comma delimted file by greenFox (Vicar) on Mar 14, 2001 at 15:46 UTC
since no-one else has mentioned it, and since your data might really be CSV and not whitespace delimited as your sample shows, you may want to look at Text::Parsewords (this is a standard module). Use like- `@line = quotewords(',',0,$_);` [download] slurp and sort left to your imagination... of course the best data structure will depend on what else you plan on doing with the data.	[reply] [d/l]