Re: Help with parsing through a comma delimted file
by danger (Priest) on Mar 14, 2001 at 08:03 UTC
|
Well, that doesn't look comma delimited (or comma separated) to
me, but here you go: (see also: Schwartzian Transform):
#!/usr/bin/perl -w
use strict;
my @lines = <DATA>;
my @sorted = map {$_->[0]}
sort {$a->[1]<=>$b->[1]}
map {[$_,/^(\d+)/]} @lines[2..$#lines];
print @sorted;
__DATA__
COUNT Type Error Message
------------------------------
3 pro bad message #1
99 dis bad message #2
209 pro bad message #3
44 dis bad message #4
19 dis Bad message #5
| [reply] [d/l] |
|
|
thanks for your help. I justextracted a few lines from the file but in actuality it has no header and trust me it is comma delimted. If you get a minute could you explain what your lines do? I am sure they work but as I am new to perl an explanation would be very helpful. Txs!!!
| [reply] |
|
|
#!/usr/bin/perl -w
use strict;
my @lines = <DATA>; # Read this file starting with the line after __D
+ATA__
# Schwartzian Transform:
my @sorted = map {$_->[0]}
sort {$a->[1]<=>$b->[1]}
map {[$_,/^(\d+)/]} @lines[2..$#lines];
# That is where all the work is done.
# Read it from right to left, bottom up:
# So first he uses an array slice to get get the data
# minus the first two header rows. This is passed to
# map(), which reads the first number in each row
# and maps it to the row itself, that number is the COUNT.
# This is put into an array ref
# map() returns an array of those array refs, which is passed to sort(
+)
# sort sorts the array refs based on the second element, COUNT
# and returns an array, which is passed to another map()
# which extracts the first element out of the array ref
# and creates the array you had in the first place, only now
# it has been sorted.
print @sorted;
# The next line actually ends the script, and acts a marker for <DATA>
__DATA__
COUNT Type Error Message
------------------------------
3 pro bad message #1
99 dis bad message #2
209 pro bad message #3
44 dis bad message #4
19 dis Bad message #5
I suspect that uses plenty of Perl-isms, more then most people want to learn in one sitting, but You did ask how it
works, and I would hate for an earnest question like that to go un-answered. I would be happy to answer any other questions you have about that code, but first you must read the docs for map, sort, perlman:perllol and of course, Efficient sorting using the Schwartzian Transform | [reply] [d/l] |
Re: Help with parsing through a comma delimted file
by jlawrenc (Scribe) on Mar 14, 2001 at 08:43 UTC
|
Here is another approach that might be easier to understand. (The map function might be kinda heavy for newbies).
My approach will be to open the file, ignore the junk. Then
for each line in the file I am going to read it, bust it
apart by your criteria. In this case you've said that it
is comma separated values. Now I am going to assume that
you don't have field values with embedded commas to keep
the code straightforward.
Once the data is busted I'm gonna save it to memory in
a array of hashes structure. Each array entry will account
for a line of data.
Then you want your data sorted by a field value. I am going
to assume that you're sorting by count. To accomplish this
we can't simply say "sort @data". We have to actually use
the $datax{'count'} value for sort purposes. This is
done by creating an inline sort function.
This sort/print section will expose how perl stores
arrays of hashes. Each array entry is actually a reference
to an anonymous hash. So the sort function is actually
sorting hash references. That means when perl wants to
compare two entries in the array to decide who is bigger
it is comparing two hash references. ($a and $b). The
inline sort function dereferences the "Count" entry of
the two hash references to actually regurgiate a number
which is a much more meaningful value for sorting.
Finally the value in the foreach is going to be a sorted
list of hash references that are from the @data array.
To actually use that data we just deference the value as
I have shown below - voila data.
open FIN, "datafile.dat" || die "Cannot open datafile: $!\n";
my $line=<FIN>; $line=<FIN>; # Toss the first two lines
my @data;
my $index=0;
while ($line=<FIN>) {
chop;
my @field=split /,/, $line; # Assuming that data is not
# quoted or escaped - that's more work
$data[$index]{'Count'}=$field[0];
$data[$index]{'Type'}=$field[1];
$data[$index]{'Message'}=$field[2];
$index++;
}
foreach my $data (sort {$a->{'Count'} <=> $b->{'Count'}} @data) {
print "$data->{'Count'} is of type $data->{'Type'} - $data->{'Messa
+ge'}\n";
}
Finally, if you actually ment to say that you have a
data file of positional data - just change the @field=split
line to parse your data accordingly.
I hope this is helpful and maybe even a bit educational.
Jay | [reply] [d/l] |
Re: Help with parsing through a comma delimted file
by Desdinova (Friar) on Mar 14, 2001 at 11:26 UTC
|
I did something like this once using just arrays. I dont know if this better or worse than hashes. I used multidimensional arrays so you end up refering to the data by $data[row][coloum]. I also made it a sub that returns the array this makes it easier to use in other scripts.
#!/usr/local/bin/perl -w
use strict;
my @sorted=sort_file('data.txt');
my $cnt=0;
my $size=@sorted;
while ($cnt < $size) {
print "$sorted[$cnt][0] is of type $sorted[$cnt][1] - $sorted[$cnt][
+2]\n";
$cnt++;
}
exit();
sub sort_file{
my $filename = shift;
open DATA,$filename|| die "Unable to open file $filename :$!\n";
my @record;
my $row =0;
while (<DATA>) {
chomp;
#create Array one element per field
my @line=split /,/;
# move values into multi-dimensional array
$record[$row][0]=$line[0];
$record[$row][1]=$line[1];
$record[$row][2]=$line[2];
$row++;
}
return (sort { $a->[0] <=> $b->[0] } @record);
}
PS--Any comments are welcome as I now am getting to know enough to *really* shoot myself in the foot :> | [reply] [d/l] [select] |
|
|
In addition to my comments to jlawrenc I would like to
point out that in Perl use of explicit indexing is usually
unnecessary, and by avoiding it you can seriously reduce
the potential for error. Instead in this case you can use
push. The section just to read the file would then reduce
down to:
while (<DATA>) {
chomp;
push @record, [split(/,/, $_, -1)];
}
which is considerably shorter, faster, and reduces the
chance of error.
Also I think that reading the file and sorting it are
two different things. You are likely to want to read the
file into data for lots of reasons. You are likely to
later discover the need to sort the file in lots of ways.
Why not have two functions?
Of course whenever I see a CSV format with the field names
not in the first row, I tend to get upset. And I really
prefer hashes. Therefore the above snippet of code would
set off a bunch of danger signs for me. Certainly any
data format that I have any say in will include the columns
in the definition of the format, and code that handles it
will be expected to handle columns moving around.
In this simple case a function to read the format could
look like this:
use strict;
use Carp;
# Time passes...
sub read_csv {
my $file = shift;
local *CSV;
open (CSV, $file) or confess("Cannot read '$file': $!");
my $header = <CSV>;
chomp($header);
my @fields = split /,/, $header;
# You could do an error check for repeated field names...
my @data;
while (<CSV>) {
chomp;
my $row;
@$row{@fields} = split(/,/, $_, -1);
push @data, $row;
}
return @data;
}
I keep meaning to clean up and then post a more robust
version of this that handles quoting, fields with embedded
commas and returns, can be used either for slurping (like
this) or for a stream-oriented file... | [reply] [d/l] [select] |
|
|
Your suggestion is quite feasable. In fact it is not much different than mine.
By using arrays you get a more frugal memory usage and faster performance.
You do trade off for readability when it comes time to use the data. Peronally,
for maintainability I'll use a hash any day when I can give data
a meaningful label rather than a cryptic number. I might
even change the section around @line=split /,/ a bit to
something along the lines of this:
[ defined earlier ]
@fieldname = qw(Count Type Message);
[ in the loop ]
my @line=split /,/;
foreach (my $i=0; $i<=$#line; $i++) {
$record[$row]{$fieldname[$i]}=$line[$i];
}
Again, from the perspective maintainability I'd make two
constructive comments for your suggestion. The first is
the use of $_. I think this variable is confusing for
newbies. Really when you think about it, a default variable
is rather unique to Perl. As well, I feel it can be an
accident waiting to happen. Sure, use them, especially if
performance and crisp clean code is your objective. But,
if you're not in a crunch I always do the following:
while ( $line=<DATA> ) {
}
Then it is really obvious that you have read a _line_ of
DATA from your ol' data file.
While we're talking about lines, here is my second constructive
comment. You have chosen the variable @line to represent the
splitting a line of data. A subtle notion, but I feel it
is much clearer as to the meaning of data if you think about
how it is to be accessed. After the split you are not
accessing lines of data, you are accessing fields from a
record or line. Calling it "@field" means that you're going
to be referring to "$field[0]", "$field[1]", etc. which comes
from a $line of <DATA>.
Not to say that there is anything _wrong_ with your code,
but these would aid in long term maintainability. As I have seen, code does
have a nasty habit of lasting longer than expected. Worst of all
the quick 'n dirties often last the longest.
J
Update: fixed comparison on for loop - should be <=$#line, not <$#line. My mistake. | [reply] [d/l] [select] |
|
|
Your update underscores a point I like to make. Whenever
you can, avoid C style loops in favour of Perl's foreach
construction. Thus you would write:
foreach my $i (0..$#line) {
$record[$row]{$fieldname[$i]}=$line[$i];
}
The reason is that you will make fewer fencepost errors
that way. (One of the most common bugs in C, but generally
automatically avoidable in Perl.
Beyond that I would recommend using a -1 for the third
argument to split, and given that you have tabular data
I would lose the loop entirely for a slice:
my %rec;
@rec{@fields} = split(/,/, $_, -1);
$record[$row] = \%rec;
(I would actually clean it up even more, more on that in
another post.) | [reply] [d/l] [select] |
Re: Help with parsing through a comma delimted file
by Anonymous Monk on Mar 14, 2001 at 09:56 UTC
|
open ( IN, "myinput.csv" ) || die "$!\n";
my %data;
#data is a hash table. key is COUNT, and vals are
#key . Type . Error Message vals. Note that there isn't a
# very good hash value associated with the set of data,
#as COUNT probably isn't guaranteed to be unique ...
while (<IN>) {
#split on the comma
my ( $count, $type, $errMsg ) = split( /,/ );
print "collision!?!\n" if exists $data{$count . $type . $errMsg };
$data{$count . $type . $errMsg} = ($type, $errMsg); #ref to list
}
my @sortedKeys = sort keys $data;
print "COUNT\tType\tError Message\n";
foreach my $key ( $data ) {
print "$key\t$data->{$key}[0]\t$data->{key}[1]\n";
}
I think I did it without use of libraries ... :)
Sozin
| [reply] [d/l] |
Re: Help with parsing through a comma delimted file
by Anonymous Monk on Mar 14, 2001 at 18:44 UTC
|
Thanks all for your help. I will try these out today. You are lifesavers!!!
| [reply] |
Re: Help with parsing through a comma delimted file
by greenFox (Vicar) on Mar 14, 2001 at 15:46 UTC
|
since no-one else has mentioned it, and since your data might really be CSV and not whitespace delimited as your sample shows, you may want to look at
Text::Parsewords (this is a standard module). Use like-
@line = quotewords(',',0,$_);
slurp and sort left to your imagination... of course the best data structure will depend on what else you plan on doing with the data. | [reply] [d/l] |