BBQ has asked for the wisdom of the Perl Monks concerning the following question:
A while ago I had submitted to the snippets section a, well, snippet, called Open Flat File. The idea is to provide an x/y mechanism of navigating through data of pseudo-CSV files. Last night, one of my co-workers had a few questions about hashes, flat files, chomp, and miscelanous debris so I decided to pull the snippet back up. While I was trying to explain what it did, and how one should always use strict and -w it struck me: dog, I'm so lame!
So, here is my second revision of Open Flat File, with a few minor changes allowing to run under -w and use strict; Although it works, I feel very uncomfortable looking at it the way it is, and would very much appreciate any ideas that other monks may have on it. Here we go:
sub OpenFF {
my %hash;
my ($file,$dlmt) = @_;
$dlmt = "\t" if ! $_[1];
open(READ,$file) or return(0);
my @file = <READ>;
chomp(@file);
close(READ);
my @headers = split($dlmt,shift(@file));
foreach my $line (@file) {
my $x = 0;
my @columns = split($dlmt,$line);
while ($x < scalar @headers) {
$hash{$headers[$x++]}{$columns[0]} = $columns[$x];
}
}
return(%hash);
}
Having that said, please consider this sample file: foo.txt
-------
ID NAME1 NAME2 AGE
1 donald duck 50
2 mickey mouse 48
3 peter pan 62
4 madre theresa 108
5 banana split 2
And this sample script: #!/usr/bin/perl -w
use strict;
my %foo = OpenFF('foo.txt')
or die("Couldn't parse foo.txt, $!\n");
print "The keys are:\n";
print join("\t",keys %foo)."\n";
print "Row 3 is:\n";
foreach (qw'ID NAME1 NAME2 AGE') {
print "$foo{$_}{'3'}\t";
}
print "Column 'NAME2' is:\n";
foreach (1..5) {
print "$foo{'NAME2'}{$_}\n";
}
...and the output would be: The keys are:
NAME1 NAME2 ID AGE
Row 3 is:
3 peter pan 62
Column 'NAME2' is:
duck
mouse
pan
theresa
split
It is my opinion that this could be vastly explored and improoved, but at this point I have reached a point where I can't go any further without suggestions from my fellow monks.
- What could make this better?
- OO approach?
- Less loops?
- /dev/null?
I thank you all in advance.
#!/home/bbq/bin/perl
# Trust no1!
A few style suggestions
by tilly (Archbishop) on Aug 04, 2000 at 19:44 UTC
|
- Localize *READ before opening a file on it.
- Do your own error processing like it says to in perlstyle. (eg ...or die "Cannot read $file: $!")
- Use the third -1 argument to split as documented in "perldoc -f split".
- I have been bitten moving between Linux and Windows by DOS endings. Instead of chomp I like to s/\r?\n$//. YMMV.
- Should two rows have the same first entry you are silently overwriting data. This can be a Bad Thing. Either put in an error check or use an array instead of a hash for the rows.
- If you are incrementing a variable, use a for() loop instead of while(). It may work the same, but the person maintaining it finds the looping construct more obvious.
- Switch the order in the hash. You said elsewhere you think about it one way. My experience says that that decision will come back to bite you. The only way that you will find yourself wanting to access that data structure is row by row. If that is not a good enough reason for you, then let me tell you that if you reverse it then you can later choose to extract out references to each row and access them directly. The double hash lookup is slower by a factor of 2 and forces you to write more code everywhere.
- Document how this function works. A short comment helps immensely.
- The idea behind this code will never support the full CSV spec or anything close to it. Document that limitation.
OK, let me put my wallet where my mouth is and show you a hasty rewrite that takes all of that into account:
=pod
=item open_flatfile
Takes the name of a pseudo-CSV flatfile and an optional delimiter as a
+rgs.
(The delimiter can be a regular expression created with qr//.) It ope
+ns
the file, uses the first line a a header, and returns the data as an
array of hashes. This will not handle CSV files with escaped fields.
=cut
sub open_flatfile {
my $file = shift;
my $delim = shift || "\t";
local *FH;
open (FH, "<$file") or die "Cannot read $file: $!";
my @contents = <FH>;
close (FH);
s/\r?\n$// foreach @contents;
my @header = split ($delim, (shift @contents), -1);
# Create an anonymous sub to do the work
my $extract_row = sub {
my @cols = split($delim, shift(@_), -1);
my %row = map {($header[$_], $cols[$_])} 0..$#header;
return \%row;
};
return map {$extract_row->($_)} @contents;
}
Cheers,
Ben | [reply] [d/l] |
|
WRT s/\r?\n$// ...
It may seem like a minor point, but I'd rather not depend
on the values of \r and \n any more than I have to. And
I really don't like using $ in patterns when I'm actually
thinking about newlines (what with it matching before a
trailing newline, and all). My version of the above, if
I'm in full-blown portability mode:
s/[\xD\xA]+\z//;
(Macs use single CRs as line endings....)
-- Chip Salzenberg, Free-Floating Agent of Chaos
| [reply] [d/l] [select] |
|
| [reply] |
|
Use the third -1 argument to split as documented in "perldoc -f split".
Very well noted!! Just so that lazy people know what we are talking about, here's what perldoc has to say about the third argument of split (aka LIMIT)
If LIMIT is specified and positive, splits into no more than that many fields (though it may split into fewer). If LIMIT is unspecified or zero, trailing null fields are stripped (which potential users of pop would do well to remember). If LIMIT is negative, it is treated as if an arbitrarily large LIMIT had been specified.
Switch the order in the hash. You said elsewhere you think about it one way. My experience says that that decision will come back to bite you. (...)
Now I understand what brtrott was saying. I am looking at the table from a user's perspective as opposed to a programer's perpective. I can see this is good advice now, and not just a matter of preference!
The idea behind this code will never support the full CSV spec or anything close to it. Document that limitation.
Humm.. You probably didn't read the description of Open Flat File. I haven't used CSVs for quite some time now. I just pulled this one up as an example for peer-teaching. No formal documentation required, since it will be tossed into a tar pit as soon as the brain-storm is over. :)
Nevertheless, all of your recommendations make perfect sense. Thanks a bunch for the contribution.
#!/home/bbq/bin/perl
# Trust no1!
| [reply] |
Re: Another flatfile thingy
by btrott (Parson) on Aug 04, 2000 at 10:15 UTC
|
A couple of things:
That all said, here's my rewrite:
sub OpenFF {
my($file, $dlmt) = @_;
$dlmt = "\t" unless defined $dlmt;
my $hash = {};
open READ, $file or return;
chomp(my $head = <READ>);
my @headers = split $dlmt, $head;
while (<READ>) {
chomp;
my @columns = split $dlmt;
@{$hash->{$columns[0]}}{@headers} = @columns;
}
close READ;
return $hash;
}
| [reply] [d/l] [select] |
|
- W/r/t your data structure: personally, I always prefer a hash of records, with each of those records being a hash reference of columns and values. Whereas yours is the other way round. Does that make sense? In other words, I'd prefer
Yeah it makes sense, its just a matter of $hash{x}{y} vs. $hash{y}{x}. As you said, personal preference. I think of tables in a very visual way, so I just guess my feeling is for x first.
- Returning a reference would be less memory-intensive.
Forgive me for my dense question, but how do you return a reference of a hash if the hash is private to the subroutine? Can that be done?
- When you return on an error, you don't have to explicitly return 0.
Oh, that's just an office habit we have. We are always very explicit about everything, we even avoid using $_ when possible. Ex: saying foreach $v instead of just foreach. But thanks for the note anyway!
@{$hash->{$columns[0]}}{@headers} = @columns;Now that has to be one of the coolest things I've seen here lately... Congrats, dude. You hit the nail on the head it seems. Thanks for your views.
#!/home/bbq/bin/perl
# Trust no1!
| [reply] [d/l] |
|
> Forgive me for my dense question, but how do you return
> a reference of a hash if the hash is private to the
> subroutine? Can that be done?
Don't think of the hash as "private" to the subroutine; yes, it's
declared lexically within the sub, but that's not all that matters.
Think of the hash as a piece of data *somewhere* that you've
referenced lexically within the sub.
Perl's garbage collection is based on reference counting; so each
reference to such a piece of data increments the reference count.
Once you've declared
my $hash = {}
the
reference count of the underlying structure is incremented to
one.
Normally, you'd just have that one reference to the data, so once
the lexical has gone out of scope--at the end of the subroutine--the
data would be destroyed. But not here, because you've returned
a reference to the data and assigned it somewhere else. Even though
leaving the sub decrements the reference count, then, you
re-increment it by assigning the return value of the sub. So the
reference count is still one, and the data is still there, and valid.
The data won't be destroyed until all references to it have gone
away.
BTW: this is, as far as I know, correct, but someone who knows
more about Perl's internals is free to correct anything wrong. :) | [reply] [d/l] [select] |
Re: Another flatfile thingy
by eak (Monk) on Aug 04, 2000 at 10:02 UTC
|
I am parsing the flatfile into an AoH. This makes it very easy to get at a row, than a column. Please note the cool map that creates the reference to a hash.
foo.txt
--------------------------
ID NAME1 NAME2 AGE
1 donald duck 50
2 mickey mouse 48
3 peter pan 62
4 madre theresa 108
5 banana split 2
#!/usr/bin/perl -w
use strict;
use IO::File;
my $fh = new IO::File;
$fh->open("< foo.txt") or die "can't open foo.txt $!";
my @file = map{ [split /\s+/]; } <$fh>;
my $keys = shift @file;
print "The keys are:\n";
print join("\t",@$keys )."\n";
my @aoh = map{ my %hash; @hash{@$keys} = @$_; \%hash} @file;
print "Row 3 is \n";
print join("\t", @{$aoh[2]}{@$keys} )."\n";
print "Column 'NAME2' is:\n";
foreach my $row (@aoh) {
print $row->{NAME2}."\n";
}
| [reply] [d/l] |
Re: Another flatfile thingy
by perlmonkey (Hermit) on Aug 04, 2000 at 11:46 UTC
|
Well here is another rather sick way to do it if you want
to keep the funky data structure you got. Note: I am not saying
this is better or faster, just ugly and sort of fun sub OpenFF {
my ($file,$dlmt) = @_;
$dlmt = $_[1] || "\t";
my(%hash,%row);
open(READ,$file) or return;
my @headers = split /$dlmt/, <READ>;
s/^\s+|\s+$//g for @headers;
chomp(@row{@headers} = split /$dlmt/, <READ>)
and %hash = map { $_=>{ $hash{$_} ? %{$hash{$_}} : (), $row{$heade
+rs[0]}=>$row{$_}} } @headers
while !eof(READ);
return %hash;
}
| [reply] [d/l] |
|
| [reply] [d/l] |
|
Hmm, did merlyn happen to mention a better construct, the
s/// for ... it is something we put into production. I guess
an alternative would be map { s/^\s+|\s+$//g } @headers
But I doubt it would be significantly better. I guess I will have
to benchmark.
Thanks!
| [reply] [d/l] |
|
|
Re: Another flatfile thingy
by tye (Sage) on Aug 04, 2000 at 19:18 UTC
|
I'd put the records in an array of hashes ($data) then
have a hash of hashes of arrays ($index) that returns
record numbers (nah, make that references to the record
hashes). Then let the user select which field(s) they
want the index computed for. Then the user could do
things like:
$record= $index{$keyName}{$keyValue}[0];
$fieldValue= $record->{fieldName};
$fieldValue= $data[$recordNumber]{$fieldName};
for example:
( $data, $index )= OpenFF( "flatfile.txt", qw(ID AGE) );
print "People ages 55 are:\n";
for( @{ $index{AGE}{55} } ) {
print "\t",$_->{NAME},"\n";
}
The main reason I thought of this is because I thought
your routine would be nice to use even when the first field
isn't an ID number that is unique.
And, yes, I'd make the interface OO so you could have a
method for setting the delimiter and for adding things to
the index after the fact. Then if you have multiple files
that use the same delimiter and ID fields, you could reuse
the configuration of one for loading another via
my $other= $one->new("other.txt");. Then
have the object also be a reference to a tied hash where
$obj->{}->[$N] gets a ref to record number $N
and $obj->{keyFieldName}{fieldValue}[$I] gets
a ref to the $I'th record having a matching key value.
Also, each record should keep its record number in the
field named "".
require File::Fields;
my $staff= File::Fields->new();
$staff->Delimiter('\s+');
$staff->IndexOn( "ID" );
$staff->Open( "main.txt" );
my $students= $staff->new( "student.txt" );
my $TAs= $staff->new( "ta.txt" );
my $classes= File::Fields->new( "classes.txt",
{ Delimiter=>'\s*,\s*', IndexOn=>[qw(COURSE TEACH ROOM) } );
my $enroll= File::Fields->new( "enroll.txt",
{ IndexOn=>[qw(STUDENT COURCE)] } );
$staff->AddIndex( qw(PHONE OFFICE) );
$TAs->AddIndex( "ADVISOR" );
$staff->AddField("TOTALSTUDENTS",0);
for my $teach ( @{ $staff->{} } ) {
for my $class ( @{ $classes->{TEACH}{$teach} } ) {
$teach->{TOTALSTUDENTS} +=
@{ $enroll->{COURSE}{$class} };
}
}
$staff->Write( "staff2.txt" );
Making it OO is pretty easy and once you do that, adding
all of the features I mentioned is pretty easy and can be
done incrementally. Making an
object that is also a tied reference to a hash is confusing
at first, so let me know if you'd like help there.
This probably sounds like a lot of work but I don't think
that it would be. :) | [reply] [d/l] [select] |
|
|