Another flatfile thingy

BBQ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
A few style suggestions by tilly (Archbishop) on Aug 04, 2000 at 19:44 UTC
Localize READ before opening a file on it. Do your own error processing like it says to in perlstyle. (eg ...or die "Cannot read $file: $!") Use the third -1 argument to split as documented in "perldoc -f split". I have been bitten moving between Linux and Windows by DOS endings. Instead of chomp I like to s/\r?\n$//. YMMV. Should two rows have the same first entry you are silently overwriting data. This can be a Bad Thing. Either put in an error check or use an array instead of a hash for the rows. If you are incrementing a variable, use a for() loop instead of while(). It may work the same, but the person maintaining it finds the looping construct more obvious. Switch the order in the hash. You said elsewhere you think about it one way. My experience says that that decision will come back to bite you. The only way that you will find yourself wanting to access that data structure is row by row. If that is not a good enough reason for you, then let me tell you that if you reverse it then you can later choose to extract out references to each row and access them directly. The double hash lookup is slower by a factor of 2 and forces you to write more code everywhere. Document how this function works. A short comment helps immensely. The idea behind this code will never support the full CSV spec or anything close to it. Document that limitation. OK, let me put my wallet where my mouth is and show you a hasty rewrite that takes all of that into account: =pod =item open_flatfile Takes the name of a pseudo-CSV flatfile and an optional delimiter as a +rgs. (The delimiter can be a regular expression created with qr//.) It ope +ns the file, uses the first line a a header, and returns the data as an array of hashes. This will not handle CSV files with escaped fields. =cut sub open_flatfile { my $file = shift; my $delim = shift \|\| "\t"; local FH; open (FH, "<$file") or die "Cannot read $file: $!"; my @contents = <FH>; close (FH); s/\r?\n$// foreach @contents; my @header = split ($delim, (shift @contents), -1); # Create an anonymous sub to do the work my $extract_row = sub { my @cols = split($delim, shift(@_), -1); my %row = map {($header[$_], $cols[$_])} 0..$#header; return \%row; }; return map {$extract_row->($_)} @contents; } [download] Cheers, Ben	[reply] [d/l]
RE: A few style suggestions by chip (Curate) on Aug 11, 2000 at 05:53 UTC
WRT `s/\r?\n$//` ... It may seem like a minor point, but I'd rather not depend on the values of \r and \n any more than I have to. And I really don't like using $ in patterns when I'm actually thinking about newlines (what with it matching before a trailing newline, and all). My version of the above, if I'm in full-blown portability mode: `s/[\xD\xA]+\z//;` [download] (Macs use single CRs as line endings....) -- Chip Salzenberg, Free-Floating Agent of Chaos	[reply] [d/l] [select]
RE (2): A few style suggestions by tilly (Archbishop) on Aug 11, 2000 at 15:23 UTC
You are right of course. In my defence I am rarely in full-blown portability mode. But I am frequently faced with writing code that will run under both windows and Linux, on essentially the same files, shared under SAMBA. Back when I used chomp I got seriously bitten, thought I had lost a ton of data somewhere. (I just wasn't seeing it because of the \r.) I believe in anticipating future problems, but there is a limit to how much I worry about in my own narrow set of problems. :-)	[reply]
(bbq) RE: A few style suggestions by BBQ (Curate) on Aug 04, 2000 at 20:57 UTC
Use the third -1 argument to split as documented in "perldoc -f split". Very well noted!! Just so that lazy people know what we are talking about, here's what perldoc has to say about the third argument of split (aka LIMIT) `If LIMIT is specified and positive, splits into no more than that many fields (though it may split into fewer). If LIMIT is unspecified or zero, trailing null fields are stripped (which potential users of pop would do well to remember). If LIMIT is negative, it is treated as if an arbitrarily large LIMIT had been specified.` Switch the order in the hash. You said elsewhere you think about it one way. My experience says that that decision will come back to bite you. (...) Now I understand what brtrott was saying. I am looking at the table from a user's perspective as opposed to a programer's perpective. I can see this is good advice now, and not just a matter of preference! The idea behind this code will never support the full CSV spec or anything close to it. Document that limitation. Humm.. You probably didn't read the description of Open Flat File. I haven't used CSVs for quite some time now. I just pulled this one up as an example for peer-teaching. No formal documentation required, since it will be tossed into a tar pit as soon as the brain-storm is over. :) Nevertheless, all of your recommendations make perfect sense. Thanks a bunch for the contribution. #!/home/bbq/bin/perl # Trust no1!	[reply]
Re: Another flatfile thingy by btrott (Parson) on Aug 04, 2000 at 10:15 UTC
A couple of things: W/r/t your data structure: personally, I always prefer a hash of records, with each of those records being a hash reference of columns and values. Whereas yours is the other way round. Does that make sense? In other words, I'd prefer `$foo{3}{NAME1}` [download] instead of `$foo{NAME1}{3}` [download] But that's just me. Don't read the entire file into an array; just loop over each line and process it as necessary. Returning a reference would be less memory-intensive. A hash slice would be nice for setting the values. But that would really only work with the data structure I prefer (that which I outlined above). When you return on an error, you don't have to explicitly return 0. That all said, here's my rewrite: `sub OpenFF { my($file, $dlmt) = @_; $dlmt = "\t" unless defined $dlmt; my $hash = {}; open READ, $file or return; chomp(my $head = <READ>); my @headers = split $dlmt, $head; while (<READ>) { chomp; my @columns = split $dlmt; @{$hash->{$columns[0]}}{@headers} = @columns; } close READ; return $hash; }` [download]	[reply] [d/l] [select]
(bbq) RE: Re: Another flatfile thingy by BBQ (Curate) on Aug 04, 2000 at 17:37 UTC
W/r/t your data structure: personally, I always prefer a hash of records, with each of those records being a hash reference of columns and values. Whereas yours is the other way round. Does that make sense? In other words, I'd prefer Yeah it makes sense, its just a matter of $hash{x}{y} vs. $hash{y}{x}. As you said, personal preference. I think of tables in a very visual way, so I just guess my feeling is for x first. Returning a reference would be less memory-intensive. Forgive me for my dense question, but how do you return a reference of a hash if the hash is private to the subroutine? Can that be done? When you return on an error, you don't have to explicitly return 0. Oh, that's just an office habit we have. We are always very explicit about everything, we even avoid using $_ when possible. Ex: saying foreach $v instead of just foreach. But thanks for the note anyway! `@{$hash->{$columns[0]}}{@headers} = @columns;`Now that has to be one of the coolest things I've seen here lately... Congrats, dude. You hit the nail on the head it seems. Thanks for your views. #!/home/bbq/bin/perl # Trust no1!	[reply] [d/l]
RE: (bbq) RE: Re: Another flatfile thingy by btrott (Parson) on Aug 04, 2000 at 20:10 UTC
You wrote: `> Forgive me for my dense question, but how do you return > a reference of a hash if the hash is private to the > subroutine? Can that be done?` [download] Don't think of the hash as "private" to the subroutine; yes, it's declared lexically within the sub, but that's not all that matters. Think of the hash as a piece of data somewhere that you've referenced lexically within the sub. Perl's garbage collection is based on reference counting; so each reference to such a piece of data increments the reference count. Once you've declared `my $hash = {}` [download] the reference count of the underlying structure is incremented to one. Normally, you'd just have that one reference to the data, so once the lexical has gone out of scope--at the end of the subroutine--the data would be destroyed. But not here, because you've returned a reference to the data and assigned it somewhere else. Even though leaving the sub decrements the reference count, then, you re-increment it by assigning the return value of the sub. So the reference count is still one, and the data is still there, and valid. The data won't be destroyed until all references to it have gone away. BTW: this is, as far as I know, correct, but someone who knows more about Perl's internals is free to correct anything wrong. :)	[reply] [d/l] [select]
Re: Another flatfile thingy by eak (Monk) on Aug 04, 2000 at 10:02 UTC
I am parsing the flatfile into an AoH. This makes it very easy to get at a row, than a column. Please note the cool map that creates the reference to a hash. foo.txt -------------------------- ID NAME1 NAME2 AGE 1 donald duck 50 2 mickey mouse 48 3 peter pan 62 4 madre theresa 108 5 banana split 2 `#!/usr/bin/perl -w use strict; use IO::File; my $fh = new IO::File; $fh->open("< foo.txt") or die "can't open foo.txt $!"; my @file = map{ [split /\s+/]; } <$fh>; my $keys = shift @file; print "The keys are:\n"; print join("\t",@$keys )."\n"; my @aoh = map{ my %hash; @hash{@$keys} = @$_; \%hash} @file; print "Row 3 is \n"; print join("\t", @{$aoh[2]}{@$keys} )."\n"; print "Column 'NAME2' is:\n"; foreach my $row (@aoh) { print $row->{NAME2}."\n"; }` [download]	[reply] [d/l]
Re: Another flatfile thingy by perlmonkey (Hermit) on Aug 04, 2000 at 11:46 UTC
Well here is another rather sick way to do it if you want to keep the funky data structure you got. Note: I am not saying this is better or faster, just ugly and sort of fun `sub OpenFF { my ($file,$dlmt) = @_; $dlmt = $_[1] \|\| "\t"; my(%hash,%row); open(READ,$file) or return; my @headers = split /$dlmt/, <READ>; s/^\s+\|\s+$//g for @headers; chomp(@row{@headers} = split /$dlmt/, <READ>) and %hash = map { $_=>{ $hash{$_} ? %{$hash{$_}} : (), $row{$heade +rs[0]}=>$row{$_}} } @headers while !eof(READ); return %hash; }` [download]	[reply] [d/l]
(bbq) RE: Re: Another flatfile thingy by BBQ (Curate) on Aug 04, 2000 at 17:47 UTC
Huh like $_, you REALLY like statement modifiers, don't you? :) Anyway, yeah that looks like lots of fun. I would never ship it out as production code, but still very fun! My favorite line from the above is: `s/^\s+\|\s+$//g for @headers;` [download] Waaaaaay cool! :) Updated: On a side note, I got a message from merlyn in the chatterbox pointing out that s/// for, has a great perfomance penalty despite its cool look. Then again, I believe that perlmonkey's idea was to make something funky as well and not be too worried about perfomance issues. Hey, we're all just tossing ideas here, right? Still, merlyn has a very good point that should be noted. #!/home/bbq/bin/perl # Trust no1!	[reply] [d/l]
RE: (bbq) RE: Re: Another flatfile thingy by perlmonkey (Hermit) on Aug 04, 2000 at 20:44 UTC
Hmm, did merlyn happen to mention a better construct, the s/// for ... it is something we put into production. I guess an alternative would be `map { s/^\s+\|\s+$//g } @headers` But I doubt it would be significantly better. I guess I will have to benchmark. Thanks!	[reply] [d/l]
(bbq) Re(4): Another flatfile thingy by BBQ (Curate) on Aug 04, 2000 at 21:12 UTC
RE: (bbq) Re(4): Another flatfile thingy by btrott (Parson) on Aug 05, 2000 at 00:17 UTC
Re: Another flatfile thingy by tye (Sage) on Aug 04, 2000 at 19:18 UTC
I'd put the records in an array of hashes ($data) then have a hash of hashes of arrays ($index) that returns record numbers (nah, make that references to the record hashes). Then let the user select which field(s) they want the index computed for. Then the user could do things like: `$record= $index{$keyName}{$keyValue}[0]; $fieldValue= $record->{fieldName}; $fieldValue= $data[$recordNumber]{$fieldName};` [download] for example: `( $data, $index )= OpenFF( "flatfile.txt", qw(ID AGE) ); print "People ages 55 are:\n"; for( @{ $index{AGE}{55} } ) { print "\t",$_->{NAME},"\n"; }` [download] The main reason I thought of this is because I thought your routine would be nice to use even when the first field isn't an ID number that is unique. And, yes, I'd make the interface OO so you could have a method for setting the delimiter and for adding things to the index after the fact. Then if you have multiple files that use the same delimiter and ID fields, you could reuse the configuration of one for loading another via `my $other= $one->new("other.txt");`. Then have the object also be a reference to a tied hash where `$obj->{}->[$N]` gets a ref to record number $N and `$obj->{keyFieldName}{fieldValue}[$I]` gets a ref to the $I'th record having a matching key value. Also, each record should keep its record number in the field named "". require File::Fields; my $staff= File::Fields->new(); $staff->Delimiter('\s+'); $staff->IndexOn( "ID" ); $staff->Open( "main.txt" ); my $students= $staff->new( "student.txt" ); my $TAs= $staff->new( "ta.txt" ); my $classes= File::Fields->new( "classes.txt", { Delimiter=>'\s,\s', IndexOn=>[qw(COURSE TEACH ROOM) } ); my $enroll= File::Fields->new( "enroll.txt", { IndexOn=>[qw(STUDENT COURCE)] } ); $staff->AddIndex( qw(PHONE OFFICE) ); $TAs->AddIndex( "ADVISOR" ); $staff->AddField("TOTALSTUDENTS",0); for my $teach ( @{ $staff->{} } ) { for my $class ( @{ $classes->{TEACH}{$teach} } ) { $teach->{TOTALSTUDENTS} += @{ $enroll->{COURSE}{$class} }; } } $staff->Write( "staff2.txt" ); [download] Making it OO is pretty easy and once you do that, adding all of the features I mentioned is pretty easy and can be done incrementally. Making an object that is also a tied reference to a hash is confusing at first, so let me know if you'd like help there. This probably sounds like a lot of work but I don't think that it would be. :)	[reply] [d/l] [select]


We don't bite newbies here... much
	PerlMonks