Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Another flatfile thingy

by BBQ (Deacon)
on Aug 04, 2000 at 08:56 UTC ( #26138=perlquestion: print w/replies, xml ) Need Help??

BBQ has asked for the wisdom of the Perl Monks concerning the following question:

A while ago I had submitted to the snippets section a, well, snippet, called Open Flat File. The idea is to provide an x/y mechanism of navigating through data of pseudo-CSV files. Last night, one of my co-workers had a few questions about hashes, flat files, chomp, and miscelanous debris so I decided to pull the snippet back up. While I was trying to explain what it did, and how one should always use strict and -w it struck me: dog, I'm so lame!

So, here is my second revision of Open Flat File, with a few minor changes allowing to run under -w and use strict; Although it works, I feel very uncomfortable looking at it the way it is, and would very much appreciate any ideas that other monks may have on it. Here we go:
sub OpenFF { my %hash; my ($file,$dlmt) = @_; $dlmt = "\t" if ! $_[1]; open(READ,$file) or return(0); my @file = <READ>; chomp(@file); close(READ); my @headers = split($dlmt,shift(@file)); foreach my $line (@file) { my $x = 0; my @columns = split($dlmt,$line); while ($x < scalar @headers) { $hash{$headers[$x++]}{$columns[0]} = $columns[$x]; } } return(%hash); }
Having that said, please consider this sample file:
foo.txt ------- ID NAME1 NAME2 AGE 1 donald duck 50 2 mickey mouse 48 3 peter pan 62 4 madre theresa 108 5 banana split 2
And this sample script:
#!/usr/bin/perl -w use strict; my %foo = OpenFF('foo.txt') or die("Couldn't parse foo.txt, $!\n"); print "The keys are:\n"; print join("\t",keys %foo)."\n"; print "Row 3 is:\n"; foreach (qw'ID NAME1 NAME2 AGE') { print "$foo{$_}{'3'}\t"; } print "Column 'NAME2' is:\n"; foreach (1..5) { print "$foo{'NAME2'}{$_}\n"; }
...and the output would be:
The keys are: NAME1 NAME2 ID AGE Row 3 is: 3 peter pan 62 Column 'NAME2' is: duck mouse pan theresa split
It is my opinion that this could be vastly explored and improoved, but at this point I have reached a point where I can't go any further without suggestions from my fellow monks.
  • What could make this better?
  • OO approach?
  • Less loops?
  • /dev/null?
I thank you all in advance.

#!/home/bbq/bin/perl
# Trust no1!

Replies are listed 'Best First'.
A few style suggestions
by tilly (Archbishop) on Aug 04, 2000 at 19:44 UTC
    1. Localize *READ before opening a file on it.
    2. Do your own error processing like it says to in perlstyle. (eg ...or die "Cannot read $file: $!")
    3. Use the third -1 argument to split as documented in "perldoc -f split".
    4. I have been bitten moving between Linux and Windows by DOS endings. Instead of chomp I like to s/\r?\n$//. YMMV.
    5. Should two rows have the same first entry you are silently overwriting data. This can be a Bad Thing. Either put in an error check or use an array instead of a hash for the rows.
    6. If you are incrementing a variable, use a for() loop instead of while(). It may work the same, but the person maintaining it finds the looping construct more obvious.
    7. Switch the order in the hash. You said elsewhere you think about it one way. My experience says that that decision will come back to bite you. The only way that you will find yourself wanting to access that data structure is row by row. If that is not a good enough reason for you, then let me tell you that if you reverse it then you can later choose to extract out references to each row and access them directly. The double hash lookup is slower by a factor of 2 and forces you to write more code everywhere.
    8. Document how this function works. A short comment helps immensely.
    9. The idea behind this code will never support the full CSV spec or anything close to it. Document that limitation.

    OK, let me put my wallet where my mouth is and show you a hasty rewrite that takes all of that into account:

    =pod =item open_flatfile Takes the name of a pseudo-CSV flatfile and an optional delimiter as a +rgs. (The delimiter can be a regular expression created with qr//.) It ope +ns the file, uses the first line a a header, and returns the data as an array of hashes. This will not handle CSV files with escaped fields. =cut sub open_flatfile { my $file = shift; my $delim = shift || "\t"; local *FH; open (FH, "<$file") or die "Cannot read $file: $!"; my @contents = <FH>; close (FH); s/\r?\n$// foreach @contents; my @header = split ($delim, (shift @contents), -1); # Create an anonymous sub to do the work my $extract_row = sub { my @cols = split($delim, shift(@_), -1); my %row = map {($header[$_], $cols[$_])} 0..$#header; return \%row; }; return map {$extract_row->($_)} @contents; }
    Cheers,
    Ben
      WRT s/\r?\n$// ...

      It may seem like a minor point, but I'd rather not depend on the values of \r and \n any more than I have to. And I really don't like using $ in patterns when I'm actually thinking about newlines (what with it matching before a trailing newline, and all). My version of the above, if I'm in full-blown portability mode:

      s/[\xD\xA]+\z//;
      (Macs use single CRs as line endings....)

          -- Chip Salzenberg, Free-Floating Agent of Chaos

        You are right of course. In my defence I am rarely in full-blown portability mode. But I am frequently faced with writing code that will run under both windows and Linux, on essentially the same files, shared under SAMBA. Back when I used chomp I got seriously bitten, thought I had lost a ton of data somewhere. (I just wasn't seeing it because of the \r.)

        I believe in anticipating future problems, but there is a limit to how much I worry about in my own narrow set of problems. :-)

      Use the third -1 argument to split as documented in "perldoc -f split".

      Very well noted!! Just so that lazy people know what we are talking about, here's what perldoc has to say about the third argument of split (aka LIMIT)

      If LIMIT is specified and positive, splits into no more than that many fields (though it may split into fewer). If LIMIT is unspecified or zero, trailing null fields are stripped (which potential users of pop would do well to remember). If LIMIT is negative, it is treated as if an arbitrarily large LIMIT had been specified.

      Switch the order in the hash. You said elsewhere you think about it one way. My experience says that that decision will come back to bite you. (...)

      Now I understand what brtrott was saying. I am looking at the table from a user's perspective as opposed to a programer's perpective. I can see this is good advice now, and not just a matter of preference!

      The idea behind this code will never support the full CSV spec or anything close to it. Document that limitation.

      Humm.. You probably didn't read the description of Open Flat File. I haven't used CSVs for quite some time now. I just pulled this one up as an example for peer-teaching. No formal documentation required, since it will be tossed into a tar pit as soon as the brain-storm is over. :)

      Nevertheless, all of your recommendations make perfect sense. Thanks a bunch for the contribution.

      #!/home/bbq/bin/perl
      # Trust no1!
Re: Another flatfile thingy
by btrott (Parson) on Aug 04, 2000 at 10:15 UTC
    A couple of things:
    • W/r/t your data structure: personally, I always prefer a hash of records, with each of those records being a hash reference of columns and values. Whereas yours is the other way round. Does that make sense? In other words, I'd prefer
      $foo{3}{NAME1}
      instead of
      $foo{NAME1}{3}
      But that's just me.
    • Don't read the entire file into an array; just loop over each line and process it as necessary.
    • Returning a reference would be less memory-intensive.
    • A hash slice would be nice for setting the values. But that would really only work with the data structure I prefer (that which I outlined above).
    • When you return on an error, you don't have to explicitly return 0.
    That all said, here's my rewrite:
    sub OpenFF { my($file, $dlmt) = @_; $dlmt = "\t" unless defined $dlmt; my $hash = {}; open READ, $file or return; chomp(my $head = <READ>); my @headers = split $dlmt, $head; while (<READ>) { chomp; my @columns = split $dlmt; @{$hash->{$columns[0]}}{@headers} = @columns; } close READ; return $hash; }
      • W/r/t your data structure: personally, I always prefer a hash of records, with each of those records being a hash reference of columns and values. Whereas yours is the other way round. Does that make sense? In other words, I'd prefer

        Yeah it makes sense, its just a matter of $hash{x}{y} vs. $hash{y}{x}. As you said, personal preference. I think of tables in a very visual way, so I just guess my feeling is for x first.

      • Returning a reference would be less memory-intensive.

        Forgive me for my dense question, but how do you return a reference of a hash if the hash is private to the subroutine? Can that be done?

      • When you return on an error, you don't have to explicitly return 0.

        Oh, that's just an office habit we have. We are always very explicit about everything, we even avoid using $_ when possible. Ex: saying foreach $v instead of just foreach. But thanks for the note anyway!

      @{$hash->{$columns[0]}}{@headers} = @columns;Now that has to be one of the coolest things I've seen here lately... Congrats, dude. You hit the nail on the head it seems. Thanks for your views.

      #!/home/bbq/bin/perl
      # Trust no1!
        You wrote:
        > Forgive me for my dense question, but how do you return > a reference of a hash if the hash is private to the > subroutine? Can that be done?
        Don't think of the hash as "private" to the subroutine; yes, it's declared lexically within the sub, but that's not all that matters. Think of the hash as a piece of data *somewhere* that you've referenced lexically within the sub.

        Perl's garbage collection is based on reference counting; so each reference to such a piece of data increments the reference count. Once you've declared

        my $hash = {}
        the reference count of the underlying structure is incremented to one.

        Normally, you'd just have that one reference to the data, so once the lexical has gone out of scope--at the end of the subroutine--the data would be destroyed. But not here, because you've returned a reference to the data and assigned it somewhere else. Even though leaving the sub decrements the reference count, then, you re-increment it by assigning the return value of the sub. So the reference count is still one, and the data is still there, and valid.

        The data won't be destroyed until all references to it have gone away.

        BTW: this is, as far as I know, correct, but someone who knows more about Perl's internals is free to correct anything wrong. :)

Re: Another flatfile thingy
by eak (Monk) on Aug 04, 2000 at 10:02 UTC
    I am parsing the flatfile into an AoH. This makes it very easy to get at a row, than a column. Please note the cool map that creates the reference to a hash.
    foo.txt
    --------------------------
    ID    NAME1    NAME2    AGE
    1    donald    duck    50
    2    mickey    mouse    48
    3    peter    pan    62
    4    madre    theresa    108
    5    banana    split    2 
    
    #!/usr/bin/perl -w use strict; use IO::File; my $fh = new IO::File; $fh->open("< foo.txt") or die "can't open foo.txt $!"; my @file = map{ [split /\s+/]; } <$fh>; my $keys = shift @file; print "The keys are:\n"; print join("\t",@$keys )."\n"; my @aoh = map{ my %hash; @hash{@$keys} = @$_; \%hash} @file; print "Row 3 is \n"; print join("\t", @{$aoh[2]}{@$keys} )."\n"; print "Column 'NAME2' is:\n"; foreach my $row (@aoh) { print $row->{NAME2}."\n"; }
Re: Another flatfile thingy
by perlmonkey (Hermit) on Aug 04, 2000 at 11:46 UTC
    Well here is another rather sick way to do it if you want to keep the funky data structure you got. Note: I am not saying this is better or faster, just ugly and sort of fun
    sub OpenFF { my ($file,$dlmt) = @_; $dlmt = $_[1] || "\t"; my(%hash,%row); open(READ,$file) or return; my @headers = split /$dlmt/, <READ>; s/^\s+|\s+$//g for @headers; chomp(@row{@headers} = split /$dlmt/, <READ>) and %hash = map { $_=>{ $hash{$_} ? %{$hash{$_}} : (), $row{$heade +rs[0]}=>$row{$_}} } @headers while !eof(READ); return %hash; }
      Huh like $_, you REALLY like statement modifiers, don't you? :) Anyway, yeah that looks like lots of fun. I would never ship it out as production code, but still very fun! My favorite line from the above is:
      s/^\s+|\s+$//g for @headers;
      Waaaaaay cool! :)

      Updated:
      On a side note, I got a message from merlyn in the chatterbox pointing out that s/// for, has a great perfomance penalty despite its cool look. Then again, I believe that perlmonkey's idea was to make something funky as well and not be too worried about perfomance issues. Hey, we're all just tossing ideas here, right? Still, merlyn has a very good point that should be noted.

      #!/home/bbq/bin/perl
      # Trust no1!
        Hmm, did merlyn happen to mention a better construct, the s/// for ... it is something we put into production. I guess an alternative would be  map { s/^\s+|\s+$//g } @headers But I doubt it would be significantly better. I guess I will have to benchmark.

        Thanks!
Re: Another flatfile thingy
by tye (Sage) on Aug 04, 2000 at 19:18 UTC

    I'd put the records in an array of hashes ($data) then have a hash of hashes of arrays ($index) that returns record numbers (nah, make that references to the record hashes). Then let the user select which field(s) they want the index computed for. Then the user could do things like:

    $record= $index{$keyName}{$keyValue}[0]; $fieldValue= $record->{fieldName}; $fieldValue= $data[$recordNumber]{$fieldName};
    for example:
    ( $data, $index )= OpenFF( "flatfile.txt", qw(ID AGE) ); print "People ages 55 are:\n"; for( @{ $index{AGE}{55} } ) { print "\t",$_->{NAME},"\n"; }

    The main reason I thought of this is because I thought your routine would be nice to use even when the first field isn't an ID number that is unique.

    And, yes, I'd make the interface OO so you could have a method for setting the delimiter and for adding things to the index after the fact. Then if you have multiple files that use the same delimiter and ID fields, you could reuse the configuration of one for loading another via my $other= $one->new("other.txt");. Then have the object also be a reference to a tied hash where $obj->{}->[$N] gets a ref to record number $N and $obj->{keyFieldName}{fieldValue}[$I] gets a ref to the $I'th record having a matching key value. Also, each record should keep its record number in the field named "".

    require File::Fields; my $staff= File::Fields->new(); $staff->Delimiter('\s+'); $staff->IndexOn( "ID" ); $staff->Open( "main.txt" ); my $students= $staff->new( "student.txt" ); my $TAs= $staff->new( "ta.txt" ); my $classes= File::Fields->new( "classes.txt", { Delimiter=>'\s*,\s*', IndexOn=>[qw(COURSE TEACH ROOM) } ); my $enroll= File::Fields->new( "enroll.txt", { IndexOn=>[qw(STUDENT COURCE)] } ); $staff->AddIndex( qw(PHONE OFFICE) ); $TAs->AddIndex( "ADVISOR" ); $staff->AddField("TOTALSTUDENTS",0); for my $teach ( @{ $staff->{} } ) { for my $class ( @{ $classes->{TEACH}{$teach} } ) { $teach->{TOTALSTUDENTS} += @{ $enroll->{COURSE}{$class} }; } } $staff->Write( "staff2.txt" );

    Making it OO is pretty easy and once you do that, adding all of the features I mentioned is pretty easy and can be done incrementally. Making an object that is also a tied reference to a hash is confusing at first, so let me know if you'd like help there.

    This probably sounds like a lot of work but I don't think that it would be. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://26138]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2022-09-30 09:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (125 votes). Check out past polls.

    Notices?