in reply to Help with parsing through a comma delimted file

I did something like this once using just arrays. I dont know if this better or worse than hashes. I used multidimensional arrays so you end up refering to the data by $data[row][coloum]. I also made it a sub that returns the array this makes it easier to use in other scripts.
#!/usr/local/bin/perl -w use strict; my @sorted=sort_file('data.txt'); my $cnt=0; my $size=@sorted; while ($cnt < $size) { print "$sorted[$cnt][0] is of type $sorted[$cnt][1] - $sorted[$cnt][ +2]\n"; $cnt++; } exit(); sub sort_file{ my $filename = shift; open DATA,$filename|| die "Unable to open file $filename :$!\n"; my @record; my $row =0; while (<DATA>) { chomp; #create Array one element per field my @line=split /,/; # move values into multi-dimensional array $record[$row][0]=$line[0]; $record[$row][1]=$line[1]; $record[$row][2]=$line[2]; $row++; } return (sort { $a->[0] <=> $b->[0] } @record); }

PS--Any comments are welcome as I now am getting to know enough to *really* shoot myself in the foot :>

Replies are listed 'Best First'.
Re (tilly) 2: Help with parsing through a comma delimted file
by tilly (Archbishop) on Mar 15, 2001 at 16:46 UTC
    In addition to my comments to jlawrenc I would like to point out that in Perl use of explicit indexing is usually unnecessary, and by avoiding it you can seriously reduce the potential for error. Instead in this case you can use push. The section just to read the file would then reduce down to:
    while (<DATA>) { chomp; push @record, [split(/,/, $_, -1)]; }
    which is considerably shorter, faster, and reduces the chance of error.

    Also I think that reading the file and sorting it are two different things. You are likely to want to read the file into data for lots of reasons. You are likely to later discover the need to sort the file in lots of ways. Why not have two functions?

    Of course whenever I see a CSV format with the field names not in the first row, I tend to get upset. And I really prefer hashes. Therefore the above snippet of code would set off a bunch of danger signs for me. Certainly any data format that I have any say in will include the columns in the definition of the format, and code that handles it will be expected to handle columns moving around. In this simple case a function to read the format could look like this:

    use strict; use Carp; # Time passes... sub read_csv { my $file = shift; local *CSV; open (CSV, $file) or confess("Cannot read '$file': $!"); my $header = <CSV>; chomp($header); my @fields = split /,/, $header; # You could do an error check for repeated field names... my @data; while (<CSV>) { chomp; my $row; @$row{@fields} = split(/,/, $_, -1); push @data, $row; } return @data; }
    I keep meaning to clean up and then post a more robust version of this that handles quoting, fields with embedded commas and returns, can be used either for slurping (like this) or for a stream-oriented file...
Re: Re: Help with parsing through a comma delimted file
by jlawrenc (Scribe) on Mar 14, 2001 at 20:29 UTC
    Your suggestion is quite feasable. In fact it is not much different than mine.

    By using arrays you get a more frugal memory usage and faster performance. You do trade off for readability when it comes time to use the data. Peronally, for maintainability I'll use a hash any day when I can give data a meaningful label rather than a cryptic number. I might even change the section around @line=split /,/ a bit to something along the lines of this:

    [ defined earlier ] @fieldname = qw(Count Type Message); [ in the loop ] my @line=split /,/; foreach (my $i=0; $i<=$#line; $i++) { $record[$row]{$fieldname[$i]}=$line[$i]; }
    Again, from the perspective maintainability I'd make two constructive comments for your suggestion. The first is the use of $_. I think this variable is confusing for newbies. Really when you think about it, a default variable is rather unique to Perl. As well, I feel it can be an accident waiting to happen. Sure, use them, especially if performance and crisp clean code is your objective. But, if you're not in a crunch I always do the following:
    while ( $line=<DATA> ) { }
    Then it is really obvious that you have read a _line_ of DATA from your ol' data file.

    While we're talking about lines, here is my second constructive comment. You have chosen the variable @line to represent the splitting a line of data. A subtle notion, but I feel it is much clearer as to the meaning of data if you think about how it is to be accessed. After the split you are not accessing lines of data, you are accessing fields from a record or line. Calling it "@field" means that you're going to be referring to "$field[0]", "$field[1]", etc. which comes from a $line of <DATA>.

    Not to say that there is anything _wrong_ with your code, but these would aid in long term maintainability. As I have seen, code does have a nasty habit of lasting longer than expected. Worst of all the quick 'n dirties often last the longest.

    J

    Update: fixed comparison on for loop - should be <=$#line, not <$#line. My mistake.

      Your update underscores a point I like to make. Whenever you can, avoid C style loops in favour of Perl's foreach construction. Thus you would write:
      foreach my $i (0..$#line) { $record[$row]{$fieldname[$i]}=$line[$i]; }
      The reason is that you will make fewer fencepost errors that way. (One of the most common bugs in C, but generally automatically avoidable in Perl.

      Beyond that I would recommend using a -1 for the third argument to split, and given that you have tabular data I would lose the loop entirely for a slice:

      my %rec; @rec{@fields} = split(/,/, $_, -1); $record[$row] = \%rec;
      (I would actually clean it up even more, more on that in another post.)