Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi dear monks,

i've hundreds of big CSV files (~ 3 Go) and i need to read the 3rd field of the 2nd line of each file.
here is my naive approach for achieving this goal, and i would like to know if it's correct and how i can speed this up.

thanks by advance,
# -- the type of the file can be found by checking if the value # -- of the 3rd field of the 2nd line is a number sub check_field { my ($file_name) = @_; open my $fh, "<$file_name" or die "*** ERROR opening '$file_name': $!"; my @fields; while (<$fh>) { if ($. == 2) { chomp; @fields = split /;/; die '*** ERROR number of fields !=4' unless @fields == 4; last; } } close($fh); return $fields[2] =~ /^\d+$/; }

Replies are listed 'Best First'.
Re: advice for reading data from a file
by Limbic~Region (Chancellor) on Jan 18, 2004 at 18:46 UTC
    Anonymous Monk,
    Your approach seems fine if you only need to read the second line of each file, but you don't say if you will ever need to process the entire file. I am suggesting using Text::CSV_XS in that case, which also has the added bonus of properly handling imbedded delimiters if you run into that problem.
    #!/usr/bin/perl -w use strict; use Text::CSV_XS; my @files = qw(foo bar blah asdf); for my $file ( @files ) { if ( File_Type( $file ) ) { print "Do something with $file\n"; } } sub File_Type { my $file = shift; open (INPUT , '<' , $file) or die "Unable to open $file for readin +g : $!"; my $csv = Text::CSV_XS->new( {'sep_char' => ';'} ); while ( <INPUT> ) { next if $. != 2; chomp; if ( $csv->parse($_) ) { my @field = $csv->fields; die 'Incorrect number of fields' if @field != 4; return $field[2] =~ /^\d+$/ ? 1 : 0; } else { print "Unable to parse: ", $csv->error_input, "\n"; return 0; } } }
    I left most of your code intact as you probably have it that way for a reason.

    Cheers - L~R

      thanks for your answer limbic-region,

      but here i just need to process the 2 nd line, so i didn't want to fire-up Text::CSV_XS just for that :)

Re: advice for reading data from a file
by Aragorn (Curate) on Jan 18, 2004 at 18:16 UTC
    Seems perfectly reasonable to me. If this routine works for the files you have to process, it is correct. Maybe a warn instead of the dies in the routine can be used to tag the "corrupt" files so that the program doesn't bail out if only 1 or 2 files of the hundreds are faulty. But this may not be appropriate for your purpose.

    Arjen

      I agree with aragorn with getting rid of the die. I would not even bother with the warn unless I wanted to watch it. I think that it would better to send your errors to a log file with the file_name or any other stats so you can continue to process the correct files that would include logging files we can't open perhaps. Also, you are performing a regex on the return value which may be undef. I would think you should do the regex before you return the field if (log that too) in case that doesn't meet your criteria so you can be sure you have a valid return.
        Also, you are performing a regex on the return value which may be undef.I would think you should do the regex before you return the field if (log that too)

        excuse me but i don't understand, here i'm not returning the value of the field but the returning value of the regexp which i think can be only 0 or 1 but i maybe wrong.

        do you mean that my sub can return 'undef' in some cases ?

      thanks for your answer,

      actually the code seems to works fine on the files.

      concerning the 'die' in the sub i need it because if one file is faulty the whole process need to be stopped.

      in fact i first check all the files type with an eval {} and the sub die in case of an error so i can catch it

      but i didn't tell about that in my post so thanks anyway :)

Re: advice for reading data from a file
by pg (Canon) on Jan 18, 2004 at 18:43 UTC

    There is not much space left for improvement.

    But if I am doing this, I probably will not use $., instead just count lines myself, which is not a big deal.

    Personally I would think (100% personal), using $. reduces maintainability. If one day, you (or someone) decide to modify your code for whatever reason, and in your while loop a second file is involved, your program can be easily broken, as there is only one $. across all files, and the value is only true for the last file handler accessed.

      thanks for pointing that pg,

      i think you're right and will get rid of using $.

Re: advice for reading data from a file
by Roger (Parson) on Jan 19, 2004 at 00:09 UTC
    Adding to other monks' comments, I can see two problems with your code:

    1) while (<$fh>) { ...
    This will break if the first line of the file is 0, the second line will not be read.

    2) return $fields[2] =~ /.../;
    What if the array @fields is empty? You will get warnings (assume you had 'use warnings' in your code, or haven't you?)

    So I would suggest to add more error checking to the code to make it more robust.
    sub check_field { my $file_name = shift; open my $fh, "<$file_name" or die "*** ERROR opening '$file_name': $!"; my @fields = (); while (defined (<$fh>)) { if ($. == 2) { chomp; @fields = split /;/; return 0 unless $#fields == 3; last; } } return 0 if $#fields < 0; return $fields[2] =~ /^\d+$/; }

      Hi Roger,

      i'm using 'warnings', but thanks for the 'defined' that i've missed :}

      but plz can you explain me why you are checking $#fields again in the line:

      return 0 if $#fields < 0;

      as it's already done in the loop (==3) ??

        That will guard against the case when your file has less than 2 lines, and the @fields only gets populated by the second line of the file.

      1) while (<$fh>) { ...

      This will break if the first line of the file is 0, the second line will not be read.

      Hmm. Funny, it doesn't seem to behave that way for me, and I wouldn't expect it to. The magical while(<>) statement (with or without an explicit file handle) is actually shorthand for while( defined( $_ = <> ))

      Try it out with a file that has just "0\n" as the first line and anything after that on other lines -- I've tried it a number of ways, and the only way I could get it to stop at the first line was:

      while ( <> > 0) ...
      which is admittedly the sort of thing that very few people would do inadvertently.
Re: advice for reading data from a file
by davido (Cardinal) on Jan 19, 2004 at 04:25 UTC
    I am going to weigh in here a little late.

    It seems to me that all the bother of setting up a while loop is unnecessary if all you're doing is skipping the first line of the file, reading the second, and exiting. I might write such a sub like this:

    sub check_field { open my $fh, "<", shift or die "Bleah!\n$!"; <$fh>; # Skip the unwanted line. my @fields = split /;/, <$fh>; close $fh; die "Ick!\n" unless @fields == 4; return( ($fields[2] =~ /^\d+$/) ? 1 : 0 ); }

    It's not really a matter of golf, I just like the idea that if we are only reading the first two lines from a file, listing <$fh> twice instead of breaking out of a while loop after the second line is somehow preferable.

    Also, I think your goal is to return true if $fields[2] matches only digits. I've used the ternary operator to ensure that undef never gets returned.


    Dave