lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello
I am processing a lot of files in a directory. I tested the parsing
and it works fine.
open(my $out, ">C:/Documents and Settings/mydir/Desktop/ltvec_data.txt +"); my $first_line = <DATA>; chomp $first_line; while(<DATA>){ chomp; my @fields = split /\t/; my $well_position = $fields[0]; my $sample = $fields[1]; $sample =~ /(\d+)(.*)/; my $barcode = $fields[2]; my $block_id = $fields[3]; my $name = substr($block_id,11); print $out "$well_position\t$1\t$2\t$barcode\t$name\n"; } close $out; __DATA__ Position Sample Barcode A01 12032 B1 NUBI825632 XXXXX Glyc 01/07/2009 B01 12032 L1 NUBI825757 XXXXX Glyc 01/07/2009 C01 12032 L2 NUBI825872 XXXXX Glyc 01/07/2009 D01 12032 L3 NUBI825997 XXXXX Glyc 01/07/2009 E01 12166 B1 NUBI826118 XXXXX Glyc 01/07/2009

This works fine. I have to perform the same parsing on 20+ files in a dir.
I revised the code to process the files files.
my $_dir = 'C:/Documents and Settings/mydir/Desktop/current/Test_Files +'; open(my $out, ">C:/Documents and Settings/mydir/Desktop/current/Test_F +iles/data.txt"); opendir my $dh, "$dir"; # read files in dir #my $first_line = <IN>; #chomp $first_line; #my $i=1; while(my $f = readdir($dh)){ chomp; my @fields = split /\t/; my $well_position = $fields[0]; my $sample = $fields[1]; $sample =~ /(\d+)(.*)/; my $barcode = $fields[2]; my $block_id = $fields[3]; my $name = substr($block_id,11); my $outfile = "$out$name"; print $outfile "$well_position\t$1\t$2\t$barcode\t$name\n"; } closedir($dh); </readmore>

I am getting several errors that are basically telling me that I using uninitialized variables($sample, $name, $_ in scalar chomp and split).
Any clues?

Replies are listed 'Best First'.
Re: processing a lot of files
by scorpio17 (Canon) on Jul 28, 2009 at 19:19 UTC

    I think you want something like this (untested):

    my $dir = 'C:/Documents and Settings/mydir/Desktop/current/Test_Files' +; open my $out, '>', "$dir/data.txt" or die "can't open out file: $!"; opendir my $dh, $dir or die "can't opendir $dir : $!"; while(my $f = readdir($dh)) { next if ($f eq 'data.txt'); open my $fh, '<', "$dir/$f" or die "can't open file $f : $!"; my $first_line = <$fh>; while (my $line = <$fh>) { chomp $line; my ($well,$sample,$barcode,$block_id) = split(/\t/, $line); my $name = substr($block_id, 11); $sample =~ /(\d+)(.*)/; print $outfile "$well\t$1\t$2\$barcode\t$name\n"; } close $fh; } closedir $dh; close $out;

    Notes:

    • readdir lets you loop over all the files in a directory, but you still need to open and read each file.
    • It's not good to write your output into the same place all your input files are - your script will try to read it too!
    • Be careful using the special variable $_ - this is the "default" output of many operations, so it's easy for one to clobber another. I like to store lines read from a file into a variable, just to be safe.
    • The special variables $1 and $2 (regex matches) are also easy to clobber (what if you add another regex to your script sometime in the future?) So either store them in other variables right after the regex, or else use them immediately.

      You pointed me in the right direction! The script works even
      though I get that annoying "use of uninitialized value of $1 and $2
      in concatenation(.) or string at my print $out statement line.
      Should I be concerned about that?
      my $dir = 'C:/Documents and Settings/mydir/Desktop/current/Test_Files' +; # directory to search opendir my $dh, "$dir"; my $i=1; while(my $f = readdir($dh)) { next if -d "$dir/$f"; open(my $in, "$dir/$f"); open(my $out, ">C:/Documents and Settings/mydir/Desktop/current/T +est_Files/outfiles/data$i"); my $firstline = <$in>; chomp $firstline; while(my $line = <$in>){ chomp $line; my ($well_position,$sample,$barcode,$block_id) = split(/\t/, $ +line); my $name = substr($block_id, 11); $sample =~ /(\d+)(.*)|(\D\d))/; print $out "$well_position\t$1\t$2\t$barcode\t$name\n"; } $i++; close($in); close($out); } closedir($dh);
      Thanks!
      LomSpace
        I get that annoying "use of uninitialized value of $1 and $2 in concatenation(.) or string at my print $out statement line. Should I be concerned about that?
        If you are asking this question, that means there is something that you do not understand about your code. Yes, you should be concerned, and yes, you should try to determine the root cause of the warning.

        As an aside, you should always check the results of each open and opendir:

        opendir my $dh, $dir or "Can not open directory $dir: $!";

        I'll bet you have lines in your data files that don't match the regex, so the $1 and $2 values are undefined, then you try to use them in the print statement. One solution is to simply cleanup the input files before running the script (i.e., make sure there are no files in the input directory other than files you want the script to process). The other possibility is that your data has junk in it - maybe blank lines or comment lines? If so, you simply need to check for those and skip them as needed.

        while(my $line = <$in>) { chomp $line; $line =~ s/^\s+//; # strip leading whitespace next unless $line; # skip blank lines next if ($line =~ /^#/; # skip comment line ... }

        Another thing you can do is this:

        my ($x, $y) = $sample =~ /(\d+)(.*)/; $x = '?' unless $x; $y = '?' unless $y;

        This saves the regex matches into variables, so you don't have to use the special vars $1 and $2 anymore, and you can test them, give them default values, etc.

Re: processing a lot of files
by toolic (Bishop) on Jul 28, 2009 at 19:41 UTC
    In addition to the suggestions made by others, keep in mind that readdir will also return sub-directory names, as well as file names. You may need to filter out directories using -X:
    while(my $f = readdir($dh)){ next if -d "$dir/$f";
      That is a good look Toolic!
      Thanks!
      LomSpace
Re: processing a lot of files
by SuicideJunkie (Vicar) on Jul 28, 2009 at 17:16 UTC
    If something isn't what you expect, print the values and use them to trace along. Try printing the values where you expect them to be set, and print the source you expect them to be set from.
    For example, you set:
    $sample = $fields[1];
    but you have no check to see how many fields you actually found. Sample is undef? Then $fields[1] was undef, which implies in turn that there was no \t in your $_.
      This is not clear, particularly '$_'. I can process using open, but I run into
      problems with opendir and readdir. I want to change the format of the files. Still stuck }
        try the following on the command line
        perl -e 'opendir DIR, "./";@directory =readdir(DIR);for $entry (@direc +tory){print "$entry\n";}'
        in your while loop in the failing example $f is a string with the name of a file in it, you have to open, process and close the file inside the loop
Re: processing a lot of files
by Utilitarian (Vicar) on Jul 28, 2009 at 17:13 UTC
    What value have $f and $_ in these snippets?
    The warnings are true, you have failed to allow for the changes to the operation of your loop.