bioinformatics has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I'm working on yet another program, and have run into a little problem. The program is designed to remove data from a series of text files and print it out in a series of columns. Thus far, my program gets all the required data, but is unable to output it correctly. The sub get_signal gives me a hash of arrays containing my data; I need to asign each of these arrays a different name, and then know those names so I can print them. Theoretically, I want this automated enough so that no matter how many files are inputed into the program(and therefore no matter how many arrays I end up with), the program will assign each one a unique variable name and then be capable of printing them out. My code:
#! usr/local/bin/perl use Cwd; print STDOUT "Please enter the name and location of the directory to p +arse\:\n"; $directory=<STDIN>; chomp $directory; open (OUTPUTFILE,">junk.txt"); opendir (DIR, "$directory") or die "Failed to open directory: $!"; @filename=readdir(DIR); @trash=splice(@filename, 0,2); @genius=@filename; sub get_signal { while (@filename) { $file=shift @filename; @final_data=''; use Cwd 'chdir'; chdir "./data"; open (FILE, "$file") or die; @data=<FILE>; $spliced_data=splice(@data, 1, 14); foreach (@data) { ($a, $b, $c, $d, $e, $f)=split(/\t/); push(@final_data, "$d\n");} %hash={"$file"=>@final_data}; #this hash assignment +doesn't wor +k close (FILE); } @values=values(%hash); return @values; } sub get_targets { $target=shift @genius; use Cwd 'chdir'; chdir "./data"; open (FL, "$target") or die; @info=<FL>; $excess=splice(@info, 1, 14); foreach (@info) { ($z, $x, $w, $y, $u, $v)=split(/\t/); push(@targets, "$z\n");} close (FL); return @targets; } @column=get_targets; @next_columns=get_signal; for ($i=0;$i<=scalar(@next_columns);$i++) { @$i=@next_columns[$i];} #my attempt at assigning a +unique variabl +e, which doesn't work. print @next_columns; print OUTPUTFILE "@final_data"; close OUTPUTFILE; exit;
Thank you all for your time and thoughts!!!
bioinformatics

Replies are listed 'Best First'.
Re: Unique Variable names...
by dragonchild (Archbishop) on Jul 29, 2003 at 18:48 UTC
    A few notes:
    1. Use strict, my, and pass your stuff into your functions. The way you're doing with, with all globals, is a recipe for a major headache.
    2. %hash={"$file"=>@final_data}; #this hash assignment +doesn't wor +k close (FILE); } @values=values(%hash);
      Yeah, no kidding that's not going to work. You create %hash every iteration through @filename. Try something like:
      $hash{$file} = \@final_data; close (FILE); } @values=values(%hash);
    If I understand your code correctly, you're attempting to read all the files in a directory and grab all the values in the 4th column of each file, as defined by a tab delimiter. That's get_signal().

    I'm not sure what you're doing with get_targets(), so I'll ignore it for now.

    I would implement the subset of your script that doesn't deal with get_targets() as such:

    #!/usr/local/bin/perl #Why do you need this?!? #use Cwd qw(cdir); use IO::Dir; use IO::File; print "Please enter the name and location of the directory to parse:\n +"; chomp(my $directory = <STDIN>); my $dh = IO::Dir->new($directory) || die "Cannot open directory '$directory': $!\n"; my @filenames; push @filenames, $_ for map { "$directory/$_" } grep !/^\.\.?/, $dh->r +ead; $dh->close; my %file_data; foreach my $filename (@filenames) { my @final_data; # Why do you need to do this?!? #chdir "./data"; my $fh = IO::File->new($file) || die "Cannot open file '$file': $!\n"; my $i = 0; while (<$fh>) { next while $i++ <= 14; push @{$file_data{$file}}, (split /\t/)[3]; } $fh->close; } # Now, at this point, you have a hash called %file_data # which is keyed by filename. Each filename points to an # array reference contained the values in the 4th column, # starting at the 15th line. What do you want to do with it?

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      Minor style/correctness whinge, but I believe $hash{$file} = \@final_data; should be $hash{$file}=[@final_data];. I suppose it wouldn't matter if you strictly used my @foo; every time the loop went around, but I think this way would be cleaner and less chance to bug up someplace..
        Actually, it does matter, and potentially might matter a lot. $x{$y} = \@z; takes a reference to a data structure that already exists. $x{$y} = [@z]; creates a new reference and copies the existing data structure into it. This can be an expensive operation. My operation will always occur in the same amount of time, regardless of how many elements there are in @z.

        I suppose it wouldn't matter if you strictly used my @foo; every time the loop went around, but I think this way would be cleaner and less chance to bug up someplace..

        Always using my @foo; every time the loop went around is both cleaner and less bug-prone. I am having the language handle my memory management for me. The language will always do it right - I might not. The rule of thumb is that if you're doing $x{$y} = [@z]; and you don't have a compelling reason why, you probably are doing something that is bug-prone and should rewrite it.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      The reason I used the Cwd module is because I needed to change the current directory in order for the program to function. Hence, when I use your program, I have the same issue: it will be unable to open the files until the working directory is changed within the program.
      Bioinformatics
        bioinformatics,
        Why do you need to use Cwd when chdir will work just fine?

        Why do you put the chdir inside the foreach loop? Would not changing the directory at the top of the foreach loop once be sufficient.

        As a final note - dragonchild's code provides the directory and file name in the file list, so the chdir really is not required.

        Cheers - L~R

      Thank you for your suggestions. I have incorporated a number of them into my program, as well as made it more streamlined. However, my problem still remains. I need to print the data from @signal into consecutive columns, as shown in my post at the end of the thread. The only way I know how to make this managable is to take the data pushed into @signal, grep it, and then shove it into 5 separate arrays (there are 5 input files CURRENTLY). This is all well and good, except that I need to somehow make this program capable of handling different numbers of files each time it is run. Do I need to manually assign, say 30, arrays in which to put the data, placing a limit on the program? I'm sure there are better ways to do this, but I don't know how.
      NOTE: please be patient with me, as I'm only a beginning programmer, perl being my first language. Having only been working with it for a month now, I suppose I could be doing worse...:-)
      My latest code:
      #! usr/local/bin/perl -w use Cwd; use IO::Dir; use IO::File; print STDOUT "Please enter the name and location of the directory to p +arse\:\n"; chomp (my $directory=<STDIN>); open (OUTPUTFILE,">junk.txt"); my $dh = IO::Dir->new($directory) || die "Cannot open directory '$dire +ctory': $!\n"; my @filenames; push @filenames, $_ for map { "$directory/$_" } grep !/^\.\.?/, $dh->r +ead; $dh->close; @output=get_signal; for (@output){ @{$signal[0-4]}; @{$rprobe};} print OUTPUTFILE "@signal"; close OUTPUTFILE; exit; sub get_signal { while (@filename) { $file=shift @filename; use Cwd 'chdir'; chdir "./data"; open (FILE, "$file") or die; @data=<FILE>; my $i=0; foreach (@data) { next while $i++ <=14; push @signal, (split(/\t/))[3];} my $g=0; foreach (@data) { next while $g++ <=14; @probe=(split(/\t/))[0];} $hash{$file}=\@signal; close (FILE); } @values=values(%hash); $rprobe=\@probe; return @values; return $rprobe; }
      Bioinformatics
Re: Unique Variable names...
by CountZero (Bishop) on Jul 29, 2003 at 19:15 UTC

    Without already being able to give you a solution, I have the following comments:

    1. Just a style argument: why do you put the subs definitions in the middle of your code? It tends to make the structure a lot less easy to read.
    2. spliceing the first two items of your array with filenames/directories, is a nice trick if you can be sure that the first two items are always the dot and dot-dot items. This may be something which is not guaranteed and/or not portable across all OS.
    3. Your program assumes (as is your good right) a very specific directory and file-structure (top level only holds directories and each such directory contains a "data"-file. Which makes it difficult to test your script if one doesn't have the same structure.
    4. get_targets and get_signal, seems to go through the same "data"-file, just extracting different items, resp. extracting the first and the fourth item and saving the rest in some variables which are never used (if you did use warnings you would have received some warnings in this respect). The same goes for the variables $scratch, $excess and $spliced_data, which are essentially just garbage bins in your script.
    5. Rather than using global variables, you could pass to your subs an argument list. If you did that then you would really see that you are using the same arguments in both subroutines. Now you use @genius and @filename, which are just copies of each other, but that is not readily apparent.
    6. What you are trying to do with %hash={"$file"=>@final_data} beats me. Could you explain it?
    7. Why do you return the value of @targets to @columns? You never ever use the @columns-array?
    8. Why did you think @$i=@next_columns[$i] would work? Can you explain your reasoning behind it?
    About the "unique variable name" thing: why would you need that? I'm not convinced that it is necessary for your purpose. May I suggest that you give us an example of your inputs and your expected output? That would make it a lot easier to help you.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      My input data file would look something like:
      AFFX-BioB-5_at 20 20 200.2 P 0.001 AFFX-BioB-M_at 20 20 400.4 P 0.002 AFFX-BioB-3_at 20 20 200.5 P 0.003
      I want the 4th column, with the signal data. Actually, that other subroutine gets only the first column from the first file. I didn't know how to do that otherwise, since everything else was in a loop. This way, I have everything set to look like (if it worked anyway):
      AFFX-BioB-5_at 200.0 300.0 400.0 AFFX-BioB-M_at 200.0 300.0 400.0 AFFX-BioB-3_at 200.0 300.0 400.0
      Thanks for your suggestions!!
      Bioinformatics

        OK, now we're talking.

        Assuming you want a result which is first the AFFX-BioB-xxx_at-identifier, followed by all signal-data which are connected to this identifier, I suggest:

        • you drop the get_targets sub and all references to it.
        • Then you change your sub get_signal to:
          while (@filename) { my $i; $file=shift @filename; use Cwd 'chdir'; chdir "./data"; open (FILE, "$file") or die; while (<FILE>) { next while $i++ <= 14; (my $id, undef, undef, my $signal, undef, undef)=split(/\t +/); push @{$outputdata{$id}}, $signal; } close (FILE); }
          After having run this sub over all your files you will find in %outputdata a nicely ordered (per identifier) structure of your signal-data.
        • "Printing this datastructure goes as follows:
          for $id (keys %outputdata) { print "$id:\t",(join("\t",@{$outputdata{$id}})),"\n"; }
          Of course you can print it to a filehandle. This is a format which is suitable to be imported in a database or a spreadsheet.
        The "magic" of using references to anonymous arrays may perhaps be a bit too deep for someone who is just starting to program, but if you read Chapters 8 and 9 of the Camel book a few times and study the examples given, much will become clearer.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Unique Variable names...
by LameNerd (Hermit) on Jul 29, 2003 at 18:31 UTC
    You could get the "names" like so ...
    sub get_signal { ... @values=values(%hash); @names=keys(%hash); return (\@values, \@names ); } ... my ( $arrRef_next_columns, $arrRef_names ) = get_signal;