ishazyi has asked for the wisdom of the Perl Monks concerning the following question:

I am supposed to be writing a code to rearrange some data in some files I have. The rearranging code has been written for me (by a co-researcher). My problem now is I have to finish the file so that it'll loop through all 50 of my files. I'm in a slow process of learning, but I still am VERY new and have no idea what I am doing most of the time.

Below is the code as it is right now, to be used one file at a time. $InputFile2 will be the files that are changing name (testprefix# - where the number ranges from 162-217 - with 50 total files). Any help would be greatly appreciated.

#!/usr/bin/perl use strict; use warnings; use diagnostics; my $InputFile1 = "edited_archaea_master_list.txt"; open FILE1, $InputFile1 or die "Can't read source file: $InputFile1\n" +; my $InputFile2 = "testprefix172_single_line_nostrains_aligned_gaps_tes +t.meg"; open FILE2, $InputFile2 or die "Can't read second source file: $InputF +ile2\n"; my $OutputFile3 = "rearrangement_out_test.txt"; open FILE3, ">$OutputFile3" or die "Can't open output file: $OutputFil +e3\n"; print FILE3 "Input #1: $InputFile1\n"; print FILE3 "Input #2: $InputFile2\n\n"; my @species = <FILE1>; #take file 1 and make into an array my @align = <FILE2>; #samsies print "array @align\n"; my $limit = @align; #Number of lines read in the array my $species; #used in the foreach loop through @species array ... my $i; my $j; my $FoundFlag; #variable used to store names not found and print out a +t the end my @notFound; #used to remember species that are not found in order to + print at the end ... #loop through master list ... foreach $species(@species) { chomp($species); #deletes the enter character at the end the speci +es, so rather than looking for "species\n", it can just find "species +" #print "$species\n"; $FoundFlag = 0; #Scan second file for a match ... for ($i=0; $i<$limit; $i++) #$i is the line number, set to start a +t zero, $limit is the total number of lines in file, $i++ continues t +o every consecutive line { chomp($align[$i]); if ($align[$i] =~ /$species/) #if the line in align array matc +hes the line in the species file, then print, and will print to outpu +t file { #print "Found ....\n"; print $align[$i]; $j = $i; #$j equals to the species found in the mega f +ile, do then prints the align of $j plus all the lines that follow, u +ntil the next # sign. do { print FILE3 "$align[$j]"; $j = $j + 1; } while (($j < $limit) && !($align[$j] =~ /^#/)); #ali +gn of j will continue to print until the $limit, which is the end of +the file, AND until the next pound sign. $FoundFlag = 1; #if the species is found, then foundfl +ag equals one, meaning its true. last; #quit FOR loop ... and look for the next species } } if ($FoundFlag == 0) { print "I did not find $species\n"; #print FILE3 "I did not find $species\n"; push @notFound, $species; } #print "Next one ...\n"; } #Write not found species ... foreach my $item(@notFound) { print FILE3 "I did not find $item\n"; } close FILE3; print "Done!\n"; exit;

I apologize for all of the comments, it is the only way I know what the code is doing at this point in my learning. Again, I would greatly appreciate any help.

Replies are listed 'Best First'.
Re: I'm new and unwise and need help
by Anonymous Monk on Feb 15, 2015 at 21:22 UTC

    If you're new to Perl, some good things to read would be perlintro and Getting Started with Perl.

    There are several cleanups that could be done on the script, for example using lexical filehandles (that means open my $file1, ... instead of open FILE1, ...), or waiting to define several of the variables until they're actually needed (for example for (my $i=0;...)). But regarding your question:

    The two ways to do this with the smallest amount of change to your existing code would be, instead of my $InputFile1 = ..., to 1. provide the three file names via the command line and then write a quick shell script to call the script many times, or 2. if you wanted to do it all in Perl, wrap the entire code in a subroutine and call it in a loop.

    1. First approach, see @ARGV on accessing the command line parameters:

    die "Usage: $0 InputFile1 InputFile2 OutputFile\n" unless @ARGV==3; my ($InputFile1, $InputFile2, $OutputFile3) = @ARGV; # ... rest of the code here

    Then, your script could be called from the command line or a shell script via perl script.pl edited_archaea_master_list.txt testprefix172_single_line_nostrains_aligned_gaps_test.meg rearrangement_out_test.txt

    2. Second approach: Wrap all the code you've got now (except the exit, of course) into a sub process_files { ... } (see perlsub for all the details on subroutines), and get the input/output files as arguments:

    sub process_files { my ($InputFile1, $InputFile2, $OutputFile3) = @_; # ... rest of your code here }

    And then the sub can be called like so:

    my @filesets = ( { input1 => "edited_archaea_master_list.txt", input2 => "testprefix172_single_line_nostrains_aligned_gaps_test.meg", output => "rearrangement_out_test.txt" }, # ... more file sets here ); for my $fileset (@filesets) { process_files( $fileset->{input1}, $fileset->{input2}, $fileset->{output} ); }

    The data structure used for @filesets is called an "array of hashes", it is explained (including examples of how to generate them) in perldsc, and also perlreftut. You can also build @filesets dynamically by listing the files in a folder, for example using Path::Class. Also, if you use this approach, it'd be good to change the filehandles to lexical ones, as noted at the beginning.

    You hint in your description that one of the input files will stay the same. In that case, the script could be optimized to read that file just once and use that data many times, instead of re-reading the same file every time. As I said, the above approaches are geared towards the least amount of change to your existing code.

Re: I'm new and unwise and need help
by smknjoe (Beadle) on Feb 16, 2015 at 17:24 UTC
    I'd do option #2 as anonymous monk suggested. I'd then write a for loop to loop through your 50 files, calling the subroutine each time. I'd create the filehandles on-the-fly, incrementing the number in the filehandle each time from 162 to 217.

      Welcome to the Monastery, smknjoe!

      Just to point out a minor technicality: You probably mean filenames, which isn't the same thing as a filehandle.

        Ooops, I meant filenames, not filehandles.