Combine files, while parsing info.

BioNrd has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I am trying to do something which perl was made for. However, it is giving me the run around.

I have 2 text files (represented in the code by RESULT and DATA). I would like to match the name of the file in the two seperate text files (i.e. D:/Program Files/Eclipse/workspace... with its match in the other file). While at the same time manupulating the data contained within RESULT to contain info from DATA.

To put it another way: RESULT has the data I need, but lacks identifing numbers. DATA has the identfing numbers but lacks the data. I want a file that has the data I need and the ID numbers.

I have started to get the idea below, but cant seem to make it the rest of the way. I need to somehow match @my_data to $woot_woot, and then reprint everything to look something like this all spaces are tabbed:

For the first match:
6
4 0.03609 0
3 0.0416887 0.0305891 0
2 0.0281343 0.0229377 0.0346726 0
5 0.0512866 0.0432893 0.0442696 0.0455792 0

Here is the code I have started, it knows where it wants to go, but just cant get there.

open (RESULT, "FSTdist.sum") || die "Can't open FSTdist.sum: $!\n";

#RESULT contains:
# Sorry I don't know how to make data happen twice in a script
#    I tabbed over once from the #, the tab does not exist in the text
+ file.
#    
#    -----------------------------------------------------------------
+--
#    Arlequin batch run summary file"
#        file_name
#    "D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/
+20080503203813/sample_20080422103123_1_final_0.res/sample_20080422103
+123_1_final_0.arp"    0    
#    0.03609    0    
#    0.0416887    0.0305891    0    
#    0.0281343    0.0229377    0.0346726    0    
#    0.0512866    0.0432893    0.0442696    0.0455792    0    
#    p-Values matrix : No of permutations : 110
#    
#    0    
#    0    0    
#    0    0    0    
#    0    0    0    0    
#    
#    "D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/
+20080503203813/sample_20080422103123_1_starting_0.res/sample_20080422
+103123_1_starting_0.arp"    0    
#    0.0100529    0    
#    0.00927136    0.00858208    0    
#    0.00938612    0.0144204    0.0112178    0    
#    0.00852316    0.00662521    0.00921253    0.0148837    0    
#    0.00552453    0.00510669    0.0145038    0.0130359    0.0141625  
+  0    
#    p-Values matrix : No of permutations : 110
#    
#    0.234234    
#    0.324324    0.567568    
#    0.36036    0.027027    0.216216    
#    0.432432    0.792793    0.396396    0.00900901    
#    0.900901    0.936937    0.027027    0.0900901    0.045045    
#    
#    "D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/
+20080503203813/sample_20080422103123_1_final_0.res/sample_20080422103
+123_1_final_0.arp"    0    
#    0.03609    0    
#    0.0416887    0.0305891    0    
#    0.0281343    0.0229377    0.0346726    0    
#    0.0512866    0.0432893    0.0442696    0.0455792    0    
#    p-Values matrix : No of permutations : 110
#    
#    0    
#    0    0    
#    0    0    0    
#    0    0    0    0    
#    
#    "D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/
+20080503203813/sample_20080422103123_1_starting_0.res/sample_20080422
+103123_1_starting_0.arp"    0    
#    0.0100529    0    
#    0.00927136    0.00858208    0    
#    0.00938612    0.0144204    0.0112178    0    
#    0.00852316    0.00662521    0.00921253    0.0148837    0    
#    0.00552453    0.00510669    0.0145038    0.0130359    0.0141625  
+  0    
#    p-Values matrix : No of permutations : 110
#    
#    0.297297    
#    0.423423    0.414414    
#    0.378378    0.027027    0.234234    
#    0.459459    0.756757    0.45045    0.027027    
#    0.882883    0.918919    0.036036    0.0990991    0.027027    



while (<RESULT>){
    @my_data=<RESULT>;
    print "@my_data\n"
}


my $HEADERS = 2;
my @prev_cells;

while (<DATA>) {
    # Input data is really from a different file.
    chomp;
    
    my @cells = split /\"/, $_;

    # Remove redundant headers.
    my @display_cells = @cells;
    if (@prev_cells) {
       for (@display_cells[ 0 .. $HEADERS-1 ]) {
            if ($_ ne shift @prev_cells and $_ =~ m/D:/) {
            $woot_woot = $_;
        }
        $_ = '';        
       }
       print "$woot_woot\n";
       print "$display_cells[3]\n";
       
      # I am thinking I need a hash with the $woot_woot as the key, an
+d all values
      # of $display cells[3] as the value of the key.
      # But I can't figure how to do that.
    }
    @prev_cells = @cells;
}

__DATA__
-------------------------------------------------------------------
Arlequin batch run summary file"
Gene diversity indices:
file_name    pop_name    nb_gene_copies    nb_haplotypes    [orig_nb_h
+aplotypes]    num_loci    _nb_usable_loci    _nb_pol_sites    gene_di
+versity
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "6"    60    58    10    10    10    0.998870
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "4"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "3"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "2"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "5"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "6"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "4"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "1"    60    59    10    10    10    0.999435
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "3"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "2"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "5"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "6"    60    58    10    10    10    0.998870
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "4"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "3"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "2"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_final_0.res/sample_20080422103123_1
+_final_0.arp"    "5"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "6"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "4"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "1"    60    59    10    10    10    0.999435
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "3"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "2"    60    60    10    10    10    1.000000
"D:/Program Files/Eclipse/workspace/Converter_PC/recoded_pop_gen/20080
+503203813/sample_20080422103123_1_starting_0.res/sample_2008042210312
+3_1_starting_0.arp"    "5"    60    60    10    10    10    1.000000
__END__
[download]

Thanks for any/all help.

UPDATE: I wanted to share my final solution. Here the gen_div.sum file is the __DATA__ from above, and the pairdist.sum file is the file contents that are commented above. Thanks for the advice on processing files, both to plot out my course, and for the mock up code. I took it, and ran with both.

my %data;
open(A, '<', 'gen_div.sum');
while (<A>) {
    chomp;
    my @cells = split(/\"/, $_);
    if ($cells[1] =~ m/:\//) {
        push @{$data{$cells[1]}}, $cells[3];
    }
}
close(A);

open(B, '<', 'pairdist.sum');
while (<B>) {
    chomp;    
    my @cells = split(/\"/, $_);
    if ($cells[1] =~ m/:\//) {
        for my $key_finder (keys %data) {
            if ($key_finder eq $cells[1]) {
                $unlocked = $key_finder;
                @holder_inside = @{$data{$unlocked}};
                @holder_sig = @holder_inside;
                my $pop = shift @holder_inside;
                shift @holder_sig;
                print "$key_finder FINDER\n";
                print "$pop\t0\n";
            }
        }
    } else {
        if (exists $holder_inside[0]) {
            my $pop_value = shift @holder_inside;
            
            print "$pop_value\t@cells\n";
            
        } else {
            @sig_check = split(/\t/, $cells[0]);
            if ($sig_check[0] =~ /[0-9]/ and not m/p-Value/) {   #/[0-
+9]+\..*\.output/
                my $pop_value = shift @holder_sig;
                
                print "$pop_value\t";
                
                foreach my $item (@sig_check) {
                    if ($item =~ m/\d/ and $item <= 0.05) {
                        
                        print "+\t";
                        
                    } elsif ($item =~ m/\d/ and $item >= 0.05) {
                    
                        print "-\t";
                        
                    }
                }
                print "\n";
            }
        }
    }
}
[download]

---- Even a blind squirrel finds a nut sometimes.

Comment on Combine files, while parsing info. Select or Download Code

Replies are listed 'Best First'.
Re: Combine files, while parsing info. by pc88mxer (Vicar) on May 04, 2008 at 07:27 UTC
First of all, this looks like it's going to be a problem: `while (<RESULT>){ @my_data=<RESULT>; print "@my_data\n" }` [download] What you probably want is just: `@my_data = <RESULT>;` [download] otherwise you will be discarding the first line of the RESULT file. If you really want to discard the first line of the RESULT file, I would use perhaps: `<RESULT>; # reads a line from RESULT @my_data = <RESULT>; # reads rest into @my_data` [download] Secondly, I wouldn't use `my @cells = split /\"/, $_;` to parse the DATA file. It looks like the data is either tab or whitespace delimited with some fields being enclosed in double quotes. I would see if `Text::CSV` can parse the data since it already has support for de-quoting fields. Otherwise, if no whitespace occurs within double quotes, something like this may work: `while (<DATA>) { chomp; my @cells = split(' ', $_); for (@cells) { s/^"(.*)"$/$1/ } ... }` [download] Finally, the general approach I would use to combine the files is to read one of the files into hash - the key of the hash will the field or combination of fields that are common to both files. Then when you process the other file, you can look up the corresponding data in the hash based on the values you parsed for the common fields. I can't figure out exactly which fields you need to join on, so here's a contrived example which demonstrates the approach: Read more... (2 kB)	[reply] [d/l] [select]
Re: Combine files, while parsing info. by Erez (Priest) on May 04, 2008 at 07:03 UTC
The main question here is what's the criteria you wish to use when binding RESULT and DATA. From what I understood it's not line-to-line, but a paragraph in RESULT foreach line in DATA, based on the filepath matches. For that purpose, there are, (ahem), more than one way to do it One way would be to create an @array out of DATA (not RESULT) and use that data as hashkeys, or `grep` for it `foreach` line in RESULT. Another way would be to `split` RESULT according to the filepath: `undef $/; #slurps entire file while (<RESULT>){ @my_data=split m{(?:"D:/Program Files/)}; #splits according to the + filepath without removing it }` [download] Then iterate over DATA and match the two, substituting the ""D:/Program Files...arp" part with the appropriate line from DATA. In cases such of this, writing down exactly what you need to get, and what you want from each file will give you a "blueprint" of the program you'll end up writing. Stop saying 'script'. Stop saying 'line-noise'. We have nothing to lose but our metaphors.	[reply] [d/l] [select]