Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

matching information from two files and printing off the results -help needed

by Angharad (Pilgrim)
on May 07, 2009 at 20:32 UTC ( #762687=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have two files that look like this

file 1

1gtiA 7 1jpeA 4 1jpeA 6 1jpeA 7
file 2
0.333 # VF 0.267 # TE 0.200 # YD 0.267 # QG 0.000 # G- 0.000 # C- 0.000 # A- 0.000 # D- 0.000 # A- 0.000 # -- 0.000 # C- 0.000 # Y- 0.000 # P- 0.200 # PD 0.067 # EL 1.000 # TT
I need to map the information from the first file onto the second by changing the relevant position on the second file to a small case letter. So the result should be
0.333 # VF 0.267 # TE 0.200 # YD 0.267 # Qg 0.000 # G- 0.000 # C- 0.000 # a- 0.000 # D- 0.000 # A- 0.000 # -- 0.000 # C- 0.000 # Y- 0.000 # P- 0.200 # PD 0.067 # El 1.000 # Tt
As you can see, one has to disregard the "-" when doing the mapping. The information for the two items of interest in file 2 are presented vertically. There could be many more such items but I'll only include two for the example.
score hash_character item1 item2 0.333 # V F etc
Item 1 corresponds to the item called 1gtiA in file 1 and item 2 1jpeA. I've got some code but its not working at all so any help much appreciated. Apologies for the number of posts recently. I'm just trying to complete a project and its not going too smoothly! As you can see its very much a work in progress!
#! /usr/local/bin/perl -w use FileHandle; use strict; my $scorecons_file = shift; #my $alignment_file = shift; my $csa_file = shift; my $column_count; my $res_count1 = 0; my $res_count2 = 0; warn "# Reading CSA data"; my $hCSAData = getCSAData($csa_file); warn "# Got CSA data: ".scalar (keys %$hCSAData); my $fh_score = new FileHandle($scorecons_file, "r") || die "Cannot ope +n seq file: $scorecons_file ($!)"; while(my $line = $fh_score->getline) { $column_count++; chomp $line; my @field = split /\s+/, $line; #print "$field[0] $field[2]\n"; # test print my $score = $field[0]; my $sequence = $field[2]; #print "$sequence\n"; my @sequence_field = split //, $sequence; if("$sequence_field[0]" ne "-") { $res_count1++; #print "$sequence_field[0] "; # print "$res_count1 $column_count\n"; # if(my $hCSA = $hCSAData->{$column_count}->{$res_count1}) # { #print "yes"; # } } # if("$sequence_field[1]" ne "-") #{ #$res_count2++; #print "$sequence_field[1] \n"; # } # } ######################################## sub getCSAData { my ($fIn) = @_; my $fh = new FileHandle($fIn) or die ""; my $res; my $code; my $count = 1; my $key = 0; my $protein = ""; my $code = 0; my $hData = {}; while (my $line = $fh->getline) { my @cols = split /\s+/, $line; #$key = "$cols[0]" . "$cols[1]"; #print "$cols[0] $cols[1]\n"; $key = $cols[0]; if("$key" ne "$protein") { $code++; } $protein = $key; #print "$code $cols[1]\n"; my $hEntry = { 'code' => $code, 'res' => $cols[1], }; my ($code, $res) = sort ($hEntry->{code}, $hEntry->{res}); $hData->{$code}->{$res} = $hEntry; #$hData->{$res}->{$code} = $hEntry; print "$code $res\n"; } return $hData; }

Replies are listed 'Best First'.
Re: matching information from two files and printing off the results -help needed
by graff (Chancellor) on May 08, 2009 at 01:53 UTC
    I'm guessing that the code you posted is doing a lot more work than it needs to do for the problem at hand -- or rather, you've written a lot more code than you needed to. It's a bit too messy for me to go into in detail (senseless use of multiple blank lines, random indentation, etc), so let me start over...
    #!/usr/bin/env perl use strict; use warnings; my $Usage = "$0 file1 file2 >\n"; die $Usage unless ( @ARGV==2 and -f $ARGV[0] ); my ( $file1, $file2 ) = @ARGV; my %edit; open( IN, "<", $file1 ) or die "open failed on $file1: $!\n"; while (<IN>) { my ( $type, $line ) = split; my $offset = ( $type eq '1gtiA' ) ? 0 : 1; $edit{$line}[$offset]++; } close IN; my @item_line = ( 0, 0 ); open( TBL, "<", $file2 ) or die "open failed on $file2: $!\n"; while (<TBL>) { my @edit_field = ( /(\S)(\S)\s*$/ ); my $changed = 0; for my $f ( 0, 1 ) { if ( $edit_field[$f] ne '-' ) { $item_line[$f]++; if ( exists( $edit{$item_line[$f]}[$f] )) { $edit_field[$f] = lc $edit_field[$f]; $changed++; } } } s/\S\S(\s*)$/join('',@edit_field,$1)/e if $changed; print; }
    (updated to include "or die ..." on the open() calls, as per normal practice.)

    I put your two sample data files into "f1" and "f2", saved the script shown here as "" and ran it like this in a bash shell: f1 f2 > f3
    and the contents of f3 matched what you posted as the desired output.

    Update to add some commentary on your code: Sorry about dissing your indentation -- when I put your code into emacs, it was fine -- alas, you have to remember, when posting code here, that mixing tabs and spaces for indentation creates an unattractive appearance inside our beloved code tags; convert tabs to 8-space sequences before posting.

    Apart from that, after removing all the unnecessary (commented out and blank) lines, it took a while to figure out which file name should be given first on the command line (i.e. what do the names "scorecons" and "csa" have to do with the file contents, and which is which anyway?) The code I posted would be improved if the $Usage message were rephrased to make this clear -- in my version, what you posted as "file1" should be the first file named on the command line. (But I think that's clear from the code itself, whereas it's a lot harder to tell in the code you posted -- all the more reason to make sure you provide a "$Usage" synopsis that is easy to get.)

    You seem to be doing stuff that has no bearing at all on the task as you described it. On top of that, I didn't see anything in your code that would actually print out an edited version of file2 (which was supposed to be the task, right?). A "work in progress", indeed...

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://762687]
Approved by ikegami
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2023-02-05 21:36 GMT
Find Nodes?
    Voting Booth?
    I prefer not to run the latest version of Perl because:

    Results (33 votes). Check out past polls.