comparing an ID fom one file to the records in the second file

ag88 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I want some help. I have two files. One file contains the ID's (some alpha numeric text) as below. File named is "BreastCnAPmiRNAsID.txt"

hsa-miR-4700-5p
hsa-miR-300
hsa-miR-381
hsa-miR-4803
[download]

I want to read this file line by line and then see if the same ID is present in the second file and extract the related information which is in a single row too separated by space. Second file is named "tarbaseData.txt" looks like this:

ENSG00000005175    RPAP3    hsa-miR-3199    Homo sapiens    293S    Ki
+dney    NA    HITS-CLIP    POSITIVE    DIRECT    DOWN    treatment:em
+etine
ENSG00000005175    RPAP3    hsa-miR-342-3p    Homo sapiens    HELA    
+Cervix    Cancer/Malignant    HITS-CLIP    POSITIVE    DIRECT    DOWN
+    Hela cells were treated with control shRNA.
ENSG00000005175    RPAP3    hsa-miR-381-3p    Homo sapiens    HS5    B
+one Marrow    Normal/Primary    HITS-CLIP    POSITIVE    DIRECT    DO
+WN    NA
ENSG00000005187    ACSM3    hsa-miR-196a-5p    Homo sapiens    EF3DAGO
+2    NA    Normal/Primary    PAR-CLIP    POSITIVE    DIRECT    DOWN  
+  NA
[download]

The new lines in the second file starts with the ID as well which is somehwhat like ENS.....What I actually want is that the program takes ID from the 1st file (BreastCnAPmiRNAsID.txt) and whenever it finds the same ID in the second file, it copies the complete line and write it in another file. For the time being I am printing the result in the terminal. My code is not working properly which is as follows.

#!/usr/bin/perl

open(FILEID, "BreastCnAPmiRNAsID.txt") || die "cannot open file";
{
open(FILECOMPARE, "tarbaseData.txt") || die "cannot open file";
{
   while(<FILEID>)
   {
    chomp;
    $rnaid = $_;
    while(<FILECOMPARE>)
    {
    chomp;
    print "$rnaid\n";
    if (/$rnaid/)
    {
    print "$_\n";
    }
    }
    close(FILECOMPARE);
    }
close(FILEID);
}
}
[download]

If I replace "/$rnaid/" in the second while loop with specific ID, it searches the second file and gives the output. But I am not able to compare the both files correctly. Any kind help will be appreciated. As I am new to programing any simple understandable approach/help would be highly highly appreciated.

Comment on comparing an ID fom one file to the records in the second file Select or Download Code

Replies are listed 'Best First'.
Re: comparing an ID fom one file to the records in the second file by poj (Abbot) on Dec 01, 2017 at 18:32 UTC
Store the ID's as hash keys and use exists to match records in the data file `#!/usr/bin/perl use strict; use Data::Dumper; my %ID = (); my $fileID = 'BreastCnAPmiRNAsID.txt'; open FILEID, '<', $fileID or die "cannot open $fileID"; while (<FILEID>){ chomp; $ID{$_}=1 if $_; } close FILEID; #print Dumper \%ID; my $fileCompare = 'tarbaseData.txt'; open FILECOMPARE, '<', $fileCompare or die "cannot open $fileCompare"; while(<FILECOMPARE>){ chomp; my @col = split "\t",$_; print "$col[2]\n"; if (exists $ID{$col[2]}){ print $_; } } close FILECOMPARE;` [download] poj	[reply] [d/l]
Re^2: comparing an ID fom one file to the records in the second file by ag88 (Novice) on Dec 02, 2017 at 11:27 UTC
Thankyou so much for the help poj. Much appreciated. and I understand the logic how to do this type of task i.e., using the hash and storing IDs as keys etc. But there still exists one problem. The code is not comparing the hash key (ID) with $col2 and is not printing the related line :(. I commented out the (print "$col2") and it is printing nothing. I am trying to figure this out in the mean time.	[reply]
Re^3: comparing an ID fom one file to the records in the second file by poj (Abbot) on Dec 02, 2017 at 11:45 UTC
Check that your ID file does not have hidden spaces, tabs etc. Ensure 'clean' data by adding a regex `while (<FILEID>){ chomp; s/[\s]//g; $ID{$_}=1 if $_; }` [download] poj	[reply] [d/l]
Re^4: comparing an ID fom one file to the records in the second file by ag88 (Novice) on Dec 02, 2017 at 12:33 UTC
Re: comparing an ID fom one file to the records in the second file by Laurent_R (Canon) on Dec 01, 2017 at 18:43 UTC
Hi ag88, The typical way to solve this type of problem is to first read the second file, store each line in an array, using the ID as a key and the full line as a value. Then close that file. They you read the first file and, for each line, lookup the hash to see if you find the ID. If you do, print out the hash value to the output file. Here, however, since you only want to keep the matching items, it would be slightly simpler to do it the other way around: read the first file and store the ID in a hash (as hash keys, the hash value can be anything, for example number 1). Close that file once this is done. Then open the second file, read it line by line, extract the ID from the line, and print that line to the output file if the ID is found in the hash. Something like this: `use strict; use wanings; my %ids; open my $IDS, "<", "BreastCnAPmiRNAsID.txt" or die "cannot open file B +reastCnAPmiRNAsID.txt $!"; while (<$IDS>) { chomp; $ids{$_} = 1; # populating the hash } close $IDS; open my $FILECOMPARE, "<", "tarbaseData.txt" or die "cannot open tarba +seData.txt $!"; while (my $line = <$FILECOMPARE>) { my $id = (split /\s+/, $line)[2]; # extract the ID (third field +) print $line if exists $ids{$id}; # print line if hash lookup i +s successful } close $FILECOMPARE;` [download] This prints the result to the standard output. You'll have to open a third file in write mode and print to it if you want the result in another file. Update: poj typed faster than me (or started to type earlier), our solutions are quite similar. Update: Fixed missing quotes in the name of the first file. Thanks to 1nickt for pointing out this typo.	[reply] [d/l]
Re^2: comparing an ID fom one file to the records in the second file by ag88 (Novice) on Dec 02, 2017 at 11:37 UTC
Thankyou so much for the help. I really appreciate it. But I am unable to print the line having matched ID via (print $line if exists $ids{$id};). I am trying to figure it out :(	[reply]
Re^3: comparing an ID fom one file to the records in the second file by Laurent_R (Canon) on Dec 02, 2017 at 11:45 UTC
Hi ag88, please check the content of your hash. There may be invisible characters in the lines of your first file (like extra space, carriage return, etc.). The best might be to use something like the `Data::Dumper` module (which is core, so it should be on your machine).	[reply] [d/l]
Re: comparing an ID fom one file to the records in the second file by jwkrahn (Abbot) on Dec 01, 2017 at 19:00 UTC
If you have the `grep` program installed then you could do: `grep -f BreastCnAPmiRNAsID.txt tarbaseData.txt > newfile.txt` [download]	[reply] [d/l] [select]
Re^2: comparing an ID fom one file to the records in the second file by ag88 (Novice) on Dec 02, 2017 at 11:41 UTC
I tried this before posting the question. The newfile.txt is empty. Somehow the comparison is not done. :(	[reply]
Re^3: comparing an ID fom one file to the records in the second file by hippo (Archbishop) on Dec 02, 2017 at 15:28 UTC
You must be doing something wrong since it works fine for me: $ ls BreastCnAPmiRNAsID.txt tarbaseData.txt $ cat BreastCnAPmiRNAsID.txt hsa-miR-4700-5p hsa-miR-300 hsa-miR-381 hsa-miR-4803 $ cat tarbaseData.txt ENSG00000005175 RPAP3 hsa-miR-3199 Homo sapiens 293S Ki +dney NA HITS-CLIP POSITIVE DIRECT DOWN treatment:em +etine ENSG00000005175 RPAP3 hsa-miR-342-3p Homo sapiens HELA +Cervix Cancer/Malignant HITS-CLIP POSITIVE DIRECT DOWN + Hela cells were treated with control shRNA. ENSG00000005175 RPAP3 hsa-miR-381-3p Homo sapiens HS5 B +one Marrow Normal/Primary HITS-CLIP POSITIVE DIRECT DO +WN NA ENSG00000005187 ACSM3 hsa-miR-196a-5p Homo sapiens EF3DAGO +2 NA Normal/Primary PAR-CLIP POSITIVE DIRECT DOWN + NA $ grep -f BreastCnAPmiRNAsID.txt tarbaseData.txt > output.txt $ cat output.txt ENSG00000005175 RPAP3 hsa-miR-381-3p Homo sapiens HS5 B +one Marrow Normal/Primary HITS-CLIP POSITIVE DIRECT DO +WN NA $ [download]	[reply] [d/l]
Re^4: comparing an ID fom one file to the records in the second file by ag88 (Novice) on Dec 02, 2017 at 16:47 UTC
Re: comparing an ID fom one file to the records in the second file by 1nickt (Canon) on Dec 01, 2017 at 18:15 UTC
Hi, welcome. It's a FAQ, there are tons of threads about it in this monastery. Have you searched? Do you know what your code does, or did you copy it from somewhere without really understanding it? As a beginner, have you worked through perlintro yet? You'll also need perlrequick if you are doing text matching. You must always `use strict;` and `use warnings;` at the top of your code. For example, `warnings` would have told you that you were trying to read from a closed filehandle. (If you close the comparison filehandle first time through the loop the rest of the lines in the id file never have a chance to match.) Don't copy this. (edit: because it won't work, as Laurent_R points out below. I was trying to show some errors in your code, (see above), but as others have noted your overall approach is wrong to begin with for your task.) ~~Try to spot the differences. Ask if you have any questions:~~ #!/usr/bin/perl use strict; use warnings; my $file_id = './BreastCnAPmiRNAsID.txt'; open( my $FILEID, '<', $file_id ) or die "Died: cannot open $file_id: $!"; my $file_comp = './tarbaseData.txt'; open( my $FILECOMPARE, '<', $file_comp ) or die "Died: cannot open $file_comp: $!"; while ( my $id = <FILEID> ) { chomp $id; while ( my $comp = <FILECOMPARE> ) { chomp $comp; print "$id\n"; if ( $comp =~ /$id/ ) { print "\t$_\n"; } else { print "\tno match\n"; } } } close $FILEID; close $FILECOMPARE; __END__ (untested) Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^2: comparing an ID fom one file to the records in the second file by Laurent_R (Canon) on Dec 01, 2017 at 18:52 UTC
hi 1nickt, unless I missed something, I think that this isn't gonna work. When you read the first line of FILEID, you read the whole FILECOMPARE filehandle, and you won't have any data left to read from the second file for the the next lines of the first file. Besides, even if you fixed it to get back to the beginning of the first file, the solution would be quite inefficient, because it would be reading the second file again and again for each input line of the first file. It would also print scores of "no match" to the output, even when there is actually a match.	[reply]
Re^3: comparing an ID fom one file to the records in the second file by 1nickt (Canon) on Dec 01, 2017 at 18:58 UTC
Quite likely, I didn't test it, and said "don't copy this" ... was just correcting some errors I saw in the OP. As you know well this is not the right solution to begin with. I'll update my node, thanks. The way forward always starts with a minimal test.	[reply]
Re^4: comparing an ID fom one file to the records in the second file by Laurent_R (Canon) on Dec 01, 2017 at 19:55 UTC
Re^2: comparing an ID fom one file to the records in the second file by ag88 (Novice) on Dec 02, 2017 at 11:45 UTC
Thankyou for pointing out the mistakes in the code. I am working on it. No, I did not copy the code from somewhere else. I did it myself. Thats perhaps the reason it is not working :P	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.