ag88 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I want some help. I have two files. One file contains the ID's (some alpha numeric text) as below. File named is "BreastCnAPmiRNAsID.txt"

hsa-miR-4700-5p hsa-miR-300 hsa-miR-381 hsa-miR-4803

I want to read this file line by line and then see if the same ID is present in the second file and extract the related information which is in a single row too separated by space. Second file is named "tarbaseData.txt" looks like this:

ENSG00000005175 RPAP3 hsa-miR-3199 Homo sapiens 293S Ki +dney NA HITS-CLIP POSITIVE DIRECT DOWN treatment:em +etine ENSG00000005175 RPAP3 hsa-miR-342-3p Homo sapiens HELA +Cervix Cancer/Malignant HITS-CLIP POSITIVE DIRECT DOWN + Hela cells were treated with control shRNA. ENSG00000005175 RPAP3 hsa-miR-381-3p Homo sapiens HS5 B +one Marrow Normal/Primary HITS-CLIP POSITIVE DIRECT DO +WN NA ENSG00000005187 ACSM3 hsa-miR-196a-5p Homo sapiens EF3DAGO +2 NA Normal/Primary PAR-CLIP POSITIVE DIRECT DOWN + NA

The new lines in the second file starts with the ID as well which is somehwhat like ENS.....What I actually want is that the program takes ID from the 1st file (BreastCnAPmiRNAsID.txt) and whenever it finds the same ID in the second file, it copies the complete line and write it in another file. For the time being I am printing the result in the terminal. My code is not working properly which is as follows.

#!/usr/bin/perl open(FILEID, "BreastCnAPmiRNAsID.txt") || die "cannot open file"; { open(FILECOMPARE, "tarbaseData.txt") || die "cannot open file"; { while(<FILEID>) { chomp; $rnaid = $_; while(<FILECOMPARE>) { chomp; print "$rnaid\n"; if (/$rnaid/) { print "$_\n"; } } close(FILECOMPARE); } close(FILEID); } }

If I replace "/$rnaid/" in the second while loop with specific ID, it searches the second file and gives the output. But I am not able to compare the both files correctly. Any kind help will be appreciated. As I am new to programing any simple understandable approach/help would be highly highly appreciated.

Replies are listed 'Best First'.
Re: comparing an ID fom one file to the records in the second file
by poj (Abbot) on Dec 01, 2017 at 18:32 UTC

    Store the ID's as hash keys and use exists to match records in the data file

    #!/usr/bin/perl use strict; use Data::Dumper; my %ID = (); my $fileID = 'BreastCnAPmiRNAsID.txt'; open FILEID, '<', $fileID or die "cannot open $fileID"; while (<FILEID>){ chomp; $ID{$_}=1 if $_; } close FILEID; #print Dumper \%ID; my $fileCompare = 'tarbaseData.txt'; open FILECOMPARE, '<', $fileCompare or die "cannot open $fileCompare"; while(<FILECOMPARE>){ chomp; my @col = split "\t",$_; print "$col[2]\n"; if (exists $ID{$col[2]}){ print $_; } } close FILECOMPARE;
    poj

      Thankyou so much for the help poj. Much appreciated. and I understand the logic how to do this type of task i.e., using the hash and storing IDs as keys etc. But there still exists one problem. The code is not comparing the hash key (ID) with $col2 and is not printing the related line :(. I commented out the (print "$col2") and it is printing nothing. I am trying to figure this out in the mean time.

        Check that your ID file does not have hidden spaces, tabs etc. Ensure 'clean' data by adding a regex

        while (<FILEID>){ chomp; s/[\s]//g; $ID{$_}=1 if $_; }
        poj
Re: comparing an ID fom one file to the records in the second file
by Laurent_R (Canon) on Dec 01, 2017 at 18:43 UTC
    Hi ag88,

    The typical way to solve this type of problem is to first read the second file, store each line in an array, using the ID as a key and the full line as a value. Then close that file. They you read the first file and, for each line, lookup the hash to see if you find the ID. If you do, print out the hash value to the output file.

    Here, however, since you only want to keep the matching items, it would be slightly simpler to do it the other way around: read the first file and store the ID in a hash (as hash keys, the hash value can be anything, for example number 1). Close that file once this is done. Then open the second file, read it line by line, extract the ID from the line, and print that line to the output file if the ID is found in the hash.

    Something like this:

    use strict; use wanings; my %ids; open my $IDS, "<", "BreastCnAPmiRNAsID.txt" or die "cannot open file B +reastCnAPmiRNAsID.txt $!"; while (<$IDS>) { chomp; $ids{$_} = 1; # populating the hash } close $IDS; open my $FILECOMPARE, "<", "tarbaseData.txt" or die "cannot open tarba +seData.txt $!"; while (my $line = <$FILECOMPARE>) { my $id = (split /\s+/, $line)[2]; # extract the ID (third field +) print $line if exists $ids{$id}; # print line if hash lookup i +s successful } close $FILECOMPARE;
    This prints the result to the standard output. You'll have to open a third file in write mode and print to it if you want the result in another file.

    Update: poj typed faster than me (or started to type earlier), our solutions are quite similar.

    Update: Fixed missing quotes in the name of the first file. Thanks to 1nickt for pointing out this typo.

      Thankyou so much for the help. I really appreciate it. But I am unable to print the line having matched ID via (print $line if exists $ids{$id};). I am trying to figure it out :(

        Hi ag88,

        please check the content of your hash. There may be invisible characters in the lines of your first file (like extra space, carriage return, etc.). The best might be to use something like the Data::Dumper module (which is core, so it should be on your machine).

Re: comparing an ID fom one file to the records in the second file
by jwkrahn (Abbot) on Dec 01, 2017 at 19:00 UTC

    If you have the grep program installed then you could do:

    grep -f BreastCnAPmiRNAsID.txt tarbaseData.txt > newfile.txt

      I tried this before posting the question. The newfile.txt is empty. Somehow the comparison is not done. :(

        You must be doing something wrong since it works fine for me:

        $ ls BreastCnAPmiRNAsID.txt tarbaseData.txt $ cat BreastCnAPmiRNAsID.txt hsa-miR-4700-5p hsa-miR-300 hsa-miR-381 hsa-miR-4803 $ cat tarbaseData.txt ENSG00000005175 RPAP3 hsa-miR-3199 Homo sapiens 293S Ki +dney NA HITS-CLIP POSITIVE DIRECT DOWN treatment:em +etine ENSG00000005175 RPAP3 hsa-miR-342-3p Homo sapiens HELA +Cervix Cancer/Malignant HITS-CLIP POSITIVE DIRECT DOWN + Hela cells were treated with control shRNA. ENSG00000005175 RPAP3 hsa-miR-381-3p Homo sapiens HS5 B +one Marrow Normal/Primary HITS-CLIP POSITIVE DIRECT DO +WN NA ENSG00000005187 ACSM3 hsa-miR-196a-5p Homo sapiens EF3DAGO +2 NA Normal/Primary PAR-CLIP POSITIVE DIRECT DOWN + NA $ grep -f BreastCnAPmiRNAsID.txt tarbaseData.txt > output.txt $ cat output.txt ENSG00000005175 RPAP3 hsa-miR-381-3p Homo sapiens HS5 B +one Marrow Normal/Primary HITS-CLIP POSITIVE DIRECT DO +WN NA $
Re: comparing an ID fom one file to the records in the second file
by 1nickt (Canon) on Dec 01, 2017 at 18:15 UTC

    Hi, welcome. It's a FAQ, there are tons of threads about it in this monastery. Have you searched? Do you know what your code does, or did you copy it from somewhere without really understanding it? As a beginner, have you worked through perlintro yet? You'll also need perlrequick if you are doing text matching.

    You must always use strict; and use warnings; at the top of your code.

    For example, warnings would have told you that you were trying to read from a closed filehandle.

    (If you close the comparison filehandle first time through the loop the rest of the lines in the id file never have a chance to match.)

    Don't copy this. (edit: because it won't work, as Laurent_R points out below. I was trying to show some errors in your code, (see above), but as others have noted your overall approach is wrong to begin with for your task.) Try to spot the differences. Ask if you have any questions:

    #!/usr/bin/perl
    use strict; use warnings;
    
    my $file_id = './BreastCnAPmiRNAsID.txt';
    open( my $FILEID, '<', $file_id ) or die "Died: cannot open $file_id: $!";
    
    my $file_comp = './tarbaseData.txt';
    open( my $FILECOMPARE, '<', $file_comp ) or die "Died: cannot open $file_comp: $!";
    
    while ( my $id = <FILEID> ) {
        chomp $id;
    
        while ( my $comp = <FILECOMPARE> ) {
    	chomp $comp;
    	print "$id\n";
    	if ( $comp =~ /$id/ ) {
    	    print "\t$_\n";
    	} else {
                print "\tno match\n";
            }
        }
    }
    
    close $FILEID;
    close $FILECOMPARE;
    __END__
    
    (untested)

    Hope this helps!


    The way forward always starts with a minimal test.
      hi 1nickt,

      unless I missed something, I think that this isn't gonna work. When you read the first line of FILEID, you read the whole FILECOMPARE filehandle, and you won't have any data left to read from the second file for the the next lines of the first file. Besides, even if you fixed it to get back to the beginning of the first file, the solution would be quite inefficient, because it would be reading the second file again and again for each input line of the first file. It would also print scores of "no match" to the output, even when there is actually a match.

        Quite likely, I didn't test it, and said "don't copy this" ... was just correcting some errors I saw in the OP. As you know well this is not the right solution to begin with. I'll update my node, thanks.

        The way forward always starts with a minimal test.

      Thankyou for pointing out the mistakes in the code. I am working on it. No, I did not copy the code from somewhere else. I did it myself. Thats perhaps the reason it is not working :P

A reply falls below the community's threshold of quality. You may see it by logging in.