Re: comparing an ID fom one file to the records in the second file
by poj (Abbot) on Dec 01, 2017 at 18:32 UTC
|
#!/usr/bin/perl
use strict;
use Data::Dumper;
my %ID = ();
my $fileID = 'BreastCnAPmiRNAsID.txt';
open FILEID, '<', $fileID
or die "cannot open $fileID";
while (<FILEID>){
chomp;
$ID{$_}=1 if $_;
}
close FILEID;
#print Dumper \%ID;
my $fileCompare = 'tarbaseData.txt';
open FILECOMPARE, '<', $fileCompare
or die "cannot open $fileCompare";
while(<FILECOMPARE>){
chomp;
my @col = split "\t",$_;
print "$col[2]\n";
if (exists $ID{$col[2]}){
print $_;
}
}
close FILECOMPARE;
poj | [reply] [d/l] |
|
Thankyou so much for the help poj. Much appreciated. and I understand the logic how to do this type of task i.e., using the hash and storing IDs as keys etc. But there still exists one problem. The code is not comparing the hash key (ID) with $col2 and is not printing the related line :(. I commented out the (print "$col2") and it is printing nothing. I am trying to figure this out in the mean time.
| [reply] |
|
while (<FILEID>){
chomp;
s/[\s]//g;
$ID{$_}=1 if $_;
}
poj | [reply] [d/l] |
|
Re: comparing an ID fom one file to the records in the second file
by Laurent_R (Canon) on Dec 01, 2017 at 18:43 UTC
|
Hi ag88,
The typical way to solve this type of problem is to first read the second file, store each line in an array, using the ID as a key and the full line as a value. Then close that file. They you read the first file and, for each line, lookup the hash to see if you find the ID. If you do, print out the hash value to the output file.
Here, however, since you only want to keep the matching items, it would be slightly simpler to do it the other way around: read the first file and store the ID in a hash (as hash keys, the hash value can be anything, for example number 1). Close that file once this is done. Then open the second file, read it line by line, extract the ID from the line, and print that line to the output file if the ID is found in the hash.
Something like this:
use strict;
use wanings;
my %ids;
open my $IDS, "<", "BreastCnAPmiRNAsID.txt" or die "cannot open file B
+reastCnAPmiRNAsID.txt $!";
while (<$IDS>) {
chomp;
$ids{$_} = 1; # populating the hash
}
close $IDS;
open my $FILECOMPARE, "<", "tarbaseData.txt" or die "cannot open tarba
+seData.txt $!";
while (my $line = <$FILECOMPARE>) {
my $id = (split /\s+/, $line)[2]; # extract the ID (third field
+)
print $line if exists $ids{$id}; # print line if hash lookup i
+s successful
}
close $FILECOMPARE;
This prints the result to the standard output. You'll have to open a third file in write mode and print to it if you want the result in another file.
Update: poj typed faster than me (or started to type earlier), our solutions are quite similar.
Update: Fixed missing quotes in the name of the first file. Thanks to 1nickt for pointing out this typo.
| [reply] [d/l] |
|
| [reply] |
|
Hi ag88,
please check the content of your hash. There may be invisible characters in the lines of your first file (like extra space, carriage return, etc.). The best might be to use something like the Data::Dumper module (which is core, so it should be on your machine).
| [reply] [d/l] |
Re: comparing an ID fom one file to the records in the second file
by jwkrahn (Abbot) on Dec 01, 2017 at 19:00 UTC
|
grep -f BreastCnAPmiRNAsID.txt tarbaseData.txt > newfile.txt
| [reply] [d/l] [select] |
|
| [reply] |
|
$ ls
BreastCnAPmiRNAsID.txt tarbaseData.txt
$ cat BreastCnAPmiRNAsID.txt
hsa-miR-4700-5p
hsa-miR-300
hsa-miR-381
hsa-miR-4803
$ cat tarbaseData.txt
ENSG00000005175 RPAP3 hsa-miR-3199 Homo sapiens 293S Ki
+dney NA HITS-CLIP POSITIVE DIRECT DOWN treatment:em
+etine
ENSG00000005175 RPAP3 hsa-miR-342-3p Homo sapiens HELA
+Cervix Cancer/Malignant HITS-CLIP POSITIVE DIRECT DOWN
+ Hela cells were treated with control shRNA.
ENSG00000005175 RPAP3 hsa-miR-381-3p Homo sapiens HS5 B
+one Marrow Normal/Primary HITS-CLIP POSITIVE DIRECT DO
+WN NA
ENSG00000005187 ACSM3 hsa-miR-196a-5p Homo sapiens EF3DAGO
+2 NA Normal/Primary PAR-CLIP POSITIVE DIRECT DOWN
+ NA
$ grep -f BreastCnAPmiRNAsID.txt tarbaseData.txt > output.txt
$ cat output.txt
ENSG00000005175 RPAP3 hsa-miR-381-3p Homo sapiens HS5 B
+one Marrow Normal/Primary HITS-CLIP POSITIVE DIRECT DO
+WN NA
$
| [reply] [d/l] |
|
Re: comparing an ID fom one file to the records in the second file
by 1nickt (Canon) on Dec 01, 2017 at 18:15 UTC
|
Hi, welcome. It's a FAQ, there are tons of threads about it in this monastery. Have you searched? Do you know what your code does, or did you copy it from somewhere without really understanding it? As a beginner, have you worked through perlintro yet? You'll also need perlrequick if you are doing text matching.
You must always use strict; and use warnings; at the top of your code.
For example, warnings would have told you that you were trying to read from a closed filehandle.
(If you close the comparison filehandle first time through the loop the rest of the lines in the id file never have a chance to match.)
Don't copy this. (edit: because it won't work, as Laurent_R points out below. I was trying to show some errors in your code, (see above), but as others have noted your overall approach is wrong to begin with for your task.) Try to spot the differences. Ask if you have any questions:
#!/usr/bin/perl
use strict; use warnings;
my $file_id = './BreastCnAPmiRNAsID.txt';
open( my $FILEID, '<', $file_id ) or die "Died: cannot open $file_id: $!";
my $file_comp = './tarbaseData.txt';
open( my $FILECOMPARE, '<', $file_comp ) or die "Died: cannot open $file_comp: $!";
while ( my $id = <FILEID> ) {
chomp $id;
while ( my $comp = <FILECOMPARE> ) {
chomp $comp;
print "$id\n";
if ( $comp =~ /$id/ ) {
print "\t$_\n";
} else {
print "\tno match\n";
}
}
}
close $FILEID;
close $FILECOMPARE;
__END__
(untested)
Hope this helps!
The way forward always starts with a minimal test.
| [reply] [d/l] [select] |
|
hi 1nickt,
unless I missed something, I think that this isn't gonna work. When you read the first line of FILEID, you read the whole FILECOMPARE filehandle, and you won't have any data left to read from the second file for the the next lines of the first file. Besides, even if you fixed it to get back to the beginning of the first file, the solution would be quite inefficient, because it would be reading the second file again and again for each input line of the first file. It would also print scores of "no match" to the output, even when there is actually a match.
| [reply] |
|
Quite likely, I didn't test it, and said "don't copy this" ... was just correcting some errors I saw in the OP. As you know well this is not the right solution to begin with. I'll update my node, thanks.
The way forward always starts with a minimal test.
| [reply] |
|
|
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in. |