Greetings fellow monks. Being a beginner this is my first post seeking your wisdom.
I am trying to edit a file in one format to another one. I have a list of values which are to be replaced with the values and their description, such as
gi|315134697|dbj|AP012030.1|=gi|315134697|dbj|AP012030.1| Escherichia coli DH1 (ME8569) DNA,...
gi|260447279|gb|CP001637.1|=gi|260447279|gb|CP001637.1| Escherichia coli DH1, complete genome
gi|238859724|gb|CP001396.1|=gi|238859724|gb|CP001396.1| Escherichia coli BW2952, complete g...
gi|194400059|gb|EU855241.1|=gi|194400059|gb|EU855241.1| Shigella flexneri strain FBD047 23S...
gi|194400053|gb|EU855235.1|=gi|194400053|gb|EU855235.1| Shigella dysenteriae strain FBD056 ...
gi|169887498|gb|CP000948.1|=gi|169887498|gb|CP000948.1| Escherichia coli str. K12 substr. D...
gi|85674274|dbj|AP009048.1|=gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W...
gi|48994873|gb|U00096.2|=gi|48994873|gb|U00096.2| Escherichia coli str. K-12 substr. MG1...
gi|81239530|gb|CP000034.1|=gi|81239530|gb|CP000034.1| Shigella dysenteriae Sd197, complete...
gi|5801828|gb|AF053967.1|AF053967=gi|5801828|gb|AF053967.1|AF053967 Escherichia coli strain ECOR ...
gi|5801827|gb|AF053966.1|AF053966=gi|5801827|gb|AF053966.1|AF053966 Escherichia coli rrlD operon,...
gi|406775301|gb|CP003297.1|=gi|406775301|gb|CP003297.1| Escherichia coli O104:H4 str. 2009E...
gi|383403426|gb|CP002967.1|=gi|383403426|gb|CP002967.1| Escherichia coli W, complete genome
I need to replace the value preceding the '=' sign by the value succeeding it. So I made a hash of it using the split function.
my %hash; for($i=0;$i<=$#arr0;$i++) { @arr1 = split(/\=/,$arr0[$i]); #print $#arr1; $hash{$arr1[0]} = $arr1[1]; }
Then i wanted to use this hash as reference and replace every instance of the occurrence of the hash-key by the hash value.
The file where I want to do the replacement looks like this
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 100.00 280 0 0 1 280 3402569 3402290 4e-140 506
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 227880 228159 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 2704973 2704694 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4018745 4019024 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4149866 4150145 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4191268 4191547 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 98.93 280 3 0 1 280 3924929 3925208 9e-136 491
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 100.00 280 0 0 1 280 459101 459380 4e-140 506
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 1156698 1156977 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 3643499 3643220 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4302307 4302028 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4343709 4343430 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4474830 4474551 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 98.93 280 3 0 1 280 4568646 4568367 9e-136 491
So i attempted to write the final code, as follows -
#!/usr/bin/perl -w ($a,$b,$c) = @ARGV; if (scalar @ARGV!=3) { print "Program Name [hash file] [outfmt file] [output file] \n +"; exit; } open FILE1, "$a" or die $!; @arr0 = <FILE1>; chomp(@arr0); close(FILE1); open FILE2, "$b" or die $!; @arr2 = <FILE2>; chomp(@arr2); close(FILE2); my %hash; for($i=0;$i<=$#arr0;$i++) { @arr1 = split(/\=/,$arr0[$i]); $hash{$arr1[0]} = $arr1[1]; } open(OUT, ">>$c"); for($j=0;$j<=$#arr2;$j++) { @arr3=split(/\t/,$arr2[$j]); foreach $k (keys %hash) { if ($arr3[1] eq $k) { $arr3[1] = $hash{$k}; } } print OUT "$arr3[0]\t$arr3[1]\t$arr3[2]\t$arr3[3]\t$arr3[4]\t$arr3[5]\ +t$arr3[6]\t$arr3[7]\t$arr3[8]\t$arr3[9]\t$arr3[10]\t$arr3[11]\n"; } close(OUT);
This works fine for a small file, but my files are more than 2 million lines each. I want to increase the speed of my program. Can you please share your wisdom on how to make it faster for larger files?
Regards
In reply to Replacement of data in a column of a file using Hashes created from another file by rohitmonk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |