rohitmonk has asked for the wisdom of the Perl Monks concerning the following question:
Greetings fellow monks. Being a beginner this is my first post seeking your wisdom.
I am trying to edit a file in one format to another one. I have a list of values which are to be replaced with the values and their description, such as
gi|315134697|dbj|AP012030.1|=gi|315134697|dbj|AP012030.1| Escherichia coli DH1 (ME8569) DNA,...
gi|260447279|gb|CP001637.1|=gi|260447279|gb|CP001637.1| Escherichia coli DH1, complete genome
gi|238859724|gb|CP001396.1|=gi|238859724|gb|CP001396.1| Escherichia coli BW2952, complete g...
gi|194400059|gb|EU855241.1|=gi|194400059|gb|EU855241.1| Shigella flexneri strain FBD047 23S...
gi|194400053|gb|EU855235.1|=gi|194400053|gb|EU855235.1| Shigella dysenteriae strain FBD056 ...
gi|169887498|gb|CP000948.1|=gi|169887498|gb|CP000948.1| Escherichia coli str. K12 substr. D...
gi|85674274|dbj|AP009048.1|=gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W...
gi|48994873|gb|U00096.2|=gi|48994873|gb|U00096.2| Escherichia coli str. K-12 substr. MG1...
gi|81239530|gb|CP000034.1|=gi|81239530|gb|CP000034.1| Shigella dysenteriae Sd197, complete...
gi|5801828|gb|AF053967.1|AF053967=gi|5801828|gb|AF053967.1|AF053967 Escherichia coli strain ECOR ...
gi|5801827|gb|AF053966.1|AF053966=gi|5801827|gb|AF053966.1|AF053966 Escherichia coli rrlD operon,...
gi|406775301|gb|CP003297.1|=gi|406775301|gb|CP003297.1| Escherichia coli O104:H4 str. 2009E...
gi|383403426|gb|CP002967.1|=gi|383403426|gb|CP002967.1| Escherichia coli W, complete genome
I need to replace the value preceding the '=' sign by the value succeeding it. So I made a hash of it using the split function.
my %hash; for($i=0;$i<=$#arr0;$i++) { @arr1 = split(/\=/,$arr0[$i]); #print $#arr1; $hash{$arr1[0]} = $arr1[1]; }
Then i wanted to use this hash as reference and replace every instance of the occurrence of the hash-key by the hash value.
The file where I want to do the replacement looks like this
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 100.00 280 0 0 1 280 3402569 3402290 4e-140 506
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 227880 228159 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 2704973 2704694 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4018745 4019024 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4149866 4150145 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4191268 4191547 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 98.93 280 3 0 1 280 3924929 3925208 9e-136 491
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 100.00 280 0 0 1 280 459101 459380 4e-140 506
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 1156698 1156977 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 3643499 3643220 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4302307 4302028 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4343709 4343430 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4474830 4474551 2e-138 500
10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 98.93 280 3 0 1 280 4568646 4568367 9e-136 491
So i attempted to write the final code, as follows -
#!/usr/bin/perl -w ($a,$b,$c) = @ARGV; if (scalar @ARGV!=3) { print "Program Name [hash file] [outfmt file] [output file] \n +"; exit; } open FILE1, "$a" or die $!; @arr0 = <FILE1>; chomp(@arr0); close(FILE1); open FILE2, "$b" or die $!; @arr2 = <FILE2>; chomp(@arr2); close(FILE2); my %hash; for($i=0;$i<=$#arr0;$i++) { @arr1 = split(/\=/,$arr0[$i]); $hash{$arr1[0]} = $arr1[1]; } open(OUT, ">>$c"); for($j=0;$j<=$#arr2;$j++) { @arr3=split(/\t/,$arr2[$j]); foreach $k (keys %hash) { if ($arr3[1] eq $k) { $arr3[1] = $hash{$k}; } } print OUT "$arr3[0]\t$arr3[1]\t$arr3[2]\t$arr3[3]\t$arr3[4]\t$arr3[5]\ +t$arr3[6]\t$arr3[7]\t$arr3[8]\t$arr3[9]\t$arr3[10]\t$arr3[11]\n"; } close(OUT);
This works fine for a small file, but my files are more than 2 million lines each. I want to increase the speed of my program. Can you please share your wisdom on how to make it faster for larger files?
Regards
|
|---|