Replacement of data in a column of a file using Hashes created from another file

rohitmonk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks. Being a beginner this is my first post seeking your wisdom.

I am trying to edit a file in one format to another one. I have a list of values which are to be replaced with the values and their description, such as

gi|315134697|dbj|AP012030.1|=gi|315134697|dbj|AP012030.1| Escherichia coli DH1 (ME8569) DNA,...

gi|260447279|gb|CP001637.1|=gi|260447279|gb|CP001637.1| Escherichia coli DH1, complete genome

gi|238859724|gb|CP001396.1|=gi|238859724|gb|CP001396.1| Escherichia coli BW2952, complete g...

gi|194400059|gb|EU855241.1|=gi|194400059|gb|EU855241.1| Shigella flexneri strain FBD047 23S...

gi|194400053|gb|EU855235.1|=gi|194400053|gb|EU855235.1| Shigella dysenteriae strain FBD056 ...

gi|169887498|gb|CP000948.1|=gi|169887498|gb|CP000948.1| Escherichia coli str. K12 substr. D...

gi|85674274|dbj|AP009048.1|=gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W...

gi|48994873|gb|U00096.2|=gi|48994873|gb|U00096.2| Escherichia coli str. K-12 substr. MG1...

gi|81239530|gb|CP000034.1|=gi|81239530|gb|CP000034.1| Shigella dysenteriae Sd197, complete...

gi|5801828|gb|AF053967.1|AF053967=gi|5801828|gb|AF053967.1|AF053967 Escherichia coli strain ECOR ...

gi|5801827|gb|AF053966.1|AF053966=gi|5801827|gb|AF053966.1|AF053966 Escherichia coli rrlD operon,...

gi|406775301|gb|CP003297.1|=gi|406775301|gb|CP003297.1| Escherichia coli O104:H4 str. 2009E...

gi|383403426|gb|CP002967.1|=gi|383403426|gb|CP002967.1| Escherichia coli W, complete genome

I need to replace the value preceding the '=' sign by the value succeeding it. So I made a hash of it using the split function.

my %hash;
for($i=0;$i<=$#arr0;$i++)
{
 @arr1 = split(/\=/,$arr0[$i]);
#print $#arr1;
   $hash{$arr1[0]} = $arr1[1];
}
[download]

Then i wanted to use this hash as reference and replace every instance of the occurrence of the hash-key by the hash value.

The file where I want to do the replacement looks like this

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 100.00 280 0 0 1 280 3402569 3402290 4e-140 506

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 227880 228159 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 2704973 2704694 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4018745 4019024 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4149866 4150145 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4191268 4191547 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 98.93 280 3 0 1 280 3924929 3925208 9e-136 491

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 100.00 280 0 0 1 280 459101 459380 4e-140 506

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 1156698 1156977 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 3643499 3643220 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4302307 4302028 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4343709 4343430 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4474830 4474551 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 98.93 280 3 0 1 280 4568646 4568367 9e-136 491

So i attempted to write the final code, as follows -



#!/usr/bin/perl -w

($a,$b,$c) = @ARGV;

if (scalar @ARGV!=3)
{
        print "Program Name [hash file] [outfmt file] [output file] \n
+";
        exit;
}

open FILE1, "$a" or die $!;
@arr0 = <FILE1>;
chomp(@arr0);
close(FILE1);


open FILE2, "$b" or die $!;
@arr2 = <FILE2>;
chomp(@arr2);
close(FILE2);


my %hash;
for($i=0;$i<=$#arr0;$i++)
{
 @arr1 = split(/\=/,$arr0[$i]);

   $hash{$arr1[0]} = $arr1[1];
}


open(OUT, ">>$c");
for($j=0;$j<=$#arr2;$j++)
{
   @arr3=split(/\t/,$arr2[$j]);

foreach $k (keys %hash)
{
       if ($arr3[1] eq $k)
    {
    $arr3[1] = $hash{$k};
    }
}
print OUT "$arr3[0]\t$arr3[1]\t$arr3[2]\t$arr3[3]\t$arr3[4]\t$arr3[5]\
+t$arr3[6]\t$arr3[7]\t$arr3[8]\t$arr3[9]\t$arr3[10]\t$arr3[11]\n";

}

close(OUT);
[download]

This works fine for a small file, but my files are more than 2 million lines each. I want to increase the speed of my program. Can you please share your wisdom on how to make it faster for larger files?

Regards

Comment on Replacement of data in a column of a file using Hashes created from another file Select or Download Code

Replies are listed 'Best First'.
Re: Replacement of data in a column of a file using Hashes created from another file by choroba (Cardinal) on Oct 30, 2012 at 09:22 UTC
The reason why it is slow is you use the nested loops (for each line, you loop over all the keys). The following code generates a regular expression that will match all the keys, so it saves you one loop: `#!/usr/bin/perl use warnings; use strict; open my $EQ, '<', '1.txt' or die "1: $!"; my %subst; while (<$EQ>) { chomp; # <- updated my ($search, $replace) = split /=/; $subst{$search} = $replace; } my $regex = join '\|', map quotemeta, keys %subst; open my $LST, '<', '2.txt' or die "2: $!"; while (<$LST>) { s/($regex)/$subst{$1}/; print; }` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Replacement of data in a column of a file using Hashes created from another file by rohitmonk (Initiate) on Oct 30, 2012 at 09:43 UTC
Thank you for the reply. But I need to get an output file, I am pretty new to this syntax. And this output which gets printed does not have any replacements in it when i run it. Where is it searching for the hash-key and replacing it with the value?	[reply]
Re^3: Replacement of data in a column of a file using Hashes created from another file by space_monk (Chaplain) on Oct 30, 2012 at 10:44 UTC
Not sure what you mean: Input: 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| 100.00 +280 0 0 1 280 3402569 3402290 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| 99.64 2 +80 1 0 1 280 227880 228159 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| 99.64 2 +80 1 0 1 280 2704973 2704694 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| 99.64 2 +80 1 0 1 280 4018745 4019024 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| 99.64 2 +80 1 0 1 280 4149866 4150145 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| 99.64 2 +80 1 0 1 280 4191268 4191547 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| 98.93 2 +80 3 0 1 280 3924929 3925208 9e-136 491 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| 100.00 2 +80 0 0 1 280 459101 459380 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| 99.64 28 +0 1 0 1 280 1156698 1156977 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| 99.64 28 +0 1 0 1 280 3643499 3643220 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| 99.64 28 +0 1 0 1 280 4302307 4302028 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| 99.64 28 +0 1 0 1 280 4343709 4343430 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| 99.64 28 +0 1 0 1 280 4474830 4474551 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| 98.93 28 +0 3 0 1 280 4568646 4568367 9e-136 491 [download] Output: 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| Escheri +chia coli DH1 (ME8569) DNA,... 100.00 280 0 0 1 280 3402569 3402290 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 227880 228159 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 2704973 2704694 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4018745 4019024 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4149866 4150145 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| Escheri +chia coli DH1 (ME8569) DNA,... 99.64 280 1 0 1 280 4191268 4191547 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|315134697\|dbj\|AP012030.1\| Escheri +chia coli DH1 (ME8569) DNA,... 98.93 280 3 0 1 280 3924929 3925208 9e-136 491 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| Escheric +hia coli DH1, complete genome 100.00 280 0 0 1 280 459101 459380 4e-140 506 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 1156698 1156977 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 3643499 3643220 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4302307 4302028 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4343709 4343430 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| Escheric +hia coli DH1, complete genome 99.64 280 1 0 1 280 4474830 4474551 2e-138 500 10_25_res.txt:Locus3034v1rpkm4.98 gi\|260447279\|gb\|CP001637.1\| Escheric +hia coli DH1, complete genome 98.93 280 3 0 1 280 4568646 4568367 9e-136 491 [download] The replacement is done in the `s/($regex)/$subst{$1}/` statement, and you can direct the output to a file by: `./program.pl > output_file.txt` [download] There does appear to be a stray carriage return in the output - the code may needs a `chomp` somewhere...	[reply] [d/l] [select]
Re^4: Replacement of data in a column of a file using Hashes created from another file by choroba (Cardinal) on Oct 30, 2012 at 11:23 UTC
Re^4: Replacement of data in a column of a file using Hashes created from another file by perl_walker (Novice) on Oct 30, 2012 at 14:49 UTC
Re^5: Replacement of data in a column of a file using Hashes created from another file by Anonymous Monk on Oct 30, 2012 at 23:27 UTC
Some notes below your chosen depth have not been shown here
Re^5: Replacement of data in a column of a file using Hashes created from another file by rohitmonk (Initiate) on Oct 31, 2012 at 05:49 UTC
Re^4: Replacement of data in a column of a file using Hashes created from another file by rohitmonk (Initiate) on Oct 31, 2012 at 05:44 UTC
Re^4: Replacement of data in a column of a file using Hashes created from another file by rohitmonk (Initiate) on Oct 30, 2012 at 12:47 UTC
Re^2: Replacement of data in a column of a file using Hashes created from another file by rohitmonk (Initiate) on Oct 31, 2012 at 05:46 UTC
Thank you for the code, it was dope. Cool skills u got, wish to learn more. Cheers....	[reply]