comment on

Greetings fellow monks. Being a beginner this is my first post seeking your wisdom.

I am trying to edit a file in one format to another one. I have a list of values which are to be replaced with the values and their description, such as

gi|315134697|dbj|AP012030.1|=gi|315134697|dbj|AP012030.1| Escherichia coli DH1 (ME8569) DNA,...

gi|260447279|gb|CP001637.1|=gi|260447279|gb|CP001637.1| Escherichia coli DH1, complete genome

gi|238859724|gb|CP001396.1|=gi|238859724|gb|CP001396.1| Escherichia coli BW2952, complete g...

gi|194400059|gb|EU855241.1|=gi|194400059|gb|EU855241.1| Shigella flexneri strain FBD047 23S...

gi|194400053|gb|EU855235.1|=gi|194400053|gb|EU855235.1| Shigella dysenteriae strain FBD056 ...

gi|169887498|gb|CP000948.1|=gi|169887498|gb|CP000948.1| Escherichia coli str. K12 substr. D...

gi|85674274|dbj|AP009048.1|=gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W...

gi|48994873|gb|U00096.2|=gi|48994873|gb|U00096.2| Escherichia coli str. K-12 substr. MG1...

gi|81239530|gb|CP000034.1|=gi|81239530|gb|CP000034.1| Shigella dysenteriae Sd197, complete...

gi|5801828|gb|AF053967.1|AF053967=gi|5801828|gb|AF053967.1|AF053967 Escherichia coli strain ECOR ...

gi|5801827|gb|AF053966.1|AF053966=gi|5801827|gb|AF053966.1|AF053966 Escherichia coli rrlD operon,...

gi|406775301|gb|CP003297.1|=gi|406775301|gb|CP003297.1| Escherichia coli O104:H4 str. 2009E...

gi|383403426|gb|CP002967.1|=gi|383403426|gb|CP002967.1| Escherichia coli W, complete genome

I need to replace the value preceding the '=' sign by the value succeeding it. So I made a hash of it using the split function.

my %hash;
for($i=0;$i<=$#arr0;$i++)
{
 @arr1 = split(/\=/,$arr0[$i]);
#print $#arr1;
   $hash{$arr1[0]} = $arr1[1];
}
[download]

Then i wanted to use this hash as reference and replace every instance of the occurrence of the hash-key by the hash value.

The file where I want to do the replacement looks like this

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 100.00 280 0 0 1 280 3402569 3402290 4e-140 506

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 227880 228159 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 2704973 2704694 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4018745 4019024 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4149866 4150145 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 99.64 280 1 0 1 280 4191268 4191547 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|315134697|dbj|AP012030.1| 98.93 280 3 0 1 280 3924929 3925208 9e-136 491

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 100.00 280 0 0 1 280 459101 459380 4e-140 506

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 1156698 1156977 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 3643499 3643220 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4302307 4302028 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4343709 4343430 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 99.64 280 1 0 1 280 4474830 4474551 2e-138 500

10_25_res.txt:Locus3034v1rpkm4.98 gi|260447279|gb|CP001637.1| 98.93 280 3 0 1 280 4568646 4568367 9e-136 491

So i attempted to write the final code, as follows -



#!/usr/bin/perl -w

($a,$b,$c) = @ARGV;

if (scalar @ARGV!=3)
{
        print "Program Name [hash file] [outfmt file] [output file] \n
+";
        exit;
}

open FILE1, "$a" or die $!;
@arr0 = <FILE1>;
chomp(@arr0);
close(FILE1);


open FILE2, "$b" or die $!;
@arr2 = <FILE2>;
chomp(@arr2);
close(FILE2);


my %hash;
for($i=0;$i<=$#arr0;$i++)
{
 @arr1 = split(/\=/,$arr0[$i]);

   $hash{$arr1[0]} = $arr1[1];
}


open(OUT, ">>$c");
for($j=0;$j<=$#arr2;$j++)
{
   @arr3=split(/\t/,$arr2[$j]);

foreach $k (keys %hash)
{
       if ($arr3[1] eq $k)
    {
    $arr3[1] = $hash{$k};
    }
}
print OUT "$arr3[0]\t$arr3[1]\t$arr3[2]\t$arr3[3]\t$arr3[4]\t$arr3[5]\
+t$arr3[6]\t$arr3[7]\t$arr3[8]\t$arr3[9]\t$arr3[10]\t$arr3[11]\n";

}

close(OUT);
[download]

This works fine for a small file, but my files are more than 2 million lines each. I want to increase the speed of my program. Can you please share your wisdom on how to make it faster for larger files?

Regards

In reply to Replacement of data in a column of a file using Hashes created from another file by rohitmonk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.