Hi, I have two files. File 1 is a FASTA file with a header line beginning with ">" and DNA sequence on the subsequent lines until the next ">" header. File 2 contains line after line formatted the same as the header line in file 1 except with a tab and score value added. There are only a fraction of the header lines in file2 as there are in file 1. File1 contains roughly 900,000 header lines and the corresponding DNA sequence for each. FIle2 contains roughly 60,000 header lines. I want a script to find matching header lines from file 2 and print the header along with the corresponding DNA sequence for that header from file1. I wrote a script that is successful in doing what I want, but it is far too slow to be used. After ~1 hr, it had returned only ~700 correct entries out of ~60,000 total. A small example of file1 and file 2 are shown below.

file1 >Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent +=AT1G11900.len361.i1.1_111 GTATGAATTCCAAAAATCCAGAACCGTTTTCGTGATTCATGTTATGCTCTCGTTGTTGTTTTCTGATTGT +TACTGCTCAGCGAGTTTCTTCTATCAATGTTTGATTCGATGAAGATGCGAAATTTCGAACCATTGCTGT +TCTTCTGAGTTTGATCGTTTTTTAGTTTCGGGGCTTTCACGCTTCAGCTAGTGTTTGATTACGAAGTTT +TCTGATTAAATGTGTGAGTTTTTTTGTAGTCATCTCGAAAACTGAGAAATCCATTTTTATAGATTTACA +TTGTTCATAGTTATATGTGGAAGTTGATGATTGATGGTGATTCTGCAAATTGATGATTTGGTTTTCTGT +TTATTGGCATTGCAG >Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent +=AT1G76660.len490.i1.2_394 GTTTCTGTTTAATTCTCTCTATTTTCGTTTGATTTCGACTTCTTGAGCTTTTACTTCTCTCTGTCTCAGT +TCTAACTTCTTCAGATTTTAAAGCTTTCGTTTTTTTGGCAAGTTGTTTTTTTTTCCTACTTAGATCTGA +CTACTCCGACTCTGTTCACACTAATGTTCGTTAGGGTTTATGTTGAATCTCTCCTTTGATCATTATGTT +ATTGTAAAAATCCCAGCTTTATGCTAAATCGAGCTAGTGATTCTTGAGAATTGAACAAAAAAGTTTTAC +ATTTTTCTGAATTGCCATTCAATTAGAAGAAGAAAAAATTCAACCTTTTACTGGTTATGATCTAGATTC +GATGCGTGTAAGCTATAAGATCACCATTTCGTGCTTTAGATCCATAATCATTGATTCACTATATGGCAA +TTATCTTCTTGCTTCACAGATCTCTTTTACACTTACATGTCAAGTGTCTGAGTGTGTGTGTGTGTCCTT +TTGCAG >Chr1.ID=PAC:19657550;Name=AT1G53750.1;pacid=19657550;longest=1;Parent +=AT1G53750.len344.i1.9_229 GTAAGCCTTCCCCTTTTAGAACCCTAAGTTTTATTGGGGTTTTCGATTTTTACTCTTCTGATTCATCGGA +GAATTCGGATCTACACTAGATTTTAGTTACTCGAATGTGAGGGTTTCGTCTCTTTGCAAACCAATTTGA +TGTTTCCTCCTGAGCTAGATATGTTCTTGATGAGCTTGATTTTTCTACTTGGTTCAGTTTTTTTTGCTA +ATTACTACTTATATGAGTGAATCTGCCTCACTGTTTGAATTTATTCCAAGTGGAAATATTCATAGTCAT +GCTTTGTTGTATCTGTTATCTCTCCATATGTTGTGTTGCTGACCTTGTTAAATCTCATTCTCTGCAG >Chr1.ID=PAC:19650963;Name=AT1G47740.1;pacid=19650963;longest=1;Parent +=AT1G47740.len1091.i1.4_277 GTAAGTTAATCTTCTCTTCTGAAAATTGAATTTGGTGTATCAATTCTTACATTATCTTGAAGATTCATCT +CTGAATTTCTCAAATTTATGGGGGTTTTTTGTTTGTCGGAATTGCCGGAGAAATTGGAAAAAACGAGAT +CTTTGAGTAAAGGGTTTGTTTAATCTTTAGTCTTTATTGCTTTCCTTAAGCTAATTTTGGCAGATCTGG +AACATAAACCCTAGAACAAGACCAAATCAGTGTCTCCTTACTCTTAGGGATTTTAGTCTCTGTGATACC +TTAAATGTGTTTATAAATTGACTGTGCTTAATGGGTCACATTTGATTTGCAATAAAAGTTTCACAATTC +TTCCATTTTCAGTAATGTAAGCTCCAGTTTTCAAGATTTACTATTTTGGGGAACTAGTTAGGTTTGTAG +GTTTATTAGATCTTAGAAACACTAATGTAATGCTGTTTGTTTGGTACTCTTTAATATTCACTATTCATC +TTAATGTGGAATCAAATTTGCTTTTTTTGGTCAGACTCAATTAGTTGGAATGAGTTGTTGAACCATTGA +CTTGTCTCCTAAGCTCTTACTTAATCTATCTGTATCATCTTCTCCTGTCTATTTCTCTTTATTTAATAC +ATAACTCGTTCATGATTGCATCTGACATGGTAGCAACTCTTTGTGAACAAGACTTGATTTTATAACACT +GTAACATGACCACACTTTTTCCTTTTGATCTCTGTATATTTGGGGAAGTAAGGATTTGGTTTAACAATG +ATACAAAATCACAGTTTAGATGACTCAGTCTTGTCTTTTATCATTAAGATGGAAAAAATGAGACCAATC +TTGTCTTGCTTTATTCTAAGATGCATGCTCTATTATATTTTATGCTTTTGTATATATAACTGATTGAAG +CCGGCCAATGGAGATTGGTGCTGACTTTTTAAATGAGCTGTTGTTGTATTTCAGCCAACTAGCATAAGA +TAAAATAAAATAGGAAAATTTTTCATTTCAGTCTCTTAAGTTCAATGAAAACATAATGTGGCACAGAGG +TCTTTGCTTTTGTACCTTTCAGAATCTTTTATTGATTGATGATGTTTACATACAG >Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent +=AT1G31420.len415.i1.13_217 GTTCTTATAATTCTTACTAATTCTGTTTGCTTTTAAAGCTATAACCTTTGATATTGTTGGAAAGAGTGGG +GTTTTGGGTCTTTTGCTGAATCGTTTTTTGGATTTGTTATATTGTTCGAATCTTCAGTTGTTATTGTGT +TATAGGATTGGATGGTATCTGGGAATTTCGTAATCTATGTTAAGCTAGAGCTGTTTTTGAGCTCTTTTG +TTGATGATGATTTTGAGATTGTTGGCCGAATTTAGCTCTCGTTTTCTGATTTTAGCAATTGGAAAGTGT +GTATTGGTTCTTGTGAGGCAATTTCACTGTTTTGAGTACTCAAAATGTAGATGAGAGCATGCATAAGTT +GTGTGGAGACTGAGCTTAATGTGTAGTGTAATTGACAATTAGTTTTGTGGGCTTTCCTTTGTTTTTCAG
file2 >Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent +=AT1G11900.len361.i1.1_111 100 >Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent +=AT1G76660.len490.i1.2_394 34 >Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent +=AT1G31420.len415.i1.13_217 76
Here is the main script that I wrote:
#!/usr/bin/perl use strict; use warnings; use diagnostics; use FAlite; die "usage: $0 <fasta> <list>\n" unless @ARGV == 2; open(FASTA, $ARGV[0]) or die; my $fasta = new FAlite(\*FASTA); while (my $entry = $fasta->nextEntry) { my $gseq = $entry->seq; my ($chrom) = $entry->def =~ /(^\S+)/; my @a1 = $chrom; my %hash; $hash{$_}++ for @a1; my $header; open(LIST, $ARGV[1]) or die; while (<LIST>) { next if /^#/; my ($header, $score) = split; my @a2 = $header; for my $item (@a2) { if ($hash{$item}) {print "$chrom\n$gseq\n"}; } } } close FASTA; close LIST; __END__
Here is the perl module that the above code uses:
package FAlite; use strict; sub new { my ($class, $fh) = @_; if (ref $fh !~ /GLOB/) {die ref $fh, "\n", "FAlite ERROR: expect a GLOB reference\n"} my $this = bless {}; $this->{FH} = $fh; while(<$fh>) {last if $_ =~ /\S/} # not supposed to have blanks, b +ut... my $firstline = $_; if (not defined $firstline) {warn "FAlite: Empty\n"; return $this} if ($firstline !~ /^>/) {warn "FAlite: Not FASTA formatted\n"; ret +urn $this} $this->{LASTLINE} = $firstline; chomp $this->{LASTLINE}; return $this; } sub nextEntry { my ($this) = @_; return 0 if not defined $this->{LASTLINE}; my $fh = $this->{FH}; my $def = $this->{LASTLINE}; my @seq; my $lines_read = 0; while(<$fh>) { $lines_read++; if ($_ =~ /^>/) { $this->{LASTLINE} = $_; chomp $this->{LASTLINE}; last; } push @seq, $_; } return 0 if $lines_read == 0; chomp @seq; my $entry = FAlite::Entry::new($def, \@seq); return $entry; } package FAlite::Entry; use overload '""' => 'all'; sub new { my ($def, $seqarry) = @_; my $this = bless {}; $this->{DEF} = $def; $this->{SEQ} = join("", @$seqarry); $this->{SEQ} =~ s/\s//g; # just in case more spaces return $this; } sub def {shift->{DEF}} sub seq {shift->{SEQ}} sub all {my $e = shift; return $e->{DEF}."\n".$e->{SEQ}."\n"} 1;
While my script is working, I feel that there should be a more efficient way to search my array against my hash and print the associated header and DNA sequence. Thanks for any help, Derrick


In reply to Searching array against hash by drhicks

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.