drhicks has asked for the wisdom of the Perl Monks concerning the following question:
Hi, I have two files. File 1 is a FASTA file with a header line beginning with ">" and DNA sequence on the subsequent lines until the next ">" header. File 2 contains line after line formatted the same as the header line in file 1 except with a tab and score value added. There are only a fraction of the header lines in file2 as there are in file 1. File1 contains roughly 900,000 header lines and the corresponding DNA sequence for each. FIle2 contains roughly 60,000 header lines. I want a script to find matching header lines from file 2 and print the header along with the corresponding DNA sequence for that header from file1. I wrote a script that is successful in doing what I want, but it is far too slow to be used. After ~1 hr, it had returned only ~700 correct entries out of ~60,000 total. A small example of file1 and file 2 are shown below.
file1 >Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent +=AT1G11900.len361.i1.1_111 GTATGAATTCCAAAAATCCAGAACCGTTTTCGTGATTCATGTTATGCTCTCGTTGTTGTTTTCTGATTGT +TACTGCTCAGCGAGTTTCTTCTATCAATGTTTGATTCGATGAAGATGCGAAATTTCGAACCATTGCTGT +TCTTCTGAGTTTGATCGTTTTTTAGTTTCGGGGCTTTCACGCTTCAGCTAGTGTTTGATTACGAAGTTT +TCTGATTAAATGTGTGAGTTTTTTTGTAGTCATCTCGAAAACTGAGAAATCCATTTTTATAGATTTACA +TTGTTCATAGTTATATGTGGAAGTTGATGATTGATGGTGATTCTGCAAATTGATGATTTGGTTTTCTGT +TTATTGGCATTGCAG >Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent +=AT1G76660.len490.i1.2_394 GTTTCTGTTTAATTCTCTCTATTTTCGTTTGATTTCGACTTCTTGAGCTTTTACTTCTCTCTGTCTCAGT +TCTAACTTCTTCAGATTTTAAAGCTTTCGTTTTTTTGGCAAGTTGTTTTTTTTTCCTACTTAGATCTGA +CTACTCCGACTCTGTTCACACTAATGTTCGTTAGGGTTTATGTTGAATCTCTCCTTTGATCATTATGTT +ATTGTAAAAATCCCAGCTTTATGCTAAATCGAGCTAGTGATTCTTGAGAATTGAACAAAAAAGTTTTAC +ATTTTTCTGAATTGCCATTCAATTAGAAGAAGAAAAAATTCAACCTTTTACTGGTTATGATCTAGATTC +GATGCGTGTAAGCTATAAGATCACCATTTCGTGCTTTAGATCCATAATCATTGATTCACTATATGGCAA +TTATCTTCTTGCTTCACAGATCTCTTTTACACTTACATGTCAAGTGTCTGAGTGTGTGTGTGTGTCCTT +TTGCAG >Chr1.ID=PAC:19657550;Name=AT1G53750.1;pacid=19657550;longest=1;Parent +=AT1G53750.len344.i1.9_229 GTAAGCCTTCCCCTTTTAGAACCCTAAGTTTTATTGGGGTTTTCGATTTTTACTCTTCTGATTCATCGGA +GAATTCGGATCTACACTAGATTTTAGTTACTCGAATGTGAGGGTTTCGTCTCTTTGCAAACCAATTTGA +TGTTTCCTCCTGAGCTAGATATGTTCTTGATGAGCTTGATTTTTCTACTTGGTTCAGTTTTTTTTGCTA +ATTACTACTTATATGAGTGAATCTGCCTCACTGTTTGAATTTATTCCAAGTGGAAATATTCATAGTCAT +GCTTTGTTGTATCTGTTATCTCTCCATATGTTGTGTTGCTGACCTTGTTAAATCTCATTCTCTGCAG >Chr1.ID=PAC:19650963;Name=AT1G47740.1;pacid=19650963;longest=1;Parent +=AT1G47740.len1091.i1.4_277 GTAAGTTAATCTTCTCTTCTGAAAATTGAATTTGGTGTATCAATTCTTACATTATCTTGAAGATTCATCT +CTGAATTTCTCAAATTTATGGGGGTTTTTTGTTTGTCGGAATTGCCGGAGAAATTGGAAAAAACGAGAT +CTTTGAGTAAAGGGTTTGTTTAATCTTTAGTCTTTATTGCTTTCCTTAAGCTAATTTTGGCAGATCTGG +AACATAAACCCTAGAACAAGACCAAATCAGTGTCTCCTTACTCTTAGGGATTTTAGTCTCTGTGATACC +TTAAATGTGTTTATAAATTGACTGTGCTTAATGGGTCACATTTGATTTGCAATAAAAGTTTCACAATTC +TTCCATTTTCAGTAATGTAAGCTCCAGTTTTCAAGATTTACTATTTTGGGGAACTAGTTAGGTTTGTAG +GTTTATTAGATCTTAGAAACACTAATGTAATGCTGTTTGTTTGGTACTCTTTAATATTCACTATTCATC +TTAATGTGGAATCAAATTTGCTTTTTTTGGTCAGACTCAATTAGTTGGAATGAGTTGTTGAACCATTGA +CTTGTCTCCTAAGCTCTTACTTAATCTATCTGTATCATCTTCTCCTGTCTATTTCTCTTTATTTAATAC +ATAACTCGTTCATGATTGCATCTGACATGGTAGCAACTCTTTGTGAACAAGACTTGATTTTATAACACT +GTAACATGACCACACTTTTTCCTTTTGATCTCTGTATATTTGGGGAAGTAAGGATTTGGTTTAACAATG +ATACAAAATCACAGTTTAGATGACTCAGTCTTGTCTTTTATCATTAAGATGGAAAAAATGAGACCAATC +TTGTCTTGCTTTATTCTAAGATGCATGCTCTATTATATTTTATGCTTTTGTATATATAACTGATTGAAG +CCGGCCAATGGAGATTGGTGCTGACTTTTTAAATGAGCTGTTGTTGTATTTCAGCCAACTAGCATAAGA +TAAAATAAAATAGGAAAATTTTTCATTTCAGTCTCTTAAGTTCAATGAAAACATAATGTGGCACAGAGG +TCTTTGCTTTTGTACCTTTCAGAATCTTTTATTGATTGATGATGTTTACATACAG >Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent +=AT1G31420.len415.i1.13_217 GTTCTTATAATTCTTACTAATTCTGTTTGCTTTTAAAGCTATAACCTTTGATATTGTTGGAAAGAGTGGG +GTTTTGGGTCTTTTGCTGAATCGTTTTTTGGATTTGTTATATTGTTCGAATCTTCAGTTGTTATTGTGT +TATAGGATTGGATGGTATCTGGGAATTTCGTAATCTATGTTAAGCTAGAGCTGTTTTTGAGCTCTTTTG +TTGATGATGATTTTGAGATTGTTGGCCGAATTTAGCTCTCGTTTTCTGATTTTAGCAATTGGAAAGTGT +GTATTGGTTCTTGTGAGGCAATTTCACTGTTTTGAGTACTCAAAATGTAGATGAGAGCATGCATAAGTT +GTGTGGAGACTGAGCTTAATGTGTAGTGTAATTGACAATTAGTTTTGTGGGCTTTCCTTTGTTTTTCAG
Here is the main script that I wrote:file2 >Chr1.ID=PAC:19650373;Name=AT1G11900.1;pacid=19650373;longest=1;Parent +=AT1G11900.len361.i1.1_111 100 >Chr1.ID=PAC:19657430;Name=AT1G76660.1;pacid=19657430;longest=1;Parent +=AT1G76660.len490.i1.2_394 34 >Chr1.ID=PAC:19652608;Name=AT1G31420.1;pacid=19652608;longest=1;Parent +=AT1G31420.len415.i1.13_217 76
Here is the perl module that the above code uses:#!/usr/bin/perl use strict; use warnings; use diagnostics; use FAlite; die "usage: $0 <fasta> <list>\n" unless @ARGV == 2; open(FASTA, $ARGV[0]) or die; my $fasta = new FAlite(\*FASTA); while (my $entry = $fasta->nextEntry) { my $gseq = $entry->seq; my ($chrom) = $entry->def =~ /(^\S+)/; my @a1 = $chrom; my %hash; $hash{$_}++ for @a1; my $header; open(LIST, $ARGV[1]) or die; while (<LIST>) { next if /^#/; my ($header, $score) = split; my @a2 = $header; for my $item (@a2) { if ($hash{$item}) {print "$chrom\n$gseq\n"}; } } } close FASTA; close LIST; __END__
While my script is working, I feel that there should be a more efficient way to search my array against my hash and print the associated header and DNA sequence. Thanks for any help, Derrickpackage FAlite; use strict; sub new { my ($class, $fh) = @_; if (ref $fh !~ /GLOB/) {die ref $fh, "\n", "FAlite ERROR: expect a GLOB reference\n"} my $this = bless {}; $this->{FH} = $fh; while(<$fh>) {last if $_ =~ /\S/} # not supposed to have blanks, b +ut... my $firstline = $_; if (not defined $firstline) {warn "FAlite: Empty\n"; return $this} if ($firstline !~ /^>/) {warn "FAlite: Not FASTA formatted\n"; ret +urn $this} $this->{LASTLINE} = $firstline; chomp $this->{LASTLINE}; return $this; } sub nextEntry { my ($this) = @_; return 0 if not defined $this->{LASTLINE}; my $fh = $this->{FH}; my $def = $this->{LASTLINE}; my @seq; my $lines_read = 0; while(<$fh>) { $lines_read++; if ($_ =~ /^>/) { $this->{LASTLINE} = $_; chomp $this->{LASTLINE}; last; } push @seq, $_; } return 0 if $lines_read == 0; chomp @seq; my $entry = FAlite::Entry::new($def, \@seq); return $entry; } package FAlite::Entry; use overload '""' => 'all'; sub new { my ($def, $seqarry) = @_; my $this = bless {}; $this->{DEF} = $def; $this->{SEQ} = join("", @$seqarry); $this->{SEQ} =~ s/\s//g; # just in case more spaces return $this; } sub def {shift->{DEF}} sub seq {shift->{SEQ}} sub all {my $e = shift; return $e->{DEF}."\n".$e->{SEQ}."\n"} 1;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Searching array against hash
by BrowserUk (Patriarch) on Aug 21, 2013 at 21:37 UTC | |
by drhicks (Novice) on Aug 21, 2013 at 21:54 UTC | |
by davido (Cardinal) on Aug 21, 2013 at 22:10 UTC | |
|
Re: Searching array against hash
by abualiga (Scribe) on Aug 22, 2013 at 01:07 UTC | |
by BrowserUk (Patriarch) on Aug 22, 2013 at 01:55 UTC | |
by abualiga (Scribe) on Aug 22, 2013 at 02:41 UTC | |
by BrowserUk (Patriarch) on Aug 22, 2013 at 03:23 UTC | |
by bioinformatics (Friar) on Aug 22, 2013 at 02:25 UTC | |
by BrowserUk (Patriarch) on Aug 22, 2013 at 03:18 UTC | |
by bioinformatics (Friar) on Aug 22, 2013 at 03:31 UTC | |
|