biomonk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, i need your help in searching my two files,in first file(result)frist word of each line represents geneset and rest of the words of that line represents genes represent in that geneset{We can compare it with a draw having number of files in it, i.e geneset is nothing but a draw and files in that draw are genes,I hope you can get my point}each line may contain few hundreds of words and now i need to search each gene(words except first word)in second file(map) and print that line, which has information about genes.


Just have a look at my files
My First file:
#RAW data chr1q21 na S100A3 S100A6 HRNR DRD5P2 ....... HSA04910_INSULIN_SIGNALING_PATHWAY na XRCC5 HRAS .... V$YY1_02 na B3GALT6 DZIP1 RAB1B SART3 FLJ20309 .. MORF_EIF3S2 na HCCS XRCC3 LDHB LDHA OXA1L RPL14 + ... module_486 na CYP3A7 C14orf179 JAG2 INTS1 RBM6 .. CATABOLIC_PROCESS na PGD HNRPD USE1 RNF217 RNASEH1 #second word can be eleminated
My second file
#Map data XRCC5 SNP_A-1966881 1 EFNA1 SNP_A-1877994 9 HRNR SNP_A-1919060 2 XRCC5 SNP_A-1966884 1 XRCC5 SNP_A-1966882 1 HRNR SNP_A-1829030 1
My output file should look some thing like this:
chr1q21 HRNR SNP_A-1829030 1 HRNR SNP_A-1919060 2 EFNA1 SNP_A-1877994 9 HSA04910_INSULIN_SIGNALING_PATHWAY XRCC5 SNP_A-1966884 1 XRCC5 SNP_A-1966882 1 XRCC5 SNP_A-1966881 1 .......

I tried doing this by storing each line in array and from there getting genes (searching has to be done), but the thing is that i want to know is there any thing much simpler way to do this???

THANKS IN ADVANCE. Have a look at my code:
#Actually is a subroutine which is a part of my other program: sub parseGeneEntry { ##purpose of this function is to return everythi +ng from the second tab onwards (these are the genes) $genesList = $_[0]; #print $genesList."\n"; #print "STARTING PARSING \n"; @genes; @genes = split(/\t/,$genesList); shift(@genes); ##removes first entry of array #print $#genes." "; ##for debugging only shift(@genes); #print $#genes." "; #@genes; $toReturn = ""; $counter = 0; foreach $element(@genes){ if ($counter == 0){ $toReturn = $toReturn.$element; $counter++; } else{ $toReturn = $toReturn."\t".$element; } } #$toReturn = $toReturn."\n"; #print length($toReturn)." ".$toReturn."\n\n"; return($toReturn); }

Replies are listed 'Best First'.
Re: Searching each word of a file
by linuxer (Curate) on Jul 13, 2008 at 21:00 UTC

    Hi, your code looks like you're not using use strict in your script. I, personally, would be afraid to let anyone play around with gene information without being strict.

    update:
    other Tips:

    * use shift for getting the subroutine's arguments.

    * splice(@array, 0, 2) should do the same like shifting the array two times.

    * use an array for what you want to return; push new elements to it; return the joined array.

    sub parseGeneEntry { my $genesList = shift; my @genes = split /\t/, $genesList; # should be identical with shifting @genes two times; splice( @genes, 0, 2 ); my @return; foreach my $element ( @genes ){ push @return, $element; } return join "\t", @return; }

    PS: untested

    edit: changed doc links; added 'my' to $genesList;

    edit2:
    OMG, it can be further shortened (in case you don't want to do anything further in that sub:

    sub parseGeneEntry { my $genesList = shift; my @genes = split /\t/, $genesList; # should be identical with shifting @genes two times; splice( @genes, 0, 2 ); return join "\t", @genes; }
Re: Searching each word of a file
by GrandFather (Saint) on Jul 13, 2008 at 22:55 UTC

    Again it is a matter of building a map from one file then looking up the map while parsing the second file. Consider:

    use strict; use warnings; my $rawData = <<'RAW'; chr1q21 na S100A3 S100A6 HRNR DRD5P2 EFNA1 HSA04910_INSULIN_SIGNALING_PATHWAY na XRCC5 HRAS V$YY1_02 na B3GALT6 DZIP1 RAB1B SART3 FLJ20309 MORF_EIF3S2 na HCCS XRCC3 LDHB LDHA OXA1L RPL14 module_486 na CYP3A7 C14orf179 JAG2 INTS1 RBM6 CATABOLIC_PROCESS na PGD HNRPD USE1 RNF217 RNASEH1 RAW my $mapData = <<'MAP'; XRCC5 SNP_A-1966881 1 EFNA1 SNP_A-1877994 9 HRNR SNP_A-1919060 2 XRCC5 SNP_A-1966884 1 XRCC5 SNP_A-1966882 1 HRNR SNP_A-1829030 1 MAP my %geneMap; open my $mapIn, '<', \$mapData or die "Failed to open map data: $!"; while (<$mapIn>) { chomp; my ($gene, @data) = split; next unless exists $data[1] || exists $data[2]; # Skip if unexpect +ed data format $geneMap{$gene}{$data[0]} = $data[1]; } close $mapIn; open my $rawIn, '<', \$rawData or die "Failed to open raw data: $!"; while (<$rawIn>) { chomp; my ($geneset, $ignore, @genes) = split; next unless @genes; # Skip empty or badly formed line print "$geneset\n"; for my $gene (@genes) { next unless exists $geneMap{$gene}; print "\t$gene\t$_\t$geneMap{$gene}{$_}\n" for sort keys %{$geneMap{$gene}}; } } close $rawIn;

    Prints:

    chr1q21 HRNR SNP_A-1829030 1 HRNR SNP_A-1919060 2 EFNA1 SNP_A-1877994 9 HSA04910_INSULIN_SIGNALING_PATHWAY XRCC5 SNP_A-1966881 1 XRCC5 SNP_A-1966882 1 XRCC5 SNP_A-1966884 1 V$YY1_02 MORF_EIF3S2 module_486 CATABOLIC_PROCESS

    Perl is environmentally friendly - it saves trees

      Thanks alot you really made my day, its an awesome code. I'm very greatful to you.

Re: Searching each word of a file
by moritz (Cardinal) on Jul 13, 2008 at 21:26 UTC
    @genes; @genes = split(/\t/,$genesList);

    Mentioning a variable before using it doesn't do anything useful - it's just cargo cult programming. use strict and declare your variables with my instead:

    my @genes = split m/\t/, $genesList;