Searching each word of a file

biomonk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, i need your help in searching my two files,in first file(result)frist word of each line represents geneset and rest of the words of that line represents genes represent in that geneset{We can compare it with a draw having number of files in it, i.e geneset is nothing but a draw and files in that draw are genes,I hope you can get my point}each line may contain few hundreds of words and now i need to search each gene(words except first word)in second file(map) and print that line, which has information about genes.

Just have a look at my files
My First file:

 
#RAW data
chr1q21    na    S100A3    S100A6    HRNR    DRD5P2    .......    
HSA04910_INSULIN_SIGNALING_PATHWAY    na    XRCC5 HRAS    ....
V$YY1_02    na    B3GALT6    DZIP1    RAB1B    SART3    FLJ20309 ..
MORF_EIF3S2    na    HCCS    XRCC3    LDHB    LDHA    OXA1L    RPL14  
+  ...
module_486    na    CYP3A7    C14orf179    JAG2    INTS1    RBM6    ..
CATABOLIC_PROCESS    na    PGD    HNRPD    USE1    RNF217    RNASEH1
#second word can be eleminated
[download]

My second file

#Map data
XRCC5    SNP_A-1966881    1
EFNA1    SNP_A-1877994    9
HRNR    SNP_A-1919060    2
XRCC5    SNP_A-1966884    1
XRCC5    SNP_A-1966882    1
HRNR    SNP_A-1829030    1
[download]

My output file should look some thing like this:

chr1q21 
      HRNR    SNP_A-1829030    1
      HRNR    SNP_A-1919060    2
      EFNA1    SNP_A-1877994    9
HSA04910_INSULIN_SIGNALING_PATHWAY
      XRCC5    SNP_A-1966884    1
      XRCC5    SNP_A-1966882    1
      XRCC5    SNP_A-1966881    1
.......
[download]

I tried doing this by storing each line in array and from there getting genes (searching has to be done), but the thing is that i want to know is there any thing much simpler way to do this???

THANKS IN ADVANCE. Have a look at my code:

#Actually is a subroutine which is a part of my other program:
sub parseGeneEntry {  ##purpose of this function is to return everythi
+ng from the second tab onwards (these are the genes)
    $genesList = $_[0];
    #print $genesList."\n";
    #print "STARTING PARSING \n";
    @genes;
    @genes = split(/\t/,$genesList);
    shift(@genes);              ##removes first entry of array
    #print $#genes." ";         ##for debugging only
    shift(@genes);
    #print $#genes." ";
    #@genes;
    $toReturn = "";
    $counter = 0;
    foreach $element(@genes){
        if ($counter == 0){
            $toReturn = $toReturn.$element;
            $counter++;
        }
        else{
            $toReturn = $toReturn."\t".$element;
        }
    }
    #$toReturn = $toReturn."\n";

    #print length($toReturn)." ".$toReturn."\n\n";
    return($toReturn);
}
[download]

Comment on Searching each word of a file Select or Download Code

Replies are listed 'Best First'.
Re: Searching each word of a file by linuxer (Curate) on Jul 13, 2008 at 21:00 UTC
Hi, your code looks like you're not using `use strict` in your script. I, personally, would be afraid to let anyone play around with gene information without being strict. update: other Tips: * use `shift` for getting the subroutine's arguments. * splice`(@array, 0, 2)` should do the same like shifting the array two times. * use an array for what you want to return; push new elements to it; return the `join`ed array. `sub parseGeneEntry { my $genesList = shift; my @genes = split /\t/, $genesList; # should be identical with shifting @genes two times; splice( @genes, 0, 2 ); my @return; foreach my $element ( @genes ){ push @return, $element; } return join "\t", @return; }` [download] PS: untested edit: changed doc links; added 'my' to $genesList; edit2: OMG, it can be further shortened (in case you don't want to do anything further in that sub: `sub parseGeneEntry { my $genesList = shift; my @genes = split /\t/, $genesList; # should be identical with shifting @genes two times; splice( @genes, 0, 2 ); return join "\t", @genes; }` [download]	[reply] [d/l] [select]
Re: Searching each word of a file by GrandFather (Saint) on Jul 13, 2008 at 22:55 UTC
Again it is a matter of building a map from one file then looking up the map while parsing the second file. Consider: use strict; use warnings; my $rawData = <<'RAW'; chr1q21 na S100A3 S100A6 HRNR DRD5P2 EFNA1 HSA04910_INSULIN_SIGNALING_PATHWAY na XRCC5 HRAS V$YY1_02 na B3GALT6 DZIP1 RAB1B SART3 FLJ20309 MORF_EIF3S2 na HCCS XRCC3 LDHB LDHA OXA1L RPL14 module_486 na CYP3A7 C14orf179 JAG2 INTS1 RBM6 CATABOLIC_PROCESS na PGD HNRPD USE1 RNF217 RNASEH1 RAW my $mapData = <<'MAP'; XRCC5 SNP_A-1966881 1 EFNA1 SNP_A-1877994 9 HRNR SNP_A-1919060 2 XRCC5 SNP_A-1966884 1 XRCC5 SNP_A-1966882 1 HRNR SNP_A-1829030 1 MAP my %geneMap; open my $mapIn, '<', \$mapData or die "Failed to open map data: $!"; while (<$mapIn>) { chomp; my ($gene, @data) = split; next unless exists $data[1] \|\| exists $data[2]; # Skip if unexpect +ed data format $geneMap{$gene}{$data[0]} = $data[1]; } close $mapIn; open my $rawIn, '<', \$rawData or die "Failed to open raw data: $!"; while (<$rawIn>) { chomp; my ($geneset, $ignore, @genes) = split; next unless @genes; # Skip empty or badly formed line print "$geneset\n"; for my $gene (@genes) { next unless exists $geneMap{$gene}; print "\t$gene\t$_\t$geneMap{$gene}{$_}\n" for sort keys %{$geneMap{$gene}}; } } close $rawIn; [download] Prints: `chr1q21 HRNR SNP_A-1829030 1 HRNR SNP_A-1919060 2 EFNA1 SNP_A-1877994 9 HSA04910_INSULIN_SIGNALING_PATHWAY XRCC5 SNP_A-1966881 1 XRCC5 SNP_A-1966882 1 XRCC5 SNP_A-1966884 1 V$YY1_02 MORF_EIF3S2 module_486 CATABOLIC_PROCESS` [download] Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re^2: Searching each word of a file by biomonk (Acolyte) on Jul 14, 2008 at 17:30 UTC
Thanks alot you really made my day, its an awesome code. I'm very greatful to you.	[reply] [d/l]
Re: Searching each word of a file by moritz (Cardinal) on Jul 13, 2008 at 21:26 UTC
`@genes; @genes = split(/\t/,$genesList);` [download] Mentioning a variable before using it doesn't do anything useful - it's just cargo cult programming. `use strict` and declare your variables with my instead: `my @genes = split m/\t/, $genesList;` [download]	[reply] [d/l] [select]