biomonk has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I come from biology background and I started using Perl recently.I want to add more information to an output which is generated a program in order to do that I like use raw data which produces that output.I feel this can be done by Parsing and Searching files.Problem is that size of raw data or inputs files is large and by using regexp it take too long, so I need to find other way to do it, I thought this is great place to ask for help.


Here I will describe about my files:

Output file which has to be parsed (All files are tab delimited)

Geneset=GO0035091 Size=77 ES=0.525 NES=2.913 NominalP=0. +000 Geneset=GO0030163 Size=54 ES=0.463 NES=2.248 NominalP=0. +013 Geneset=GO0007067 Size=44 ES=0.484 NES=1.975 NominalP=0. +018
......

Input or Raw data files include:

A GMT file (which has all genes associated with a Geneset ; large file)

GO0046800 GO0046800 CD209 CD209L CD209L1 CD209_HUMAN + CLC4M GO0032104 GO0032104 CART CARTPT CART_HUMAN GHRL GHRL +_HUMAN ......... .......... ..... ...... .......... .... .. +........ ......... .......... ..... ...... .......... .... .. +........ GO0035091 GO0035091 41_HUMAN 5NTD_HUMAN 9804 A-388D4.1 + .....

A SNP/Marker to Gene Map file (which has information about SNP’s related to genes and some score; huge file)

rs10904494 NP_817124 17881 rs7906287 NP_817124 39800 rs4881551 41_HUMAN 21567 rs5416721 5NTD_HUMAN 0 .................... .............. ....

A CHI2 file (which has information about SNP’s/Marker from a study and a score)

Marker CHI2 rs3749375 11.7268615355335 rs10499549 10.4656064706897 rs5416721 9.85374546064131

And I need to get a New output files which should look like this:

Geneset Genes SNP/Marker ES NES NominalP GO0035091 41_HUMAN rs4881551 .... ....... .... 5NTD_HUMAN rs5416721 .... ....... .... .................... .................... ..................

This can be done easily by first getting Geneset and there corresponding genes which I did and (for rest of things I need your help) then using Marker from CHI2 file to search SNP/Marker to Gene Map file (as CHI2 file contains markers/SNP’s of our interest) and store that into a file. Now we need to search this file for genes from our Geneset inorder get SNP’s and print them into a new file along with old data.I need your guidance to do this, so please help me out.Because of my low programming skill I need little explanation than just code so that I can understand it and use or modify it in future.

You can look at my code here
Here is my code
open(OUTPUT, "<C:\\Documents and Settings\\shra1\\Desktop\\prj\\schnei +der_breast_copy_num_pathway_enrichment1.txt"); @output = <OUTPUT>; close(OUTPUT); open(GENESETS, "<C:\\Documents and Settings\\shra1\\Desktop\\prj\\huma +n.gmt"); @geneSets = <GENESETS>; close(GENESETS); @NewgeneSet; @genesInSet; $i=0; while($i < 10){ @outputLineSplit = split(/\t/,$output[$i]); #print "$outputLineSplit[0] \n"; $setName = $outputLineSplit[0]; $equalLoc = index($setName, "="); $setName = substr($setName,$equalLoc+1,length($setName)); #print "$setName\n"; @genesInSet[$i]= $setName; $i++; } foreach $genesInSet(@genesInSet){ print "$genesInSet\n"; foreach $geneSets(@geneSets){ if($geneSets=~m/$genesInSet/i){ #print "$geneSets\n"; } } }

Replies are listed 'Best First'.
Re: Searching and Parsing Biological data
by pc88mxer (Vicar) on Jul 01, 2008 at 17:00 UTC
    This is a common operation which perl is will suited for. Whenever you want to look read in one file and look up associated information in another file, it will involve using a hash to "index" the second file so you can find the information quickly. The basic recipe is:
    for each line of the second file: parse it and locate the key store the second line in a hash indexed by its key endfor for each line of the first file: parse it and locate its common key with the second file look up the associated line using the hash do something with both lines endfor
    Examples of how to do each of these steps can be found in these nodes:
    Re: Combine files, while parsing info. (see the example in the <readmore> section
    Re: compare data between two files using Perl

    Your problem is complicated by the fact that you have several files you want to "join" together, but the basic approach won't change. This is similar to a "relational join" operation between tables in database parlance.