Hello All, I come from biology background and I started using Perl recently.I want to add more information to an output which is generated a program in order to do that I like use raw data which produces that output.I feel this can be done by Parsing and Searching files.Problem is that size of raw data or inputs files is large and by using regexp it take too long, so I need to find other way to do it, I thought this is great place to ask for help.


Here I will describe about my files:

Output file which has to be parsed (All files are tab delimited)

Geneset=GO0035091 Size=77 ES=0.525 NES=2.913 NominalP=0. +000 Geneset=GO0030163 Size=54 ES=0.463 NES=2.248 NominalP=0. +013 Geneset=GO0007067 Size=44 ES=0.484 NES=1.975 NominalP=0. +018
......

Input or Raw data files include:

A GMT file (which has all genes associated with a Geneset ; large file)

GO0046800 GO0046800 CD209 CD209L CD209L1 CD209_HUMAN + CLC4M GO0032104 GO0032104 CART CARTPT CART_HUMAN GHRL GHRL +_HUMAN ......... .......... ..... ...... .......... .... .. +........ ......... .......... ..... ...... .......... .... .. +........ GO0035091 GO0035091 41_HUMAN 5NTD_HUMAN 9804 A-388D4.1 + .....

A SNP/Marker to Gene Map file (which has information about SNP’s related to genes and some score; huge file)

rs10904494 NP_817124 17881 rs7906287 NP_817124 39800 rs4881551 41_HUMAN 21567 rs5416721 5NTD_HUMAN 0 .................... .............. ....

A CHI2 file (which has information about SNP’s/Marker from a study and a score)

Marker CHI2 rs3749375 11.7268615355335 rs10499549 10.4656064706897 rs5416721 9.85374546064131

And I need to get a New output files which should look like this:

Geneset Genes SNP/Marker ES NES NominalP GO0035091 41_HUMAN rs4881551 .... ....... .... 5NTD_HUMAN rs5416721 .... ....... .... .................... .................... ..................

This can be done easily by first getting Geneset and there corresponding genes which I did and (for rest of things I need your help) then using Marker from CHI2 file to search SNP/Marker to Gene Map file (as CHI2 file contains markers/SNP’s of our interest) and store that into a file. Now we need to search this file for genes from our Geneset inorder get SNP’s and print them into a new file along with old data.I need your guidance to do this, so please help me out.Because of my low programming skill I need little explanation than just code so that I can understand it and use or modify it in future.

You can look at my code here
Here is my code
open(OUTPUT, "<C:\\Documents and Settings\\shra1\\Desktop\\prj\\schnei +der_breast_copy_num_pathway_enrichment1.txt"); @output = <OUTPUT>; close(OUTPUT); open(GENESETS, "<C:\\Documents and Settings\\shra1\\Desktop\\prj\\huma +n.gmt"); @geneSets = <GENESETS>; close(GENESETS); @NewgeneSet; @genesInSet; $i=0; while($i < 10){ @outputLineSplit = split(/\t/,$output[$i]); #print "$outputLineSplit[0] \n"; $setName = $outputLineSplit[0]; $equalLoc = index($setName, "="); $setName = substr($setName,$equalLoc+1,length($setName)); #print "$setName\n"; @genesInSet[$i]= $setName; $i++; } foreach $genesInSet(@genesInSet){ print "$genesInSet\n"; foreach $geneSets(@geneSets){ if($geneSets=~m/$genesInSet/i){ #print "$geneSets\n"; } } }

In reply to Searching and Parsing Biological data by biomonk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.