Hello, folks;

so, I'm working on a project that I'm trying to understand more about the genotyping data that was given to us(.gen files), I wrote a perl script to extract the lines I'm interested in from the given genotyping data and saved them in lines in a comma-delimited txt file(CVS format), the lines in .txt file are in the format of:

physical_location, Allele 1, Allele2

34787638,A,C

34787800,A,G

the big question is:

are the given locations in .gen files do correspond to the same alleles(1,2) in my txt file or not? to do that, I need to search for that exact physical_location (in each line) against any genome database(NCBI, Ensembl..etc) to retrieve the alleles(1,2) that fall in that exact location (for specific chromosome).

Using genome browser to do that manually is time consuming, and I believe it is a common task so there should be a BioPerl module to retrieve the alleles in a specific location for a given chromosome.

any ideas if there is a BioPerl module that can do that, or how to approach this problem ?

EDIT(more info):

the genotyping data that was given to us was in .gen files in which each file represents the genotype info for ONE chromosome, so I have a total of 22 .gen files (sizes between 700 MB up to 4 GB). Each line in a .gen file is in the format:

(Chromsome, MarkerID, Ph_location, Allele1, Allele2, .......some other irrelevant info)

Then there is another small .txt file that have the 'list of genes' that I'm interested in(50 genes) where each line is a gene, in the following format (GeneName, Chromosome, StartPosition, EndPosition).

The Perl script I wrote was to extract the lines I'm interested in from the .gen file (which are the lines that have a Ph_location (now experimenting on the .gen for chr1) that falls in the range from StartPosition to EndPosition from 'list of gene' .txt file (looking to those lines for chr1, it happens to be 10 genes) and saved them as lines in a comma-delimited txt file(CVS format) that have the format I mentioned in the very beginning of this post before the editing part(with location, allele1, allele2 format).

Now, I have a .txt file with the lines I'm interested in, again with location, allele1, allele2 format. NOW, I'm searching for a BioPerl module to retrieve (from any genome database(NCBI, Ensembl..etc)) to retrieve the alleles(1,2) that correspond to that location (from the .txt file)and report these corresponding alleles in another text file along with their location, so I finally end up with two .txt files, one (from genotype extracted lines), that I'll pull the locations from to use it to search for alleles1,2, and another .txt file that will have the location and the corresponding alleles after the search. Our goal is to validate the alleles given to us by retrieving those from ncbi or wherever using their locations, to eventually try to draw a different approach to classify the data moving forward.

I apologize for the lengthy post, I'm trying to make it more clear to get the best out of this discussion. I hope it is more clear now. Thanks in advance! :)

FINAL OUTPUT

two .txt files, one with the lines I'm interested in in the format (physical_location, Allele 1, Allele2), file 1 looks like (this is in chromosome 1, in case anyone wanted to run this against ncbi or any genome database):

34787638,A,C

34788686,A,G

34789549,C,T

34789695,C,G

34789808,C,T

347890859,C,G

then another .txt file, file2.txt that has the same first column(the locations), I need the other two colomns(Allele1,Allele2) for all the locations in this file these need to be retrieved from any genome database(NCBI for example) by a bioperl module that goes through each line in file1 extract the first column, go to ncbi, fetch the corresponding alleles. this module should accept two arguments; the location, and (chromosome number); in which chromosome it should fetch the corresponding alleles from. That's our goal, so we can doublecheck if the given alleles in file1 actually falls in the given locations.


In reply to bioperl module to extract specific nucleotides given chromosome and exact location of that nucleotide by xxArwaxx

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.