ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:
I have been trying to take a list of human accession numbers for genes mutated in cancer tumor samples and see via HomoloGene if they have homologues in Drosophila Melanogaster, because I have a collaborator with a genetic screen assay set up in this species. The batch submission for Entrez doesn't seem to be working for me. Someone suggested that Bioperl might have a module that could be used to do something like this, and that it would be easier then dealing with Entrez batch queries. Does anyone have any ideas what module it would be or how to use it for this... I have not been able to find one. Thankyou for your time.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Homologene BioPerl
by erix (Prior) on Dec 02, 2011 at 18:30 UTC | |
There is a bioperl module that knows how to talk to NCBI's E-Utilities: see http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook (it mentions homologene - I suppose it works, but I haven't tried it). You can also use the EUtilities directly. Both approaches have a slight learning curve. Another, third approach is to download homologene into a local database. The NCBI E-Utilities work well, but working with homologene, I find it handier (and faster) to have all data locally, and use the file provided by NCBI in: The file 'homologene.data' there, when stored in a database, looks like this (just showing 10 random rows):
What you want is to look up your human gene or accession (human: tax_id=9606), take the group_id, and see if there is a Drosophila melanogaster (fly: tax_id=7227) record within the same group id. In case you have basic database skills, here is a way to load that file into a postgresql database:
The records that have the same group id are homologs.
With that group id, you can easily construct links into specific NCBI homologene pages too:
hth P.S. Re zoological nomenclature: in the binomial name Drosophila melanogaster, 'melanogaster' is the epitheton and must *always* be lower case; only genus names must be capitalised. | [reply] [d/l] [select] |
by ZWcarp (Beadle) on Dec 02, 2011 at 19:53 UTC | |
I can't thank you enough for your help. So I tried to do what you are saying by downloading the database. I got that far and I have the homologene.data file
I must say though that my "database" skills are not existent, I am good however at parsing and basic unix/perl/ matching etc... So from what you are saying ... the group id number will be the same for each gene (including its homologies should they be named differently) and that I just need to match the everything that has the group id for each human accession number, and then see if any of the lines match any of the Drosophila tax IDs? Thanks again for your help you have helped me tremendously! Heres my code so far for this
| [reply] [d/l] [select] |
by erix (Prior) on Dec 02, 2011 at 20:36 UTC | |
See also the NCBI explanatory files in:
Especially the README file, which says:
So yes, you search for your human accession in column 6, then look what value column 1 has (the homologene group id), and then look up whether there is a row which has both taxonomy_id=7227 (=D.melanogaster) *AND* that homologene group id. Btw (if you want more data), the 'Gene ID' can be handy too as it gives you access to the whole of entrez, and lets you construct URL's into the main gene page, etc, etc. More data 'addressable' via 'gene id' in the files in:
(esp. gene_info and gene2accession) (btw, I do /not/ see any Homologene records for your NP_001124398, so maybe your bioperl script does work after all, if you give it a human accession with known data in homologene) | [reply] [d/l] [select] |
|
Re: Homologene BioPerl
by Marshall (Canon) on Dec 02, 2011 at 17:23 UTC | |
| [reply] | |