Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks,

First let me say I am new to perl, I am using it for bioinformatics/genetic analysis.

What I want my script to do:
I have an IDs file in tab format, it looks like this (with info between slashes representing columns)

/Unwated/ ID required/ Unwanted1/ Unwanted2/...Unwanted6/ ID_alias/ ID_alias1/ ... ID_alias36/

I also have a gene names file, for which I want to return the official ID. The gene name may be in any one of the alias columns.

Gene names file looks like this

/Info/ Gene name/ Info1/ Info 2... Should be simple right?

What my script does:
It returns an empty output file.

I will post it here in the hope one of you learned people will be able to spot the mistakes. I am using dummy files to get it working (see below for examples).

Please excuse extensive comments in my script, like I said I am new to this.

Thank you in advance,
-Walking before I can run...


Script
################################################ # Declare an outfile to print to my $outfile = "HUGO_dummyResults.txt"; # Open the outfile using a file handle open( OUT, "> $outfile" ) or die "cannot create the output file"; ################################################# # Open file of list of neurotransmission genes where ENST has not been + found # FILENOTES::: File created in access using a query against approved H +UGO name and gene name. #Column 3 of file [2] is gene name col 4 [3] is the pathway gene is as +sociated with open (DUMMY_GENEFILE, 'DummyGenes.txt') or die "cannot open file conta +ining genes"; ################################################# # 2- Open HUGO tabbed file # FILENOTES:::: Approved gene name is in col 2 [1] open (DUMMYHUGO,'DummyHugo.txt') or die "cannot open file containing H +UGO IDs"; ################################################# #Operations ################################################# #make array genes #@genes = DUMMY_GENEFILE; #No longer done here, see below #make array HUGO @hugo = DUMMYHUGO; #for each line in genefile, try to match gene name [2] to one column o +f the columns [5]-[8] in the HUGO ID file. #check col 6, if found print, if not found, check next column. If neve +r found, print "not found". foreach (<DUMMY_GENEFILE>) #Changed from (<DUMMY_GENEFILE>) { #make array genes @genes = DUMMY_GENEFILE; for ($i = 4; $i < @hugo; $i++) { if ($genes[2] eq $hugo[$i]) #If found first print result { print OUT "$genes[0]\t$genes[1]\t$genes[2]\tgenes[ +3]\t$hugo[1]\n"; } # HUGO ID not found, print print OUT "$genes[0]\t$genes[1]\t$genes[2]\tgenes[3]\tNo H +UGO ID\n"; } } close (DUMMYHUGO); close (DUMMY_GENEFILE); close (OUT); exit;
________________
Here are some dummy files to help demonstrate.

DummyHugo.txt

HGCNID:1 SKJ Info1 Info2 Info3 Sandra San Katey Jones
HGCNID:2 DJL Info1 Info2 Info3 Dave David James London
HGCNID:3 PKKJ INfo1 INfo2 INfo3 Paul Kevin Kean June
HGCNID:4 KJRJ INfo1 Info2 INfo3 Katie Joanna Rachel Jolie


DummyGenes.txt

ID1 Id2 Katie Path
ID1a Id2a Dave Path
ID1b Id2b Kean Path
ID1c Id2c Paul Path
ID1d Id2d Sandra Path
____________________________

Replies are listed 'Best First'.
Re: (Failing) script to return an official ID
by toolic (Bishop) on Apr 11, 2008 at 13:08 UTC
    If you had used the strictures:
    use warnings; use strict;

    you would have gotten these error messages:

    Bareword "DUMMYHUGO" not allowed while "strict subs" in use ... Bareword "DUMMY_GENEFILE" not allowed while "strict subs" in use...

    This would have lead you to suspect some kind of problem with these lines of code:

    @hugo = DUMMYHUGO; @genes = DUMMY_GENEFILE;

    It is not really clear what you are trying to do, but if you are trying to match names in the genes file to any name on any line of the hugo file, the following code might be a step in the right direction. Please note that this is not an attempt at a complete solution:

    #!/usr/bin/env perl use warnings; use strict; my @hugos; open my $DUMMYHUGO, '<', 'DummyHugo.txt' or die "cannot open file cont +aining HUGO IDs: $!\n"; while (<$DUMMYHUGO>) { my @hugo = split; push @hugos, [@hugo]; } close $DUMMYHUGO; my $outfile = 'HUGO_dummyResults.txt'; open my $OUT, '>', $outfile or die "cannot create +the output file: $!\n"; open my $DUMMY_GENEFILE, '<', 'DummyGenes.txt' or die "cannot open fi +le containing genes: $!\n"; while (<$DUMMY_GENEFILE>) { my @genes = split; for my $href (@hugos) { my @hugo = @{$href}; for (my $i = 5; $i < 9; $i++) { if ($genes[2] eq $hugo[$i]) { print $OUT "$genes[0]\t$genes[1]\t$genes[2]\t$genes[3] +\t$hugo[1]\n"; } } } } close $DUMMY_GENEFILE; close $OUT; exit;

    The output file will then contain:

    ID1 Id2 Katie Path KJRJ ID1a Id2a Dave Path DJL ID1b Id2b Kean Path PKKJ ID1c Id2c Paul Path PKKJ ID1d Id2d Sandra Path SKJ

    This code reads the hugo file into an array of arrays, @hugos, then reads the genes file one line at a time and checks if the gene name matches any name on any line in the hugo file.

    I hope this helps.

      Thank you for your quick response, that is exactly the output I would like to see.
      I will have a play and see what happens.
      Dear toolic,
      This script works beautifully on my dummy data.

      When I run it using my real files I get an error message that repeats itself line after line until I stop it.

      Use of uninitialized value in string eq at HUGOID_extract.pl line 50, <$GENEFILE> line 1.

        The code I posted, as I mentioned, was not a complete solution. The code assumed that every line in the genes file would have at least 3 columns and that every line in the hugo file would have at least 9 columns, since this is what your dummy input sample files had.

        If your actual files have fewer columns, then you might get those warnings.

        If your actual files have blank lines, then you might get those warnings.

        It is impossible for me to know the structure of your input files without seeing more (small) examples. My guess is that you now need to check the format of your input. For example, you could check how many columns are in each line of the genes file by checking how many elements are in the array:

        my $cols = scalar @genes;

        Are you sure the code is looping infinitely? I could believe that the code would take a long time to run if your input files are really big (1 million lines, many columns per line).

Re: (Failing) script to return an official ID
by apl (Monsignor) on Apr 11, 2008 at 13:13 UTC
    Revised: Please ignore. toolic had previously provided a complete, working solution

    I modified your code to the following:

    #!/usr/bin/perl use strict; use warnings; my $outfile = "HUGO_dummyResults.txt"; open( OUT, "> $outfile" ) or die "cannot create the output file"; open (DUMMY_GENEFILE, 'DummyGenes.txt') or die "cannot open file conta +ining genes"; open (DUMMYHUGO,'DummyHugo.txt') or die "cannot open file containing H +UGO IDs"; ################################################# #Operations ################################################# #make array HUGO my @hugo = <DUMMYHUGO>; foreach (<DUMMY_GENEFILE>) #Changed from (<DUMMY_GENEFILE>) { #make array genes my @genes = split ( /\b/, $_ ); foreach my $i ( 4 .. 7 ) { if ($genes[2] eq $hugo[$i]) { print OUT "$genes[0]\t$genes[1]\t$genes[2]\t$genes[3]\t$hugo[1 +]\n"; } print OUT "$genes[0]\t$genes[1]\t$genes[2]\t$genes[3]\tNo HUGO I +D\n"; } } close (DUMMYHUGO); close (DUMMY_GENEFILE); close (OUT); exit;

    This results in:

    ID1 Id2 No HUGO ID ID1 Id2 No HUGO ID ID1 Id2 No HUGO ID ID1 Id2 No HUGO ID ID1a Id2a No HUGO ID ID1a Id2a No HUGO ID ID1a Id2a No HUGO ID ID1a Id2a No HUGO ID ID1b Id2b No HUGO ID ID1b Id2b No HUGO ID ID1b Id2b No HUGO ID ID1b Id2b No HUGO ID ID1c Id2c No HUGO ID ID1c Id2c No HUGO ID ID1c Id2c No HUGO ID ID1c Id2c No HUGO ID ID1d Id2d No HUGO ID ID1d Id2d No HUGO ID ID1d Id2d No HUGO ID ID1d Id2d No HUGO ID

    Could you tell me what you expected to see, so I can refine this further?

    Modifictions:

    • Define @hugo, @genes and $i.
    • References to DUMMYHUGO and DUMMY_GENEFILE should be surrounded by < and >.
    • genes[3] changed to $genes[3]
    • my $i defined tightened up slightly
Re: (Failing) script to return an official ID
by starbolin (Hermit) on Apr 11, 2008 at 16:35 UTC

    No need to apologize; I think you code is very well formated. Kudos on a well stated and complete question also. Why don't you sign in so you can collect XP and gain fame and fortune? Well the fortune part is probably a stretch but seriously, Get a login if you don't have one and sign in, the site could use your participation.


    s//----->\t/;$~="JAPH";s//\r<$~~/;{s|~$~-|-~$~|||s |-$~~|$~~-|||s,<$~~,<~$~,,s,~$~>,$~~>,, $|=1,select$,,$,,$,,1e-1;print;redo}
      Ooh, aren't you all terribly nice. I was terrified to post here, let alone register! I think I will because over the course of my PhD I will have to become much more adept at perl and I'm sure I will be baying with frustration on more than one occasion. Thanks again.