in reply to (Failing) script to return an official ID

If you had used the strictures:
use warnings; use strict;

you would have gotten these error messages:

Bareword "DUMMYHUGO" not allowed while "strict subs" in use ... Bareword "DUMMY_GENEFILE" not allowed while "strict subs" in use...

This would have lead you to suspect some kind of problem with these lines of code:

@hugo = DUMMYHUGO; @genes = DUMMY_GENEFILE;

It is not really clear what you are trying to do, but if you are trying to match names in the genes file to any name on any line of the hugo file, the following code might be a step in the right direction. Please note that this is not an attempt at a complete solution:

#!/usr/bin/env perl use warnings; use strict; my @hugos; open my $DUMMYHUGO, '<', 'DummyHugo.txt' or die "cannot open file cont +aining HUGO IDs: $!\n"; while (<$DUMMYHUGO>) { my @hugo = split; push @hugos, [@hugo]; } close $DUMMYHUGO; my $outfile = 'HUGO_dummyResults.txt'; open my $OUT, '>', $outfile or die "cannot create +the output file: $!\n"; open my $DUMMY_GENEFILE, '<', 'DummyGenes.txt' or die "cannot open fi +le containing genes: $!\n"; while (<$DUMMY_GENEFILE>) { my @genes = split; for my $href (@hugos) { my @hugo = @{$href}; for (my $i = 5; $i < 9; $i++) { if ($genes[2] eq $hugo[$i]) { print $OUT "$genes[0]\t$genes[1]\t$genes[2]\t$genes[3] +\t$hugo[1]\n"; } } } } close $DUMMY_GENEFILE; close $OUT; exit;

The output file will then contain:

ID1 Id2 Katie Path KJRJ ID1a Id2a Dave Path DJL ID1b Id2b Kean Path PKKJ ID1c Id2c Paul Path PKKJ ID1d Id2d Sandra Path SKJ

This code reads the hugo file into an array of arrays, @hugos, then reads the genes file one line at a time and checks if the gene name matches any name on any line in the hugo file.

I hope this helps.

Replies are listed 'Best First'.
Re^2: (Failing) script to return an official ID
by Anonymous Monk on Apr 11, 2008 at 14:03 UTC
    Thank you for your quick response, that is exactly the output I would like to see.
    I will have a play and see what happens.
Re^2: (Failing) script to return an official ID
by Anonymous Monk on Apr 12, 2008 at 16:27 UTC
    Dear toolic,
    This script works beautifully on my dummy data.

    When I run it using my real files I get an error message that repeats itself line after line until I stop it.

    Use of uninitialized value in string eq at HUGOID_extract.pl line 50, <$GENEFILE> line 1.

      The code I posted, as I mentioned, was not a complete solution. The code assumed that every line in the genes file would have at least 3 columns and that every line in the hugo file would have at least 9 columns, since this is what your dummy input sample files had.

      If your actual files have fewer columns, then you might get those warnings.

      If your actual files have blank lines, then you might get those warnings.

      It is impossible for me to know the structure of your input files without seeing more (small) examples. My guess is that you now need to check the format of your input. For example, you could check how many columns are in each line of the genes file by checking how many elements are in the array:

      my $cols = scalar @genes;

      Are you sure the code is looping infinitely? I could believe that the code would take a long time to run if your input files are really big (1 million lines, many columns per line).

        Yes, my dummy files were very simplified. Some lines may not have any alises at all, it ranges from between 0-35. That must be the problem.

        Not I'm not sure the code loops infinitely, I stopped after many -many lines!

        Thanks again.