Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I have an array of sequences, each with a unique id. I'm trying to extract those sequences based on id's in a second array. The problem is that I am extracting them in the order they appear in the sequence array, not the order in the second array (which is what I want). How can I alter my code to extract ther sequences based on the order they appear in array2? Thanks!
Sequence file (@array1) looks like this: >gi|13470331|ref|NP_101896.1| hypothetical protein MFWVTKKALMPFLMLPAGIIFVSAVGYAINWLFSTLFQFQPPLVEGPAGPVTVLIFTITMLLAYDISYYL >gi|13470319|ref|NP_101897.1| hypothetical protein MGAYCQAHPACKVTDRTVIGRRDAAMNAPFVLAIPRTRTFEVVTSAARLAEIAPAWTALWQRAGGLVFQH my @array2 = qq(13470319 13470331 15460001 13490216); foreach my $line (@array1) { if ($line =~ /^gi\|(\d+)/) { for (my $i=0;$i<@array2; $i++) { if ($array2[$i] == $1) { print "$line "; } } } }

Replies are listed 'Best First'.
Re: retrieving in the correct order
by halley (Prior) on Dec 16, 2004 at 19:38 UTC
    See my response in an old thread: Re: sort an array according to another array

    Also, you're initializing @array2 with qq() which makes a string, not a list of values. You want either qw( list of words ) or ('word', 'word', 'word') or (value, value, value) without the qq. If you think of your identifiers as words, then you should use lt/eq/gt instead of </==/> when comparing them, too.

    --
    [ e d @ h a l l e y . c c ]

Re: retrieving in the correct order
by VSarkiss (Monsignor) on Dec 16, 2004 at 19:27 UTC
      Hi VSarkiss, Thanks for your solution but I can't get it to work! Maybe it's becuase my first array is quite big (~1000 sequences). But wouldn't this just slow it down? Thanks

        Well, some detail on what went wrong would help....

        When I tried it against the sample data in your original post, I noticed two things:

        1. You're testing for gi| at the beginning of the line, but your @array1 values start with >gi|. I had to remove the >; you'll have to either fix the $key = line to match your data, or fix the data to match your test...
        2. You're populating a single element in array2. If you want each number to be an element of the array, you need to use qw(...), not qq(...).

        If these are both copy-and-paste artifacts, pleave provide more detail on what the error is.

        hey, check my code below, i tested it with a file with 3000 lines in it. time cat gen.txt | perl -w gen.pl says:
        real 0m0.139s user 0m0.109s sys 0m0.000s

        if you're still looking for an answer...

        --
        to ask a question is a moment of shame
        to remain ignorant is a lifelong shame
Re: retrieving in the correct order
by Animator (Hermit) on Dec 16, 2004 at 19:14 UTC

    A possible way is to build a hash of the first array, where the key is the id of the element.

    If that's done you can easily use a hash slice to get an array with the values in the order of the second array.

Re: retrieving in the correct order
by nedals (Deacon) on Dec 16, 2004 at 19:52 UTC
    # If the files are not too large... # Read in the sequence file putting the data into a hash use strict; my %hash; while (<DATA>) { chomp $_; my ($id,$protein) = /^gi\|(.+?)\|.+\|(.+)$/; ## Save what you nee +d $hash{$1} = $2; } # Now use the second file to print out the hash my @array2 = qw(13470319 13470331 15460001 13490216); map { print "$hash{$_}\n"; } @array2; __DATA__ gi|13490216|ref|NP_101899.1|protein for 216 gi|13470331|ref|NP_101896.1|protein for 331 gi|15460001|ref|NP_101898.1|protein for 001 gi|13470319|ref|NP_101897.1|protein for 319
      A hash slice, as mentioned in my first post would be faster (I guess)...

      something like: print join("\n", @hash{@array2});

      Another thing (just noticed it now) why would you be using map?? map returns an array (filled with x times 1 (return value of print)), which you aren't using at all...

      What should be used is for/foreach (or a hash slice ofc).

        The map method was already sitting in my 'test' template. The foreach method is a better option, but I liked your hash slice method even better. ++
Re: retrieving in the correct order
by insaniac (Friar) on Dec 16, 2004 at 20:54 UTC
    or, if the first array is really a text file, say on a UNIX/LINUX system, you could cat the first file and read it line by line.. no?
    the scanning perl program gen.pl: --------------------------------- #!/usr/bin/perl use strict; my @array = qw(13470319 13470331 15460001 13490216); my @array2; while(my $line = <> ) { foreach my $id (0..$#array) { $array2[$id]=$line if $line =~ m/^gi\|($array[$id])\|/; } } print "order: ", join (" ", @array), "\n"; map {print} @array2; ------------------------------- the text file: gi|13470331|ref|NP_101896.1| hypothetical protein MFWVTKKALMPFLMLPAGIIFVSAVGYAINWLFSTLFQFQPPLVEGPAGPVTVLIFTITMLLAYDISYYL gi|13470319|ref|NP_101897.1| hypothetical protein MGAYCQAHPACKVTDRTVIGRRDAAMNAPFVLAIPRTRTFEVVTSAARLAEIAPAWTALWQRAGGLVFQH ------------------------------- the execution: # cat gen.txt | perl gen.pl order: 13470319 13470331 15460001 13490216 gi|13470319|ref|NP_101897.1| hypothetical protein gi|13470331|ref|NP_101896.1| hypothetical protein

    this just looked like a quick and keep it simple job to me..

    UPDATE: updated the code to display correct order..

    --
    to ask a question is a moment of shame
    to remain ignorant is a lifelong shame

      I think you are missing the point...

      If I understand the poster correctly then he (or she) wants to output them in the order they appear in the second array, so in this case first line/element 13470319, after that line/element 13470331 and so on

        aah.. oops.. indeed i understood wrongly.

        update: and wisdom came to me :-D check the above code... simple, line by line and in the correct order. this time i tested the code ;-)

        --
        to ask a question is a moment of shame
        to remain ignorant is a lifelong shame