Hi I have a bit of a snag with my perl code. I am trying to take a random unique sampling of a file containing >100,000 sequences. But right now the way the data is listed I can't seem to figure out how to write a good code.

What I have so far.....>

#!/user/bin/perl -w + # Use your own path + use strict; # always use strict + my $file = "josh_vamps_seqs_ALL.fa"; open (FILE, $file) or die "$0: $!"; # printing errors is a Good Thing +(tm) my @file = <FILE>; # Store all the lines in an array + close FILE; open OUT, ">rarefaction.fa" or die "$0: $!"; # Caution: will overwrite + any existing files my $i; for ($i = 0; $i<4266;$i++) { # Start sampling loop + my $rand = int(rand($#file)); # rand gives you a random number between 0 and the number of lines rem +aining in the array, note # that this number is by necessity dynamic, it will drop 1000, 999, 99 +8, etc. my $sample = splice @file, $rand, 1; # Cut out the $rand-th line, +offset 1 means just one line print OUT $sample; } # end of for loop

What it gives me ......>

CTTTTCTTCGGACTACTTACAAGGTGTTGCATGGTCGTC >FW4WBAJ01DVAX5.ICM_PML_Bv6.PML_43_2003_06_09 CGAGTCAACGCGCAGAACCTTACCAACACTTGACATGTTCGTCGCGACTCTAAGAGATTA TCTCTATGCGCAACGCGAAAACCTTACCTGGCCTTGACATGCATCTCTAAGCGTGTGAAA >FMS0R7002J2YH1.ICM_CAM_Bv6.CAM_0011_2000_03_26 TGGTGCCTTCGGGAACGCAGTGACAGGTGATGCATGG AAACCCTCAGAGACTTCGGTTAATGACATGTTTACAGGTGATGCATGGCCGTCG >E6SXMJY02I00IR.ICM_BMO_Bv6.BMO_0005_2007_09_22 TTCGGTTCGGCCGGACGAAACACAGGTGT TAGTGCGACGCGAAGAACCTTACCAGGGCTTAAATGTAGTGGGACAGGTCTAGAGATAGA GGGTGCCCTTCGGGGAATCTAGTGAGAGGTGTTGCATGGCCGTCG GTGAGCAACGCGCAGAACCTTACCAACCCTTGACATCCTGTGCTACTACCAGAGATGGTA TACATCTACGCGAAGAACCTTATCTACACTTGACATACAGAGAACTTACCAGAGATGGTT TGGTGCCTTCGGGAATCTAGTGACAGGTGATGCATGGCTGTCG CACACCAACGCGAAAAACCTTACCAACACTTGACATGTTCGTCGCGACTCTAAGAGATTA TTCGGTTCGGCCGGACGAAACACAGGTGTTGCATGGCTGTC

What I actually want .........>

FW4WBAJ01DVAX5.ICM_PML_Bv6.PML_43_2003_06_09 FMS0R7002J2YH1.ICM_CAM_Bv6.CAM_0011_2000_03_26 E6SXMJY02I00IR.ICM_BMO_Bv6.BMO_0005_2007_09_22

What input file looks like ......>

>FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTACCAGGTCCTGACTTCCTGACTATGGTTATTAGAAATAA TTTCCTTCAGTTCGGCTGGGTCAGTGACAGGTGATGCATGGCCGTC >FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTAACCGATGAACCTTACCTACACTTGACATGCAGAGAACTTTCCAGAGATGGAT TGGTGCCTTCGGGAACTCTGACACAGGTGATGCATCGCCGTC >FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTACGCGAAGAACCTTACCTACACTTGACATACAGAGAACTTACCAGAGATGGTT TGGTGCCTTCGGGAACTCTGATACAGGTGATGCATGGCTGTC >FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTACCAACCCTTGACATCCAGAGAATTTTCTAGAGATAGAT TTGTGCCTTCGGGAACTCTGTGACAGGTGATGCATGGCTGTC

I don't know if there is a way for the random function of perl to recognize a specific piece of the input say the ">" and then print the line that follows? Any pointers would be greatly appreciated.>


In reply to Get random unique lines from file by radnorr

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.