LostWeekender has asked for the wisdom of the Perl Monks concerning the following question:

This is my very first time using Perl and I'm wondering if someone could show me how to perform the following task (which will form part of a larger pipeline I'm building). I'm hoping that seeing how to do this in Perl will give me a foothold/starting point into the language and allow me to build the pipeline from there. The task in question:

I have a text file containing approximately 15,000 records, each record looks like this:

TF Unknown TF Name Unknown Gene ENSG00000113916 Motif ENSG00000113916___1|2x3 Family C2H2 ZF Species Homo_sapiens Pos A C G T 1 0.538498 0.157305 0.157633 0.146564 2 0.072844 0.008771 0.877166 0.0412175 3 0.959269 0.013107 0.015961 0.0116621 4 0.852439 0.023883 0.016813 0.106864 5 0.57332 0.068801 0.181385 0.176494 6 0.139513 0.074798 0.737607 0.0480813 7 0.735484 0.091299 0.09091 0.0823067 8 0.79932 0.027041 0.137306 0.0363319 9 0.16103 0.12536 0.109938 0.603672 10 0.622356 0.06782 0.115463 0.194361

For the rows explicitly numbered 1 to 10, I need to find the highest value in each row of four (<1.0) values and output the character heading that column (a DNA base). i.e. row 1 in the above matrix is A. I ultimately need to produce a list containing two columns; the first with the “motif name" from row four of the record, and second with the string of 10 characters from the matrix analysis. e.g.

ENSG00000113916___1|2x3 AGAAAGAATA

Thank you, any help is sincerely appreciated!

Replies are listed 'Best First'.
Re: First foray into Perl
by ww (Archbishop) on Mar 24, 2014 at 16:14 UTC

    Just how much do you know now? If this is first time you've exposted yourself to Perl, you probably better start with the usual list of recommendations: read Learning Perl; the introductory Tutorials on this site, etc. etc. because, otherwise, you're just asking us to hand you a solution, rather than (as is the ethic here) helping your learn.

    OTOH, if you've done some homework on Perl, and have become stuck on some particular aspect of your problem ... and we know what you've done so far and what your sticking point is (hint, hint: post code, error messages and -- if relevant -- narrative explanation of where output fails to satisfy you).


    Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
    1. code
    2. verbatim error and/or warning messages
    3. a coherent explanation of what "doesn't work actually means.
Re: First foray into Perl
by dorko (Prior) on Mar 24, 2014 at 18:01 UTC
    Hello,

    I might be wrong, but I get the feeling this is more of a biology thing for you than it is a Perl thing, so I'm helping out more than I usually do. Please make sure you understand the code below (and its limitations) before you use it.

    Good luck.

    use strict; use warnings; # Read the first 7 lines of metadata. # Assuming there are always 7 lines of metadata. foreach (1..7) { # Read a line of data. my $header_data = <DATA>; # Remove the end of line character. chomp $header_data; # Split the string into 2 parts, using white space as a separator. my ($lable, $string) = split /\s+/, $header_data, 2; # only pay attention to the "Motif" line. next if ($lable ne 'Motif'); print "$string "; } # Process the next 10 lines of data. # Assuming there are always 10 lines of data. foreach (1..10) { # Declare a variable to hold the data in the file. my %base_pairs; # Read a line of data. my $line = <DATA>; # Remove the end of line character. chomp $line; # Split the string into 5 parts, using whitespace as a separator. # Assuming the Position is always in the same order in the file. (undef, $base_pairs{A}, $base_pairs{C}, $base_pairs{G}, $base_pair +s{T}) = split /\s+/, $line, 5; my @letters = keys %base_pairs; # Start with the first column value and make it the max. value. my $max = pop @letters; # Compare each value to the maximum. foreach my $letter (@letters) { # What if two (or more) values are equal??? if ($base_pairs{$max} < $base_pairs{$letter}) { # The current value was grater than the maximum, so make i +t the new maximum. $max = $letter; } } # Print the letter representing the maximum value. print $max; } # print an end of line character. print "\n"; __DATA__ TF Unknown TF Name Unknown Gene ENSG00000113916 Motif ENSG00000113916___1|2x3 Family C2H2 ZF Species Homo_sapiens Pos A C G T 1 0.538498 0.157305 0.157633 0.146564 2 0.072844 0.008771 0.877166 0.0412175 3 0.959269 0.013107 0.015961 0.0116621 4 0.852439 0.023883 0.016813 0.106864 5 0.57332 0.068801 0.181385 0.176494 6 0.139513 0.074798 0.737607 0.0480813 7 0.735484 0.091299 0.09091 0.0823067 8 0.79932 0.027041 0.137306 0.0363319 9 0.16103 0.12536 0.109938 0.603672 10 0.622356 0.06782 0.115463 0.194361
    Output:
    ENSG00000113916___1|2x3 AGAAAGAATA

    Cheers,

    Brent

    -- Yeah, I'm a Delt.

      Thank you very much Brent!

      As you suspected I am a biologist and I thought I'd see what Perl has to offer. Thanks for such a comprehensive and detailed piece of code, I really appreciate your time and effort. Out of interest, could the "foreach (1 .. 10)" loop be modified to accommodate variation in the number of data lines? something like:

      foreach (integer),else exit loop

      Cheers!

        What comes on the line after the last (highest) digit, 10 in your sample. A blank line maybe? Please post more than 1 record showing if anything indicates you've reached the last pos for the record.
Re: First foray into Perl
by GotToBTru (Prior) on Mar 24, 2014 at 16:05 UTC

    A more useful title for your post would help!

    Do you have any code you have put together so far? Or perhaps even just pseudocode?

Re: First foray into Perl
by AnomalousMonk (Archbishop) on Mar 25, 2014 at 21:38 UTC

    Here's one possible approach to handling your data. Some notes and caveats:

    • This code takes the approach of gulping records and parsing them. A better approach might be to read the data line-by-line and use something like Text::CSV_XS to split out the fields and process fields in lines of interest. Anyway, that's not what I did...
    • This code requires Perl version 5.10 or better.
    • Questions as to the example data provided by you remain. Please note the test data I'm using. (Update: Any chance of changing the data format to use, say, a single blank line as a delimiter between records? Such clarity of format can often result in avoidance of headaches.)
    • The base position (?) sub-record
      Pos     A       C       G       T
      is present in each record. This suggests that the base ordering may vary from record to record. I think the code below will handle such variation, but I have not tested this.
    • The number of base frequency (?) sub-records
      1       0.664794        0.13099 0.0810125       0.123203
      is always ten per record. I think the code can handle any number of these sub-records, but again, I have not tested this.
    • The code below does some data validation, but not as much as I'm really comfortable with. You will have to judge how reliable your data is.