in reply to Re: Why is it matching??
in thread Why is it matching??

LOL....Thank you for the harsh rebuke:-). As for the warnings issue, I failed to mention that, but hadn't figured out what it was refering to until checking things out in Programming Perl. So, yes I do use warnings thank you very much. No offense, but I couldn't get your program to work as stated. It gave me the same problem I am asking about in this node. $target_name doesn't change, so only one key is present in the hash. I need the keys to continue changing with $target_name. Since that does not occur in either program, the question still stands....
Bioinformatics Bioinformatics

Replies are listed 'Best First'.
Re: Re: Re: Why is it matching??
by BrowserUk (Patriarch) on Sep 11, 2003 at 22:48 UTC

    The rebuke wasn't intended to come across as harsh. Sorry that it did.

    Moving right along. Could you explain a little more of what you mean by ...but I couldn't get your program to work as stated...? I just downloaded the code again and it produced the output I listed, which show that two keys were created. The first with three probes found

    1415671 : GGAACAGGAATGTCGCAACATCGTA, ACATCGTATGGATTGCTGAGTGCAT, GGCTGATCACATCCAAAAAGTCATG

    And the second with 10:

    1415670 : GAGGAAACGTTCACCCTGTCTACTA, GTTCACCCTGTCTACTATCAAGACA, TACTATCAAGACACTCGAAGAGGCT, CTGTGGGCAATATTGTGAAGTTCCT, GAATGCATCCTTGTGAGAGGTCAGA, GAGAGGTCAGACAAAGTGCCAGAAA, AAAACAAGAACACCCACACGCTGCT, ACACGCTGCTGCTAGCTGGAGTATT, TATCTTGTCCAACACTACGTCGAAG, TTGTCACCATGCCTGCAAGGAGAGA

    This is as expected from the sample data you provided on that original post, although I've manually wrapped it to prevent it getting confused by the autowrapper.

    If you are getting different output when you run my original code, then could you post the output you get please and I'll try to work out what could be different.

    The way $target name gets updated in the original is like this.

    do { # extract the target name $target_name = $1 if m[( \d{7} ) _at: \d{3} : \d{3} ]x; while( m[$target_name] ) { # process the record containing the current target name my $probe = <DATA>; # Read the probe chomp $probe; # save it in an HoA keyed by the target name push @{ $probes{ $target_name } }, $probe; # get the next line; last unless defined( $_ = <DATA> ); } } until eof DATA; # till done

    $target_name is set at the top of the outer do..until loop.

    The code enters the inner while loop, reads the next line, extracts the probe pushes it onto the HoA.

    It then gets to the last unless defined... line, where it reads another line. So long as it hasn't reached the eof, then it loops back to the top and tests the while condition again. If it matches, the loop repeats, another probe is read and pushed.

    If it doesn't, then it falls out of the while loop and the until eof DATA condition is tested. If it's not at the eof, then it loops back to the top of the do...until loop and the new $target_name is extracted from the last line it read (which failed to match the while condition) and the cycle repeats.

    Hopefully, that explains how it works and will allow you to modify it to your needs.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      How would one add in a newline into the print statement? The main problem I'm running into is that usage of quotations of any type throws the entire print statement off, causing it to print join rather that actaully join the sequences, etc. As well, is seems I would almost have to rewrite it to fit the newline in; when i do, it does at least come out on separate lines. The data is coming out now with original print statement, but all on one line now; I need it on separate lines...:-(...
      #my rewritten statement, for better or worse for (keys %probes) { print "$target_name, ':' join', ', @{$probes{$_}}, \n"; } _output_ 244901, ':' join', ', TTGCTGCTATTCTATCTATTTGTGC GACTTTCAAAGTGACTCTCGA +CGGG GAGCCTCCAGGCTATTCAGGAAGAA GAAGAATCGCAGCAATTCCCCAATC GAAGTAGTTCCT +CCGGAATCCAATG TCAGCTTGCGAATTTGTGGCACCGT TTACCAATGGCACGCTGTGCGCCTA GCA +AGCTTTGTTATGCCGAAACCTA AACACTTACAAATGCCACTTCTTCC GTCGCATCCGTTTTCAGGAC +GATCT AGCAATTTGCCTACTCTTGTATCTC, 244902, ':' join', ', GTATTCGGGGAATCCTCCTTAATAG ATATTCCTATTATGTCAATGC +CAAT AGCTGTGAATTCGAACTTTTTGGTA GGTATTTTCCGTTTCTTCGGATGAT GATGGGTCAAGT +ATTTGCTTCATTG TGCTTCATTGGTTCCAACGGTGGCA CCAACGGTGGCAGCTGCGGAATCCG GCG +GAATCCGCTATTGGGTTAGCCA TAGCCATTTTCGTTATAACTTTCCG TAACTTTCCGAGTCCGAGGT +ACTAT GAGTCCGAGGTACTATTGCTGTAGA,
      Bioinformatics

        Close. You need to note the differences between "s and 's a little more carefully:)

        I think this does what you want. Sometimes it helps to spread the individual clauses of a statement across seperate lines. It helps you keep track of what is going on. I've also added parens to the join function. They often aren't necessary, but it helps to clarify what bits are arguments to join and which bits (the return from join included) are arguments to print.

        for (keys %probes) { print "$target_name :\n", join( "\n", @{$probes{$_}} ), "\n"; }

        HTH.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.

      I've run into a small problem with the code (yes, that means it works:-)). In a few data files, I have two different digit lengths in the file. Ex: 10001_at, and 123456_at. I've tried things like adding another if statement, another loop, even placing the \d{6} and \d{5} parameters in different programs. When I use \d{6} parameter, it is able to function correctly and gather the subsequent sequences. When I use \d{5} however, it can't find anything at all except the control sequences, which I know won't pattern match and don't care. SOOO...I know the \d{5} is working because it recognizes sequences that don't match, but won't recognize the 10001_at target name or 6 digit target name either. Any ideas?
      NOTE: the first half of the file is the 6 digits, while the second half is the 5 digit target name. Is it possible that the program stops partway throught the file since it can't imediately find a matching pattern? I thought that the $1 would cause it to look for the first matching pattern, no matter where it is in the file....

      Bioinformatics

        I'm not sure I've fully appreciated all that you've said in this post, it's quite difficult to visualise without real examples of the lines in front of me, but I think that all you need to do is be a little more flexible in what you allow the regex to match.

        $target_name = $1 if m[( \d{5,6} ) _at: \d{3} : \d{3} ]x;

        The \d{5,6} will allow that part of the regex to match a sequence of either 5 or 6 digits followed by _at:. Will this do wnat you need?

        In general, it's usually good practice to only tighten the regex as far as you need to prevent unwanted matches. You might for instance get away with using

        $target_name = $1 if m[( \d+ ) _at: \d+ : \d+ ]x;

        which would allow for  ... 1_at: 1:1

        to . ... 12345678901234567890: 1234567890:1234567890

        And all stations in between. Without being able to see a fully representative sample of your data, its difficult to know just how tight you need to make the regex to avoid false matches, but hopefully this will allow you to experiment to make that determination for yourself?

        If you find that you are still missing some lines, try adding a prrint statement or two to display the line that was read, and those that were rejected. And post the lines that were falsey rejected along with the regex you are using and it will make it easier for us to help you refine the regex to your needs.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        If I understand your problem, I can solve it! Of course, the same can be said for you.