in reply to Re: Re: Re: Why is it matching??
in thread Why is it matching??

I've run into a small problem with the code (yes, that means it works:-)). In a few data files, I have two different digit lengths in the file. Ex: 10001_at, and 123456_at. I've tried things like adding another if statement, another loop, even placing the \d{6} and \d{5} parameters in different programs. When I use \d{6} parameter, it is able to function correctly and gather the subsequent sequences. When I use \d{5} however, it can't find anything at all except the control sequences, which I know won't pattern match and don't care. SOOO...I know the \d{5} is working because it recognizes sequences that don't match, but won't recognize the 10001_at target name or 6 digit target name either. Any ideas?
NOTE: the first half of the file is the 6 digits, while the second half is the 5 digit target name. Is it possible that the program stops partway throught the file since it can't imediately find a matching pattern? I thought that the $1 would cause it to look for the first matching pattern, no matter where it is in the file....

Bioinformatics

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Why is it matching??
by BrowserUk (Patriarch) on Sep 16, 2003 at 20:23 UTC

    I'm not sure I've fully appreciated all that you've said in this post, it's quite difficult to visualise without real examples of the lines in front of me, but I think that all you need to do is be a little more flexible in what you allow the regex to match.

    $target_name = $1 if m[( \d{5,6} ) _at: \d{3} : \d{3} ]x;

    The \d{5,6} will allow that part of the regex to match a sequence of either 5 or 6 digits followed by _at:. Will this do wnat you need?

    In general, it's usually good practice to only tighten the regex as far as you need to prevent unwanted matches. You might for instance get away with using

    $target_name = $1 if m[( \d+ ) _at: \d+ : \d+ ]x;

    which would allow for  ... 1_at: 1:1

    to . ... 12345678901234567890: 1234567890:1234567890

    And all stations in between. Without being able to see a fully representative sample of your data, its difficult to know just how tight you need to make the regex to avoid false matches, but hopefully this will allow you to experiment to make that determination for yourself?

    If you find that you are still missing some lines, try adding a prrint statement or two to display the line that was read, and those that were rejected. And post the lines that were falsey rejected along with the regex you are using and it will make it easier for us to help you refine the regex to your needs.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.