Nicpetbio23! has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse this file with a REGEX so that it returns ONE close relative of a gene of interest and the gene of interest (does not matter which one).The sequence the comes first is the close relative. The gene of interest in parentheses. I want to exclude lines that are a repeat of the same sequence.
Metan1_4283(Metac1_3189) Metac1_3189(Metac1_3189) MagorCD156_00067621(MagorUS71_00075311) MagorUS71_00075311(MagorUS71_00075311) Phaca1_206503(Phchr2_2932727) Phchr2_2932727(Phchr2_2932727) Thiar1_121068(Thiar1_121068) Thite2_36710(Thiar1_121068) Chagl1_10753(Thiar1_121068) Spoth2_41043(Thiar1_121068) Myche1_767736(Thiar1_121068) Thiar1_767720(Thiar1_767720) Spoth2_47778(Thiar1_767720) Myche1_701010(Thiar1_767720) Sorma1_06886(Thiar1_767720) Neudi1_90877(Thiar1_767720) Neutemata1_106297(Thiar1_767720) NeutematA2_74860(Thiar1_767720) Neucrtrp31_8476(Thiar1_767720) Neucr2_8476(Thiar1_767720) Thian1_368754(Thiar1_767720) Thiap1_682895(Thiar1_767720) Thihy1_418088(Thiar1_767720) Podan2_4187(Thiar1_767720) Micmi1_478558(Micmi1_478558) Micmi1_311120(Micmi1_478558) Micmi1_478558(Micmi1_311120) Micmi1_311120(Micmi1_311120)
For example I want to end up with something like this below
Metan1_4283 : Metac1_3189 MagorCD156_00067621 : MagorUS71_00075311 Phaca1_206503 : Phchr2_2932727 Thite2_36710 : Thiar1_121068 Spoth2_47778 : Thiar1_767720 Micmi1_311120 : Micmi1_478558 Micmi1_478558 : Micmi1_311120

Replies are listed 'Best First'.
Re: Perl regex
by 1nickt (Canon) on Jul 12, 2017 at 14:37 UTC

    Hi, this seems to be a trivial task in Perl (update: although your spec is not clear, really: you say "the ONE" but there may be more than one?).

    From your previous question:
    I am new at this and don't know where to start.

    Do you want to learn Perl, or just pass your bio class?

    This is the place to learn Perl.

    See perlintro, perlretut.

    If you just have other people do your work for you, you won't learn anything. Scientists aren't like that.

    Update: deleted example solution as I am not sure I understand the spec after follow-up post from OP...


    The way forward always starts with a minimal test.
      Its just ONE the "the" was a typo.

        $ perl -nE '/([^\(]+)\((.*)\)/ and $1 ne $2 and ++$seen{$2} < 2 and sa +y "$1 : $2"' raw_data.txt Metan1_4283 : Metac1_3189 MagorCD156_00067621 : MagorUS71_00075311 Phaca1_206503 : Phchr2_2932727 Thite2_36710 : Thiar1_121068 Spoth2_47778 : Thiar1_767720 Micmi1_311120 : Micmi1_478558 Micmi1_478558 : Micmi1_311120
        (Don't use it until you can explain it! Questions welcome.)

        The way forward always starts with a minimal test.
Re: Perl regex
by hippo (Archbishop) on Jul 12, 2017 at 14:19 UTC

    You will need to:

    1. Create a full, detailed specification of the problem
    2. Design an algorithm to match the spec
    3. Code up the algorithm
    4. Test the code

    If the tests fail you then need to ascertain whether your algorithm is flawed or just the implementation of it and then revisit steps 2 or 3 iteratively until all the tests pass.

    Since your statement of the problem is woolly you probably need to start right at point 1. After that someone here might be able to help you with the other parts but only you know the initial problem to be solved.

      1.Create a full, detailed specification of the problem : There is a ton of superfluous information in this file.
      a. File has multiple close relatives for a gene of interest. I only ne +ed one close relative in the output file. b. Some lines have a repeat of same sequence. For example: Metac1_3189 +(Metac1_3189) I want to exclude these from output file b. File has parentheses which I want to exclude from output file c. Close relative and gene of interest are not separated into two dist +inct columns. I want to include these in output file.

        Great, so now you have a full, detailed spec (or so we think). On to step 2 -> create an algorithm to match the spec.

Re: Perl regex
by choroba (Cardinal) on Jul 12, 2017 at 15:09 UTC
    The following script outputs the expected output in about 2% of runs, because for multiple possibilities, it selects a random one (the first key in a hash). Is it what you want?
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my %relative; while (<>) { my ($close_relative, $of_interest) = /(.*)\((.*)\)/ or next; next if $close_relative eq $of_interest; undef $relative{$of_interest}{$close_relative}; } for my $gene (sort keys %relative) { say join ' : ', (keys %{ $relative{$gene} })[0], $gene; }

    It stores the pairs in the %relative hash of hashes in the following way:

    Micmi1_478558 => { Micmi1_311120 => undef }, Thiar1_121068 => { Thite2_36710 => undef, Chagl1_10753 => undef, Spoth2_41043 => undef, Myche1_767736 => undef, } ...

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,