Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I have a file of the following format:
> gi|11SB_CUCMA Train|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV AQGFGIRGIAIPGCAETYQT SSSSSSSSSSSSSSSSSSSSS.................... ........................... ................. .......... > gi|1A43_HUMAN Train|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQ +R MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKWETAHEAE SSSSSSSSSSSSSSSSSSSSSSSS.............................................. +........................................... ............. ..........................................

which I need to bring into the format of:
> gi|11SB_CUCMA Train|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAP +KLIFVAQGFGIRGIAIPGCAETYQT SSSSSSSSSSSSSSSSSSSSS................................................. +......................... > gi|1A43_HUMAN Train|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRME +PRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIA +LNEDLRSWTAADMAAQITQRKWETAHEAE SSSSSSSSSSSSSSSSSSSSSSSS.............................................. +..................................................................... +.............................

which, in essence, means that one should keep 3 lines for each "entry", the line that starts with ">", then the lines that do not contain any "." and finally the lines that contain "."
I have tried this, but without result...
$/="> "; while(<>) { $seq=''; $top=''; if($_=~/^gi\|(.*?) \.*?|.*/m) { $id=$1; #print $id."\n"; } if($_!~/^gi/mg ) { print $_; $seq.=$_; } print $seq."\n"; } $/="\n";

Can you point me to some direction please?

Replies are listed 'Best First'.
Re: Help in joining these lines
by toolic (Bishop) on Sep 15, 2016 at 13:12 UTC
    One way to get your desired output:
    use warnings; use strict; my $line1; my $line2; while (<DATA>) { if (/^>/) { if ($line1) { print "$line1\n$line2\n"; $line1 = ''; $line2 = ''; } print; } elsif (!/\./) { chomp; $line1 .= $_; } else { chomp; $line2 .= $_; } } print "$line1\n$line2\n" if $line1; __DATA__ > gi|11SB_CUCMA Train|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV AQGFGIRGIAIPGCAETYQT SSSSSSSSSSSSSSSSSSSSS.................... ........................... ................. .......... > gi|1A43_HUMAN Train|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQ +R MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKWETAHEAE SSSSSSSSSSSSSSSSSSSSSSSS.............................................. +........................................... ............. ..........................................
Re: Help in joining these lines
by kcott (Archbishop) on Sep 15, 2016 at 21:29 UTC

    This looks like FASTA format. I'd say you're on the right track by changing $/; however, it's better to localise the change in an anonymous block and let Perl return it to its previous value (rather than hard-coding your guess of what it was). I typically use: local $/ = "\n>".

    If you have Perl v5.14 or higher, you can code it like this:

    #!/usr/bin/env perl use 5.014; use strict; use warnings; { local $/ = "\n>"; while (<DATA>) { chomp; my ($top, $seq) = split /\n/, $_, 2; print '>' unless $. == 1; say $top; say $seq =~ y/\n//dr; } } __DATA__ > gi|11SB_CUCMA Train|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV AQGFGIRGIAIPGCAETYQT SSSSSSSSSSSSSSSSSSSSS.................... ........................... ................. .......... > gi|1A43_HUMAN Train|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQ +R MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKWETAHEAE SSSSSSSSSSSSSSSSSSSSSSSS.............................................. +........................................... ............. ..........................................

    Output:

    > gi|11SB_CUCMA Train|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAP +KLIFVAQGFGIRGIAIPGCAETYQTSSSSSSSSSSSSSSSSSSSSS....................... +................................................... > gi|1A43_HUMAN Train|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRME +PRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIA +LNEDLRSWTAADMAAQITQRKWETAHEAESSSSSSSSSSSSSSSSSSSSSSSS................ +..................................................................... +...........................................................

    If you only have v5.10 or v5.12, the 'r' modifier for transliteration is unavailable (see Non-destructive substitution in perl5140delta: Regular Expressions) and you'll need an extra line of code.

    #!/usr/bin/env perl use 5.010; use strict; use warnings; { local $/ = "\n>"; while (<DATA>) { chomp; my ($top, $seq) = split /\n/, $_, 2; print '>' unless $. == 1; say $top; $seq =~ y/\n//d; say $seq; } }

    If you're working with older Perl versions, 'say' is unavailable (see perl5100delta: say()) and you'll need to hard-code line endings.

    #!/usr/bin/env perl use strict; use warnings; { local $/ = "\n>"; while (<DATA>) { chomp; my ($top, $seq) = split /\n/, $_, 2; print '>' unless $. == 1; print "$top\n"; $seq =~ y/\n//d; print "$seq\n"; } }

    All of these versions of the code produce the same output (given the same __DATA__).

    — Ken

Re: Help in joining these lines
by Anonymous Monk on Sep 15, 2016 at 14:24 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1171851 use strict; use warnings; $_ = join '', <DATA>; 1 while s/ ^\w*\K\n(?=\w*\n) | ^.*\..*\K\n(?=.*\.) //mx; print; __DATA__ > gi|11SB_CUCMA Train|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV AQGFGIRGIAIPGCAETYQT SSSSSSSSSSSSSSSSSSSSS.................... ........................... ................. .......... > gi|1A43_HUMAN Train|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQ +R MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKWETAHEAE SSSSSSSSSSSSSSSSSSSSSSSS.............................................. +........................................... ............. ..........................................
      Thanks to both of you who helped me!
      Is it possible to explain the pattern match? I am really interested in learning this technique but I am afraid I can't really understand the expression that gods wrote here..

        Although use re 'debug'; is available, YAPE::Regex::Explain is a fair amount more detailed in its output when explaining a regex:

        use warnings; use strict; use YAPE::Regex::Explain; my $re = 's/ ^\w*\K\n(?=\w*\n) | ^.*\..*\K\n(?=.*\.) //mx'; print YAPE::Regex::Explain->new($re)->explain;

        Output:

        The regular expression: (?-imsx:s/ ^\w*\K\n(?=\w*\n) | ^.*\..*\K\n(?=.*\.) //mx) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- s/ 's/ ' ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- \w* word characters (a-z, A-Z, 0-9, _) (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \K 'K' ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- \w* word characters (a-z, A-Z, 0-9, _) (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ' ' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- ' ' ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \. '.' ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \K 'K' ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \. '.' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- //mx ' //mx' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

        For the items that still may not be clear, see perlre.

        The '\K' and '(?=...)' parts are explained in Lookaround Assertions under "perlre: Extended Patterns".

        You can use Regexp::Debugger to see what's happening (step-by-step) as a regex is processed. I find this to be a very useful tool.

        — Ken

Re: Help in joining these lines
by BillKSmith (Monsignor) on Sep 15, 2016 at 20:32 UTC
    This approach follows your statement of the problem.
    use strict; use warnings; use List::MoreUtils qw(first_index); my $sample = \do{my $raw = "> gi|11SB_CUCMA Train|1 21\n" ."MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS\n" ."EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV\n" ."AQGFGIRGIAIPGCAETYQT\n" ."SSSSSSSSSSSSSSSSSSSSS....................\n" ."...........................\n" .".................\n" ."..........\n" ."> gi|1A43_HUMAN Train|1 24\n" ."MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG\n" ."RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEG" ."PEYWSQTDRANLGTLRGYYNQSEDGSHTIQR\n" ."MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRS" ."WTAADMAAQITQRKWETAHEAE\n" ."SSSSSSSSSSSSSSSSSSSSSSSS..........." ."...................................\n" ."...........................................\n" .".............\n" ."..........................................\n" }; open my $SAMPLE, '<', $sample; while (my $block = do{local $/ = "\n>"; <$SAMPLE>}) { $block =~ s/\n>\z//; $block =~ s/\A([^>])(.*?)$/>$1$2/ms; my @lines = split /\n/, $block; my $i = first_index {$_ =~ /\./} @lines; $" = q(); print $lines[0], "\n" ."@lines[1..$i-1]\n" ."@lines[$i..$#lines]\n"; }
    Bill

      Instead of going through hoops with do and various variables you could just:

      open my $fIn, '<', \<<INSTR; > gi|11SB_CUCMA Train|1 21 ... .......................................... INSTR

      which saves a bunch of quoting things and messing around generally.

      Premature optimization is the root of all job security

        You do have a point. Of course it would be even easier to use <DATA>. I felt that the quoting offered three advantages.

        The biggest advantage is that it removes the ambiguity related to white space at the end of each line.

        All the data remains on the screen at once.

        The format of the open statement is exactly the same as for reading a disk file.

        Bill