Help in joining these lines

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help in joining these lines by toolic (Bishop) on Sep 15, 2016 at 13:12 UTC
One way to get your desired output: use warnings; use strict; my $line1; my $line2; while (<DATA>) { if (/^>/) { if ($line1) { print "$line1\n$line2\n"; $line1 = ''; $line2 = ''; } print; } elsif (!/\./) { chomp; $line1 .= $_; } else { chomp; $line2 .= $_; } } print "$line1\n$line2\n" if $line1; __DATA__ > gi\|11SB_CUCMA Train\|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV AQGFGIRGIAIPGCAETYQT SSSSSSSSSSSSSSSSSSSSS.................... ........................... ................. .......... > gi\|1A43_HUMAN Train\|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQ +R MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKWETAHEAE SSSSSSSSSSSSSSSSSSSSSSSS.............................................. +........................................... ............. .......................................... [download]	[reply] [d/l]
Re: Help in joining these lines by kcott (Archbishop) on Sep 15, 2016 at 21:29 UTC
This looks like FASTA format. I'd say you're on the right track by changing `$/`; however, it's better to localise the change in an anonymous block and let Perl return it to its previous value (rather than hard-coding your guess of what it was). I typically use: `local $/ = "\n>"`. If you have Perl v5.14 or higher, you can code it like this: #!/usr/bin/env perl use 5.014; use strict; use warnings; { local $/ = "\n>"; while (<DATA>) { chomp; my ($top, $seq) = split /\n/, $_, 2; print '>' unless $. == 1; say $top; say $seq =~ y/\n//dr; } } __DATA__ > gi\|11SB_CUCMA Train\|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV AQGFGIRGIAIPGCAETYQT SSSSSSSSSSSSSSSSSSSSS.................... ........................... ................. .......... > gi\|1A43_HUMAN Train\|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQ +R MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKWETAHEAE SSSSSSSSSSSSSSSSSSSSSSSS.............................................. +........................................... ............. .......................................... [download] Output: > gi\|11SB_CUCMA Train\|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAP +KLIFVAQGFGIRGIAIPGCAETYQTSSSSSSSSSSSSSSSSSSSSS....................... +................................................... > gi\|1A43_HUMAN Train\|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRME +PRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIA +LNEDLRSWTAADMAAQITQRKWETAHEAESSSSSSSSSSSSSSSSSSSSSSSS................ +..................................................................... +........................................................... [download] If you only have v5.10 or v5.12, the '`r`' modifier for transliteration is unavailable (see Non-destructive substitution in perl5140delta: Regular Expressions) and you'll need an extra line of code. `#!/usr/bin/env perl use 5.010; use strict; use warnings; { local $/ = "\n>"; while (<DATA>) { chomp; my ($top, $seq) = split /\n/, $_, 2; print '>' unless $. == 1; say $top; $seq =~ y/\n//d; say $seq; } }` [download] If you're working with older Perl versions, '`say`' is unavailable (see perl5100delta: say()) and you'll need to hard-code line endings. `#!/usr/bin/env perl use strict; use warnings; { local $/ = "\n>"; while (<DATA>) { chomp; my ($top, $seq) = split /\n/, $_, 2; print '>' unless $. == 1; print "$top\n"; $seq =~ y/\n//d; print "$seq\n"; } }` [download] All of these versions of the code produce the same output (given the same `__DATA__`). — Ken	[reply] [d/l] [select]
Re: Help in joining these lines by Anonymous Monk on Sep 15, 2016 at 14:24 UTC
#!/usr/bin/perl # http://perlmonks.org/?node_id=1171851 use strict; use warnings; $_ = join '', <DATA>; 1 while s/ ^\w\K\n(?=\w\n) \| ^.\..\K\n(?=.*\.) //mx; print; __DATA__ > gi\|11SB_CUCMA Train\|1 21 MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV AQGFGIRGIAIPGCAETYQT SSSSSSSSSSSSSSSSSSSSS.................... ........................... ................. .......... > gi\|1A43_HUMAN Train\|1 24 MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWSQTDRANLGTLRGYYNQSEDGSHTIQ +R MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRKWETAHEAE SSSSSSSSSSSSSSSSSSSSSSSS.............................................. +........................................... ............. .......................................... [download]	[reply] [d/l]
Re^2: Help in joining these lines by Anonymous Monk on Sep 15, 2016 at 15:48 UTC
Thanks to both of you who helped me! Is it possible to explain the pattern match? I am really interested in learning this technique but I am afraid I can't really understand the expression that gods wrote here..	[reply]
Re^3: Help in joining these lines by stevieb (Canon) on Sep 15, 2016 at 16:02 UTC
Although `use re 'debug';` is available, YAPE::Regex::Explain is a fair amount more detailed in its output when explaining a regex: `use warnings; use strict; use YAPE::Regex::Explain; my $re = 's/ ^\w\K\n(?=\w\n) \| ^.\..\K\n(?=.\.) //mx'; print YAPE::Regex::Explain->new($re)->explain;` [download] Output: The regular expression: (?-imsx:s/ ^\w\K\n(?=\w\n) \| ^.\..\K\n(?=.\.) //mx) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- s/ 's/ ' ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- \w* word characters (a-z, A-Z, 0-9, _) (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \K 'K' ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- \w* word characters (a-z, A-Z, 0-9, _) (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ' ' ---------------------------------------------------------------------- \| OR ---------------------------------------------------------------------- ' ' ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \. '.' ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \K 'K' ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \. '.' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- //mx ' //mx' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] For the items that still may not be clear, see perlre.	[reply] [d/l] [select]
Re^4: Help in joining these lines by AnomalousMonk (Archbishop) on Sep 15, 2016 at 17:01 UTC
Re^4: Help in joining these lines by kcott (Archbishop) on Sep 16, 2016 at 11:57 UTC
Re^4: Help in joining these lines by Anonymous Monk on Sep 15, 2016 at 16:13 UTC
Re^3: Help in joining these lines by kcott (Archbishop) on Sep 15, 2016 at 21:44 UTC
The '`\K`' and '`(?=...)`' parts are explained in Lookaround Assertions under "perlre: Extended Patterns". You can use Regexp::Debugger to see what's happening (step-by-step) as a regex is processed. I find this to be a very useful tool. — Ken	[reply] [d/l] [select]
Re: Help in joining these lines by BillKSmith (Monsignor) on Sep 15, 2016 at 20:32 UTC
This approach follows your statement of the problem. use strict; use warnings; use List::MoreUtils qw(first_index); my $sample = \do{my $raw = "> gi\|11SB_CUCMA Train\|1 21\n" ."MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGS\n" ."EVWQQHRYQSPRACRLENLRAQDPVRLLLPGFSNAPKLIFV\n" ."AQGFGIRGIAIPGCAETYQT\n" ."SSSSSSSSSSSSSSSSSSSSS....................\n" ."...........................\n" .".................\n" ."..........\n" ."> gi\|1A43_HUMAN Train\|1 24\n" ."MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPG\n" ."RGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEG" ."PEYWSQTDRANLGTLRGYYNQSEDGSHTIQR\n" ."MYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRS" ."WTAADMAAQITQRKWETAHEAE\n" ."SSSSSSSSSSSSSSSSSSSSSSSS..........." ."...................................\n" ."...........................................\n" .".............\n" ."..........................................\n" }; open my $SAMPLE, '<', $sample; while (my $block = do{local $/ = "\n>"; <$SAMPLE>}) { $block =~ s/\n>\z//; $block =~ s/\A([^>])(.*?)$/>$1$2/ms; my @lines = split /\n/, $block; my $i = first_index {$_ =~ /\./} @lines; $" = q(); print $lines[0], "\n" ."@lines[1..$i-1]\n" ."@lines[$i..$#lines]\n"; } [download] Bill	[reply] [d/l]
Re^2: Help in joining these lines by GrandFather (Saint) on Sep 16, 2016 at 02:45 UTC
Instead of going through hoops with do and various variables you could just: `open my $fIn, '<', \<<INSTR; > gi\|11SB_CUCMA Train\|1 21 ... .......................................... INSTR` [download] which saves a bunch of quoting things and messing around generally. Premature optimization is the root of all job security	[reply] [d/l]
Re^3: Help in joining these lines by BillKSmith (Monsignor) on Sep 16, 2016 at 18:21 UTC
You do have a point. Of course it would be even easier to use <DATA>. I felt that the quoting offered three advantages. The biggest advantage is that it removes the ambiguity related to white space at the end of each line. All the data remains on the screen at once. The format of the open statement is exactly the same as for reading a disk file. Bill	[reply]
Re^4: Help in joining these lines by GrandFather (Saint) on Sep 17, 2016 at 00:01 UTC