comment on

I have a file format called FRG file. I would like extract certain information from the file. My current script able to parse some of the information but not accurate. My current script recognizes "acc: " as the sequence ID. However, there is another "acc:" which represent the library name instead of the sequence id. I would like to capture the "acc:" after the "{FRG" bracket, and the sequences under seq: but not rest of line. My current script will grep everything after "typ:* to next acc:"? How can I match only the sequence from seq: to .qlt?

{BAT
bna:(Batch name)
crt:12012334370
acc:110767247557
com:
Generated By LibraryInterface process
Start: 1147321117200
StartDate: 20060511 00:16:47
End: 1149189374000
EndDate: 20060601 15:16:14
Index: 1 of 1
Filename: Library_name.frg
.
}
{DST
act:A
acc:1052000138220
mea:5000.0
std:1000.0
}
{FRG
act:A
acc:1101077781160
typ:R
src:
.
etm:0
seq:
acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta
ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat
tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat
taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat
tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt
catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat
tatctgacga
.
qlt:
6666666?::866::<>H\EEGE@B?AAJMRXXXXX\XSE?BB??MPROOPOSZHIKIIO____aaRQQM
MPOPJOMOROQHKMPNM____Y___]_\]VVV\\V_VVXXX\Y______]Y__Y__V\VVVV______\V
\X____MMV\
X\Y___TXX_]___\V\\\\______\\V]_\\\\\\S_\\V\]\X\\\]\\\S_________]V__\SS
SZZS\\\]]RRQQSSGSRMHHGHHF\\PSSSSS\\\__]]\]\\\MSRQQNRRTZSS]\]VOPHCEJOOL
\_\V\VSVHGGCMMP@>9977@EJAXVV\JSIOCAB97@AECHC>A@??99=<ABL>A98?EC>99@>A>
<<<>ACC=J@C@@GRTKCCHE?>CFR]]\DD<8>>:7:@B;:B8888<<@EHE>97B77<<@GCLG<>99
:;9:88<776
.
clr:17,855
}
{FRG
act:A
acc:1101077781161
typ:R
src:
.
etm:0
seq:
gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc
attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata
caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc
taatccaaattgtttaccctagacaatcaaactgccagtattaaac
.
qlt:
7966666666766:877:E>G==AAAEEEOOMMHQGEEDG@ACBIILJLKNRRRNNDDNNQMJXR]VIGI
YY]_YaaMMRSS\VVVSPTWPRTV_Y__V
\Xaaa\]_V\XaaPL]]_\__TVRXVW]XXXXWXV__TXPLLXW___VTTXXVaaaaaa]XPTV_XXXXR
XVT_Y_______Y__aY_\TRRMSRRV\_]]]S\]V___V\X____aVXXXV\]aa_]XXXWPPPWWPLL
aPX]]]___XXXPTLLXX__]]aaaaaaa]\Y__T\XRa]]a\VV\RVVXTT]RXXXVMSHMOW]XXXXX
Xaaa]]XPLLLXXTT]H____V_______Y_____]aXTLPWT___XTPPWTXWXVMMVROT_]]T_TTT
TV_TTTTTHICJHKNTRPRHHK\VTXXXRRRVMOIISRRWXXPXRRXXX]]]V\\V\VSR\V\VRRRH\X
XXR]XLLTX]]]]]XXMNLNNR]_\SS]IIHEEEKGEHJ]]EEGGCABA>>A>>;::;=?@><7779B97
<;A:=<<79<7:<<>:8<8866677<99866999<888=9=;>AA:
.
clr:20,824
}
[download]

My code:

#!/usr/bin/perl
#FRG to FASTA

$/ = "acc:";
$| = 1;

while(<>){ #FRG file as input
   chomp;
   my ($titleline, $sequence) = split(/\n/,$_,2);
   next unless ($sequence && $titleline);
   my ($id) = $titleline =~ /^(\S+)/;
   $sequence =~ s/\n//g; 
   print ">$titleline\n$sequence\n";
        }
[download]

The ideal output should be like :

>1101077781160
acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta
ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat
tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat
taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat
tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt
catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat
tatctgacga
>1101077781161
gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc
attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata
caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc
taatccaaattgtttaccctagacaatcaaactgccagtattaaac
[download]

In reply to Vertical Regex by joomanji

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.