I have a file format called FRG file. I would like extract certain information from the file. My current script able to parse some of the information but not accurate. My current script recognizes "acc: " as the sequence ID. However, there is another "acc:" which represent the library name instead of the sequence id. I would like to capture the "acc:" after the "{FRG" bracket, and the sequences under seq: but not rest of line. My current script will grep everything after "typ:* to next acc:"? How can I match only the sequence from seq: to .qlt?
{BAT bna:(Batch name) crt:12012334370 acc:110767247557 com: Generated By LibraryInterface process Start: 1147321117200 StartDate: 20060511 00:16:47 End: 1149189374000 EndDate: 20060601 15:16:14 Index: 1 of 1 Filename: Library_name.frg . } {DST act:A acc:1052000138220 mea:5000.0 std:1000.0 } {FRG act:A acc:1101077781160 typ:R src: . etm:0 seq: acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat tatctgacga . qlt: 6666666?::866::<>H\EEGE@B?AAJMRXXXXX\XSE?BB??MPROOPOSZHIKIIO____aaRQQM MPOPJOMOROQHKMPNM____Y___]_\]VVV\\V_VVXXX\Y______]Y__Y__V\VVVV______\V \X____MMV\ X\Y___TXX_]___\V\\\\______\\V]_\\\\\\S_\\V\]\X\\\]\\\S_________]V__\SS SZZS\\\]]RRQQSSGSRMHHGHHF\\PSSSSS\\\__]]\]\\\MSRQQNRRTZSS]\]VOPHCEJOOL \_\V\VSVHGGCMMP@>9977@EJAXVV\JSIOCAB97@AECHC>A@??99=<ABL>A98?EC>99@>A> <<<>ACC=J@C@@GRTKCCHE?>CFR]]\DD<8>>:7:@B;:B8888<<@EHE>97B77<<@GCLG<>99 :;9:88<776 . clr:17,855 } {FRG act:A acc:1101077781161 typ:R src: . etm:0 seq: gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc taatccaaattgtttaccctagacaatcaaactgccagtattaaac . qlt: 7966666666766:877:E>G==AAAEEEOOMMHQGEEDG@ACBIILJLKNRRRNNDDNNQMJXR]VIGI YY]_YaaMMRSS\VVVSPTWPRTV_Y__V \Xaaa\]_V\XaaPL]]_\__TVRXVW]XXXXWXV__TXPLLXW___VTTXXVaaaaaa]XPTV_XXXXR XVT_Y_______Y__aY_\TRRMSRRV\_]]]S\]V___V\X____aVXXXV\]aa_]XXXWPPPWWPLL aPX]]]___XXXPTLLXX__]]aaaaaaa]\Y__T\XRa]]a\VV\RVVXTT]RXXXVMSHMOW]XXXXX Xaaa]]XPLLLXXTT]H____V_______Y_____]aXTLPWT___XTPPWTXWXVMMVROT_]]T_TTT TV_TTTTTHICJHKNTRPRHHK\VTXXXRRRVMOIISRRWXXPXRRXXX]]]V\\V\VSR\V\VRRRH\X XXR]XLLTX]]]]]XXMNLNNR]_\SS]IIHEEEKGEHJ]]EEGGCABA>>A>>;::;=?@><7779B97 <;A:=<<79<7:<<>:8<8866677<99866999<888=9=;>AA: . clr:20,824 }
My code:
#!/usr/bin/perl #FRG to FASTA $/ = "acc:"; $| = 1; while(<>){ #FRG file as input chomp; my ($titleline, $sequence) = split(/\n/,$_,2); next unless ($sequence && $titleline); my ($id) = $titleline =~ /^(\S+)/; $sequence =~ s/\n//g; print ">$titleline\n$sequence\n"; }
The ideal output should be like :
>1101077781160 acaaggctggagtatttttttgtttagtaatttatttaattcagtttttatattttcataaactttttta ggatcaccagggccattacttaaaaaaaaaccatcaaaatttctattaattatatcctcagcattaaaat tgatctttagagagaaacttacctttgaaaatatattttttgttataaattaaatatccgttttgataat taagtttagttttattatctaatacgggcatattaaatcatgtgtattagtatattatatcaaaggaaat tcaaatgagtttggcaaaaaaatttctgatgacgttaaagtgctttaaaaggcggagatagaaaaacttt catagcaaggtatgtctattctgagttaaaaattttctattaaagaaatctagagagagacgtgcttaat tatctgacga >1101077781161 gcgtgacgtttgagcagaagaattatttattaatttctgaggattttaagtctttaaaacaaaacgtttc attcaaatttcaaaatcttgattataaagaagcgatggcactaatggctgagattggcaatattaatata caactggctcagcagtattaaaaattgaactagatgcgttggaatctcaagggttaggaagaacagtttc taatccaaattgtttaccctagacaatcaaactgccagtattaaac

In reply to Vertical Regex by joomanji

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.