gogoglou has asked for the wisdom of the Perl Monks concerning the following question:
Dear Perl monks, I am fairly new in regular expressions and I got stuck while trying to parce a txt file. The file is as follows:
========== block1352806 (miRBase hsa-mir-655) [miRNAknown_cloningHIGH_randfoldOK] ========== #randfold block1352806: 0.002 randfold chimpZ43p2_block1352806: 0.001 #forester chimpZ43p2_block1352806: 1 #identity: 100.00 block1352806 + ATAATACATGGTTAACCTCTTT #1308609;1;libSPL:4|130860 +8;1;libBB07981:1|1308607;1;libBB071024:13,libFAT:2,libBB0740:1,libBB0 +91237:4,libLIVER:7,libBB07981:8,libOVA:38,libBB081093:1,libART:2,libB +B091085:1,libLYMPHG:1,libVEIN:4,libBB04316:2,libADRG:79,libSPL:17,lib +BB04331:1,libCART:4,libBB091661:3,libSKIN:38,libFALLOP:2,libMirbase_h +sa-miR-655:1,libDCOL:1,libKDNCORT:3,libBB091392:17,libSTOM:2,libBB082 +901:3,libMUSCL:4,libBRCH:1,libJEJ:1,libGBLADD:16,libBB091313:2,libPNC +:1 31 284 1 block1352806 + TAATACATGGTTAACCTCTTT #4802530;1;libBB071024:9|4 +802529;1;libBB04316:3,libBB091392:3 3 15 1 block1352806 + TAATACATGGTTAACCTCTTTT #4802531;1;libHRTVALVE:2 +1 2 1 block1352806 + ATAATACATGGTTAACCTCTTTT #1308611;1;libADRG:6|1308 +612;1;libADRG:1|1308613;1;libSTOM:1|1308614;1;libMUSCL:1|1308610;1;li +bBB071024:2,libOVA:2,libBB04316:4,libADRG:10,libBB04331:1,libCART:8,l +ibBB091661:5,libBB091392:1,libGBLADD:2,libPNC:2 12 46 1 block1352806 + ATAATACATGGTTAACCTCT #1308601;1;libBB04316:2,libA +DRG:2,libSPL:1,libGBLADD:2 4 7 1 block1352806 + ATAATACATGGTTAACCTC #1308599;1;libADRG:1|1308600; +1;libBB091661:1 2 2 1 block1352806 + AATACATGGTTAACCTCTTT #406652;1;libOVA:3,libBB04 +322:6 2 9 1 block1352806 + ATAATACATGGTTAACCTCTT #1308603;1;libOVA:3,libJEJ: +2|1308605;1;libOVA:2,libCART:10|1308606;1;libADRG:2|1308602;1;libBB07 +1024:1,libBB07981:1,libOVA:5,libADRG:10,libSPL:2,libBB091392:1 8 39 1 block1352806 + AATAATACATGGTTAACCTCT #404499;1;libBB091392:3 1 3 +1 block1352806 TTCGTTTCAGAACTATGCAAGGATATTTGAGGAGAGGTTATCCGTGTTAT +GTTCGCTTCATTCATCATGAATAATACATGGTTAACCTCTTTTTGAATATCAGACTCTGCCTCGGA chimpZ43p2 ---------------------GATATTTGAGGAGAGGTTATCCGTGTTAT +GTTCGCTTCATTCATCATGAATAATACATGGTTAACCTCTTTTTGAATATC--------------- ***************************** +*************************************************** block1352806 ...............>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>............ +.. #ENST00000362159 ENSG00000207646 hsa-mir-655 block1352806 ((((...((((..........((((((..(((((((((((.((((((..( +(((((............))))))..)))))).)))))))))))..))))))....))))...)))) 0. +590 -37.94 # chimpZ43p2 ((((((..(((((((((((.((((((..( +(((((............))))))..)))))).)))))))))))..)))))) 0. +999 -33.40 # #block1352806 chromosome:14:101515877:101515992:1 Same_strand|Ex +onic_non-coding|ENST00000362159|ENSG00000207646 ## ENSG00000207646|mi +RNA|hsa-mir-655|hsa-mir-655 [Source:miRBase;Acc:MI0003677] ## {MIR: h +sa-mir-655} #chimpZ43p2 chromosome:14:101481114:101481193:1 Same_strand|Exon +ic_non-coding|ENSPTRT00000061396|ENSPTRG00000032316|miRNA| ========== block1594019 (miRBase hsa-mir-184) [miRNAknown_cloningHIGH_randfoldOK] ========== #randfold block1594019: 0.003 randfold ratZ4p3_block1594019: 0.001 ran +dfold zebrafish36p3_block1594019: 0.002 randfold chimpZ12p3_block1594 +019: 0.003 randfold mouseZ3p3_block1594019: 0.001 randfold macaqueZ5p +3_block1594019: 0.001 randfold chickenZ1p3_block1594019: 0.001 #forester ratZ4p3_block1594019: 0.876592 forester zebrafish36p3_block1 +594019: 0.675141 forester chimpZ12p3_block1594019: 0.95772 forester m +ouseZ3p3_block1594019: 0.892995 forester macaqueZ5p3_block1594019: 0. +924482 forester chickenZ1p3_block1594019: 0.806755 #identity: 88.98 block1594019 + GGACGGAGAACTGATAAGGG #3945151;1;libBB04316:4,libBB091661:6, +libBB091313:7|3945152;1;libBB091313:1 3 18 1 block1594019 + TGGACGGAGAACTGATAAGG #5548924;1;libBB0740:10,libBB081093:4,l +ibBB04316:4,libBB091661:2,libBB091313:9|5548923;1;libBRESTMUSC:7,libB +B0740:9,libBB081093:4,libART:7,libVEIN:1,libBB04316:2,libHRTVALVE:5,l +ibBB091661:2,libLUNGPAR1:3,libLUNGPAR2:3,libBB091392:3,libBB091313:31 +|5548922;1;libBB0740:8,libCERVIX:15,libBB081093:8,libART:6,libKIDNEYM +:28,libFALLOP:6,libBREST:6,libBB08429:2,libSTOM:2,libMUSCL:2,libBB091 +313:15|5548928;1;libBB0740:1,libBB081093:1|5548990;1;libBB081093:1|55 +48992;1;libBB081093:4,libMUSCL:1|5548993;1;libBB04316:5|5548991;1;lib +BB0740:3,libBB091237:5,libBB081093:1,libBREST:5|5548995;1;libBB091313 +:1|5548994;1;libBB05704:1|5548996;1;libBRESTMUSC:1|5548921;1;libFAT:1 +,libBRESTMUSC:3,libBB0740:8,libBB091237:1,libCERVIX:25,libBB05704:11, +libBB081093:2,libART:4,libKIDNEYM:2,libLYMPHG:1,libBB04316:9,libACOL: +2,libSPL:1,libCART:4,libBB091661:1,libLUNGPAR1:2,libFALLOP:1,libBREST +:7,libKDNCORT:4,libGBLADD:3,libBB091313:12 28 338 1 block1594019 + GACGGAGAACTGATAAGGGTAG #3144863;1;libFALLOP:1 1 1 1 block1594019 + TGGACGGAGAACTGATAA #5548913;1;libBB0740:1,libBB081093:3 2 4 +1 block1594019 + GACGGAGAACTGATAAGGGTA #3144864;1;libBB0740:1 1 1 1 block1594019 + TTGGACGGAGAACTGATAAGGGT #5924355;1;libBB081093:1 1 1 1 block1594019 + GGACGGAGAACTGATAAGG #3945150;1;libBB091313:2 1 2 1 block1594019 + TTGGACGGAGAACTGATAAGGG #5924354;1;libBB0740:1 1 1 1 block1594019 + CGGAGAACTGATAAGGGTA #2423838;1;libBB081093:1 1 1 1 block1594019 + GGACGGAGAACTGATAAGGGT #3945153;1;libBB0740:1,libBB081093:2, +libART:4,libBB04316:3,libSPL:2,libBB091313:1 6 13 1 block1594019 + GACGGAGAACTGATAAGGG #3144860;2;libBB081093:2,libBB091392:3 +|3144859;2;libBB091661:6 3 11 2 block1594019 + TGGACGGAGAACTGATAAGGGTAG #5548962;1;libBB04316:11 1 11 1 block1594019 + GGACGGAGAACTGATAAGGGTA #3945154;1;libBB04316:2,libBB091313: +5|3945156;1;libBB091313:4 2 11 1 block1594019 + TGGACGGAGAACTGATAAG #5548916;1;libBB081093:1,libBB091313:1|5 +548919;1;libBB0740:2|5548918;1;libBB04316:1|5548915;1;libFAT:1,libBB0 +7981:1,libBB081093:1,libART:1,libCART:7,libBB091313:2|5548920;1;libSP +L:1|5548998;1;libBB04316:1|5548999;1;libKDNCORT:1|5548997;1;libBRESTM +USC:1|5548914;1;libBRESTMUSC:2,libBB0740:1,libBB05704:1,libKIDNEYM:28 +,libBB04316:3,libSTOM:5 14 62 1 block1594019 + ACGGAGAACTGATAAGGGT #677339;1;libBB04316:4 1 4 1 block1594019 + TTGGACGGAGAACTGATAAGGGTA #5924356;1;libBB04316:1 1 1 1 block1594019 + GGACGGAGAACTGATAAGGGTAG #3945155;1;libBB04316:4 1 4 1 block1594019 + TGGACGGAGAACTGATAAGGG #5548932;1;libFAT:1,libBB0740:4,libBB0 +4316:5,libBB091313:1|5548933;1;libBB04316:4,libBB091313:2|5548934;1;l +ibBB04316:5|5548931;1;libBB0740:4,libBB091237:2,libOVA:1,libBB05704:2 +,libBB081093:23,libVEIN:2,libBB04316:16,libBB072910:1,libCART:12,libB +B091661:3,libLUNGPAR1:1,libSKIN:18,libFALLOP:1,libKDNCORT:5,libBB0913 +92:2,libGBLADD:3,libBB091313:21|5548936;1;libBB04316:2|5548938;1;libB +B081093:1,libBB091313:2|5548937;1;libBB0740:1,libBB081093:1,libVEIN:1 +,libDCOL:2|5548939;1;libBB091313:1|5548930;1;libBRESTMUSC:21,libBB074 +0:56,libBB091237:6,libBB07981:2,libCERVIX:28,libOVA:1,libBB05704:17,l +ibBB081093:100,libART:11,libKIDNEYM:25,libVEIN:5,libBB04316:42,libADR +G:1,libHRTVALVE:1,libSPL:1,libBB091661:11,libLUNGPAR1:3,libFALLOP:5,l +ibLUNGPAR2:6,libBB08429:5,libKDNCORT:24,libBB091392:5,libSTOM:3,libBB +082901:2,libBB091313:35,libUTERIS:1|5548941;1;libCERVIX:1|5548942;1;l +ibFALLOP:1|5548943;1;libKIDNEYM:1|5548944;1;libBRESTMUSC:1|5548947;1; +libBB091392:2|5548946;1;libFAT:1,libBB04322:2|5548948;1;libBB0740:1,l +ibBB081093:2|5548945;1;libVEIN:2,libBB04316:2|5548949;1;libBB081093:1 +,libBB04316:3|5548951;1;libCERVIX:2,libBB04316:1,libFALLOP:1|5548952; +1;libBB0740:1,libART:1|5548950;1;libCERVIX:1,libBB081093:2,libBB07291 +0:1,libHRTVALVE:1|5548954;1;libCERVIX:1|5548953;1;libKDNCORT:1|554895 +5;1;libSPL:1|5548956;1;libBB4110:1|5548929;1;libBRESTMUSC:12,libBB074 +0:8,libBB091237:3,libBB04322:1,libBB05704:1,libBB081093:16,libART:3,l +ibKIDNEYM:31,libBB04316:33,libACOL:2,libBB04331:4,libFALLOP:2,libBRES +T:4,libKDNCORT:3,libGBLADD:2,libBB091313:5 37 732 1 block1594019 + TGGACGGAGAACTGATAAGGGTA #5548960;1;libBB04316:8|5548959;1;li +bBB0740:1,libBB05704:11,libBB081093:5,libBB04316:51,libBB091661:2,lib +MUSCL:1,libBB091313:17,libUTERIS:1|5548961;1;libBB04316:1,libBB091661 +:2|5548963;1;libKIDNEYM:1|5548965;1;libBB04316:2|5548966;1;libBB08109 +3:2,libBB091661:1|5548964;1;libBB0740:18,libBB091237:6,libBB05704:32, +libBB081093:40,libART:5,libKIDNEYM:10,libLYMPHG:1,libVEIN:2,libBB0431 +6:173,libBB072910:2,libACOL:1,libBB091661:15,libFALLOP:1,libDCOL:2,li +bBB091392:4,libBB082901:4,libMUSCL:7,libJEJ:1,libBB091313:16|5548967; +1;libBB081093:1|5548969;1;libBB04316:1|5548970;1;libBB04316:1|5548971 +;1;libCERVIX:1|5548968;1;libBB091313:1|5548958;1;libFAT:1,libBB0740:2 +5,libCERVIX:28,libBB05704:9,libBB081093:31,libART:10,libLYMPHG:1,libH +EART:3,libBB04316:77,libADRG:2,libBB091661:9,libFALLOP:1,libKDNCORT:6 +,libBB091392:3,libBB082901:2,libMUSCL:1,libBB091313:19 25 679 1 block1594019 + GACGGAGAACTGATAAGGGT #3144865;1;libKIDNEYM:14|3144862;1;li +bBB04316:2 2 16 1 block1594019 + TGGACGGAGAACTGATAAGGGT #5548972;1;libCERVIX:14,libBB05704:4, +libBB081093:4,libSPL:1,libBB091313:2|5548974;1;libCERVIX:1|5548973;1; +libBB081093:1|5548975;1;libBB071024:1|5548978;1;libCERVIX:10,libBB043 +16:1|5548977;1;libBB05704:7,libBB04316:4,libBB091313:1|5548979;1;libB +B04316:7|5548980;1;libBB04316:1|5548981;1;libCERVIX:1,libHEART:1,libF +ALLOP:1|5548982;1;libFALLOP:1|5548983;1;libBB081093:2|5548984;1;libBB +081093:2|5548985;1;libBB04316:1|5548976;1;libFAT:1,libBRESTMUSC:6,lib +BB0740:17,libBB04322:3,libBB05704:36,libBB081093:39,libART:2,libLYMPH +G:3,libBB04316:21,libSPL:4,libBB04331:4,libBB091661:7,libLUNGPAR1:2,l +ibLUNGPAR2:2,libBB08429:2,libDCOL:4,libKDNCORT:9,libMUSCL:4,libBRCH:4 +,libJEJ:2,libBB091313:13|5548987;1;libBB081093:2|5548988;1;libBB05704 +:1|5548986;1;libBB0740:4,libBB091237:2,libBB05704:5,libBB081093:3,lib +ART:3,libBB04316:15,libSPL:3,libFALLOP:1,libBB08429:1,libBB091313:3|5 +548989;1;libBB071024:1|5548957;1;libFAT:1,libBRESTMUSC:16,libBB0740:6 +4,libBB091237:12,libCERVIX:80,libMirbase_hsa-miR-184:1,libOVA:1,libBB +04322:8,libBB05704:97,libBB081093:142,libBB4110:2,libART:14,libKIDNEY +M:38,libHEART:6,libVEIN:10,libBB04316:495,libBB04217:2,libBB072910:1, +libACOL:5,libHRTVALVE:1,libOESO:5,libBB072257:3,libSPL:18,libBB04331: +3,libCART:14,libBB091661:14,libLUNGPAR1:13,libFALLOP:25,libLUNGPAR2:5 +,libBREST:11,libBB08429:20,libDCOL:1,libKDNCORT:22,libBB091392:18,lib +STOM:4,libBB082901:3,libMUSCL:25,libGBLADD:3,libBB091313:94,libUTERIS +:1 43 1594 1 block1594019 --CCGGCCAGTCACGTCCCCTTATCACTTTTCCAGCCCAGCTTTGTGACT +-GTAAGTGTTGGACGGAGAACTGATAAGGGTAGGTGATTGACACTCACAGCCTCCGGA ratZ4p3 ----------------TCCCTTATCAGTTTTCCAG-CCAGCTTTGTGACT +-GTAAATGTTGGACGGAGAACTGATAAGGGT--------------------------- zebrafish36p3 -----GTCGAACACGTCTCCTTATCACTTTTCCAGCCCAGCTATCCATTT +AGTATTCGTTGGACGGAGAACTGATAAGGGCATGTGCCCGATACTCCCTG-------- chimpZ12p3 TGCCGGCCAGTCACGTCCCCTTATCACTTTTCCAGCCCAGCTTTGTGACT +-GTAAGTGTTGGACGGAGAACTGATAAGGGTAGGTGATTG------------------ mouseZ3p3 ----------------TTCCTTATCACTTTTCCAG-CCAGCTTTGTGACT +-CTAAGTGTTGGACGGAGAACTGATAAGGGT--------------------------- macaqueZ5p3 ----------------CCCCTTATCACTTTTCCAGCCCAGCTTTATGACT +-GTAAGTGTTGGACGGAGAACTGATAAGGGT--------------------------- chickenZ1p3 ----------------CCCCTTATCACTTTTCCAGCCCAGCTTCTTCGCT +-CTGACTGTTGGACGGAGAACTGATAAGGGT--------------------------- ******** ******** ****** * + * *********************** block1594019 ..... ....>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>...................... #ENST +00000384962 ENSG00000207695 hsa-mir-184 block1594019 ((((.(((((((...(((((((((.((((((...(((((((((..... + .)))).)))))..)))))).)))))))))...))))))).............)))). 0.700 -38. +34 # ratZ4p3 .((((((((((((((((.. (((((((((..... + .)))).)))))..)))))))))))))))). 1.000 -34. +10 # zebrafish36p3 ((((..(((((.(((((((((.((((((...(((((......... +.......)))))..)))))).))))))))).)))))..))))........ 0.969 -32. +99 # chimpZ12p3 .......(((((((...(((((((((.((((((...(((((((((..... + .)))).)))))..)))))).)))))))))...))))))) 1.000 -34. +40 # mouseZ3p3 .(((((((((.((((((.. (((((((((..... + .)))).)))))..)))))).))))))))). 1.000 -25. +50 # macaqueZ5p3 .(((((((((.((((((...((((((((((.... + ))))).)))))..)))))).))))))))). 1.000 -28. +00 # chickenZ1p3 .(((((((((.((((((...(((((.((...... + ..))..)))))..)))))).))))))))). 1.000 -27. +10 # #block1594019 chromosome:15:79502126:79502230:1 Same_strand|Exon +ic_non-coding|ENST00000384962|ENSG00000207695 ## ENSG00000207695|miRN +A|hsa-mir-184|hsa-mir-184 [Source:miRBase;Acc:MI0000481] ## {MIR: hsa +-mir-184} #ratZ4p3 chromosome:8:94692535:94692679:-1 Same_strand|Boundary_ +non-coding|ENSRNOT00000053645|ENSRNOG00000035522|miRNA|rno-mir-184 [S +ource:miRBase;Acc:MI0000929] ## {MIR: rno-mir-184} #zebrafish36p3 chromosome:18:26309060:26309154:1 Same_strand|Bou +ndary_non-coding|ENSDART00000119009|ENSDARG00000081763|miRNA| #chimpZ12p3 chromosome:15:77150842:77150930:1 Same_strand|Bounda +ry_non-coding|ENSPTRT00000054446|ENSPTRG00000027776|miRNA|ptr-mir-184 + [Source:miRBase;Acc:MI0002602] ## {MIR: ptr-mir-184} #mouseZ3p3 chromosome:9:89697054:89697198:-1 Same_strand|Boundar +y_non-coding|ENSMUST00000083662|ENSMUSG00000065596|miRNA|mmu-mir-184 + [Source:miRBase;Acc:MI0000226] ## {MIR: mmu-mir-184} #macaqueZ5p3 chromosome:7:58339687:58339831:1 Same_strand|Bounda +ry_non-coding|ENSMMUT00000036552|ENSMMUG00000026851|miRNA|mml-mir-184 + [Source:miRBase;Acc:MI0007648] ## {MIR: mml-mir-184} #chickenZ1p3 chromosome:10:22146221:22146365:1 Same_strand|Bound +ary_non-coding|ENSGALT00000028966|ENSGALG00000018259|miRNA|gga-mir-18 +4 [Source:miRBase;Acc:MI0001227] ## {MIR: gga-mir-184}
until now I wanted to get the data from the txt file that was after the *********** at the bottom of each block. I chieved that with the following script :</p»
#!usr/bin/perl use DBI; use Data::Dumper; use strict; use warnings; use CGI; my $filename = $ARGV[0]; my $tablename = $ARGV[1]; my $outfile = $ARGV[2]; open(INPUT, $filename); my $line; my $i; my $sth; while ($line = <INPUT>) { # Get the block identifier my $blockID; $line = <INPUT>; chomp($line); $blockID = $line; $i++; #print "\n"."New block: $blockID"."\n"; # Skip the next four lines <INPUT>; <INPUT>; <INPUT>; <INPUT>; open (MYOUTFILE, ">>./tmp/$outfile"); while ($line = <INPUT>) { chomp($line); if (! $line) { last; } else { my $tmp = index($line, '#'); if ($tmp == -1) # alignment lines { if (index($line, '*') == -1) { #print "$line\n"; } } if ($tmp == 0) # genomic positions { #print "$line\n"; my @fields = split('\t', $line); my $tmp = $fields[1]; @fields = split(':', $tmp); my $chrom = $fields[1]; my $start = $fields[2]; my $stop = $fields[3]; #print "$i\t\$blockID\n"; #print "$i\t$line\t$chrom\t$start\t$stop\t$blockID\n"; print MYOUTFILE"$i\t$start\t$stop\t$chrom\t$blockID\n" +; } } } } close (MYOUTFILE); close (INPUT);
Now I am trying to modify it in order to get the lines above (the ones starting with block). I tried changing the index function in my script but that didn't seem to work. (to return true instead of false so that it would read those new lines. I am pretty inexperienced with regular exressions, and although I read your tutorial, I can't get it this to work Does anyone have any idea on how I could get those lines ? I know that there are more than one way to do it, but I am really stuck. thanks in advance for your help
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: regular expression questions (from someone without experience)
by tospo (Hermit) on Sep 22, 2010 at 08:30 UTC | |
Re: regular expression questions (from someone without experience)
by moritz (Cardinal) on Sep 22, 2010 at 08:34 UTC | |
Re: regular expression questions (from someone without experience)
by locked_user sundialsvc4 (Abbot) on Sep 22, 2010 at 11:13 UTC | |
Re: regular expression questions (from someone without experience)
by biohisham (Priest) on Sep 22, 2010 at 14:54 UTC | |
Re: regular expression questions (from someone without experience)
by thargas (Deacon) on Sep 22, 2010 at 13:20 UTC | |
by BioLion (Curate) on Sep 22, 2010 at 15:24 UTC | |
Re: regular expression questions (from someone without experience)
by girarde (Hermit) on Sep 22, 2010 at 14:47 UTC |