Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

regular expression questions (from someone without experience)

by gogoglou (Beadle)
on Sep 22, 2010 at 08:04 UTC ( [id://861225]=perlquestion: print w/replies, xml ) Need Help??

gogoglou has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl monks, I am fairly new in regular expressions and I got stuck while trying to parce a txt file. The file is as follows:

========== block1352806 (miRBase hsa-mir-655) [miRNAknown_cloningHIGH_randfoldOK] ========== #randfold block1352806: 0.002 randfold chimpZ43p2_block1352806: 0.001 #forester chimpZ43p2_block1352806: 1 #identity: 100.00 block1352806 + ATAATACATGGTTAACCTCTTT #1308609;1;libSPL:4|130860 +8;1;libBB07981:1|1308607;1;libBB071024:13,libFAT:2,libBB0740:1,libBB0 +91237:4,libLIVER:7,libBB07981:8,libOVA:38,libBB081093:1,libART:2,libB +B091085:1,libLYMPHG:1,libVEIN:4,libBB04316:2,libADRG:79,libSPL:17,lib +BB04331:1,libCART:4,libBB091661:3,libSKIN:38,libFALLOP:2,libMirbase_h +sa-miR-655:1,libDCOL:1,libKDNCORT:3,libBB091392:17,libSTOM:2,libBB082 +901:3,libMUSCL:4,libBRCH:1,libJEJ:1,libGBLADD:16,libBB091313:2,libPNC +:1 31 284 1 block1352806 + TAATACATGGTTAACCTCTTT #4802530;1;libBB071024:9|4 +802529;1;libBB04316:3,libBB091392:3 3 15 1 block1352806 + TAATACATGGTTAACCTCTTTT #4802531;1;libHRTVALVE:2 +1 2 1 block1352806 + ATAATACATGGTTAACCTCTTTT #1308611;1;libADRG:6|1308 +612;1;libADRG:1|1308613;1;libSTOM:1|1308614;1;libMUSCL:1|1308610;1;li +bBB071024:2,libOVA:2,libBB04316:4,libADRG:10,libBB04331:1,libCART:8,l +ibBB091661:5,libBB091392:1,libGBLADD:2,libPNC:2 12 46 1 block1352806 + ATAATACATGGTTAACCTCT #1308601;1;libBB04316:2,libA +DRG:2,libSPL:1,libGBLADD:2 4 7 1 block1352806 + ATAATACATGGTTAACCTC #1308599;1;libADRG:1|1308600; +1;libBB091661:1 2 2 1 block1352806 + AATACATGGTTAACCTCTTT #406652;1;libOVA:3,libBB04 +322:6 2 9 1 block1352806 + ATAATACATGGTTAACCTCTT #1308603;1;libOVA:3,libJEJ: +2|1308605;1;libOVA:2,libCART:10|1308606;1;libADRG:2|1308602;1;libBB07 +1024:1,libBB07981:1,libOVA:5,libADRG:10,libSPL:2,libBB091392:1 8 39 1 block1352806 + AATAATACATGGTTAACCTCT #404499;1;libBB091392:3 1 3 +1 block1352806 TTCGTTTCAGAACTATGCAAGGATATTTGAGGAGAGGTTATCCGTGTTAT +GTTCGCTTCATTCATCATGAATAATACATGGTTAACCTCTTTTTGAATATCAGACTCTGCCTCGGA chimpZ43p2 ---------------------GATATTTGAGGAGAGGTTATCCGTGTTAT +GTTCGCTTCATTCATCATGAATAATACATGGTTAACCTCTTTTTGAATATC--------------- ***************************** +*************************************************** block1352806 ...............>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> +>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>............ +.. #ENST00000362159 ENSG00000207646 hsa-mir-655 block1352806 ((((...((((..........((((((..(((((((((((.((((((..( +(((((............))))))..)))))).)))))))))))..))))))....))))...)))) 0. +590 -37.94 # chimpZ43p2 ((((((..(((((((((((.((((((..( +(((((............))))))..)))))).)))))))))))..)))))) 0. +999 -33.40 # #block1352806 chromosome:14:101515877:101515992:1 Same_strand|Ex +onic_non-coding|ENST00000362159|ENSG00000207646 ## ENSG00000207646|mi +RNA|hsa-mir-655|hsa-mir-655 [Source:miRBase;Acc:MI0003677] ## {MIR: h +sa-mir-655} #chimpZ43p2 chromosome:14:101481114:101481193:1 Same_strand|Exon +ic_non-coding|ENSPTRT00000061396|ENSPTRG00000032316|miRNA| ========== block1594019 (miRBase hsa-mir-184) [miRNAknown_cloningHIGH_randfoldOK] ========== #randfold block1594019: 0.003 randfold ratZ4p3_block1594019: 0.001 ran +dfold zebrafish36p3_block1594019: 0.002 randfold chimpZ12p3_block1594 +019: 0.003 randfold mouseZ3p3_block1594019: 0.001 randfold macaqueZ5p +3_block1594019: 0.001 randfold chickenZ1p3_block1594019: 0.001 #forester ratZ4p3_block1594019: 0.876592 forester zebrafish36p3_block1 +594019: 0.675141 forester chimpZ12p3_block1594019: 0.95772 forester m +ouseZ3p3_block1594019: 0.892995 forester macaqueZ5p3_block1594019: 0. +924482 forester chickenZ1p3_block1594019: 0.806755 #identity: 88.98 block1594019 + GGACGGAGAACTGATAAGGG #3945151;1;libBB04316:4,libBB091661:6, +libBB091313:7|3945152;1;libBB091313:1 3 18 1 block1594019 + TGGACGGAGAACTGATAAGG #5548924;1;libBB0740:10,libBB081093:4,l +ibBB04316:4,libBB091661:2,libBB091313:9|5548923;1;libBRESTMUSC:7,libB +B0740:9,libBB081093:4,libART:7,libVEIN:1,libBB04316:2,libHRTVALVE:5,l +ibBB091661:2,libLUNGPAR1:3,libLUNGPAR2:3,libBB091392:3,libBB091313:31 +|5548922;1;libBB0740:8,libCERVIX:15,libBB081093:8,libART:6,libKIDNEYM +:28,libFALLOP:6,libBREST:6,libBB08429:2,libSTOM:2,libMUSCL:2,libBB091 +313:15|5548928;1;libBB0740:1,libBB081093:1|5548990;1;libBB081093:1|55 +48992;1;libBB081093:4,libMUSCL:1|5548993;1;libBB04316:5|5548991;1;lib +BB0740:3,libBB091237:5,libBB081093:1,libBREST:5|5548995;1;libBB091313 +:1|5548994;1;libBB05704:1|5548996;1;libBRESTMUSC:1|5548921;1;libFAT:1 +,libBRESTMUSC:3,libBB0740:8,libBB091237:1,libCERVIX:25,libBB05704:11, +libBB081093:2,libART:4,libKIDNEYM:2,libLYMPHG:1,libBB04316:9,libACOL: +2,libSPL:1,libCART:4,libBB091661:1,libLUNGPAR1:2,libFALLOP:1,libBREST +:7,libKDNCORT:4,libGBLADD:3,libBB091313:12 28 338 1 block1594019 + GACGGAGAACTGATAAGGGTAG #3144863;1;libFALLOP:1 1 1 1 block1594019 + TGGACGGAGAACTGATAA #5548913;1;libBB0740:1,libBB081093:3 2 4 +1 block1594019 + GACGGAGAACTGATAAGGGTA #3144864;1;libBB0740:1 1 1 1 block1594019 + TTGGACGGAGAACTGATAAGGGT #5924355;1;libBB081093:1 1 1 1 block1594019 + GGACGGAGAACTGATAAGG #3945150;1;libBB091313:2 1 2 1 block1594019 + TTGGACGGAGAACTGATAAGGG #5924354;1;libBB0740:1 1 1 1 block1594019 + CGGAGAACTGATAAGGGTA #2423838;1;libBB081093:1 1 1 1 block1594019 + GGACGGAGAACTGATAAGGGT #3945153;1;libBB0740:1,libBB081093:2, +libART:4,libBB04316:3,libSPL:2,libBB091313:1 6 13 1 block1594019 + GACGGAGAACTGATAAGGG #3144860;2;libBB081093:2,libBB091392:3 +|3144859;2;libBB091661:6 3 11 2 block1594019 + TGGACGGAGAACTGATAAGGGTAG #5548962;1;libBB04316:11 1 11 1 block1594019 + GGACGGAGAACTGATAAGGGTA #3945154;1;libBB04316:2,libBB091313: +5|3945156;1;libBB091313:4 2 11 1 block1594019 + TGGACGGAGAACTGATAAG #5548916;1;libBB081093:1,libBB091313:1|5 +548919;1;libBB0740:2|5548918;1;libBB04316:1|5548915;1;libFAT:1,libBB0 +7981:1,libBB081093:1,libART:1,libCART:7,libBB091313:2|5548920;1;libSP +L:1|5548998;1;libBB04316:1|5548999;1;libKDNCORT:1|5548997;1;libBRESTM +USC:1|5548914;1;libBRESTMUSC:2,libBB0740:1,libBB05704:1,libKIDNEYM:28 +,libBB04316:3,libSTOM:5 14 62 1 block1594019 + ACGGAGAACTGATAAGGGT #677339;1;libBB04316:4 1 4 1 block1594019 + TTGGACGGAGAACTGATAAGGGTA #5924356;1;libBB04316:1 1 1 1 block1594019 + GGACGGAGAACTGATAAGGGTAG #3945155;1;libBB04316:4 1 4 1 block1594019 + TGGACGGAGAACTGATAAGGG #5548932;1;libFAT:1,libBB0740:4,libBB0 +4316:5,libBB091313:1|5548933;1;libBB04316:4,libBB091313:2|5548934;1;l +ibBB04316:5|5548931;1;libBB0740:4,libBB091237:2,libOVA:1,libBB05704:2 +,libBB081093:23,libVEIN:2,libBB04316:16,libBB072910:1,libCART:12,libB +B091661:3,libLUNGPAR1:1,libSKIN:18,libFALLOP:1,libKDNCORT:5,libBB0913 +92:2,libGBLADD:3,libBB091313:21|5548936;1;libBB04316:2|5548938;1;libB +B081093:1,libBB091313:2|5548937;1;libBB0740:1,libBB081093:1,libVEIN:1 +,libDCOL:2|5548939;1;libBB091313:1|5548930;1;libBRESTMUSC:21,libBB074 +0:56,libBB091237:6,libBB07981:2,libCERVIX:28,libOVA:1,libBB05704:17,l +ibBB081093:100,libART:11,libKIDNEYM:25,libVEIN:5,libBB04316:42,libADR +G:1,libHRTVALVE:1,libSPL:1,libBB091661:11,libLUNGPAR1:3,libFALLOP:5,l +ibLUNGPAR2:6,libBB08429:5,libKDNCORT:24,libBB091392:5,libSTOM:3,libBB +082901:2,libBB091313:35,libUTERIS:1|5548941;1;libCERVIX:1|5548942;1;l +ibFALLOP:1|5548943;1;libKIDNEYM:1|5548944;1;libBRESTMUSC:1|5548947;1; +libBB091392:2|5548946;1;libFAT:1,libBB04322:2|5548948;1;libBB0740:1,l +ibBB081093:2|5548945;1;libVEIN:2,libBB04316:2|5548949;1;libBB081093:1 +,libBB04316:3|5548951;1;libCERVIX:2,libBB04316:1,libFALLOP:1|5548952; +1;libBB0740:1,libART:1|5548950;1;libCERVIX:1,libBB081093:2,libBB07291 +0:1,libHRTVALVE:1|5548954;1;libCERVIX:1|5548953;1;libKDNCORT:1|554895 +5;1;libSPL:1|5548956;1;libBB4110:1|5548929;1;libBRESTMUSC:12,libBB074 +0:8,libBB091237:3,libBB04322:1,libBB05704:1,libBB081093:16,libART:3,l +ibKIDNEYM:31,libBB04316:33,libACOL:2,libBB04331:4,libFALLOP:2,libBRES +T:4,libKDNCORT:3,libGBLADD:2,libBB091313:5 37 732 1 block1594019 + TGGACGGAGAACTGATAAGGGTA #5548960;1;libBB04316:8|5548959;1;li +bBB0740:1,libBB05704:11,libBB081093:5,libBB04316:51,libBB091661:2,lib +MUSCL:1,libBB091313:17,libUTERIS:1|5548961;1;libBB04316:1,libBB091661 +:2|5548963;1;libKIDNEYM:1|5548965;1;libBB04316:2|5548966;1;libBB08109 +3:2,libBB091661:1|5548964;1;libBB0740:18,libBB091237:6,libBB05704:32, +libBB081093:40,libART:5,libKIDNEYM:10,libLYMPHG:1,libVEIN:2,libBB0431 +6:173,libBB072910:2,libACOL:1,libBB091661:15,libFALLOP:1,libDCOL:2,li +bBB091392:4,libBB082901:4,libMUSCL:7,libJEJ:1,libBB091313:16|5548967; +1;libBB081093:1|5548969;1;libBB04316:1|5548970;1;libBB04316:1|5548971 +;1;libCERVIX:1|5548968;1;libBB091313:1|5548958;1;libFAT:1,libBB0740:2 +5,libCERVIX:28,libBB05704:9,libBB081093:31,libART:10,libLYMPHG:1,libH +EART:3,libBB04316:77,libADRG:2,libBB091661:9,libFALLOP:1,libKDNCORT:6 +,libBB091392:3,libBB082901:2,libMUSCL:1,libBB091313:19 25 679 1 block1594019 + GACGGAGAACTGATAAGGGT #3144865;1;libKIDNEYM:14|3144862;1;li +bBB04316:2 2 16 1 block1594019 + TGGACGGAGAACTGATAAGGGT #5548972;1;libCERVIX:14,libBB05704:4, +libBB081093:4,libSPL:1,libBB091313:2|5548974;1;libCERVIX:1|5548973;1; +libBB081093:1|5548975;1;libBB071024:1|5548978;1;libCERVIX:10,libBB043 +16:1|5548977;1;libBB05704:7,libBB04316:4,libBB091313:1|5548979;1;libB +B04316:7|5548980;1;libBB04316:1|5548981;1;libCERVIX:1,libHEART:1,libF +ALLOP:1|5548982;1;libFALLOP:1|5548983;1;libBB081093:2|5548984;1;libBB +081093:2|5548985;1;libBB04316:1|5548976;1;libFAT:1,libBRESTMUSC:6,lib +BB0740:17,libBB04322:3,libBB05704:36,libBB081093:39,libART:2,libLYMPH +G:3,libBB04316:21,libSPL:4,libBB04331:4,libBB091661:7,libLUNGPAR1:2,l +ibLUNGPAR2:2,libBB08429:2,libDCOL:4,libKDNCORT:9,libMUSCL:4,libBRCH:4 +,libJEJ:2,libBB091313:13|5548987;1;libBB081093:2|5548988;1;libBB05704 +:1|5548986;1;libBB0740:4,libBB091237:2,libBB05704:5,libBB081093:3,lib +ART:3,libBB04316:15,libSPL:3,libFALLOP:1,libBB08429:1,libBB091313:3|5 +548989;1;libBB071024:1|5548957;1;libFAT:1,libBRESTMUSC:16,libBB0740:6 +4,libBB091237:12,libCERVIX:80,libMirbase_hsa-miR-184:1,libOVA:1,libBB +04322:8,libBB05704:97,libBB081093:142,libBB4110:2,libART:14,libKIDNEY +M:38,libHEART:6,libVEIN:10,libBB04316:495,libBB04217:2,libBB072910:1, +libACOL:5,libHRTVALVE:1,libOESO:5,libBB072257:3,libSPL:18,libBB04331: +3,libCART:14,libBB091661:14,libLUNGPAR1:13,libFALLOP:25,libLUNGPAR2:5 +,libBREST:11,libBB08429:20,libDCOL:1,libKDNCORT:22,libBB091392:18,lib +STOM:4,libBB082901:3,libMUSCL:25,libGBLADD:3,libBB091313:94,libUTERIS +:1 43 1594 1 block1594019 --CCGGCCAGTCACGTCCCCTTATCACTTTTCCAGCCCAGCTTTGTGACT +-GTAAGTGTTGGACGGAGAACTGATAAGGGTAGGTGATTGACACTCACAGCCTCCGGA ratZ4p3 ----------------TCCCTTATCAGTTTTCCAG-CCAGCTTTGTGACT +-GTAAATGTTGGACGGAGAACTGATAAGGGT--------------------------- zebrafish36p3 -----GTCGAACACGTCTCCTTATCACTTTTCCAGCCCAGCTATCCATTT +AGTATTCGTTGGACGGAGAACTGATAAGGGCATGTGCCCGATACTCCCTG-------- chimpZ12p3 TGCCGGCCAGTCACGTCCCCTTATCACTTTTCCAGCCCAGCTTTGTGACT +-GTAAGTGTTGGACGGAGAACTGATAAGGGTAGGTGATTG------------------ mouseZ3p3 ----------------TTCCTTATCACTTTTCCAG-CCAGCTTTGTGACT +-CTAAGTGTTGGACGGAGAACTGATAAGGGT--------------------------- macaqueZ5p3 ----------------CCCCTTATCACTTTTCCAGCCCAGCTTTATGACT +-GTAAGTGTTGGACGGAGAACTGATAAGGGT--------------------------- chickenZ1p3 ----------------CCCCTTATCACTTTTCCAGCCCAGCTTCTTCGCT +-CTGACTGTTGGACGGAGAACTGATAAGGGT--------------------------- ******** ******** ****** * + * *********************** block1594019 ..... ....>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>...................... #ENST +00000384962 ENSG00000207695 hsa-mir-184 block1594019 ((((.(((((((...(((((((((.((((((...(((((((((..... + .)))).)))))..)))))).)))))))))...))))))).............)))). 0.700 -38. +34 # ratZ4p3 .((((((((((((((((.. (((((((((..... + .)))).)))))..)))))))))))))))). 1.000 -34. +10 # zebrafish36p3 ((((..(((((.(((((((((.((((((...(((((......... +.......)))))..)))))).))))))))).)))))..))))........ 0.969 -32. +99 # chimpZ12p3 .......(((((((...(((((((((.((((((...(((((((((..... + .)))).)))))..)))))).)))))))))...))))))) 1.000 -34. +40 # mouseZ3p3 .(((((((((.((((((.. (((((((((..... + .)))).)))))..)))))).))))))))). 1.000 -25. +50 # macaqueZ5p3 .(((((((((.((((((...((((((((((.... + ))))).)))))..)))))).))))))))). 1.000 -28. +00 # chickenZ1p3 .(((((((((.((((((...(((((.((...... + ..))..)))))..)))))).))))))))). 1.000 -27. +10 # #block1594019 chromosome:15:79502126:79502230:1 Same_strand|Exon +ic_non-coding|ENST00000384962|ENSG00000207695 ## ENSG00000207695|miRN +A|hsa-mir-184|hsa-mir-184 [Source:miRBase;Acc:MI0000481] ## {MIR: hsa +-mir-184} #ratZ4p3 chromosome:8:94692535:94692679:-1 Same_strand|Boundary_ +non-coding|ENSRNOT00000053645|ENSRNOG00000035522|miRNA|rno-mir-184 [S +ource:miRBase;Acc:MI0000929] ## {MIR: rno-mir-184} #zebrafish36p3 chromosome:18:26309060:26309154:1 Same_strand|Bou +ndary_non-coding|ENSDART00000119009|ENSDARG00000081763|miRNA| #chimpZ12p3 chromosome:15:77150842:77150930:1 Same_strand|Bounda +ry_non-coding|ENSPTRT00000054446|ENSPTRG00000027776|miRNA|ptr-mir-184 + [Source:miRBase;Acc:MI0002602] ## {MIR: ptr-mir-184} #mouseZ3p3 chromosome:9:89697054:89697198:-1 Same_strand|Boundar +y_non-coding|ENSMUST00000083662|ENSMUSG00000065596|miRNA|mmu-mir-184 + [Source:miRBase;Acc:MI0000226] ## {MIR: mmu-mir-184} #macaqueZ5p3 chromosome:7:58339687:58339831:1 Same_strand|Bounda +ry_non-coding|ENSMMUT00000036552|ENSMMUG00000026851|miRNA|mml-mir-184 + [Source:miRBase;Acc:MI0007648] ## {MIR: mml-mir-184} #chickenZ1p3 chromosome:10:22146221:22146365:1 Same_strand|Bound +ary_non-coding|ENSGALT00000028966|ENSGALG00000018259|miRNA|gga-mir-18 +4 [Source:miRBase;Acc:MI0001227] ## {MIR: gga-mir-184}

until now I wanted to get the data from the txt file that was after the *********** at the bottom of each block. I chieved that with the following script :</p

#!usr/bin/perl use DBI; use Data::Dumper; use strict; use warnings; use CGI; my $filename = $ARGV[0]; my $tablename = $ARGV[1]; my $outfile = $ARGV[2]; open(INPUT, $filename); my $line; my $i; my $sth; while ($line = <INPUT>) { # Get the block identifier my $blockID; $line = <INPUT>; chomp($line); $blockID = $line; $i++; #print "\n"."New block: $blockID"."\n"; # Skip the next four lines <INPUT>; <INPUT>; <INPUT>; <INPUT>; open (MYOUTFILE, ">>./tmp/$outfile"); while ($line = <INPUT>) { chomp($line); if (! $line) { last; } else { my $tmp = index($line, '#'); if ($tmp == -1) # alignment lines { if (index($line, '*') == -1) { #print "$line\n"; } } if ($tmp == 0) # genomic positions { #print "$line\n"; my @fields = split('\t', $line); my $tmp = $fields[1]; @fields = split(':', $tmp); my $chrom = $fields[1]; my $start = $fields[2]; my $stop = $fields[3]; #print "$i\t\$blockID\n"; #print "$i\t$line\t$chrom\t$start\t$stop\t$blockID\n"; print MYOUTFILE"$i\t$start\t$stop\t$chrom\t$blockID\n" +; } } } } close (MYOUTFILE); close (INPUT);

Now I am trying to modify it in order to get the lines above (the ones starting with block). I tried changing the index function in my script but that didn't seem to work. (to return true instead of false so that it would read those new lines. I am pretty inexperienced with regular exressions, and although I read your tutorial, I can't get it this to work Does anyone have any idea on how I could get those lines ? I know that there are more than one way to do it, but I am really stuck. thanks in advance for your help

Replies are listed 'Best First'.
Re: regular expression questions (from someone without experience)
by tospo (Hermit) on Sep 22, 2010 at 08:30 UTC

    Wow, that's a beast of a results file. At the moment, you are not using regular expressions at all. Without going in detail through this massive text file I can't give you the exact regexes but what you need to do is write down (and maybe post here if you need more help) what exactly a new block starts and ends with. This will help you to identify a pattern. Say you have a file like this:

    miRNA1a - results SOME DATA HERE ####### results end ######### miRNA2 - results ... and so on
    Then you could do:
    while (<>){ chomp; if (/^(miRNA\w+) - results/){ my $mirna_id = $1; while (<>){ chomp; last if /^####### results end #########/; # PARSE YOUR DATA HERE } } }
    There are of course other ways of doing this but this should work for you. You just need to identify the patterns that signal the beginning and end of a block.


    Actually, I think this may be even better for you: since you are probably parsing the same data for different microRNAs, all you need to do is keep track of which microRNA you are reading the data for at the moment. So, still using my above example text file, I would do:

    my $current_mirna; while(<>){ chomp; $current_mirna = $1 if /^(miRNA\w+) - results/; die unless defined $current_mirna; # now you always have the miRNA ID of the current block # conitnue here to parse the actual data and insert into the # database for that specific miRNA }
Re: regular expression questions (from someone without experience)
by moritz (Cardinal) on Sep 22, 2010 at 08:34 UTC
    When you want to obtain data before and after a certain delimiter, think of split.

    Here's an example to get you started; it might need some tweaking to do what you want.

    #!/usr/bin/perl use strict; use warnings; use autodie; open my $h, '<', 'file.txt'; local $/ = ''; while (my $block = <$h>) { # split off header my $body = (split /==========\n/, $block )[-1]; # split on the long ***** line my ($before, $after) = split /\*{20,}\s*/, $body, 2; print "before: $before\n"; print "after: $after\n"; }

    Note that I've used the paragraph mode for reading the records; that's explained in the documentation for the $/ variable.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: regular expression questions (from someone without experience)
by sundialsvc4 (Abbot) on Sep 22, 2010 at 11:13 UTC

    What I generally try to do is to subdivide the problem into four concerns:

    1. Identifying particular records, preferably based on their characteristics, and
    2. Grabbing whatever I need out of each one, writing appropriate logic for each one.
    3. Recognizing when I have found a complete record and need to do something with it... including the last one.
    4. Coding defensively ... arranging for the program, itself, to check its own work.   (Realistically, nothing else can...)

    This is the general approach taken by Perl’s little-brother, awk.   (Still a very good tool...)

    There really isn’t much more to be said, I’m afraid.   Regular expressions are a black art.   However, there are a great many “Perl-compatible regular expression checker” web sites to help you with debugging.

Re: regular expression questions (from someone without experience)
by biohisham (Priest) on Sep 22, 2010 at 14:54 UTC
    Well, the miRBase data format is spooky, but initially what I'll want to do is to clean up this file by removing the lines that have no interesting value to the analysis problem at hand, you can keep the original file intact and the cleaned up file(s) be generated from there and each can have their own subset of the original file and their own subproblem to be analysed that collectively culminate into achieving the overall analytical goal (N.B. You've not mentioned what you intend to do with the file sections you wanted captured).

    I have to disagree with Moritz's reliance on the '*' to separate the records (this arose from the OP's description) because, these '*'s in here have a different meaning all together and they aren't record separators at all since they're used to reflect how two lines -or multiple ones for that matter- of letters are identical at the character level in that position, this is known as Sequence Alignment, so if these sequences weren't identical no '*' appears and thus two records can be inadvertently fused and if an alignment appeared mid-record then a record could be separated into two without having noticed so. On a related note you use the '-' to represent alignment gaps.

    Back to topic, refining the file by purging the unwanted lines can probably allow you to use one of the BioPerl modules to tackle the entire problem without writing much code after all and can enable us to see a clear definition thereof in order to relevantly provide assistance.

    You may want to read Perl and Bioinformatics in addition.

    Excellence is an Endeavor of Persistence. A Year-Old Monk :D .
Re: regular expression questions (from someone without experience)
by thargas (Deacon) on Sep 22, 2010 at 13:20 UTC
    I'm not a biology guy, but when I see strings like TGGACGGAGAACTGATAAGGGT in a file I think DNA/RNA. Have you looked in CPAN? There are lots of modules there specific to things biological. I.E. you may be re-inventing an existing wheel.

      This certianly looks like a standard output format - maybe OP should also check out BioPerl...

      Just a something something...
Re: regular expression questions (from someone without experience)
by girarde (Hermit) on Sep 22, 2010 at 14:47 UTC
    When including very large blocks of stuff, either code or data, the <readmore> tag is much appreciated.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://861225]
Approved by Corion
Front-paged by Corion
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-22 04:08 GMT
Find Nodes?
    Voting Booth?

    No recent polls found