Re^2: skipping lines when parsing a file

Hi ssandv
I want to remove the text starting at "COMMENT" and just before the line that starts with FEATURES.

LOCUS       4                     302276 bp    DNA     linear   HTG 31
+-OCT-2008
DEFINITION  Mus musculus chromosome 4 NCBIM37 partial sequence
            138489260..138791535 reannotated via EnsEMBL
ACCESSION   chromosome:NCBIM37:4:138489260:138791535:-1
KEYWORDS    .
SOURCE      house mouse.
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele
+ostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muroidea; Muridae; Murinae; Mus.
COMMENT     This sequence was annotated by the Ensembl system. Please 
+visit
            the Ensembl web site, http://www.ensembl.org/ for more
            information.
            All feature locations are relative to the first (5') base 
+of the
            sequence in this file. The sequence presented is always th
+e
            forward strand of the assembly. Features that lie outside 
+of the
            sequence contained in this file have clonal location coord
+inates
            in the format: <clone accession>.<version>:<start>..<end>
            The /gene indicates a unique id for a gene,
            /note="transcript_id=..." a unique id for a transcript,
            /protein_id a unique id for a peptide and note="exon_id=..
+." a
            unique id for an exon. These ids are maintained wherever p
+ossible
            between versions.
            All the exons and transcripts in Ensembl are confirmed by
            similarity to either protein or cDNA sequences.
            Features not parsed:
                 gene            AL807811.9.1.249021:69347..168228
            /locus_tag="Capzb" /gene="ENSMUSG00000028745" /note="cappi
+ng protein
            (actin filament) muscle Z-line, beta [Source:MGI;Acc:MGI:1
+04652]"
                 mRNA            join(complement(42593..42690),
            AL807811.9.1.249021:115325..115414,
            AL807811.9.1.249021:133677..133798,
            AL807811.9.1.249021:138400..138513,
            AL807811.9.1.249021:156412..156553,
            AL807811.9.1.249021:156884..157000,
            AL807811.9.1.249021:163445..163510,
            AL807811.9.1.249021:164172..164248,
            AL807811.9.1.249021:167419..168228) /gene="ENSMUSG00000028
+745"
            /note="transcript_id=ENSMUST00000102508"
                 CDS             join(complement(42593..42595),
            AL807811.9.1.249021:115325..115414,
            AL807811.9.1.249021:133677..133798,
            AL807811.9.1.249021:138400..138513,
            AL807811.9.1.249021:156412..156553,
            AL807811.9.1.249021:156884..157000,
            AL807811.9.1.249021:163445..163510,
            AL807811.9.1.249021:164172..164248,
            AL807811.9.1.249021:167419..167506) /db_xref="CCDS:CCDS188
+41.1"
            /db_xref="MGI:Capzb"
            /db_xref="Vega_mouse_transcript:OTTMUST00000022955"
            /protein_id="ENSMUSP00000099566" /gene="ENSMUSG00000028745
+"
            /note="transcript_id=ENSMUST00000102508"
FEATURES             Location/Qualifiers
     source          1..302276
                     /db_xref="taxon:10090"
                     /organism="Mus musculus"
     gene            complement(267261..268504)
                     /note="locus_tag=Rnf186"
                     /gene="ENSMUSG00000070661"
                     /note="ring finger protein 186 [Source:MGI;Acc:MG
+I:1914075]
[download]

Then print from "FEATURES" until the end of the line.
LomSpace

Comment on Re^2: skipping lines when parsing a file Download Code

Replies are listed 'Best First'.
Re^3: skipping lines when parsing a file by ssandv (Hermit) on Aug 20, 2009 at 14:51 UTC
So it appears that sections of the file are defined by words in all caps starting in column 0. This actually lends itself pretty well to keeping track of the state (in this case the file section) you're in. There are many other ways to do it, but this is an example of what I was suggesting: `my $state; while (my $line=<$in>) { if ($line=~/^([A-Z]+)/) { $state=$1; } print $line unless $state eq "COMMENT"; }` [download] which outputs: 07:37<sandvik@sat1> ~/perl$ ./pmtest.pl LOCUS 4 302276 bp DNA linear HTG 31 +-OCT-2008 DEFINITION Mus musculus chromosome 4 NCBIM37 partial sequence 138489260..138791535 reannotated via EnsEMBL ACCESSION chromosome:NCBIM37:4:138489260:138791535:-1 KEYWORDS . SOURCE house mouse. ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele +ostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Mus. FEATURES Location/Qualifiers source 1..302276 /db_xref="taxon:10090" /organism="Mus musculus" gene complement(267261..268504) /note="locus_tag=Rnf186" /gene="ENSMUSG00000070661" /note="ring finger protein 186 [Source:MGI;Acc:MG +I:1914075] [download]	[reply] [d/l] [select]
Re^4: skipping lines when parsing a file by lomSpace (Scribe) on Aug 20, 2009 at 17:42 UTC
I am doing this in windows. Could that be a the problem why it is not parsing? LomSpace	[reply]
Re^5: skipping lines when parsing a file by ssandv (Hermit) on Aug 20, 2009 at 17:50 UTC
Conceivably. It's possible Perl is confused by your line terminators on Windows, and is slurping the entire file in as one single line. You can test this by writing a quick piece of code that reads and prints out one line at a time, printing something after each line (or waiting for <STDIN>, or something else like that.) I recommend you do your debugging by printing to the screen instead of a file, though, it's faster (if you think it's a data format problem. Obviously if my other response is correct and it's a file access problem, you'll have to do file access to discover it.)	[reply]
Re^4: skipping lines when parsing a file by lomSpace (Scribe) on Aug 20, 2009 at 17:02 UTC
ssandv, Ok, this is pretty simple and straight forward. The regex captures the capital letter(word) at the beginning of the line and checks for the next line that begins with a capital word. When it comes across that line it starts printing again? When I run it `#!/usr/bin/perl -w use strict; =cut The script parses the targeting.gb file and creates a new file that contains removes the comment info. =cut open(my $in, "C:/Documents and Settings/mydir/Desktop/TARGETING.gb"); open(my $out, ">C:/Documents and Settings/mydir/Desktop/TARGET.gb"); my $state; while(my $line =<$in>){ if ($line=~/^([A-Z]+)/) { $state=$1; } print $out $line unless $state eq "COMMENT"; } close $in; close $out` [download] The file is not being parsed. The out file is still the same as the in. Any idea? LomSpace	[reply] [d/l]
Re^5: skipping lines when parsing a file by ssandv (Hermit) on Aug 20, 2009 at 17:46 UTC
A couple thoughts come to mind. First of all, you're better off to be in the habit of using 3-argument `open`, because it protects you against weird filenames. This shouldn't be a problem in that case, but it's just a good habit. Also, `use warnings;` as well as using `strict`. Most importantly, your `open` commands should also have `or die "open failed: $!"` (or something similar) at the end, so that if they fail you know about it. In this case, if you try to open a locked or read-only file for writing, it will die (because `open` returns 0 if it can't open the file) instead of continuing on and silently failing on all its print statments, which is what's probably happening to you. `#!/usr/bin/perl -w use strict; use warnings; my $readfile = "C:/Documents and Settings/mydir/Desktop/TARGETING.gb"; my $writefile = "C:/Documents and Settings/mydir/Desktop/TARGET.gb"; open my $in, "<", $readfile or die "couldn't open $readfile: $!"; open my $out, ">", $writefile or die "couldn't open $writefile: $!"; my $state; while(my $line =<$in>){ if ($line=~/^([A-Z]+)/) { $state=$1; } print $out $line unless $state eq "COMMENT"; } close $in; close $out` [download] The output I produced above in Re:^3 was done using the code I wrote, so that's what drives me to look at this as a file access issue, now.	[reply] [d/l] [select]