Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

#!/usr/bin/perl while(<DATA>){ if (/\{TT\}/ .. /^\{TAG\}/) { unless (/^\{(TT|TAG)\}/) { $deletestrings = $_; #if($_ =~ m/^D$/){ print $_; # $_ = '' if index( $_, "$deletestrings" ) >= 0; } } } __DATA__ S 9912290449 00005941^B{TT} D {TAG} 9912290449 {PUBLICATION} THE OS {DATE} 000101 S 9912290450 00005941^B{TT} R {TAG} 9912290450 {DATE} 000101 {TDATE} Saturday, January 1, 2000 S 9912290451 00005941^B{TT} D {TAG} 9912290451 {DATE} 000101 {TDATE} Saturday, January 1, 2000
If there is a character 'D' below the {TT} I have to delete all the characters from the line which has text {TT} and to begining of another {TT}. Now I have code to store only text between {TT} and {TAG}.

Replies are listed 'Best First'.
Re: delete lines till
by ikegami (Patriarch) on Aug 25, 2009 at 06:04 UTC
    Deal with records, not lines.
    local $/ = ''; while (<>) { next if /\A.*\nD$/m; print; }
    More readable, more versatile:
    local $/ = ''; while (<>) { chomp( my @rec = split /^/m ); my $hdr1 = shift(@rec); my $hdr2 = shift(@rec); my %rec = @rec; next if $hdr2 eq 'D'; print; }
      Please tell me what does the line mean? /\A.*\nD$/m;. How to store that records with 'D' to a variable?
        Why don't you install YAPE::Regex::Explain?
        The regular expression: (?m-isx:\A.*\nD$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?m-isx: group, but do not capture (with ^ and $ matching start and end of line) (case- sensitive) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- \A the beginning of the string ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- D 'D' ---------------------------------------------------------------------- $ before an optional \n, and the end of a "line" ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Re: delete lines till
by Sewi (Friar) on Aug 25, 2009 at 07:32 UTC
    Try:
    s/[^\n]+\{TT\}\nD.+?(\{TT\}|$)/$1/gs
    Explained:
  • s/ startes a replace, because we want to replace the {TT}/d - blocks by "nothing" (which means just deleting them)
  • ^\n+ selectes all text (everything which is no newline char = \n) as you said "...delete all chars from the line which has text {TT}", remove this part if you want the chars before the {TT} to stay.
  • \{TT\}\nD matches your "if there is a char D below the {TT}" = A {TT} followed by a newline followed by a D
  • .+? matches everything from the newline following the D, but the ? says that we wan't to match as few as possible chars (missing it would match everything beginning from here)
  • (\{TT\}|$) finally selects the next {TT} and stops the match before at this point. The |$ means that we also accept a "End of string" as match. The last {TT}\nD - block wouldn't be deleted otherwise. You could also add ^\n+ after the ( to keep the whole line which holds the next {TT}
  • /$1 holds the replacement string for the earlier match, in this case the first ( ) block - which is our "beginning of next block" marker
  • /gs has two options: g for "replace all" and s for "make . also match \n" which is important for the .+? - block - it won't work over newlines otherwise.
  • This solution will do what you want as long as you could get the data into a variable. My personal choice would be this short way as long as the amount of data is below 25% of the memory your script may eat. Assuming you could give 1 GB of RAM to it would allow 250GB of data being processed, maybe more.
      #!/usr/bin/perl while(<DATA>){ if (/\{TT\}/ .. /^\{TAG\}/) { unless (/^\{(TT|TAG)\}/) { $deletestrings = $_; #if($_ =~ m/^D$/){ print $_; open FILE, '>list.txt'; print FILE $_; close FILE; $_ = '' if index( $_, "$deletestrings" ) >= 0; } } } __DATA__ S 9912290449 00005941^B{TT} D {TAG} 9912290449 {PUBLICATION} THE OS {DATE} 000101 S 9912290450 00005941^B{TT} R {TAG} 9912290450 {DATE} 000101 {TDATE} Saturday, January 1, 2000 S 9912290451 00005941^B{TT} D {TAG} 9912290451 {DATE} 000101 {TDATE} Saturday, January 1, 2000
      Now the above writes the lines which is between {TT} and {TAG}. The output is
      S 9912290449 00005941^B{TT} D S 9912290450 00005941^B{TT} R S 9912290451 00005941^B{TT} D
      How to write only two lines which has the character D to the file. In the file: only
      S 9912290449 00005941^B{TT} D S 9912290451 00005941^B{TT} D
      should be writeen
        Actually the above statement prints all the lines
        S 9912290449 00005941^B{TT} D S 9912290450 00005941^B{TT} R S 9912290451 00005941^B{TT} D
        But I want to print only the statemtents which has 'D' and its above line. Yhe output should be something like this
        S 9912290449 00005941^B{TT} D S 9912290451 00005941^B{TT} D
Re: delete lines till
by Anonymous Monk on Aug 25, 2009 at 06:02 UTC
    What is the name of this format?
Re: delete lines till
by bichonfrise74 (Vicar) on Aug 25, 2009 at 19:13 UTC
    Try this. Although I'm not sure how this relates to your original question.
    #!/usr/bin/perl use strict; my @records = <DATA>; for (0 .. $#records) { print $records[$_] . $records[$_ + 1] if ( $records[$_] =~ /\{TT\}/ ); } __DATA__ S 9912290449 00005941^B{TT} D {TAG} 9912290449 {PUBLICATION} THE OS {DATE} 000101 S 9912290450 00005941^B{TT} R {TAG} 9912290450 {DATE} 000101 {TDATE} Saturday, January 1, 2000 S 9912290451 00005941^B{TT} D {TAG} 9912290451 {DATE} 000101 {TDATE} Saturday, January 1, 2000
      S 9912290449 00005941^B{TT} D S 9912290450 00005941^B{TT} R S 9912290451 00005941^B{TT} D <code/> But I want to print only the statemtents which has 'D' and its above l +ine. Yhe output should be something like this <code> S 9912290449 00005941^B{TT} D S 9912290451 00005941^B{TT} D