in reply to Re^3: Delete till end of line to another string
in thread Delete till end of line to another string

#!/usr/bin/perl use strict; use warnings; my $tag; my $output; my $fh; my $deletestrings; #open DATA,"$ARGV[0]"; while (<DATA>) { chomp; s/[\cA-\cZ]//g; # To remove control characters #To delete the lines which has {IT} and any lines between {IT} and {SO +URCETAG} if (/\{IT\}/ .. /^\{SOURCETAG\}/) { unless (/^\{(IT|SOURCETAG)\}/) { $deletestrings = $_; $_ = '' if index( $_, "$deletestrings" ) >= 0; } } s/[\\|<]$//g; s/^[\\|<]//g; s/<//g; s/^\s+//g; s/\s+$//g; if(/^{(.*)}$/) { # match {METATAG} line $fh = output($output, $tag, $fh); $output = ""; $tag = $1; } else { # not a {TAG} line next unless($tag); next if(/^\s*$/); s/\\//g; $output .= ($output) ? " $_" : "<$tag>$_"; } } $fh = output($output, $tag, $fh); if($fh) { print $fh "</ROOT>\n"; close($fh); } exit(0); sub output { my ($output, $tag, $fh) = @_; if($output) { if($output =~ m/<SOURCETAG>(.*)/) { if($fh) { print $fh "</ROOT>\n"; close($fh); } open($fh, '>', "$1.xml") or die "$1.xml: $!"; print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<RO +OT>\n"; } print $fh "$output</$tag>\n"; } return($fh); } __DATA__ ^C^D^V^V^A exit 0543 961017 N S 9702050900 00000000^B{IT} N {SOURCETAG} 9702050900 {ACCESSION} 000000 {DATE} 961017 {TDATE} Thursday, October 17, 1996 {PUBLICATION} EXITO {EDITION} Exito {SECTION} SALUD FAMILIAR {PAGE} 22 {SOURCE} Por Nahyr Acosta. Colaboradora de ¡Exito! {COLUMN} Nuestro hijos. {HEADLINE} EVITE LA ENFERMEDAD PERIODONTAL< AS {LEAD}CTIQUE LA HIGIENE DENTAL PARA CUIDAR SUS ENCà Cuando hace casi + dos años visité a mi dentista para hacerme un chequeo dental , me encontré ante un problema que podía terminar con mi dent +adura. Siempre tuve unos dientes sanos y jamás se me ocurrió pensar que pod +ía perder mis piezas si no tomaba acción inmediata. Me puse en manos del denti +sta y empezamos a trabajar en mis encías.<
This code creates a xml output. But for {LEAD} it doesnot creates the proper tag. For the file when I pass the input file as an command line argument, The output is seen as below.
<SECTION>SALUD FAMILIAR</SECTION> <PAGE>22</PAGE> <SOURCE>Por Nahyr Acosta. Colaboradora de ¡Exito!</SOURCE> <COLUMN>Nuestro hijos.</COLUMN> + AS</HEADLINE> <LEAD>Cuando hace casi dos años visité a mi dentista para hacerme un + chequeo dental , me encontré ante un problema que podía terminar c +on mi dentadura. Siempre tuve unos dientes sanos y jamás se me ocurr +ió pensar que podía perder mis piezas si no tomaba acción inmedia +ta. Me puse en manos del dentista y empezamos a trabajar en mis encà +­as. La descomposición de la comida, la saliva y las bacterias que h +ay en la boca producen una sustancia llamada placa, que se adhiere a +la superficie de los dientes y, si no se limpia diariamente, irrita l +as encías y se endurecen, formando el sarro. Esta sustancia, a su ve +z, se adhiere cada vez más, inflamando las encías y produciendo bol +sas que se llenan con colonias de bacterias. Estas son las que con el + tiempo, causan la enfermedad periodontal. Las bacterias también pro +ducen una sustancia llamada toxina, la cual destruye el hueso.</LEAD>
For {LEAD} it creates the tag but not for HEADLINE

Replies are listed 'Best First'.
Re^5: Delete till end of line to another string
by rovf (Priest) on Aug 18, 2009 at 13:32 UTC
    First, may I suggest to annotate your code with comments, and to streamline it a bit, because I found it quite incomprehesible. For instance, if I look at

    s/[\\|<]$//g; s/^[\\|<]//g; s/<//g;
    You first delete all \, | and < from the end and the beginning of $_, and then you erase all < from the whole $_. If you erase all < from $_ anyway, you don't need to do this too for the special cases 'begin' and 'end'. Also, if you want to erase completely certain characters from a string, tr is better readible, so your code would become
    s/[\\|]$//g; s/^[\\|]//g; tr/</d;
    Also I find your question self-contradictory. You ask
    But for {LEAD} it doesnot creates the proper tag. ... For {LEAD} it creates the tag

    So first you say it does not create the tag, then you say it does create the tag. The problem rather seems to me that it does not create an opening tag for HEADLINE only, or is there still something not correct with LEAD?

    You should put into your code various print statements, so that you see what is going on, in particular after your program has encountered HEADLINE for the first time. I would do the same if I had to debug such a program...
    -- 
    Ronald Fischer <ynnor@mm.st>
      Hi, I am not able to debug the script. When I grep for
      <HEADLINE>
      tag in a file grep '<HEADLINE>' 9702050900.xml It prints out as below. <HEADLINE>EVITE LA ENFERMEDAD PERIODONTAL PRA CTIQUE LA HIGIENE DENTAL PARA CUIDAR SUS ENC�AS</HEADLINE> But when I vi the file to view, it displays
      <COLUMN>Nuestro hijos.</COLUMN> + AS</HEADLINE> <LEAD>C......</LEAD>
      The problem is very strange. Anyone please help me
        Have a look at the file with a Hex viewer (for instance on Unix or other reasonable operating systems, using
        od -cx FILENAME | less
        . Maybe you find some conspicuous control character, such as an isolated carriage return (0x0DH).

        -- 
        Ronald Fischer <ynnor@mm.st>