Re^4: Delete till end of line to another string

#!/usr/bin/perl
use strict;
use warnings;

my $tag;
my $output;
my $fh;
my $deletestrings;
#open DATA,"$ARGV[0]";
while (<DATA>) {
    chomp;
 s/[\cA-\cZ]//g;     # To remove control characters
#To delete the lines which has {IT} and any lines between {IT} and {SO
+URCETAG}
if (/\{IT\}/ .. /^\{SOURCETAG\}/) {
        unless (/^\{(IT|SOURCETAG)\}/) {
            $deletestrings = $_;
        $_ = '' if index( $_, "$deletestrings" ) >= 0;
        }
    }
        s/[\\|<]$//g;
        s/^[\\|<]//g;
        s/<//g;
        s/^\s+//g;
        s/\s+$//g;
    if(/^{(.*)}$/) {    # match {METATAG} line
        $fh = output($output, $tag, $fh);
        $output = "";
        $tag = $1;
    } else {            # not a {TAG} line

        next unless($tag);
        next if(/^\s*$/);
        s/\\//g;
        $output .= ($output) ? " $_" : "<$tag>$_";
    }
}
$fh = output($output, $tag, $fh);
if($fh) {
    print $fh "</ROOT>\n";
    close($fh);
}
exit(0);

sub output {
    my ($output, $tag, $fh) = @_;
    if($output) {
        if($output =~ m/<SOURCETAG>(.*)/) {
            if($fh) {
                print $fh "</ROOT>\n";
                close($fh);
            }
            open($fh, '>', "$1.xml") or die "$1.xml: $!";
            print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<RO
+OT>\n";
        }
        print $fh "$output</$tag>\n";
    }
    return($fh);
}
__DATA__
^C^D^V^V^A exit 0543 961017 N S 9702050900 00000000^B{IT}
N
{SOURCETAG}
9702050900
{ACCESSION}
000000
{DATE}
961017
{TDATE}
Thursday, October 17, 1996
{PUBLICATION}
EXITO
{EDITION}
Exito
{SECTION}
SALUD FAMILIAR
{PAGE}
22
{SOURCE}
Por Nahyr Acosta. Colaboradora de Â¡Exito!
{COLUMN}
Nuestro hijos.
{HEADLINE}
EVITE LA ENFERMEDAD PERIODONTAL<                   AS
{LEAD}CTIQUE LA HIGIENE DENTAL PARA CUIDAR SUS ENCÃ   Cuando hace casi
+ dos aÃ±os visitÃ© a mi dentista para hacerme un chequeo
dental , me encontrÃ© ante un problema que podÃa terminar con mi dent
+adura.
Siempre tuve unos dientes sanos y jamÃ¡s se me ocurriÃ³ pensar que pod
+Ãa  perder
mis piezas si no tomaba acciÃ³n inmediata. Me puse en manos del  denti
+sta y
empezamos a trabajar en mis encÃas.<
[download]

This code creates a xml output. But for {LEAD} it doesnot creates the proper tag. For the file when I pass the input file as an command line argument, The output is seen as below.

<SECTION>SALUD FAMILIAR</SECTION>
<PAGE>22</PAGE>
<SOURCE>Por Nahyr Acosta. Colaboradora de Â¡Exito!</SOURCE>
<COLUMN>Nuestro hijos.</COLUMN>                                       
+                     AS</HEADLINE>
<LEAD>Cuando hace casi dos aÃ±os visitÃ© a mi dentista para hacerme un
+ chequeo dental , me encontrÃ© ante un problema que podÃa terminar c
+on mi dentadura. Siempre tuve unos dientes sanos y jamÃ¡s se me ocurr
+iÃ³ pensar que podÃa  perder mis piezas si no tomaba acciÃ³n inmedia
+ta. Me puse en manos del  dentista y empezamos a trabajar en mis encÃ
+as. La descomposiciÃ³n de la comida, la saliva y las bacterias que h
+ay en la boca producen una sustancia llamada placa, que se adhiere a 
+la superficie de los dientes y, si no se limpia diariamente, irrita l
+as encÃas y se endurecen, formando el sarro. Esta sustancia, a su ve
+z, se adhiere cada vez mÃ¡s, inflamando las encÃas y produciendo bol
+sas que se llenan con colonias de bacterias. Estas son las que con el
+ tiempo, causan la enfermedad periodontal. Las bacterias tambiÃ©n pro
+ducen una sustancia llamada toxina, la cual destruye el hueso.</LEAD>
[download]

For {LEAD} it creates the tag but not for HEADLINE

Comment on Re^4: Delete till end of line to another string Select or Download Code

Replies are listed 'Best First'.
Re^5: Delete till end of line to another string by rovf (Priest) on Aug 18, 2009 at 13:32 UTC
First, may I suggest to annotate your code with comments, and to streamline it a bit, because I found it quite incomprehesible. For instance, if I look at `s/[\\\|<]$//g; s/^[\\\|<]//g; s/<//g;` [download] You first delete all \, \| and < from the end and the beginning of $_, and then you erase all < from the whole $_. If you erase all < from $_ anyway, you don't need to do this too for the special cases 'begin' and 'end'. Also, if you want to erase completely certain characters from a string, `tr` is better readible, so your code would become `s/[\\\|]$//g; s/^[\\\|]//g; tr/</d;` [download] Also I find your question self-contradictory. You ask But for {LEAD} it doesnot creates the proper tag. ... For {LEAD} it creates the tag So first you say it does not create the tag, then you say it does create the tag. The problem rather seems to me that it does not create an opening tag for HEADLINE only, or is there still something not correct with LEAD? You should put into your code various print statements, so that you see what is going on, in particular after your program has encountered HEADLINE for the first time. I would do the same if I had to debug such a program... -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l] [select]
Re^6: Delete till end of line to another string by Anonymous Monk on Aug 19, 2009 at 05:24 UTC
Hi, I am not able to debug the script. When I grep for `<HEADLINE>` [download] tag in a file grep '<HEADLINE>' 9702050900.xml It prints out as below. `<HEADLINE>EVITE LA ENFERMEDAD PERIODONTAL PRA CTIQUE LA HIGIENE DENTAL PARA CUIDAR SUS ENCï¿½AS</HEADLINE>` But when I vi the file to view, it displays `<COLUMN>Nuestro hijos.</COLUMN> + AS</HEADLINE> <LEAD>C......</LEAD>` [download] The problem is very strange. Anyone please help me	[reply] [d/l] [select]
Re^7: Delete till end of line to another string by rovf (Priest) on Aug 19, 2009 at 08:27 UTC
Have a look at the file with a Hex viewer (for instance on Unix or other reasonable operating systems, using `od -cx FILENAME \| less` [download] . Maybe you find some conspicuous control character, such as an isolated carriage return (0x0DH). -- Ronald Fischer <ynnor@mm.st>	[reply] [d/l] [select]