Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks I have a xml file like this:
</S></TEXT><TEXT><S Entail="142" s_id="0"> Annan urges return to democracy in <REF C-ENTID="Nepal" EXT="Nepal" ID +="104" S&#1058;YPE="PROPNAME">Nepal</REF></S><S Entail="138-139-142" +s_id="1"> UN Secretary General Kofi Annan on Tuesday expressed deep concern over + events in <REF A-CLASS="No-Reference" A-REFTYPE="Entity" C-ENTID="Ne +pal" EXT="Nepal" ID="105" S&#1058;YPE="PROPNAME">Nepal</REF> and urge +d a return to democracy, after <REF C-ENTID="King Gyanendra Bir Bikra +m" COMMENT="Coref direction is forward" EXT="King Gyanendra Bir Bikra +m" ID="100" S&#1058;YPE="APNAME">King Gyanendra Bir Bikram</REF> dismissed <REF A-CLASS="Entity-Entity" A-DIR="Backward" A-RELTYPE="Ide +ntity" A-RESTYPE="Intra" A-TYPE="Referential" ANT-ID="105" ID="101">t +he country</REF> 's coalition government and imposed an indefinite st +ate of emergency. </S><S Entail="138-139-143" s_id="2">
I tried using regular expression and sed to get rid of all <REF .......> elements so my desired output would look like this:
</S></TEXT><TEXT><S Entail="142" s_id="0"> Annan urges return to democracy in Nepal</REF></S><S Entail="138-139-1 +42" s_id="1"> UN Secretary General Kofi Annan on Tuesday expressed deep concern over + events in Nepal</REF> and urged a return to democracy, after King Gy +anendra Bir Bikram</REF> dismissed the country</REF> 's coalition government and imposed an ind +efinite state of emergency. </S><S Entail="138-139-143" s_id="2">
I had a sed line like this which does not work well :(
sed -r 's/<REF ([A-Za-z]*[-]{0,1}[A-Za-z]*=["].[A-Za-z0-9-]*.{0,1}[A-Z +a-z0-9-]*.{0,1}[A-Za-z0-9-]*.{0,1}[A-Za-z0-9-]*.{0,1}[A-Za-z0-9-]*.{0 +,1}[A-Za-z0-9-]*.{0,1}[A-Za-z0-9-]*.{0,1}[A-Za-z0-9-]*)*>//g' input.x +ml
Any idea how can I do it with perl? I do really appreciate :) Thanks

Replies are listed 'Best First'.
Re: problem with removing something in XML file
by marto (Cardinal) on Sep 18, 2009 at 14:28 UTC
      Dear Martin My problem is that I have to get rid of all REF. how do you thin is possible? any sample?
Re: problem with removing something in XML file
by Sandy (Curate) on Sep 18, 2009 at 16:47 UTC
    Normally, one should take the advice of previous suggestions before demanding more answers, but... nonetheless...

    Don't know why your regular expression is so complicated.

    Assuming that all <REF > statements are always complete on a single line...

    XML File Before

    </S></TEXT><TEXT><S Entail="142" s_id="0"> Annan urges return to democracy in <REF C-ENTID="Nepal" EXT="Nepal" ID +="104" S&#1058;YPE="PROPNAME">Nepal</REF></S> <S Entail="138-139-142" s_id="1"> UN Secretary General Kofi Annan on Tuesday expressed deep concern over + events in <REF A-CLASS="No-Reference" A-REFTYPE="Entity" C-ENTID="Nepal" EXT="Ne +pal" ID="105" S&#1058;YPE="PROPNAME">Nepal</REF> and urged a return to democracy, after <REF C-ENTID="King Gyanendra Bir Bikram" COMMENT="Coref direction is f +orward" EXT="King Gyanendra Bir Bikram" ID="100" S&#1058;YPE="APNAME" +> King Gyanendra Bir Bikram</REF> dismissed <REF A-CLASS="Entity-Entity" A-DIR="Backward" A-RELTYPE="Ide +ntity" A-RESTYPE="Intra" A-TYPE="Referential" ANT-ID="105" ID="101"> the country</REF> 's coalition government and imposed an indefinite st +ate of emergency. </S><S Entail="138-139-143" s_id="2">
    perl one-liner (on DOS)
    perl -pibak -e "s/<\/?REF.*?>//ig" junk.txt
    Result:
    </S></TEXT><TEXT><S Entail="142" s_id="0"> Annan urges return to democracy in Nepal</S> <S Entail="138-139-142" s_id="1"> UN Secretary General Kofi Annan on Tuesday expressed deep concern over + events in Nepal and urged a return to democracy, after King Gyanendra Bir Bikram dismissed the country 's coalition government and imposed an indefinite state of + emergency. </S><S Entail="138-139-143" s_id="2">
    Sandy

    UPDATE: Also assumes that there are no embedded ">" inside the REF tag

Re: problem with removing something in XML file
by graff (Chancellor) on Sep 19, 2009 at 02:01 UTC
    If you are going to remove <REF ...> tags, you really should be removing the </REF> tags too, don't you think?

    And there is a funny thing about your sample xml data: the "T" in the "STYPE" attribute of the REF tag is actually a Cyrillic "T", not an ASCII "T". Is that why you're trying to get rid of the REF tags, because they all got corrupted somehow? (It must have been caused by someone trying to do stream-edits on the XML data...) You could just fix that:

    perl -CS -pe 'tr{\x{422}}{T}' file.xml > fixed.file.xml
    As mentioned earlier, just removing the tags is pretty simple -- it can be a one-liner on the command line -- if the <REF...> thing is never split up by a line break, but even if it is, you can just run perl in "file-slurp" mode:
    perl -0777 -pe 's{</?REF[^>]*>}{}g' file.xml > noref.file.xml
    It seems like a pretty safe bet that REF tags will never contain a ">" as part of an attribute, so this approach should suffice.
Re: problem with removing something in XML file
by mirod (Canon) on Sep 19, 2009 at 11:14 UTC

    Using XML::Twig, something like this (untested) should work:

    use strict; use warnings; use XML::Twig; XML::Twig->new( twig_roots => { REF => sub { print $_->inner_xml; } }, twig_print_outside_roots => 1, ) ->parsefile( "my_file.xml");
A reply falls below the community's threshold of quality. You may see it by logging in.