After reading Cultural and Bibliometric Perl by Ignatius Monk I realized that his code would be tripped if he had to work with... XML documents (you knew this was coming!), such as those you can get from the Text Encoding Initiative.
So here is a small tool, using XML::TiePYX (note, you know you've spent too much time on PerlMonks (and at least one other person has!) when you keep calling that module XML::TyePYX ;--) that strips all markup from a text. Optionally you can pass it a tag and it will only save the text within that tag, so you can strip out useless header information for example.
Why did I use XML::TiePYX? Because it is fast, works equally well on Unix and Windows, and it has that clever Latin option that outputs the result in latin1 encoding (no, it does _not_ use Lingua::Romana::Perligata). This is probably the default encoding for a lot of text, at least those that Ignatius Monk might come accross. A word of warning though: this option only works with XML::Parser 2.27, not with 2.30 (and I know this is a favourite peeve of mine, but I'd rather warn you than have you /msg me that it does not work).
So here it is, with a sample text, in spanish:
#!/bin/perl -w # strips an XML file from all of it's tags and outputs the result # each tag content is on a separate line # the output is in latin1, remove the Latin => 1 option to get UTF-8 o +utput # optionaly a tag can be passed, only the text within this tag will be + output # the script also works whith input from STDIN # # usage strip_markup [-t <tagname>] <file> use strict; use XML::TiePYX; use Getopt::Std; my $tag; my $in_tag; my %opts; getopts('t:',\%opts); unless( $tag= $opts{t}) { $in_tag= 1; } die "usage: $0 [-t <tagname>] <file>\n" unless @ARGV<=1; tie *XML,'XML::TiePYX', $ARGV[0] || \*STDIN, Condense=>0, Latin => 1; while( <XML>) { if( $tag && m/^\($tag/) { $in_tag= 1; } # check for opening $tag if( $tag && m/^\)$tag/) { $in_tag= 0; } # check for closing $tag next unless $in_tag; # skip unless $in_tag next unless s/^-//; # skip markup next if m/^\\n$/; # skip line returns next if m/^\s*$/; # skip empty lines print; # output the rest (text +in $tag) }
You can try it with this sample document, using strip_markup cancion.xml or strip_markup -t texto cancion.xml (oh, and my Spanish is really bad so some of the tag names might be completely wrong!):
<?xml version="1.0" encoding="ISO-8859-1"?> <cancion> <titulo>A mi me gusta lo blanco</titulo> <origen> <libro> <titulo>Las canciones del pueblo español</titulo> <autor>Juan de Aguila</autor> <editor>Unión musical española</editor> </libro> <pagina>67</pagina> </origen> <texto> <parte> <verso>A mi me gusta lo blanco,</verso> <verso>viva lo blanco, muera lo negro</verso> <verso>que lo negro es consa triste</verso> <verso>yo soy alegre yo no lo quiero.</verso> <verso>A mi me gusta la gaita,</verso> </parte> <parte> <verso>viva la gaita, viva el gaitero</verso> <verso>a mi me gusta la gaita</verso> <verso>que tenga el fuelle de terciopelo.</verso> </parte> </texto> </cancion>
In reply to strip_markup by mirod
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |