comment on

After reading Cultural and Bibliometric Perl by Ignatius Monk I realized that his code would be tripped if he had to work with... XML documents (you knew this was coming!), such as those you can get from the Text Encoding Initiative.

So here is a small tool, using XML::TiePYX (note, you know you've spent too much time on PerlMonks (and at least one other person has!) when you keep calling that module XML::TyePYX ;--) that strips all markup from a text. Optionally you can pass it a tag and it will only save the text within that tag, so you can strip out useless header information for example.

Why did I use XML::TiePYX? Because it is fast, works equally well on Unix and Windows, and it has that clever Latin option that outputs the result in latin1 encoding (no, it does _not_ use Lingua::Romana::Perligata). This is probably the default encoding for a lot of text, at least those that Ignatius Monk might come accross. A word of warning though: this option only works with XML::Parser 2.27, not with 2.30 (and I know this is a favourite peeve of mine, but I'd rather warn you than have you /msg me that it does not work).

So here it is, with a sample text, in spanish:

#!/bin/perl -w

# strips an XML file from all of it's tags and outputs the result
# each tag content is on a separate line
# the output is in latin1, remove the Latin => 1 option to get UTF-8 o
+utput
# optionaly a tag can be passed, only the text within this tag will be
+ output
# the script also works whith input from STDIN
#
# usage strip_markup [-t <tagname>] <file>

use strict;

use XML::TiePYX;
use Getopt::Std;

my $tag;
my $in_tag;

my %opts;
getopts('t:',\%opts);
unless( $tag= $opts{t}) { $in_tag= 1; }

die "usage: $0 [-t <tagname>] <file>\n" unless @ARGV<=1;

tie *XML,'XML::TiePYX', $ARGV[0] || \*STDIN, Condense=>0, Latin => 1;

while( <XML>)
  { if( $tag && m/^\($tag/) { $in_tag= 1; }   # check for opening $tag
    if( $tag && m/^\)$tag/) { $in_tag= 0; }   # check for closing $tag
    next unless $in_tag;                      # skip unless $in_tag
    next unless s/^-//;                       # skip markup
    next if m/^\\n$/;                         # skip line returns
    next if m/^\s*$/;                         # skip empty lines
    print;                                    # output the rest (text 
+in $tag)
  }
[download]

You can try it with this sample document, using strip_markup cancion.xml or strip_markup -t texto cancion.xml (oh, and my Spanish is really bad so some of the tag names might be completely wrong!):

<?xml version="1.0" encoding="ISO-8859-1"?>
<cancion>
  <titulo>A mi me gusta lo blanco</titulo>
  <origen>
    <libro>
      <titulo>Las canciones del pueblo español</titulo>
      <autor>Juan de Aguila</autor>
      <editor>Unión musical española</editor>
    </libro>
    <pagina>67</pagina>
  </origen>
  <texto>
    <parte>
      <verso>A mi me gusta lo blanco,</verso>
      <verso>viva lo blanco, muera lo negro</verso>
      <verso>que lo negro es consa triste</verso>
      <verso>yo soy alegre yo no lo quiero.</verso>
      <verso>A mi me gusta la gaita,</verso>
    </parte>
    <parte>
      <verso>viva la gaita, viva el gaitero</verso>
      <verso>a mi me gusta la gaita</verso>
      <verso>que tenga el fuelle de terciopelo.</verso>
    </parte>
  </texto>
</cancion>
[download]

In reply to strip_markup by mirod

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.