pike has asked for the wisdom of the Perl Monks concerning the following question:

Dear brethren,

I have to add XML markup to a text file. E.g. I must markup dates and numbers. Since dates often contain numbers (as in 4/2/2001), my approach would be to first find all dates, mark them up (or is it markup them?), and then scan the remaining (non-marked up) text for numbers.

Specifically, I was thinking of using XML::DOM::Node to first create a node containing all the text and then add nodes for dates and numbers as I find them. In the code snippet below, I assume that I have functions findDate and findNumber that return the text before the date/number, the date/number itself, and the text after it (or undef if there is no date in the text). So I end up with the following code:

#createMarkup creates markup for dates and numbers in the given text, #e.g. 'On Oct. 21, the Dow Jones rose to 10043 points' should become #'<mytxt>On <date>Oct. 21</date>, the Dow Jones rose to <number>10043< +/number> points</mytxt> sub createMarkup { my ($text, $doc) = @_; #create parent node my $node = new XML::DOM::Element ($doc, 'mytxt'); #markup dates my $textNode = $node->addText ($text); markupElement ($textNode, $node, \&findDate, 'date'); #markup numbers foreach my $child ($node->getChildNodes ()) { next unless $child->isTextNode (); my $frag = $child->getNodeValue (); markupElement ($child, $node, \&findNumber, 'number'); } return $node; } sub markupElement { my ($textNode, $parent, $rFindFunc, $elemName) = @_; my $doc = $parent->getOwnerDocument (); die unless $textNode->isTextNode (); my $nextNode = $textNode->getNextSibling (); my $text = $textNode->getValue (); while (my ($before, $elem, $after) = &$rFindFunc ($text)) { $textNode->setValue ($before); my $elemNode = new XML::DOM::Element ($doc, $elemName); $elemNode->setValue ($elem); $parent->insertBefore ($elemNode, $nextNode); $textNode = $doc->createTextNode ($after); $parent->insertBefore ($textNode, $nextNode); $text = $after; } }
Is there a more elegant way to do this? And is XML::DOM::Node and subclasses the right thing to use? Or what should I do? In reality I have about 20 different tags to add to the text, so proposals should not rely on finding just two entities as shown in the example.

pike

Replies are listed 'Best First'.
Re: Building an XML File from text
by mirod (Canon) on Nov 20, 2001 at 18:33 UTC

    The main problem I see here is carving a regexp that will reliably catch dates. The rest looks OK.

    For fun here is how I would write it with... XML::Twig (surprise surprise! ;--). Note that it is quite easier to mark the first think to mark (dates here) than the following ones. I should probably add an option to ignore some tags (John M Dlugosz suggested this). Also if you want to use XML::DOM you can use XML::DOM::Twig, which implements a lot of XML::Twig methods over XML::DOM, and just cut'n paste the mark method from XML::Twig.

    #!/bin/perl -w use strict; use XML::Twig; # create the regexp for date, this should be improved my $month = qr/(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\ +.?)/; my $day = qr/(?:(?:[0-2]?[0-9]|30|31)(?:st|nd|th)?)/; my $year = qr/(?:,?\s*\d+)?/; my $date= qr/($month $day$year)/; # this could probably be improved too! my $number= qr/(\d{2,})/; while( <DATA>) { chomp; # create the un-tagged XML my $t= XML::Twig->new(); # XML::Parser::Expat:::xml_escape just replaces & by &amp;, # > by &lt; etc... $t->parse( "<mytxt>". XML::Parser::Expat:::xml_escape( 1, $_) +. "</mytxt>"); # mark the dates $t->root->mark( $date, 'date'); # mark the numbers, foreach my $elt ($t->descendants( '#PCDATA')) { #skip if in date next if( $elt->in_context( 'date')); $elt->mark( $number, 'number'); } # output $t->print; print "\n"; } __DATA__ On Oct. 21, the Dow Jones rose to 10043 points On May 1st 2001, 12 people were working in France

    Update: cacharbe is right, the text is probably not "XML-safe", so I added the XML escape when parsing $_.

Re: Building an XML File from text
by cacharbe (Curate) on Nov 20, 2001 at 18:26 UTC
    What you're talking about isn't really XML, it's more of a meta tag insertion. If you were to mark up a small example by hand and try to load it into a parser (any XML parser that checks for well formed XML, really), I have a feeling you'd be a little let down by the results.

    This document from XML.com might be a good place to start, as well as it's parent.

    C-.

Re: Building an XML File from text
by atlantageek (Monk) on Nov 20, 2001 at 18:36 UTC
    I am thinking that this is not the best approach. The XML packages tend to be more for taking structured data and putting it in an XML document (ie databases or values of variables/hashes/arrays) To work with unstructured data, regular expressions seem like the best bet. After loading the text into a variable you do some subsitutes, numbers are easy
    $text =~ s/^(|.*\s)(\d+)(\s.*|)$/$1<number>$2<\/number>$3/ # The expression looks like the following # beginning of line followed by either nothing or at least # one space which neighbors a set of digits followed by # # either nothing or at least a space and the end of the line
    Deates would be done in a similar approach replacing \d+ with whatever a date looks like. I think there are some loose date modules you might can still some RE's from.
    ----
    I always wanted to be somebody... I guess I should have been more specific.
      The reason why I don't want to use regexes here is that things like '23' can occur within a date or as 'bare' number. If I insert markup tags as you propose here, I will end up with <date>Oct. <number>23</number></date> rather than <date>Oct. 23</date>.

      pike

        You can still use regexps actually, you just have to "neutralize" the dates once you've marked them, by turning them into some kind of xml entity for example.

        Here is the code:

        #!/bin/perl -w use strict; # create the regexp for date, this should be improved my $month = qr/(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\ +.?)/; my $day = qr/(?:(?:[0-2]?[0-9]|30|31)(?:st|nd|th)?)/; my $year = qr/(?:,?\s*\d+)?/; my $date= qr/($month $day$year)/; # this could probably be improved too! my $number= qr/(\d{2,})/; my %replace; my $i=0; while( <DATA>) { chomp; # entitize special characters, those 2 are sufficient here s{&}{&amp;}g; s{<}{&lt;}g; s{(?!<&)$date} {$i++; $replace{$i}="<date>$1</date>"; "&$i;"}eg; + # replace the dates s{(?!<&)$number}{$i++; $replace{$i}="<number>$1</number>"; "&$i;"} +eg; # replace the numbers s{&(\d+);}{$replace{$1}}g; + # replace the &n; print "<mytext>$_</mytext>\n"; } __DATA__ On Oct. 21, the Dow Jones rose to 10043 points On May 1st 2001, 12 people were working in France On Nov. 20, I can still tell that 123 < 234