monger has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks, accolytes, etc. I have an XML-related question for the Monks today. I'm trying to debug a module that is parsing an XML file and creating a text file (called an imp file here). The function is to take documents in the ThML format from ccel.org and process them to be used in the Sword application (www.crosswire.org/sword). The imp is for importing the file into the format for Sword. It contains each section of the document in one line. Here's the LD:

When you parse a valid XML file, it works fine under most circumstances. However, the problem is with the <scripRef> tag. When you have several in sequence, you end up loosing the closing tag on all of the instances. For instance, you might have a string in the XML file, like this:

<scripRef...>1 John 1:1</scripRef>, <scripRef...>John 3:16</scripRef>, + ...

When it's processed to the imp file, you end up with something like this:

<scripRef...>1 John 1:1, <scripRef...>John 3:16, ...

which breaks things.

So, I've looked at the script and the module used here. The code appears to be in the module, but I can't figure out why it's missing things. So, below find the code. If anyone has any ideas on how I could repair this, it would be great. If you need more code, let me know.

NB - I am not the creator or maintainer. I am using this largely for personal stuff.

sub parseStart { my $expat = shift; my $tag = shift; my %attr = @_; SWITCH: for ($tag) { /^DC.(.*)$/ && do { saveDC($1); last SWITCH; }; /div(\d+)/ && do { start_section($1, $attr{title}); +last SWITCH; }; /^(p|h\d+)$/ && do { passthrough_start($1); last SWITC +H; }; /^(verse)$/ && do { passthrough_start('p'); last SWIT +CH; }; /^(span)$/ && do { passthrough_start('b'); last SWIT +CH; }; /^(l)$/ && do { $sectionData{$currentDepth} .= '& +nbsp;&nbsp;'; last SWITCH; }; /^(scripRef)$/ && do { $sectionData{$currentDepth} .= "< +scripRef passage=\"$attr{passage}\">"; last SWITCH; }; /^(note|added)$/ && do { ignore(); last SWITCH; }; } } sub parseEnd { my ($expat, $tag) = @_; SWITCH: for ($tag) { /^DC.(.*)$/ && do { end_saveDC($1); last SWITCH; }; /div(\d+)/ && do { end_section($1); last SWITCH; }; /^(p|h\d+|scripRef)$/ && do { passthrough_end($1); last S +WITCH; }; /^(verse)$/ && do { passthrough_end('p'); last +SWITCH; }; /^(span)$/ && do { passthrough_end('b'); last +SWITCH; }; /^(br|l)$/ && do { $sectionData{$currentDepth} + .= "<br />"; last SWITCH; }; /^(note|added)$/ && do { unignore(); last SWITCH; }; } }

What is happening here is that these two seperate subs are gathering the opening tags and stripping out some un-needed info, then finding the closing tag. I can't tell where to start here. Thanks,

Monger

Monger +++++++++++++++++++++++++ Munging Perl on the side

Replies are listed 'Best First'.
Re: Script Misses Close Closing Tags
by Happy-the-monk (Canon) on Mar 15, 2004 at 18:41 UTC

    I can't tell where to start here.

    Nor could I, as the passthrough_...-subs aren't anywhere in your code.

    Could a CPAN Module like XML::Simple help you ease the pain of parsing it with that legacy code?

    Edit:
    Meditating over the code I noticed that the start/end routines seem to deal with <scripRef> differently. Might that be a hint to the source of the error omitting the closing tag?

    Sören

      Here's the passthrough code snip:
      sub passthrough_start { return if ($ignore); my ($tag) = @_; $sectionData{$currentDepth} .= "<$tag>"; } sub passthrough_end { return if ($ignore); my ($tag) = @_; $sectionData{$currentDepth} .= "</$tag>"; }

      One of my problems is not understanding the SWITCH in the module. Also, I've never written any modules, or looked deeply at existing ones that have something like that.

      Also, the tags should be dealt with differently. In the original XML file, the opening scripRef tag can have much more information in it. The closing one is generally just </scripRef>.

      Thanks, Monger

      Monger +++++++++++++++++++++++++ Munging Perl on the side
        There's not enough information here to tell for sure what's wrong, but I notice that the closing tag for scripRef checks $ignore (in passthrogh_end), while the opening tag doesn't. The SWITCH idiom is explained in perlsyn.