I've got an XML document that represents a book. (No, it's not my book.) Each chapter of the book has scenes, each scene has paragraphs, and each paragraph has text and/or quotes. Here's a rough structure...
<BOOK> <CHAPTER> <SCENE> <PARA> <TEXT>The man said,</TEXT> <QUOTE>"Hello."</QUOTE> </PARA> <PARA> <TEXT>I was startled. I didn't know there was a man there. </TEXT> <QUOTE>"Hello,"</QUOTE> <TEXT>I said back to him.</TEXT> </PARA> </SCENE> <SCENE> <PARA>...</PARA> <PARA>...</PARA> <PARA>...</PARA> </SCENE> ... </CHAPTER> <CHAPTER> ... </CHAPTER> </BOOK>
That's pretty much what my XML document looks like. I'm trying to create a web-based search tool for this document. You can search for specific words in TEXT or QUOTE, in PARA, or in SCENE. When the results come back, each XML tag becomes an HTML tag; <SCENE> becomes <SPAN CLASS="SCENE">, and so on.

The problem I have is the filtering. If you search on the "SCENE" scope, if any text is found in a SCENE tag that matches the words you're looking for, the entire SCENE tag is displayed. If you're searching on the "PARA" scope, only those paragraphs in a scene that match get displayed, but the surrounding SCENE ... /SCENE tags need to get displayed. And if you're searching on the "QUOTE" scope, only those QUOTE tags that match should get printed, but each in its proper PARA group, and each set of PARAs in its proper SCENE.

The problem I guess is how to do this efficiently. I don't want to display AT ALL any empty SCENEs or PARAs (that is, any SCENEs or PARAs that don't have any matching elements). I tried doing it on the fly, streaming the XML output from XML::Parser, but it's more difficult than I imagined.

If you want to see sample output, here goes. If I was searching on the "PARA" scope for "the", I'd expect back:

<BOOK> <CHAPTER> <SCENE> <PARA> <TEXT>The man said,</TEXT> <QUOTE>"Hello."</QUOTE> </PARA> </SCENE> </CHAPTER> ...any other matches... </BOOK>
If I was searching on the "QUOTE" scope for "hello", I'd expect:
<BOOK> <CHAPTER> <SCENE> <PARA> <QUOTE>"Hello."</QUOTE> </PARA> <PARA> <QUOTE>"Hello,"</QUOTE> </PARA> </SCENE> </CHAPTER> ...any other matches... </BOOK>
If I was searching on the "SCENE" scope for "apple", I'd expect:
<BOOK> ...any other matches... </BOOK>
So there's my problem. I'd really like to be able to do it on-the-fly, instead of building up results and then filtering them. If I have to, I'll buffer it to an entire scene's contents (meaning, after I've parsed an entire SCENE element, I'll display the contents if there are any matches), which I have a feeling is what I'll end up having to do.
_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

In reply to Difficult XML presentation issue by japhy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.