japhy has asked for the wisdom of the Perl Monks concerning the following question:

I've got an XML document that represents a book. (No, it's not my book.) Each chapter of the book has scenes, each scene has paragraphs, and each paragraph has text and/or quotes. Here's a rough structure...
<BOOK> <CHAPTER> <SCENE> <PARA> <TEXT>The man said,</TEXT> <QUOTE>"Hello."</QUOTE> </PARA> <PARA> <TEXT>I was startled. I didn't know there was a man there. </TEXT> <QUOTE>"Hello,"</QUOTE> <TEXT>I said back to him.</TEXT> </PARA> </SCENE> <SCENE> <PARA>...</PARA> <PARA>...</PARA> <PARA>...</PARA> </SCENE> ... </CHAPTER> <CHAPTER> ... </CHAPTER> </BOOK>
That's pretty much what my XML document looks like. I'm trying to create a web-based search tool for this document. You can search for specific words in TEXT or QUOTE, in PARA, or in SCENE. When the results come back, each XML tag becomes an HTML tag; <SCENE> becomes <SPAN CLASS="SCENE">, and so on.

The problem I have is the filtering. If you search on the "SCENE" scope, if any text is found in a SCENE tag that matches the words you're looking for, the entire SCENE tag is displayed. If you're searching on the "PARA" scope, only those paragraphs in a scene that match get displayed, but the surrounding SCENE ... /SCENE tags need to get displayed. And if you're searching on the "QUOTE" scope, only those QUOTE tags that match should get printed, but each in its proper PARA group, and each set of PARAs in its proper SCENE.

The problem I guess is how to do this efficiently. I don't want to display AT ALL any empty SCENEs or PARAs (that is, any SCENEs or PARAs that don't have any matching elements). I tried doing it on the fly, streaming the XML output from XML::Parser, but it's more difficult than I imagined.

If you want to see sample output, here goes. If I was searching on the "PARA" scope for "the", I'd expect back:

<BOOK> <CHAPTER> <SCENE> <PARA> <TEXT>The man said,</TEXT> <QUOTE>"Hello."</QUOTE> </PARA> </SCENE> </CHAPTER> ...any other matches... </BOOK>
If I was searching on the "QUOTE" scope for "hello", I'd expect:
<BOOK> <CHAPTER> <SCENE> <PARA> <QUOTE>"Hello."</QUOTE> </PARA> <PARA> <QUOTE>"Hello,"</QUOTE> </PARA> </SCENE> </CHAPTER> ...any other matches... </BOOK>
If I was searching on the "SCENE" scope for "apple", I'd expect:
<BOOK> ...any other matches... </BOOK>
So there's my problem. I'd really like to be able to do it on-the-fly, instead of building up results and then filtering them. If I have to, I'll buffer it to an entire scene's contents (meaning, after I've parsed an entire SCENE element, I'll display the contents if there are any matches), which I have a feeling is what I'll end up having to do.
_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Difficult XML presentation issue
by diotalevi (Canon) on Feb 03, 2004 at 05:38 UTC

    I'd see about writing an XPath expression to match and run that against my XML with XSLT.

    /BOOK/CHAPTER/SCENE and contains( ., 'match-text' )

    Paragraph level search

    /BOOK/CHAPTER/SCENE/PARA and contains( ., 'match-text' )
Re: Difficult XML presentation issue
by stvn (Monsignor) on Feb 03, 2004 at 02:58 UTC

    In the past, I have found that XSLT can be an extremely powerful searching and filtering tool, as well as a good presentation/transformation tool. I cannot say for sure that it would work in your situtation (because I am not sure i understand your problem 100%), but it might be worth looking into. You may be able to kill two birds with one stone here too, using the XSLT to do the searching and HTML conversion. The trick with XSLT is to think declaratively, but given what I have seen you do with reg-ex's in the past, I doubt you will have any trouble with it. There is always X-Path & XQL (not sure what they are calling them now), which can do some really interesting stuff as well.

    I always like to try and use other XML based technologies with XML, they tend to be a good fit, and they up the project buzzword quotient signifagantly too.

    -stvn
      I have noticed that the transformation engines such as xalan can be memory hungry when large conversions are taking place. The solution to this is to divide the XML content up before transforming.

      The transformation that caught me out was on an XML file that was 200Mb in size. Assuming that the book isn't too large (e.g. 2Mb of XML) then you will have no problem using XSLT. W3 Schools have a nice xslt tutorial.

        inman

        Most DOM based parsers and transformation engines can be HUGE memory hogs since they need to load the whole document in memory. I have found though that SAX based parsers and transformation engines are much less hoggish and many times faster. Being stream based too they are ideal for online/real-time web transformations.

        -stvn
Re: Difficult XML presentation issue
by Fletch (Bishop) on Feb 03, 2004 at 01:58 UTC

    Offhand I'd look into XML::Twig. Set it up to look down into SCENE elements to whatever level and grep through the contents for the search term. If it doesn't have anything matching remove it from the tree.

A reply falls below the community's threshold of quality. You may see it by logging in.