madbee has asked for the wisdom of the Perl Monks concerning the following question:
Greetings Perl Monks!
Tasked with developing a search and extract process using Perl/XML. Not my forte,but thanks to you and perldocs,am able to manage. For standard XML's like this,my process worked like a charm. (Tried both approaches: XML::LibXML and Anonymous Monk's Xpath approach and got the results as expected) But the issue is with heterogenous xml files where the structure of each file is different and the path to the content where I have to search and extract for is unknown.
<root> <part> <sect> <toc>1.1 Design Purpose...</toc> </sect> <sect> <sect> <header2>Purpose</header2> <tag>3.3 Design Purpose</tag> <tag> Design purpose and description </tag> <tag>This is a design XZY document for Project </ta +g> <tag>design specification details </tag> <tag>3.4 application purpose</tag> <tag> app details </tag> <tag> more app details</tag> </header> </sect> </part> </root>
Given an XML as above.Say I have to search and extract the section "Design Purpose".
1. I only know that every document definitely has a root and part tags. The structure of every document is different from the other. 2 out of 10 docs may have a similar structure. Unless I manually review each, I wouldnt know which are similar and which are not. There are hunderds of docs that need to be processed
2. Content in each document is nested under multiple nodes- the complete or partial path or even the nodes where the content I am looking for is unknown. It can be as I showed in the example or can be in any other form.
3. Content is replicated in multiple sections including TOC and Bookmarks.Ignoring these may be easier. But if replicated in sections other than TOC and Bookmarks, I need to identify and extract the exact section.
4. Need to extract only the child nodes belonging to the content I am trying to extract. i.e And also need to only retrieve the child nodes of Design Purpose only and nothing after the "Application Purpose" nodes.
Without knowing where in the document that Section is, under what nodes and what tags it can possibly be held by i.e without knowing the partial or full path to the content, will it be possible to develop a generic search and extract process using Perl and XML?
Will something like this work?
1. Search for string.Get its node. So,if I'm searching for Design Purpose: Find <tag>
2. Get all the parent nodes until it hits the root.
3. Build the path dynamically. Using the constructed path, say: //root//part//sect//tag from the above example: extract the child elements using XPath or XML::LibXML
Am I on the right track with this? Any pointers to how this can be done?
Can this task be done easily using Perl/RegEx parsing on text files rather than XML files?
I have to add that some of these XML files are not even Trees -they are flat flanked by Tags. All these were created by using PDF-Save As XML
Appreciate any thoughts in this regard. Apologies in advance if the post is not very clear.
Thanks in advance, madbee
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Search and Extract from XML when path is unknown
by McA (Priest) on Jul 10, 2013 at 06:52 UTC | |
|
Re: Search and Extract from XML when path is unknown
by Anonymous Monk on Jul 10, 2013 at 07:19 UTC | |
|
Re: Search and Extract from XML when path is unknown
by locked_user sundialsvc4 (Abbot) on Jul 10, 2013 at 12:13 UTC |