comment on

Greetings Perl Monks!

Tasked with developing a search and extract process using Perl/XML. Not my forte,but thanks to you and perldocs,am able to manage. For standard XML's like this,my process worked like a charm. (Tried both approaches: XML::LibXML and Anonymous Monk's Xpath approach and got the results as expected) But the issue is with heterogenous xml files where the structure of each file is different and the path to the content where I have to search and extract for is unknown.

   <root>
        <part>
            <sect> <toc>1.1 Design Purpose...</toc> </sect>
            <sect>
            <sect>
               <header2>Purpose</header2>
                <tag>3.3 Design Purpose</tag>
                  <tag> Design purpose and description </tag>
                   <tag>This is a design XZY document for Project </ta
+g>
                   <tag>design specification details </tag>
               <tag>3.4 application purpose</tag>
                <tag> app details </tag>
                <tag> more app details</tag>
                </header>
            </sect>
         </part>
     </root>
[download]

Given an XML as above.Say I have to search and extract the section "Design Purpose".

1. I only know that every document definitely has a root and part tags. The structure of every document is different from the other. 2 out of 10 docs may have a similar structure. Unless I manually review each, I wouldnt know which are similar and which are not. There are hunderds of docs that need to be processed

2. Content in each document is nested under multiple nodes- the complete or partial path or even the nodes where the content I am looking for is unknown. It can be as I showed in the example or can be in any other form.

3. Content is replicated in multiple sections including TOC and Bookmarks.Ignoring these may be easier. But if replicated in sections other than TOC and Bookmarks, I need to identify and extract the exact section.

4. Need to extract only the child nodes belonging to the content I am trying to extract. i.e And also need to only retrieve the child nodes of Design Purpose only and nothing after the "Application Purpose" nodes.

Without knowing where in the document that Section is, under what nodes and what tags it can possibly be held by i.e without knowing the partial or full path to the content, will it be possible to develop a generic search and extract process using Perl and XML?

Will something like this work?

1. Search for string.Get its node. So,if I'm searching for Design Purpose: Find <tag>

2. Get all the parent nodes until it hits the root.

3. Build the path dynamically. Using the constructed path, say: //root//part//sect//tag from the above example: extract the child elements using XPath or XML::LibXML

Am I on the right track with this? Any pointers to how this can be done?

Can this task be done easily using Perl/RegEx parsing on text files rather than XML files?

I have to add that some of these XML files are not even Trees -they are flat flanked by Tags. All these were created by using PDF-Save As XML

Appreciate any thoughts in this regard. Apologies in advance if the post is not very clear.

Thanks in advance, madbee

In reply to Search and Extract from XML when path is unknown by madbee

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.