This is a difficult question to ask since I'm not sure of the terminology. Basically I am looking for a solution to parse what I would call a "loose" XML grammar. This means that data is contained between nested tags just as XML but without the requirement to specify the sequence of subtags.
I'm a novice with regards to XML but it seems that what I'm looking for a more generalized grammar parser?
For example, this would be allowed:
<toptag>
<subtag1>element #1</subtag1>
<subtag2>element #2</subtag2>
<subtag3>element #3</subtag3>
<toptag>
<toptag>
<subtag1>element #3</subtag1>
<subtag2>element #2</subtag2>
<subtag1>element #1</subtag1>
<toptag>
<toptag>
<subtag2>element #2</subtag2>
<subtag2>element #2</subtag2>
<subtag2>element #2</subtag2>
<toptag>
The trouble is that the subtags could occur in any order and in any number from 0 to unbounded.
Essentially, I want to build a hash of these tag elements and then parse through the hash to build an XML compliant output.
This is kind of out of my area and I'm not sure of that I'm asking the right questions when I research this. Any suggestions would be appreciated.
Further clarification:
Maybe this will help clarify. Consider it this way. A person is writing a text document. They will tag various words or phrases of that document using a predefined set of tags. Different parts of the document may contain related tags. For example,
<statement>
This is the statement of <person id="001"><name>Joe Smith</nam
+e></person>. His mothers name is
<parent><name>Betty</name></parent>. Joe is <person id="001"><
+age>15</age></person> years old.
</statement>
The person {name/age} sub-elements could occur in any order. In fact, the parent/person elements could occur in any order. There might also be multiple person tag sets.
Ultimately, I want to parse the final document, build a hash from the tags and then process the hash to combine all the elements associated with person id="001" into a single data structure.
Update:
I've received several good suggestions and some good advice. XML::Simple seems the most promising at the moment. Of course, I'm open to more suggestions and I'd love to hear from someone who has tackled this problem before.
Well, I've got some exploration to do ...
PJ
use strict; use warnings; use diagnostics;
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.