One idea (thought process):
Until you empty the string:
- Strip leading whitespace
- If you start with a '<', grab up to next '>' (use non-greedy, and store into array
- Otherwise grab up to next '<', and put into array.
- Repeat
You should now have an array like (for the above) ( '<Reference1>', '<reference1_name>', 'jvdsj', ...)
Now repeat until the array is empty, with a fresh hash:
- If top element start with < but not with </, then:
- Create a key with the text of that tag, and create a new hash for this.
- Push that text into a stack array, and the hash into a stack of hashes. (That is, at all times, your current level is denoted by the [-1] element of the stack array, and the hash to insert into is the [-1] element of the stack of hashes
- If the element starts with </, then:
- Pop the last items off the the two stack arrays until the [-1] of the stack array is equal to the rest of this current tag. If you pop all the elements off, you've got a malformed element.
- Otherwise, push the element into the [-1] hash of the stack of hashes, say in the 'data' element.
This assumes that leading whitespace in the data themselves are unneeded (though that can be worked around), and that < will not be unescaped in the data spaces.
Update I'm watching other responses to this node, but I'm wondering if a very lightweight XML parser module based on this idea might be worth it. I see that someone's pointed out XML::Sax, which appears to not require external libs, but to get an XML parser requires what appears to be 3 additional modules. The above pseudo code, on the other hand, could be made into a single module, say, XML::PureSimple, which would not handle bad XML gracefully, but could be used to handle anything that follows the basic XML patterns. If anyone thinks there might be such a use of a module, drop me a msg or similar.
-----------------------------------------------------
Dr. Michael K. Neylon - mneylon-pm@masemware.com
||
"You've left the lens cap of your mind on again, Pinky" - The Brain
"I can see my house from here!"
It's not what you know, but knowing how to find it if you don't know that's important
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.