Dear Monks,
Perl needs a fast and memory efficient XPath module.
The key to making a fast & memory efficient XPath processor is to find a smart data structure to derive from the XML document and keep in memory. The XPath query should then be done largely on that data structure.
Idea:
Suppose we build an array with values like this for each node:
'elementName,parentId,byteIndex' Where
elementName is the name of the element,
parentId is the arrayindex of the parent and
byteIndex is the byteposition of the element in the xml document.
If we have this xml document:
<?xml version="1.0">
<colors>
<paprika>
<good>red</good>
<bad>green</bad>
</paprika>
<banana>yellow</banana>
</colors>
The above xml doc would generate a data structure like this:
my @dataStructure = (
'colors,-1,22',
'paprika,0,33',
'good,1,47',
'bad,1,68',
'banana,0,100',
);
Now if the XPath query looks like this:
/colors/paprika/bad then we'd start with the last part:
bad, find it in the array and see if his parent is
paprika (this can be done VERY fast since we have the arrayindex of the parent). If it is paprike, look for his parent to see if it's
colors. If all that's true, we have the byteIndex to look in the XML document for the content of the element.
To keep it short now, are there any obvious shortcomings to this approach? Is it worth implementing? I have some optimisations in mind for this, but that would only make the story longer.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.