Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re^2: How to create XML tree from non-XML source

by H4 (Acolyte)
on Sep 09, 2008 at 09:53 UTC ( #710020=note: print w/replies, xml ) Need Help??

in reply to Re: How to create XML tree from non-XML source
in thread How to create XML tree from non-XML source

My original data is genealogical data in GEDCOM format. GEDCOM is a well-documented standard, yet every GEDCOM-able software creates files that, in one way or another, violate that standard. My idea is to create an intermediate form which can be converted to and from all involved 3rd party GEDCOM styles. I chose XML because GEDCOM is a tree structure, and I thought it is better to use existing tools for manipulating trees than to re-invent them.

Yes, I know there is a Gedcom package on CPAN, but it cannot read 5 out of 9 test files, and does not handle character sets correctly.

I want to use XPath expressions to locate the nodes which must be modified, then modify them as required, then save the tree to an XML file. I don't mind saving the unmodified XML tree to an intermediate file if I must. But then, using XPath to locate a node, how do I do my modification? This may include renaming the node's type, changing the text, moving the node up in the tree, or creating subnodes. Are XML and XPath the wrong tools? Maybe I'll have to create my own code to locate nodes, rather than using XPath?

  • Comment on Re^2: How to create XML tree from non-XML source

Replies are listed 'Best First'.
Re^3: How to create XML tree from non-XML source
by GrandFather (Saint) on Sep 09, 2008 at 21:12 UTC

    XML is in essence a file format. It is not generally used as an in memory representation of data from some other file format. Unless you want to store an intermediate form of the data on disk in some non-GEDCOM format XML is not appropriate. Even then, you would probably be better to store any intermediate form of the data on disk as a clean GEDCOM file (although, see below).

    There are many ways to handle trees in Perl (see tree), but probably you are better to write a GEDCOM object hierarchy that directly addresses the structure you need to manipulate.

    I note that GEDCOM 6.0 will be an XML based file, but that needn't alter how you internally represent the data. In fact whatever internal representation you choose now should be completely independent of the external representation and should be chosen to facilitate the creation and manipulation of the internal representation. Then it becomes fairly easy to handle different input file formats and generate different output file formats.

    Perl reduces RSI - it saves typing

      Mhm, I'm not sure if I'd agree that internal representation should be completely independent of expected final output, if for no other reason than the implementation of that internal representation may or may not lend itself well to any given input or output, forcing the developer to spend more time rewriting those parts later. In other words, the input and output formts dictate parts of the creation (input) and manipulation (for later output).

      If he implements, for example, in Tree::DAG_Node, he might get a slight improvement in ease of creation as he steps through his psuedo-GEDCOM files, he could still use XPath for searching (it's not just for XML), and he gets a few other features he may or may not have any use for (such as ascii-art representations of the tree), but if he ever wants to output to XML, he's going to have to completely rebuild that tree in a manner more accessible to an XML writer or hand-craft his own, as I don't know of any module that will do it for him automatically.

      On a more primitive example, if he implements as a pure hash tree, he can directly modify every element by location and he can convert it directly to XML via XML::TreePP, but it's otherwise going to be a nuisance to manipulate, he gets no XPath searches, and he'd have to write his own tree walking code.

      If he implements with XML::Twig, he gets XPath for searches, automatic XML output, and fairly decent tree manipulation functions, but he'd have to write a few small shortcut subs for things like adding a child or a sibling (since creating and pasting an element can take 2-3 lines perhaps better reduced to one), and of course writing a GEDCOM 5.5 file isn't going be any easier than with Tree — thought it also won't really be any worse.

      There are a few decent tree-building modules available, but if he wants a tree generator/manipulator with easy XPath-language searches, automatic XML output, and preservation of node order, he's pretty much down to XML::Twig that I know of. Did you have another suggestion in particular?

      (As an aside, I'd dispute that XML is only a file format -- sure, it's a file format primarily, but it's a file format that makes a richer view of the traditional tree easily human-grokkable. Trees are traditionally a collection of node positions, with little more metadata attached to that position than a name or value most of the time. XML trees are trees where the nodes can contain a subtree otherwise unrelated to the main tree containing an unlimited amount of metadata, and where as a result you don't have to think of it as a subtree if you don't want to, but can if you do. If you're working with complex trees and aren't optimizing for nothing but speed, you could do a lot worse than to think of your project in terms of how it would look in XML.)

      (Aside #2: Whatever happened to GEDCOM 6.0 anyway? It was in 'any day now' state half a decade ago, but a quick Googling doesn't show anything ever coming out of it, and the only copies of the specification I can find are incomplete drafts.)

        Hm, "completely independent" was a bit strong. In some way the internal representation has to represent the structure of the data and if the data is tree structured then it is likely that the internal representation will be of a treeish nature.

        My worry with the OP however is that it focuses on solving an implementation problem using specific tools rather than finding the best set of tools to solve the actual problem. In fact Tree::DAG_Node is a strong contender for "best tool", especially when you discard the XML component as the red herring that it really is.

        Update: Doh! s/read/red/

        Perl reduces RSI - it saves typing

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://710020]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2022-12-01 05:40 GMT
Find Nodes?
    Voting Booth?