murugu has asked for the wisdom of the Perl Monks concerning the following question:
Hi Great Monks,
Im doing conversion of a plain text file into XML file. The DTD is much complicated.(Nesting is too much). For example the structure looks like this.
<extract> <line> <p> <extract> <show> <list> <listitem><p><extract></extract></p></listitem> </list> </show> </extract> </p></line> </extract>
In order to convert the text file we use short tags like <ex> for extract, <ln> for line and <ls> for list tag. The problem is the nesting, where any tag can come inside any tag.
We have listitems splitted with \n. Can we convert this kind of file by inside out(By converting the inner elements first and saving them in some data structure and then converting the outer elements). Is this a better idea?
Can i try Parse::RecDescent or XML::DOM.
Please give me the suggestions. How to approach this task.
Thanks in advance
--Murugesan--
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Text to XML
by mirod (Canon) on Apr 12, 2004 at 09:14 UTC | |
As you do not give an example of the original text and of the result you would expect for it, it is very difficult to answer you. A few ideas though: And finally, because no post of mine is complete without a shameless XML::Twig plug, if all you want is wrap the lines in a list in the appropriate tags, then you can use something like this:
| [reply] [d/l] |
by murugu (Curate) on Apr 12, 2004 at 15:34 UTC | |
Thanks a lot for ur reply Here below is the input file:
The output should be like:
The above output is what i need. I have given u just a small part of my text file. since covering other parts are more similar to this part, please give me suggestions to do the conversion for this kind of strucutures what should i do?
--Murugesan | [reply] [d/l] [select] |
by mirod (Canon) on Apr 12, 2004 at 16:20 UTC | |
First a complaint: if you really want help, you have to help me: please check your data before posting. I nearly gave up on writing the code, because your input and your output are full of typos: the numbers in the list items don't match, the tag names are inconsistent, so is the indenting... this made it really difficult for me to get the test code to work. So please, next time be more considerate. With this out of my system ;--( here is the code:
| [reply] [d/l] |
|
Re: Text to XML
by gmpassos (Priest) on Apr 13, 2004 at 07:41 UTC | |
The output is: Note that your XML structure is very strange! If you can make something more normal will be better, since the idea of XML is not to delcare a tree, but to declare a document that can be read by other programs, and a crazy DTD will make this impossible in some languages. Note that if you don't want to share this type of document, maybe XML is not the best choice to store your tree. Other crazy thing that you have is that tag <list>, where the first, the root, is used as a node with the list-item inside: And in the other parts you use it as a simple tag near the list-item: Also you have 2 different tags with similar names, <listitem> and <list-item>. Will be better to have something different, like <subitem> and <listitem>. Also I don't understand why have <list> and <listitem> as a new level for items! So, my suggestion for you XML is: Is smaller and represent a similar tree with the same informations. So, I say again, if you can, please, change this crazy DTD!
Graciliano M. P. | [reply] [d/l] [select] |
by mirod (Canon) on Apr 13, 2004 at 08:20 UTC | |
The XML submitted by the OP is indeed strange, but I think it's just typos. list-item and listitem should be just one, and, like you suggested, I would call it item. Once this is fixed, the original XML is perfectly reasonable. In any case I would certainly not call it "crazy". It is standard practice to have a list contain only items, not a mixture of items and lists as you suggest at the end of your post. That's how XHTML, Docbook, and just about any other DTD out there works. If anything the XML you propose is harder to handle with most tools. It might be easier to process with XML::Smart, but that's a (minor) gripe I have with both XML::Smart and XML::Simple: they sometimes lead to XML that is designed with the tool in mind, instead of following standard practices and proper XML design. | [reply] |
by gmpassos (Priest) on Apr 13, 2004 at 08:26 UTC | |
What I say that is crazy, is the use of <list> in 2 ways, that I don't think that can be defined well with a DTD. Also you really need to take care with typos. in XML, foo-bar, is very different of foobar, that is different of FOOBAR! Soo, when I saw list-item, and listitem, for me as a XML tag, they are things different, but only similar in the name. So, the structure that I suggest in the end, is based in the same tree structure sent in the main post, where yes, it has a list with items and sub lists inside it, since I won't judge that structure, I'm only judging the use of similar names for tags and use of the same name, <list>, in different ways. And don't forget that without "following standard practices and proper XML design." you don't have a real XML, for the real purpose of XML, be a standart format. And without a real XML you just don't need XML, you can use better things to declare a tree. Good luck!
Graciliano M. P. | [reply] |
by mirod (Canon) on Apr 13, 2004 at 09:37 UTC | |