murugu has asked for the wisdom of the Perl Monks concerning the following question:

Hi Great Monks,

Im doing conversion of a plain text file into XML file. The DTD is much complicated.(Nesting is too much). For example the structure looks like this.

<extract> <line> <p> <extract> <show> <list> <listitem><p><extract></extract></p></listitem> </list> </show> </extract> </p></line> </extract>

In order to convert the text file we use short tags like <ex> for extract, <ln> for line and <ls> for list tag. The problem is the nesting, where any tag can come inside any tag.

We have listitems splitted with \n. Can we convert this kind of file by inside out(By converting the inner elements first and saving them in some data structure and then converting the outer elements). Is this a better idea?

Can i try Parse::RecDescent or XML::DOM.

Please give me the suggestions. How to approach this task.

Thanks in advance

--Murugesan--

Replies are listed 'Best First'.
Re: Text to XML
by mirod (Canon) on Apr 12, 2004 at 09:14 UTC

    As you do not give an example of the original text and of the result you would expect for it, it is very difficult to answer you.

    A few ideas though:

    • if your input is plain text, not XML, use either Parse::RecDescent or just plain regexps, coupled with XML::Writer or a SAX writer (XML::Handler::YAWriter or XML::SAX::Writer),
    • if your input is already XML, then you can use a SAX filter to isolate the bits you want to process further, and treat them as plain text, printing the result or emitting SAX events on it,
    • if you know a bit about SGML, you could try writing the SGML DTD and taking advantage of the minimization features (wherer the parser infers the markup from the DTD, using the DTD structure and for example line returns as an element delimiter) to see if by any chance your text is not already a valid SGML document, or cannot simply be made into. Going from SGML to XML is as simple as using sx (also called osx in some linux distributions),

    And finally, because no post of mine is complete without a shameless XML::Twig plug, if all you want is wrap the lines in a list in the appropriate tags, then you can use something like this:

    #!/usr/bin/perl -w use strict; use XML::Twig; XML::Twig->new( # process just list elements twig_roots => { list => \&process_list }, # output the rest as is twig_print_outside_roots => 1, ) ->parse( \*DATA); sub process_list { my( $t, $list)= @_; # wrap (non-empty) lines in a listitem element my @listitems= $list->split( qr/(^.+)\n/m => 'listitem'); # add the p and extract tags within each listitem foreach my $listitem (@listitems) { $listitem->insert( 'p', 'extract'); } $list->print ; } __DATA__ <extract> <line> <p> <extract> <show> <list> first item second item third item </list> </show> </extract> </p> </line> </extract>

      Thanks a lot for ur reply

      Here below is the input file:

      <nl>number list 1 number list 2 <ul>unnumbered list1 unnumbered list 2 <pl>plain list 1 plain list 2 <nl>numbered list 1 numbered list 2</nl> </pl> </ul> </nl>

      The output should be like:

      <list type="numbered"> <list-item>numbered list 1</listitem> <list-item>numbered list 1</listitem> <listitem> <list type="unnumbered"> <listitem>unnumbered list1</listitem> <listitem>unnumbered list 2</listitem> <listitem> <list type="plain"> <listitem>plain list 1</listitem> <listitem>plain list 2</listitem> <listitem> <list type="numbered"> <list-item>numbered list 1</listitem> <list-item>numbered list 1</listitem> </list> </listitem> </list> </listitem> </list> </listitem> </list>

      The above output is what i need. I have given u just a small part of my text file. since covering other parts are more similar to this part, please give me suggestions to do the conversion

      for this kind of strucutures what should i do?

      --Murugesan

        First a complaint: if you really want help, you have to help me: please check your data before posting. I nearly gave up on writing the code, because your input and your output are full of typos: the numbers in the list items don't match, the tag names are inconsistent, so is the indenting... this made it really difficult for me to get the test code to work. So please, next time be more considerate.

        With this out of my system ;--( here is the code:

        #!/usr/bin/perl -w use strict; use XML::Twig; use Test::More tests => 1; my( $input, $expected); { local $/="\n\n"; $input= <DATA>; ($expected= <DATA>)=~ s{^\s*}{}; } my $t= XML::Twig->new( twig_handlers => { nl => sub { process_list( numbered => @_) +; }, pl => sub { process_list( plain => @_); }, ul => sub { process_list( unnumbered => @_); }, }, pretty_print => + 'indented', ) ->parse( $input); $t->set_indent( ' ' x 4); # if you really want 4 space indents my $result= $t->sprint; is( $result, $expected, "test lists"); sub process_list { my( $type, $t, $list)= @_; $list->set_tag( 'list') ->set_att( type => $type); foreach my $child ( $list->children) { if( $child->is_text) { $child->mark( qr/^(.+?)\s*$/m, 'listitem'); } else { $child->wrap_in( 'listitem'); } } # you need this for the pretty printing to work, or the # empty text elements left by mark will mess up XML::Twig # this is a bug, I will see how best to fix it in the next version foreach my $child ( $list->children) { $child->delete if( $child->text=~ m{^\s*$}); } } __DATA__ <nl>number list 1 number list 2 <ul>unnumbered list1 unnumbered list 2 <pl>plain list 1 plain list 2 <nl>numbered list 1 numbered list 2</nl> </pl> </ul> </nl> <list type="numbered"> <listitem>number list 1</listitem> <listitem>number list 2</listitem> <listitem> <list type="unnumbered"> <listitem>unnumbered list1</listitem> <listitem>unnumbered list 2</listitem> <listitem> <list type="plain"> <listitem>plain list 1</listitem> <listitem>plain list 2</listitem> <listitem> <list type="numbered"> <listitem>numbered list 1</listitem> <listitem>numbered list 2</listitem> </list> </listitem> </list> </listitem> </list> </listitem> </list>
Re: Text to XML
by gmpassos (Priest) on Apr 13, 2004 at 07:41 UTC
    Here's a solution with XML::Smart:
    use XML::Smart ; my $xml = XML::Smart->new(q` <nl> number list 1 number list 2 <ul> unnumbered list1 unnumbered list 2 <pl> plain list 1 plain list 2 <nl> numbered list 1 numbered list 2 </nl> </pl> </ul> </nl> ` , 'html'); my $new_xml = XML::Smart->new() ; process($xml , $new_xml) ; print $new_xml->data ; sub process { my $xml = shift ; my $new_xml = shift ; foreach my $node_i ( $xml->nodes ) { my @lines = split(/\s*\n\s*/ , $node_i) ; my $type ; if ( $node_i->key eq 'nl' ) { $type = 'numbered' ;} elsif ( $node_i->key eq 'ul' ) { $type = 'unnumbered' ;} elsif ( $node_i->key eq 'pl' ) { $type = 'plain' ;} my $set_root = 1 if $new_xml->base->null ; $new_xml->{list}{type} = $type ; $new_xml = $new_xml->{list} if $set_root ; push( @{$new_xml->{'list-item'}} , @lines) ; process($node_i , $new_xml->{listitem} ) ; } }
    The output is:
    <?xml version="1.0" encoding="iso-8859-1" ?> <?meta name="GENERATOR" content="XML::Smart/1.5.9 Perl/5.006001 [MSWin +32]" ?> <list type="numbered"> <list-item>number list 1</list-item> <list-item>number list 2</list-item> <listitem> <list type="unnumbered"/> <list-item>unnumbered list1</list-item> <list-item>unnumbered list 2</list-item> <listitem> <list type="plain"/> <list-item>plain list 1</list-item> <list-item>plain list 2</list-item> <listitem> <list type="numbered"/> <list-item>numbered list 1</list-item> <list-item>numbered list 2</list-item> </listitem> </listitem> </listitem> </list>
    Note that your XML structure is very strange! If you can make something more normal will be better, since the idea of XML is not to delcare a tree, but to declare a document that can be read by other programs, and a crazy DTD will make this impossible in some languages. Note that if you don't want to share this type of document, maybe XML is not the best choice to store your tree.

    Other crazy thing that you have is that tag <list>, where the first, the root, is used as a node with the list-item inside:

    <list type="numbered"> <list-item>number list 1</list-item> <list-item>number list 2</list-item> </list>
    And in the other parts you use it as a simple tag near the list-item:
    <listitem> <list type="numbered"/> <list-item>numbered list 1</list-item> <list-item>numbered list 2</list-item> </listitem>
    Also you have 2 different tags with similar names, <listitem> and <list-item>. Will be better to have something different, like <subitem> and <listitem>.

    Also I don't understand why have <list> and <listitem> as a new level for items! So, my suggestion for you XML is:

    <list type="numbered"> <item>number list 1</item> <item>number list 2</item> <list type="unnumbered"/> <item>unnumbered list 1</item> <item>unnumbered list 2</item> <list type="plain"/> <item>plain list 1</item> <item>plain list 2</item> <list type="numbered"/> <item>numbered list 1</item> <item>numbered list 2</item> </list> </list> </list> </list>
    Is smaller and represent a similar tree with the same informations. So, I say again, if you can, please, change this crazy DTD!

    Graciliano M. P.
    "Creativity is the expression of the liberty".

      The XML submitted by the OP is indeed strange, but I think it's just typos. list-item and listitem should be just one, and, like you suggested, I would call it item.

      Once this is fixed, the original XML is perfectly reasonable. In any case I would certainly not call it "crazy". It is standard practice to have a list contain only items, not a mixture of items and lists as you suggest at the end of your post. That's how XHTML, Docbook, and just about any other DTD out there works.

      If anything the XML you propose is harder to handle with most tools. It might be easier to process with XML::Smart, but that's a (minor) gripe I have with both XML::Smart and XML::Simple: they sometimes lead to XML that is designed with the tool in mind, instead of following standard practices and proper XML design.

        XML::Smart and XML::Simple doesn't follow any DTD to read a XML!

        What I say that is crazy, is the use of <list> in 2 ways, that I don't think that can be defined well with a DTD.

        Also you really need to take care with typos. in XML, foo-bar, is very different of foobar, that is different of FOOBAR! Soo, when I saw list-item, and listitem, for me as a XML tag, they are things different, but only similar in the name. So, the structure that I suggest in the end, is based in the same tree structure sent in the main post, where yes, it has a list with items and sub lists inside it, since I won't judge that structure, I'm only judging the use of similar names for tags and use of the same name, <list>, in different ways.

        And don't forget that without "following standard practices and proper XML design." you don't have a real XML, for the real purpose of XML, be a standart format. And without a real XML you just don't need XML, you can use better things to declare a tree.

        Good luck!

        Graciliano M. P.
        "Creativity is the expression of the liberty".