bharathinc has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: Sort xml based on attribute
by marto (Cardinal) on Feb 18, 2010 at 12:26 UTC
Re: Sort xml based on attribute
by Corion (Patriarch) on Feb 18, 2010 at 12:27 UTC

    While you have shown us some of your data, you haven't shown us any of your code and how it fails to produce the results you want. We cannot help you without knowing what you've done already and where your exact problem is.

    Personally, I would approach the problem by either restructuring/restringifying the XML fragments to be sorted so that they sort ASCIIbetically or by using XPath queries to extract the attributes after which the fragments are to be sorted and then output the new document by stringifying their nodes.

    For both approaches, XML::Twig or XML::LibXML would work, and also Jendas XML::Rules I suppose.

Re: Sort xml based on attribute
by Jenda (Abbot) on Feb 18, 2010 at 22:27 UTC
    use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( style => 'filter', rules => { _default => 'raw', itemid => sub { my ($tag,$attrs,$context,$parents) = @_; $parents->[-4]{':PUI'} = $attrs->{_content} if $attrs->{id +type} eq "PUI"; return [$tag => $attrs]; # same thing the 'raw' built-in d +oes }, item => 'as array', bibdataset => sub { my ($tag,$attrs) = @_; @{$attrs->{item}} = sort {$a->{':PUI'} <=> $b->{':PUI'}} @ +{$attrs->{item}}; $attrs->{_content} = [ (map( ( "\n\t", [item => $_]), @{$attrs->{item}})), "\n", ]; delete $attrs->{item}; return $tag => $attrs; }, } ); $parser->filter(\*DATA); __DATA__ <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <bibdataset ...

    Basicaly ... whenever an <itemid> tag is fully parsed (including content and end tag), the code checks whether the idtype eq "PUI" and if it does it remembers the content in the tag's parent's parent's parent's parent (i.e. the <item> tag ... attributes starting by a colon are never exported to the resulting XML) and then it add the tag's data into the parent's content. Then the <item> tags are removed from the parent tag's content and stored in an array stored in the parent tag's hash of attributes under key "item".

    Then once the XML is fully parsed, the array of items is sorted, some whitespace gets inserted between the items and the resulting array becomes the contents of the root tag. And the tag with the attributes and content (including child tags) gets printed.

    The code assumes the <itemid> will always be at the same level below <item> and that there will only <item> tags in bibdataset!

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Sorry to be a pest Jenda, but I'm trying to modify this script for my own use but I've only just started learning perl so much of this is completely new to me. Could you explain what each bit does if it's not too much trouble? Or explain how I could modify it to sort an XML file such as this based on category, subcategory and then code1? Sorry to ask so much but I really am at a loss.
      <?xml version="1.0"?> <ResultDetail> <results> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010000</code1> <name>parse</name> <type>ERR</type> <flags>320</flags> <language>EN</language> <description>Parse error</description> <cause>Parse error in the input XML</cause> <action>Correct the error and send your request again</action> </ResultItem> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010300</code1> <name>client.NotEntered</name> <type>ERR</type> <flags>320</flags> <language>EN</language> <description>Client not entered</description> <cause>Invalid data field values</cause> <action>Correct the problem and send the request again</action +> </ResultItem> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010400</code1> <name>client.notFound</name> <type>ERR</type> <flags>320</flags> <language>EN</language> <description>Client not found</description> <cause>Invalid data field values</cause> <action>Correct the problem and send the request again</action +> </ResultItem> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010500</code1> <name>client.invalidData</name> <type>ERR</type> <flags>320</flags> <language>EN</language> <description>Client data invalid</description> <cause>Invalid data field values</cause> <action>Correct the problem and send the request again</action +> </ResultItem> </results> </ResultDetail>
      Note: It's actually a much much larger file (9510 lines)

        The code will be a bit simpler, but whether it will be any easier to understand I don't know. What language(s) do you have experience with?

        use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( style => 'filter', # we want to filter (modify) the XML, not extra +ct data rules => { _default => 'raw', # we want to copy most tags intact, includi +ng the whitespace in and around them # the data of the tags will end up in the _content pseudoa +ttribute of the parent tag 'category,subCategory,code1' => 'raw extended', # these three we need not only to copy, but also made easi +er to access. # The "raw extended" rule causes the data of that tag to b +e available in the hash of the parent tag # also as ":category", ":subCategory" and ":code" so you d +o not have to search through the _content array 'ResultItem' => 'as array', # we expect several <ResultItem> tags and want to store th +e data of each in an array . # the array will be accessible using the 'ResultItem' key +in the hash containing the data of the parent tag 'results' => sub { my ($tag,$attrs) = @_; # this is the Perl way to assign na +mes to subroutine/function parameters # this subroutine is called whenever the <results>...< +/results> is fully parsed and the rules # specified for the child tags evaluated. if ($attrs->{ResultItem} and @{$attrs->{ResultItem}} > 1) +{ # if there are any <ResultItem> tags and there's more +than one @{$attrs->{ResultItem}} = sort { # sort allows you to specify the code to be us +ed to compare the items to sort # the items are made available as $a and $b to + the code. # in this case the $a abd $b are hashes create +d by processing the child tags of the <ResultItem> tags. $a->{':category'} cmp $b->{':category'} or $a->{':subCategory'} cmp $b->{':subCategory'} or $a->{':code1'} cmp $b->{':code1'} } @{$attrs->{ResultItem}}; } $attrs->{_content} =~ s/^\s+// if (!ref $attrs->{_content} +); # remove the accumulated whitespace that was present b +etween the <ResultItem> tags return [$tag => $attrs] } } ); $parser->filter(\*DATA); # see the XML::Rules docs for ways to redirect the output to file __DATA__ <?xml version="1.0"?> <ResultDetail> <results> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010000</code1> <name>parse</name> ...

        Update: Please see Re^9: Sort xml based on attribute for a fixed version.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Re: Sort xml based on attribute
by mirod (Canon) on Feb 20, 2010 at 08:43 UTC

    That's something I've had to do in the past, so as usual it ended up in XML::Twig. The sort_children method gets called on the parent, and gets passed a function, which will be called on each child in turn. That function will return the sort criteria. The method also takes options to specify the type of sort (numeric or alpha) and the order.

    This leads to the code below:

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; XML::Twig->parse( pretty_print => 'indented', shift @ARGV) ->root ->sort_children( \&get_pui, type => 'numeric') ->print; sub get_pui { my( $item)= @_; return $item->first_descendant( 'itemid[@idtype="PUI"]')->text; }

    Note that this code relies on a couple of assumptions:

    • it assumes that bibdataset contains only elements to be sorted (item containing an itemid descendant with the proper attribute). If that's not the case you need to tweak get_pui to return a number, either big or small, depending where you want to put the extra elements. I believe that in recent versions of Perl if you always return the same number, then the order will be the original order in the document.
    • it assumes that you can load the entire document in memory. If that is not the case you can split the document into 1 file per record and then sort then before merging them back. XML::Twig includes the xml_split tool that can do just that, but you might be better off doing it yourself, using twig_handlers to save each file under a name that includes the PUI (probably padded with 0s so lexicographic order as used by the shell works), then merging them back.

    Does this help?

Re: Sort xml based on attribute
by runrig (Abbot) on Aug 12, 2010 at 15:18 UTC
Re: Sort xml based on attribute
by choroba (Cardinal) on Aug 12, 2010 at 17:39 UTC
    I often use XML::XSH2. Leaving perl aside, the XSH code would look like this:
    open 823927.xml ; register-namespace ani http://www.elsevier.com/xml/ani/ani ; for &{ sort :n /ani:bibdataset/ani:item/ani:bibrecord/ani:item-info/an +i:itemidlist/ani:itemid[@idtype="PUI"] } ls ancestor::ani:item
A reply falls below the community's threshold of quality. You may see it by logging in.