Sort xml based on attribute

bharathinc has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sort xml based on attribute by marto (Cardinal) on Feb 18, 2010 at 12:26 UTC
See Re: XML::Twig -- sorting by attribute, super search is your friend.	[reply]
Re: Sort xml based on attribute by Corion (Patriarch) on Feb 18, 2010 at 12:27 UTC
While you have shown us some of your data, you haven't shown us any of your code and how it fails to produce the results you want. We cannot help you without knowing what you've done already and where your exact problem is. Personally, I would approach the problem by either restructuring/restringifying the XML fragments to be sorted so that they sort ASCIIbetically or by using XPath queries to extract the attributes after which the fragments are to be sorted and then output the new document by stringifying their nodes. For both approaches, XML::Twig or XML::LibXML would work, and also Jendas XML::Rules I suppose.	[reply]
Re: Sort xml based on attribute by Jenda (Abbot) on Feb 18, 2010 at 22:27 UTC
use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( style => 'filter', rules => { _default => 'raw', itemid => sub { my ($tag,$attrs,$context,$parents) = @_; $parents->[-4]{':PUI'} = $attrs->{_content} if $attrs->{id +type} eq "PUI"; return [$tag => $attrs]; # same thing the 'raw' built-in d +oes }, item => 'as array', bibdataset => sub { my ($tag,$attrs) = @_; @{$attrs->{item}} = sort {$a->{':PUI'} <=> $b->{':PUI'}} @ +{$attrs->{item}}; $attrs->{_content} = [ (map( ( "\n\t", [item => $_]), @{$attrs->{item}})), "\n", ]; delete $attrs->{item}; return $tag => $attrs; }, } ); $parser->filter(\DATA); __DATA__ <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <bibdataset ... [download] Basicaly ... whenever an <itemid> tag is fully parsed (including content and end tag), the code checks whether the idtype eq "PUI" and if it does it remembers the content in the tag's parent's parent's parent's parent (i.e. the <item> tag ... attributes starting by a colon are never exported to the resulting XML) and then it add the tag's data into the parent's content. Then the <item> tags are removed from the parent tag's content and stored in an array stored in the parent tag's hash of attributes under key "item". Then once the XML is fully parsed, the array of items is sorted, some whitespace gets inserted between the items and the resulting array becomes the contents of the root tag. And the tag with the attributes and content (including child tags) gets printed. The code assumes the <itemid> will always be at the same level below <item> and that there will only <item> tags in bibdataset! Jenda Enoch was right!* Enjoy the last years of Rome.	[reply] [d/l]
Re^2: Sort xml based on attribute by Anonymous Monk on Aug 12, 2010 at 02:35 UTC
Sorry to be a pest Jenda, but I'm trying to modify this script for my own use but I've only just started learning perl so much of this is completely new to me. Could you explain what each bit does if it's not too much trouble? Or explain how I could modify it to sort an XML file such as this based on category, subcategory and then code1? Sorry to ask so much but I really am at a loss. <?xml version="1.0"?> <ResultDetail> <results> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010000</code1> <name>parse</name> <type>ERR</type> <flags>320</flags> <language>EN</language> <description>Parse error</description> <cause>Parse error in the input XML</cause> <action>Correct the error and send your request again</action> </ResultItem> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010300</code1> <name>client.NotEntered</name> <type>ERR</type> <flags>320</flags> <language>EN</language> <description>Client not entered</description> <cause>Invalid data field values</cause> <action>Correct the problem and send the request again</action +> </ResultItem> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010400</code1> <name>client.notFound</name> <type>ERR</type> <flags>320</flags> <language>EN</language> <description>Client not found</description> <cause>Invalid data field values</cause> <action>Correct the problem and send the request again</action +> </ResultItem> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010500</code1> <name>client.invalidData</name> <type>ERR</type> <flags>320</flags> <language>EN</language> <description>Client data invalid</description> <cause>Invalid data field values</cause> <action>Correct the problem and send the request again</action +> </ResultItem> </results> </ResultDetail> [download] Note: It's actually a much much larger file (9510 lines)	[reply] [d/l]
Re^3: Sort xml based on attribute by Jenda (Abbot) on Aug 12, 2010 at 10:20 UTC
The code will be a bit simpler, but whether it will be any easier to understand I don't know. What language(s) do you have experience with? use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my $parser = XML::Rules->new( style => 'filter', # we want to filter (modify) the XML, not extra +ct data rules => { _default => 'raw', # we want to copy most tags intact, includi +ng the whitespace in and around them # the data of the tags will end up in the _content pseudoa +ttribute of the parent tag 'category,subCategory,code1' => 'raw extended', # these three we need not only to copy, but also made easi +er to access. # The "raw extended" rule causes the data of that tag to b +e available in the hash of the parent tag # also as ":category", ":subCategory" and ":code" so you d +o not have to search through the _content array 'ResultItem' => 'as array', # we expect several <ResultItem> tags and want to store th +e data of each in an array . # the array will be accessible using the 'ResultItem' key +in the hash containing the data of the parent tag 'results' => sub { my ($tag,$attrs) = @_; # this is the Perl way to assign na +mes to subroutine/function parameters # this subroutine is called whenever the <results>...< +/results> is fully parsed and the rules # specified for the child tags evaluated. if ($attrs->{ResultItem} and @{$attrs->{ResultItem}} > 1) +{ # if there are any <ResultItem> tags and there's more +than one @{$attrs->{ResultItem}} = sort { # sort allows you to specify the code to be us +ed to compare the items to sort # the items are made available as $a and $b to + the code. # in this case the $a abd $b are hashes create +d by processing the child tags of the <ResultItem> tags. $a->{':category'} cmp $b->{':category'} or $a->{':subCategory'} cmp $b->{':subCategory'} or $a->{':code1'} cmp $b->{':code1'} } @{$attrs->{ResultItem}}; } $attrs->{_content} =~ s/^\s+// if (!ref $attrs->{_content} +); # remove the accumulated whitespace that was present b +etween the <ResultItem> tags return [$tag => $attrs] } } ); $parser->filter(\DATA); # see the XML::Rules docs for ways to redirect the output to file __DATA__ <?xml version="1.0"?> <ResultDetail> <results> <ResultItem> <category>AGM</category> <subCategory>VAL</subCategory> <code1>010000</code1> <name>parse</name> ... [download] Update:* Please see Re^9: Sort xml based on attribute for a fixed version. Jenda Enoch was right! Enjoy the last years of Rome.	[reply] [d/l]
Re^4: Sort xml based on attribute by Anonymous Monk on Aug 12, 2010 at 11:48 UTC
Re^5: Sort xml based on attribute by Jenda (Abbot) on Aug 12, 2010 at 12:53 UTC
Some notes below your chosen depth have not been shown here
Re^4: Sort xml based on attribute by Anonymous Monk on Aug 12, 2010 at 11:53 UTC
Re: Sort xml based on attribute by mirod (Canon) on Feb 20, 2010 at 08:43 UTC
That's something I've had to do in the past, so as usual it ended up in XML::Twig. The `sort_children` method gets called on the parent, and gets passed a function, which will be called on each child in turn. That function will return the sort criteria. The method also takes options to specify the type of sort (numeric or alpha) and the order. This leads to the code below: `#!/usr/bin/perl use strict; use warnings; use XML::Twig; XML::Twig->parse( pretty_print => 'indented', shift @ARGV) ->root ->sort_children( \&get_pui, type => 'numeric') ->print; sub get_pui { my( $item)= @_; return $item->first_descendant( 'itemid[@idtype="PUI"]')->text; }` [download] Note that this code relies on a couple of assumptions: it assumes that `bibdataset` contains only elements to be sorted (`item` containing an `itemid` descendant with the proper attribute). If that's not the case you need to tweak `get_pui` to return a number, either big or small, depending where you want to put the extra elements. I believe that in recent versions of Perl if you always return the same number, then the order will be the original order in the document. it assumes that you can load the entire document in memory. If that is not the case you can split the document into 1 file per record and then sort then before merging them back. XML::Twig includes the `xml_split` tool that can do just that, but you might be better off doing it yourself, using twig_handlers to save each file under a name that includes the PUI (probably padded with 0s so lexicographic order as used by the shell works), then merging them back. Does this help?	[reply] [d/l]
Re: Sort xml based on attribute by runrig (Abbot) on Aug 12, 2010 at 15:18 UTC
There is also XML::Filter::Sort and XSLT (via XML::LibXSLT. Complicated example of sorting with XSLT over on use.perl.	[reply]
Re: Sort xml based on attribute by choroba (Cardinal) on Aug 12, 2010 at 17:39 UTC
I often use XML::XSH2. Leaving perl aside, the XSH code would look like this: `open 823927.xml ; register-namespace ani http://www.elsevier.com/xml/ani/ani ; for &{ sort :n /ani:bibdataset/ani:item/ani:bibrecord/ani:item-info/an +i:itemidlist/ani:itemid[@idtype="PUI"] } ls ancestor::ani:item` [download]	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.