Perl with XML

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl with XML by mirod (Canon) on Oct 25, 2001 at 19:47 UTC
If this is a homework then I would like to -- both you (for not running the code under the debugger, using Data::Dumper or generally searching this site and the web for information on Perl and XML) and your teacher. XML SHOULD NOT BE PARSED USING REGEXPs (unless you seriously know what you're doing, as Paul Kulchenko in XML::Parser::Lite or Matt Sergeant in his new pure Perl parser). See On XML parsing for a few reasons (and I am not even mentioning non-ascii encodings in that post). Now if you are looking for resources on Perl and XML you can have a look at xml.com, which carries a series of really good articles by Kip Hampton on Perl and XML, I have a couple of resources on xmltwig.com and the Mother of all XML resources is of course The XML Cover Pages, which even has a section on Perl and XML. Update: Oh my! I forgot to mention a web site dedicated to Perl and XML: xmlperl.com!	[reply]
Re: Perl with XML by tomhukins (Curate) on Oct 25, 2001 at 18:18 UTC
Is there any reason you're using your own code rather than standard CPAN XML modules? XML::Parser is good for creating flexible XML parsing code, and XML::Simple is great if you want to convert XML into Perl data structures. To answer your questions: 1. It's always a good idea to pass data rather than referring to it directly to reduce the number of global variables. This makes your code more scalable. 2. You're missing a `/` from the closing `divriskgrade` tag. Why have you offered a hint, though? Can you already answer this question? If so, why ask it? If you were to use standard XML modules, they would report where the error occurs. 3. 4. & 5. Why can't you answer these questions yourself? Run the code and find out! For question 5, though, you'd be much better off using CPAN modules such as HTML::Parser instead of writing your own HTML parsing code which is liable to failure. As for good Web sites that discuss Perl and XML, search Google for perl xml or learn how to find information on Perl Monks. Super Search is very useful. Tutorials and Module Reviews contain information that will help you with XML parsing. If you're not sure why writing your own parsing code is a bad idea, take a look at Re: Parsing HTML and (tye)Re: parsing HTML. Update: We've been discussing this thread on the CB, and several monks let me know they have downvoted this post because it answers a homework question. I've considered editing my response to question 2, but the questioner has probably read it by now, and the response might help someone else. Overall, I think my answers were vague enough to make the questioner think, and might be useful to others.	[reply] [d/l] [select]
Re: Perl with XML by MZSanford (Curate) on Oct 25, 2001 at 18:14 UTC
I am usualy one to help with genral questions, but this seems a bit to homew(or\|rec)k-ish ... all i will suggest is that XML::Parser would be a very good place to start. As for a good place for general XML/Perl stuff, perl.com is usually ok. i had a memory leak once, and it ruined my favorite shirt.	[reply]
Re: Perl with XML by perrin (Chancellor) on Oct 25, 2001 at 21:27 UTC
Hey, this isn't homework, it's part of the job interview at RiskMetrics! That's not very cool, posting it here like this. Why don't you try learning some XML instead? You should keep in mind that at least one former RiskMetrics employee is regularly on this site: japhy.	[reply]
Re: Re: Perl with XML by Anonymous Monk on Dec 06, 2001 at 06:58 UTC
yes he is correct this is the interview questions ......................of risk coders	[reply]
Re: Perl with XML by buckaduck (Chaplain) on Oct 25, 2001 at 18:53 UTC
Not that you care, but your code won't run correctly under `use strict`. One way to fix this is to use `my` to create a lexical scope for the variable `$portresults` : `my $portresults = parsePortfolioResponse($sampleXML);` [download] If you fix this, you might be able to run the program and answer the teacher's questions yourself. buckaduck	[reply] [d/l] [select]
Re: Perl with XML by tachyon (Chancellor) on Oct 25, 2001 at 18:34 UTC
This looks like homework to me. Did you consider doing this? `# 3. What is the value of "$portresults->{portfolio}->{riskgrade}"? print "port ", $portresults->{portfolio}->{riskgrade}, "\n"; # 3. What is the value of "sprintf("%.2f", -$portresults->{Bayer}->{xl +oss} )"? printf "sprintf: %.2f\n", -$portresults->{Bayer}->{xloss}; # 4. What is the first element of "sort {lc($b) cmp lc($a)} keys %{$po +rtresults}" ? @ary = sort {lc($b) cmp lc($a)} keys %{$portresults}; print "sort: $ary[0]\n";` [download] As for Q5 this is a bizare regex that does this: `s # substitute / # begin search sequence < # find a literal < (?! # negative lookahead ie find a < not followed by \/* # 0 or more / chars [bi] # then either a 'b' or and 'i' char > # then a literal > ) # end of negative lookahead .? # 0 or more of any characters, non greedy ie minimal > # a literal > /x/ # replacement sequence is a literal 'x' char g # do all occurences` [download] In a nutshell this will subsitite all HTML tags with the letter 'x' except for <b> </b> <i> </i> tags. It is very broken. `$HTML = <<THIS; <html> <head> <title>Foo</title> </head> <body> <B>This regex is broken</b> <b>It will cope with this</b> <b >But not this< /b> <I>told you</i>ts<b>roken</b> </body> </html> THIS $HTML =~ s/<(?!\/[bi]>).*?>/x/g; print $HTML;` [download] When all else fails suck it and see! cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l] [select]
Re: Perl with XML by tachyon (Chancellor) on Oct 25, 2001 at 19:02 UTC
Here is how to use HTML::Parser to do the job (your) regex doesn't: #!/usr/bin/perl -w package Filter; use strict; use base 'HTML::Parser'; my ($filter, %ok_tags); my @ok_tags = qw ( i b ); @ok_tags{@ok_tags} = @ok_tags; sub start { my ($self, $tag, $attr, $attrseq, $origtext) = @_; $filter .= exists $ok_tags{$tag} ? $origtext : 'x'; } sub text { my ($self, $text) = @_; $filter .= $text; } sub comment { my ($self, $comment) = @_; $filter .= $comment; } sub end { my ($self, $tag, $origtext) = @_; $filter .= $ok_tags{$tag} ? $origtext : 'x'; } my $html = join '', <DATA>; my $parser = new Filter; $parser->parse($html); $parser->eof; print $filter; __DATA__ <html> <head> <title>Foo</title> </head> <body> <B>This regex is broken</b> <b>It will cope with this</b> <b >But not this< /b> <i>told you</i>ts<b>roken</B> </body> </html> [download] Before you lay that on your teacher make sure you understand how the hash slice lookup table and the V2 interface to HTML::Parser works :-) cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Perl with XML by cbeels (Initiate) on Oct 26, 2001 at 13:25 UTC
What a naughty anonymous monk this is. I wrote this quiz to judge the fitness of perl programmers who were applying for roles with our company. In fact, the illustrious japhy himself blew it away before joining us last year (though he has sadly returned to school since). He was the inspiration for several questions that this naughty monk didn't feel obliged to include. Feel free to check out the test in its original format (and apply for a job, if you're looking in Manhattan) here.	[reply]
Re: Perl with XML by cbeels (Initiate) on Oct 26, 2001 at 13:43 UTC
And as for the comments about using XML::Parser instead of RegExps, the RiskGrades engine handles thousands of XML requests daily, and it required significantly more over-head (especially under mod_perl, which occasionally has had freaky memory retention problems) to use Parser. We did Benchmark both, and straight RegExps came out well ahead. I do agree that Parser should be used for non mission-critical apps, tho. The point about use strict is valid and has been corrected in the quiz. As for the "broken" regexp that Tachyon described most eloquently above, it was actually used to translate a bunch of files that had been raped by DreamWeaver. The only valid tags were the <b> and <i> tags (which were all lower case), but I needed to keep track of where the other ones were, so I used "x"s. Not terribly elegant, but made for a good question (that most people get wrong).	[reply]
Re: Re: Perl with XML by mirod (Canon) on Oct 26, 2001 at 16:31 UTC
No! Your code does NOT parse XML. It parses a limited subset of XML. It might be OK for the data you handle right now but it means that you cannot change this data. Are the restrictions you put on the XML clearly documented somewhere? Because if you have to receive data from a source that you don't control and if you just tell them "it's XML, here is the DTD/schema" I can tell you that you open the door to tons of problems. People do use entities, comments, processing instructions, namespaces and the likes! And as this regexp based parsing does no validation whatsoever of the incoming XML, how do you know you can trust it? In short you are using an internal format, that looks a little bit like XML but that is not XML. This is fine except when you call it XML. I understand that the quizz is for applicants to your company only, so it's not like you were advocating your method in a public forum, but I still want to warn people (and you!) against thinking that XML is simple to process using regexps. BTW if you don't want to use XML::Parser you can also use XML::Parser::Lite, which is regexp based, or libXML, or soon the new XML::SAX::PurePerl or you could use a real (and fast) XML processor to generate a version of the data that you know you can handle (expanding entities, discarding comments...)	[reply]
Re: Re: Perl with XML by buckaduck (Chaplain) on Nov 01, 2001 at 01:19 UTC
The point about use strict is valid and has been corrected in the quiz. I hope that RiskGrades counts this in my favor if I should decide to apply! (And I just might do that; I'll be looking for a new job next year when my worksite closes...) buckaduck	[reply]