Looks like there is a bug in one of those modules. I've been testing some changes to a script to be able to process the new and even more crazy XML format, taking the same data in both formats, processing it an comparing the results. For one job it there was one tiny difference. A space. In one result there was "Typically reports to top management. nagement." while in the other just "Typically reports to top management.nagement.". Now that doesn't look like a sensible text and sure enough neither of the source XMLs contained this nonsense. There was just "Typically reports to top management." at the end of a 1058 characters long line.
I tested the very text I get from XML::Twig, both before and after ->simplify() and the error WAS there. I have XML::Twig 3.17 and XML::Parser 2.34, Perl v5.8.0 ActiveState build 805, Win2kServer SP4. Here is a short example script with a XML that shows this bug.:
use strict; use XML::Twig; # wonder why strict doesn't complain about XML being us +ed. my $twig = XML::Twig->new( twig_roots => { 'Job' => \&process_job }, keep_encoding => 1, ) or die "Can't create XML::Twig object!\n"; my $data = do {local $/; <DATA>}; $twig->parse( $data); $twig->purge; sub process_job { my ($twig, $JobObj) = @_; if ($JobObj->first_child("JobBody")->text() =~ /Typically reports +to top management\. ?nagement\./) { print STDERR "I HATE XML\n"; exit; } print $JobObj->first_child("JobBody")->text(); } __DATA__ <?xml version="1.0" encoding="Windows-1252" standalone="no"?> <Jobs xmlns="http://schemas.monster.com/Monster" xmlns:Monster="http:/ +/schemas.monster.com/Monster" xmlns:SOAP-ENV="http://schemas.xmlsoap. +org/soap/envelope/"> <Job jobId="31129343"> <OriginationJobId>DHRDallas-31129343</OriginationJobId> <JobseekerRedirectURL>http://my.monster.com/applyStart.asp?job +id=31129343</JobseekerRedirectURL> <RecruiterRedirectURL>http://recruiter.monster.com/submitjob.a +sp?jobid=31129343</RecruiterRedirectURL> <RecruiterUserId>36339124</RecruiterUserId> <RecruiterUserName>xaimcodx1</RecruiterUserName> <RecruiterFirstName>Scott</RecruiterFirstName> <RecruiterLastName>Davis</RecruiterLastName> <RecruiterEmailAddress>scott.davis@aimco.com</RecruiterEmailAd +dress> <JobTitle>Director of Human Resources</JobTitle> <JobBody><![CDATA[Implements Human Resources policies and prog +rams for the ROC/department. The major areas covered are employment, +employee orientation and training, employee relations, compensation, +benefits, safety and health, and employee services. Participates with + Corporate and Area Human Resources to design, and develop practices +and objectives that will provide a balanced program throughout all ar +eas. Implements and/or delivers training for new programs and company + plan changes for the employees in their ROC/department. Requires a B +achelor’s Degree with at least 5-8 years experience in the field. SPH +R, CCP CEBS certifications a plus. Must have a breath of knowledge of + Human Resources functions such as compensation, benefits, staffing, +compliance, employee relations, and performance management. Must be f +amiliar with Internet business models and technologies and desktop to +ols for collection and analysis of data. Experience supporting field +organizations and multi-disciplined groups preferred. Typically repor +ts to top management. Please email resume to chris.delisa@aimco.com along with salary histor +y and requirements.]]></JobBody> <ContactName>Chris DeLisa</ContactName> <ContactCompanyName>AIMCO</ContactCompanyName> <ContactEmailAddress>chris.xxxxa@xxxxx.com</ContactEmailAddres +s> <ContactPhoneNumber></ContactPhoneNumber> <ContactFaxNumber></ContactFaxNumber> <ContactStreetAddress></ContactStreetAddress> <ContactCity></ContactCity> <ContactState></ContactState> <ContactPostalCode></ContactPostalCode> <ContactCountry></ContactCountry> <SalaryFrom></SalaryFrom> <SalaryTo></SalaryTo> <SalaryTime>PerYear</SalaryTime> <JobTypeFullTime>0</JobTypeFullTime> <JobTypePartTime>0</JobTypePartTime> <JobTypeContract>1</JobTypeContract> <JobTypePermanent>0</JobTypePermanent> <JobLocation locationId="615"> <JobCity>Dallas</JobCity> <ClientJobCity></ClientJobCity> <JobState>TX</JobState> <JobCountry>US</JobCountry> <JobZipCode></JobZipCode> </JobLocation> <JobCategory>Human Resources/Recruiting</JobCategory> <AffiliateXCode>xaimcodx</AffiliateXCode> <AffiliateCompanyId>309263</AffiliateCompanyId> <ContactType contactTypeId="16">PLEASE APPLY ONLINE AT A URL</ +ContactType> <JobCode>1300080</JobCode> </Job> </Jobs>
Whoever finds and fixes the bug has a bunch of beers on me next time he/she visits Prague.
"According to section 4.7.8.7.8.87.7 of the XML specifications lines should be no longer than 876 characters long" is not an answer I would like. Though it would not come as a big surprise either.
Jenda
| XML sucks. Badly. SOAP on the other hand is the most powerfull vacuum pump ever invented. |
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |