cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good day bros. The following snippet:
#!/usr/bin/perl -w use strict; use Date::Manip; use HTML::TreeBuilder; my $htm = ' <html><div class="posthead"> <span class="postdate new"><span class="date">14th August 2017,&nbsp;<span class= "time">21:07</span></span></span> <span class= "nodecontrols"><a name="post27949278" href= "threads/2360460-product-reviews.htm" class="postcounter">#1937</a></span> </div></html>'; my $tree = HTML::TreeBuilder->new_from_content($htm); my $postdate = $tree->look_down('class','date')->as_text(); print "postdate: $postdate\n"; print "postdate parsed: ",ParseDate($postdate),"\n"; my $timestamp = '14th August 2017, 21:07'; print "string parsed: ",ParseDate($timestamp),"\n";
yields output:
postdate: 14th August 2017, 21:07 postdate parsed: string parsed: 2017081421:07:00
So it fails to parse a date when it's passed to ParseDate as the contents of a variable gotten with HTML::Element, but if I take the exact same text, assign it to a variable as a string literal, and pass it to ParseDate, it parses fine. I've debugged into Date::Manip and it seems to be getting the same string in both cases. Anyone know what's going on here?!?

Replies are listed 'Best First'.
Re: Weird Date::Manip DateParse fail
by Corion (Patriarch) on Aug 17, 2017 at 18:37 UTC

    Maybe this:

    14th August 2017,&nbsp;

    HTML-decodes not to a space after the comma but to \x{A0} after the comma, which looks like a plain space but is non-breaking whitespace?

      For kicks, I ran it on a Windows system, and the evidence on the console supports Corion's conclusion (note the á where the &nbsp; would be, which showed as whitespace on the original example):

      M:\PerlMonks>perl parsedate.pl postdate: 14th August 2017,á21:07 postdate parsed: string parsed: 2017081421:07:00

        Thanks guys. That fixed it. I guess I should have checked that before posting!

      Thank you for this. I'm trying to capture some table data and the $#160 source elements (which are translated to nbsp elements when inspecting the HTML::Element as_HTML() output) come through in HTML::Element's as_text() method as weird characters, and I couldn't figure out how to clean them with regexes. Now I just

      my $el = $_->as_text(); my $nbsp = chr(160); $el =~ s/$nbsp/ /g;

      and all is well :)