SpacemanSpiff has asked for the wisdom of the Perl Monks concerning the following question:

Here's a snippit of the web page I've parsed:
<br> <span class="smalltype"><em>Date:</em></span> Tue&nbsp;Sep&nbsp;13,&nbsp;2005&nbsp; 10:38 pm <br> <span class="smalltype"><em>Subject:</em></span>
Using TokeParser, I've managed to assign the following to $GetDate:
Tue&nbsp;Sep&nbsp;13,&nbsp;2005&nbsp; 10:38 pm
Probably as a function of the module, I'm assuming it's compiling the HTML so that when I print back the contents, it comes out as:
Tue Sep 13, 2005 10:38 pm
Now because I want to split this guy up and create a timestamp from it, I figure the best plan is to assign it to an array with the following statement:
my @date = split /\s+/, "$GetDate";
And here's the contents of the array:
0 1 Tue Sep 13, 2005  2 10:38 3 pm 4
Obviously, the nonbreakable spaces (&nbsp;) are preventing me from getting the rest of it split apart. What do I do to clear this up?

I know there's some time modules out there, but I was just going to try this operation manually as I have the impression that's the best way to keep the speed up on an already deathly slow script. Is this a good plan, or am I just making more work for myself?

Thanks!

Replies are listed 'Best First'.
Re: Array problem when parsing HTML
by polettix (Vicar) on Sep 14, 2005 at 22:05 UTC
    A non-breakable space is, by definition, non-breakable :) - \s simply doesn't match. You can substitute it with plain spaces if you want. The following snippet will work with both decoded and encoded chunks of text:
    #!/usr/bin/perl use strict; use warnings; use HTML::Entities qw( decode_entities ); my $EncodedDate = "Tue&nbsp;Sep&nbsp;13,&nbsp;2005&nbsp; 10:38 pm"; my $MixedDate = $EncodedDate . "\n------\n" . decode_entities($Encoded +Date); # $nbsp will contain the decoded version of &nbsp; decode_entities(my $nbsp = '&nbsp;'); # Now, substitute all flavours of non-breakable-spaces $MixedDate =~ s/&nbsp;|$nbsp/ /g; # Same split as before my @date = split /\s+/, $MixedDate; print "$_ $date[$_]\n" foreach 0 .. $#date; __END__ 0 Tue 1 Sep 2 13, 3 2005 4 10:38 5 pm 6 ------ 7 Tue 8 Sep 9 13, 10 2005 11 10:38 12 pm
    You'd probably prefer to split on /[\s:,]+/, anyway: it will get rid of the comma after the "13", and will split "10:38" as well.

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
Re: Array problem when parsing HTML
by InfiniteSilence (Curate) on Sep 14, 2005 at 21:31 UTC
    Why don't you use a regex to capture the relevant parts of the string?

    perl -e "$f=q|Tue&nbsp;Sep&nbsp;13,&nbsp;2005&nbsp; 10:38|; $f=~s/\&nb +sp;/ /g; if($f=~m/\S+\s+(\S+)\s+(\d+),\s+(\d+)\s+(\d+):(\d+)/){print +$1 . $2 . $3 . $4 . $5;};" Sep1320051038
    You can push them into an array or whatever and do what you want with it then.

    Celebrate Intellectual Diversity

Re: Array problem when parsing HTML
by saberworks (Curate) on Sep 14, 2005 at 21:49 UTC