Array problem when parsing HTML

SpacemanSpiff has asked for the wisdom of the Perl Monks concerning the following question:

Here's a snippit of the web page I've parsed:

<br> <span class="smalltype"><em>Date:</em></span>
Tue&nbsp;Sep&nbsp;13,&nbsp;2005&nbsp; 10:38
pm <br> <span class="smalltype"><em>Subject:</em></span>
[download]

Using TokeParser, I've managed to assign the following to $GetDate:

Tue&nbsp;Sep&nbsp;13,&nbsp;2005&nbsp; 10:38
pm
[download]

Probably as a function of the module, I'm assuming it's compiling the HTML so that when I print back the contents, it comes out as:

Tue Sep 13, 2005  10:38
pm
[download]

Now because I want to split this guy up and create a timestamp from it, I figure the best plan is to assign it to an array with the following statement:

my @date = split /\s+/, "$GetDate";
[download]

And here's the contents of the array:

0 
1 Tue Sep 13, 2005 
2 10:38
3 pm
4
[download]

Obviously, the nonbreakable spaces ( ) are preventing me from getting the rest of it split apart. What do I do to clear this up?

I know there's some time modules out there, but I was just going to try this operation manually as I have the impression that's the best way to keep the speed up on an already deathly slow script. Is this a good plan, or am I just making more work for myself?

Thanks!

Comment on Array problem when parsing HTML Select or Download Code

Replies are listed 'Best First'.
Re: Array problem when parsing HTML by polettix (Vicar) on Sep 14, 2005 at 22:05 UTC
A non-breakable space is, by definition, non-breakable :) - `\s` simply doesn't match. You can substitute it with plain spaces if you want. The following snippet will work with both decoded and encoded chunks of text: #!/usr/bin/perl use strict; use warnings; use HTML::Entities qw( decode_entities ); my $EncodedDate = "Tue Sep 13, 2005  10:38 pm"; my $MixedDate = $EncodedDate . "\n------\n" . decode_entities($Encoded +Date); # $nbsp will contain the decoded version of   decode_entities(my $nbsp = ' '); # Now, substitute all flavours of non-breakable-spaces $MixedDate =~ s/ \|$nbsp/ /g; # Same split as before my @date = split /\s+/, $MixedDate; print "$_ $date[$_]\n" foreach 0 .. $#date; __END__ 0 Tue 1 Sep 2 13, 3 2005 4 10:38 5 pm 6 ------ 7 Tue 8 Sep 9 13, 10 2005 11 10:38 12 pm [download] You'd probably prefer to split on `/[\s:,]+/`, anyway: it will get rid of the comma after the "13", and will split "10:38" as well. Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply] [d/l] [select]
Re: Array problem when parsing HTML by InfiniteSilence (Curate) on Sep 14, 2005 at 21:31 UTC
Why don't you use a regex to capture the relevant parts of the string? `perl -e "$f=q\|Tue Sep 13, 2005  10:38\|; $f=~s/\&nb +sp;/ /g; if($f=~m/\S+\s+(\S+)\s+(\d+),\s+(\d+)\s+(\d+):(\d+)/){print +$1 . $2 . $3 . $4 . $5;};" Sep1320051038` [download] You can push them into an array or whatever and do what you want with it then. Celebrate Intellectual Diversity	[reply] [d/l]
Re: Array problem when parsing HTML by saberworks (Curate) on Sep 14, 2005 at 21:49 UTC
Or simply: `$GetDate =~ s/ / /g;` [download]	[reply] [d/l]