help substituting whitespaces?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I'll cut to the chase here. I posted a LONG message earlier today, and wound up not sending it, as I got passed that hurtle, and now I'm on another. "$/" was suggested to me once when I wanted to use a regex and have it go over a newline character. And, it worked. I'm sucking data from a few hundred HTML files. The code that was given to me was :

local $/ = "<br>"
[download]

Since in that case, there was always a newline character after \
. So, what I'm doing now, is taking a much larger string (several hundred characters) after matching a similar regular expression. After checking the old cookbook, I decided slupr mode (page 274 or so) was the best idea. My regular expression is working, except for one thing. I can't seem to get the whitespaces out. The string I grab, shows up just as it does in the original (crappy Frontpage produced) HTML file. Looks like this.

This would be the first line.  It's fine and dandy.
         Something tends to go wrong about here though.
         And of course the next line would be here.
         The next line here.  
         And so on and so on.
[download]

I assumed I could substitute whitespaces, but that's not happening. Here's a cut down version of the code that generates this.

#!/usr/bin/perl
#
#
use strict;
use warnings;

my $directory = "/path/to/html/files/";
my ($line, $base, $file, $description);

undef $/;
opendir ( DIR, $directory ) or die "Can't open $directory $!";
while ( my $base = readdir( DIR ) ) {
     if ( $base =~ /.htm/ ) {
          $file = $directory . $base;
          open ( FILE, $file ) or die "Can't open $file $!";
          my $whole_file = <FILE>;
          if ( $whole_file =~ /.+?Title:.*?<\/b>*(.*?)<br>*\s*<br>\s*(
+.+?)</si ) {
               $description = $2;
               $description =~ s/\s{2,}/ /s; # I tried many varaitions
+ of this to no avail
               $description =~ s/&nbsp;/ /s;
               $description =~ s/<.+?>/ /s;
               print "$description\n\n";
          }
     }
}
[download]

So, maybe an explanation of what exactly $/ does would get me started. If I'm understanding it write, it can redefine the newline character, no? Or maybe my substitution of whitespaces is just way off. I thought my code above would substitute 2 or and infinite number of whitespaces with a big ole donut, but that didn't work. Thanks.

Comment on help substituting whitespaces? Select or Download Code

Replies are listed 'Best First'.
Re: help substituting whitespaces? by Abstraction (Friar) on Aug 07, 2003 at 13:56 UTC
So, maybe an explanation of what exactly $/ does would get me started. From `perldoc perlvar` $/ The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to "undef" to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.) local $/; # enable "slurp" mode local $_ = <FH>; # whole file now here s/\n \t+/ /g; Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-) Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this: local $/ = \32768; # or \"32768", or \$var_containing_32768 open my $fh, $myfile or die $!; local $_ = <$fh>; will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. On VMS, record reads are done with the equivalent of "sysread", so it's best not to mix record and non- record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.	[reply]
Re: help substituting whitespaces? by fourmi (Scribe) on Aug 07, 2003 at 14:07 UTC
Does your conversion of   work? also you may get spurious mulit-spaces, esp if there was   <hr>   in the string. So i'd put the 'clear multi-spaces' step last in that group to cover that, as for that step... any of `$description =~ s/\s/ /s; $description =~ s/\s+/ /s;` [download] should work AFAIK, i don't think the 's' modifier at the end is strictly needed, Which variation have you tried? And are you just trying to replace multi-spaces with single ones at that step? Try a debug by replacing with a hyphen instead? Perhaps more questions than answers, but may be of some* use?!	[reply] [d/l]
Re: Re: help substituting whitespaces? by Abstraction (Friar) on Aug 07, 2003 at 14:29 UTC
Try replacing the /s with a /g. `$description =~ s/\s+/ /g;` [download]	[reply] [d/l]
Re: Re: Re: help substituting whitespaces? by Anonymous Monk on Aug 07, 2003 at 15:01 UTC
Ah -- dat's it. Global is what did it. I had the /s stuck in there from a while back (copy and paste) and totally forgot about it, and 'global' totally slipped my mind in the process. Fantastic -- as always, the help is much appreciated. If this lesson has taught me one thing, it's to adhere strictly to a template... it's been driving me nuts getting this stuff cleaned up. :) PS - I finally registered. I'm the anonymous monk that posted.	[reply]
Re: help substituting whitespaces? by l2kashe (Deacon) on Aug 07, 2003 at 14:59 UTC
I will toss out the obiligitory comment about CPAN is your friend. Especially the HTML:: family of modules. It can be very very tricky to deal with HTML via a regex for much beyond extremly simple (?:X\|HT\|DHT)ML etc... You could alleviate alot of your parsing headaches, and get down to the meat of what you want to do, all with a simple use statement. With that said, if you are doing this simply as an excercise in data munging have fun! :) `perldoc perlvar` is the place to go for all documentation about variables, or at least a nice springboard. If you are ever lost about what perldoc to read, try `perldoc perltoc` or even `perldoc perl`. use perl;	[reply] [d/l] [select]
Re: help substituting whitespaces? by snax (Hermit) on Aug 07, 2003 at 19:11 UTC
Squashing whitespace is most efficiently done with tr: `$description =~ tr/ / /s;` [download] This will leave (at most) a single leading/trailing space if it exists, though. Want to squash all the internal spaces and lose the leading/trailing spaces? Try this one: `$description = join ' ', split ' ', $description;` [download] Read up on join and split if you want to see why this works.	[reply] [d/l] [select]