Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I'll cut to the chase here. I posted a LONG message earlier today, and wound up not sending it, as I got passed that hurtle, and now I'm on another. "$/" was suggested to me once when I wanted to use a regex and have it go over a newline character. And, it worked. I'm sucking data from a few hundred HTML files. The code that was given to me was :
local $/ = "<br>"
Since in that case, there was always a newline character after \
. So, what I'm doing now, is taking a much larger string (several hundred characters) after matching a similar regular expression. After checking the old cookbook, I decided slupr mode (page 274 or so) was the best idea. My regular expression is working, except for one thing. I can't seem to get the whitespaces out. The string I grab, shows up just as it does in the original (crappy Frontpage produced) HTML file. Looks like this.
This would be the first line. It's fine and dandy. Something tends to go wrong about here though. And of course the next line would be here. The next line here. And so on and so on.
I assumed I could substitute whitespaces, but that's not happening. Here's a cut down version of the code that generates this.
#!/usr/bin/perl # # use strict; use warnings; my $directory = "/path/to/html/files/"; my ($line, $base, $file, $description); undef $/; opendir ( DIR, $directory ) or die "Can't open $directory $!"; while ( my $base = readdir( DIR ) ) { if ( $base =~ /.htm/ ) { $file = $directory . $base; open ( FILE, $file ) or die "Can't open $file $!"; my $whole_file = <FILE>; if ( $whole_file =~ /.+?Title:.*?<\/b>*(.*?)<br>*\s*<br>\s*( +.+?)</si ) { $description = $2; $description =~ s/\s{2,}/ /s; # I tried many varaitions + of this to no avail $description =~ s/&nbsp;/ /s; $description =~ s/<.+?>/ /s; print "$description\n\n"; } } }
So, maybe an explanation of what exactly $/ does would get me started. If I'm understanding it write, it can redefine the newline character, no? Or maybe my substitution of whitespaces is just way off. I thought my code above would substitute 2 or and infinite number of whitespaces with a big ole donut, but that didn't work. Thanks.

Replies are listed 'Best First'.
Re: help substituting whitespaces?
by Abstraction (Friar) on Aug 07, 2003 at 13:56 UTC
    So, maybe an explanation of what exactly $/ does would get me started.

    From perldoc perlvar

         $/      The input record separator, newline by default.
                 This influences Perl's idea of what a "line" is.
                 Works like awk's RS variable, including treating
                 empty lines as a terminator if set to the null
                 string.  (An empty line cannot contain any spaces or
                 tabs.)  You may set it to a multi-character string
                 to match a multi-character terminator, or to "undef"
                 to read through the end of file.  Setting it to
                 "\n\n" means something slightly different than
                 setting to "", if the file contains consecutive
                 empty lines.  Setting to "" will treat two or more
                 consecutive empty lines as a single empty line.
                 Setting to "\n\n" will blindly assume that the next
                 input character belongs to the next paragraph, even
                 if it's a newline.  (Mnemonic: / delimits line
                 boundaries when quoting poetry.)
    
                     local $/;           # enable "slurp" mode
                     local $_ = <FH>;    # whole file now here
                     s/\n \t+/ /g;
    
                 Remember: the value of $/ is a string, not a regex.
                 awk has to be better for something. :-)
    
                 Setting $/ to a reference to an integer, scalar
                 containing an integer, or scalar that's convertible
                 to an integer will attempt to read records instead
                 of lines, with the maximum record size being the
                 referenced integer.  So this:
    
                     local $/ = \32768; # or \"32768", or \$var_containing_32768
                     open my $fh, $myfile or die $!;
                     local $_ = <$fh>;
    
                 will read a record of no more than 32768 bytes from
                 FILE.  If you're not reading from a record-oriented
                 file (or your OS doesn't have record-oriented
                 files), then you'll likely get a full chunk of data
                 with every read.  If a record is larger than the
                 record size you've set, you'll get the record back
                 in pieces.
    
                 On VMS, record reads are done with the equivalent of
                 "sysread", so it's best not to mix record and non-
                 record reads on the same file.  (This is unlikely to
                 be a problem, because any file you'd want to read in
                 record mode is probably unusable in line mode.)
                 Non-VMS systems do normal I/O, so it's safe to mix
                 record and non-record reads of a file.
    
    
    
Re: help substituting whitespaces?
by fourmi (Scribe) on Aug 07, 2003 at 14:07 UTC
    Does your conversion of &nbsp; work? also you may get spurious mulit-spaces, esp if there was &nbsp;&nbsp;<hr>&nbsp;&nbsp; in the string.

    So i'd put the 'clear multi-spaces' step last in that group to cover that, as for that step...
    any of
    $description =~ s/\s*/ /s; $description =~ s/\s+/ /s;
    should work AFAIK, i don't think the 's' modifier at the end is strictly needed, Which variation have you tried? And are you just trying to replace multi-spaces with single ones at that step? Try a debug by replacing with a hyphen instead?

    Perhaps more questions than answers, but may be of some use?!
      Try replacing the /s with a /g.
      $description =~ s/\s+/ /g;
        Ah -- dat's it. Global is what did it. I had the /s stuck in there from a while back (copy and paste) and totally forgot about it, and 'global' totally slipped my mind in the process. Fantastic -- as always, the help is much appreciated. If this lesson has taught me one thing, it's to adhere strictly to a template... it's been driving me nuts getting this stuff cleaned up. :) PS - I finally registered. I'm the anonymous monk that posted.
Re: help substituting whitespaces?
by l2kashe (Deacon) on Aug 07, 2003 at 14:59 UTC

    I will toss out the obiligitory comment about CPAN is your friend. Especially the HTML:: family of modules. It can be very very tricky to deal with HTML via a regex for much beyond extremly simple (?:X|HT|DHT)ML etc... You could alleviate alot of your parsing headaches, and get down to the meat of what you want to do, all with a simple use statement.

    With that said, if you are doing this simply as an excercise in data munging have fun! :) perldoc perlvar is the place to go for all documentation about variables, or at least a nice springboard. If you are ever lost about what perldoc to read, try perldoc perltoc or even perldoc perl.

    use perl;

Re: help substituting whitespaces?
by snax (Hermit) on Aug 07, 2003 at 19:11 UTC
    Squashing whitespace is most efficiently done with tr:
    $description =~ tr/ / /s;
    This will leave (at most) a single leading/trailing space if it exists, though. Want to squash all the internal spaces and lose the leading/trailing spaces? Try this one:
    $description = join ' ', split ' ', $description;
    Read up on join and split if you want to see why this works.