comment on

Hello. I'll cut to the chase here. I posted a LONG message earlier today, and wound up not sending it, as I got passed that hurtle, and now I'm on another. "$/" was suggested to me once when I wanted to use a regex and have it go over a newline character. And, it worked. I'm sucking data from a few hundred HTML files. The code that was given to me was :

local $/ = "<br>"
[download]

Since in that case, there was always a newline character after \
. So, what I'm doing now, is taking a much larger string (several hundred characters) after matching a similar regular expression. After checking the old cookbook, I decided slupr mode (page 274 or so) was the best idea. My regular expression is working, except for one thing. I can't seem to get the whitespaces out. The string I grab, shows up just as it does in the original (crappy Frontpage produced) HTML file. Looks like this.

This would be the first line.  It's fine and dandy.
         Something tends to go wrong about here though.
         And of course the next line would be here.
         The next line here.  
         And so on and so on.
[download]

I assumed I could substitute whitespaces, but that's not happening. Here's a cut down version of the code that generates this.

#!/usr/bin/perl
#
#
use strict;
use warnings;

my $directory = "/path/to/html/files/";
my ($line, $base, $file, $description);

undef $/;
opendir ( DIR, $directory ) or die "Can't open $directory $!";
while ( my $base = readdir( DIR ) ) {
     if ( $base =~ /.htm/ ) {
          $file = $directory . $base;
          open ( FILE, $file ) or die "Can't open $file $!";
          my $whole_file = <FILE>;
          if ( $whole_file =~ /.+?Title:.*?<\/b>*(.*?)<br>*\s*<br>\s*(
+.+?)</si ) {
               $description = $2;
               $description =~ s/\s{2,}/ /s; # I tried many varaitions
+ of this to no avail
               $description =~ s/&nbsp;/ /s;
               $description =~ s/<.+?>/ /s;
               print "$description\n\n";
          }
     }
}
[download]

So, maybe an explanation of what exactly $/ does would get me started. If I'm understanding it write, it can redefine the newline character, no? Or maybe my substitution of whitespaces is just way off. I thought my code above would substitute 2 or and infinite number of whitespaces with a big ole donut, but that didn't work. Thanks.

In reply to help substituting whitespaces? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.