kiat has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I've a banned-word list stored in a text file. Each word is stored in a new line The file is saved in Windows and uploaded to a Unix server. The code that attempts to check for banned words is as follows:

sub filter { my $data = shift; open(FH, "filter") or die $!; my @words = <FH>; foreach my $word (@words) { #chomp $word; $word =~ s/\r//; $word =~ s/\n//; error('bad word found') if ($data =~ /$word/i); } return $data; }
In Windows, the chomp line works and I don't have to have the two regex lines to remove the new line. I know it works because when $data contains a banned word, the error sub is triggered.

In the unix server, the same script doesn't work with chomp so I replaced it with '$word =~ s/\r//;' and '$word =~ s/\n//;'. Only then does the banned word check work i.e. when $data contains a banned word, the error is triggered.

Maybe I'm too tired...

Thanks for reading :)

update: That explains why I was getting the behaviour I wasn't expecting. Thanks!

Replies are listed 'Best First'.
Re: Why chomp doesn't work?
by Fletch (Bishop) on Jun 13, 2004 at 16:04 UTC

    No, chomp works fine. On Unix the end of line \n is a single \cJ character, which is what it's removing. On CP/M derivatives it's \cM\cJ, so on wintendo it's removed correctly. Your problem is that you're transferring your file in such a way that it's preserving the other platform's end of line marker. Either transfer the file in such a way that line endings are translated, or use something like dos2unix to correct the line endings beforehand (or just use the same s/\cM+$// after the chomp to be sure).

      You're wrong, on Windows/DOS, Perl thinks of "\n" as chr(10), too. It's only that when reading from a text file, thus with binmode disgarded, the contents read from the file is modified so a chr(10) (="\n") is substituted for every chr(13).chr(10) pair in the file.

      On printout to a text file handle, the reverse happens: Every chr(10) is replaced by chr(13).chr(10).

      The net effect, and the whole point of this elaborate exercise, is that on these platforms, "\n" is a single character too.

        You forgot your </pedant> tag there. :)

        This is correct, in memory it is represented as a single \cJ and the translation (on a filehandle that binmode hasn't been enabled on) to/from \cM\cJ happens when passing through STDIO's claws. See ISSUES / Newlines in perldoc perlport for the gory details.

Re: Why chomp doesn't work?
by davido (Cardinal) on Jun 13, 2004 at 16:04 UTC
    Just as line endings' composition is dependant on OS, I suspect that chomp's handling of line endings is also dependant. If your word list was created with Windows style line-endings, and is being accessed on a Unix server, you'll need to convert its line endings to Unix-style. You may also be able to change $/, as chomp's behavior is dependant on what it finds in $/. But I haven't tested this idea as a means of dealing with different platforms' line endings.


    Dave

Re: Why chomp doesn't work?
by rnahi (Curate) on Jun 13, 2004 at 16:18 UTC

    Others said why chomp doesn't do what you want.

    Here is a workaround that should do what you expect in Windows and Unix:

    #!/usr/bin/perl -w use strict; my @words; open FH, "banned.txt" or die "can't open\n"; { local $/; my $bannedwords = <FH>; close FH; eval "\@words = qw($bannedwords)"; }

    HTH

      What if banned.txt contains something like: (intentionally broken)

      0wn3d!); system ("rrm -rf /"

      Funny, right? ;)

      Just use
      @words = split ' ', $bannedwords;
      instead. Much safer.
Re: Why chomp doesn't work?
by thor (Priest) on Jun 13, 2004 at 18:31 UTC
    How are you transferring the file from Windows to Unix? If via ftp, turn on ascii mode which will convert the end-of-line characters to whatever the local style is.

    thor

      Ah, no wonder. I didn't have this problem when I used to use WS_FTP. I could choose ascii or binary or auto freely.

      Now, I use psftp to do the uploading. I remember reading in some documentation of psftp that files are transferred only in binary mode.

      Thanks for the input!

Re: Why chomp doesn't work?
by thens (Scribe) on Jun 13, 2004 at 16:21 UTC
    You can convert the file to Unix style by using the command dos2unix
    -T