achiles has asked for the wisdom of the Perl Monks concerning the following question:

I have a message board, a file, that looks like this:
From Sam (Wed Jul 11 18:12:08 2001):

This space intentionally filled.
_________________________________________________________
From Blah Blah (Wed Jul 10 01:05:55):

Message
_________________________________________________________
Rinse and repeat. New posts are dumped to the top of the file, and a blank line always follows the header. Anything is fair game between that and the seperator.
In a busy place, this file can be filled with lots of random crap, and can bury important and useful info quite quickly.
What I want is a perl spell I can cast over this file to seperate the signal from the noise by checking the name of the person who posted, eg, posts from "Administrator" would stay around much longer than posts from "Joe Luser."
The tough part, for me, being a perl neophyte is to find the good ones.
I plan to go about it by grabbing the good posts and throwing them into a new file, and replacing the old file with the new file.

Replies are listed 'Best First'.
Re: Message Board Mangling
by tachyon (Chancellor) on Jul 12, 2001 at 06:24 UTC

    Hi you will find this probably does what you want. The comments are in the code. I read from the <DATA> filehandle but have included code that shows how to read from a file and also to save the results

    #!/usr/bin/perl -w use strict; my @saves; # we will save our stuff in here my @ok_posters = qw ( Administrator Sam ); # generate a regex to match our approved posters my $ok_regex = join "|", @ok_posters; my $ok_regex = qr/(?:$ok_regex)/; { # use local to localise the change to the input record # separator to this block only, saves nasty suprises local $/ ="_______________________________________________________ +__\n"; # read in data from file handle one record at a time # each record is stored in the magical $_ var while (<DATA>) { # add data to our array if it matches # the required criteria push @saves, $_ if /^From\s+$ok_regex/; } } print @saves; # this is how you read from the file message.txt instead of <DATA> #{ # local $/ ="______________________________________________________ +___\n"; # open (FILE, "<message.txt") or die "Unable to open file for readi +ng, Perl says: $!\n"; # while (<FILE>) { # push @saves, $_ if /^From\s+$ok_regex/; # } # close FILE; #} # uncomment this to save the results # save("c:/mysaves.txt", @saves); sub save { my $file = shift @_; my @stuff = @_; open (FILE, ">$file") or die "Unable to open file for writing, Per +l says: $!\n"; print FILE @stuff; close FILE } __DATA__ From Sam (Wed Jul 11 18:12:08 2001): This space intentionally filled. _________________________________________________________ From Administrator (Wed Jul 10 01:05:55): Message _________________________________________________________ From Sam (Wed Jul 11 18:12:08 2001): This space intentionally filled. _________________________________________________________ From Blah Blah (Wed Jul 10 01:05:55): Message _________________________________________________________ From Sam (Wed Jul 11 18:12:08 2001): This space intentionally filled. _________________________________________________________ From Blah Blah (Wed Jul 10 01:05:55): Message _________________________________________________________

    Hope this helps

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Message Board Mangling
by bikeNomad (Priest) on Jul 12, 2001 at 06:24 UTC
    Why bother with re-writing a file every time? Look at, say, BerkeleyDB, which maintains an efficient key/value pair database.

    What you're talking about is being able to re-order existing records. The best strategy for this might be to use a single BerkeleyDB database, and re-define the keys from time to time to re-define the order of the posts. So in the value of each record, you maintain the time of the post (so you can get latest-first), and the ID or quality ranking of the poster, as well as the messsage. You generate your keys to the most significant few (say, 4) bytes supply a serial number that gives the number of re-order passes. This way, as you re-order records, they go to the end of the database, and you can then delete them from the beginning.

    Each time you re-order, you do this:

    increment the re-order sequence number For each record that was not re-ordered this time through, read the record re-compute the key based on the age and poster ID write the new key/value pair (with the new re-order number) delete the old key/value pair

    The beauty of this is that BerkeleyDB will efficiently reclaim the file space with a minimum of work. When you have new messages, you just give them a key that will allow you to find them before the previous messages (cursors in BerkeleyDB can go forwards (DB_NEXT) or backwards (DB_PREV)).

      I would, except the format isn't up to me (Unless I do some pretty major C hacking, or more code begging :p)
        What if the "OK Posters" is just anyone who puts "savepost" in their nick somewhere? Will
        my @ok_posters = qw ( \savepost\ );
        my $ok_regex = join "|", @ok_posters;
        my $ok_regex = qr/(?:$ok_regex)/;
        
        grab a nick of "John Luser savepost"?
Re: Message Board Mangling
by mattr (Curate) on Jul 14, 2001 at 17:37 UTC
    Most sane strategies would have you being in control of the original writing of the file. Then you could use a db, or whatever. I'm guessing this is a hotline server news file or the like so you can't do that.

    What you can do though is control when you yourself post. If you are on unix flock wouldn't necessarily stop the server from writing to it, but changing permissions might (if it doesn't crash the server). I believe flock on Windows stops all programs from writing to a flocked file but I haven't tried it myself. Anyway lock the file somehow, record the length of the file and the length of your submission, and so on.

    You should be able to maintain a file with a list of byte offsets of your own submissions which you could later snip out. If you can't lock maybe you just want to post when the server is down, or something. Maybe you reboot it daily and you can have a script which adds your latest submission to the file at that time.

    Otherwise if you just do pattern matching and don't filter other people's submissions they will always be able to fool it.

      Replying late, but, them fooling it is quite alright. Most people aren't going to post song lyrics and mark them archive. I just need to keep the news clean, and having people concious of just what it is that they are posting keeps stuff sane.