abdiel has asked for the wisdom of the Perl Monks concerning the following question:

Okay, easy question (as I've been away from Perl and scripting for *far* too long): I have a log file that I need to sanitize before send to a vendor. The log reads something like this:
7/21/2006 6:22:49 start new visit signin - Requirements Passed 7/21/2006 6:22:49 visitor data captured for JON DOE 7/21/2006 6:22:49 visit record saved for JON DOE 7/21/2006 6:22:49 starting to send print job for JON DOE 7/21/2006 6:22:49 printing: row count is 1 7/21/2006 6:22:49 printing: print badge begin 7/21/2006 6:22:52 printing: in PrintBadges 7/21/2006 6:22:52 printing: about to replace objects 1 7/21/2006 6:22:53 printing: about to start printing 7/21/2006 6:22:55 printing: finished printing 7/21/2006 6:22:55 printing: done 7/21/2006 6:22:55 printing: print badge end 7/21/2006 6:22:59 finished sending print job for JON DOE 7/21/2006 6:22:59 finished with visit sign in for JON DOE 7/21/2006 8:25:42 start new visit signin - Requirements Passed 7/21/2006 8:25:42 visitor data captured for JANE SMITH 7/21/2006 8:25:43 visit record saved for JANE SMITH 7/21/2006 8:25:43 starting to send print job for JANE SMITH 7/21/2006 8:25:43 printing: row count is 1 7/21/2006 8:25:43 printing: print badge begin 7/21/2006 8:25:43 printing: in PrintBadges 7/21/2006 8:25:43 printing: in BdgDsgn_PrintVisitsBadges 7/21/2006 8:25:43 printing: about to replace objects 1 7/21/2006 8:25:44 printing: go object look 7/21/2006 8:25:44 printing: set object look 7/21/2006 8:25:44 printing: got objects 7/21/2006 8:25:44 printing: about to start printing 7/21/2006 8:25:46 printing: finished printing 7/21/2006 8:25:46 printing: done 7/21/2006 8:25:46 printing: print badge end 7/21/2006 8:25:49 finished sending print job for JANE SMITH 7/21/2006 8:25:51 finished with visit sign in for JANE SMITH
The names are always in caps and are the only all caps strings in the document. I want to replace each name with a unique identifier (something generic like AAAAAAAA for JON DOE and BBBBBBBB for JANE SMITH) throughout the document. I know it's a simple solution, basic homework-type stuff but I haven't coded in over two years, and the time crunch to return these logs doesn't afford me the time to refresh my memory. Any quick help available?

Replies are listed 'Best First'.
Re: Log cleanup script
by andyford (Curate) on Aug 14, 2006 at 20:18 UTC
    You should be able to match the all caps stuff with a simple
    /[A-Z]{2,}/
    for each piece of the name, where the curly bracket part means match 2 or more. Now for the whole name, try something like
    /([A-Z]{2,})\s+([A-Z]{2,})/

    To generate the unique identifiers, I would just create them as you go along and store it in a hash so you can repeat. Something like this might work:

    if (exists $ids{JON_DOE}) { # found it, use old s/JON DOE/$ids{JON_DOE}; } else { # create new $ids{JON_DOE} = "UNIQUEID_$i"; $i++; }
    Updated: regex for whole name
Re: Log cleanup script
by GrandFather (Saint) on Aug 14, 2006 at 20:58 UTC

    The regex to match names is somewhat fussy. Many names are hypehnated so it is likely a hyphen should be included in the character set. It may be that some names comprise more than two components. You want to grab the whole name includding white space, but not the white space at the ent of the name - there may be none. Given all that (but ignoring a host of other possible gotchas), here's a solution

    #!/usr/bin/perl use strict; use warnings; my %names; my $nextID = 'AAAAAA'; while (<DATA>) { if (/( (?:[A-Z-]{2,} (?:(?=\s+[A-Z-]{2})\s+)?)+ )/x) { my $name = $1; $names{$name} = $nextID++ if ! exists $names{$name}; s/$name/$names{$name}/g; } print; } __DATA__ 7/21/2006 6:22:49 start new visit signin - Requirements Passed 7/21/2006 6:22:49 visitor data captured for JON DOE 7/21/2006 6:22:49 visit record saved for JON DOE 7/21/2006 6:22:49 starting to send print job for JON JOE DOE 7/21/2006 8:25:42 visitor data captured for JANE SMITH 7/21/2006 8:25:43 visit record saved for JANE SMITH 7/21/2006 8:25:43 starting to send print job for JANE-BOB SMITH 7/21/2006 8:25:51 finished with visit sign in for JANE-BOB SMITH

    Prints:

    7/21/2006 6:22:49 start new visit signin - Requirements Passed 7/21/2006 6:22:49 visitor data captured for AAAAAA 7/21/2006 6:22:49 visit record saved for AAAAAA 7/21/2006 6:22:49 starting to send print job for AAAAAB 7/21/2006 8:25:42 visitor data captured for AAAAAC 7/21/2006 8:25:43 visit record saved for AAAAAC 7/21/2006 8:25:43 starting to send print job for AAAAAD 7/21/2006 8:25:51 finished with visit sign in for AAAAAD

    Note that the first part of the regex matches a name component. The second part looks ahead to see if there is white space followed by anotehr name component and captures the white space if there is. Those two parts are repeated for as many times as there are adjacent name components to match.


    DWIM is Perl's answer to Gödel