lambden has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I've been trying to work this out for myself but I feel I'm missing something fundamental and without guidance I won't be able to complete this quest:

I have a text file which is outputted by a automated router provisioning system, the file is in the following format:

Date time IPaddress Action

As the file is produced each line in the file contains the current Date and time so therefore each line in the file is unique because the time changes. However the IPaddress and Action part of each line may contain duplicates, its these duplicates I want to remove but still keep the output in time order.

I can produce output which excludes the duplicate IPaddress and Actions but the field which contains the time has to be removed as it causes all lines to be unique.

So how do I go about creating this ??

Janitored by davido to provide formatting based on OP's direct input. See Writeup Formatting Tips for details.

  • Comment on Removing duplicate entries in a file which has a time stamp on each line
  • Download Code

Replies are listed 'Best First'.
Re: Removing duplicate entries in a file which has a time stamp on each line
by Old_Gray_Bear (Bishop) on Jan 18, 2006 at 21:25 UTC
    Basically, you want to build a hash keyed off of IPaddress+Action, whose value is the data-line of the file.

    Let us call the hash

    my %ip_address;
    As you read your data file, you check to see if you have seen this action for this IP. If you have, then drop the data and go on to the next; if you haven't, then added the key/value to %ip_address.

    Once you have read to the end of the data, your hash will contain the first occurances of each IPaddress+action. This pseudo-code (note: it only looks vaguely like Perl, not tested) shows you how to extract and sort to get the final report:

    foreach my $data_line ( sort values %ip_action ) { print($data_line); }

    ----
    I Go Back to Sleep, Now.

    OGB

Re: Removing duplicate entries in a file which has a time stamp on each line
by explorer (Chaplain) on Jan 18, 2006 at 21:30 UTC
    (A very fast solution...)
    my %see; open my $file, "<textfile.txt"; while ( my $line = <$file> ) { my ($date, $rest) = split(" ",$line,2); next if $see{$rest}++; print "$date $rest"; } close $file;
Re: Removing duplicate entries in a file which has a time stamp on each line
by Random_Walk (Prior) on Jan 18, 2006 at 21:37 UTC

    The easiest way is probably to split your line on space, tab or comma depending on what your file uses, take all the fields beyond the date and time and use them as a hash key join will glue them back together for you. If the key already exists you have seen this line before and can discard it, it not write the line out, create the key and move onto the next line.

    There are a few optimistaions depending on your need for speed, code simplicity or fun and games. The first is to only split the line into three, Date, Time, Rest and the you do not have to join back up. another is to test if the hash key exists at the same time you create it:

    if ($hash{$key}++) { # key already seen }else{ # key not yet seen }

    Have a go and post your code if you have any problems

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
Re: Removing duplicate entries in a file which has a time stamp on each line
by BrowserUk (Patriarch) on Jan 18, 2006 at 21:41 UTC

    Adjust the number to the length of your timestamps.

    uniq -us 20 infile outfile

    If you're on windows and do not have access to uniq, get it here amongst other places.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Removing duplicate entries in a file which has a time stamp on each line
by graff (Chancellor) on Jan 19, 2006 at 04:01 UTC
    I think the earlier replies have probably solved your problem, but I was curious about this statement:
    However the IPaddress and Action part of each line may contain duplicates, its these duplicates I want to remove but still keep the output in time order.
    Now, if "1.2.3.4 PowerOff" occurs today at 08:18 and again today at 10:20, do you want to keep the first record and delete the later one, or vice versa?

    If you keep the first and delete later repeats, you just keep the IP/Action data as hash keys, and assuming the data are being read in chronological order, only output lines whose IP/Action are not yet in the hash.

    In order to delete earlier occurrences and keep only the latest one, you have to store Date/Time as the value for each IP/Action key, and after you've read the whole input stream, sort the hash by its values in order to print each "hash_value hash_key" in chronological order.