Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have the list of xml files in a directory.
The file format is such that some of the files have a same "name" and prefixed with the random numbers.
Out of those files, using the xml parser, check for the date and time tag.
How can I compare the files with same "name" and delete the file which have the less time
filenames as 508.ids.xml 1508.ids.xml 1509.id123.xml 1400.id123.xml
#!/usr/bin/perl use XML::Simple; use ContentHelper; use XML::Parser; @files = `ls`; # Sub substitution to check for duplicates foreach $input_file (@grepNames){ chomp $input_file; my $xml_parser = XML::Simple->new(); my $data = $xml_parser->XMLin($input_file); my $Id = $data->{"root"}->{'id'}; my $date = $data->{'root'}->{'date'}; my $time = $data->{'root'}->{'time'}; ### Please tell me how to add the condition here. print "$input_file:$Id:$time:$date\n\n"; }
The above print command prints 9890.ids.xml:ids:70857:2004-10-02 9893.ids.xml:ids:70859:2004-10-02 9830.ids.xml:ids:2000:2004-10-02 9834.ids.xml:ids:4000:2004-10-01
Here how to delete the files, 9890.ids.xml:ids:70857:2004-10-02 and 9834.ids.xml:ids:4000:2004-10-02 as it has the less date and time

Replies are listed 'Best First'.
Re: Delete the file with checking the value
by almut (Canon) on Feb 26, 2010 at 20:20 UTC
    #!/usr/bin/perl my %hash; while (<DATA>) { my ($input_file, $time, $date) = split ' '; # determine common part of file names my $common = $input_file; $common =~ s/^\d+//; # remove leading number # collect info (name, timestamp) in hash, keyed by common part of +file names push @{$hash{$common}}, [ $input_file, sprintf("%s %6d",$date,$tim +e) ]; } # use Data::Dumper; # print Dumper \%hash; # debug for my $k (keys %hash) { # for all file sets # sort by timestamp my @files = sort {$b->[1] cmp $a->[1]} @{$hash{$k}}; # remove most recent file (the one to keep) from list shift @files; # delete remaining (older) files unlink map $_->[0], @files; } __DATA__ 508.ids.xml 70857 2004-10-02 1508.ids.xml 70859 2004-10-02 1509.id123.xml 2000 2004-10-02 1400.id123.xml 4000 2004-10-01

    (I left out the XML stuff, as the OP doesn't seem to have problems with that part.)

      Thanks for the help.
      __DATA__ 508.ids.xml 70857 2004-10-01 1508.ids.xml 70859 2004-10-01 1509.id123.xml 2000 2004-10-01 1400.id123.xml 4000 2004-10-01
      How to compare time if the date is same.

        As usual, there are several ways to do it. In the sample code I've taken care of it by tagging the time value onto the end of the date string, aligning it such that the whole string can simply be sorted asciibetically to yield the proper result. Note that space (ASCII 32) orders before digits (ASCII 48..57).  This is done with the sprintf("%s %6d",$date,$time).

        With the following sample input

        __DATA__ 0.ids.xml 500 2004-10-01 1.ids.xml 2 2004-10-01 2.ids.xml 30 2004-10-01 3.ids.xml 600 2004-10-01 4.ids.xml 40 2004-10-01 5.ids.xml 7000 2004-10-01 6.ids.xml 8000 2004-10-01 7.ids.xml 1 2004-10-01 8.ids.xml 100000 2004-10-01 9.ids.xml 90000 2004-10-01

        and the comparison operation as shown — $b->[1] cmp $a->[1] (string sort, reversed) — this would order as

        2004-10-01 100000 2004-10-01 90000 2004-10-01 8000 2004-10-01 7000 2004-10-01 600 2004-10-01 500 2004-10-01 40 2004-10-01 30 2004-10-01 2 2004-10-01 1

        i.e. you get the entry with the highest time value as the first element.

        Another way would be to store the date and time values separately

        push @{$hash{$common}}, [ $input_file, $date, $time ];

        and then use a generic chained sort operation

        @files = sort {$b->[1] cmp $a->[1] || $b->[2] <=> $a->[2]} @{$hash +{$k}};

        This works because if the date value is equal, the first comparison ($b->[1] cmp $a->[1]) evaluates to zero, so the next comparison ($b->[2] <=> $a->[2]) after the logical or "||" is tested to determine if the time differs (it kind of "falls through"). Note that in this case the time value must be compared numerically, i.e. with <=>, or else (with string comparison cmp) the 100000 would be ordered in between 1 and 2.   See sort.

Re: Delete the file with checking the value
by ack (Deacon) on Mar 01, 2010 at 21:10 UTC

    Seems like the most straight-forward way...and to do it one continuous pass, would be to:

    1. Get Filename 2. Extract unique part of file to a $uniqueId variable 3. Keep unique $uniqueId as key to hash 4. Get date-time stamp of file 5. If $uniqueId key doesn't ye exist in hash then store filename and date-time stamp in hash (perhaps as 'filename':'date-time stamp' to make it easy and quick to use split(":",$uniqueIds{$uniqueId}) to recover the parts) at that $uniqueId key, ELSE Get the filename and date-time stamp already in hash at that $uniqueId; Compare date-time (probably using CPAN module to make this easy and simple) already in hash at $uniqueId key; Store new filename and date-time stamp into hash if new date-time is later than old one and Delete the old file that was idetified at the old entry in the hash, OTHERWISE, Delete this file. 6. Repeat #5 until all files processed.

    This will leave the names and date-time stamps of the 'latest' (or 'newest') filenames in the hash at the end of the run and will also leave only the newest files in the directory.

    ack Albuquerque, NM