Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

output unique lines only

by sbp (Initiate)
on Dec 06, 2005 at 16:24 UTC ( [id://514528]=perlquestion: print w/replies, xml ) Need Help??

sbp has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a txt file that contains 5 columns (tab delimited)...
filename col2 col3 col4 col5

I want to read in this file and output to a file only the filenames that are unique from the first file.
For example, if the txt file contains following data:
filename1 col1 col2 col3 col4 col5
filename2 col1a col2b col3c col4d col5e
filename3 col1f col2g col3h col4i col5j
filename2 col1k col2l col3m col4n col5o
filename2 col1p col2q col3r col4s col5t

I want to remove the duplicate filenames, and output the following to another file:
filename1
filename2
filename3

As I'm a beginner to Perl, can you guys provide me with any suggestions on how to approach this?
Thank you!

Replies are listed 'Best First'.
Re: output unique lines only
by swkronenfeld (Hermit) on Dec 06, 2005 at 16:42 UTC
    No need for Perl, unless you're doing something more complicated. Type this from your *IX command line.

    cut -d" " -f1 FileName | sort | uniq
      I'd go for a shell pipe as well, and it would be close to your suggestion. Except that I wouldn't use the final pipe, but use sort -u instead. But that's just a minor difference. I won't be handing out 'useless use of uniq' awards.
      Perl --((8:>*
        the only reason to use sort and uniq in combination instead of "sort -u" that I can think of is to skip specific columns when looking for unique intances. example:
        ... RH_MEa0001bG06_5 710 14 16 Invalid starting position (14) RH_MEa0001bG06_4 710 125 12 GGGGGACACCTTCTCTCTCT... RH_MEa0001bG06_6 710 125 12 GGGGGACACCTTCTCTCTCT... ...
        sending a file containing this output to " | sort | uniq -f1" would compare each line and take the first instance that is unique (other than the column you want to skip, column 1 in this case) up to that point and give you :
        ... RH_MEa0001bG06_4 710 125 12 GGGGGACACCTTCTCTCTCT... RH_MEa0001bG06_5 710 14 16 Invalid starting position (14) ...
Re: output unique lines only
by tirwhan (Abbot) on Dec 06, 2005 at 16:33 UTC

    You should try to make a little bit of effort to arrive at a solution on your own, at least say "This is what I've tried but it doesn't work and I don't know why".

    Your task can be solved by reading the file in a loop, using split on each line and then putting the first returned element into a hash as a key (for example $hash{$element}=1. After you read the whole file you can open another file for writing and do

    for my $name(keys %hash) { print $filehandle "$name\n"; }

    Try to solve it with that information and do come back and ask if you have problems.


    Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Re: output unique lines only
by davorg (Chancellor) on Dec 06, 2005 at 16:36 UTC

    What parts are you having trouble with?

    • Use "open" top open the file
    • Use "< ... >" to read from the file
    • USe "split" to break each line into its parts
    • Use a hash to store the filename
    • Only print filenames if they don't exist in the hash

    Update: I deliberately didn't give any code as I don't like to help people who show no sign of putting any effort in for themselves. It seems that others don't agree with that policy.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: output unique lines only
by chibiryuu (Beadle) on Dec 06, 2005 at 16:36 UTC
    my %seen; while (<>) { s/\t.*//s; $seen{$_}++ or print "$_\n"; }
Re: output unique lines only
by blazar (Canon) on Dec 06, 2005 at 16:39 UTC
    IIUC
    $ perl -lne 's/\t.*//; print if !$saw{$_}++' input_file > output_file
Re: output unique lines only
by EdwardG (Vicar) on Dec 06, 2005 at 16:44 UTC

    Here's one approach -

    • Use STDIN and STDOUT for input and output
    • Use a regex to extract the first 'column'. You could also use split, but since you care only about the first column it may be overkill.
    • Use a hash to gather unique filenames

    Put it all together and you will have something like this:

    # uniqfiles.pl use strict; # helps prevent silly mistakes use warnings; # helpful when writing code while (<>) { # Reads from STDIN if (/^(\w+)\t/) { # If the line starts with one or more 'word' char +acters followed by a tab... my $filename = $1; # ...assume we've got a filename captured $uniq_fnames{$filename} = 1; # ...and add it to our hash. } } print $_,"\n" for keys %uniq_fnames; # prints to STDOUT, can be piped + to a file

    Then you could use this as follows

    perl uniqfiles.pl < my_non_unique_list_of_files > my_unique_list_of_fi +les

     

Re: output unique lines only
by cormanaz (Deacon) on Dec 06, 2005 at 19:05 UTC
    This is easy to do with a hash. Open the file, read in one line at a time and use the split function to put the first element in each line (i.e. the filename) into a variable like $fn. If your hash is called %uniquefiles you then set the value for $fn to some arbitrary value, like

    $uniquefiles{$fn} = 1;

    If your loop comes across the same filename again, it will simply set the same value for the same filename, in effect eliminating the dupes. When you're all done %uniquefiles will only contain the unique filenames, which you can print like so:

    foreach my $k (keys %uniquefiles) { print OUT "$k\n"; }
    If you're just learning Perl, make sure you learn about hashes. They're a very powerful feature.

    Steve

      Thanks everyone for their tips/suggestions. I've decided to approach this using a hashtable.
      I came up with the following script but it doesn't seem to be working correctly.
      #!/usr/bin/perl -w $filelist = "/home/exp/acctlist.txt"; open(FILEDUPS, $filelist) || die ("Cannot open $filelist"); open($output, '>', '/home/exp/output.txt') || die ("Cannot open file"); while ($line = <FILEDUPS>) { chomp $line; ($filename, undef, undef, undef, undef) = split /\t/, $line; } $uniquefiles{$filename} = 1; foreach $k (keys %uniquefiles) { print $output "$k\n"; }
      It currently only outputs one line. For example, if my file contains
      filename1
      filename2
      filename1
      filename4

      Then it outputs the first line only:
      filename1
      Where as it should output:
      filename1
      filename2
      filename4

      I've spent a long time trying to debug this, but i'm not sure where i'm going wrong.
      Thanks.
        hi,
        I guess you should give the  $uniquefiles{$filename} = 1; inside the while loop.
        -kulls

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://514528]
Approved by VSarkiss
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-19 01:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found