sm2004 has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file that looks like:
MA01001A1A03.f1 760 5640111 ad1
MA01001A1A03.f1 760 42572233 ubq
MA01001A1A04.f1 300 15232924 ubq
MA01001A1A04.f1 300 145334669 DNA
MA01001A1B22.f1 580 77745475 ra
MA01001A1B22.f1 580 30409730 ra
How do I write a perl script to extract the lines based on the first word being new in the list? So the extracted list should have only:
MA01001A1A03.f1 760 5640111 ad1
MA01001A1A04.f1 300 15232924 ubq
MA01001A1B22.f1 580 77745475 ra
Any tips on how to get this done is appreciated very much.
  • Comment on How to extract lines starting with new names/words

Replies are listed 'Best First'.
Re: How to extract lines starting with new names/words
by moritz (Cardinal) on Mar 13, 2008 at 07:37 UTC
    You can keep these first words in a hash and check if they have already been stored:
    #!/usr/bin/perl use strict; use warnings; my %seen_words; while (<DATA>){ if (!m/^(\S+)/){ die "Invalid line: $_"; } my $first_word = $1; if (!$seen_words{$first_word}){ print; $seen_words{$first_word} = 1; } } __DATA__ MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 15232924 ubq MA01001A1A04.f1 300 145334669 DNA MA01001A1B22.f1 580 77745475 ra MA01001A1B22.f1 580 30409730 ra

    This can be written a little bit compacter:

    while (<DATA>){ if (!m/^(\S+)/){ die "Invalid line: $_"; } print unless $seen_words{$1}++; }

    But the first one is easier to read for the beginner ;-)

      Thanks so much! That was perfect. Exactly, what I wanted it to do... I spent several days trying to do this. Just learned perl two weeks ago. Thanks again.
Re: How to extract lines starting with new names/words
by Thilosophy (Curate) on Mar 13, 2008 at 08:47 UTC
    I believe moritz's script will do what you want, but your expected output is confusing:
    MA01001A1A03.f1 760 5640111 ad1 MA01001A1A04.f1 300 15232924 ubq MA01001A1B22.f1 580 77745475 ra
    Should not the second line list the first occurrence of ubq? And what happened to DNA?
    MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 145334669 DNA MA01001A1B22.f1 580 77745475 ra

    Update: Ah, yes...
    As moritz points out (in more polite words) below, I am an idiot. Or tired. I repent, expect swift and adequate punishment, and demand this node be voted down to about -5. (but not much more. please)

      but your expected output is confusing:

      I found that the expected output matches the description very well.

      Should not the second line list the first occurrence of ubq?

      no, because they both start with MA01001A1A03.f1

      And what happened to DNA?
      it starts with the same word as the third line.
Re: How to extract lines starting with new names/words
by poolpi (Hermit) on Mar 13, 2008 at 10:11 UTC

    For example, if the line begins with some comment, you will need another regexp

    #!/usr/bin/perl use strict; use warnings; my $line ; while (<DATA>) { next unless /\A (\w+[.]\w+) \s+ (.+) \z/xms; print unless $line->{ $1 }++; }; __DATA__ # Log file 13/3/2008 MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 15232924 ubq MA01001A1A04.f1 300 145334669 DNA # MA01001A1B22.f1 580 77745475 ra MA01001A1B22.f1 580 30409730 ra MA01001A1A03.f1 760 5640111 foo MA01001A1A04.f1 300 15232924 bar # End of log
    Output: MA01001A1A03.f1 760 5640111 ad1 MA01001A1A04.f1 300 15232924 ubq MA01001A1B22.f1 580 77745475 ra

    hth,

    PooLpi

    'Ebry haffa hoe hab im tik a bush'. Jamaican proverb

    Update : for -> while, thanks johngg ;)

      Your for (<DATA>) would be better written as while (<DATA>). Using for will have the effect of reading the entire file into memory rather than processing a line at a time as with while. Not a problem, perhaps, with small data sets but it's not a good habit to get into.

      Cheers,

      JohnGG

      Thanks a lot. I could use the idea for another file I need to extract data. I'm new to perl and all your input helped a lot.