How to extract lines starting with new names/words

sm2004 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to extract lines starting with new names/words by moritz (Cardinal) on Mar 13, 2008 at 07:37 UTC
You can keep these first words in a hash and check if they have already been stored: `#!/usr/bin/perl use strict; use warnings; my %seen_words; while (<DATA>){ if (!m/^(\S+)/){ die "Invalid line: $_"; } my $first_word = $1; if (!$seen_words{$first_word}){ print; $seen_words{$first_word} = 1; } } __DATA__ MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 15232924 ubq MA01001A1A04.f1 300 145334669 DNA MA01001A1B22.f1 580 77745475 ra MA01001A1B22.f1 580 30409730 ra` [download] This can be written a little bit compacter: `while (<DATA>){ if (!m/^(\S+)/){ die "Invalid line: $_"; } print unless $seen_words{$1}++; }` [download] But the first one is easier to read for the beginner ;-)	[reply] [d/l] [select]
Re^2: How to extract lines starting with new names/words by sm2004 (Acolyte) on Mar 13, 2008 at 23:28 UTC
Thanks so much! That was perfect. Exactly, what I wanted it to do... I spent several days trying to do this. Just learned perl two weeks ago. Thanks again.	[reply]
Re: How to extract lines starting with new names/words by Thilosophy (Curate) on Mar 13, 2008 at 08:47 UTC
I believe moritz's script will do what you want, but your expected output is confusing: `MA01001A1A03.f1 760 5640111 ad1 MA01001A1A04.f1 300 15232924 ubq MA01001A1B22.f1 580 77745475 ra` [download] Should not the second line list the first occurrence of `ubq`? And what happened to `DNA`? `MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 145334669 DNA MA01001A1B22.f1 580 77745475 ra` [download] Update: Ah, yes... As moritz points out (in more polite words) below, I am an idiot. Or tired. I repent, expect swift and adequate punishment, and demand this node be voted down to about -5. (but not much more. please)	[reply] [d/l] [select]
Re^2: How to extract lines starting with new names/words by moritz (Cardinal) on Mar 13, 2008 at 08:58 UTC
but your expected output is confusing: I found that the expected output matches the description very well. Should not the second line list the first occurrence of ubq? no, because they both start with `MA01001A1A03.f1` And what happened to DNA? it starts with the same word as the third line.	[reply] [d/l]
Re: How to extract lines starting with new names/words by poolpi (Hermit) on Mar 13, 2008 at 10:11 UTC
For example, if the line begins with some comment, you will need another regexp `#!/usr/bin/perl use strict; use warnings; my $line ; while (<DATA>) { next unless /\A (\w+[.]\w+) \s+ (.+) \z/xms; print unless $line->{ $1 }++; }; __DATA__ # Log file 13/3/2008 MA01001A1A03.f1 760 5640111 ad1 MA01001A1A03.f1 760 42572233 ubq MA01001A1A04.f1 300 15232924 ubq MA01001A1A04.f1 300 145334669 DNA # MA01001A1B22.f1 580 77745475 ra MA01001A1B22.f1 580 30409730 ra MA01001A1A03.f1 760 5640111 foo MA01001A1A04.f1 300 15232924 bar # End of log` [download] `Output: MA01001A1A03.f1 760 5640111 ad1 MA01001A1A04.f1 300 15232924 ubq MA01001A1B22.f1 580 77745475 ra` [download] hth, PooLpi 'Ebry haffa hoe hab im tik a bush'. Jamaican proverb Update : for -> while, thanks johngg ;)	[reply] [d/l] [select]
Re^2: How to extract lines starting with new names/words by johngg (Canon) on Mar 13, 2008 at 10:43 UTC
Your `for (<DATA>)` would be better written as `while (<DATA>)`. Using `for` will have the effect of reading the entire file into memory rather than processing a line at a time as with `while`. Not a problem, perhaps, with small data sets but it's not a good habit to get into. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: How to extract lines starting with new names/words by sm2004 (Acolyte) on Mar 13, 2008 at 23:42 UTC
Thanks a lot. I could use the idea for another file I need to extract data. I'm new to perl and all your input helped a lot.	[reply]