RobertCraven has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

this question has been raised before. Unfortunately I was not able to transfer the answers to my problem.

I have to parse a 9GB textfile, I only want to keep lines containing certain strings (UniProt IDs, like P40303 or Q99436).
I tried to adapt solution from another node, but no luck.
#!/usr/bin/perl use warnings; use strict; use Data::Dumper; my @items; foreach my $key (keys %hash){ push(@items,$key); } my $rxMatchItems; { local $" = q{|}; $rxMatchItems = qr{(?:@items)}; } open(FH,'<','gene_association.goa_uniprot') or die "Can not open/acces +s 'gene_association.goa_uniprot'\n$!"; while(<FH>){ next unless m{$rxMatchItems}; print $_; } close(FH);
The runtime is endless, could anyone recommend me a better way?
Many thanks,
Thomas

Replies are listed 'Best First'.
Re: Large File Parsing
by jwkrahn (Abbot) on Jan 03, 2010 at 04:25 UTC
    my @items; foreach my $key (keys %hash){ push(@items,$key); }

    Or simply:

    my @items = keys %hash;


    my $rxMatchItems; { local $" = q{|}; $rxMatchItems = qr{(?:@items)}; }

    Or simply:

    my $rxMatchItems = do { local $" = q{|}; qr{(?:@items)} };


    Because your %hash is empty your pattern match becomes:

    $ perl -le'my @items; my $rxMatchItems = do { local $" = q{|}; qr{(?:@ +items)} }; print $rxMatchItems' (?-xism:(?:))

    And the pattern (?-xism:(?:)) will match everything.

      my $rxMatchItems = do { local $" = q{|}; qr{(?:@items)} };

      Oh noes :)

      my $rxMatchItems = join '|', map quotemeta, @items; $rxMatchItems = qr/$rxMatchItems/;

        I'm guessing from the variable name and the use of quoting constructs that the OP grabbed that bit of code from one of my solutions, probably one where @items contained values known not to need quotemeta'ing. Invariable application of quotemeta without any consideration of whether it is necessary is just another form of cargo cult programming. I don't think we can tell from the OP's code whether it is required or not. Even if it is required, the do block construct is as valid as using join.

        my $rxMatchItems = do { local $" = q{|}; qr{(?:@{ [ map quotemeta, @items ] })}; };

        Cheers,

        JohnGG

Re: Large File Parsing
by educated_foo (Vicar) on Jan 03, 2010 at 06:44 UTC
    These lines, especially the last, clearly show that you have copy-pasted something you don't understand:
    use warnings; use strict; use Data::Dumper;
    Not to mention the fact that your script doesn't even run -- where is %hash defined?

      Please don't read the previous posting as "using strict and warnings is nonsense", the opposite is true: Always use strict and warnings (except in rare situations like Perl golf). The use of Data::Dumper is nonsense here, and it really looks like cargo cult.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        You always add in use Data::Dumper; whenever you want to debugprint a datastructure and remove it as soon as you remove all debug prints that need it? Really? I quite often leave it there knowing that sooner or later I'll need it again.

        Sometimes the holly war against cargo culting is a bit cargo cultish.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

      The hash is populated from a DB.
Re: Large File Parsing
by Marshall (Canon) on Jan 03, 2010 at 23:20 UTC
    I have to parse a 9GB textfile, I only want to keep lines containing certain strings (UniProt IDs, like P40303 or Q99436).

    I would think that the first thing is to decide whether you even need to write any kind of program or not (Perl or otherwise)! I figure you are on a Unix type machine. There is a standard program that does what you want called "grep".

    Type "man grep", "man egrep" at the command line to get some hints. "grep P40303 *.datafile" will output all lines containing P40303 in all files ending in ".datafile".

    But if you must, here is some Perl code...

    #!/usr/bin/perl -w use strict; my @items = qw (P40303 Q99436 X1234 W9765543); my $regex = join ("|",@items); print $regex; # to see what this does # put something like this in your "grep" # P40303|Q99436|X1234|W976554 while (<>) { print if m/$regex/; } __END__ Perl 5.10 is pretty smart. I think that the /o option is not necessary here. I don't think more complex syntax's are either.
Re: Large File Parsing
by Anonymous Monk on Jan 03, 2010 at 04:15 UTC
    The runtime is endless, could anyone recommend me a better way?

    Probably because your regex is nonsense

      Harsh, but helped