sagar_qwerty has asked for the wisdom of the Perl Monks concerning the following question:

I have script like this:

open(INFILE,"<","words.txt") my @X; while(<INFILE>){ push @X,(split(/\s+/,$_))[0]; ### choosing particluar column and pu +shing all its values in single array } $a = 0; $main = 'temp.txt'; $mod = 'temp_mod.txt'; for (0..$#X) { $b = $main; open b ; open NEWFILE1, ">$mod" ; while (<b>) { / $X[$a] / or print NEWFILE1 } close NEWFILE1; $main = $mod; $mod = $b; $a++; }

What it is doing: it is removing lines having "words" in single column in words.txt file. Which are stored in array @X. It opens file temp.txt, removes first word and save it in file temp_mod.txt. this later file is used again to remove second word from @X array and this loop runs by opening and closing temp.txt and temp_mod.txt

I have lot many words and many lines in file (60mb). So opening, removing line, closing, opening cycle consumes lot of time. Can I just open once a file and remove all lines together having words stored in words.txt

Replies are listed 'Best First'.
Re: how to avoid opening and closing files
by davido (Cardinal) on Jun 18, 2012 at 05:29 UTC

    It is inefficient to re-write your entire target file once for each "drop word". Luckily, there is a better algorithm; read your 'drop-words' into a hash, use the hash as a lookup table, then run through the words in your 'temp.txt' file one time. Every time you find that a word in the 'temp.txt' file exists within your hash, drop the line and move onto the next. Any line where you don't come across a drop-word, print the line to a new file.

    use strict; use warnings; use autodie; use List::MoreUtils qw( any ); my %drop_words; open my $words_ifh, '<', 'words.txt'; while( <$words_ifh> ) { $drop_words{ ( split /\s+/, $_, 2 )[0] } = 1; } close $words_ifh; open my $temp_ifh, '<', 'temp.txt'; open my $result_ofh, '>', 'temp_mod.txt'; while( <$temp_ifh> ) { chomp; next if any { exists $drop_words{$_} } split /\s+/; print {$result_ofh} $_, "\n"; } close $temp_ifh; close $result_ofh;

    If you're not interested in using the non-core module List::MoreUtils, you could achieve about the same goal by changing line 21 to look like this:

    next if defined first { exists $drop_words{$_} } split /\s+/;

    ...and replacing line 4 with use List::Util qw(first); (a core module).


    Dave

Re: how to avoid opening and closing files
by zentara (Cardinal) on Jun 18, 2012 at 10:55 UTC
    Can I just open once a file and remove all lines together having words stored in words.txt

    Also see Re: Search Replace String Not Working on text file. You can open a file just once, then truncate and rewrite it. Of course this requires you save your output temporarily in an array. For a 60 mb file, you might want to use @ARGV's special line-by-line in-place editing capability, as also shown in the link. That would save you creating a big array.


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: how to avoid opening and closing files
by cheekuperl (Monk) on Jun 18, 2012 at 04:01 UTC
    1. Read words from words.txt file into @X.
    2. Go through temp.txt and remove any lines that contain any of the words present in @X.
    This is all you are doing, right? The following code is untested.
    open (TEMP,"<temp.txt"); open (TEMP_MOD,">temp_mod.txt"); while($line=<TEMP>) { $flag=0; foreach $word (@X) { if($line=~/$word/) #Does $line contain this $word? { $flag++; last; } } if($flag==0)#None of the words from @X is present in $line { print TEMP_MOD $line ; } } close TEMP; close TEMP_MOD;
    You can also split the line read from TEMP into an array (say @arr) and then apply the logic of intersection of array elements given in Programming Perl for arrays @X and @arr.
    If you get any elements in the interesection set, don't write the line into TEMP_MOD.
Re: how to avoid opening and closing files
by pvaldes (Chaplain) on Jun 18, 2012 at 18:19 UTC
    Can I just open once a file and remove all lines together having words stored in words.txt?

    See map and grep

    (and specially "grep (!/regex/) the_file", where regex should match words stored in words.txt

    my @X; while(<INFILE>){ push @X,(split(/\s+/,$_))[0];} ### choosing particular column and pushing all its values in single ar +ray

    I guess that a hash (%X) could be better here, but if you want an array, think in an unique sorted list or so. The idea is to avoid duplicates.