Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

The source file I'm working on is ~20,000 lines or so and all the source files will be about the same size. There is a "junk file" text file which contains strings that need to be removed from the source file. Each "junk" string is on it's own line and there will be quite a few of these (over 40 different strings of junk). Example of the junk file which contains phrases that have to be removed from the source file
First Appeared in the Akron Beacon Journal (Ohio) Email to friend Get a Map | Get Directions
There will be dozens more than this but these three lines have to be slurped and removed from another text file. The main issue here is speed because if this junk file contains 40+ different strings and it has to remove as many occurences as it can for each of these in a 20k line file, it could take a while, right?

In short, I need to read from junk.txt and be able to use each line as a new substitution to remove them from source.txt

Thank you wise monks.

Replies are listed 'Best First'.
Re: file substitution
by Zaxo (Archbishop) on Jun 04, 2004 at 18:48 UTC

    20,000 lines is not likely to be more than 2M or so. That is not so big as to prevent slurping on most machines. The 40-some phrases in the junk file are insignificant in size, so lets compile them just once.

    use Fcntl ':flock'; my @regexen; { open my $rx, '<', '/path/to/junk.txt' or die $!; @regexen = map { chomp; qr/\Q$_\E/ } <$rx>; close $rx or die $!; } for (@list_of_files) { local $/; open my $fh, '+<', $_ or warn $! and next; flock $fh, LOCK_EX; my $contents = <$fh>; # study $contents; # may want to try this $contents =~ s/$_//g for @regexen; seek $fh, 0, 0; print $fh $contents or warn $! and next; close $fh or warn $! and next; }
    That should give decent performance. I've locked the file after opening to read and write, so you are protected from races.

    Update: MidLifeXis' reply reminded me to chomp before compiling the regexen. Added to code, along with revised error handling in the big loop.

    After Compline,
    Zaxo

Re: file substitution
by McMahon (Chaplain) on Jun 04, 2004 at 18:14 UTC
Re: file substitution
by MidLifeXis (Monsignor) on Jun 04, 2004 at 18:32 UTC

    I am not sure on performance, as I have not implemented something like this, but...

    How about something like this....

    ... my @junk = map {chomp $_; qr(\Q$_\E)} <JUNK>; while (<>) { # Added / Modified my $line = $_; next if $line =~ $_ foreach (@junk); # End update print; }

    Order the junkfile in descending order of occurances.

    Another option, to avoid the inner loop, would be to build a giant alternation regexp (qr((foo|bar|biz|bang))) from the junk file. However, then you need to escape various things. In addition, I am not sure which is faster, the alternation regexp or the nested loop.

    --MidLifeXis

    P.S. All code is untested, blah blah blah.

    Update: Updated foreach stuff - Thanks BrowserUk. Added \Q..\E. Thanks Berik.

      This foreach my $junk (@junk); isn't valid perl syntax.

      If you want to name the control variable you have to use the foreach my $name (@things) { #use $name }, not the postfix form


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      Be sure to use quotemeta or \Q..\E
      my @junk = map {chomp $_; qr/\Q$_\E/} <JUNK>; while (<>) { next if $_ =~ $junk foreach my $junk (@junk); print; }
      perl's regexes are not as fast as grep's. But this will do, probably fast enough.
      ---
      Berik
Re: file substitution
by TomDLux (Vicar) on Jun 05, 2004 at 00:29 UTC

    I would use the Unix command, 'comm'.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      How?

      comm requires its inputs to be sorted. I doubt that the OP would want his html files sorted lexically.

      comm works on whole lines not substrings embedded within bigger ones.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail