smbs has asked for the wisdom of the Perl Monks concerning the following question:

I have a large text file 130,000 lines (5mega).
I want to add all lines starting with 5 same letters
eg "abcde" and ending with the letters "partname" into
a new file.
I am using a "foreach" reading line by line and doing
a match---it works but is very slowwwwwww!
I need something more fancy!!!
open C, ">c:\\somefile.txt"; open (FH, "k:\\1\\somefile.txt") || die "Couldn't open file: $!"; @required=(); @all=<FH>; foreach $item (@all) { next unless (index($item, 'abcde') == 0); if ($item=~/abcde.*PARTNAME/) { push (@required,$item); } } print C @required;

2005-01-12 Janitored by Arunbear - added code tags, as per Monastery guidelines

Replies are listed 'Best First'.
Re: looking for speed!! large file search and extract
by bart (Canon) on Jan 12, 2005 at 15:38 UTC
    I see no reason to
    • first read in the whole file and then process it line by line
    • doing a similar tes ttwice, when one will do

    I'd just do

    while(<FH>) { push @required, $_ if /^abcde.*partname$/; }
    or, if that's all you do with those lines:
    while(<FH>) { print C if /^abcde.*partname$/; }
Re: looking for speed!! large file search and extract
by holli (Abbot) on Jan 12, 2005 at 15:47 UTC
    one-liner:
    c:\> perl -n -e "print if /^abcde.+PARTNAME$/" c:\somefile.txt>k:\1\so +mefile.txt
    or
    c:\> perl -n -e "print if /^abcde/ && /PARTNAME$/" c:\somefile.txt>k:\ +1\somefile.txt
    whatever is faster.

    Update: The second one is approx 50% faster.
    I tried with a file of 73MB and 900.000 lines, where every second line matches.
    one-liner 1 takes 11 seconds, one-liner 2 takes 6 seconds.

    Update:
    one-liner using
    substr() c:\> perl -n -e "print if substr($_,0,5) eq q(abcde) && substr($_,-9) +eq qq(PARTNAME\n)" c:\somefile.txt>k:\1\somefile.txt
      I recommend looking at substr and (if on Unix) the egrep utility, too.

      Caution: Contents may have been coded under pressure.
        Thanx for answer but now have to make a small change
        I only want to extract the lines on condition that the
        line directly above it starts and ends with the
        following 5 chararacters "xyzdf"
        basically looking for 2 line match
        thanx
Re: looking for speed!! large file search and extract
by Tanktalus (Canon) on Jan 12, 2005 at 15:40 UTC

    Instead of @required and @all and the foreach:

    while (<FH>) { print C $_ if /^abcde.*PARTNAME$/; }
    Note that if either "abcde" or "PARTNAME" (or both) are variables, but don't change while reading this particular file, I would compile that regexp once:
    my $re = qr/^$start.*$end$/; while (<FH>) { print C $_ if m/$re/; }
    Another possible improvement may be to remove the .* and do two matches:
    my $startre = qr/^$start/; my $endre = qr/$end$/; while (<FH>) { print C $_ if m/$startre/ and m/$endre/; }
    Also, please put your code into <code> and </code> tags - makes it much easier to read. Thanks.

Re: looking for speed!! large file search and extract
by waswas-fng (Curate) on Jan 12, 2005 at 15:43 UTC
    open INFI, "c:\\input\\file.txt" or die "cant open infile: $!\n"; open OUTFI, ">c:\\output\\file.txt" or "cant open outfile : $!\n"; while (<INFI>) { print OUTFI if /^abcde.*partname$/; }


    -Waswas
Re: looking for speed!! large file search and extract
by vek (Prior) on Jan 12, 2005 at 22:29 UTC

    Please remember that @all=<FH>; will read the entire file into memory. Safe for small files but might cause you grief with huge files.

    -- vek --
Re: looking for speed!! large file search and extract
by perlsen (Chaplain) on Jan 13, 2005 at 10:35 UTC

    I think this takes less time for process
    If U wish please try this

    undef $/; open (FH, "D:\\temp.txt") || die "Couldn't open file: $!"; @required=(); $str=<FH>; close(FH); (@arr)=$str=~m#(abcde.*?PARTNAME)#gsi; print "$_\n" for @arr;