Jonathan has asked for the wisdom of the Perl Monks concerning the following question:

I have a tab delimited file that is rather large so I slurped it up to process it. (I've got loads of ram :-)

What I need to do is pull out all the records that have, for example, the string 'LONDON' in the nth field. These records need putting in a new file and deleted from the orginal. This is a straight forward task but I'm not convinced that my first solution is at all efficient.
my $tsv_data; # Slurp the file { local $/ = undef; local *BIGFILE; open BIGFILE, "<$tsv_data" or die "can't open big file. $!"; $tsv_data = <BIGFILE>; close BIGFILE or die "Can't close big file: $!"; } # Throw it all in an array for testing my @records = split /\n/, $tsv_data; # Opps need the line feeds back again for the output files @records = map { "$_\n" } @records; my @london = grep /LONDON/, @records; # Write to new file, amend old file.. ...
I want to look in each line for the string but don't like spliting on new line then adding them back with the map line.

There is a better way??

Replies are listed 'Best First'.
Re: Processing slurped file.
by Masem (Monsignor) on Sep 18, 2001 at 17:17 UTC
    For one thing, you can do something like:
    my @records = <BIGFILE>;
    which will slurp but does not chomp at the same time; however, the impact of the CR will have little effect on your matching.

    Since you then have to write out to a file, there's no need to wastefully add the carriage return before you split the file, so instead, you can do:

    my @records = <BIGFILE>; my @london = grep /LONDON/, @records; print LONDONFILE join "", @london;
    Finally, this type of situation is an ideal one where slurping is inefficient, unless you're repeating the process multiple times; it's probably much easier and less of a memory hog to process line by line:
    while ( my $line = <BIGFILE> ) { /LONDON/ ? print LONDONFILE $line,"\n" : print OTHERFILE $line,"\n"; }

    Updates as arturo pointed out to me that slurping in array mode doesn't chomp.

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
    It's not what you know, but knowing how to find it if you don't know that's important

      Hmmm, surely setting $/ to undef will cause @records to contain just one record?
Re: Processing slurped file.
by dragonchild (Archbishop) on Sep 18, 2001 at 17:12 UTC
    Slurping, splitting on \n, then adding in the \n is the exact same as:
    my @records = <BIGFILE>;
    Why not just do that? In fact, you could do something like:
    my $searchstring = 'LONDON'; my @records = grep /$searchstring/, <BIGFILE>;
    Voila! Just the string you want in @records, all postpended with \n. :)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: Processing slurped file.
by Caillte (Friar) on Sep 18, 2001 at 17:32 UTC

    DBI::CSV...... Treat it as a Database ;)

    $japh->{'Caillte'} = $me;

Re: Processing slurped file.
by arturo (Vicar) on Sep 18, 2001 at 18:40 UTC

    This suggestion certainly won't help your efficiency, but simply grepping for lines that match /LONDON/ doesn't meet your specs. How important the difference is, I don't know, but to meet those specs (and assuming $n holds the number of the field you want):

    my @london = grep { (split /\t/)[$n] =~ /LONDON/ } grep { /LONDON/ } @ +records;

    This would only perform the split on the members of @records that matched somewhere.

    perl -e 'print "How sweet does a rose smell? "; chomp ($n = <STDIN>); +$rose = "smells sweet to degree $n"; *other_name = *rose; print "$oth +er_name\n"'
Re: Processing slurped file.
by petdance (Parson) on Sep 19, 2001 at 00:02 UTC
    Tell us more about your efficiency concerns. How often is this program run? How long does it take to run now? What sort of time would you like it to run in? How big is the data file?

    If after answering those questions you're sure that efficiency is a concern, run it under the Perl profiler with perl -d:DProf program.pl, and then run dprofpp to process the output (automatically stored in the file tmon.out). The output will give you an idea where to focus your efficiency concerns.

    xoxo,
    Andy
    --
    <megaphone> Throw down the gun and tiara and come out of the float! </megaphone>

Re: Processing slurped file.
by broquaint (Abbot) on Sep 18, 2001 at 17:35 UTC
    Well if you've got more memory than sense, then this might suffice for your task at hand -
    open(BF, $tsv_data) or die("It ain't there ($tsv_data): $!\n"); @res = map { "$_\n" if /LONDON/ } split(/\n/, <BF>); close(BF);
    Grab everyline with 'LONDON' and stick on that ever useful newline. Not necessarily a 'better' way but certainly a quicker one ;o)
    HTH

    broquaint