Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Multiple substitutions in large files

by mdi (Acolyte)
on May 09, 2005 at 13:41 UTC ( [id://455179]=perlquestion: print w/replies, xml ) Need Help??

mdi has asked for the wisdom of the Perl Monks concerning the following question:

I need to do multiple substitutions in several large (1-10MB) files. I've been using this:
use strict; use warnings; use Tie::File; foreach my $x (@ARGV) { tie my @f, 'Tie::File', $x or die "Could not tie $x: $!\n"; for (@f) { s/^\|/\\N\|/; s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/; } }
...but this is taking entirely too long, and using up too much CPU. How can I do this more efficiently?

Replies are listed 'Best First'.
Re: Multiple substitutions in large files
by Joost (Canon) on May 09, 2005 at 13:48 UTC
Re: Multiple substitutions in large files
by dragonchild (Archbishop) on May 09, 2005 at 13:46 UTC
    #!/usr/bin/perl -p s/^\|/\\N\|/; s/\|\s*$/\|\\N/; s/\|\s*\|/\|\\N\|/g; s/\|\.\s*\|/\|\\N\|/g; s/\|\s+/\|/g; s/\s+\|/\|/g; s/(\d{2}:\d{2}:\d{2})\.\d+/$1/g; s/(\d{5})-(?:\d{1,4}|\s+)/$1/;

    Execute as so:

    my_scriptydoo.pl file1 > file2

    Update: ikegami is absolutely correct. I should be doing a redirect. The next 1st level response provides the -pi version.


    • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
    • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
      Shouldn't that be -pi (or -pi.bak if a backup is desired)? With just -p, the usage would be my_scriptydoo.pl file1 > file1.new
Re: Multiple substitutions in large files
by ikegami (Patriarch) on May 09, 2005 at 14:58 UTC

    a|b||d becomes a|b|\N|d
    |b|c|d becomes \N|b|c|d
    a|b|c| becomes a|b|c|\N
    and similarly,
    a|b|.|d becomes a|b|\N|d
    but
    .|b|c|d does not become \N|b|c|d
    a|b|c|. does not become a|b|c|\N
    Is that a bug?

    If the above is a bug, the following regexps are probably faster:

    s/\s*\|\s*/\|/g; s/^\.?(?=\|)/\\N/; s/(?<=\|)\.?(?=\||$)/\\N/g; s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g; s/(?<=\d{5})-(?:\d{1,4}|\s+)//;

    If the above is not a bug, the following regexps are probably faster:

    s/\s*\|\s*/\|/g; s/^(?=\|)/\\N/; s/(?<=\|)(?=\||$)/\\N/g; s/(?<=\|)\.(?=\|)/\\N/g; s/(?<=\d{2}:\d{2}:\d{2})\.\d+//g; s/(?<=\d{5})-(?:\d{1,4}|\s+)//;

    I reduced the number of regexps by combining a few, I shortened the regexps by removing the spaces first (not last), and I used zero-widths positive lookaheads and lookbehinds to mimimze the text being captured and substituted.

    Use this in conjuction with the -p or -pi suggestion for better results.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://455179]
Approved by Fletch
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2024-03-28 21:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found