OtakuGenX has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a program that opens a huge file, reads it in one line at a time, works on that with a regex entered earlier in the code for STDIN and then outputs it to a new file. In the past I have done this work in Vi, but the files are so big that it takes a LONG time. I had hard coded Perl scripts to do this, but the files change regularly and or I am asked for other data. So what I was trying to do was write a script that would ask me for the regex at the time of running without me having to hard code the regex. Here is what works
open (INPUTFILE, "< $filein"); while (<INPUTFILE>) { my $line=$_; $line =~ s/\),\(/\)\n\(/gi; };
But if I do this:
my $regex=<STDIN>; #Entering s/\),\(/\)\n\(/gi chomp $regex; open (INPUTFILE, "< $filein"); while (<INPUTFILE>) { my $line=$_; $line =~ $regex; };
It does nothing. I could really use some help :-)

Replies are listed 'Best First'.
Re: Regex stored in a scalar
by Laurent_R (Canon) on Aug 21, 2015 at 19:26 UTC
    A regex should be between / / marks, so something like this:
    $line =~ /$regex/;
    but that won't work either in your case, because:
    s/\),\(/\)\n\(/
    is a substitution operator, it is not a regex. Only the part between the first two slashes is a regex, not the rest.

    So if you want to make substitutions you probably want to capture two inputs from the user or from the command line, the searched pattern and the substitution (not tested).

    my $regex = <STDIN>; chomp $regex; my $subst = <STDIN>; chomp $subst; while (<INPUTFILE>) { my $line = $_; s/$regex/$subst/gi; }
    And for a big file, you might want to add the o modifier, it may be faster (but that may depend on your version of Perl).

    Update: Oh, BTW, if you're used to do it in vi, you might consider sed. You should feel at home.

    Update 2: I crossed out the first part of my answer, as it was at least very incomplete, as kindly pointed out by AnomalousMonk. Only the second part was really relevant to the OP problem.

      It would be better to quote the pattern with qr// rather than using /o. Like this:
      $regex = qr/$regex/
        Yes, you're probably right, ++, this was just a quick additional note for speed, not much to to with the OP question.
      A regex should be between / / marks ...

      The  =~ operator is sufficiently DWIMic that it will take any string as a regex. It will even take all the  qr// regex modifiers if they are embedded as  (?adlupimsx-imsx) extended patterns.

      c:\@Work\Perl>perl -wMstrict -le "my $regex = '(?xms) ((.) \2{2,})'; ;; for my $s (qw(aeiou aeeiou aeiiiou aeioooou)) { print qq{match: captured '$1'} if $s =~ $regex; } " match: captured 'iii' match: captured 'oooo'

      ... but that won't work ...

      The first comment I made is actually rather trivial in the face of your second point; the substitution  s/\),\(/\)\n\(/ is, indeed, a substitution and not a regex — and bang, the whole endeavor hits a brick wall.


      Give a man a fish:  <%-{-{-{-<

        Yes, AnomalousMonk, you are right ++, and I actually knew that what I was saying was not quite right, not that I was thinking to you said in your comments, but I was thinking that you can have a regex in the form:
        if (m{pattern}) { # ...
        or many other delimiter pairs. I really wanted to get that point out of the way quickly to get to the real thing about the substitution not being a regex, so that I was a bit negligent in the way I wrote that first part.

        You're absolutely right, the first part of my comment stood for correction.

      sed would likely be wicked fast for this, too.
        It depends on the quality of the sed implementation (and also the Perl version). I have seen cases where Perl was 2 to 5 times faster than either sed or awk (I don't remember for sure from which vendor), although this was more than 10 years ago. In the more recent tests I made (but with rather old OS), there was no significant difference and it would also depend on the complexity of the processing being applied.

        I think that, in general, tests are required to decide the best way to go (if it matters at all, e.g. if your files to be processed are really so large that it will make a significant difference for you).

      I do like your sed suggestion. I have a snippet I've used for 20 years to either process 1 or many files, and it's fast and low load. I've changed 10's of thousands of files on a server in mear moments (after extensive testing of course!! to save restoring)

      This command will find and replace the string 'old' with 'new' in all files with the htm/html extension recursively from where you run the command. be careful, there's no undo! Use your regex as usual. Hopefully someone will find this snippet useful, I sure have 1000's of times!

      find . -name '*.htm*' -type f | xargs sed -i 's/old/new/g'

      The OP has )\n( as the substition. I can't see how to enter that with my $subst = <STDIN>;
      poj

        True, it won't work this way, but this quick test under the Perl debugger show that it might be feasible with a slight syntax tweak and one further step to process it. Here, I am passing the replacement string with \n to my debugger session:
        $ perl -de 42 foo\\n Loading DB routines from perl5db.pl version 1.33 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(-e:1): 42 DB<1> $c = shift; DB<2> x $c 0 'foo\\n' DB<3> $c =~ s|\\n|\n|; DB<4> x $c 0 'foo ' DB<5>
        So, this seems to work, although it might not be the most elegant construct. Of course, it also assumes that you know when writing your program beforehand, you might need some newline characters to be reprocessed.
Re: Regex stored in a scalar
by Anonymous Monk on Aug 21, 2015 at 19:36 UTC

    If the only thing you're doing inside the loop is one or two regex substitutions, and the way you describe it, the scripts sound like they're throwaways, you may want to look at the -e, -p and maybe also -i switches in perlrun, i.e. write one-liners:

    $ cat foo.txt one two three $ perl -wMstrict -pe 's/^t(?!h)/th/; s/(.)\1/$1/g' foo.txt > bar.txt $ cat bar.txt one thwo thre $ perl -wMstrict -pe 's/th/ph/g' -i.bak bar.txt $ cat bar.txt one phwo phre $ cat bar.txt.bak one thwo thre
Re: Regex stored in a scalar
by BillKSmith (Monsignor) on Aug 22, 2015 at 03:43 UTC
    You can execute your substitution on the default string $_.
    use strict; use warnings; my $string = '(abc),(def),(ghi)'; my $substitution = 's/\),\(/\)\n\(/gi'; $_ = $string; eval "$substitution"; print;

    OUTPUT:

    (abc) (def) (ghi)
    Bill
      Since my original reply, I have discovered that my concept of "build the command and evaluate it" can be generalized to meet your original requirement.
      use strict; use warnings; my $regex=<STDIN>; #Entering s/\),\(/\)\n\(/gi chomp $regex; open (INPUTFILE, "< $filein"); while (<INPUTFILE>) { my $line=$_; #$line =~ $regex; eval "\$line =~ $regex"; };
      Bill
Re: Regex stored in a scalar
by atcroft (Abbot) on Aug 22, 2015 at 06:15 UTC

    I wanted to do something similar to this recently, but with the left and right-hand patterns stored in a database. The problem I ran into, however, was if I tried to use capture variables, such as the following (contrived) example:

    Any suggestions?

      Hello atcroft,

      The only way I can find to do this is to pull the substitution apart into its component steps and perform these separately:

      #! perl use strict; use warnings; my $c = q{asdfghjk}; my @regex = ( { lh => q{(gh)}, rh => q{__$1__}, }, { lh => q{(h_)}, rh => q{_h!$1!}, }, ); print q{Original: }, $c, "\n"; for my $i (0 .. $#regex) { if ($c =~ /$regex[$i]{lh}/) { my $s = $1; my $d = $regex[$i]{rh}; $d =~ s/\$1/$s/; $c =~ s/$regex[$i]{lh}/$d/; } } print q{Final: }, $c, "\n";

      Output:

      17:37 >perl 1352_SoPW.pl Original: asdfghjk Final: asdf__g_h!h_!_jk 17:39 >

      This is far from elegant, and I keep thinking there must be a simpler way involving s///ee — but I haven’t found it.

      Anyway, hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        One way to one step per regex:

        c:\@Work\Perl>perl -wMstrict -le "my $c = q{asdfghjk}; print qq{ original: '$c'}; ;; my @regex = ( { lh => q{(gh)}, rh => q{__$1__}, }, { lh => q{(h_)}, rh => q{_h!$1!}, }, ); ;; for my $hr_s (@regex) { $c =~ s[ (?-x)$hr_s->{lh}]{ qq{qq{$hr_s->{rh}}} }xmsgee; print qq{intermediate: '$c'}; } ;; print qq{ final: '$c'}; " original: 'asdfghjk' intermediate: 'asdf__gh__jk' intermediate: 'asdf__g_h!h_!_jk' final: 'asdf__g_h!h_!_jk'
        Since  s///e or  s///ee is string eval, AnonyMonk's warning/advice here still holds. See Re: Evaluating $1 construct in literal replacement expression and associated nodes for more discussion.


        Give a man a fish:  <%-{-{-{-<

      $_ = 'foo'; $left = '(.)(.)'; $right = '$1$2$2$1'; s{$left}{"qq{$right}"}ee; print "$_\n"; s{$left}{eval "qq{$right}"}e; print "$_\n"; __END__ foofo foofofo

      first /e turns "" into a string qq{$1$2$2$1}

      second /e interpolates qq{$1$2$2$1} at the correct time and substitutes into the original string

      string eval is eval so arbitrary code could be executed

      So, to make it safer, instead of eval ... use some form of String::Interpolate/String::Interpolate::RE

        Thanks Anonymous Monk and AnomalousMonk,

        So, the technique is to doubly double-stringify the RHS before doubly evaluating it! Analogous to the trick of using @{ [...] } to interpolate a function-returned list into a string.

        I like String::Interpolate (the module, not its documentation!):

        #! perl use strict; use warnings; use String::Interpolate qw( interpolate ); my $c = q{asdfghjk}; my @regex = ( { lh => q{(gh)}, rh => q{__$1__}, }, { lh => q{(h_)}, rh => q{_h!$1!}, }, ); print q{Original: }, $c, "\n"; for my $i (0 .. $#regex) { $c =~ s/ $regex[$i]{lh} / interpolate($regex[$i]{rh}) /ex; } print q{Final: }, $c, "\n";

        Output:

        13:37 >perl 1352_SoPW.pl Original: asdfghjk Final: asdf__g_h!h_!_jk 13:37 >

        Cheers,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Regex stored in a scalar
by 1nickt (Canon) on Aug 21, 2015 at 18:57 UTC

    What is in $regex?

    try

    $line =~ /$regex/;

    The way forward always starts with a minimal test.
Re: Regex stored in a scalar
by james28909 (Deacon) on Aug 21, 2015 at 19:43 UTC
    EDIT: Updated script. This works for me:
    use strict; use warnings; print "Enter left side for search: "; my $LeftSide = <STDIN>; print "Enter right side replacement: "; my $RightSide = <STDIN>; chomp($LeftSide); chomp($RightSide); while(<DATA>){ print if s/$LeftSide/$RightSide/g; #print "Replaced \"$LeftSide\" with \"$RightSide\" at: line $.: $_ +" if ($_ =~ s/$LeftSide/$RightSide/g); } __DATA__ hello this. is line 2 line, 3 ),( this is test for line 4 ) testing line ,5 now testing line (6)
    ran it with script.pl \),\( \)\n\(
    Outputs:
    C:\Users\James\Desktop>test.pl \),\( \)\n\( Replaced "\),\(" with "\)\n\(" at: line 3 "line, 3 \)\n\("

    Posting some example input would be helpful :)

      Hum, I was just going to tell something to the effect that your script was not very useful, but as I hit the reply button, I just saw your edited version. This is indeed much more to the point.
        I tested it then noticed that he was indeed trying to search and replace. Did a ninja edit ; I was also about to change ARGV's to <STDIN>'s and chomp them as well.
      try
      __DATA__ (123),(456),(789) (abc),(def),(ghi)
      the result should be
      (123) (456) (789) (abc) (def) (ghi)
      poj
        I was able to get that output by changing the script like:
        print $file $_ if s/$LeftSide/$RightSide/eegi;

        And then left side is just a comma ',' without the quotes and right side is "\n" WITH the quotes

        EDIT: Using above little snippet, Use left side as \),\( and right side as "\)\n\(". And from what I understand the above snippet is the same as doing:

        print $_ if s/$LeftSide/eval $RightSide/egi;

        So I also suggest reading PerlDoc: eval.

        Heres another more hackish way to get the job done haha:

        Create your main script like so, with keywords in the s///

        # main.pl # all this is just a template that creates "run.pl" use strict; use warnings; while(<DATA>){ print $_ if s/search_here/replace_here/gi; } __DATA__ (123),(456),(789) (abc),(def),(ghi)

        Then this following script will search and replace the keywords "search_here" and "replace_here" in the script above, with whatever you input and put it in "run.pl"!

        # prepare_run.pl open my $file, '+<', 'main.pl'; #your original script we will replace +keywords open my $run, '+>', 'run.pl'; #newly created script that we will execu +te below print "Enter left side of s///: "; chomp(my $LeftSide = <STDIN>); print "Enter right side of s///: "; chomp(my $RightSide = <STDIN>); while(my $line = <$file>){ print $run $line if $line !~ /.*search_here.*/ || /.*replace_here.*/; print $run $line if $line =~ s/(.*)search_here(.*)/$1$LeftSide$2/ && + $line =~ s/(.*)replace_here(.*)/$1$RightSide$2/; } close($file); close($run); system("run.pl"); #or whatever the the equivalent of your OS.

        Here is the script that the above will create and run:

        # run.pl, will be created after running "prepare_run.pl" while using " +main.pl" as a template. use strict; use warnings; while(<DATA>){ print $_ if s/\),\(/\)\n\(/gi; } __DATA__ (123),(456),(789) (abc),(def),(ghi)

        Download all the above and then just run "prepare_run.pl" and it will copy lines from main.pl while replacing keywords with your regex from STDIN and put it all in run.pl for execution. You can use \),\( and \)\n\( for STDIN per normal without using any quotes.

        Here is the output:

        (123) (456) (789) (abc) (def) (ghi)
Re: Regex stored in a scalar
by anonymized user 468275 (Curate) on Aug 24, 2015 at 14:08 UTC
    Feeding the regex in via STDIN seems a bit clunky. What about using command line options, e.g. -s "regex" -r "replacement" and see getopt for a wealth of option parsers to load in their arguments.

    One world, one people

Re: Regex stored in a scalar
by OtakuGenX (Initiate) on Aug 25, 2015 at 17:08 UTC
    OK I ended up going with s/$search/$replace/g . I guess I had hoped to do it the other way simply for ease. Thank you ALL you comments are AWESOME and help a TON!!!