kemuel has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to convert a huge textfile into "OSIS XML markup". One thing that i have to do is replace all the qotation-marks with tags so that:

He said, »Someone once said ›This is what they said‹ but I say something else.«

is turned into something like:

He said, <q marker="»" sID="1234" />Someone once said <q marker="›" sID="3456"/>This id what they said<q marker="‹" eID="3456"/> but I say something else.<q marker="«" eID="1234"/>

I wrote the following code and it works. But it works extremely slow. Especially with the huge multiline textfiles i feed it:

#!/usr/bin/perl use warnings; use Getopt::Long; my $input =""; my $output =""; my $datamat =""; my $doodle =""; GetOptions ('infile=s' => \$input, 'outfile=s' => \$output) or die $!; open my $in_fh, '<', $input or die "Can't open $input: $!"; open my $out_fh, '>', $output or die "Can't open $output: $!"; while (<$in_fh>){ $datamat .=$_ } my $i = 0; while ($datamat=~ m/»(.*?)«/gs) { $i++; $datamat=~ s/»(.*?)«/<q marker="»" sID="$i"\/>$1<q marker="«" eID= +"$i"\/>/s; } while ($datamat=~ m/›(.*?)‹/gs) { $i++; $datamat=~ s/›(.*?)‹/<q marker="›" sID="$i"\/>$1<q marker="‹" eID= +"$i"\/>/s; } print { $out_fh } $datamat or die $!; close $in_fh or die $!; close $out_fh or die $!;

This script took about an hour to work through one of my files

Is there a way to do this that is more effective and not so time/CPU-demanding?

Replies are listed 'Best First'.
Re: Replace quotation-marks with tags in a huge text-file
by Eily (Monsignor) on Sep 12, 2015 at 14:52 UTC

    Here are some things you can do to improve your code:

    First, if you can't process the file line by line, don't read it line by line. You can read it all at once like this:

    my $data; { local $/ = undef; # local means this change will be reversed at the +end of the block $data = <$in_fh>; }
    $/ is the input record separator, by default it's "\n" which tells perl to read the file and stop everytime it encounters a newline. You could also set it to "", and perl will read the file one paragraph at a time (at least two consecutive "\n").

    while ($data =~ /REGEX/g) { $data =~ s/REGEX/rep/; }
    Here the /g is useless, because the string changes between each call to the first regex, so perl reads the string (your complete file) from the beginning each time. $data =~ s/REGEX/rep/g; Would do what you want, except for the part that you haven't figured out yet, with your non constant replacement.

    The s operator allows the right side to be dynamic with the /e switch. To use it, just write perl code that would return what you want, ex: $i_want = "string, $1".$i++."another string"; And put that in the right side of your replacement and add /e: s/REGEX/"string, $1".$i++."another string"/e;. Notice that I have kept the same code, including the quotes.

    This last part even allows you to do all your matches at once. If you regex is />(.*?)<|»(.*?)«/ you can write: $i_want = "String".($1||$2)." ".($i++); So: s/>(.*?)<|»(.*?)«/"String".($1||$2)." ".($i++)/gse;

      Thank you so much. That really simplified my script a lot.
      And processing the file now takes a few seconds instead of an hour.

      I'm really happy right now

Re: Replace quotation-marks with tags in a huge text-file
by AnomalousMonk (Archbishop) on Sep 12, 2015 at 15:25 UTC

    Try something like this (semi-tested):

    use warnings; use strict; my $datamat = "He said, »Someone once said ›This is what they said‹ but I say some +thing else.«"; print qq{'$datamat' \n\n}; my $i = 0; $datamat =~ s{ » (.*?) « } { ++$i; qq{<q marker="»" sID="$i"/>$1<q marker="«" eID=" +$i"/>}; }xmsge; $datamat =~ s{ › (.*?) ‹ } { ++$i; qq{<q marker="›" sID="$i"/>$1<q marker="‹" eID=" +$i"/>}; }xmsge; print qq{[[$datamat]] \n\n};
    (Assumes entire file has been read (i.e., "slurped" (update: see also File::Slurp)) into the  $datamat variable.)
    (Update: Also assumes matching  » ... « and  › ... ‹ quote character pairs are never nested!)

    Update: Please see perlre, perlretut, and perlrequick.

    Update 2: A regex expression like  » (.*?) « may run faster if written as  » ([^«]*) « instead.


    Give a man a fish:  <%-{-{-{-<

Re: Replace quotation-marks with tags in a huge text-file
by poj (Abbot) on Sep 12, 2015 at 14:20 UTC

    Not very clever but give this a go

    #!/usr/bin/perl use strict; use warnings; my $input = 'infile.txt' ; my $output = 'outfile.txt'; my $t0 = time(); open my $in_fh, '<', $input or die "Can't open $input: $!"; open my $out_fh, '>', $output or die "Can't open $output: $!"; my $s = 1234; my @id = (); my $count = 0; while (my $line = <$in_fh>){ my @pos = (); while ($line =~ m/([»«›‹])/g) { my $q = $1; if ($q =~ /[»›]/){ push @pos,[pos($line),qq{<q marker="$q" sID="$s"\/>}]; push @id,$s++ } else { my $e = pop @id; push @pos,[pos($line),qq{<q marker="$q" eID="$e"\/>}]; } } # reverse to preserve positions after replacement for (reverse @pos){ substr($line,$_->[0]-1,1) = $_->[1]; } print $out_fh $line; ++$count; } close $in_fh or die $!; close $out_fh or die $!; my $dur = $t0-time(); print "$count lines processed in $dur seconds\n";
    poj

      I like the idea, but does not work.

      For some reason it matches more than just the quotes and thus pushes quote-marks in all over the place.
      And every time a match is made it is made twice so I'm having double the quotes

      And another problem is that this only checks the file line by line which is no good since many of the quotes span several lines.

      However, It gave me something to work with..

        Do you have an example of the text that fails ?

        The quote marks themselves don't span lines so that shouldn't matter

        poj

        Here's an example of matching something that is not a quote:

        before:

        Copyright © 2002, 2006 by Biblica, Inc.® 
        Used by permission. All rights reserved worldwide.
        
        These Scriptures are copyrighted and have been made available on the Internet for your personal use only. Any other use including, but not limited to, copying or reposting on the Internet is prohibited. These Scriptures may not be altered or modified in any form and must remain in their original context. These Scriptures may not be sold or otherwise offered for sale. 
        These Scriptures are not shareware and may not be duplicated.
        These Scriptures are not public domain. 
        

        after:

        Copyright <q marker="Â" sID="1"/>© 2002, 2006 by Biblica, Inc.<q marker="Â" sID="2"/>® 
        Used by permission. All rights reserved worldwide.
        
        These Scriptures are copyrighted and have been made available on the Internet for your personal use only. Any other use including, but not limited to, copying or reposting on the Internet is prohibited. These Scriptures may not be altered or modified in any form and must remain in their original context. These Scriptures may not be sold or otherwise offered for sale. 
        These Scriptures are not shareware and may not be duplicated.
        These Scriptures are not public domain. 
        

        And here is a match with too many tags:

        before:

        Dernæst sagde Gud: »Lad vandet under himmelhvælvingen samle sig, så det tørre land kan ses!« Og sådan skete det. 
        

        after:

        Dernæst sagde Gud: <q marker="Â" sID="9"/><q marker="»" sID="10"/>Lad vandet under himmelhvælvingen samle sig, så det tørre land kan ses!<q marker="Â" sID="11"/><q marker="«" eID="11"/> Og sådan skete det.