in reply to Replace quotation-marks with tags in a huge text-file

Not very clever but give this a go

#!/usr/bin/perl use strict; use warnings; my $input = 'infile.txt' ; my $output = 'outfile.txt'; my $t0 = time(); open my $in_fh, '<', $input or die "Can't open $input: $!"; open my $out_fh, '>', $output or die "Can't open $output: $!"; my $s = 1234; my @id = (); my $count = 0; while (my $line = <$in_fh>){ my @pos = (); while ($line =~ m/([»«›‹])/g) { my $q = $1; if ($q =~ /[»›]/){ push @pos,[pos($line),qq{<q marker="$q" sID="$s"\/>}]; push @id,$s++ } else { my $e = pop @id; push @pos,[pos($line),qq{<q marker="$q" eID="$e"\/>}]; } } # reverse to preserve positions after replacement for (reverse @pos){ substr($line,$_->[0]-1,1) = $_->[1]; } print $out_fh $line; ++$count; } close $in_fh or die $!; close $out_fh or die $!; my $dur = $t0-time(); print "$count lines processed in $dur seconds\n";
poj

Replies are listed 'Best First'.
Re^2: Replace quotation-marks with tags in a huge text-file
by kemuel (Novice) on Sep 12, 2015 at 14:40 UTC

    I like the idea, but does not work.

    For some reason it matches more than just the quotes and thus pushes quote-marks in all over the place.
    And every time a match is made it is made twice so I'm having double the quotes

    And another problem is that this only checks the file line by line which is no good since many of the quotes span several lines.

    However, It gave me something to work with..

      Do you have an example of the text that fails ?

      The quote marks themselves don't span lines so that shouldn't matter

      poj

      Here's an example of matching something that is not a quote:

      before:

      Copyright © 2002, 2006 by Biblica, Inc.® 
      Used by permission. All rights reserved worldwide.
      
      These Scriptures are copyrighted and have been made available on the Internet for your personal use only. Any other use including, but not limited to, copying or reposting on the Internet is prohibited. These Scriptures may not be altered or modified in any form and must remain in their original context. These Scriptures may not be sold or otherwise offered for sale. 
      These Scriptures are not shareware and may not be duplicated.
      These Scriptures are not public domain. 
      

      after:

      Copyright <q marker="Â" sID="1"/>© 2002, 2006 by Biblica, Inc.<q marker="Â" sID="2"/>® 
      Used by permission. All rights reserved worldwide.
      
      These Scriptures are copyrighted and have been made available on the Internet for your personal use only. Any other use including, but not limited to, copying or reposting on the Internet is prohibited. These Scriptures may not be altered or modified in any form and must remain in their original context. These Scriptures may not be sold or otherwise offered for sale. 
      These Scriptures are not shareware and may not be duplicated.
      These Scriptures are not public domain. 
      

      And here is a match with too many tags:

      before:

      Dernæst sagde Gud: »Lad vandet under himmelhvælvingen samle sig, så det tørre land kan ses!« Og sådan skete det. 
      

      after:

      Dernæst sagde Gud: <q marker="Â" sID="9"/><q marker="»" sID="10"/>Lad vandet under himmelhvælvingen samle sig, så det tørre land kan ses!<q marker="Â" sID="11"/><q marker="«" eID="11"/> Og sådan skete det. 
      

        Which perl version do you have ?. Those examples work OK for me on v5.16.1 win 8.1

        poj