Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

multiple-pass search?

by propellerhat (Novice)
on Dec 09, 2021 at 18:54 UTC ( [id://11139514]=perlquestion: print w/replies, xml ) Need Help??

propellerhat has asked for the wisdom of the Perl Monks concerning the following question:

I have several hundred text files in which I need to copy a 3-digit serial number from a representation in Arabic numerals ([0-9]) to a representation in English (['zero' - 'nine']).

Thus, article number "345" needs also the label "threefourfive"; article number "004" needs also the label "zerozerofour".

The serial number appears in a single instance in the text of each file with the label "No.", as in "No. 345".

The English representation is a LaTeX command, prefixed by "\" as in "\threefourfive". As a placeholder for the English representation, each file contains the string "\zerozerozero".

If it can be useful, it happens that the serial number appears also in the filename, as in "abstract-345.tex".

The English representation allows (using the LaTeX package "catchfile") a single article title to be maintained in a separate file, so that it may be used in several documents (catalogue, abstract, article).

I do not know how to approach this; perhaps using a substitution with "s///" ? In occurs to me that matching with the greedy modifier "/g" could also match against ordinary English words in the text files.

Replies are listed 'Best First'.
Re: multiple-pass search?
by choroba (Cardinal) on Dec 09, 2021 at 19:01 UTC
    Yes, substitution is the right tool. You can use Lingua::EN::Numbers to turn digits into words.

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Lingua::EN::Numbers qw{ num2en }; my $text = 'abc No. 347 xyz'; $text =~ s/No\. \K(\d+)/join "", "\\", map num2en($_), split m{}, $1/g +e; print $text; # abc No. \threefourseven xyz

    I used /e which evaluates the replacement part as code. The regex matches "No. " followed by a number, but replaces just the number due to \K. It splits the number into digits, replaces each with the word (via num2en) and joins them together with a \ at the beginning.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: multiple-pass search?
by jdporter (Paladin) on Dec 09, 2021 at 20:29 UTC
    use Tie::File; sub replace_serialnumbers_in_file($) { my @word = qw( zero one two three four five six seven eight nine ) +; my $filename = shift; my $serno = join '', map $word[$_], $filename =~ /(\d)/; # assumin +g no other digits in the filename tie my @lines, 'Tie::File', $filename or die; s/\\zerozerozero/\\$serno/g for @lines; }

    You don't need multipass if you take the serial number from the filename.

    I reckon we are the only monastery ever to have a dungeon staffed with 16,000 zombies.

      Here is how I extract the serial number from the filename:

      use File::Find; my $dir = "documents"; find( sub { my $filename = $_; return unless ( $filename =~ /abstract-([0-9][0-9][0-9]).tex/ && -f $filename ); my $serialnumber = $1 ;

        So I take it the filename will have the exact pattern abstract-NNN.tex. If so, the regex you gave is too broad. It will match, for example, nonabstract-000stexts. You need to anchor the beginning and end, and escape the dot: /^abstract-(\d{3})\.tex$/

Re: multiple-pass search?
by LanX (Saint) on Dec 09, 2021 at 20:22 UTC
    this should get you started, I kept it flexible so that you can adjust it.
    DB<49> sub english_num { my ($pre,$num) = @_; my $eng = join "-", ma +p {(qw/zero one two three four five six seven \ eight nine/)[$_] } split //,$num; return "$pre \\$eng"} DB<50> $txt =" some text No. 345 other text No. 123 end text" DB<51> $txt =~ s/(No.) (\d{3})/english_num($1,$2)/ge DB<52> say $txt some text No. \three-four-five other text No. \one-two-three end text

    edit

    In case you are sure that it's always exactly 3 digits, you can also use a hardwired regex, with a lookup array

    s/(No.) (\d)(\d)(\d)/$1 \\$nums[$2]-$nums[$3]-$nums[$4]/g

    DB<94> $_ =" some text No. 345 other text No. 123 end text" DB<95> p some text No. 345 other text No. 123 end text DB<96> s/(No.) (\d)(\d)(\d)/$1 \\$nums[$2]-$nums[$3]-$nums[$4]/g DB<97> p some text No. \three-four-five other text No. \one-two-three end text

    update

    after reading the OP again, please provide an SSCCE clarifying input and expected output.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      This is about the best I can do by way of providing a SSCCE: 1) files: a) title files (one title per file): title-001.tex title-002.tex ... title-999.tex b) catchfile index (one file of a thousand lines; a thousand titles + is about three or four times the number needed): \CatchFileDef{\zerozerozero}{title-000.tex}{} \CatchFileDef{\zerozeroone}{title-001.tex}{} ... \CatchFileDef{\nineninenine}{title-999.tex}{} c) document files (several categories, having same title): article-001.tex article-002.tex ... article-003.tex abstract-001.tex abstract-002.tex ... abstract-003.tex catalogue-001.tex catalogue-002.tex ... catalogue-003.tex 2) In the head of each document file is a placeholder for the English +representation of the serial number of the title: "\zerozerozero". I +f the placeholder is not useful, I can delete it. 3) In the head of each document file is the serial number of the title +, in Arabic representation: "No. 345". 4) The serial number of the title appears also in the filename of the +document file: "article-345". 5) The objective is to write in the document file the English represen +tation of the serial number of the title: "\threefourfive". 6) Once the English representations are in place, I can use Perl to ma +ke necessary adjustments.

        Unfortunately you have not provided so much as a single line of Perl here. As such it is impossible to know at which point you are encountering a problem, let alone what that problem is.

        Here is the sort of SSCCE you could have written:

        use strict; use warnings; use Test::More tests => 3; my $filename = 'abstract-345.tex'; my $have = <<'EOT'; foo Here: \zerozerozero bar No. 345 baz EOT my $want = <<'EOT'; foo Here: \threefourfive bar No. 345 baz EOT my @digits = qw/zero one two three four five six seven eight nine/; my ($arabic) = $filename =~ /-([0-9]{3})\.tex/; (my $eng = $arabic) =~ s/([0-9])/$digits[$1]/g; $have =~ s/\\zerozerozero/\\$eng/; is $arabic, '345', 'Digits extracted'; is $eng, 'threefourfive', 'Converted to English'; is $have, $want, 'Replaced in text';

        Now you can see how to perform these three operations. If that doesn't solve your problem you need to provide some runnable code which demonstrates the problem which you are having (ideally with a test such as shown here). In that way we will know what it is you are actually asking.

        There's a detailed rationale at How to ask better questions using Test::More and sample data.


        🦛

Re: multiple-pass search?
by jwkrahn (Abbot) on Dec 10, 2021 at 07:00 UTC
    In occurs to me that matching with the greedy modifier "/g"

    From perlop:

              g Match globally, i.e., find all occurrences.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11139514]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-19 21:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found