wrml has asked for the wisdom of the Perl Monks concerning the following question:

HI,

This is probably a very simple question, but I'm only in my first few days of perl programming. I'm an economist trying to fix some data that did not use negative signs but rather used letters to represent negative numbers. The data is organized in such a way that these values fill spaces that are five characters long, and there are many of them (thousands of columns by thousands of rows). I want a program that will replace them, but not a simple find-and-replace substitution. I rather need to tell perl to look at an individual string of five columns, and the next one. that is to say

OOOOOOOOOJ

is really the same as:

OOOOOJOOOO

I think there are other weird data entries. And I need to convert these in chunks. that is to say, I don't want every J to be -1, I rather need to give instructions to convert the rows between 955 and 959 to -0001 if it contains a J anywhere in that block. I have a dictionary that identifies tbe blocks (they are not all equally long) in Stata. I think I might be able to use it again. Does anyone have ideas as to an easy way to do this?

The force is not so much with me when it comes to perl (though I'm learning!).

  • Comment on Find and replace by five character string

Replies are listed 'Best First'.
Re: Easy find and replace loop. HELP!
by Zaxo (Archbishop) on Aug 10, 2005 at 19:03 UTC

    There is an easy way in perl to do such edits in-place, but that's not what you want. Here's how.

    We'll start with the splatline and other preliminaries. The splatline can be useful even on win32 systems, since perl can still obtain options from it.

    #!/usr/bin/perl use warnings; use strict;
    We'll define a glob pattern for the input files, and an output path, and bring in a standard module function for extracting filenames from paths.
    my $infiles = '/path/to/RETAKC*.txt'; my $outpath = '/another/path/'; use File::Basename 'basename';
    Now let's define a hash whose keys are the text to be replaced, and values, their replacement. We'll construct a big alternation regex from the keys. It will capture whatever it matches.
    my %substitute = ( '0000J' => '000-1', # . . . ); my $regex = qr/(${\join '|', keys %substitute})/;
    Now we're ready to do the deed. We get all our filenames with glob and loop over them. We open each to read, and an output file to accept the results. We apply our substitution globally to each line and print to the output file.
    for my $file (glob $infiles) { open my $in, '<', $file or warn $! and next; open my $out, '>', $outpath.basename($file) or warn $! and next; while (<$in>) { s/$regex/$substitute{$1}/g; print $out $_; } }
    That's it. All your substitutions made to new copies of all your files.

    After Compline,
    Zaxo

      Thanks you very much for the help. I'm interested to learn why this works. I think I understand everything except how the module works and what is happening here:

      my $regex = qr/(${\join '|', keys %substitute})/;

      If you would be so kind as to explicate I would be oh so grateful.

      walter

        The File::Basename module is a library which just defines a number of functions which are useful for extracting pieces of file paths. By giving 'basename' as an argument to the use statement which loads the module, we get the basename() function imported to our main:: namespace. That lets us call it without its fully qualified name, File::Basename::basename, in constructing the output file path. The module is standard, it has shipped with perl for years.

        The regex construction is somewhat more compact than you should usually expect to see. The qr// operator is a quote which produces a compiled regex out of its contents. It interpolates variables (things starting with $ or @) in the same way double quotes or qq// do. Instead of producing a named variable with the regex text in it, I used dereference (${}) of a reference to the regex text to get an "anonymous variable" to be interpolated. All that because interpolation in quotes does not call functions, but just fills in things with $ or @ sigils. That line could have been split in two in a way that many would prefer,

        my $alt = join '|', keys %substitute; # make one string of all # keys with pipes between my $regex = qr/($alt)/; # compile the regex # with parens to capture
        but I prefer having the assignment made in a single statement without superfluous variables.

        After Compline,
        Zaxo

Re: Easy find and replace loop. HELP!
by Transient (Hermit) on Aug 10, 2005 at 18:24 UTC
    Some questions: Is it only upper case letters? Does it stop at Z? Is it a specific 5 character string that has four zeroes then an uppercase letter from J to Z at the end? Can there be more than one on a line? Can there be other information on either side? Is it one big line or are there multiple lines in a file? If J is -1 and K is -2, and T is -11, does the string become 000-11 or 00-11?
      Good questions, I should have clarified.

      the lines are very long and there are a lot of them (these files are very large). I believe the lines are 15,000 characters long and there might be 2,000 of them. there is data on both sides and there can be more than one "0000J" and "0000K" on the same line. The letters go from J to U -12. As for 000-11 or 00-11. they will all always be five characters long. so it is 00-11. I don't know if they are all capitalized, and would like to be prepared if they were not (though I think I know how to make this change in perl). there is certainly no other use for those letters so one need not worry about changing a lowercase that means something else.

      thanks again,

      walter

        Here's one way to do your search and replace. I suggest you look into each element of why this works. Opening the file and looping through the lines should be simple enough (there are many tutorials on that). Note that this one will treat upper and lower case characters the same

        Try this example:
        #!/usr/bin/perl use strict; use warnings; my $string = "ABC0000K0000T12345"; $string =~ s/0000([J-U])/my $x=ord(uc($1))-ord('I');('0'x(4-length($x) +)).'-'.$x/ige; print $string, "\n";


        Update (added links)

        s///
        ord
        length

        tutorial on open