How do I delete lines matching a certain condition and make sure each line has the right prefix and suffix?

cjff150 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'd appreciate any guidance on this -- I'm very new to PERL and can't crack this one. I have a pipe delimited file and the beginning of each line must start with | and end with || (no space in-between). Blank columns on a row should have a space inbetween the delimiters | |. I created the code below using regex to insert a space between the || delimiters anywhere a column might have blank data which is working ok.

Here's an example, the code below changes the file from:

|red||blue|yellow||black||

|white|yellow||teal||purple||

|red||black||yellow||teal||

To..(note spaces between delimiters for blank column data)

|red| |blue|yellow| |black||

|white|yellow| |teal| |purple||

|red| |black|yellow| |teal||

Here's the code


#!C:\Perl64\bin\perl -w

use strict;

# Iteration 1, open myfile.txt

open(FILE, "</temp/PERL-Samples/myfile.txt") || die "File not found";
my @lines = <FILE>;
close(FILE);

#This section uses an array and scans each line and does a 
#sequential search and replace for the following strings:
#1. Finds any instance of || and changes it to | |
#2. Finds any instance of | || and changes it to | | |
#3. Finds any instance of || | and changes it to | | |
#4. Finds any instance of | | at the end of a line and 
#changes it to ||

my @newlines;
foreach(@lines) {
   $_ =~ s/\|\|/\| \|/g;
   $_ =~ s/\| \|\|/\| \|/g;
   $_ =~ s/\|\| \|/\| \|/g;
   $_ =~ s/\| \|$/\|\|/g;
   push(@newlines,$_);
}

#Push changes to MyFileNew.dat

open(FILE, ">/temp/PERL-Samples/MyFileNew.dat") || die "File not found
+";
print FILE @newlines;
close(FILE);
[download]

Another pre-existing condition this file has is there are several hundred lines where | and || exist on a line by themselves. Plus there are other lines not beginning with a | and ending with ||. For example:

|white|yellow| |teal| |purple||

I'm not sure how to reliably and efficiently delete rows where || and | are the only characters in a line and also make sure each line is prefixed with | and suffixed with ||. Thanks in advance for reviewing this.

Comment on How do I delete lines matching a certain condition and make sure each line has the right prefix and suffix? Download Code

Replies are listed 'Best First'.
Re: How do I delete lines matching a certain condition and make sure each line has the right prefix and suffix? by kennethk (Abbot) on Sep 16, 2014 at 23:51 UTC
Welcome to the Monastary, cjff150. As a note, it's considered good form to wrap not just code in `<code>` tags, but input and output as well, in order to avoid whitespace mangling and other formatting issues. See How do I post a question effectively?. For your various requirements, you will find the `^` and `$` Metacharacters useful; they let you specify the start and end of a line, respectively. You'll also find Look Around Assertions helpful. So for your 4 requirements, something like: `s/^(?!\\|)/\|/; # Make sure each line starts w/ \| s/(?<!\\|)\\|?\n/\|\|\n/; # Make sure each line ends w/ \|\| s/\\|\\|(?!$)/\| \|/g; # Insert space between \|\| if not at the end of a + line s/^[\s\\|]*\n//; # Delete rows w/o meaningful characters` [download] Of course, whoever invented this format has given you a raging case of Leaning_toothpick_syndrome. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re: How do I delete lines matching a certain condition and make sure each line has the right prefix and suffix? by locked_user sundialsvc4 (Abbot) on Sep 17, 2014 at 00:09 UTC
In general, I like to approach problems like this in a slightly different way: I will read “the original file,” one record at a time, in a loop, and decide which records from that file I want to keep. I can also change the content of each record in any way that I please. I will then write those records, one at a time, to “the next generation of ‘the original file.’” So, when the program is finished, I am left with two files: “before,” and “after.” I can then compare the two ... the `diff` command (in Unix/Linux) comes in very handy here. If I like what I see, I can (separately) throw-away or archive the original file and keep the new one with just a few rename-commands in the shell. And, if I don’t like what I see, nothing has been lost or harmed. It is, in other words, a non-destructive process that works for datasets of any size. This approach works equally well with files of unlimited length, because it’s just being processed line-by-line no matter how long it is. So, your program might look something like this ... (caution: extemporaneous coding) use strict; use warnings; # Open the input file, create the output file. open(INFILE, "</temp/PERL-Samples/myfile.txt") \|\| die "File not found" +; open(OUTFILE, ">/temp/PERL-Samples/myfile_out.txt"); while (<INFILE>) { #Looks like the actual logic here is actually just a filter: #1. Finds any instance of \|\| and changes it to \| \| #2. Finds any instance of \| \|\| and changes it to \| \| \| #3. Finds any instance of \|\| \| and changes it to \| \| \| #4. Finds any instance of \| \| at the end of a line and # changes it to \|\| $_ =~ s/\\|\\|/\\| \\|/g; $_ =~ s/\\| \\|\\|/\\| \\|/g; $_ =~ s/\\|\\| \\|/\\| \\|/g; $_ =~ s/\\| \\|$/\\|\\|/g; print OUTFILE $_; } close INFILE; close OUTFILE; [download]
Re^2: How do I delete lines matching a certain condition and make sure each line has the right prefix and suffix? by Laurent_R (Canon) on Sep 17, 2014 at 06:41 UTC
I also prefer the line by line approach, especially because I am mostly dealing with files having tens or even hundreds of millions lines, and writing to another file is also a must for me, because I may need to undo things if something went wrong. As a side note, there is no need to escape the pipe ("\|") character in the replacement values of the `s///` operator.	[reply] [d/l]