Substitution in range operator

benaw has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a text file that I am trying to clean up, the hurdle that I am currently stuck on is trying to remove pipe (|) and new lines (\n\r) from text in between two patterns namely "Mechanical Notes:|" and "Lighting Notes:"

As you can see from the sample text below I would like it to read ..LOW LEVEL EXHAUST, POSITIVE.. so I would replace all with a space. Thus far I have been playing with regex and subistution and was looking at using subistution in a range operator in the code section "## Remove pipes and new lines in range" -(My example is using commas and hyphens at the moment.) It seems to replace the first comma on every line with a hyphen not limited to the range.

So to give you some background I am trying to analyse some .pdf reports I've converted them into (horriable) text files and trying to convert them into something I can suck into SQL. So at the moment I am just trying to format it into {Field:}{Value}{new line}.

It should be noted that there are mutiple sections so the below sample occors many times hence why I am just going for this formatting at this stage, thought I'd throw that in because usually an OP's approach isn't the most effective

Sample text from inputfile.txt:

Mechanical Notes:|HEPA FILTERED, LOW LEVEL

|EXHAUST, POSITIVE CIRCULATION

Lighting Notes:|LIGHTING - COLOUR

#!/usr/bin/perl
 
### slurp
open (MYFILE, 'inputfile.txt');
while (<MYFILE>) {
###    chomp;

    ## Substitute 2 or more horizontal white space with pipe
    #    s/[^\S\r\n]{2,}/|/g; #works
        s/[\h+]{2,}/|/g;
    ## Remove Room Name
        s/[\w+\W+]{7,}Room No/Room No/;    
    ## Remove pipe     
        s/(\|Briefed Area:)/Briefed Area:/;
        s/(\|Drawn Area:)/Drawn Area:/;
        s/(\|Maximum\|)/\|Maximum Occupancy:\|/;
        s/(\|Hours Of Use:)/Hours Of Use:/;
    ## Remove obselete  data    
        s/(\|Occupancy:)//;
        s/(\|hrs.)//;
        s/(\|sq.m)//;
    ## Remove pipes and new lines in range
    #    s/,+/-/  if /(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/ ;
    #    if (/(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/) {s/,+/-/
+};
    
    ## Push to new line
        s/(\|Maximum Occupancy:)/\nMaximum Occupancy:/;

### Burp;
    print $_  
    
 }
 close (MYFILE);
[download]

Update: I think I need to find the expression for the hex value a7c

Update2: SOLVED: an Anonymous Monk told me about "\R" which saved me having to use the hex code so I run the above code and then I use 1nickt's sugesstion and slurp the entire outputted file and replace all instances of new line and pipe using s/\R\|/ /mg; This cant be do in the original script because the substitution is running over each line and of course because of the new line I'm working on multiple lines

#!/usr/bin/perl
local $/;
open my $fh, '<', 'C:/Users/bwinkley/Desktop/test/whitespace.txt' or d
+ie $!;
my $txt = <$fh>;

#$txt =~ s/(\n\|)/ /mg;
$txt =~ s/\R\|/ /mg;

for ( split '\n', $txt ) {
    # process each line
}
print $txt;
[download]

Thankyou you all for your help; and to MS for having so many new line options /s

Comment on Substitution in range operator Select or Download Code

Replies are listed 'Best First'.
Re: Substitution in range operator by 1nickt (Canon) on Dec 30, 2015 at 02:39 UTC
If the file size is not too large you could read in the whole text and replace each instance of a new line followed by a pipe, with a space. Then afterwards go through it again line by line and make your other substitutions. `local $/; open my $fh, '<', 'file.txt' or die $!; my $txt = <$fh>; $txt =~ s/\n\\|/ /g; for ( split '\n', $txt ) { # process each line }` [download] The way forward always starts with a minimal test.	[reply] [d/l]
Re^2: Substitution in range operator by benaw (Novice) on Dec 30, 2015 at 04:48 UTC
Thanks for your response, I created an output file using my original script because that is what puts the pipes in `## Substitute 2 or more horizontal white space with pipe s/[\h+]{2,}/\|/g;` [download] then i passed that file through your script it finds the pipe and replaces it with a space but it is still on a separate line. `#!/usr/bin/perl local $/; open my $fh, '<', 'whitespacegone.txt' or die $!; my $txt = <$fh>; $txt =~ s/(\n\\|)/ /mg; for ( split '\n', $txt ) { # process each line } print $txt;` [download]	[reply] [d/l] [select]
Re^3: Substitution in range operator by 1nickt (Canon) on Dec 30, 2015 at 05:18 UTC
Hm, odd. Please try running this program: `#!/usr/bin/perl use strict; use warnings; local $/; my $txt = <DATA>; $txt =~ s/\n\\|/ /g; print $txt; __DATA__ foo \| bar baz \|qux fred \|wilma barney` [download] Output should be : `foo \| bar baz qux fred wilma barney` [download] (Note that you don't need to enclose your desired match in parentheses unless you need to "capture" it.) The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^4: Substitution in range operator by benaw (Novice) on Dec 30, 2015 at 06:35 UTC
Re^5: Substitution in range operator by ww (Archbishop) on Dec 30, 2015 at 11:27 UTC
Re^5: Substitution in range operator by 1nickt (Canon) on Dec 30, 2015 at 06:51 UTC
Some notes below your chosen depth have not been shown here
Re: Substitution in range operator by Anonymous Monk on Dec 30, 2015 at 10:55 UTC
Update: I think I need to find the expression for the hex value a7c That's probably line feed (0xA) and vertical line (0x7C), so the expression is "\n\|". Also note that "\R" matches anything that can be considered a newline under Unicode rules. It can match a multi-character sequence. It cannot be used inside a bracketed character class That might make it easier for you to deal with various newlines (it matches "\n", "\r\n", "\r", among other things).	[reply]
Re^2: Substitution in range operator by benaw (Novice) on Dec 30, 2015 at 23:55 UTC
Thanks "\R" was very helpful with that in 1nickt's script which I run afterwards the hurdle is overcome, now onto the next part!	[reply]