comment on

Hello Monks,

I have a text file that I am trying to clean up, the hurdle that I am currently stuck on is trying to remove pipe (|) and new lines (\n\r) from text in between two patterns namely "Mechanical Notes:|" and "Lighting Notes:"

As you can see from the sample text below I would like it to read ..LOW LEVEL EXHAUST, POSITIVE.. so I would replace all with a space. Thus far I have been playing with regex and subistution and was looking at using subistution in a range operator in the code section "## Remove pipes and new lines in range" -(My example is using commas and hyphens at the moment.) It seems to replace the first comma on every line with a hyphen not limited to the range.

So to give you some background I am trying to analyse some .pdf reports I've converted them into (horriable) text files and trying to convert them into something I can suck into SQL. So at the moment I am just trying to format it into {Field:}{Value}{new line}.

It should be noted that there are mutiple sections so the below sample occors many times hence why I am just going for this formatting at this stage, thought I'd throw that in because usually an OP's approach isn't the most effective

Sample text from inputfile.txt:

Mechanical Notes:|HEPA FILTERED, LOW LEVEL

|EXHAUST, POSITIVE CIRCULATION

Lighting Notes:|LIGHTING - COLOUR

#!/usr/bin/perl
 
### slurp
open (MYFILE, 'inputfile.txt');
while (<MYFILE>) {
###    chomp;

    ## Substitute 2 or more horizontal white space with pipe
    #    s/[^\S\r\n]{2,}/|/g; #works
        s/[\h+]{2,}/|/g;
    ## Remove Room Name
        s/[\w+\W+]{7,}Room No/Room No/;    
    ## Remove pipe     
        s/(\|Briefed Area:)/Briefed Area:/;
        s/(\|Drawn Area:)/Drawn Area:/;
        s/(\|Maximum\|)/\|Maximum Occupancy:\|/;
        s/(\|Hours Of Use:)/Hours Of Use:/;
    ## Remove obselete  data    
        s/(\|Occupancy:)//;
        s/(\|hrs.)//;
        s/(\|sq.m)//;
    ## Remove pipes and new lines in range
    #    s/,+/-/  if /(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/ ;
    #    if (/(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/) {s/,+/-/
+};
    
    ## Push to new line
        s/(\|Maximum Occupancy:)/\nMaximum Occupancy:/;

### Burp;
    print $_  
    
 }
 close (MYFILE);
[download]

Update: I think I need to find the expression for the hex value a7c

Update2: SOLVED: an Anonymous Monk told me about "\R" which saved me having to use the hex code so I run the above code and then I use 1nickt's sugesstion and slurp the entire outputted file and replace all instances of new line and pipe using s/\R\|/ /mg; This cant be do in the original script because the substitution is running over each line and of course because of the new line I'm working on multiple lines

#!/usr/bin/perl
local $/;
open my $fh, '<', 'C:/Users/bwinkley/Desktop/test/whitespace.txt' or d
+ie $!;
my $txt = <$fh>;

#$txt =~ s/(\n\|)/ /mg;
$txt =~ s/\R\|/ /mg;

for ( split '\n', $txt ) {
    # process each line
}
print $txt;
[download]

Thankyou you all for your help; and to MS for having so many new line options /s

In reply to Substitution in range operator by benaw

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.