I have a text file that I am trying to clean up, the hurdle that I am currently stuck on is trying to remove pipe (|) and new lines (\n\r) from text in between two patterns namely "Mechanical Notes:|" and "Lighting Notes:"
As you can see from the sample text below I would like it to read ..LOW LEVEL EXHAUST, POSITIVE.. so I would replace all with a space. Thus far I have been playing with regex and subistution and was looking at using subistution in a range operator in the code section "## Remove pipes and new lines in range" -(My example is using commas and hyphens at the moment.) It seems to replace the first comma on every line with a hyphen not limited to the range.
So to give you some background I am trying to analyse some .pdf reports I've converted them into (horriable) text files and trying to convert them into something I can suck into SQL. So at the moment I am just trying to format it into {Field:}{Value}{new line}.
It should be noted that there are mutiple sections so the below sample occors many times hence why I am just going for this formatting at this stage, thought I'd throw that in because usually an OP's approach isn't the most effective
Sample text from inputfile.txt:
Mechanical Notes:|HEPA FILTERED, LOW LEVEL
|EXHAUST, POSITIVE CIRCULATION
Lighting Notes:|LIGHTING - COLOUR
#!/usr/bin/perl ### slurp open (MYFILE, 'inputfile.txt'); while (<MYFILE>) { ### chomp; ## Substitute 2 or more horizontal white space with pipe # s/[^\S\r\n]{2,}/|/g; #works s/[\h+]{2,}/|/g; ## Remove Room Name s/[\w+\W+]{7,}Room No/Room No/; ## Remove pipe s/(\|Briefed Area:)/Briefed Area:/; s/(\|Drawn Area:)/Drawn Area:/; s/(\|Maximum\|)/\|Maximum Occupancy:\|/; s/(\|Hours Of Use:)/Hours Of Use:/; ## Remove obselete data s/(\|Occupancy:)//; s/(\|hrs.)//; s/(\|sq.m)//; ## Remove pipes and new lines in range # s/,+/-/ if /(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/ ; # if (/(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/) {s/,+/-/ +}; ## Push to new line s/(\|Maximum Occupancy:)/\nMaximum Occupancy:/; ### Burp; print $_ } close (MYFILE);
Update: I think I need to find the expression for the hex value a7c
Update2: SOLVED: an Anonymous Monk told me about "\R" which saved me having to use the hex code so I run the above code and then I use 1nickt's sugesstion and slurp the entire outputted file and replace all instances of new line and pipe using s/\R\|/ /mg; This cant be do in the original script because the substitution is running over each line and of course because of the new line I'm working on multiple lines
#!/usr/bin/perl local $/; open my $fh, '<', 'C:/Users/bwinkley/Desktop/test/whitespace.txt' or d +ie $!; my $txt = <$fh>; #$txt =~ s/(\n\|)/ /mg; $txt =~ s/\R\|/ /mg; for ( split '\n', $txt ) { # process each line } print $txt;
Thankyou you all for your help; and to MS for having so many new line options /s
In reply to Substitution in range operator by benaw
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |