benaw has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a text file that I am trying to clean up, the hurdle that I am currently stuck on is trying to remove pipe (|) and new lines (\n\r) from text in between two patterns namely "Mechanical Notes:|" and "Lighting Notes:"

As you can see from the sample text below I would like it to read ..LOW LEVEL EXHAUST, POSITIVE.. so I would replace all with a space. Thus far I have been playing with regex and subistution and was looking at using subistution in a range operator in the code section "## Remove pipes and new lines in range" -(My example is using commas and hyphens at the moment.) It seems to replace the first comma on every line with a hyphen not limited to the range.

So to give you some background I am trying to analyse some .pdf reports I've converted them into (horriable) text files and trying to convert them into something I can suck into SQL. So at the moment I am just trying to format it into {Field:}{Value}{new line}.

It should be noted that there are mutiple sections so the below sample occors many times hence why I am just going for this formatting at this stage, thought I'd throw that in because usually an OP's approach isn't the most effective

Sample text from inputfile.txt:

Mechanical Notes:|HEPA FILTERED, LOW LEVEL

|EXHAUST, POSITIVE CIRCULATION

Lighting Notes:|LIGHTING - COLOUR

#!/usr/bin/perl ### slurp open (MYFILE, 'inputfile.txt'); while (<MYFILE>) { ### chomp; ## Substitute 2 or more horizontal white space with pipe # s/[^\S\r\n]{2,}/|/g; #works s/[\h+]{2,}/|/g; ## Remove Room Name s/[\w+\W+]{7,}Room No/Room No/; ## Remove pipe s/(\|Briefed Area:)/Briefed Area:/; s/(\|Drawn Area:)/Drawn Area:/; s/(\|Maximum\|)/\|Maximum Occupancy:\|/; s/(\|Hours Of Use:)/Hours Of Use:/; ## Remove obselete data s/(\|Occupancy:)//; s/(\|hrs.)//; s/(\|sq.m)//; ## Remove pipes and new lines in range # s/,+/-/ if /(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/ ; # if (/(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/) {s/,+/-/ +}; ## Push to new line s/(\|Maximum Occupancy:)/\nMaximum Occupancy:/; ### Burp; print $_ } close (MYFILE);

Update: I think I need to find the expression for the hex value a7c

Update2: SOLVED: an Anonymous Monk told me about "\R" which saved me having to use the hex code so I run the above code and then I use 1nickt's sugesstion and slurp the entire outputted file and replace all instances of new line and pipe using s/\R\|/ /mg; This cant be do in the original script because the substitution is running over each line and of course because of the new line I'm working on multiple lines

#!/usr/bin/perl local $/; open my $fh, '<', 'C:/Users/bwinkley/Desktop/test/whitespace.txt' or d +ie $!; my $txt = <$fh>; #$txt =~ s/(\n\|)/ /mg; $txt =~ s/\R\|/ /mg; for ( split '\n', $txt ) { # process each line } print $txt;

Thankyou you all for your help; and to MS for having so many new line options /s

Replies are listed 'Best First'.
Re: Substitution in range operator
by 1nickt (Canon) on Dec 30, 2015 at 02:39 UTC

    If the file size is not too large you could read in the whole text and replace each instance of a new line followed by a pipe, with a space. Then afterwards go through it again line by line and make your other substitutions.

    local $/; open my $fh, '<', 'file.txt' or die $!; my $txt = <$fh>; $txt =~ s/\n\|/ /g; for ( split '\n', $txt ) { # process each line }


    The way forward always starts with a minimal test.

      Thanks for your response,

      I created an output file using my original script because that is what puts the pipes in

      ## Substitute 2 or more horizontal white space with pipe s/[\h+]{2,}/|/g;

      then i passed that file through your script it finds the pipe and replaces it with a space but it is still on a separate line.

      #!/usr/bin/perl local $/; open my $fh, '<', 'whitespacegone.txt' or die $!; my $txt = <$fh>; $txt =~ s/(\n\|)/ /mg; for ( split '\n', $txt ) { # process each line } print $txt;

        Hm, odd. Please try running this program:

        #!/usr/bin/perl use strict; use warnings; local $/; my $txt = <DATA>; $txt =~ s/\n\|/ /g; print $txt; __DATA__ foo | bar baz |qux fred |wilma barney
        Output should be :
        foo | bar baz qux fred wilma barney
        (Note that you don't need to enclose your desired match in parentheses unless you need to "capture" it.)


        The way forward always starts with a minimal test.
Re: Substitution in range operator
by Anonymous Monk on Dec 30, 2015 at 10:55 UTC
    Update: I think I need to find the expression for the hex value a7c
    That's probably line feed (0xA) and vertical line (0x7C), so the expression is "\n|". Also note that
    "\R" matches anything that can be considered a newline under Unicode rules. It can match a multi-character sequence. It cannot be used inside a bracketed character class
    That might make it easier for you to deal with various newlines (it matches "\n", "\r\n", "\r", among other things).

      Thanks "\R" was very helpful with that in 1nickt's script which I run afterwards the hurdle is overcome, now onto the next part!