Hello Monks,

I have a text file that I am trying to clean up, the hurdle that I am currently stuck on is trying to remove pipe (|) and new lines (\n\r) from text in between two patterns namely "Mechanical Notes:|" and "Lighting Notes:"

As you can see from the sample text below I would like it to read ..LOW LEVEL EXHAUST, POSITIVE.. so I would replace all with a space. Thus far I have been playing with regex and subistution and was looking at using subistution in a range operator in the code section "## Remove pipes and new lines in range" -(My example is using commas and hyphens at the moment.) It seems to replace the first comma on every line with a hyphen not limited to the range.

So to give you some background I am trying to analyse some .pdf reports I've converted them into (horriable) text files and trying to convert them into something I can suck into SQL. So at the moment I am just trying to format it into {Field:}{Value}{new line}.

It should be noted that there are mutiple sections so the below sample occors many times hence why I am just going for this formatting at this stage, thought I'd throw that in because usually an OP's approach isn't the most effective

Sample text from inputfile.txt:

Mechanical Notes:|HEPA FILTERED, LOW LEVEL

|EXHAUST, POSITIVE CIRCULATION

Lighting Notes:|LIGHTING - COLOUR

#!/usr/bin/perl ### slurp open (MYFILE, 'inputfile.txt'); while (<MYFILE>) { ### chomp; ## Substitute 2 or more horizontal white space with pipe # s/[^\S\r\n]{2,}/|/g; #works s/[\h+]{2,}/|/g; ## Remove Room Name s/[\w+\W+]{7,}Room No/Room No/; ## Remove pipe s/(\|Briefed Area:)/Briefed Area:/; s/(\|Drawn Area:)/Drawn Area:/; s/(\|Maximum\|)/\|Maximum Occupancy:\|/; s/(\|Hours Of Use:)/Hours Of Use:/; ## Remove obselete data s/(\|Occupancy:)//; s/(\|hrs.)//; s/(\|sq.m)//; ## Remove pipes and new lines in range # s/,+/-/ if /(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/ ; # if (/(^Mechanical Notes:|)/ .. /(^Lighting Notes:)/) {s/,+/-/ +}; ## Push to new line s/(\|Maximum Occupancy:)/\nMaximum Occupancy:/; ### Burp; print $_ } close (MYFILE);

Update: I think I need to find the expression for the hex value a7c

Update2: SOLVED: an Anonymous Monk told me about "\R" which saved me having to use the hex code so I run the above code and then I use 1nickt's sugesstion and slurp the entire outputted file and replace all instances of new line and pipe using s/\R\|/ /mg; This cant be do in the original script because the substitution is running over each line and of course because of the new line I'm working on multiple lines

#!/usr/bin/perl local $/; open my $fh, '<', 'C:/Users/bwinkley/Desktop/test/whitespace.txt' or d +ie $!; my $txt = <$fh>; #$txt =~ s/(\n\|)/ /mg; $txt =~ s/\R\|/ /mg; for ( split '\n', $txt ) { # process each line } print $txt;

Thankyou you all for your help; and to MS for having so many new line options /s


In reply to Substitution in range operator by benaw

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.