in reply to Re: Need a regex to replace incomplete html entities
in thread Need a regex to replace incomplete html entities

I am looking like if & is not followed by #38; then replace the & to blank.
If a line consist of & or &# or &#3 or &#38 should be replaced to blan +k but & should not be affected.
Note: File is 200+ MB so thinking to apply sed command.

Replies are listed 'Best First'.
Re^3: Need a regex to replace incomplete html entities
by Laurent_R (Canon) on Nov 20, 2016 at 12:02 UTC
    If I understand you correctly, the important difference is the semi-colon: you want to replace &#38, but not if it is followed by a semi-colon (i.e. you don't want to replace &). The poor formatting in your post made it difficult to understand that.

    The easy solution is to use a negative look-ahead, as already suggested in other posts, but I doubt that sed supports look-ahead assertions (it may depend which version).

    Besides, even for a 200 MB file, this should not be a problem in Perl. Last time I compared the performance of Perl and sed, I did not find a really significant performance difference between them, but, again, this may depend on the implementation of the sed version you're using.

      You got correct Laurent. Thanks for the update and look ahead assertion.

      The reason I focus on sed command is, I want to parse the xml file which has similar multiple <Remarks> tag.

      But since file consist of incomplete html entities, parser is not able to parse the file.
      Hence I was planning to use sed command to replace the code and then parse it.