in reply to Sanity Check

Do you really want to get nothing for the following line?
  Some content  
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Replies are listed 'Best First'.
Re^2: Sanity Check
by kcinmd (Initiate) on Jan 28, 2015 at 13:38 UTC

    As far as I can tell these out of range ascii characters are introduced by the user copying from MS Word into a long description. For example

    60 TON AIR-COOLED ROTARY SCROLL CHILLER Carrier Model no.30RAP060 Chiller Manufactured in Charlotte, NC ò AHRI Performance: 56.0 Tons, Full Load û 10.2 EER, IPLV &# +xFB; 14.5 EER ò Electrical Requirements for 208/3: MCA û 279.7, MOCP &#xFB +; 300, Rec $37,855.00 $37,855.00 Quote ref: TII/CA/1214/2895 Page 1 of 5 Fuse Size û 300 ò Electrical Requirements for 460/3: MCA û 134.6, MOCP &#xFB +; 150, Rec Fuse Size û 150 ò AHRI STANDARD 550/590 CERTIFIED ò ASHRAE 90.1 COMPLIANT ò 7.5 HP Constant speed, single-pump package (61.3Æ ext. hea +d @ 134 GPM)

    I do not care about the content of the long description. I would much rather have a fully automated solution to keep a human out of the loop. W/O the pre-command... my ETL (Informatica) will not digest the unknown character and aborts then I get a phone call!.

      I see. What about the entities in non-starting positions, then?
      Fuse Size û 150
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Good catch. Thank you. Changed and result.

        perl -pi -e 's/&#x.+;//g' ./*.xml 60 TON AIR-COOLED ROTARY SCROLL CHILLER Carrier Model no.30RAP060 Chiller Manufactured in Charlotte, NC 14.5 EER 300, Rec $37,855.00 $37,855.00 Quote ref: TII/CA/1214/2895 Page 1 of 5 Fuse Size 300 150, Rec Fuse Size 150 AHRI STANDARD 550/590 CERTIFIED ASHRAE 90.1 COMPLIANT ext. head @ 134 GPM)
      Perhaps it would be better to just skip the lines with encoded chars
      perl -ne 'print unless /&#x/'
      Stuff like "134 GPM)" is on a separate line; I don't know if that's a problem.