ctgIT has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have written a simple PERL script that uses regular expressions to parse XML documents. I have an XML document whose one element is called <title>. This element contains the title of the document in the form: <title>Mathematical Spaces in Algebra</title> I discovered that some documents contain one or more newline characters inside the <title></title> element which I have to remove because I want the whole title to be in one line. I currently read the whole XML into a single variable but I do not know how to remove these newline(s) in a single pass through a regular expression. Thanks, Christos
  • Comment on XML-related Regular Expression question

Replies are listed 'Best First'.
Re: XML-related Regular Expression question
by particle (Vicar) on May 14, 2002 at 21:37 UTC
    you'll never get a clean parse of XML with regular expressions alone. i suggest you try an existing wheel, many of which can be found at the CPAN. for example, XML::libXML, XML::Parser, and XML::SAX. there are many good examples of using these modules here and elsewhere.

    oh, i almost forgot XML::Simple, too.

    Update: i did forget... you'll get much better replies if you post an example of code and data that explains your problem. you might want to have a look at Perl Monks FAQ for more (and better) information on posting.

    ~Particle *accelerates*

Re: XML-related Regular Expression question
by Cyrnus (Monk) on May 14, 2002 at 22:27 UTC
    I would reccomend following particle's advice. To answer your question however, This REGEX will replace all newlines that are within elements with a space.
    $XML_string =~ s/([^>])\n+([^<])/$1 $2/g


    John