in reply to Help extracting text from XML data

If you're not perlaphobic, this print $xml =~ m[">([^<]+)</string>]sm; will do the job just as reliably as the other possibilities.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Help extracting text from XML data
by Jenda (Abbot) on Oct 20, 2008 at 23:53 UTC

      Yes.

      $xml = '<string>Everyone knows that 1 &lt; 2</string>';; print $xml =~ m[>([^<]+)</string>]sm;; Everyone knows that 1 &lt; 2

      From the spec:

      The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings "&" and "<" respectively.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        No.

        use XML::Simple; my $data = XMLin( '<string>Everyone knows that 1 &lt; 2</string>'); print $data; # ==> Everyone knows that 1 < 2

        Your code did not unescape the string. And before you attempt to add that, keep in mind that there might have been <string><![CDATA[Everyone knows that 1 < 2]]></string>. Or the encoding specified by the <?xml ...?> might have been different and there might have been some accentuated characters that need to be converted. Or. Or. Or. If you do know your files will never contain anything like that, go ahead. But don't say your script processes XML then, because it doesn't.