Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^3: Why XML not well formed?

by BaldPenguin (Friar)
on Jun 30, 2005 at 16:03 UTC ( #471378=note: print w/replies, xml ) Need Help??


in reply to Re^2: Why XML not well formed?
in thread Why XML not well formed?

You could regex the &:
$line =~ s/(&)/$1amp;/g;

Don
WHITEPAGES.COM | INC

Edit by castaway: Closed small tag in signature

Replies are listed 'Best First'.
Re^4: Why XML not well formed?
by davorg (Chancellor) on Jun 30, 2005 at 16:11 UTC

    That would potentially disastrous effects if the XML contained any entities.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Yup. better to make it a bit more restrictive.
      s/"(http[^"]+)&(?!amp;)/$1&amp;/g;


      holli, /regexed monk/
        VERY good point

        Don
        WHITEPAGES.COM | INC

        Edit by castaway: Closed small tag in signature

Re^4: Why XML not well formed?
by graff (Chancellor) on Jul 03, 2005 at 05:33 UTC
    I think the proper way to control this sort of operation -- and keep it from screwing up other entity references (like &egrave; and &lt;, etc) is something like this:
    s/&(?!\w+;)/&amp;/;
    This is just assuming that every valid entity reference that might exist in the original text is limited to alphanumerics and underscores between the initial ampersand and the final semi-colon, which is probably a safe-enough assumption.

    But keep a backup of the original. If the data still causes parse errors after this simple edit, they might be different problems you haven't fixed yet, or they might be problems created by this simple edit. Careful diagnosis would be needed in that case.

Re^4: Why XML not well formed?
by nan (Novice) on Jul 06, 2005 at 17:04 UTC

    Hi Don,

    Sorry I used $line =~ s/(&)/(& amp;)/g; would it be different with yours?

    Many thanks, Nan

      First, the parens in the pattern part of the regex collect the value we find, then the $1 in the replace puts it back in. In that respect the follwing regex would do nothing but spin cycles;
      $line =~ s/(&)/$1/g;
      So, in your regex the first set of parens save the & for later use. But you don't use it, instead you find all of the &s and replace them literaly with "(&amp;)" including the parens.

      That said, the regex I posted would work only if no other entites existed. Take a look at the other regexs above, they do a better job of 'thinking' ahead, to prevent possible errors in the future.

      Somebody could, and I have seen it done, place other entities in the url, such as:
      http://www.test.me/test.pl?me=1&you=1&string=Montr&eacute;al
      In this case the regex I posted would tranlate to Montr&eacute;al. Not what you want.

      Don
      WHITEPAGES.COM | INC

      Edit by castaway: Closed small tag in signature

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://471378]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2023-02-02 15:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (19 votes). Check out past polls.

    Notices?