starbuck has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've been chasing my tail for way to long on this one.

I basically have an XML document and want to remove the last incomplete tag so that I am left with the first two complete XML lines.

<?xml version="1.0" encoding="iso-8859-1" ?><TransactionPart Sub='PRFN +' Part='1' Id='ALAABADOCKAGDI'><Event Row='1' File='RELATIONSHIP_FILE +' ISN='23290' Time='2005/11/11-17:05:39' Op='Update'></TransactionPar +t> <?xml version="1.0" encoding="iso-8859-1" ?><TransactionPart Sub='PRF' + Part='1' Id='AABADOCKAGDI'><Event Row='1' File='RELATIONSHIP_FILE' I +SN='2180' Time='2005/11/11-17:05:39' Op='Update'></TransactionPart> <?xml version="1.0" encoding="iso-8859-1" ?><TransactionPart Sub='PRFN +' Part='1' Id='AABADOCKAGDI'><Event Row='1' File='PRODUCT_RELATIONSHI +P_FILE' ISN='1
I have tried pattern matching the last end tag and have managed to extract the last fragment:
my $terminator = "<\/notification>|<\/TransactionPart>"; $msg_payload =~ /($terminator?([^$terminator]*))$/; $truncate = $';
But the prematch doesn't give me the rest of the document, which is what I want.

So I tried to pattern match $truncate in the original string in the hopes of pulling out everything before it but that was sadly not to be for some unknown reason.

So what regexp magic can I use to extract the complete first n complete xml elements?

Many Thanks,
Starbuck

Replies are listed 'Best First'.
Re: Regexp magic needed, truncate incomplete XML
by Roy Johnson (Monsignor) on Nov 18, 2005 at 19:41 UTC
    For this example, assuming you've got all three lines stored as one string in $_:
    s/^.*[^>]$//gm;
    That will delete any lines that don't end in a >. It will leave the newline from those lines, though. Season to taste.

    Caution: Contents may have been coded under pressure.
      Great, just the trick I was looking for. I got too caught up in it being the last record that I didn't think to replace incomplete lines.

      Thanks.