thirilog has asked for the wisdom of the Perl Monks concerning the following question:

Could someone please spare sometime lookinto the below issue?

Script:

#!/usr/local/bin/perl use strict; use warnings; undef $/; chomp($input=$ARGV[0]); open (A, "< $input") || die "Couldn't open the input"; open(B, ">A.xml") or die("Sorry!"); while (<A>){ $_=~s/\s\s//gi; $_=~s/\t//gi; $_=~s/^\n$//gi; $_=~s/\n//gi; $_=~s/’/&apos;/gi; $_=~s/—/&mdash;/gi; $_=~s/“/&ldquo;/gi; $_=~s/â€/&rdquo;/gi; print B $_; } close (A); close (B); open(C, "< A.xml") || die "$!"; open (Z, "> $input.sgm") || die "Can't write the output"; $file = <C>; #Body matter begins if ($file =~ m/(<Body [^>]*>(.*?)<\/Body>)/){ $xmlBody = $1; #. #. #. #Processing BlockAmendments while ($xmlBody =~ m/(<BlockAmendment [^>]*>(.+?)<\/BlockAmendment +>)/sgi){ $CbBlkTx = $1; $bBlkTx = $CbBlkTx; if ($bBlkTx =~ m/(<BlockAmendment [^>]+><P2>(.*?)<\/P2><\/Bloc +kAmendment>)/){ $bBlkTx =~ s/<BlockAmendment [^>]+><P2>(.*?)<\/P2><\/Block +Amendment>/\n<lq><P2>$1<\/P2>\n<\/lq>/gi; $bBlkTx =~ s/<P2><Pnumber>([^<]+)<\/Pnumber><P2para><Text> +(.+?)<\/Text>(.*?)<\/P2para><\/P2>/\n<s1><no>($1)<\/no>\n<pt>$2<\/pt> +$3\n<\/s1>/mgi; $bBlkTx =~ s/<P3><Pnumber>([^<]+)<\/Pnumber><P3para><Text> +(.+?)<\/Text>(.*?)<\/P3para><\/P3>/\n<s2><no>($1)<\/no>\n<pt>$2<\/pt> +$3\n<\/s2>/mgi; $bBlkTx =~ s/<P4><Pnumber>([^<]+)<\/Pnumber><P4para><Text> +(.+?)<\/Text>(.*?)<\/P4para><\/P4>/\n<s3><no>($1)<\/no>\n<pt>$2<\/pt> +$3\n<\/s3>/mgi; $bBlkTx =~ s/<P5><Pnumber>([^<]+)<\/Pnumber><P5para><Text> +(.+?)<\/Text>(.*?)<\/P5para><\/P5>/\n<s4><no>($1)<\/no>\n<pt>$2<\/pt> +$3\n<\/s4>/mgi; } $bBlkTx =~ s/<\/pt>\n<\/s([0-9])><Text>(.+?)<\/Text>/ $2<\/pt> +<\/s$1>/mgi; print "$CbBlkTx\n\n============\n"; print "$bBlkTx\n\n"; $xmlBody =~ s/$CbBlkTx/$bBlkTx/gi; } } print Z $xmlBody;

I'm sending you only the problamatical section from the script.

Requirement:

  1. Need to perform only in <Body ..</body> section from input XML. — clamped in $xmlBody variable
  2. Read the <BlockAmendment ...</BlockAmendment> chunks alone from the complete body content variable '$file' and do the simple find and replace. (refer #Processing BlockAmendments from the script)
  3. Then need to replace back the <BlockAmendment ...</BlockAmendment> with processed output from #2 point.
  4. Issue:

    #3 point has not performing within while loop!

    could you please helpout in this?

    Below are the two piece of particular input and required output

    piece of input:

    <BlockAmendment Context="unknown" TargetClass="unknown" TargetSubClass +="unknown" Format="double"> <P2> <Pnumber>3</Pnumber> <P2para><Text>At the time when the order is drawn up, the court office +r will—</Text> <P3> <Pnumber>a</Pnumber> <P3para><Text>where the order made is (or includes) a non-molestation +order; or</Text> </P3para> </P3> <P3> <Pnumber>b</Pnumber> <P3para><Text>where the order made is an occupation order and the cour +t so directs,</Text> </P3para> </P3> <Text>issue a copy of the order, indorsed with or incorporating a noti +ce as to the consequences of disobedience, for service in accordance +with paragraph (2).</Text> </P2para> </P2> </BlockAmendment>

    piece of output:

    <lq> <s1><no>&ldquo;(3)</no> <pt>At the time when the order is drawn up, the court officer will&mda +sh;</pt> <s2><no>(a)</no> <pt>where the order made is (or includes) a non-molestation order; or< +/pt> </s2> <s2><no>(b)</no> <pt>where the order made is an occupation order and the court so direc +ts, issue a copy of the order, indorsed with or incorporating a notic +e as to the consequences of disobedience, for service in accordance w +ith <cit><loc>paragraph (2)</loc></cit>.&rdquo;.</pt> </s2></s1> </lq>

    Thanks in Advance,

    Thirilog

Replies are listed 'Best First'.
Re: Help required inText manipulation
by davido (Cardinal) on Apr 12, 2011 at 08:03 UTC

    Your post makes me dizzy. Can you provide a solution for that first?

    Where did you get this code?

    open(B, ">A.xml") or die("Sorry!");

    I'm sorry too. Rather than apologize, why not spit out a meaningful and useful message, or at least print the contents of $!

    Next let's look at this:

    $_=~s/\s\s//gi; $_=~s/\t//gi; $_=~s/^\n$//gi; $_=~s/\n//gi;

    Could you explain why it's a good idea to use case-insensitivity when you're matching whitespace, tabs, and newlines? (Hint: It isn't, and you shouldn't indiscriminately do so.)

    After that I just lost interest. But I can say that you should parse XML with an XML parser, not regexps. You'll have a better success rate with less brain cramps if you do. Have a look at XML::Simple, or XML::Twig for starters.

    And for heaven's sake, for projects starting in 2011 use lexical filehandles, and three-arg version of open.


    Dave

      Thanks for your comments Dave,
      Yes, you may correct. the requirement i'm trying here is something should work with multiple lines at same time.
      so that i have removed all the whitespace, tabs, and newlines to have the complete content in single line.
      Probably i should not use the ..gi; thanks will update them.
      regarding open(B, ">A.xml") or die("Sorry!");
      just trying to create a intermediate file without whitespace, tabs, and newlines.
      IF the message was not clear am apologize!
      using XML::Simple and XML::Twig can we match multiple elements OR attributes at a time?
      Thanks

        The primary reason you're trying to remove as much whitespace as possible (including and in particular newlines) is probably so that your XML tags don't get line-broken. And this is probably important because you're parsing XML tags using regular expressions. That entire issue and resulting data contortion is avoidable by using a real XML parser. XML::Simple is one of the easiest parsers to use for simple tasks, but there are others.

        The /g modifier is necessary if you stick with the regexp solution, but the /i modifier only applies to characters that have some notion of upper/lower case. Space doesn't have such a context, and so the /i modifier is unnecessary, and in fact does impact performance (though probably not enough to care about). The point is to not wield modifiers unnecessarily without considering what they're being used for.

        The three argument version of open is considered to be a safer programming practice. So is the use of lexical filehandles as opposed to global typeglob filehandles. For example, "open my $infile, '<', $filename or die "Couldn't open the input file $filename: $!\n";....... which reminds me, you should get in the habit of using meaningful messages in die statements. That will aid in debugging.

        The advantage to something like XML::Simple is that you don't have to invent a fragile and probably flawed regexp approach to parsing something that is quite difficult to parse correctly. XML::Simple dumps the XML file into a hash. If you're trying to match multiple things at once, you just have to ask, can I get what I'm after by diving into a hash instead? I think the answer is probably yes. But if a hash based representation of your XML file isn't helpful, XML::Twig give a tree-based representation instead. One of those two strategies ought to satisfy most basic needs. If you have to dig deeper, XML::Parser gives a lower level hook into the parsing mechanics. But I doubt you need to dig that deep.

        Hope this helps...


        Dave

Re: Help required inText manipulation
by jethro (Monsignor) on Apr 12, 2011 at 09:51 UTC

    Could you please remove the duplicate texts in readmore tags?

    First of all there are many perl modules that do the XML parsing for you. Ok, you already are nearly finished, but next time you have to parse XML, you could use something like XML::Parser XML::Rules XML::Bare XML::Smart XML::Easy ...

    To make regexes easier to read (and consequently debug) you could add an 'x' modifier and add spaces and line breaks to your regexes (remember to escape your spaces then). Also if you have to match literal '/' characters, using a different delimiter for the regex is recommended. Example:

    $bBlkTx =~ s/<BlockAmendment [^>]+><P2>(.*?)<\/P2><\/Block Amendment>/ +\n<lq><P2>$1<\/P2>\n<\/lq>/gi; # versus $bBlkTx =~ s{ <BlockAmendment\ [^>]+ ><P2> (.*?) </P2></Block\ Ame +ndment> } { \n<lq><P2> $1 <\/P2>\n<\/lq> }xgi;

    Notice that I highlighted the parts where the substitution happens through spaces. You could even extract part of the regexes into variables with meaningful names and make it even more obvious what you are parsing (though the positive effect in this example is rather small or even nonexistant):

    $P2=qr'\s*<P2>\s*'xgi; $AmendmentPrefix= qr'BlockAmendment\ [^>]+ >'xgi; $AmendmentSuffix= qr'</P2></Block\ Amendment>'xgi; $bBlkTx =~ s{ $AmendmentPrefix $P2 (.*?) $AmendmentSuffix } { \n<lq><P2> $1 <\/P2>\n<\/lq> }xgi;

    I know, this doesn't help you directly, but it helps you detect errors yourself and if you post more readable code, more people will make the effort to read your code

    Do you know which regex is not working? If you don't know, just put print statements in your code that show you the data as it changes, or print the return values from the regexes, they tell you how many substitutions they made

    If you know, make sure you add \s* patterns where line breaks or spaces occur

      Thanks jethro,

      Will update them.

      The troubling regex is $xmlBody =~ s/$CbBlkTx/$bBlkTx/xgi; within the while.

      just trying to replace back the replaced result '$bBlkTx' in "$xmlBody" but its not perfoming.

      I may be wrong in trying this please add some advice in this.

      #Processing BlockAmendments while ($xmlBody =~ m/(<BlockAmendment [^>]*>(.+?)<\/BlockAmendment +>)/gi){ $CbBlkTx = $1; $bBlkTx = $CbBlkTx; if ($bBlkTx =~ m/(<BlockAmendment [^>]+><P2>(.*?)<\/P2><\/Bloc +kAmendment>)/){ $bBlkTx =~ s/<BlockAmendment [^>]+><P2>(.*?)<\/P2><\/Block +Amendment>/\n<lq><P2>$1<\/P2>\n<\/lq>/xgi; $bBlkTx =~ s/<P2><Pnumber>([^<]+)<\/Pnumber><P2para><Text> +(.+?)<\/Text>(.*?)<\/P2para><\/P2>/\n<s1><no>($1)<\/no>\n<pt>$2<\/pt> +$3\n<\/s1>/xgi; $bBlkTx =~ s/<P3><Pnumber>([^<]+)<\/Pnumber><P3para><Text> +(.+?)<\/Text>(.*?)<\/P3para><\/P3>/\n<s2><no>($1)<\/no>\n<pt>$2<\/pt> +$3\n<\/s2>/xgi; $bBlkTx =~ s/<P4><Pnumber>([^<]+)<\/Pnumber><P4para><Text> +(.+?)<\/Text>(.*?)<\/P4para><\/P4>/\n<s3><no>($1)<\/no>\n<pt>$2<\/pt> +$3\n<\/s3>/xgi; $bBlkTx =~ s/<P5><Pnumber>([^<]+)<\/Pnumber><P5para><Text> +(.+?)<\/Text>(.*?)<\/P5para><\/P5>/\n<s4><no>($1)<\/no>\n<pt>$2<\/pt> +$3\n<\/s4>/xgi; } $bBlkTx =~ s/<\/pt>\n<\/s([0-9])><Text>(.+?)<\/Text>/ $2<\/pt> +<\/s$1>/xgi; print "$CbBlkTx\n\n=====>\n$bBlkTx\n\n"; $xmlBody =~ s/$CbBlkTx/$bBlkTx/gi; }

      thanks

        Ok, usual problem is that some characters in $CbBlkTx are regex special chars. If you use

        $xmlBody =~ s/\Q$CbBlkTx\E/$bBlkTx/xgi;

        instead, all special characters between \Q and \E are escaped. If you want more information try 'perldoc -f quotemeta', quotemeta is the function behind \Q\E

        UPDATE: Added missing word "information". I didn't want to insinuate that quotemeta delivers more escapes than \Q\E ;-)