The primary reason you're trying to remove as much whitespace as possible (including and in particular newlines) is probably so that your XML tags don't get line-broken. And this is probably important because you're parsing XML tags using regular expressions. That entire issue and resulting data contortion is avoidable by using a real XML parser. XML::Simple is one of the easiest parsers to use for simple tasks, but there are others.

The /g modifier is necessary if you stick with the regexp solution, but the /i modifier only applies to characters that have some notion of upper/lower case. Space doesn't have such a context, and so the /i modifier is unnecessary, and in fact does impact performance (though probably not enough to care about). The point is to not wield modifiers unnecessarily without considering what they're being used for.

The three argument version of open is considered to be a safer programming practice. So is the use of lexical filehandles as opposed to global typeglob filehandles. For example, "open my $infile, '<', $filename or die "Couldn't open the input file $filename: $!\n";....... which reminds me, you should get in the habit of using meaningful messages in die statements. That will aid in debugging.

The advantage to something like XML::Simple is that you don't have to invent a fragile and probably flawed regexp approach to parsing something that is quite difficult to parse correctly. XML::Simple dumps the XML file into a hash. If you're trying to match multiple things at once, you just have to ask, can I get what I'm after by diving into a hash instead? I think the answer is probably yes. But if a hash based representation of your XML file isn't helpful, XML::Twig give a tree-based representation instead. One of those two strategies ought to satisfy most basic needs. If you have to dig deeper, XML::Parser gives a lower level hook into the parsing mechanics. But I doubt you need to dig that deep.

Hope this helps...


Dave


In reply to Re^3: Help required inText manipulation by davido
in thread Help required inText manipulation by thirilog

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.