Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have some doubts regarding a regex on how to deal with white spaces and new lines. I have tried some code but none of them are working. (I am not that good in Programming or Perl but I am trying to do this hopefully in Perl).

Actually, I want to take out all the white spaces(spaces & tabs). Also If we have more than two newlines between each paragraph we should reduce it to two new lines.

This is a sample input,

START:The Product Manager is a deep expert on specific feature(s), ser +vice(s) or interface(s) that Mimecast provides or sells. This could +be a service delivered through the cloud, or installable software tha +t customers will use on their computers, SmartPhones, tablets and sim +ilar devices. For each product under your remit, you will have a good understanding +of the competitors in the market and their pricing, as well as the re +levance of the feature to IT administrators, users and customers. Yo +u will use this knowledge to formulate a roadmap of feature enhanceme +nts or, if required, the development of entirely new products and ser +vices. If this sounds like you, then please read on….. You will be required to spend time sharing ideas and product concepts +with customers to gauge feedback, honing the feature pitch, and using + this information to assist the Product Director with pricing deliber +ations. Your role will see you managing development through the entire develop +ment lifecycle, whereafter you will be responsible for launching impr +ovements to Mimecast internally on a global scale. You will be comfo +rtable assisting with the authoring of collateral for web-sites, prin +table assets and to be integrated into appropriate media packs. You will take strategic guidance from your line manager, the product d +irector, and will work closely with the technical product managers on + the team to ensure products are delivered on time. :END

This is the desired output,

START:The Product Manager is a deep expert on specific feature(s), ser +vice(s) or interface(s) that Mimecast provides or sells. This could +be a service delivered through the cloud, or installable software tha +t customers will use on their computers, SmartPhones, tablets and sim +ilar devices. For each product under your remit, you will have a good understanding +of the competitors in the market and their pricing, as well as the re +levance of the feature to IT administrators, users and customers. Yo +u will use this knowledge to formulate a roadmap of feature enhanceme +nts or, if required, the development of entirely new products and ser +vices. If this sounds like you, then please read on….. You will be required to spend time sharing ideas and product concepts +with customers to gauge feedback, honing the feature pitch, and using + this information to assist the Product Director with pricing deliber +ations. Your role will see you managing development through the entire develop +ment lifecycle, whereafter you will be responsible for launching impr +ovements to Mimecast internally on a global scale. You will be comfo +rtable assisting with the authoring of collateral for web-sites, prin +table assets and to be integrated into appropriate media packs. You will take strategic guidance from your line manager, the product d +irector, and will work closely with the technical product managers on + the team to ensure products are delivered on time.:END

Please help monks.

Thanks in advance.

Replies are listed 'Best First'.
Re: How to remove white space and more than a particular number of new lines!
by ww (Archbishop) on Apr 10, 2014 at 11:45 UTC

    Please tell us what you have tried.

    Please don't expect the Monks to do your work for you without any hint of what you've tried, so far. We're here to help you learn, not to perform as code-a-matic.


    Questions containing the words "doesn't work" (or their moral equivalent) will usually get a downvote from me unless accompanied by:
    1. code
    2. verbatim error and/or warning messages
    3. a coherent explanation of what "doesn't work actually means.

    check Ln42!

Re: How to remove white space and more than a particular number of new lines!
by mr_mischief (Monsignor) on Apr 10, 2014 at 14:56 UTC

    You have a specification problem. A few, actually.

    • You're asking for two newlines and showing three. Did you mean two blank lines? Or are you counting one new line as part of the paragraph?
    • You don't want to get rid of all whitespace. There's a space between each word in the output, except when ':END' appears with no whitespace before it.
    • You have specified a regex but you likely want more than one regex. Some of this can be done as simply with character translation as with a regex.If you're feeding a regex into some third-party system that requires a single regex as a data input then I feel for you. Otherwise break the problem into parts.
    • There is an implicit assumption that text manipulation must be done only with text manipulation tools. Loops, state flags, and counters may actually be clearer in many cases.

    This smells a bit like homework. I don't mind helping with homework, but I'm not going to give a final version you'd want a professor to see. The below smell. They stink. They are not so much solutions as new problems. If I was teaching a class and a student brought this to me I'd tell her to come back when the assignment was complete.

    perl -e '$/--;$_=<>;s/[ \t]+/ /g;s/ ?(\r?\n) ?/$1/g;s/($1){3,}/$1$1$1/ +g;s/[\s\r\n]+(:END)/$1/;print'
    or
    perl -e '$/=undef;while(<>){tr/ \t/ /ds;s/ ?(\r?\n) ?/$1/g;s/\s+:/:/g; +s/(\r?\n){3,}/$1$1$1/g;print}'

    The above are designed to give the model output from the model input. They don't do so in a clean or maintainable way. You may be able to glean some information from them. Most of all, perhaps you'll learn to ask about which specific parts you don't understand and how to improve them.

      Thank You mr_mischief , it helped me a lot...

      but this code is affecting all the lines, I want to keep a line as it is if it contains anything other than white space and tabs.

      For example I am changing the input and output to this,

      Input,

      TEST1 fgdfgb Test2 dlfgndkflgbn test3 /,g;dhjzmpa h7rg tRruber Test4 ';gfojmd ofimt lj kjbkhb The Product Manager is a deep expert on specific feature(s), service(s +) or interface(s) that Mimecast provides or sells. This could be a s +ervice delivered through the cloud, or installable software that cust +omers will use on their computers, SmartPhones, tablets and similar d +evices. For each product under your remit, you will have a good understanding +of the competitors in the market and their pricing, as well as the re +levance of the feature to IT administrators, users and customers. Yo +u will use this knowledge to formulate a roadmap of feature enhanceme +nts or, if required, the development of entirely new products and ser +vices. If this sounds like you, then please read onÂ&#133;.. You will be required to spend time sharing ideas and product concepts +with customers to gauge feedback, honing the feature pitch, and using + this information to assist the Product Director with pricing deliber +ations. dfgdfg Your role will see you managing development through the entire develop +ment lifecycle, whereafter you will be responsible for launching impr +ovements to Mimecast internally on a global scale. You will be comfo +rtable assisting with the authoring of collateral for web-sites, prin +table assets and to be integrated into appropriate media packs.

      Desired output,

      TEST1 fgdfgb Test2 dlfgndkflgbn test3 /,g;dhjzmpa h7rg tRruber Test4 ';gfojmd ofimt lj kjbkhb The Product Manager is a deep expert on specific feature(s), service(s +) or interface(s) that Mimecast provides or sells. This could be a se +rvice delivered through the cloud, or installable software that custo +mers will use on their computers, SmartPhones, tablets and similar de +vices. For each product under your remit, you will have a good understanding +of the competitors in the market and their pricing, as well as the re +levance of the feature to IT administrators, users and customers. You + will use this knowledge to formulate a roadmap of feature enhancemen +ts or, if required, the development of entirely new products and serv +ices. If this sounds like you, then please read onÂ&#133;.. You will be required to spend time sharing ideas and product concepts +with customers to gauge feedback, honing the feature pitch, and using + this information to assist the Product Director with pricing deliber +ations. dfgdfg Your role will see you managing development through the entire develop +ment lifecycle, whereafter you will be responsible for launching impr +ovements to Mimecast internally on a global scale. You will be comfo +rtable assisting with the authoring of collateral for web-sites, prin +table assets and to be integrated into appropriate media packs.

      Many thanks,

        You were shown two alternatives. Which one did you use? (Hint: You must copy/paste directly from the shell window where you ran the command.) What actual output did you get? (Again, use copy/paste.)

        BTW, it's not really clear that the "desired output" is consistent with the OP description of what you're trying to do. Please be more explicit in the explanation, or more careful about presenting the "desired output".

        Also, it looks like your input contains some non-ASCII characters. Do you know what sort of character encoding is being used? If it's UTF-8, you'll want to add '-C31' as the first command-line option on the perl command line that mr_mischief gave you (i.e. perl -C31 -e … (but if it's something other than UTF-8, you'll want to handle encoding conversion).

Re: How to remove white space and more than a particular number of new lines!
by locked_user sundialsvc4 (Abbot) on Apr 10, 2014 at 12:22 UTC

    Without being a code-a-matic for you either, here’s a classic treatment of this sort of problem which can be done by Perl, or, perhaps even more simply, by awk:

    There are two kinds of lines here:   those that are blank, and those that are not.   (A blank line would match something like /^\s*$/ ...)   If the line is not blank, then you want to output that line and set a counter of the number of blank-lines encountered to zero.   If it is, you want to increment that count, and output the line only if the count is so-far two or less.   You must initialize that counter to zero before the loop begins.   With this algorithm, a gap of one line would stand, but a gap of more than two lines would be two.

    In data-processing in general, we all encounter a lot of “text-file problems” which can be reduced to the same basic idea:   what different types of lines might I encounter, how do I recognize them, and, having recognized them, what do I do with them.   (Also, what do I do at the beginning of the file and what do I do at the end.)   Well, the awk tool is specifically designed around that very idea, and the Perl programming language was originally birthed as an über-extension of awk ... both for good reasons.

    Now, I’m sure that you can see your way to writing that very-short program based on this?

      I tried to capture whitespaces with /^\s*$/ but its not working. Can you pls show the correct way please
        Show us the actual code that you used when you "tried to capture whitespaces". Also tell us what "its not working" really means: what did it do, and how did that differ from what you wanted to do. These details will help us help you.