Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

multi line matching problem

by jcpunk (Friar)
on Dec 16, 2003 at 01:42 UTC ( [id://314945]=perlquestion: print w/replies, xml ) Need Help??

jcpunk has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a program (duh). I have a scalar holding a bunch of text and need to remove all the new lines from it as well as any extra spaces (ie " ") so I can format it however I want.
I promptly wrote this line $text =~ s/\n\s*/ /g; and it failed greatly because it removed a bunch of things besides newlines and space characters. So I decided screw the spaces and just kill the newlines and wrote this line $text =~ s/\n/ /g;, sadly it also fails.

So I turn to you guys. My request is for either a line that will strip out all the spaces and newlines, replacing them with one space or a multiline matching expression that will find things of the nature:

<thing> condition condition randomness other junk </thing>
and replace it with a string of my choosing.

Why am I asking such an easy question? I think it has something to do with me not entirely understanding how to use s/// and m/// entirely. So a tutorial on how they differ would be most helpful (sadly perldoc is not exactly helpful enough for me)

thanks for the help


jcpunk
all code is tested, and doesn't work so there :p (varient on common PM sig for my own ammusment)

Replies are listed 'Best First'.
Re: multi line matching problem
by Roger (Parson) on Dec 16, 2003 at 02:07 UTC
    Perhaps you are looking for something like this instead?
    use strict; use warnings; my $data = do {local $/; <DATA>}; $data =~ y/ \n/ /s; # add the following if you want to strip space after > # and before < characters. # $data =~ s/(?<=>)\s|\s(?=<)//gm; print "$data\n"; __DATA__ <thing> condition condition randomness other junk more junk </thing>
    and the output -
    <thing> condition condition randomness other junk more junk </thing>
      wow, thats perfect and answered in record time.
      thanks
Re: multi line matching problem
by Zaxo (Archbishop) on Dec 16, 2003 at 04:10 UTC

    One more way $text = join ' ', split ' ', $text; uses magical split. The is one difference in the result; magical split will remove leading and trailing whitespace, instead of replacing it with a single space.

    After Compline,
    Zaxo

Re: multi line matching problem
by SquireJames (Monk) on Dec 16, 2003 at 02:18 UTC
    $data =~ y/\n/ /; # This will remove all newlines and replace them wit +h spaces
    Better still, if you want to replace all the greater than one space(s) in your regular expression and take care of any newlines that you have at the time, you can do so with this RegEx:
    $data =~ s/\s+|\n/ /g;
    The main key here is that \s* has been replaced with \s+, so it's no longer greedy and will replace only multiple space characters.

    The difference between s and m is that s is used to substitue an expression, whilst m is used to test for a pattern match. M is not really used too much as it is implied when you place a regular expression between two slashes (ie. $data =~ /\n/g is the same as $data=~ m/\n/g). The y modifier is a simple character replacement (transliteration).

    Update: With much thanks to Enlil for the explaination, the whole regex could actually be written without the \n (i.e. $data =~ s/\s+/ /g).

      Part of your explanation is flatly erroneous. That \s+ is still greedy. The \n is never matched in your s/\s+|\n/ regex.

        For the record, this is the testing that I did, which works fine by me.
        $data = "15 65\n35 6\n445 34,546 59034584\n54 3,450 805;5409 + 8534\n\nStuff..."; print ($data); print ("\nChainging now\n\n"); $data =~ s/\s+|\n/ /g; print ($data);
        Sorry if I misunderstood the question, and I'll take the hit on the greedy statement, perhaps I should have said greedy only for space characters, which is what is wanted....
      Thanks for the tip on implied m/// I will keep looking around at these things
Re: multi line matching problem
by Enlil (Parson) on Dec 16, 2003 at 02:22 UTC
    What else are you doing to $text as your code, $text =~ s/\n\s*/ /g; deletes space chars preceeded by newlines, which looking at your test case seems to do what you want (in this case):
    my $this = '<thing> condition condition randomness other junk </thing>'; print "before: $this\n\n"; $this =~ s/\n\s*/ /g; print "after: $this\n"; __END__ before: <thing> condition condition randomness other junk </thing> after: <thing> condition condition randomness other junk </thing>
    One thing though is that \n is in the set of \s characters so $text=~s/\s+/ /g; should suffice.

    -enlil

Re: multi line matching problem
by doom (Deacon) on Dec 16, 2003 at 06:56 UTC
    Rather than striping newlines, I'll try and answer your other question, about doing a "multiline matching expression". Do you know about the "s///ms" trick? Typically you use the m and s modifiers when working on a string with embedded newlines, one changes the meaning of . so that it also matches a /n, the other changes the meaning of ^ and $ so that they match the beginning and end of lines (most people I know use them both together and don't bother remembering which one does which...):
    my $string = <<ENDSTRING; <thing> condition condition condition condition randomness other junk </thing> ENDSTRING print "$string\n"; $string =~ s{ <thing>.*?</thing> } {<THANG>blah</THANG>}msx; print "$string\n";
    That should output:
     <thing>
     condition condition
       condition   condition
        randomness
          other junk
     </thing> 
    
     <THANG>blah</THANG> 
    
    I'd recommend reading the "matching within multiple lines" recipe in the Perl Cookbook (that's recipe 6.6 in both the 1st and 2nd editions).

    And by the way... you're not rolling your own code to parse HTML or XML are you? You should be looking for already existing modules out on CPAN.

      ah ha! In my foolish frustration I forgot about the perl cookbook and its wealth of useful data. Sadly I consulted google (with very poor search terms no less (shame upon me)) and when it turned up crap that wasn't helpful, I got lazy and posted to here - after checking the tutorials section.
      The "s///ms" trick is going into my working memory. I thank thee for pointing it out to me.
      In regards to the rolling my own code for parsing out html/xml, I sort of am, but not without good reason. The program in progress needs to have some files in html format for no terribly good reason (insert boss module) and so it shall be. And in regards to CPAN, lets just say an overly paranoid sysadmin stands between me and that alternative. I could of course go for the  use lib '/bla/foo/meh' option, but for other reasons too complex waste your time on, that also isn't workable (see site inconsistency).
      Ending rant before I realize what a crappy assignment I have and quit.

      Again thanks a lot for the help.


      jcpunk
      all code is tested, and doesn't work so there :p (varient on common PM sig for my own ammusment)
        Perl By Example (on-line)

        the Perl Cookbook (on-line).

        ...And then, He rested.:) NOTE: I AM HAPPY TO HAVING STUDIED FROM THE PERL BY EXAMPLE BOOK AT MY PUBLIC LIBRARY (you should also try your public library, you'ld be amazed! And it is always so quiet... And perhaps you find the Camel!).-

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://314945]
Approved by Roger
Front-paged by PERLscienceman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2024-04-20 08:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found