in reply to regexp over multiple lines
the regex term \s* means zero or more whitespace characters, there are 5 of them: 'space',\n,\r,\t,\f : space, new line, carriage return, tab, form feed. So this code just ignores any spaces or End-of-Line things that are seen(they are optional, zero or one is ok).#!/usr/bin/perl -w use strict; $/=undef; # undefines the record separator # which is by default \n # this means that there is no "line" # separator my $bigString = <DATA>; # would normally read one "line" # but since record separator is undefined # it reads all the data as a single string # this is what "slurp" the file means my @prices = $bigString =~ m|<homePrice>\s*(.+?)\s*</homePrice>|ig; print "@prices"; # prints: 1.91 295.3 KEuro __DATA__ <homePrice> 1.91</homePrice> <balh></balh><homePrice>295.3 KEuro</homePrice>
The (.+?) means one or more of any character, but "calm your greedy-ness down!" - don't keep going, but stop capturing when the term after the (.+?) matches. A "greedy match" would keep going until it saw the the last possible match of that next term.
The /g switch means to "match global" keep going and send all matches to the left. the /i is not needed here, but it means ignore case
This \n stuff is more complicated to explain than it is to use. Basically, Perl will almost always do what you expect. It can read line terminations by other operating systems and translate them into the single "\n" character. And when you do a write, it will write your OS specific "\n" thing.
Unix uses just <line feed> to mean End-of-Line. Windows (and Network standard TCP/IP) programs use <carriage return>, <line feed> to mean End-of-Line, and some versions of Apple stuff uses <carriage return> to mean End-of-Line. When reading a file on your platform, Perl will translate what it reads into a single \n character. A Perl program on Unix will be able to read my Windows file and it will just see one "\n" at the end of line (the \r that Windows put there is ignored).
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: regexp over multiple lines
by liverpaul (Acolyte) on Aug 03, 2011 at 09:03 UTC | |
by ww (Archbishop) on Aug 03, 2011 at 12:24 UTC | |
by liverpaul (Acolyte) on Aug 03, 2011 at 15:51 UTC | |
by ww (Archbishop) on Aug 03, 2011 at 22:57 UTC | |
by liverpaul (Acolyte) on Aug 04, 2011 at 11:38 UTC | |
|