UPDATE!!

It isn't a regex problem at all!. It was the way I was reading in the file.

I erroneously had $/='';, but then it only read my data up to, but not including the blank line.

When I changed the above code to undef $/, everything works fine.

Oops!

END_OF_UPDATE

Hello all,

I have a file that I need to parse. Each page has a header and a page number.

At this point, I am having difficulties just grabbing the header for each page (part of a more complex regex).

Problem:

If there is an extra line in front of page 54's header, the regex does not find this page. If the blank line is replaced with an 'x', the page header will be found.
Code & output when it doesn't work:
#!/usr/bin/perl use warnings; use strict; my $header = join("\\s*\\n", 'My\s+Header', 'Page\s+\d+', ); $/ = ''; my $file = <DATA>; while ($file =~/($header\n+)/g) { print "pos = ", pos($file),"\n"; print $1,"\n"; } __DATA__ My Header Page 53 Some Text Some More Text Some More Text Some More Text My Header Page 54 3 Chapter Title My Header Page 55 Some Text Some More Text Some More Text Some More Text
Result:
pos = 18 My Header Page 53
Code & output when it does work:

same code, slightly different data

#!/usr/bin/perl use warnings; use strict; my $header = join("\\s*\\n", 'My\s+Header', 'Page\s+\d+', ); $/ = ''; my $file = <DATA>; while ($file =~/($header\n+)/g) { print "pos = ", pos($file),"\n"; print $1,"\n"; } __DATA__ My Header Page 53 Some Text Some More Text Some More Text Some More Text x My Header Page 54 3 Chapter Title My Header Page 55 Some Text Some More Text Some More Text Some More Text
Result:
pos = 18 My Header Page 53 pos = 93 My Header Page 54 pos = 127 My Header Page 55
I have checked the data with a binary editor, and nothing weird is on the blank line.

I'm using active state perl V5.6.1 on a windows 2000 professional machine.

Any insight would be appreciated.

Sandy

UPDATE:

Just a note: the code above was part of a series of test. The same result is obtained even if I use the 'sgmx' modifiers for the regex.


In reply to Need help with a regex by Sandy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.