in reply to Searching data file

In your get_data() sub, your while loop takes up where the other one left off. The first thing it does is read in a line and place it in $_, but since you already read the author line, you lose it.

You could get around that by changing your loop in the get_data() sub to do { . . . } while ( <FH> ); instead. That wouldn't read in the next line until you had processed the first line.

If you records are always separated by a blank line, though, I'd suggesting reading each record at a time by putting perl in "paragraph" mode. You do that by setting $/ = "";. (See the entry for the $/ var in perldoc perlvar for more information.) Then, I'd parse each record into a hash. That would give you much more flexibility.

By the way, do you have any control over the data? Because, if you do, I'd consider changing your format. Real XML would probably be better in the long run than that bizarre broken XML-ish format.

-sauoq
"My two cents aren't worth a dime.";

Replies are listed 'Best First'.
Re: Re: Searching data file
by ysth (Canon) on Nov 02, 2003 at 19:35 UTC
    Agreement. The key point is that you can't read part of the data, do the match, and then expect to be able to read all of the data. Instead, read each record, see if it matches, then print it. If paragraph mode doesn't work (i.e. not always a blank line between records), you can do it manually with something like:
    sub readrec { my $line; my $inrec; my $rec; while (defined($line = <FH>)) { last if $line eq "</ref>\n"; $rec .= $line if $inrec; $inrec ||= $line eq "<ref>\n"; } $rec; }
    (This actually strips off the <ref> and </ref> tags; if you want to preserve them, reverse the order of the lines in the while loop.)

    You could also parse the record as you read it, but to search any of the fields, its probably more convenient to return just a string to match against, and split it up into the components if it matches.

    (BTW, the OP's match statement doesn't look as if it would work at all.)

    (updated to remove comment about match based on misunderstanding)

      BTW, the OP's match statement doesn't look as if it would work at all.

      This one: /\<author\>\s*(\D*)$search/i? That should work fine for the examples he gave. It lacks robustness; it probably isn't the best expression of what he is looking for; a tail search doesn't seem very useful; and it certainly isn't how I would write it. But it should work.

      By the way, if I were to do it the way you suggested, I wouldn't bother to reinvent the flip-flop operator. I'd write it like this:

      sub readrec { my $rec = ''; while ( <FH> ) { $rec .= $_ if m|<rec>| .. m|</rec>|; last if m|</rec>|; } $rec; }
      It'd be better to pass the filehandle, of course. Also, your version is rather brittle because of your use of string equality. If there happens to be space between a record's start or end tag and the following newline, yours breaks.

      One other thing... in this construct:

      while (defined($line = <FH>)) {
      that defined() check isn't needed. Usually, including it could be classified as so-called "cargo-cult" programming. Honestly, I too probably still do it on occasion out of old habit. If you wanted your code to run quietly with warnings enabled on 5.004_04, it was a necessity.¹ There's really no reason for it these days though, provided you aren't still supporting 5.004_04. And if you are, it's time to consider upgrading. ;-)

      1. I think the practice of using defined() in that manner primarily exists because of that warning emitted by 5.004_04 and not because of a real need. The construct, while ( <FH> ) is somewhat magical and checks for definedness. I think code that included an assignment in the loop, like while ( $line = <FH> ), did not check for definedness until sometime after 5.004_04. It was, however, a minor issue in reality because "\n" and "0\n" are both true values anyway. So, you might've run into an obscure bug if you changed $/ to something like "0" but it wouldn't have affected most code.

      -sauoq
      "My two cents aren't worth a dime.";
      
      A reply falls below the community's threshold of quality. You may see it by logging in.