Searching data file

parisa has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Searching data file by sauoq (Abbot) on Nov 02, 2003 at 10:43 UTC
In your `get_data()` sub, your while loop takes up where the other one left off. The first thing it does is read in a line and place it in `$_`, but since you already read the author line, you lose it. You could get around that by changing your loop in the `get_data()` sub to `do { . . . } while ( <FH> );` instead. That wouldn't read in the next line until you had processed the first line. If you records are always separated by a blank line, though, I'd suggesting reading each record at a time by putting perl in "paragraph" mode. You do that by setting `$/ = "";`. (See the entry for the `$/` var in `perldoc perlvar` for more information.) Then, I'd parse each record into a hash. That would give you much more flexibility. By the way, do you have any control over the data? Because, if you do, I'd consider changing your format. Real XML would probably be better in the long run than that bizarre broken XML-ish format. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
Re: Re: Searching data file by ysth (Canon) on Nov 02, 2003 at 19:35 UTC
Agreement. The key point is that you can't read part of the data, do the match, and then expect to be able to read all of the data. Instead, read each record, see if it matches, then print it. If paragraph mode doesn't work (i.e. not always a blank line between records), you can do it manually with something like: `sub readrec { my $line; my $inrec; my $rec; while (defined($line = <FH>)) { last if $line eq "</ref>\n"; $rec .= $line if $inrec; $inrec \|\|= $line eq "<ref>\n"; } $rec; }` [download] (This actually strips off the <ref> and </ref> tags; if you want to preserve them, reverse the order of the lines in the while loop.) You could also parse the record as you read it, but to search any of the fields, its probably more convenient to return just a string to match against, and split it up into the components if it matches. ~~(BTW, the OP's match statement doesn't look as if it would work at all.)~~ (updated to remove comment about match based on misunderstanding)	[reply] [d/l]
Re: Re: Re: Searching data file by sauoq (Abbot) on Nov 03, 2003 at 00:22 UTC
BTW, the OP's match statement doesn't look as if it would work at all. This one: `/\<author\>\s(\D)$search/i`? That should work fine for the examples he gave. It lacks robustness; it probably isn't the best expression of what he is looking for; a tail search doesn't seem very useful; and it certainly isn't how I would write it. But it should work. By the way, if I were to do it the way you suggested, I wouldn't bother to reinvent the flip-flop operator. I'd write it like this: `sub readrec { my $rec = ''; while ( <FH> ) { $rec .= $_ if m\|<rec>\| .. m\|</rec>\|; last if m\|</rec>\|; } $rec; }` [download] It'd be better to pass the filehandle, of course. Also, your version is rather brittle because of your use of string equality. If there happens to be space between a record's start or end tag and the following newline, yours breaks. One other thing... in this construct: `while (defined($line = <FH>)) {` that `defined()` check isn't needed. Usually, including it could be classified as so-called "cargo-cult" programming. Honestly, I too probably still do it on occasion out of old habit. If you wanted your code to run quietly with warnings enabled on 5.004_04, it was a necessity.¹ There's really no reason for it these days though, provided you aren't still supporting 5.004_04. And if you are, it's time to consider upgrading. ;-) 1. I think the practice of using `defined()` in that manner primarily exists because of that warning emitted by 5.004_04 and not because of a real need. The construct, `while ( <FH> )` is somewhat magical and checks for definedness. I think code that included an assignment in the loop, like `while ( $line = <FH> )`, did not check for definedness until sometime after 5.004_04. It was, however, a minor issue in reality because `"\n"` and `"0\n"` are both true values anyway. So, you might've run into an obscure bug if you changed `$/` to something like "0" but it wouldn't have affected most code. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
•Re: Searching data file by merlyn (Sage) on Nov 02, 2003 at 17:13 UTC
It's really too bad that data isn't XML, instead being some SGML, I presume. If it were XML, there'd be a myriad of solutions. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: Searching data file by graff (Chancellor) on Nov 02, 2003 at 22:58 UTC
As others have pointed out, part of the problem is having `while (<FH>)` in two different places -- this makes it a lot harder to control what's going on. Even though the data is not XML, it appears to be good enough as SGML. There is SGML::Parser, if you feel so inclined, but with stuff like the example you showed, I tend to be more "trusting" of the data (especially if I have already validated the data separately, e.g. using James Clark's SP package), and I would use some simple perl-isms to chunk my way through it (after I have used a separate, simple perl script to scan the data and make sure my simplifying assumptions are correct). For example: if ( open( FH, <data.sgl") { local $/ = "</ref>\n"; # make close-tag the input rec. separator while (<FH>) # now read a full "ref" element into $_ { next unless ( m{<aulist>.?$search.?</aulist>}is ); # now split the ref element up and print it # <strike>(I'll leave that as an exercise...)</strike> # Since you have a finite number of things you want # to put into your output listing, it'll be easiest # just to look for those things: for my $fld ( qw/author year source keys title/ ) { @{$data{$fld}} = m{<$fld>([^<]+)}ig; } print_record( \%data ); # all hash elements are array refs # (some may be empty/undef, some may have 1 item) } close FH; } else { die "Unable to read data.sgl\n"; } [download] That assumes that the search term only applies to the contents of the "author list" sub-element within the ref, but it should be easy to see how it would extend to other sub-elements like title, etc. I'm not sure how much you should worry about checking the content of $search before passing it into a regex match... using the "m{...$search...}" type of usage shouldn't risk any real trouble.	[reply] [d/l] [select]
Re: Searching data file by jZed (Prior) on Nov 03, 2003 at 01:35 UTC
I would ask for more details, you've only shown us sample data. If, for example, everything follows your sample in only allowing us to omit end-tags on single-line tags with no embedded tags, then a simple regex (to add end-tags where they are missing) will turn this into real XML and you can come to a real solution. If it's not a huge amount of data, you can just import the string into memory, transfrom it into real XML and then use one of the XML or DBI related modules or whatever modules to query it. Of course, in the long run, you're better off having whatever produces the data make it real XML to begin with, but in the meantime, this solution should work.	[reply]
Re: Searching data file by Roger (Parson) on Nov 03, 2003 at 00:41 UTC
I have rewritten your code with a simpler algorithm. use strict; use Data::Dumper; my %records; # hash to store book info based on author my $data; while (<DATA>) { chomp; $data .= $_; if ($_ eq '</ref>') { # process what's in the buffer when we see the end tag my $rec = process_record($data); $records{$rec->{author}} = $rec; $data = ''; } } print print Dumper(\%records); sub process_record { my $rec = shift; my %col; ($col{author}) = $rec =~ m/<author>\s([^<])(?=<)/g; ($col{year}) = $rec =~ m/<year>\s([^<])(?=<)/g; ($col{source}) = $rec =~ m/<source>\s([^<])(?=<)/g; ($col{id}) = $rec =~ m/<id>\s([^<])(?=<)/g; ($col{title}) = $rec =~ m/<title>\s([^<])(?=<)/g; my @keywords = $rec =~ m/<key>\s([^<])(?=<)/g; $col{keywords} = \@keywords; return \%col; } __DATA__ <ref> <provnc> <aulist> <author> Bin Laden </aulist> <year>1990 <source> Cambridge University Press, Cambridge UK, 1st edition <id>1 <keywords> <key>terrorism <key>whatever </keywords> </provnc> <title> Terrorism </ref> <ref> <provnc> <aulist> <author> Sydney </aulist> <year>1990 <source> Cambridge University Press, Cambridge UK, 1st edition <id>1 <keywords> <key>nothing <key>whatever </keywords> </provnc> <title> Terrorism </ref> [download] And the output is as expected - `$VAR1 = { 'Bin Laden' => { 'title' => 'Terrorism', 'author' => 'Bin Laden', 'keywords' => [ 'terrorism', 'whatever' ], 'id' => '1', 'year' => '1990', 'source' => 'Cambridge University Press, Ca +mbridge UK, 1st edition ' }, 'Sydney' => { 'title' => 'Terrorism', 'author' => 'Sydney', 'keywords' => [ 'nothing', 'whatever' ], 'id' => '1', 'year' => '1990', 'source' => 'Cambridge University Press, Cambr +idge UK, 1st edition ' } };` [download]	[reply] [d/l] [select]
Re: Re: Searching data file by graff (Chancellor) on Nov 03, 2003 at 02:24 UTC
Actually, I think there's a slight problem with this design. The markup structure makes it clear that it is meant to handle refs with multiple authors, and when there is such a ref entry, your "process_record" sub will only return the first author -- then this single author will be the basis for testing if the record matches the given search. So if the name being searched for happens to be the second author in a record, that record won't be returned. You would need the hash element for "author" be a reference to an array, and then search over the elements of that array, which makes it a lot more complicated than if you were reading a whole `<ref>...</ref>` element at each iteration (by setting $/ as I suggested above), and looking for $search anywhere within the `<authlist>` element.	[reply] [d/l] [select]
Re: Re: Re: Searching data file by Roger (Parson) on Nov 03, 2003 at 02:44 UTC
Yes you are right, there is a problem that my code does not pick out multiple authors. I have omitted multiple authors for being lazy. Fixing the code is simple though, just modify the code slightly to read multiple authors (same as multiple keys). `my @author = $rec =~ m/<key>\s([^<])(?=<)/g; $col{author} = \@author;` [download] And how to store the returned hash structure by the subroutine needs to be revised too since there can be multiple authors. That should be a simple exercise.	[reply] [d/l]
Re: Re: Re: Re: Searching data file by Anonymous Monk on Nov 03, 2003 at 09:32 UTC