legend has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to implement some data filters on my text data. The filters are defined something like this:
(AND (OR AUTHOR=John PROFIT=90% AUTHOR=Matt PROFIT=80% ) PUBLISHER=OReilly )
I need to somehow convert the above filter into something sensible so that I can apply to the following data:
Page 1 AUTHOR: John PROFIT: 20% PUBLISHER: TMH BOOK: OPERATING SYSTEMS Page 2 AUTHOR: John PROFIT: 90% PUBLISHER: OREILLY BOOK: ALGORITHMS Page 3 AUTHOR: Matt PROFIT: 80% PUBLISHER: TMH BOOK: COMPUTER NETWORKS Page 4 AUTHOR: Matt PROFIT: 80% PUBLISHER: OREILLY BOOK: COMMUNICATION SYSTEMS
Then when I apply the filter to this data, I should get: the results from Page 2 and Page 4. I thought of pushing the elements into an array as they are encountered while scanning the filter but it didn't make any sense because I'm not sure how to use this. Am I missing out some easy thing here?

Replies are listed 'Best First'.
Re: Implementing a text filter on some dataset
by bobf (Monsignor) on Feb 27, 2008 at 05:09 UTC

    My first inclination would be to reformat the data into something that could be read by DBD::AnyData (or just stuff it into a db), then convert your filter into SQL. The former looks pretty easy given your example. The latter is probably quite a bit trickier.

    Without more information about how many filters (queries) you have, how complex and dynamic they are, how many times you're running them, etc, though, this may not be the best solution. I'm sure other monks will provide lots of other WTDI.

      Well, for now, I'll just get it working with a two level filter (Two for OR and Two for AND)... And thanks for the suggestion, will look into it now...
Re: Implementing a text filter on some dataset
by BrowserUk (Patriarch) on Feb 27, 2008 at 09:31 UTC

    Cos the problem intrigued me :)

    #! perl -slw use strict; ## Convert query syntax to evalable Perl code sub buildQuery { local $_ = uc shift; while( m[ \( (?: AND | OR ) | != | = ]x ) { s< ( [^!=()\s]+ ) ( != | = ) ( [^()\s]+ ) >{ my $op = ( $2 eq '!=' ? 'ne' : 'eq' ); "[\$\L$1\E $op '$3']"; }xge; s< \( \s* ( AND ) \s+ ( \[ [^()]+ \] ) \s+ ( \[ [^()]+ \] ) \s +* \) > { "[$2 && $3]" }xge; s< \( \s* ( OR ) \s+ ( \[ [^()]+ \] ) \s+ ( \[ [^()]+ \] ) \s* + \) > { "[$2 || $3]" }xge; } tr/[]/()/; return $_; } ## Read data and UPPER case my $data = do{ local $/; uc <DATA> }; ## Some variables my( $author, $profit, $publisher, $book ); ## And a regex to populate them from each record my $re = qr[ PAGE \s+ \d+ \s+ AUTHOR: \s+ ( [^\n]+ ) (?{ $author = $^N }) \s+ PROFIT: \s+ ( [^\n]+ ) (?{ $profit = $^N }) \s+ PUBLISHER: \s+ ( [^\n]+ ) (?{ $publisher = $^N }) \s+ BOOK: \s+ ( [^\n]+ ) (?{ $book = $^N }) \s+ ]x; ## Covert the query NOTE: (AND ) syntax is required ## where example used implicit AND my $query = buildQuery( <<EOQ ); (AND (OR (AND AUTHOR=John PROFIT=90% ) (AND AUTHOR=Matt PROFIT=80% ) ) PUBLISHER=OReilly ) EOQ print "\nQuery: $query\n"; ## Test the condition and print the record if it matches ## For each record eval "$query" and print $1 while $data =~ m[ ( $re ) ]xg; ## Same again for another query ## Note != also accepted. my $query2 = buildQuery( <<EOQ ); (OR (AND AUTHOR=John PROFIT!=90% ) (AND AUTHOR=Matt PUBLISHER!=OReilly ) ) EOQ print "\nQuery: $query2\n"; eval "$query2" and print $1 while $data =~ m[ ( $re ) ]xg; __DATA__ Page 1 AUTHOR: John PROFIT: 20% PUBLISHER: TMH BOOK: OPERATING SYSTEMS Page 2 AUTHOR: John PROFIT: 90% PUBLISHER: OREILLY BOOK: ALGORITHMS Page 3 AUTHOR: Matt PROFIT: 80% PUBLISHER: TMH BOOK: COMPUTER NETWORKS Page 4 AUTHOR: Matt PROFIT: 80% PUBLISHER: OREILLY BOOK: COMMUNICATION SYSTEMS

    Outputs:

    [ 9:25:03.05]C:\test>670477 Query: (((($author eq 'JOHN') && ($profit eq '90%')) || (($author eq ' +MATT') && ($profit eq '80%'))) && ($publisher eq 'OREILLY')) PAGE 2 AUTHOR: JOHN PROFIT: 90% PUBLISHER: OREILLY BOOK: ALGORITHMS PAGE 4 AUTHOR: MATT PROFIT: 80% PUBLISHER: OREILLY BOOK: COMMUNICATION SYSTEMS Query: ((($author eq 'JOHN') && ($profit ne '90%')) || (($author eq 'M +ATT') && ($publisher ne 'OREILLY'))) PAGE 1 AUTHOR: JOHN PROFIT: 20% PUBLISHER: TMH BOOK: OPERATING SYSTEMS PAGE 3 AUTHOR: MATT PROFIT: 80% PUBLISHER: TMH BOOK: COMPUTER NETWORKS

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Bravo! :)

      Another approach: assuming, that we can "upgrade" filter a little

      $_=<<END_OF_FILTER; (AND (OR (AND (COND AUTHOR=John) (COND PROFIT=90%) ) (AND (COND AUTHOR=Matt) (COND PROFIT=80%) ) ) (COND PUBLISHER=OReilly) ) END_OF_FILTER %hash = ( AUTHOR=> 'John', PROFIT=> '90%', PUBLISHER=> 'OReilly', BOOK=> 'OPERATING SYSTEMS' ); sub AND { print "and debug: [@_]"; for(@_) { return 0 if ! $_ } return 1 } sub OR { print "or debug: [@_]"; for(@_) { return 1 if $_ } return 0 } sub COND { my ($key, $value)=@_; $ret = 0 | ($hash{$key} eq $value); print "cond debug: @_ -> $ret"; return $ret } # add comas between ) and ( braces s/\)(\s*)\(/),$1(/g; # convert COND a little - I hope you do not have O'reilly name... :) s/\(COND (\w+)=([^)]+)\)/(COND '$1','$2')/g; # we have proper parens already, move it only s/\((AND|OR|COND)\b/$1(/g; # see how it looks now print $_; # and voila! print "ok" if eval;

      Update: The question is only how to add (COND ...) block with one regexp to original filter? Search for '=' like this?

      s/\w+=\S+/(COND $&)/g

      Or there can be whitespace in some values, on which regexp fails? As only legend knows the format of conditions, the answer and a good regexp is in the legend :)

Re: Implementing a text filter on some dataset
by grizzley (Chaplain) on Feb 27, 2008 at 08:40 UTC

    Maybe it would be worth writing piece of code converting this filter into something more perlish like this?:

    ( ( ($hash{AUTHOR} eq 'John' and $hash{PROFIT} eq '90%') or ($hash{AUTHOR} eq 'Matt' and $hash{PROFIT} eq '80%') ) and $hash{PUBLISHER} eq 'OReilly' )

    Then your filter is just a Perl condition, which you can apply to your input data stored in %hash like this:

    %hash=( AUTHOR=> 'John', PROFIT=> '20%', PUBLISHER=> 'TMH', BOOK=> 'OPERATING SYSTEMS' )

    BTW. It is not good, that two conditions (AUTHOR=John PROFIT=90%) are separated by space, should be separated with (AND ...).

Re: Implementing a text filter on some dataset
by legend (Sexton) on Mar 02, 2008 at 21:01 UTC
    Works perfectly... Thanks a lot :) I want to use the same filter in a code that I've already written for another format. I have some lines in my code that print the fields:
    print $db_title."\n"; print $db_source."\n"; print $db_length."\n"; print $db_author."\n"; print $db_body."\n"; print $db_language."\n"; print $db_subject."\n"; print $db_datepub."\n"; print $db_loaddate."\n"; print "\n\n\n";
    Now, I have to print all the fields only if the filter condition matches but I am unable to get it right on how to use the filter with this data. Can I have some help here please?
Re: Implementing a text filter on some dataset
by legend (Sexton) on Mar 04, 2008 at 22:52 UTC
    I forgot to say that the function buldQuery doesn't work when a string with a space is given. For example:
    my $query = buildQuery( <<EOQ ); (OR (OR AUTHOR=John Nash PROFIT=90% ) (OR AUTHOR=Matt PROFIT=80% ) ) EOQ print "\nQuery: $query\n";
    This does not work because John Nash has a space in between. Can someone help me get it to work please?