Re: My first package - need help getting started
by hv (Prior) on Feb 27, 2003 at 02:52 UTC
|
This may seem like a red herring, but the first thing that worries me is your choice of name, 'DBParser'. The thing about OO is that it is about objects, so the first question to ask is "what is the object here?". My first guess, looking at the data, is that each record represents a person (or perhaps something more general - an "entity", perhaps), in which case I'd be inclined to call the object "Person" (though I'd probably avoid namespace problems by using a prefix that represented the company name, or perhaps my name or the project name, depending on the scope). An alternative approach, if this record format can be used to describe a variety of things, is to name the class instead after the name of the record format; in that case, it might be worth having subclasses for each of the major different types of thing that the format can represent.
Now, what does the data in a typical file look like: is it multiple records each starting with the 'Key' attribute? If so, I could imagine wanting to write the code like:
use Person;
for my $person (Person->parse_from_file('/tmp/somefile')) {
next unless $person->surname eq 'Region';
$person->tel('(123) 456-7890');
print $person->text;
}
Let me be clear: this is just how I like to write my code, and other people (including yourself) will doubtless have different prejudices. The above code assumes that Person::parse_from_file() knows how to read a sequence of records from a file, turn each one into a "Person" object, and return the resulting list.
It also assumes that these objects are opaque, so that all access is via methods: you can choose to make them transparent hashrefs with documented keys, but then (for example) you always need to do the work to split up the 'Full Name' key so that the 'Surname' key will be there in case someone looks at it, and it probably means you can't allow modification both by way of the 'Surname' field and directly in the 'Full Name' field, because by the time you need to write the record back out you won't know which value is correct.
I tend to like what are sometimes called "polymorphic get/set accessors", which means that you can use the same method either without arguments to fetch the value, or with an argument to set it to a new value. Some others prefer to split such functionality into two methods, eg tel() and set_tel().
I'm sure there are many other aspects worth talking about, but these are just some initial thoughts.
Hugo | [reply] [d/l] |
|
|
Hugo,
Thank you for your valuable input.
I see that my lack of OO understanding has been blatently displayed. I also seemed to have used keywords that were clear to me, but not to everyone. What I am trying to accomplish is the following:
- Turn a record (bunch of lines) into a complex data structure that can be treated as a single entity.
- Have the ability to manipulate that complex data structure
- Access that complex data structure for printing in the same ugly format that I created it with.
This appears to be what you have gleaned from my poor attempt at explaining this. As far as I am concerned, I do not have a preference on how the code should look as I am completely inexperienced at this. I appreciate the information, but I really do not understand how to code the opaque objects as you suggest. I know that the full key will always be static, even if the broken out pieces change as it will be printed externally. If you could show me some code to illustrate this - I would be very appreciative. If not, what you have already done is appreciated.
You do not have to use my data to create the opaque object - just show me a template to see the methodology. I am a fairly adept student.
Cheers - L~R
| [reply] |
|
|
Ok, let's assume that the opaque object is implemented internally as a hashref, and that the fullname has a simple format of "surname, initials". Here's a simplistic approach:
package Person;
sub fullname {
my $self = shift;
if (@_) {
$self->{fullname} = shift;
}
return $self->{fullname};
}
sub initials {
my $self = shift;
if (@_) {
$self->fullname(join ', ', $self->surname, shift);
}
(split $self->fullname, ', ', 2)[1];
}
sub surname {
my $self = shift;
if (@_) {
$self->fullname(join ', ', shift, $self->initials);
}
(split $self->fullname, ', ', 2)[0];
}
In practice, I'd write it a bit differently: I'd probably have many methods very similar to fullname(), and might well generate them rather than write each one out explicitly. Also, I'd probably cache the derived information like surname and initials, to avoid recalculating them each time, in which case I'd need to be careful to decache that information when the source (fullname in this case) changed.
I'm surprised that you don't want the module to parse the data for you, since that seems to be a chunk of code that you'd otherwise need to repeat everywhere you deal with these records. But likely I've misunderstood what you're trying to do.
I guess the most important thing, which I should have said before, is that documentation is the key, particular in perl: the docs for your class will say how you're allowed to use the object, and what you're allowed to assume about it. And in general, anything that the docs don't say you are not allowed to do or assume when using the class or its objects in other code.
Hugo | [reply] [d/l] |
Re: My first package - need help getting started
by djantzen (Priest) on Feb 27, 2003 at 02:08 UTC
|
This seems like a reasonable place for a module. One thing to note is that in your sample interface you're basically splitting the logic between the module and the calling code. That the caller controls opening and reading the file, pulling out a hash for each entry whose structure the caller must know in advance in order to read, indicates that this isn't a complete modularization of responsibilities.
To my mind, the clearest way to fix these issues is to go down the OO path, in which an instance of DBParser opens a file to read, controls iteration internally, and returns results that match your search criteria for you to print from the calling context or to pass to another module specifically for formatting output. Doing this gives a clean interface, separation of duties, and the ability to create further refined subclasses of both the parsing and printing components.
The difficult part in doing this will be specifying the search criteria since the data is pretty hairy, so it would be good to start with a review of all the ways current scripts access it, and see if there's a method to the madness that you can tease out and formalize.
"The dead do not recognize context" -- Kai, Lexx
| [reply] |
|
|
fever,
After reading a few of these replies, I realize how much I really don't know about what I am getting into. I do not want the module to do the parsing, I just want it to create an object that I can then manipulate. Each program's needs are going to be different. In my very ficticious/contrived example I obviously used the wrong key word DBParser. It is really supposed to take a record and build an object. You have given me some food for thought as have others.
Thank you - L~R
| [reply] |
Re: My first package - need help getting started
by pg (Canon) on Feb 27, 2003 at 03:18 UTC
|
Nice thinking, here is some thought I have.
I see three classes here:
- Parser
- Filter
- Formatter
The data would flow in this direction: Parser => Filter => Formatter.
- The Parser takes a stream of characters, and parses it into structured data.
The Parser would have methods allow you to provide the input, which is the entity would be processed. It might be a file, might be a string...
The Parser would parse the input into records (could be the same as lines), and each line into fields. You would allow the user to specify some criteria, and define how the records would be extracted, and then how to seperate each record into fields. Those criteria might regexps.
For example, if we look at the sample data you give, you might want to make each line into a record, and within each record, the part before ':' is one field, the part after is another field.
The Parser should also have methods allow you to fetch records, and fields, which would be used by the Filter.
- The Filter would accept structured data come out from the parser.
You would allow the user to define the criteria as what would be threw away, and what can pass thru the Filter. Again, regexps might be a good fit here.
The Filter does not modify the structure of the input data, but the number of output records could be less than the number of input records, if some records are threw away by the Filter.
- The Formatter would take the structured data, format them back to stream. Of course, the stream is formatted, and well presented.
A good way is to allow the user define call back function, and the call back function would format those records, not your module, but your module might provide default format method, if one is not provided by the user.
I am thinking it would be really nice, if you found a way to wrap around those well known HTML parsers, and XML parsers, and make them available to your Filter.
One thing you may want to do, is to have a generic Filter class as the root, and have some generic method defined. Base on this, you then have some more specific Filters, for example, you may have one filter understand the output from a certain xml parser.
You said that you didn't want someone to do it for you, but only want some ideas. The fact is nobody can do this for you ;-), quick and good.
| [reply] |
|
|
| [reply] |
|
|
;-) Lots of time, you would see more than one design fly, and each of them is good. There is no black and white answer, and this is why computer science is both science and art.
I agree that you can start with filter as part of your program, instead of a seperate module, but later if you see the functionality need to be reused, then abstract/extract a class out of your existing code.
The traditional software engineering requires you to have everything laid out at the beginning, the design phase, and there is only one design phase. The modern software engineering does allow you to create your software cycle by cycle, each cycle is a whole traditional software life cycle, and has its own design phase. For each new cycle, new functionality would be added, and the design would be modified in a constructive way.
This change of methodology is mainly because:
- people found there is no way that the traditional methodology would work for big systems/projects. It is simply impossible for people to get everything straight and right, once for forever.
- From a business view, companies some time want to be the first in the market. They have to prototype things, and quickly make their products available, worry more functions later.
For sure, I would like to be one of the persons to do code review for you. By doing that, we can learn from each other.
| [reply] |
Re: My first package - need help getting started
by zengargoyle (Deacon) on Feb 27, 2003 at 04:24 UTC
|
yes, an object for your chunk-o-data. but if your stream-o-data isn't likely to change i would say no object for the parser.
just have your object's creation method take a whole chunk-o-data.
package ChunkOData;
sub from {
my ($class, $chunk) = @_;
$chunk =~ s/\n\t//g; # continuation lines are easy
# parse $chunk like you already know how
# shove it into a hash
return bless \%self, $class;
}
# write some accessors
# write some common useful junk
package main;
local $/ = '';
while (my $chunk_text = <>) {
my $chunk = ChunkOData->from $chunk_text;
next unless $chunk->type eq 'UR';
$chunk->owner('me');
if ($chunk->is_a_certain_type) {
$chunk->do_some_standard_thing;
$chunk->do_something_else($with_my_info);
subroutines_are_good($chunk);
}
$chunk->print;
}
if your stream-o-data is blank-line seperated (or other $/ -able format) this is a simple way to get started.
you might also use one of the Order-keeping Hash modules from the CPAN in an object for your key field. then you could do something like:
my $key = $chunk->key;
next unless $key{OU} eq 'VALUE1';
my $otherkey = $chunk->key_as_string; # X=foo/Y=bar/..
| [reply] [d/l] [select] |
|
|
++zengargoyle,
This is a great start - but I want to make sure I grok it before I try to use it, so I may have more questions.
Thanks a million!
Cheers - L~R
| [reply] |
Re: My first package - need help getting started
by toma (Vicar) on Feb 27, 2003 at 04:29 UTC
|
Your data appears to be LDAP data. I searched for
LDAP on cpan and found 179 modules.
Probably you don't need to write any new objects if
you can use a few of these modules.
Your LDAP data appears to be in LDIF format, which is
covered in
rfc2798.
There is Net::LDAP::LDIF which
may do exactly what you need, which is to turn LDIF
text into a perl LDAP object.
It should work perfectly the first time! - toma | [reply] |
|
|
toma,
Thanks, but no dice. It is a flat file export of a very propietary database that has no public APIs. I will take a look at your references to see if they provide any insight into my own dilema though.
Cheers - L~R
| [reply] |
|
|
use strict;
use warnings;
use diagnostics;
use Data::Dumper;
use Net::LDAP::LDIF;
my $ldif = Net::LDAP::LDIF->new( "file.ldif", "r", onerror => 'undef'
+);
while( not $ldif->eof() ) {
my $entry = $ldif->read_entry();
if ( $ldif->error() ) {
print "Error msg: ",$ldif->error(),"\n";
print "Error lines:\n",$ldif->error_lines(),"\n";
}
else {
print Dumper($ldif);
}
}
$ldif->done();
Here is the modified input file:
dn: /C=US/A=BOGUS/P=ABC+DEF/O=CONN/OU=VALUE1/S=Region/G=Limbic/I=_/
type: UR
flags: DIRM ADMINM ADIC ADIM
Alias-2: wL_Region
Alias-3: Limbic Region
Alias-4: Limbic._.Region@nowhere.com
Alias-5: Limbic._.Region
Alias-6: Limbic.Region@nowhere.com
Alias-7: ORG
Alias-8: CMP
Alias-10: O=ORG/OU=Some Big Division/CN=Limbic _. Region
Alias-11: Region
Alias-12: Region, Limbic _
Alias-14: Limbic _. Region at WT-CONN
Alias-15: ex:/o=BLANK/ou=ORG/cn=Recipients/cn=Mailboxes/cn=LRegion
Alias-16: WT:Limbic_Region
Alias-17: SMTP:Limbic._.Region@nowhere.com
Alias-18: /o=A.B.C.D./ou=Vermont Ave/cn=Recipients/cn=wt/cn=Limbic_Reg
+ion
Full Name: Region, Limbic _.
Post Office: WT-CONN
Description: 999-555-1212 Some Big Company
Tel: 999-555-1212
Dept: Some Big Division
Location: EMWS
Address: 123 nowhere street
City: everywhere
State: MD
Zip Code: 20874
Building: BLAH
Building-Code: MD-ABC
owner: CONN
It should work perfectly the first time! - toma | [reply] [d/l] [select] |
|
|
|
|
That sample input looks almost exactly like LDAP. It may be an offshoot of DAP. Net::LDAP can almost access that data, it just a little more coaxing. Well, ok, not Net::LDAP, but the LDIF modules included with it. I really think this route should be investigated thouroughly.
| [reply] |
|
|
Re: My first package - need help getting started
by jonadab (Parson) on Feb 27, 2003 at 09:14 UTC
|
I have a couple of questions and, depending on your
answers, a suggestion for how to simplify the
problem.
First, you said that the only thing guaranteed to be
unique was the key, but you were talking about
uniqueness among all the records. In your example,
the field names are all unique within the record.
Is that the case for every record? If so, it seems
to me that a record can be conveniently represented
as a hash.
Second, it sounds to me from the description, though you
don't really expressly say this, that you generally
only need to look at one record at a time.
If I'm understanding right here, then creating an
object per se may be an unnecessary complication.
It sounds to me like all you need is two functions:
one that takes an open filehandle (as a glob maybe),
reads off the next record, and returns a reference
to a hash, and one that
takes a reference to a hash and returns a string.
Depending on what you need to do, another routine
or several might be in order for testing records
(e.g., a routine that takes a hashref and a string
and returns the number of Alias fields in the hash
whose values match the string).
I know it's heresy to some to suggest not using OO
where it's possible to use OO, but it just seems
unnecessary here, to me.
The only thing that makes me think I might be wrong,
and that OO might in fact be a Good Idea, is that
you didn't show what delimits records in the files
you're reading. If there's no delimiter, then you
are going to be reading until you get the key for
the next record, which you then have to save for
when you read that record. It is of course possible
to do this without real OO, but it's awkward, since
it involves a persistant variable (the one-line buffer)
that needs to be associated with the specific file
in question. If you never have more than one of these
files open at the same time you could get by with a
magic global ($main::MY_DB_PARSING_PERSISTENT_LINE_BUFFER
or whatnot), but that's a kludge, and if you ever
need to work through more than one of these files at
the same time it will break. It is possible to get
around that too, by using the filehandle as a key
into a magic global hash, but now we're doing something
arguably almost as complex as OO, so I'm not sure this
really saves anything.
But it is an option to consider. If your records are
delimited by some magic marker in the files (e.g., a
blank line), then this problem goes away, and you
can just have a couple of routines, as I said.
for(unpack("C*",'GGGG?GGGG?O__\?WccW?{GCw?Wcc{?Wcc~?Wcc{?~cc'
.'W?')){$j=$_-63;++$a;for$p(0..7){$h[$p][$a]=$j%2;$j/=2}}for$
p(0..7){for$a(1..45){$_=($h[$p-1][$a])?'#':' ';print}print$/}
| [reply] [d/l] |
Re: My first package - need help getting started
by tachyon (Chancellor) on Feb 27, 2003 at 10:54 UTC
|
| [reply] |
Re: My first package - need help getting started
by zengargoyle (Deacon) on Feb 27, 2003 at 17:05 UTC
|
another non-OO way of doing things popped into my head. theres a module for processing NetFlow records ( module CFlow out of the flow-tools package, not on CPAN ) that does things like this:
sub match_func {
return 0 unless $bytes > 5000;
return 0 unless $src_port == 80;
# do_something with matched
return 1;
}
CFlow::loop( \&match_func, $filehandle );
print "matched $CFlow::match_count records\n";
since you generally work with a single record, if the fields of the record are unique... forget all of the OO stuff and use globals. the 'loop' routine takes a coderef to be run after each record is parsed (and shoved into the global variables) and a filehandle ( if filehandle is undef, read STDIN, if filehandle is a string, open and use that file.). the coderef returns 0 if the record wasn't interesting, else it does whatever and returns 1 (so the module can keep track of how many records matched).
while not-00, it does do an excellent job of hiding the details from the user, eliminates all of the derefrencing ( $chunk->type() just becomes $type) which makes it easy to write quick one-off scripts.
sub fix_building {
return 0 unless $building eq 'FOO';
$building eq 'BAR';
print_rec;
return 1;
}
DBParserThingy::loop ( \&fix_building );
| [reply] [d/l] [select] |