blahblah has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

This problem has been bothering me for some time. It feels like it should be possible, but I don't know where to begin.
I have some text data - a csv file for example. I have a subroutine that can parse csv data, say csvparse(). How can I make csvparse() smart so that it can accept csv data from a string passed as a scalar, from a filehandle, or from a file without slurping in the entire file (except in the scalar case, of course) and without making three separate subs - one for each data type? Is this possible?
Here is some stumbling in the dark...
#!/usr/bin/perl -w use strict; use CGI; { # case 1 - a file my $file = "csvfile.csv"; my %parseddata = csvparse($file); } { # case 2 - a filehandle my $uploadedfile = param('uploadfile'); my %parseddata = csvparse($uploadedfile); } { # case 3 - a scalar my $csvdata = 'first,middle,last,phone,email'; my %parseddata = csvparse($csvdata); } # so how can I avoid having three separate csv parsing subs? # Once it can get at the data, the parsing is the same for all 3 cases +! sub csvparse { my $data = $_[0]; if (ref($data) eq 'SCALAR') { for (split(/\n/,$data)) { # parse... } } elsif (-f "$data") { open(DATA, "$data") or die("Noooo!\n"); while (<DATA>) { # parse... } close(DATA); } elsif () { # detect filehandle??? } }

yuck.
Thanks

Replies are listed 'Best First'.
Re: Handling different data types of the same data
by davido (Cardinal) on May 24, 2004 at 06:47 UTC
    Use a two-sub strategy. The first is a sub that gets strings to be parsed out of anything: a filename, a ref to a scalar of data, a ref to an array of data, or a ref to a glob filehandle. Within your sub, do a check on ref($_[0]);. Set it up as follows:
    • ref() returns false: The value should be used as a filename. Open the filename and parse the file.
    • ref() returns SCALAR: The value should be used as a reference to a string to be parsed.
    • ref() returns ARRAY: The value should be used as a ref to an array of strings to be parsed.
    • ref() returns GLOB: The value should be used as a filehandle. Read from the filehandle and parse the file.

    From any of those sources, the first sub will send string(s) to the parsing engine one at a time, which is a second sub. That parsing engine needs to only understand how to parse strings. It doesn't care whether it got that string from a filehandle or a scalar, because that's all handled by the invoking sub.


    Dave

Re: Handling different data types of the same data
by Zaxo (Archbishop) on May 24, 2004 at 06:48 UTC

    I'll start with a warning that you need to untaint this data carefully. This appears to be part of some CGI application, so that goes double. Also, the *DATA handle is reserved for data at the tail of your script, after a __END__ or __DATA__ or Ctrl-Z.

    You have an error in if (ref($data) eq 'SCALAR'). That asks if $data is a reference to a scalar, but then you treat $data as a string.

    This is a neat problem to solve in perl 5.8+ because you can open a string as a file. The open function is able to take either a handle or a filename on its own. The difficulty with your puzzle is to distinguish between a string that is a filename and one that is csv data. It looks easy to, say, look for commas with a regex or index but that discounts the possibility of unexpected filenames.

    How about taking the calling convention that a reference to a scalar is data and a string is a filename? That gives you what you seem to be writing towards.

    You can detect a reference to a file handle with ref($foo) eq 'GLOB'. Lexical handles will be of that type. You could also insist that global handles be passed by reference. There is a lot of dwimmerie in dealing with filehandles, so testing is much to be desired for your sub.

    After Compline,
    Zaxo

Re: Handling different data types of the same data
by adrianh (Chancellor) on May 24, 2004 at 07:35 UTC
    so how can I avoid having three separate csv parsing subs?

    One way would be to build an iterator for each different input type and pass that to the parsing routine. Something like this:

    use strict; use warnings; use Carp; use Scalar::Util qw( openhandle ); sub filehandle_iterator { my $fh = shift; return unless openhandle( $fh ); return sub { my $row = <$fh>; chomp $row if defined($row); return $row; }; }; sub file_iterator { my $filename = shift; no warnings; # to avoid warnings about filenames with \n in return unless open my $fh, '<', $filename; return filehandle_iterator( $fh ); }; sub string_iterator { my $csv_string = shift; my @lines = split( /\n/, $csv_string ); return sub { shift @lines }; }; sub parse_csv { my $csv_input = shift; my $row_iterator = file_iterator( $csv_input ) || filehandle_iterator( $csv_input ) || string_iterator( $csv_input ) || croak 'could not find iterator'; while ( my $row = $row_iterator->() ) { # ... do stuff with row ... print "> $row <\n"; }; };