aral has asked for the wisdom of the Perl Monks concerning the following question:

I have read numerous articles that detail how it is not possible to have a PEEK at STDIN (if it is linked to a pipe). The typical suggestion - which I would implement, given different circumstances - is to use a read-ahead buffer and redo a loop iteration, reusing the buffer.

My problem is now that I would like to use an XML parser (XML::Twig) if, and only if, the first line of the input contains a valid XML header.

So a read-ahead buffer is not doing me much good if XML::Twig does not accept an incomplete XML file (missing the header) in its "parsefile" function.

Basically, I was hoping to do it like this:
my $header = peek(<>); if (is_xml ($header)) { my $t = XML::Twig->new(); $t->parse (\*STDIN); } else { # do something else with <> }

Any suggestions on how to tackle this?

  • Comment on peek at STDIN, to determine data type and then pass STDIN to a parser
  • Download Code

Replies are listed 'Best First'.
Re: peek at STDIN, to determine data type and then pass STDIN to a parser
by Athanasius (Archbishop) on Jan 06, 2015 at 14:58 UTC

    Hello aral,

    The ungets method from FileHandle::Unget works on STDIN:

    #! perl use strict; use warnings; use FileHandle::Unget; $| = 1; my $fh = FileHandle::Unget->new(\*STDIN) or die "Cannot open filehandle: $!"; print "\nEnter a string: "; read($fh, my $buffer1, 10); print "\nThe first 10 characters: '$buffer1'\n"; $fh->ungets($buffer1); read($fh, my $buffer2, 15); print "The \"next\" 15 characters: '$buffer2'\n"; $fh->close;

    Output:

    1:17 >perl 1115_SoPW.pl Enter a string: abcdefghijklmnopqrstuvwxyz The first 10 characters: 'abcdefghij' The "next" 15 characters: 'abcdefghijklmno' 1:17 >

    Update: Added print statements and renamed variables.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I'd use IO::Unread instead since it uses Perl's builtin support for unreading rather than reimplementing every file operation.

      Excellent solution as well - I just tested it and it worked out of the box like this:

      use FileHandle::Unget; my $fh = FileHandle::Unget->new(\*STDIN) or die "Cannot open filehandle: $!"; my $testline = <$fh>;; $fh->ungets($testline); print "$.: $testline"; for (my $i = 0; $i < 3; $i++) { $testline = <$fh>; print "$.: $testline"; }

      Output for "cat xmlfile | ./perscript.pl" is:

      1: <?xml version="1.0" encoding="UTF-8"?> 1: <?xml version="1.0" encoding="UTF-8"?> 2: <MFOP> 3: <Basics>

      Thank you very much! FileHandle::Unget is *the* answer to my original question.

      @ikegami: Unfortunately, the install script (Makefile) for IO::Unread fails with error messages, and there seems to be no debian packet for it available in jessie - so I was not able to test this.

Re: peek at STDIN, to determine data type and then pass STDIN to a parser
by MidLifeXis (Monsignor) on Jan 06, 2015 at 14:37 UTC

    Perhaps using an iterator might be a solution. Create an Iterator::Simple iterator object out of the original file handle, pull the first couple of lines from the original file handle to validate file type, and then use the iterator as the file handle passed to the actual processing code. IIRC, the iterator can behave like a standard file handle. You will need to manage the storage of the first bit of text that you check on, but the coding is pretty simple.

    --MidLifeXis

      Thank you for the suggestion. Are you still talking about possibilities for STDIN? For normal filehandles I would be able to use a seek operation anyways. My problem seems to be limited to pipes.

        Yes. It is an option. It may not be the best option for your uses.

        I use iterators when schlepping event logs through my monitoring system, whether they come from a real-time event queue, stored log files, or current state of a system. To my consumer software, all of the data looks the same.

        The reason I suggested this technique is that it does not significantly increase the memory or filesystem requirements (as reading files fully into memory or storing in a temp file and processing would^Wcould do). It also allows the consumer (your XML processing in this case) to treat it as just a file handle.

        # UNTESTED # # This is for line-by-line reading, not block-by-block reading. # Adjust as necessary. sub create_iterator { my $original_fh = \*STDIN; my @cached_data = $original_fh->getline; # enough +to id the file my $data_type_id = identify_data_type( \@cached_data ); # Remove +from @cached if provided my $iterator = iter( sub { my $retval; if ( $data_type_id ) { $retval = $data_type_id; $data_type_id = undef; } elsif ( @cached_data ) { $retval = shift( @cached_data ); } else { $retval = $original_fh->getline; } return $retval; } ); return $iterator; }

        --MidLifeXis

      what is the difference between reading line by line using the filehandle with the diamond operator and using an iterator?

        Nothing if you are just reading. The benefit can arise if you want to rearrange, inject, or modify the incoming data on the file handle and make the resulting stream look like a plain old file handle. I understand the OP to want to maybe inject a proper doctype into the data stream if needed.

        Perhaps not the best tool for this particular case, but a tool for the generic case.

        --MidLifeXis

A reply falls below the community's threshold of quality. You may see it by logging in.