Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

uhead: "head -c" for utf8 data

by graff (Chancellor)
on Sep 01, 2006 at 03:25 UTC ( [id://570713]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info graff(at)ldc.upenn.edu
Description: This simple command-line utility does for utf8 text data what GNU "head -c N" does for ASCII data: print just the first N characters of files (or STDIN). Since Perl's built-in "read" function is able to read characters (rather than just bytes), this a pretty trivial exercise. But I wanted to post it anyway, because it's a nice demonstration of a fairly complex process (handling variable-width characters) being made really simple.

#!/usr/bin/perl

=head1 NAME

uhead -- unicode-aware version of unix "head"

=head1 SYNOPSIS

uhead -c N [file ...]   show first N unicode chars from file(s)

=head1 DESCRIPTION

This does what the standard "head -c N" command (GNU version) would do
(i.e. show the first N characters from one or more files), with just
the following differences:

=over 4

=item *

The "-c N" option is required (not optional)

=item *

N refers to a number of UTF-8 encoded unicode characters rather than
bytes

=item *

"Negative" values for N are not supported (you cannot elect to view
all but the last N characters)

=back

If no files are provided on the command line, it will read from STDIN
instead. (But if it notices that STDIN is actually the user's tty, not
a pipe or redirection from a file, it will exit with a suitable error
message.)

=head1 AUTHOR

David Graff <graff(at)ldc.upenn.edu>

=cut

use strict;

my $Usage = "Usage: $0 -c N [file ...]\n";
die $Usage unless ( @ARGV > 1 and $ARGV[0] eq '-c' and
                    $ARGV[1] =~ /^\d+$/ );

shift;
my $show_chrs = shift;
if ( -t ) {
    @ARGV or die "You need to provide some data (pipe or file(s))\n$Us
+age";
}
else {
    @ARGV = ( '__STDIN__' );
}

binmode STDOUT, ":utf8";
my $nfiles = @ARGV;

while ( @ARGV ) {
    my $file = shift;
    my $head;
    if ( $file eq '__STDIN__' ) {
        binmode STDIN, ":utf8";
        read STDIN, $head, $show_chrs;
    }
    else {
        if ( open( I, "<:utf8", $file )) {
            read I, $head, $show_chrs;
        }
        else {
            warn "open failed on $file\n";
            next;
        }
    }
    print "\n==> $file <==\n" if ( $nfiles > 1 );
    print $head,"\n";
}

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://570713]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-25 17:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found