jcg3525 has asked for the wisdom of the Perl Monks concerning the following question:

I'd be grateful for help in solving this little problem. I'm trying to convert a large body of ASCII text to delimited text (50 mgb).

Each item of text is currently formatted as follows (there are about 20k of these that need to be converted):

^46004 % Tamerlane.| Tamerlane - Sheridan; Bajazet - Barry; Moneses - A Gentleman; Arpasia - Mrs. Furnival; Selima - Mrs. Elmy; +

I would like it to look like this (but w/o double spacing) for importation into a database:

"46004","Tamerlane","Tamerlane","Sheridan"
"46004","Tamerlane","Bajazet","Barry"
"46004","Tamerlane","Moneses","A Gentleman"
"46004","Tamerlane","Arpasia","Mrs. Furnival"
"46004","Tamerlane","Selima","Mrs. Elmy"
Here's another example of the sort of text:

^46005 % Hamlet.| Hamlet - Sheridan; Polonius - J. Morris; Laertes- Lacy; Ophelia- Mrs. Storer; Queen - Mrs. Furnival; +

Changed to:

"46005","Hamlet","Hamlet","Sheridan"
"46005","Hamlet","Polonius ","J. Morris"
"46005","Hamlet","Laertes","Lacy"
"46005","Hamlet","Ophelia","Mrs. Storer"
"46005","Hamlet","Queen","Mrs. Furnival"
Any help gratefully acknowledged.

  • Comment on Help formatting text to delimited text in file

Replies are listed 'Best First'.
Re: Help formatting text to delimited text in file (updated)
by AnomalousMonk (Archbishop) on Apr 21, 2019 at 03:18 UTC

    Here's another approach. The module is extremely simple: no exportation, no OO. Invocation must be by fully-qualified function names, but that's part of the simplicity.

    The idea is that the regexes used for record field validation and extraction can be adapted very precisely to the data. (These regexes are currently rather general and naive; names and titles and such can be very tricky.) No effort is made to check for play index number/name duplication. The test script should be expanded to cover many edge cases. Note that the "Polonius" output field of play 46005 has been "fixed": The trailing space present in the OP has been removed; if this whitespace needs to remain, you need to specify under what conditions trailing whitespace is preserved.

    Output:
    c:\@Work\Perl\monks\jcg3525>perl ParsePlaybill.t ok 1 - use ParsePlaybill; ok 2 - ^46004 % Tamerlane.| Tamerlane - Sheridan; Bajazet - Barry; Mon +eses - A Gentleman; Arpasia - Mrs. Furnival; Selima - Mrs. Elmy; + # # Polonius "fixed" in this test case ok 3 - ^46005 % Hamlet.| Hamlet - Sheridan; Polonius - J. Morris; Laer +tes- Lacy; Ophelia- Mrs. Storer; Queen - Mrs. Furnival; + # # one-character play ok 4 - ^1 % Mark Twain Tonight .| Mark Twain -Hal Holbrook;+ # ok 5 - no warnings 1..5
    Run under Perl version 5.8.9.

    Update: As it stands, this code has some (small, I hope) problems:

    1. After looking at some of the posts of holli and Marshall, I realized that character names with embedded hyphens would cause a problem (conflict with the character-actor field separator hyphen). There's a simple fix for this, but it's fragile.
    2. Names of a play, character or actor (player) can include  - ' . characters after the first character, but cannot end in one of these characters due to the  \b assertion in the regexes for these record fields. Again, I think there's a simple fix, but it may be fragile. (These fields probably could not end in a hyphen in any case.)
    I don't know the real requirements of this parsing task and my code was intended only as a general example anyway, so rather than posting any changes (which I may yet do), I will just post this warning and await events.


    Give a man a fish:  <%-{-{-{-<

Re: Help formatting text to delimited text in file
by holli (Abbot) on Apr 20, 2019 at 23:25 UTC
    How does your input look like when Mrs. Storer marries again and is now known as Mrs. Storer-Goods? You know, she'sone of these modern women who won't give up her name. Especially not since she had to walk over dead former Mrs. Storers' body to achieve it, but I disgress.

    In other Words, can there be "-" in the data and if so, is it escaped?


    holli

    You can lead your users to water, but alas, you cannot drown them.
      This is a completely valid point.
      With names there is of course always the additional complications of titles, XXX,MD or Jr. Sr., III etc. Sometimes people have four or just two names instead of three. Wild cases like a single name "Madonna" also happen. A friend of mine didn't get fully divorced, but did drop her last name (which she got from her husband). A weird thing short of going back to her maiden name.

      In my code at Re: Help formatting text to delimited text in file, my line my ($name1, $name2) = split /\s*-\s*/,$sub_name; was my ($name1, $name2) = split /\s+-\s+/,$sub_name; until I ran it and saw: Ophelia- Mrs. Storer instead of the expected Ophelia - Mrs. Storer.

      I am not sure if the missing space before the "-" is a typo or not? I changed the split regex to allow optional spaces before and after to work with the OP's posted data in a quick decision. Usually a hyphenated name will be printed without spaces either before or after each surname. Mileage varies.

      With just 2 example lines, we can't solve every potential case. There is always some iteration involved when working ad-hoc without a complete spec. It could be that requiring a space after the "-" is enough to differentiate between "Smith-Jones"? Not sure.

      I think my suggestion to count hyphens on each line and identify outliers is a good one. Modify code accordingly.
      I still suspect that: "Ophelia- Mrs. Storer" is a typo.
      Update: Oh, in this type of printout, I sincerely doubt that that there are any "escape" characters like "\" to guide the process. Could be, but doubtful. I am working on a project right now where I have to parse several types of printouts designed for humans. Perl is an excellent language for this! Regex is an operator instead of an object and I can quickly iterate and fine tune the parsing functions.

Re: Help formatting text to delimited text in file
by jwkrahn (Abbot) on Apr 21, 2019 at 00:26 UTC
    #!/usr/bin/perl use warnings; use strict; open my $IN, '<', 'input.txt' or die "Cannot open 'input.txt' becaus +e: $!"; open my $OUT, '>', 'output.txt' or die "Cannot open 'output.txt' becau +se: $!"; # Input record separator $/ = '+'; while ( <$IN> ) { # skip empty records next unless /\S/; # remove input record separator chomp; my ( $id, $name ) = ( $1, $2 ) if s/ \s* \^ \s* ( \d+ ) \s* % \s* +( .+? ) \s* \.\| \s* //x; for my $data ( split /\s*;\s*/ ) { print $OUT join( ',', map qq/"$_"/, $id, $name, split /\s*-\s* +/, $data ), "\n"; } }
Re: Help formatting text to delimited text in file
by LanX (Saint) on Apr 20, 2019 at 22:14 UTC
    Please use  <code> tags around your data

    What did you code so far?

    Hint: Looks like you need to apply multiple rounds of split , i.e. first on | then ; then -

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Help formatting text to delimited text in file
by Marshall (Canon) on Apr 21, 2019 at 19:01 UTC
    This was easy enough, I went ahead and wrote the code for you. In situations like this, use a combination of regex and split. It takes some experience to decide which to use. In general use regex when you know what to keep and use split when you know what to throw away. Below I started with a regex to make sure that I'm looking at something that looks like a valid line. Then successive splits are applied which throw away various stuff. You will have to consider if there are hyphenated names, maybe search your input for lines that contain more than the usual number of hyphens? There is always a bit of trial and error when parsing something without at formal spec that covers all cases. Good luck.
    #!/usr/bin/perl use strict; use warnings; while (my $line = <DATA>) { if (my( $num, $root_name, $rest)=$line =~ /^\^(\d+)\s*%\s*(\w+)[. | +]+(.+)\+$/) { my @sub_names = split /;\s*/,$rest; foreach my $sub_name (@sub_names) { my ($name1, $name2) = split /\s*-\s*/,$sub_name; print "\"$num\",\"$root_name\",\"$name1\",\"$name2\"\n"; } } } =Prints "46004","Tamerlane","Tamerlane","Sheridan" "46004","Tamerlane","Bajazet","Barry" "46004","Tamerlane","Moneses","A Gentleman" "46004","Tamerlane","Arpasia","Mrs. Furnival" "46004","Tamerlane","Selima","Mrs. Elmy" "46005","Hamlet","Hamlet","Sheridan" "46005","Hamlet","Polonius","J. Morris" "46005","Hamlet","Laertes","Lacy" "46005","Hamlet","Ophelia","Mrs. Storer" "46005","Hamlet","Queen","Mrs. Furnival" =cut __DATA__ ^46004 % Tamerlane.| Tamerlane - Sheridan; Bajazet - Barry; Moneses - +A Gentleman; Arpasia - Mrs. Furnival; Selima - Mrs. Elmy; + ^46005 % Hamlet.| Hamlet - Sheridan; Polonius - J. Morris; Laertes- La +cy; Ophelia- Mrs. Storer; Queen - Mrs. Furnival; +
Re: Help formatting text to delimited text in file
by clueless newbie (Curate) on Apr 21, 2019 at 15:44 UTC
    Somedays one feels like being a "jerk" so
    #!/usr/bin/env perl use Data::Dumper; use 5.14.0; my @data=( q{^46004 % Tamerlane.| Tamerlane - Sheridan; Bajazet - Barry; +Moneses - A Gentleman; Arpasia - Mrs. Furnival; Selima - Mrs. Elmy; + +} ,q{^46005 % Hamlet.| Hamlet - Sheridan; Polonius - J. Morris; +Laertes- Lacy; Ophelia- Mrs. Storer; Queen - Mrs. Furnival; +} ); local $"='","';#" for (@data) { my %c_h; if (m{^\^(.+?) % (.+?)\.\|(?{$c_h{1}=$1; $c_h{2}=$2;}) ?(?:(.+?) * +- *(.+?); ?(?{$c_h{3}=$3;$c_h{4}=$4; say qq{"@c_h{qw(1 2 3 4)}"}}))*} +) { }; }; __DATA__
    which gives on my machine
    "46004","Tamerlane","Tamerlane","Sheridan" "46004","Tamerlane","Bajazet","Barry" "46004","Tamerlane","Moneses","A Gentleman" "46004","Tamerlane","Arpasia","Mrs. Furnival" "46004","Tamerlane","Selima","Mrs. Elmy" "46005","Hamlet","Hamlet","Sheridan" "46005","Hamlet","Polonius","J. Morris" "46005","Hamlet","Laertes","Lacy" "46005","Hamlet","Ophelia","Mrs. Storer" "46005","Hamlet","Queen","Mrs. Furnival"
    See A bit of magic: executing Perl code in a regular expression. Of course, you might as well do something like this (and go straight to the database):
    #!/usr/bin/env perl use Data::Dumper; use DBI; use 5.14.0; use if ($ENV{DBG} || $ENV{DEBUG}),'Devel::Include','keep',$ENV{MASK} ? + $ENV{MASK} : '#=#'; my @data=( q{^46004 % Tamerlane.| Tamerlane - Sheridan; Bajazet - Barry; +Moneses - A Gentleman; Arpasia - Mrs. Furnival; Selima - Mrs. Elmy; + +} ,q{^46005 % Hamlet.| Hamlet - Sheridan; Polonius - J. Morris; +Laertes- Lacy; Ophelia- Mrs. Storer; Queen - Mrs. Furnival; +} ); my $dbh=DBI->connect('dbi:SQLite:dbname=test.sqlite','','',{ PrintErro +r=>1, RaiseError=>1 }); $dbh->do(<<"__CREATE__"); CREATE TABLE titles ( number integer, name text); __CREATE__ $dbh->do(<<"__CREATE__"); CREATE TABLE roles ( number integer, role text, actor text); __CREATE__ my %sth; $sth{title}=$dbh->prepare(<<"__TITLE__"); INSERT INTO titles (number,name) VALUES (?,?); __TITLE__ $sth{role}=$dbh->prepare(<<"__ROLE__"); INSERT INTO roles (number,role,actor) VALUES (?,?,?); __ROLE__ local $"='","';#" for (@data) { if (m{^\^(.+?) % (.+?)\.\|(?{$sth{title}->execute($1,$2)}) ?(?:(.+ +?) *- *(.+?); ?(?{$sth{role}->execute($1,$3,$4)}))*}) { }; }; $dbh->disconnect;