Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> </head> <body>

Ok, I give... Let me give you an example of the data I need to reformat.

Bob Smith bsmith 00001234567 01/01/1986 00:00:00
Mary Ann Doe mdoe 00001234568 01/01/1986 00:00:00 00001234563 01/01/1986 00:00:00
Gilligan Q Smith gsmith 00001234569 01/01/1986 00:00:00

 

So format is:

Name     LoginID    Token    Last Login Time (... possible extra tokens and login times)

Trouble here is this, the numbers refer to security token serial numbers (changed to protect the innocent) and the time is the last login time with that token. So, people may have multiple tokens which is why double entry in the Mary Ann above. I need to find a way to comma delimit this file and it seems the hardest part is placing the comma between the login id and the name since there was no standard imposed on the name entered. My thoughts is something along the line of an expression that searches for the 0000 in the token and steps back a space, then a word, then swaps the space ahead of that with a comma, like

s/\s(\w\s0000.*)/,$1/mg

This isnt working, it has no effect on the data at all that I can see. Any ideas would be GREATLY appreciated. Thanks!

</body> </html>

Replies are listed 'Best First'.
Re: Backtracking for substitutions
by chromatic (Archbishop) on Oct 02, 2000 at 01:33 UTC
    Here's what I did:
    #!/usr/bin/perl -w use strict; my @lines = ( 'Bob Smith bsmith 00001234567 01/01/1986 00:00:00', 'Mary Ann Doe mdoe 00001234568 01/01/1986 00:00:00 00001234563 01/ +01/1986 00:00:00', 'Gilligan Q Smith gsmith 00001234569 01/01/1986 00:00:00'); foreach my $line (@lines) { $line =~ s/(\s[a-z]+\s0{4})/,$1/g; print "New line:\n'$line'\n"; }
    I didn't use \w because that goes for alphanumerics. I matched the whole login name with the character class [a-z]+, with one whitespace character before and after it. I didn't need to match the rest of the string, because that particular combination only occurs once in your dataset.

    If I were doing this on my own, I might look into split. I like split.

    Split on the token (you know it's an n-digit number):

    my ($user, $token, $login) = split(/\s(\d{11})\s/, $line, 2); (my $username = $user) =~ s/\s(\w+)$//; # untested construct print "$username, claiming to be $user logged on with token $token, at + $login.\n";
    Then again, the regex seems pretty straightforward.
Re: Backtracking for substitutions
by japhy (Canon) on Oct 02, 2000 at 02:24 UTC
    Your only problem was that \w matches a word character, not a word. Add a + after it, and you should be fine.

    $_="goto+F.print+chop;\n=yhpaj";F1:eval
Re: Backtracking for substitutions
by extremely (Priest) on Oct 02, 2000 at 01:44 UTC
    Are all the names guaranteed to start with capitals?
    Are all the logins lowercase only?

    If you you can match on the case variation this works:

    foreach (<>) { s/\s+(?=[a-z])/,/; s/\b([a-z]+)\s/$1,/; s/\b\s+\b(\d\d\/)/,$1/g; s/(:\d\d)\s\b/$1,/g; }

    It ain't purty but it works.

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: Backtracking for substitutions
by 2501 (Pilgrim) on Oct 02, 2000 at 08:27 UTC
    my @lines = ( 'Bob Smith bsmith 00001234567 01/01/1986 00:00:00', 'Mary Ann Doe mdoe 00001234568 01/01/1986 00:00:00 00001 +234563 01/01/1986 00:00:00', 'Gilligan Q Smith gsmith 00001234569 01/01/1986 00:00:00 +'); foreach $line (@lines){ $data = $line; $line =~ s/^(\D+)(.*)/$1/; $data =~ s/^(\D+)(.*)/$2/; $data =~ s/(\d+)\s(\w+)\s(\w+)/$line,$1,$2,$3,\n/g }
    This is still abit rough around the edges but I hope you will get the idea. I took the original question to mean that he wanted extra tokens identifiable on seperate lines in a csv format. So Mary would end up having two lines in the final output. What I tried to do was store the alpha text, identify the data, break the data down into 'sets' and finish it up by attaching the alpha text and delimiting it.
    Now that i think about it, would a \w pickup slashes and colons? Maybe it needs an '|' in there also:P
Re: Backtracking for substitutions
by runrig (Abbot) on Oct 02, 2000 at 04:18 UTC
    In two steps:
    s|^(.*?)\s(\w+)\s(0000\d+)\s(\d+/\d+/\d+\s\d+:\d+:\d+)|$1,$2,$3,$4|; s|\s(0000\d+)\s(\d+/\d+/\d+\s\d+:\d+:\d+)|,$1,$2|g;
A two-liner for Backtracking for substitutions
by Anonymous Monk on Oct 02, 2000 at 23:02 UTC
    My solution is really just two lines of substitution code. Here they are:
    $mydata =~ s/([\w\s]+)\s([\w\d]+)\s(0000.*)/$1,$2,$3/g; $mydata =~ s/(:\d{2})\s0000/$1,$2/g;
    And if you're not easily overwhelmed by lots of documentation, here's the whole program with setup code, comments, and output:
    #!/usr/bin/perl # NODE34853.pl # Assumptions: There are potentially any number of parts of a user's n +ame. # For example, "Bill Clinton" might be a user's name, but # "William Jefferson Clinton the Liar" might also be his name. # The user's name is next, which is always one word long. Might also h +ave numbers # in it, such as Bill69. # The data is currently in a single scalar (as though you read it from + a flat file). # And, It's not clear what granularity you want the data to have. I'm +assuming # that you want the user's name, his username, and the individual chun +ks of login # data. Do you also want to split up the login data? Your post didn't +say. # # Knowing what you want to do with this data afterwards would also hel +p. If you want to # load this into a SQL database, then you'd probably want to do this a + bit differently. # But, if your goal is just to comma-delimit the file so you can load +it into # a spreadsheet, then this oughta do the trick. # # This solution is really just a two line program with lots of comment +s and some # stuff to setup the environment and print the results. # I hope it helps. # --Mark # # This line just sets up the scalar variable you want to parse. # I'm assuming you have other methods of doing this (reading from # CSV, etc.) $mydata = <<ENDDATA; Bob Smith bsmith 00001234567 01/01/1986 00:00:00 Mary Ann Doe mdoe 00001234568 01/01/1986 00:00:01 00001234563 01/01/19 +86 00:00:02 00001234563 01/01/1986 00:00:03 Gilligan Q Smith gsmith 00001234569 01/01/1986 00:00:01 00001234569 01 +/01/1986 00:00:02 ENDDATA # The purpose of this regex is just to split out the user's NAME, # USERNAME, and associated DATA. We're leaving the guts of the DATA al +one for now. $mydata =~ s/([\w\s]+)\s([\w\d]+)\s(0000.*)/$1,$2,$3/g; #MyData temporarily looks like this: #Bob Smith,bsmith,00001234567 01/01/1986 00:00:00 #Mary Ann Doe,mdoe,00001234568 01/01/1986 00:00:01 00001234563 01/01/1 +986 00:00:02 00001234563 01/01/1986 00:00:03 #Gilligan Q Smith,gsmith,00001234569 01/01/1986 00:00:01 00001234569 0 +1/01/1986 00:00:02 # Now, let's split up the DATA parts by looking for the space between +the :00 and 0000 $mydata =~ s/(:\d{2})\s0000/$1,$2/g; print "All done. MyData now looks like this\n$mydata\n\n"; #Bob Smith,bsmith,00001234567 01/01/1986 00:00:00 #Mary Ann Doe,mdoe,00001234568 01/01/1986 00:00:01,00001234563 01/01/1 +986 00:00:02,00001234563 01/01/1986 00:00:03 #Gilligan Q Smith,gsmith,00001234569 01/01/1986 00:00:01,00001234569 0 +1/01/1986 00:00:02
    I hope this helps. Let us know. --Mark
      I didn't go over your code in detail, but I did notice a common regex error:
      [\w\d]
      Many people (including yours truly at one time), mistakenly assume that \w does not match 0-9. Surprise! It does. This caused me a problem when I was trying to do the following:
      my $text = "product1234imageSmall.jpg"; ($type, $id, $property) = ($1, $2, $3) if $text =~ /^(\w+)(\d+)(\w+)/;
      It failed pretty quickly because $type was getting set to product123 (it didn't pick up the "4" because \d had to match something).

      In this case, because you are including both \w and \d in a character class, there's only an issue of redundancy and doesn't affect the functioning of the regex. I just wanted to point this out because it's easy to miss that and get bitten in other situations.

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

        Thanks Ovid! I appreciate the friendly amendment. How does the rest of the code look? I'd also like to hear from the original anonymous poster. Did he get his problem solved? --Mark
      Woops. Forgot to log in before I sent that last post. Still hoping it helps. --Mark