in reply to Backtracking for substitutions

My solution is really just two lines of substitution code. Here they are:
$mydata =~ s/([\w\s]+)\s([\w\d]+)\s(0000.*)/$1,$2,$3/g; $mydata =~ s/(:\d{2})\s0000/$1,$2/g;
And if you're not easily overwhelmed by lots of documentation, here's the whole program with setup code, comments, and output:
#!/usr/bin/perl # NODE34853.pl # Assumptions: There are potentially any number of parts of a user's n +ame. # For example, "Bill Clinton" might be a user's name, but # "William Jefferson Clinton the Liar" might also be his name. # The user's name is next, which is always one word long. Might also h +ave numbers # in it, such as Bill69. # The data is currently in a single scalar (as though you read it from + a flat file). # And, It's not clear what granularity you want the data to have. I'm +assuming # that you want the user's name, his username, and the individual chun +ks of login # data. Do you also want to split up the login data? Your post didn't +say. # # Knowing what you want to do with this data afterwards would also hel +p. If you want to # load this into a SQL database, then you'd probably want to do this a + bit differently. # But, if your goal is just to comma-delimit the file so you can load +it into # a spreadsheet, then this oughta do the trick. # # This solution is really just a two line program with lots of comment +s and some # stuff to setup the environment and print the results. # I hope it helps. # --Mark # # This line just sets up the scalar variable you want to parse. # I'm assuming you have other methods of doing this (reading from # CSV, etc.) $mydata = <<ENDDATA; Bob Smith bsmith 00001234567 01/01/1986 00:00:00 Mary Ann Doe mdoe 00001234568 01/01/1986 00:00:01 00001234563 01/01/19 +86 00:00:02 00001234563 01/01/1986 00:00:03 Gilligan Q Smith gsmith 00001234569 01/01/1986 00:00:01 00001234569 01 +/01/1986 00:00:02 ENDDATA # The purpose of this regex is just to split out the user's NAME, # USERNAME, and associated DATA. We're leaving the guts of the DATA al +one for now. $mydata =~ s/([\w\s]+)\s([\w\d]+)\s(0000.*)/$1,$2,$3/g; #MyData temporarily looks like this: #Bob Smith,bsmith,00001234567 01/01/1986 00:00:00 #Mary Ann Doe,mdoe,00001234568 01/01/1986 00:00:01 00001234563 01/01/1 +986 00:00:02 00001234563 01/01/1986 00:00:03 #Gilligan Q Smith,gsmith,00001234569 01/01/1986 00:00:01 00001234569 0 +1/01/1986 00:00:02 # Now, let's split up the DATA parts by looking for the space between +the :00 and 0000 $mydata =~ s/(:\d{2})\s0000/$1,$2/g; print "All done. MyData now looks like this\n$mydata\n\n"; #Bob Smith,bsmith,00001234567 01/01/1986 00:00:00 #Mary Ann Doe,mdoe,00001234568 01/01/1986 00:00:01,00001234563 01/01/1 +986 00:00:02,00001234563 01/01/1986 00:00:03 #Gilligan Q Smith,gsmith,00001234569 01/01/1986 00:00:01,00001234569 0 +1/01/1986 00:00:02
I hope this helps. Let us know. --Mark

Replies are listed 'Best First'.
(Ovid - Common regex error) RE: A two-liner for Backtracking for substitutions
by Ovid (Cardinal) on Oct 02, 2000 at 23:16 UTC
    I didn't go over your code in detail, but I did notice a common regex error:
    [\w\d]
    Many people (including yours truly at one time), mistakenly assume that \w does not match 0-9. Surprise! It does. This caused me a problem when I was trying to do the following:
    my $text = "product1234imageSmall.jpg"; ($type, $id, $property) = ($1, $2, $3) if $text =~ /^(\w+)(\d+)(\w+)/;
    It failed pretty quickly because $type was getting set to product123 (it didn't pick up the "4" because \d had to match something).

    In this case, because you are including both \w and \d in a character class, there's only an issue of redundancy and doesn't affect the functioning of the regex. I just wanted to point this out because it's easy to miss that and get bitten in other situations.

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

      Thanks Ovid! I appreciate the friendly amendment. How does the rest of the code look? I'd also like to hear from the original anonymous poster. Did he get his problem solved? --Mark
        There are a few problems with your code. However, if I write a book, I'm going to dedicate it to:
        #!/usr/bin/perl -w use strict;
        Those two little lines have saved me more trouble than you can possibly imagine and I would strongly recommend that you incorporate them. Admittedly, you just posted what you did for testing purposes, but I still have this "knee jerk" reaction regarding anything without the -w switch or use strict.

        Your first regex can be made a bit more efficient (and accurate) by eliminating the .*, matching to the beginning of the line and using the /m switch:

        $mydata =~ s/^([\w\s]+)\s([\w]+)\s(0000)/$1,$2,$3/mg;
        I haven't actually benchmarked this, but I'd bet good money that this is the case. See Death to Dot Star! for information on why .* is problematic. The accuracy issue is probably a mute point if you have relatively clean data.

        The second regex has two issues. You forgot to put parentheses around the \s0000. Those parentheses were supposed to capture this data and substitute it back using $2. I just changed it to the following:

        $mydata =~ s/(:\d\d)\s0000/$1, 0000/g;
        The other problem is a really just a minor efficiency issue: \d{2} is better written as \d\d (this is from MRE, so it may be out of date for newer regex engines). Basically, when you use \d{2}, the regex engine is forced to keep track of the number of instances of \d. This slows it down just a tad (which can be significant when iterating over a large amount of data). However, when the regex engine sees \d\d, it just matches each instance of \d which is faster.

        Hope this helps!

        Cheers,
        Ovid

        Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

RE: A two-liner for Backtracking for substitutions
by markwild (Sexton) on Oct 02, 2000 at 23:05 UTC
    Woops. Forgot to log in before I sent that last post. Still hoping it helps. --Mark