Please critique this script -- Read emails and write URLs into bookmarks

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Please critique this script -- Read emails and write URLs into bookmarks by talexb (Chancellor) on Oct 06, 2016 at 18:39 UTC
Hmm .. well, I don't see any error-checking, you're using a regex for something that isn't clear (re-formatting the content of the subject line?), and you're assuming there is only going to be a single URI in each message. Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply]
Re^2: Please critique this script -- Read emails and write URLs into bookmarks by Anonymous Monk on Oct 06, 2016 at 18:59 UTC
There is only one valid URL per message. This is a situational constraint (also written in the spec). I know that because these are 'Emails-to-self' that contain at max two URLs (one in the plain text part, one in the HTML part if it exists) Maybe it's a bit difficult to read but the regex is transforming the subject into a suitable filename, escaping problematic characters. I tried to be sensible about the error checking and assumed the modules do error handling for their part, outside of that there's hardly any code in its own right. Now that I think of it though, input should have been checked. That was a mistake. What else do you think needs explicit error handling?	[reply]
Re^3: Please critique this script -- Read emails and write URLs into bookmarks by talexb (Chancellor) on Oct 06, 2016 at 19:06 UTC
I would check the following: Input arguments -- if they are files, do they exist, and are they readable and the right type? Object from Path::Tiny -- check that this is valid. Perhaps check after the call to `URI::Find->new` so see that there is something in the `@uris` list? To me, that's basic defensive programming. Check everything -- you never know when Something Weird will happen; it's better that your script captures it rather than the script dying for some unknown reason. Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply] [d/l] [select]
Re^4: Please critique this script -- Read emails and write URLs into bookmarks by Anonymous Monk on Oct 07, 2016 at 08:13 UTC
Re^5: Please critique this script -- Read emails and write URLs into bookmarks by talexb (Chancellor) on Oct 07, 2016 at 20:47 UTC
Re^4: Please critique this script -- Read emails and write URLs into bookmarks by Anonymous Monk on Oct 06, 2016 at 19:28 UTC
Re^5: Please critique this script -- Read emails and write URLs into bookmarks by Anonymous Monk on Oct 06, 2016 at 19:49 UTC
Some notes below your chosen depth have not been shown here
Re: Please critique this script -- Read emails and write URLs into bookmarks by kcott (Archbishop) on Oct 07, 2016 at 21:02 UTC
"This is my first Perl script." ++ This is an excellent, first effort! "I'm looking for any kind of feedback, especially pitfalls and 'better' idioms etc." Here's some observations; along with various suggestions and recommendations: I'd personally avoid hard-coding a path to `perl` in the shebang line. Consider using the current environment's `perl`: `#!/usr/bin/env perl` [download] This allows you to change the version of Perl in the current environment and run your script under this new version. In particular, it's useful when testing how your script runs under different versions of Perl (perlbrew is a common choice for this). You're using the '`/r`' modifier in your regex substitution. The way you're using it is fine. The problem is, it was introduced in `v5.14`^{[perl5140delta: Non-destructive substitution]}: anyone using an earlier version will have problems. The earliest version I have installed is `v5.14.2`, so I can't test this. Whenever I use the '`/r`' modifier, I typically start my scripts like this: `#!/usr/bin/env perl use 5.014; use strict; use warnings; ...` [download] It's fine to use advanced features; however, if you do, it's a good idea to keep an eye on their statuses (e.g. stable, experimental, and so on). Take a look at the `perl5delta` and `perl5vvsdelta`_{[vv=version; s=subversion]} manpages in the perldoc: perl: Miscellaneous section. New features are typically added in the first version (i.e. subversion=0): I typically start searches in the `perl5vv0delta` pages. Here's an example of the types of changes that can occur over just a few versions: `v5.24` - works as expected; `v5.22` and `v5.20` - can be made to work with two features, warnings silenced and experimental status; `v5.18` - fatal error. If '`$x->@`' is uncharted territory, take a look at "perlref: Postfix Dereference Syntax". $ alias perle alias perle='perl -Mstrict -Mwarnings -Mautodie=:all -E' $ perlbrew switch perl-5.24.0t $ perl -v \| head -2 \| tail -1 This is perl 5, version 24, subversion 0 (v5.24.0) built for darwin-th +read-multi-2level $ perle 'my $x = [qw{a b c}]; say "@{$x}"; say "$x->@"' a b c a b c $ perlbrew switch perl-5.22.0t $ perl -v \| head -2 \| tail -1 This is perl 5, version 22, subversion 0 (v5.22.0) built for darwin-th +read-multi-2level $ perle 'my $x = [qw{a b c}]; say "@{$x}"; say "$x->@"' a b c ARRAY(0x7ff4d2805480)->@ $ perle 'no warnings "experimental::postderef"; use feature qw{postder +ef postderef_qq}; my $x = [qw{a b c}]; say "@{$x}"; say "$x->@"' a b c a b c $ perlbrew switch perl-5.20.0t $ perl -v \| head -2 \| tail -1 This is perl 5, version 20, subversion 0 (v5.20.0) built for darwin-th +read-multi-2level $ perle 'no warnings "experimental::postderef"; use feature qw{postder +ef postderef_qq}; my $x = [qw{a b c}]; say "@{$x}"; say "$x->@"' a b c a b c $ perlbrew switch perl-5.18.0t $ perl -v \| head -2 \| tail -1 This is perl 5, version 18, subversion 0 (v5.18.0) built for darwin-th +read-multi-2level $ perle 'no warnings "experimental::postderef"; use feature qw{postder +ef postderef_qq}; my $x = [qw{a b c}]; say "@{$x}"; say "$x->@"' Unknown warnings category 'experimental::postderef' at -e line 1. BEGIN failed--compilation aborted at -e line 1. [download] I don't see any reason for `$VERSION`; however, if you do need it, first see "version: SYNOPSIS". Of particular note: `... use version; our $VERSION = qv("v1.2.3"); # deprecated ... our $VERSION = "1.0203"; # recommended ...` [download] Consider compiling your regex only once, and doing so at compile time. By the way, much of your escaping was unnecessary; I've removed the superfluous backslashes (i.e. all except '`\\`'); see "perlrecharclass: Special Characters Inside a Bracketed Character Class". I might have written something along the lines of: `... { my $re; BEGIN { $re = qr{(?x: [%/?<>\\:\|":] )} } for ... { ... s/$re/_/gr; } }` [download] Take a look at "perlperf - Perl Performance and Optimization Techniques"; in particular, BENCHMARKS's "Search and replace or tr" section. While both substitution and transliteration can be used for this task, Benchmarking indicates transliteration can do it 25 times faster. $ perle 'my $re; BEGIN { $re = qr{(?x: [%/?<>\\:\|":] )} } say q{%/?<> +\\:\|":} =~ s/$re/_/gr' ___________ $ perle 'say q{%/?<>\\:\|":} =~ y{%/?<>\\:\|":}{_}r' ___________ $ for i in {1..5}; do perle 'use Benchmark qw{cmpthese}; my $re; BEGIN + { $re = qr{(?x: [%/?<>\\:\\|":] )} }; my $str = q{%/?<>\\:\|":}; cmp +these 0 => { s => sub { $str =~ s/$re/_/gr }, y => sub { $str =~ y{%/ +?<>\\:\|":}{_}r } }'; done Rate s y s 212144/s -- -96% y 5420361/s 2455% -- Rate s y s 204213/s -- -96% y 5632150/s 2658% -- Rate s y s 201600/s -- -97% y 5771041/s 2763% -- Rate s y s 204103/s -- -96% y 5026909/s 2363% -- Rate s y s 222651/s -- -96% y 5600560/s 2415% -- [download] Perhaps a regex substitution was not the best choice. I don't see any need for the intermediary variable '`$uri`': you can just write '`push @uris, shift`'. I'd also question the need to create a new coderef on every iteration. Consider something like one of these two ways (untested): `{ my @uris; my $finder_code = sub { push @uris, shift }; sub finder_sub { push @uris, shift } for ... { ... @uris = (); my $code_finder = URI::Find::->new($finder_code); my $sub_finder = URI::Find::->new(\&finder_sub); ... } }` [download] If you were wondering about the extra pair of colons I added ('`::`‍`->`'), see "perlobj: Invoking Class Methods". There may be other issues, but these were the ones that stood out for me. Having pointed out all those issues, I just wanted to reiterate that this is a well-written script and an excellent first attempt. See also: the online perl manpage (which I have bookmarked and refer to often); "Create A New User". Update:* Fixed typo: `s/obversations/observations/` — Ken	[reply] [d/l] [select]
Re^2: Please critique this script -- Read emails and write URLs into bookmarks by afoken (Chancellor) on Oct 13, 2016 at 06:29 UTC
I'd personally avoid hard-coding a path to perl in the shebang line. [...] This allows you to change the version of Perl in the current environment and run your script under this new version. In particular, it's useful when testing how your script runs under different versions of Perl The shebang line with a fixed interpreter does not prevent you from trying other interpreters: `./foobar.pl # <- perl from shebang perl foobar.pl # <- whatever perl is found first in $ENV{'PATH'} /home/me/bin/perl-special foobar.pl # <- a special perl /usr/local/bin/perl foobar.pl # <- common location for a local perl /usr/bin/perl foobar.pl # <- common location for system perl` [download] perlrun has an entire chapter "Location of Perl" with recommendations: When possible, it's good for both /usr/bin/perl and /usr/local/bin/perl to be symlinks to the actual binary. If that can't be done, system administrators are strongly encouraged to put (symlinks to) perl and its accompanying utilities into a directory typically found along a user's PATH, or in some other obvious and convenient place. You are advised to use a specific path if you care about a specific version. Perl has some surprises when it comes to the shebang line, see perlrun: Parsing of the `#!` switches starts wherever "`perl`" is mentioned in the line. The sequences "`-`" and "`-`" are specifically ignored* If the `#!` line does not contain the word "`perl`" nor the word "`indir`", the program named after the `#!` is executed instead of the Perl interpreter. This is slightly bizarre, but it helps people on machines that don't do `#!` Also, while most systems have /usr/bin/env, not all systems have it. Some have the env utility in /bin/env. Some systems use a symlink in one directory pointing to env in the other directory. See also http://www.in-ulm.de/~mascheck/various/shebang/#env. env is another process that has to be started. env does not just start perl; at least, it has to parse the command line for arguments. See busybox env.c for a quite minimal implementation. Entry point is `env_main()`. The GNU version does much more, see GNU env.c. So using env in the #! line increases the startup time. Generally, it does not matter that much, especially on modern machines. But the overhead is not zero. I typically start my scripts like this: `#!/usr/bin/env perl use 5.014; use strict; use warnings;` [download] `use strict` is redundant here, as `use VERSION` with a VERSION >= 5.12.0 implies `use strict`. See use. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^3: Please critique this script -- Read emails and write URLs into bookmarks by kcott (Archbishop) on Oct 13, 2016 at 16:09 UTC
++ Thanks for the feedback. I'm aware of the shebang line points you raise. Hard-coding paths to `perl` has, in my (>20 years) experience, many more problems than using `env`. Of course, there are occasions where a hard-coded path might be more appropriate. You quoted me as saying: "I typically start my scripts like this: ..." A more accurate quote, that conveyed the point I was making, and didn't hide the text you elided, would have been: "Whenever I use the '`/r`' modifier,* I typically start my scripts like this: ..."* If I removed the '`/r`' modifier, but still had code with `say`, `//=`, etc., I only need to change a '`4`' to a '`0`' (i.e. `5.014` → `5.010`); I do not need to remember to add `use strict`. — Ken	[reply] [d/l] [select]
Re^2: Please critique this script -- Read emails and write URLs into bookmarks by Anonymous Monk on Oct 12, 2016 at 08:56 UTC
This was a long one to parse, thank you for that :) I've done almost all you've recommended, especially the trivial ones. Nice performance considerations as well.	[reply]
Re: Please critique this script -- Read emails and write URLs into bookmarks by eyepopslikeamosquito (Archbishop) on Oct 07, 2016 at 23:23 UTC
Modern Perl assumes you're already decent at programming, so it elides some basic stuff in favor of explaining how Perl works from philosophy to programming in the large. Learning Perl assumes you've never programmed before, so it spends more time on the basics, covers less of the language, and doesn't explore the philosophy of Perl in as much detail. -- chromatic (author of Modern Perl) comments on the philosophy behind his book Given this is your first Perl script, you have demonstrated that you are an accomplished programmer, who is new to Perl. Accordingly, I feel Modern Perl is more appropriate for you than Learning Perl. Modern Perl is free and very quick to read. If you have more time, and want to get inside the head of the language designer, Programming Perl is also good. As for good "second" books on Perl, I suggest: Effective Perl Programming Perl Best Practices In particular, Perl Best Practices, Chapter 4, "Values and Expressions" in the "Don't mix high- and low-precedence booleans" item, discusses a stylistic point already raised in this thread. For flow of control, prefer the low precedence `and` and `or` operators. A common example of preferring `or` to `\|\|` is: `open(my $fh, '<', $file) or die "error opening '$file': $!";` [download] Reserve `&&` and `\|\|` for logical expressions, not flow of control, for example: `if ($x > 5 && $y < 10) ...` [download]	[reply] [d/l] [select]
Re^2: Please critique this script -- Read emails and write URLs into bookmarks by Anonymous Monk on Oct 12, 2016 at 08:41 UTC
Thank you for the book recommendations. "For flow of control, prefer the low precedence and and or operators. A common example of preferring or to \|\| is:" "Reserve && and \|\| for logical expressions, not flow of control, for example:" I'm inclined not to follow this. I don't like to rely on subtle precedence rules. Is there any reasoning to not always use parentheses instead? (Not that I did that in the versions you know.)	[reply]
Re^3: Please critique this script -- Read emails and write URLs into bookmarks by afoken (Chancellor) on Oct 13, 2016 at 07:13 UTC
I'm inclined not to follow this. I don't like to rely on subtle precedence rules. Is there any reasoning to not always use parentheses instead? (This is a quite generic question, applying not only to Perl, and so the answer does not only apply to Perl.) There might be reasons, I guess the most common ones are avoiding typing and cleaner code. And I think both do not apply, for exactly the reason you gave: "I don't like to rely on subtle precedence rules." Treating operators as having completely random precedence, except for parentheses and the "multiplication and division before addition and subtraction" rule definitively prevents nasty surprises with operator precedence. Adding parentheses does not hurt, the compiler or interpreter should need just a few CPU cycles to handle them. Parentheses clearly indicate the programmer's intended order of evaluation, without forcing the next programmer to memorize precedence rules. Example: `if (a && b \|\| c && d \|\| e) // wtf? Don't make me guess what you intend +ed! if ((a && b) \|\| (c && d) \|\| e) // clear if (a && (b \|\| c) && (d \|\| e)) // as clear, but a very different condi +tion if (a && (b \|\| (c && (d \|\| e)))) // a different condition if ((((a && b) \|\| c) && d) \|\| e) // another different condition` [download] My current job is developing embedded software for medical and aerospace devices. We have a very strict rule that parentheses are required whereever code might be ambiguous. The first example would never get through the obligatory code review; you will never find that in production code. And by the way: If you are working with several languages, as I do, you will notice that precedence rules vary from language to language. Most copy or inherit from C, but some are very different. MUMPS evaluates strictly from left to right, unless parentheses are used. In MUMPS, `WRITE 1+23` writes 9, not 7. `WRITE 1+(23)` writes 7. Of course, redundant parentheses are rubbish: `if ((((((((a && b))) \|\| (((c)))))))) // make extra-sure that parenthes +es have precedence ;-)` [download] Some times, lots of parentheses are required, adding some white space and indention clearly helps: `if ( ((a == A_MAGIC_1) \|\| (a == A_MAGIC_2)) && ((b == B_MAGIC_1) \|\| (b == B_MAGIC_2)) && ( (c == C_MAGIC_1) \|\| ((c == C_MAGIC_2) && (a == A_MAGIC_2) && (b == B_MAGIC_2)) ) )` [download] Yes, this will make the C compiler "waste" about thirty precious CPU cycles eating up all of that redundant white space and braces. Just think how long that takes at 3 GHz! And it gets worse, because all of the source code "wastes" CPU cycles. So compiling the enitre project takes 1.5 msec more time than needed. Flashing into the target takes 20 +/- 2 seconds. Deciphering source code "optimized" for better compile times takes minutes per line, hours per file, every time a change is required. This is a real waste of time and money. Computers and compilers become faster every day. A modern desktop runs circles around old super computers even with crap components. "Optimizing" source code for compile time is ridiculous, except in very rare high-performance cases. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^3: Please critique this script -- Read emails and write URLs into bookmarks by eyepopslikeamosquito (Archbishop) on Oct 13, 2016 at 09:09 UTC
Is there any reasoning to not always use parentheses instead? Yes, clarity and readability. While I certainly do not endorse memorising the precedence table, I feel non-casual programmers should know the precedence of very commonly used operators. To illustrate, if you follow the "always use parentheses" rule, you'd be obliged to write: `( my $x ) = ( $y + 42 );` [download] which I feel is borderline obscene. This is surely more clearly written, without the clutter of parentheses, simply as: `my $x = $y + 42;` [download] BTW, I would similarly object to parentheses here in C or in any language where assignment is an operator. In Perl, I place the ultra low precedence `and` and `or` operators in the same category as the `=` operator, at least for non-novice Perl programmers. Being very low precedence makes them especially well-suited to flow of control, and I agree with Conway that reserving them exclusively for that purpose, for example: `open(my $fh, '<', $file) or die "error opening '$file': $!";` [download] or: `$meaning == 42 or next;` [download] is idiomatic Perl and makes the code clearer and more enjoyable to read.	[reply] [d/l] [select]