in reply to extract ids

Your regex is close; very close.

However, the character class [A-Za-z] matches ONLY ONE character.

Since you want to match the many characters before the \smolecule, you need a quantifier after the char class, thusly:

if ($line =~ /^<[A-Za-z]+\smolecule_idref="(\d+)">$/) {

where the "+" says "Match one or more members of the class. (Note that even though the "+" makes the regex" greedy," there's no harm here, because you specify a whitespace character next.)

I've also changed your capture to specify one or more digits. .* matches ZERO or more of anything, which isn't what you've specified.