I recently came up with a "clever" idea to eliminate one jump from an inner loop, and was surprised to find that it slowed things down. Allow me to explain my terrible error, so that you don't fall victim in the future.
Here, foo is kinda like a naked function: it uses the same stack frame and registers as the parent function, reads from s1, and writes to s0.
The call to foo uses the the bl instruction, which is "branch and link": it jumps to the given label, and stores the next instruction address in the link register (lr or x30).
When foo is done, the ret instruction jumps to the address in the link register, which is the instruction following the original bl.
Looking at this code, I was struck by the fact that it does two branches, one after the other. Surely, it would be more efficient to only branch once.
Why do we need a special function return instruction? Functionally, BR LR would do the same job as RET. Using RET tells the processor that this is a function return. Most modern processors, and all Cortex-A processors, support branch prediction. Knowing that this is a function return allows processors to more accurately predict the branch.