This does require another register, but we skip having to use shifted ADD, which takes two cycles on some CPUs, and we gain instruction-level parallelism.