important question for anyone good at x86. can microcode cache the top of the stack in processor registers for sufficiently nearby pushes and pops or do stack accesses always require a cache access no matter what
@mothcompute idk if there are additional microarchitectural register caches for stack values, but I know on Intel there's some caching of past stack values as part of IBP for speculating returns. when you reach a RET the Return Stack Buffer (RSB) contains the target prediction for speculative execution beyond that point, and when the stack pop comes back (from cache hit or memory fetch) it can either confirm successful prediction or rewind.
@mothcompute there may also be some cleverness with writeback coalescing on temporally grouped stack ops, but I'd bet they at least go as far as L1 each time otherwise coherence would be a mess. (this is mostly an educated guess though and I am prepared to be surprised)
@gsuberland @mothcompute there's also usually a queue of pending stores (ie. "when you load from an address, try to find the value from the youngest store to this address").
but like any other kind of caching strategy, it's used to *predict* the values of a load, and you still have to validate it against whatever is being held in the next level of the hierarchy
@gsuberland @mothcompute yeh it will have to get to L1 eventually, but x86 explicitly allows ordering to be broken by forwarding within the processor that did the store, so it's a bit weaker (I came across this a while back because it makes emulating architectures with stronger ordering on x86 tricky). see 'Intra-Processor Forwarding Is Allowed'
(8.2.3.5 in my copy of the SDM)
@mothcompute I think so. I don't have a source handy, but I recall reading up on this once while teaching compilers, as I unexpectedly measuring no performance difference after we implemented register allocation, where before we passed everything on the stack. Some reading led me to conclude x86 was secretly putting stuff in secret registers, and our language was too small to express programs that could defeat the microcode and measure a difference.
@mothcompute Apparently so; I was thinking that 'store forwarding' would be the thing that lets this happen; but when I was hunting for a reference (e.g. see 24.17 below); I came across 'Mirroring memory operands' 24.17/page 236 which says 'It also works with PUSH and POP instructions.'. Note, I doubt Microcode gets involved - I think microcode only happens for big complex stuff, not anything fast.
@penguin42 thats *exactly* what i was looking for. thank you
@penguin42 its very interesting that it mentions that its present in zen 2 but not zen 3 because those are the two machines i usually write for. maybe i can try comparing performance per clock between them in these sections
@mothcompute Yeh I guess all these forwarding mechanisms are really complex and interact with the rest of the out of order pipeline; its possible they hit a bug in zen 3 and decided to take it out/turn it off rather than fix it before release; difficult to know; and I'm guessing some of the other parts of the forwarding from the store buffer might get you some of the performance anyway; shrug.
…one of these days you’re gonna have to tell me what exactly it is that you do.
@rk @penguin42 funnily enough pretty much just what you see on here