gh-109329: Support for basic pystats for Tier 2 by mdboom · Pull Request #109913 · python/cpython

mdboom · 2023-09-26T18:36:20Z

This just implements a set of tier 2 pystats mentioned in #109329.

This implements:

Total micro-ops executed
Total number of traces started
Total number of traces created
Optimization attempts
Per uop execution counts, like we have for tier 1 instructions.
A histogram of uops executed per trace
Trace too long
Unsupported opcode (maybe even with counters for each offender)
Inner loop found
Too many frame pushes (currently capped at a depth of 5)
Too many frame pops (if we return from the original frame)
Recursive function call

It also fixes a "bug" where specialization "hits" in the running Tier 2 interpreter would count against the Tier 1 stats.

Not implemented (since it will require reworking of DEOPT_IF calls):

Exit reason counts: polymorphism vs branch mis-prediction

Example output (for nbody benchmark):

Optimization (Tier 2) stats

statistics about the Tier 2 optimizer

Overall stats

overall stats

	Count	Ratio
Optimization attempts	5
Traces created	5	100.0%
Traces executed	1,600,070
Uops executed	65,202,621	40
Trace stack overflow	0
Trace stack underflow	0
Trace too long	5
Inner loop found	0
Recursive call	0

Trace length histogram

Range	Count	Ratio
<= 4	0	0.0%
<= 8	0	0.0%
<= 16	0	0.0%
<= 32	0	0.0%
<= 64	5	100.0%

Optimized trace length histogram

Range	Count	Ratio
<= 4	0	0.0%
<= 8	0	0.0%
<= 16	0	0.0%
<= 32	0	0.0%
<= 64	5	100.0%

Trace run length histogram

Range	Count	Ratio
<= 4	0	0.0%
<= 8	0	0.0%
<= 16	0	0.0%
<= 32	0	0.0%
<= 64	1,600,070	100.0%

Uop stats

uop stats

Uop	Count	Self	Cumulative
STORE_FAST	14,900,598	22.9%	22.9%
_SET_IP	14,100,583	21.6%	44.5%
LOAD_FAST	10,100,513	15.5%	60.0%
_GUARD_BOTH_FLOAT	3,700,274	5.7%	65.6%
_BINARY_OP_SUBTRACT_FLOAT	2,900,138	4.4%	70.1%
UNPACK_SEQUENCE_TUPLE	2,400,092	3.7%	73.8%
UNPACK_SEQUENCE_LIST	2,400,092	3.7%	77.5%
_POP_JUMP_IF_TRUE	1,700,049	2.6%	80.1%
_EXIT_TRACE	1,600,070	2.5%	82.5%
_ITER_CHECK_LIST	1,600,065	2.5%	85.0%
_IS_ITER_EXHAUSTED_LIST	1,600,065	2.5%	87.4%
COPY	1,599,948	2.5%	89.9%
_ITER_NEXT_LIST	1,400,053	2.1%	92.0%
UNPACK_SEQUENCE_TWO_TUPLE	1,000,039	1.5%	93.6%
LOAD_CONST	800,001	1.2%	94.8%
SWAP	799,974	1.2%	96.0%
BINARY_SUBSCR_LIST_INT	799,974	1.2%	97.2%
_BINARY_OP_MULTIPLY_FLOAT	400,095	0.6%	97.9%
_BINARY_OP_ADD_FLOAT	400,041	0.6%	98.5%
STORE_SUBSCR_LIST_INT	399,987	0.6%	99.1%
POP_TOP	200,017	0.3%	99.4%
_ITER_CHECK_RANGE	99,984	0.2%	99.5%
_IS_ITER_EXHAUSTED_RANGE	99,984	0.2%	99.7%
_ITER_NEXT_RANGE	99,979	0.2%	99.8%
GET_ITER	99,979	0.2%	100.0%
BINARY_OP	27	0.0%	100.0%

Unsupported opcodes

unsupported opcodes

Opcode	Count

Issue: Adds stats for the tier 2 optimizer #109329

brandtbucher

This is so great to have, thanks! We can tweak the collection and output further if need be, but this is already going to be so much more insightful!

Just a couple of suggestions, but otherwise I'll plan on merging tomorrow.

brandtbucher · 2023-09-27T23:54:55Z

Include/internal/pycore_code.h

+#define OPT_STAT_INC(name) do { if (_Py_stats) _Py_stats->optimization_stats.name++; } while (0)
+#define UOP_EXE_INC(opname) do { if (_Py_stats) _Py_stats->optimization_stats.opcode[opname].execution_count++; } while (0)
+#define OPT_UNSUPPORTED_OPCODE(opname) do { if (_Py_stats) _Py_stats->optimization_stats.unsupported_opcode[opname]++; } while (0)
+#define OPT_HIST(length, name) do { if (_Py_stats) _Py_stats->optimization_stats.name[_Py_bit_length((length - 1) >> 2)]++; } while (0)


Why ((length - 1) >> 2)?

If I understand correctly, there is a minimum trace size of 2, so the first two power-of-two buckets (for 0 and 1) would always be empty, so might as well just not support them (that's the >> 2). Then I also think it's easier to reason about buckets that have an inclusive maximum of a round number (that's the - 1).

brandtbucher · 2023-09-28T00:01:11Z

Include/cpython/pystats.h

+    uint64_t execution_count;
+} UOpStats;
+
+#define _Py_UOP_HIST_SIZE 5


We'll probably want more buckets than this (since execution lengths for closed loops will get pretty big), but it's probably fine for now. We can just bump it whenever.

I thought the max trace length was 64 (_Py_UOP_MAX_TRACE_LENGTH). Can they be longer? (It's quite possible I don't understand the code). In any event, I should probably add a comment here that this should be log2 of whatever the expected maximum length is.

Can they be longer? (It's quite possible I don't understand the code).

Not statically, but dynamically they can be quite long (imagine executing a trace for a closed loop hundreds of times).

So we should probably modify the code that adds to the histogram to be Py_MAX(_Py_UOP_HIST_SIZE - 1, _Py_bit_length((length - 1) >> _Py_UOP_HIST_BIAS)), so that anything overflowing the histogram ends up in the largest bucket instead of writing past the end of the array.

Yeah, that's a good idea to add the range-checking.

Also, this made me realize I was counting the wrong thing -- I was recording the "instruction pointer at exit" to be the execution length. But (after discussing in person) it seems what we want is the number of uops executed per execution of the trace. I've updated this to report that instead.

brandtbucher · 2023-09-28T00:10:41Z

Python/specialize.c

+print_histogram(FILE *out, const char *name, uint64_t hist[_Py_UOP_HIST_SIZE])
+{
+    for (int i = 0; i < _Py_UOP_HIST_SIZE; i++) {
+        fprintf(out, "%s[%d]: %" PRIu64 "\n", name, (1 << (i + 2)), hist[i]);


Why i + 2? Is it to correct for the >> 2 when recording the stats?

Oh, I think I see. We're starting with the "third" bucket, since tiny traces are basically nonexistent. Maybe add another constant next to _Py_UOP_HIST_SIZE (like #define _Py_UOP_HIST_BIAS 2 or something) so it's clearer why we're doing this and where.

Yes, that's right. Adding the #define for bias makes sense.

brandtbucher · 2023-09-28T00:14:09Z