gh-110019: Refactor summarize_stats#110398
Conversation
markshannon
left a comment
There was a problem hiding this comment.
Thanks for this. This file definitely needed refactoring.
I have few comments. I think a more functional (as opposed to OO) style would help clarity, but the general design seems sound.
Tools/scripts/summarize_stats.py
Outdated
| a_ncols = list(set(len(x) for x in a_rows)) | ||
| if len(a_ncols) != 1: | ||
| raise ValueError("Table a is ragged") | ||
| class Stats(dict): |
There was a problem hiding this comment.
Inheriting from builtin collections can be awkward.
Could you wrap the dict?
Tools/scripts/summarize_stats.py
Outdated
| else: | ||
| ncols = b_ncols[0] | ||
| elif input.is_dir(): | ||
| stats: collections.Counter = collections.Counter() |
There was a problem hiding this comment.
This type annotation seems redundant
There was a problem hiding this comment.
Yep, but mypy requires it :(
There was a problem hiding this comment.
Don't use mypy then?
Is there a Mypy issue for this?
There was a problem hiding this comment.
This should make mypy happy and Mark less unhappy:
| stats: collections.Counter = collections.Counter() | |
| stats = collections.Counter[str]() |
There was a problem hiding this comment.
Don't use mypy then?
Is there a Mypy issue for this?
It's not a mypy bug. Pyright will complain at you in exactly the same way about this. Mypy can't tell what kind of items are going to be stored as keys for the Counter, so it demands an explicit annotation. @mdboom's annotation shuts mypy up, because collections.Counter as a type annotation is equivalent to collections.Counter[Any]. But the better solution is to use collections.Counter[str], because they keys of stats are all strings.
There was a problem hiding this comment.
It is a mypy bug.
Omitting the annotation and providing collections.Counter as an annotation has exactly the same information content. So complaining about one and not the other is erroneous.
I assume mypy will not complain about stats = collections.Counter[str]() then.
There was a problem hiding this comment.
I agree there's a usability bug here; mypy isn't communicating what it wants from you very clearly at all.
I assume mypy will not complain about
stats = collections.Counter[str]()then.
Correct, that will make mypy happy (#110398 (comment))
Tools/scripts/summarize_stats.py
Outdated
| self["_defines"] = get_defines(Path("Python") / "specialize.c") | ||
|
|
||
| @property | ||
| def defines(self) -> Defines: |
There was a problem hiding this comment.
If you wrap the dict, then this can be a normal attribute.
There was a problem hiding this comment.
Yes, but since this value should be saved, it's convenient for it to be on the dictionary which is dumped to JSON.
Tools/scripts/summarize_stats.py
Outdated
| ] | ||
| stats["_stats_defines"] = get_stats_defines() | ||
| stats["_defines"] = get_defines() | ||
| class CountPer: |
There was a problem hiding this comment.
Why not use a Ratio, or just a float?
There was a problem hiding this comment.
Because my secret plan is to also introduce CSV output, where we would want to format this differently.
There was a problem hiding this comment.
And also a Ratio is rendered as a percentage, a CountPer is rendered as an integer. "Uops run per trace" is much better represented as an integer rather than a percentage.
There was a problem hiding this comment.
I think uops per trace is more naturally a float than an int. I agree that 30[.0] is better than 3000% though.
Maybe add a percent: bool=True argument to Ratio's __init__?
Tools/scripts/summarize_stats.py
Outdated
| comparative: bool = True, | ||
| ): | ||
| self.title = title | ||
| if not summary: |
There was a problem hiding this comment.
If the summary is explicitly "", it is ignored. Keeping the default as None seems better.
Tools/scripts/summarize_stats.py
Outdated
| print(file=out) | ||
|
|
||
|
|
||
| class FixedTable(Table): |
There was a problem hiding this comment.
I don't like all these subclasses of Table, it mixes up the responsibilities of creating the contents and the formatting.
Having a single Table class which does the formatting and takes the contents as a parameter to its __init__ would separate the responsibilities better.
So instead of ExecutionCountTable("uops"), it would be Table("uops", get_uop_counts())
There was a problem hiding this comment.
It's more complicated than that. Every table knows how to generate a single set of results, and then also supports combining two tables for comparative results. This usually is straightforward, but for some tables (e.g. execution count table) that behavior needs to be overridden. But I'll take a look at doing all of this in a more functional way -- I'm a little worried we'll end up back where we started, though.
There was a problem hiding this comment.
The data can support merging, etc. get_uop_counts() can return an object that supports that functionality.
I just think that separating formating from data manipulation will make things more maintainable.
This refactors summarize_stats so that the comparative tables are easier to make and use more common code.
4fde5b4 to
901b952
Compare
|
I have modified this to:
|
|
I think what bothers me with this PR is that the data processing is mixed in with the data storage. This can be a problem with OOP. There is nothing wrong with objects that enhance and structure the data, but data processing should be separated, otherwise the code can be hard to follow. I think we need the following:
Each "function" above doesn't have to one function, but should be independent of the others. By all means use objects (this is Python, not Haskell), but I'd recommend trying to stick to functional(ish) principles:
With that, the base pipeline (raw stats files to markdown) would look something like: def main():
# Process args and find folders, etc.
raw_stats = gather_stats(stats_dir)
stats = structure_stats(raw_stats)
output_stats(stats, outfile)The A This PR contains the following comment:
That should be a method on the Stats (or Section): Rows: TypeAlias = list[tuple[str]]
def to_rows(self) -> Rows:
... |
Indeed, it isn't. There are 4 separate layers:
I agree 100%, but I think this refactor does that.
Wouldn't that be more of a combining of processing and presentation? I think what would address your concerns is:
|
|
@markshannon: I've largely moved the data processing inside of the |
|
Have you checked that the latest commit produces the same output as main? |
Yes, for single datasets. For comparative the results are different, but due to bugfixes. |
This refactors summarize_stats so that the comparative tables are easier to make and use more common code.
Reviewing this as a diff may be rather difficult -- instead maybe just look at the file verbatim.
Tools/scripts/summarize_stats.py#110019