gh-110019: Refactor summarize_stats by mdboom · Pull Request #110398 · python/cpython

mdboom · 2023-10-05T13:55:17Z

This refactors summarize_stats so that the comparative tables are easier to make and use more common code.

Reviewing this as a diff may be rather difficult -- instead maybe just look at the file verbatim.

Issue: Refactor to reduce code duplication in Tools/scripts/summarize_stats.py #110019

markshannon

Thanks for this. This file definitely needed refactoring.

I have few comments. I think a more functional (as opposed to OO) style would help clarity, but the general design seems sound.

markshannon · 2023-10-05T14:44:01Z

Tools/scripts/summarize_stats.py

-        a_ncols = list(set(len(x) for x in a_rows))
-        if len(a_ncols) != 1:
-            raise ValueError("Table a is ragged")
+class Stats(dict):


Inheriting from builtin collections can be awkward.
Could you wrap the dict?

markshannon · 2023-10-05T14:44:05Z

Tools/scripts/summarize_stats.py

-    else:
-        ncols = b_ncols[0]
+        elif input.is_dir():
+            stats: collections.Counter = collections.Counter()


This type annotation seems redundant

Yep, but mypy requires it :(

Don't use mypy then?

Is there a Mypy issue for this?

This should make mypy happy and Mark less unhappy:

Suggested change

stats: collections.Counter = collections.Counter()

stats = collections.Counter[str]()

Don't use mypy then?

Is there a Mypy issue for this?

It's not a mypy bug. Pyright will complain at you in exactly the same way about this. Mypy can't tell what kind of items are going to be stored as keys for the Counter, so it demands an explicit annotation. @mdboom's annotation shuts mypy up, because collections.Counter as a type annotation is equivalent to collections.Counter[Any]. But the better solution is to use collections.Counter[str], because they keys of stats are all strings.

It is a mypy bug.
Omitting the annotation and providing collections.Counter as an annotation has exactly the same information content. So complaining about one and not the other is erroneous.
I assume mypy will not complain about stats = collections.Counter[str]() then.

I agree there's a usability bug here; mypy isn't communicating what it wants from you very clearly at all.

I assume mypy will not complain about stats = collections.Counter[str]() then.

Correct, that will make mypy happy (#110398 (comment))

markshannon · 2023-10-05T14:45:29Z

Tools/scripts/summarize_stats.py

+        self["_defines"] = get_defines(Path("Python") / "specialize.c")

+    @property
+    def defines(self) -> Defines:


If you wrap the dict, then this can be a normal attribute.

Yes, but since this value should be saved, it's convenient for it to be on the dictionary which is dumped to JSON.

markshannon · 2023-10-05T14:50:25Z

Tools/scripts/summarize_stats.py

-        ]
-        stats["_stats_defines"] = get_stats_defines()
-        stats["_defines"] = get_defines()
+class CountPer:


Why not use a Ratio, or just a float?

Because my secret plan is to also introduce CSV output, where we would want to format this differently.

And also a Ratio is rendered as a percentage, a CountPer is rendered as an integer. "Uops run per trace" is much better represented as an integer rather than a percentage.

I think uops per trace is more naturally a float than an int. I agree that 30[.0] is better than 3000% though.
Maybe add a percent: bool=True argument to Ratio's __init__?

markshannon · 2023-10-05T14:53:41Z

Tools/scripts/summarize_stats.py

+        comparative: bool = True,
+    ):
+        self.title = title
+        if not summary:


If the summary is explicitly "", it is ignored. Keeping the default as None seems better.

markshannon · 2023-10-05T15:00:28Z

Tools/scripts/summarize_stats.py

+        print(file=out)
+
+
+class FixedTable(Table):


I don't like all these subclasses of Table, it mixes up the responsibilities of creating the contents and the formatting.
Having a single Table class which does the formatting and takes the contents as a parameter to its __init__ would separate the responsibilities better.
So instead of ExecutionCountTable("uops"), it would be Table("uops", get_uop_counts())

It's more complicated than that. Every table knows how to generate a single set of results, and then also supports combining two tables for comparative results. This usually is straightforward, but for some tables (e.g. execution count table) that behavior needs to be overridden. But I'll take a look at doing all of this in a more functional way -- I'm a little worried we'll end up back where we started, though.

The data can support merging, etc. get_uop_counts() can return an object that supports that functionality.
I just think that separating formating from data manipulation will make things more maintainable.

This refactors summarize_stats so that the comparative tables are easier to make and use more common code.

mdboom · 2023-10-05T20:29:35Z

I have modified this to:

Take a more functional approach without inheritance
Separate out markdown output from data handling
Not have Stats inherit from dict

markshannon · 2023-10-06T09:29:26Z

I think what bothers me with this PR is that the data processing is mixed in with the data storage. This can be a problem with OOP.

There is nothing wrong with objects that enhance and structure the data, but data processing should be separated, otherwise the code can be hard to follow.

I think we need the following:

To read the data off disk and into one big poorly structured object ("blob") (we already have this https://github.com/python/cpython/blob/main/Tools/scripts/summarize_stats.py#L229-L244)
A structured data type, Stats.
A function to convert the "blob" to Stats
Functions/methods to save and load the Stats to a file (currently json)
A function/method to make a single Stats from the diff of two Stats
A function to output Stats to a human readable format (currently markdown)

Each "function" above doesn't have to one function, but should be independent of the others.

By all means use objects (this is Python, not Haskell), but I'd recommend trying to stick to functional(ish) principles:

All classes should have a simple __init__ (as would be generated for a dataclass)
No method should mutate the object
No special methods
Methods should stick to the domain of the object. So for a Section object, to_rows()->list[tuple[str]] is fine, but write_markdown() maybe not so much.

With that, the base pipeline (raw stats files to markdown) would look something like:

def main():
    # Process args and find folders, etc.
    raw_stats = gather_stats(stats_dir)
    stats = structure_stats(raw_stats)
    output_stats(stats, outfile)

The Stats class should be a structured data type, with the top level containing attributes for each of the top level categories in the data, execution_counts, pair_counts, predecessor_pairs, etc.
It could simply be a list of Sections, where the Sections describe the data, if that makes more sense. The markdown file is structured as a list of sections.

A Section will need to describe how to present the data as well as contain it.
For example, execution counts, is a list of 3-tuples, name, count, miss. But it also needs data on how to present it:
It should be sorted by count, have a ratio and cumulative ratio column, and the miss should be presented as a percentage.
That could be a method, which converts the data to a table (where table is list[tuple[str]])

This PR contains the following comment:

A Table defines how to convert a set of Stats into a specific set of rows displaying some aspect of the data.

That should be a method on the Stats (or Section):

Rows: TypeAlias = list[tuple[str]]
def to_rows(self) -> Rows:
     ...

mdboom · 2023-10-06T13:55:17Z

I think what bothers me with this PR is that the data processing is mixed in with the data storage.

Indeed, it isn't. There are 4 separate layers:

raw data (Stats)
abstract views of that data (Table)
organization of that data (Section)
output (output_markdown)

There is nothing wrong with objects that enhance and structure the data, but data processing should be separated, otherwise the code can be hard to follow.

I agree 100%, but I think this refactor does that.

The Stats class should be a structured data type, with the top level containing attributes for each of the top level categories in the data, execution_counts, pair_counts, predecessor_pairs, etc.

It could simply be a list of Sections, where the Sections describe the data, if that makes more sense. The markdown file is structured as a list of sections.

Wouldn't that be more of a combining of processing and presentation?

I think what would address your concerns is:

The calc_*_ functions become methods on the Stats class.
The Table class would go away.
The Section class would describe the organization of the file and how tables need to be combined.

mdboom · 2023-10-06T19:01:27Z

@markshannon: I've largely moved the data processing inside of the Stats class and the new OpcodeStats class. The Table/Section distinction is still required, since a Section may have multiple tables etc. But I hope this is closer to what you had in mind in terms of separation of concerns.

markshannon · 2023-10-23T15:31:13Z

Have you checked that the latest commit produces the same output as main?

mdboom · 2023-10-23T16:53:50Z

Have you checked that the latest commit produces the same output as main?

Yes, for single datasets. For comparative the results are different, but due to bugfixes.

mdboom added the skip news label Oct 5, 2023

mdboom requested review from brandtbucher and markshannon October 5, 2023 13:55

bedevere-app bot added the awaiting review label Oct 5, 2023

bedevere-app bot mentioned this pull request Oct 5, 2023

Refactor to reduce code duplication in Tools/scripts/summarize_stats.py #110019

Closed

markshannon reviewed Oct 5, 2023

View reviewed changes

mdboom added 3 commits October 5, 2023 11:21

pythongh-110019: Refactor summarize_stats

c897328

This refactors summarize_stats so that the comparative tables are easier to make and use more common code.

Truncate histograms

4925c33

Include commas in histogram bins

901b952

mdboom force-pushed the summarize_stats-refactor branch from 4fde5b4 to 901b952 Compare October 5, 2023 15:22

mdboom added 2 commits October 5, 2023 11:42

Don't have stats be a subclass of dict

fa176e7

Use a more functional approach

193b450

mdboom requested a review from markshannon October 5, 2023 20:28

Separate data processing from display

582b1b6

Typing

a4fce82

mdboom mentioned this pull request Oct 9, 2023

gh-109329: Count tier2 miss opcodes #110561

Merged

Fix denominator

8a535fe

markshannon mentioned this pull request Oct 23, 2023

GH-111213: Fix a few broken stats #111216

Merged

markshannon merged commit 81eba76 into python:main Oct 24, 2023

bedevere-app bot removed the awaiting review label Oct 24, 2023

aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024

pythongh-110019: Refactor summarize_stats (pythonGH-110398)

b192fe6

Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024

pythongh-110019: Refactor summarize_stats (pythonGH-110398)

27ae8e0

	stats: collections.Counter = collections.Counter()
	stats = collections.Counter[str]()

Uh oh!

Comments

Conversation

mdboom commented Oct 5, 2023 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexWaygood Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markshannon Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdboom commented Oct 5, 2023

Uh oh!

markshannon commented Oct 6, 2023

Uh oh!

mdboom commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdboom commented Oct 6, 2023

Uh oh!

markshannon commented Oct 23, 2023

Uh oh!

mdboom commented Oct 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mdboom commented Oct 5, 2023 •

edited by bedevere-app bot

Loading

AlexWaygood Oct 5, 2023 •

edited

Loading

markshannon Oct 5, 2023 •

edited

Loading

mdboom commented Oct 6, 2023 •

edited

Loading