Programming Metagame Balance Metrics - Gen 6 in Retrospective

migzoo · Dec 25, 2016

Hello friends,

Over two years ago, I authored this thread, which proposed using usage statistics to quantitatively measure the oft-used, otherwise hand-wavey concept of metagame balance, inspired by Doug's manifesto. This is much-delayed realization of my idea from two years ago, which I figured would be quite timely, given that we are currently in the transition between Generations 6 and 7.

In the last two years I have become less of a programming scrub, so I was able to write up a quick python script to compute various inequality/diversity metrics taken from the social sciences for a given usage statistics file. I originally proposed using the Gini coefficient, a popular measure of income inequality used in economics, the idea being that income inequality is analogous to imbalance. This time around, I also looked into diversity indices, which in theory should capture both balance and variety, as defined by Doug in his post, since "value of a diversity index increases both when the number of types increases and when evenness increases." In particular, I implemented

Gini coefficient
True diversity
Richness
This most directly quantifies Doug's notion of "variety," since it "simply quantifies how many different types the dataset of interest contains." I propose using some epsilon minimum usage threshold when calculating richness to make it a more meaningful metric.
Shannon index
Rényi entropy, Gini-Simpson index, Berger-Parker index
These are transformations of one of the above which I threw in for fun

All of the diversity indices and their computations are described in the wikipedia page. It should be noted that I have little to no background in the social sciences, so I don't know the domain-specific subtleties and differences behind the different metrics. Also, a caveat: I worked with the json files under the chaos directories for each month of usage stats, which have "Raw count" for each pokemon. These numbers doesn't match up to the "Raw" numbers shown in the .txt usage stat files, and I don't know why, but I just rolled with it (perhaps Antar can give some insight here?).

Anyways, here are some interesting findings and interpretations that I have come upon thus far:

For reasons that I don't understand, diversity is remarkably stable across cutoffs for the same tier. Like in this plot here, you can barely even tell that there are multiple lines. This is true for all tiers that I've seen. Perhaps this means that diversity is a robust/good balance metric?

Here is the diversity plot for all the main tiers side-by-side (sans LC, I guess):

Some observations:

All tiers seem to have a pretty steady upwards trend in diversity, except NU and RU, which decline/plateau in late 2015/early 2016 respectively. Idk if there is a good explanation for this, but one can interpret the upwards trend as suspect testing doing its intended job: balancing the metagame.
Ubers is always the least diverse, and PU is always the most diverse, which is as expected.
All tiers jump significantly in diversity between May and June of 2015. Again, I'm not sure why.

When looking at Gini coefficient, we see that the lines for different cutoffs of the same tier run roughly parallel, with 0 being the least "equal" and 1760 the most. This is somewhat counterintuitive, since players at lower cutoffs are prone to not know what they are doing and use random mons. Also, we can see that Gini tends to fluctuate more than diversity.

Here is the Gini plot for all the main tiers side-by-side, with cutoffs 0 and 1760 (1825 for OU):

We can see that PU has the lowest Gini coefficient, which is consistent with the diversity plot. Interestingly, RU has had an even higher Gini coefficient than Ubers for a while now, despite Ubers having lower diversity.

Richness, like diversity, is pretty much the same across different cutoffs for the same tier. What's somewhat interesting is the comparison of tiers for different values of epsilon:

This is the plot for epsilon = 0.01, i.e. the number of pokemon with usage above 1% for each tier.

This is the plot for epsilon = 0.2, i.e. the number of pokemon with usage above 20% for each tier.

We see that Ubers has relatively high richness in the first plot, but is in the middle for the second. This suggests that there are a few highly viable pokemon, and a decent amount of somewhat viable ones. The fact that PU has very low richness in the second plot, i.e. there are few pokemon with high usage, makes sense since PU is a relatively undeveloped metagame with a lot of useless pokemon, so only a handful are known to be viable. Also, RU has the lowest richness in the first plot but one of the highest in the second. This suggests that there is a clear divide between the viable and unviable pokemon in the RU metagame.

I encourage anyone interested to poke around and try stuff on their own – I haven't looked at LC or past gens for example. The git repo for the project has steps to run; it should be pretty straightforward. If you have any feedback, suggestions, or interpretations of plots, those are all welcome as well.

flamewizzy21 · Jan 8, 2017

This is a really cool project. Quick question: how is your diversity less than one if it's an exponential of a non-negative number?

RetroBacon · Jan 8, 2017

Look at how diverse PU is; no wonder people say it's a fun tier!

migzoo · Jan 9, 2017

flamewizzy21 said:
This is a really cool project. Quick question: how is your diversity less than one if it's an exponential of a non-negative number?

the exponent is (1/(1-q)); for q=2 the exponent is thus -1

flamewizzy21 · Jan 15, 2017

migzoo said:
the exponent is (1/(1-q)); for q=2 the exponent is thus -1

Ah, I see. I didn't realize people used different q's for diversity (my bad). I was more interested in seeing true diversity for q = 1 (limit), or specifically the exponential of Shannon entropy. This is because that should be the effective number of pokemon in the tier. I don't know how fast it would be to add this to your code, but that is what I had originally envisioned.

migzoo · Jan 15, 2017

flamewizzy21 said:
Ah, I see. I didn't realize people used different q's for diversity (my bad). I was more interested in seeing true diversity for q = 1 (limit), or specifically the exponential of Shannon entropy. This is because that should be the effective number of pokemon in the tier. I don't know how fast it would be to add this to your code, but that is what I had originally envisioned.

That's actually already implemented

Programming Metagame Balance Metrics - Gen 6 in Retrospective

migzoo

new money

flamewizzy21

RetroBacon

migzoo

new money

flamewizzy21

migzoo

new money

Users Who Are Viewing This Thread (Users: 1, Guests: 0)