The magic of 2 by 2 tables

# Our usual imports
import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')
# HIDDEN
# An extra tweak to make sure we always get the same random numbers.
# Do not use this in your own code; you nearly always want an unpredictable
# stream of random numbers.  Making them predictable in this way only makes
# sense for a very limited range of things, like tutorials and tests.
np.random.seed(1938)

At the end of the noble politics page, you came across an enigma, from analyzing two by two tables of counts.

This page shows you another version of that enigma, and gives a few explanations for these strange effects.

To start with, we make a simple data frame, where each row corresponds to a person. Each person, and therefore, each row, has two labels. If you have read the Noble Politics page, you will see what inspired this example.

The first label for each person, called “Party” can be either “Yes”, or “No”.

The second label for each person, called “Respondent”, can be either “R” or “B”.

Here is our data frame from scratch:

# Create a data frame from scratch with some data.
people = pd.DataFrame([["Yes", "R"],
                       ["Yes", "R"],
                       ["Yes","R"],
                       ["No", "R"],
                       ["No", "R"],
                       ["Yes","B"],
                       ["Yes","B"],
                       ["No","B"]],
                       columns = ['Respondent', 'Party'])
# Show the result
people
Respondent Party
0 Yes R
1 Yes R
2 Yes R
3 No R
4 No R
5 Yes B
6 Yes B
7 No B

We can ask Pandas to make a 2 by 2 table of counts, for these labels. Each entry in the table gives the number of times the combination of labels occurs. For example, the cell in the “Yes” row, “R” column gives the number of rows in the original table that have a “Yes” label for “Party” and a “R” label for “Respondent”. Call this the Yes-with-R cell.

two_by_two = pd.crosstab(people['Party'], people['Respondent'])
two_by_two
Respondent No Yes
Party
B 1 2
R 2 3

You might want to confirm these counts. You should see that there are in fact 3 rows which have both the “Yes” label and the “R” label, and this matches the bottom right entry in the pd.crosstab table.

We can add the row and column totals to this table, with some fancy Pandas code (unhide the code in this page, or see the notebook version for details):

# Fancy code for adding row / column totals, giving them colors.
from matplotlib.colors import ListedColormap

PALE_GREY = ListedColormap([0.8, 0.8, 0.8])
MID_GREY = ListedColormap([0.6, 0.6, 0.6])

def with_margins(df):
    """ Data frame displayed with row, column totals added, colored.
    """
    # Add row totals to copy of data frame.
    row_margin = df.sum().astype(int).rename('Row totals')
    df_margins = df.append(row_margin)
    # Add column totals.
    df_o = df_margins.assign(**{'Col totals': df_margins.sum(axis='columns')})
    # Set colors for row, column totals.
    styler = df_o.style.background_gradient(subset='Col totals', cmap=PALE_GREY)
    styler = styler.background_gradient(subset=('Row totals',), cmap=PALE_GREY)
    styler = styler.background_gradient(subset=('Row totals', 'Col totals'),
                                                cmap=MID_GREY)
    return styler
# Display the table with colors, using fancy plotting function.
with_margins(two_by_two)
Respondent No Yes Col totals
Party      
B 1 2 3
R 2 3 5
Row totals 3 5 8

Now let’s say we want to create a random association between the “Party” (Yes/No) column and the “Respondent” (R/B) column. We can do that by shuffling (permuting) the Yes/No values, and recreating the table, like this:

parties = people['Party']
respondents = people['Respondent']
permuted_tab_1 = pd.crosstab(np.random.permutation(parties), respondents)
with_margins(permuted_tab_1)
Respondent No Yes Col totals
row_0      
B 2 1 3
R 1 4 5
Row totals 3 5 8

Note that I could have shuffled the “Respondent” labels instead, or both sets of labels - any of these operations creates what we want - which is a random association between the “Party” and “Respondent” labels.

Let’s make another few of those random tables:

permuted_tab_2 = pd.crosstab(np.random.permutation(parties), respondents)
with_margins(permuted_tab_2)
Respondent No Yes Col totals
row_0      
B 0 3 3
R 3 2 5
Row totals 3 5 8
permuted_tab_3 = pd.crosstab(np.random.permutation(parties), respondents)
with_margins(permuted_tab_3)
Respondent No Yes Col totals
row_0      
B 1 2 3
R 2 3 5
Row totals 3 5 8

You will see that, if the top left value goes up, or down, compared to the original table, then the bottom right value goes up, or down, by the same amount. Call the change in the top-right value, from the original c. The original value for that element was 1, so if the new value is 2, \(c = 1\). If the new value was 0, then \(c = -1\).

You will also see that the bottom left value goes down by the same amount as the top left value goes up. If the top left value goes up by \(c\), then the bottom left value does down by \(c\).

All the values in the table appear to be linked, and move in lock-step.

Notice too that the column and row totals do not — and in fact cannot — change, whatever the association of the R/B and Yes/No labels, for reasons we go into below.

This is another version of the enigma you saw at the end of the Noble Politics page.

Why are the table values linked?

Explanation 1: the row and column totals cannot change

The first explanation for the relationship between the values is — the total counts over the rows and columns are fixed.

They are fixed because they depend only on the labels, not on the relationship of the labels. So, we cannot change the counts over rows or columns by shuffling the order of either or both set of labels.

For example, consider the sum of counts over columns. In our example, the columns are for “B” and “R”, and the column sums of the counts are the total number of “B” and “R” labels.

By shuffling the order of the labels, we can change the number in the B-with-Yes cell, but we can’t change the total number of “B” labels, so, if the B-with-Yes number increases, the B-with-No number has to decrease by the same amount.

The same argument applies for the row totals.

You may now see that, if you give me the value for any one of the four elements in the table, I can fill in the rest, because they follow from the fact that I know what the rows and columns have to add up to.

Explanation 2: what goes up, must come down

Here we think about what happens as we permute the Yes/No labels. The same argument applies to permuting the R/B labels, and in fact, to permuting both the labels.

Consider the original table.

with_margins(two_by_two)
Respondent No Yes Col totals
Party      
B 1 2 3
R 2 3 5
Row totals 3 5 8

Now consider permutations in the order of the Yes/No column of the original table. We will just permute the first “Yes”, for now.

Imagine the first “Yes” gets swapped with the “Yes” in one of the other “R”, “Yes” rows. We still end up with the same two by two counts table.

Next swap the first “Yes” with the “No” in one of the “R”, “No” rows. The first row is “R”, “No” but the other row has now become “R”, “Yes”, we still have the same number of “R”, “Yes” and “R”, “No”, and the counts table doesn’t change.

Now swap the first “Yes” with the “Yes” in one of the “B”, “Yes” rows. We still have the same number of “R”, “Yes” and “B”, “Yes” rows, just in a different row order, and the counts table doesn’t change.

Last, we swap the first “Yes” with the “No” in one of the “B”, “No” rows. Now, for the first time, the counts table does change, like this:

# The data frame with first "Yes" permuted to last row.
perm_people = pd.DataFrame([["No", "R"],
                            ["Yes", "R"],
                            ["Yes","R"],
                            ["No", "R"],
                            ["No", "R"],
                            ["Yes","B"],
                            ["Yes","B"],
                            ["Yes","B"]],
                            columns = ['Respondent', 'Party'])
perm_people
Respondent Party
0 No R
1 Yes R
2 Yes R
3 No R
4 No R
5 Yes B
6 Yes B
7 Yes B

Here is the cross-tabulation for that case:

perm_crosstab = pd.crosstab(perm_people['Party'], perm_people['Respondent'])
with_margins(perm_crosstab)
Respondent No Yes Col totals
Party      
B 0 3 3
R 3 2 5
Row totals 3 5 8

When we do this single change, we have:

  • One less “R”, “Yes” row (bottom-right count goes down by 1)

  • One less “B”, “No” row (top-left count down by 1)

  • One more “R”, “No” row (bottom-left count up by 1)

  • One more “B”, “Yes” row (top-right count up by 1)

As we permute the “Yes”, “No” labels, we either leave the total number of unique “R/B”, “Yes/No” pairings unchanged, or we change the number of all the pairings at once.