20 Dec 2015

Python: Squashing 'duplicate' pairs together

As part of a data cleaning pipeline I had pairs of ids of duplicate addresses that I wanted to group together.

I couldn’t work out how to solve the problem immediately so I simplified the problem into pairs of letters i.e.

A	B		(A is the same as B)
B	C		(B is the same as C)
C	D		...
E	F		(E is the same as F)
F	G		...

The output that I want to get is:

(A, B, C, D)
(E, F, G)

I spent several hours trying to come up with a clever data structure to do this until Reshmee suggested tracking the sets of duplicates using an array of arrays or list of lists since we’re going to script this using Python.

The actual data is in a CSV file but we’ll create a list of tuples to save ourselves some work:

pairs = [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ]

We’re going to iterate through the list of pairs and on each iteration we’ll check if there’s an entry in the list containing either of the values. There can be three outcomes from this check:

No entry - we’ll add a new entry with our pair of values.
One entry - we’ll add the other value to that entry.
Two entries - we’ll merge them together replacing the existing entry.

The first step is to write a function to check the list of lists for a matching pair:

def find_matching_index(pair, dups):
    return [index
            for index, dup in enumerate(dups)
            if pair[0] in dup or pair[1] in dup]

print find_matching_index(("A", "B"), [set(["D", "E"])])
[]

print find_matching_index(("B", "C"), [set(["A", "B"])])
[0]

print find_matching_index(("B", "C"), [set(["A", "B"]), set(["C", "D"])])
[0, 1]

Next we need to write a function which iterates over all our pairs of values and uses find_matching_index to work out which decision to make:

def extract_groups(items):
    dups = []
    for pair in items:
        matching_index = find_matching_index(pair, dups)

        if len(matching_index) == 0:
            dups.append(set([pair[0], pair[1]]))
        elif len(matching_index) == 1:
            index = matching_index[0]
            matching_dup = dups[index]
            dups.pop(index)
            dups.append(matching_dup.union([pair[0], pair[1]]))
        else:
            index1, index2 = matching_index
            dup1 = dups[index1]
            dup2 = dups[index2]

            dups.pop(index1)
            dups.pop(index2 - 1) # the index decrements since we removed one entry on the previous line
            dups.append(dup1.union(dup2))
    return dups

Now let’s run this with a few test cases:

test_cases = [
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ],
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G"), ("G", "A"), ("G", "Z"), ("B", "D") ],
    [ ("A", "B"), ("B", "C"), ("C", "E"), ("E", "A") ],
    [ ("A", "B"), ("C", "D"), ("F", "G"), ("H", "I"), ("J", "A") ]
]

for test_case in test_cases:
    print extract_groups(test_case)

[set(['A', 'C', 'B', 'D']), set(['E', 'G', 'F'])]
[set(['A', 'C', 'B', 'E', 'D', 'G', 'F', 'Z'])]
[set(['A', 'C', 'B', 'E'])]
[set(['C', 'D']), set(['G', 'F']), set(['I', 'H']), set(['A', 'J', 'B'])]

This certainly doesn’t scale very well but since I only have a few hundred duplicate addresses it does the job for me.

It feels like there should be a more functional way to write these functions without mutating all these lists but I haven’t figured out what that is yet.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.