Mark Needham

Thoughts on Software Development

Python: Squashing ‘duplicate’ pairs together

with 5 comments

As part of a data cleaning pipeline I had pairs of ids of duplicate addresses that I wanted to group together.

I couldn’t work out how to solve the problem immediately so I simplified the problem into pairs of letters i.e.

A	B		(A is the same as B)
B	C		(B is the same as C)
C	D		...
E	F		(E is the same as F)
F	G		...

The output that I want to get is:

(A, B, C, D)
(E, F, G)

I spent several hours trying to come up with a clever data structure to do this until Reshmee suggested tracking the sets of duplicates using an array of arrays or list of lists since we’re going to script this using Python.

The actual data is in a CSV file but we’ll create a list of tuples to save ourselves some work:

pairs = [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ]

We’re going to iterate through the list of pairs and on each iteration we’ll check if there’s an entry in the list containing either of the values. There can be three outcomes from this check:

  1. No entry – we’ll add a new entry with our pair of values.
  2. One entry – we’ll add the other value to that entry.
  3. Two entries – we’ll merge them together replacing the existing entry.

The first step is to write a function to check the list of lists for a matching pair:

def find_matching_index(pair, dups):
    return [index
            for index, dup in enumerate(dups)
            if pair[0] in dup or pair[1] in dup]
print find_matching_index(("A", "B"), [set(["D", "E"])])
print find_matching_index(("B", "C"), [set(["A", "B"])])
print find_matching_index(("B", "C"), [set(["A", "B"]), set(["C", "D"])])
[0, 1]

Next we need to write a function which iterates over all our pairs of values and uses find_matching_index to work out which decision to make:

def extract_groups(items):
    dups = []
    for pair in items:
        matching_index = find_matching_index(pair, dups)
        if len(matching_index) == 0:
            dups.append(set([pair[0], pair[1]]))
        elif len(matching_index) == 1:
            index = matching_index[0]
            matching_dup = dups[index]
            dups.append(matching_dup.union([pair[0], pair[1]]))
            index1, index2 = matching_index
            dup1 = dups[index1]
            dup2 = dups[index2]
            dups.pop(index2 - 1) # the index decrements since we removed one entry on the previous line
    return dups

Now let’s run this with a few test cases:

test_cases = [
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ],
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G"), ("G", "A"), ("G", "Z"), ("B", "D") ],
    [ ("A", "B"), ("B", "C"), ("C", "E"), ("E", "A") ],
    [ ("A", "B"), ("C", "D"), ("F", "G"), ("H", "I"), ("J", "A") ]
for test_case in test_cases:
    print extract_groups(test_case)
[set(['A', 'C', 'B', 'D']), set(['E', 'G', 'F'])]
[set(['A', 'C', 'B', 'E', 'D', 'G', 'F', 'Z'])]
[set(['A', 'C', 'B', 'E'])]
[set(['C', 'D']), set(['G', 'F']), set(['I', 'H']), set(['A', 'J', 'B'])]

This certainly doesn’t scale very well but since I only have a few hundred duplicate addresses it does the job for me.

It feels like there should be a more functional way to write these functions without mutating all these lists but I haven’t figured out what that is yet.

Be Sociable, Share!

Written by Mark Needham

December 20th, 2015 at 12:12 pm

Posted in Python

Tagged with

  • Bill Cooperman

    This problem is actually just finding connected components in a graph. Each adress is a node, and there is an edge between them if they represent the same place.

    DFS can solve this in O(# edges).

  • @billcooperman:disqus Hah, you’re right! And after all this time working with graphs I didn’t spot that 🙂

  • cybervegan

    Here’s an alternative approach (sorry the comment software is screwing with indents, so I’ve used underscores instead). I also assume your expected output was actually intended to be (‘A’, ‘B’, ‘C’, ‘D’), (‘E’, ‘F’, ‘G’) and not (‘A’, ‘B’, ‘C’, ‘F’), (‘E’, ‘F’, ‘G’) as nowhere in your test data does F pair with A, B, or C.

    pairs = [ (“A”, “B”), (“B”, “C”), (“C”, “D”), (“E”, “F”), (“F”, “G”) ]
    similar = dict(pairs)

    def chain(i):
    ____if similar[i] in similar:
    ________return [ i ] + chain(similar[i])
    ____return [i, similar[i] ]

    print chain(“A”)
    [‘A’, ‘B’, ‘C’, ‘D’]


    for a,b in pairs:
    ____if a not in answers:
    ____for i in c:
    ____answers[i] = c

    print set( tuple(ans) for ans in answers.values() )
    set([(‘A’, ‘B’, ‘C’, ‘D’), (‘E’, ‘F’, ‘G’)])

  • Good spot. Have fixed that typo

  • wampastompah

    When this was posted on the /r/programming subreddit I was tempted to throw my hat into the ring. My solution is O(n^2) but since you only have a couple hundred pairs, it should be fine.

    Not sure about formatting in comments here, so here’s a link to my comment on reddit: