Mark Needham

Thoughts on Software Development

Archive for the ‘Python’ Category

Python: Squashing ‘duplicate’ pairs together

with 5 comments

As part of a data cleaning pipeline I had pairs of ids of duplicate addresses that I wanted to group together.

I couldn’t work out how to solve the problem immediately so I simplified the problem into pairs of letters i.e.

A	B		(A is the same as B)
B	C		(B is the same as C)
C	D		...
E	F		(E is the same as F)
F	G		...

The output that I want to get is:

(A, B, C, D)
(E, F, G)

I spent several hours trying to come up with a clever data structure to do this until Reshmee suggested tracking the sets of duplicates using an array of arrays or list of lists since we’re going to script this using Python.

The actual data is in a CSV file but we’ll create a list of tuples to save ourselves some work:

pairs = [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ]

We’re going to iterate through the list of pairs and on each iteration we’ll check if there’s an entry in the list containing either of the values. There can be three outcomes from this check:

  1. No entry – we’ll add a new entry with our pair of values.
  2. One entry – we’ll add the other value to that entry.
  3. Two entries – we’ll merge them together replacing the existing entry.

The first step is to write a function to check the list of lists for a matching pair:

def find_matching_index(pair, dups):
    return [index
            for index, dup in enumerate(dups)
            if pair[0] in dup or pair[1] in dup]
 
print find_matching_index(("A", "B"), [set(["D", "E"])])
[]
 
print find_matching_index(("B", "C"), [set(["A", "B"])])
[0]
 
print find_matching_index(("B", "C"), [set(["A", "B"]), set(["C", "D"])])
[0, 1]

Next we need to write a function which iterates over all our pairs of values and uses find_matching_index to work out which decision to make:

def extract_groups(items):
    dups = []
    for pair in items:
        matching_index = find_matching_index(pair, dups)
 
        if len(matching_index) == 0:
            dups.append(set([pair[0], pair[1]]))
        elif len(matching_index) == 1:
            index = matching_index[0]
            matching_dup = dups[index]
            dups.pop(index)
            dups.append(matching_dup.union([pair[0], pair[1]]))
        else:
            index1, index2 = matching_index
            dup1 = dups[index1]
            dup2 = dups[index2]
 
            dups.pop(index1)
            dups.pop(index2 - 1) # the index decrements since we removed one entry on the previous line
            dups.append(dup1.union(dup2))
    return dups

Now let’s run this with a few test cases:

test_cases = [
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ],
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G"), ("G", "A"), ("G", "Z"), ("B", "D") ],
    [ ("A", "B"), ("B", "C"), ("C", "E"), ("E", "A") ],
    [ ("A", "B"), ("C", "D"), ("F", "G"), ("H", "I"), ("J", "A") ]
]
 
for test_case in test_cases:
    print extract_groups(test_case)
 
[set(['A', 'C', 'B', 'D']), set(['E', 'G', 'F'])]
[set(['A', 'C', 'B', 'E', 'D', 'G', 'F', 'Z'])]
[set(['A', 'C', 'B', 'E'])]
[set(['C', 'D']), set(['G', 'F']), set(['I', 'H']), set(['A', 'J', 'B'])]

This certainly doesn’t scale very well but since I only have a few hundred duplicate addresses it does the job for me.

It feels like there should be a more functional way to write these functions without mutating all these lists but I haven’t figured out what that is yet.

Written by Mark Needham

December 20th, 2015 at 12:12 pm

Posted in Python

Tagged with

Python: Parsing a JSON HTTP chunking stream

without comments

I’ve been playing around with meetup.com’s API again and this time wanted to consume the chunked HTTP RSVP stream and filter RSVPs for events I’m interested in.

I use Python for most of my hacking these days and if HTTP requests are required the requests library is my first port of call.

I started out with the following script

import requests
import json
 
def stream_meetup_initial():
    uri = "http://stream.meetup.com/2/rsvps"
    response = requests.get(uri, stream = True)
    for chunk in response.iter_content(chunk_size = None):
        yield chunk
 
for raw_rsvp in stream_meetup_initial():
    print raw_rsvp
    try:
        rsvp = json.loads(raw_rsvp)
    except ValueError as e:
        print e
        continue

This mostly worked but I also noticed the following error from time to time:

No JSON object could be decoded

Although less frequent, I also saw errors suggesting I was trying to parse an incomplete JSON object. I tweaked the function to keep a local buffer and only yield that if the chunk ended in a new line character:

def stream_meetup_newline():
    uri = "http://stream.meetup.com/2/rsvps"
    response = requests.get(uri, stream = True)
    buffer = ""
    for chunk in response.iter_content(chunk_size = 1):
        if chunk.endswith("\n"):
            buffer += chunk
            yield buffer
            buffer = ""
        else:
            buffer += chunk

This mostly works although I’m sure I’ve seen some occasions where two JSON objects were being yielded and then the call to ‘json.loads’ failed. I haven’t been able to reproduce that though.

A second read through the requests documentation made me realise I hadn’t read it very carefully the first time since we can make our lives much easier by using ‘iter_lines’ rather than ‘iter_content’:

r = requests.get('http://stream.meetup.com/2/rsvps', stream=True)
for raw_rsvp in r.iter_lines():
    if raw_rsvp:
        rsvp = json.loads(raw_rsvp)
        print rsvp

We can then process ‘rsvp’, filtering out the ones we’re interested in.

Written by Mark Needham

November 28th, 2015 at 1:56 pm

Posted in Python

Tagged with

Python: Extracting Excel spreadsheet into CSV files

without comments

I’ve been playing around with the Road Safety open data set and the download comes with several CSV files and an excel spreadsheet containing the legend.

There are 45 sheets in total and each of them looks like this:

2015 08 17 23 33 19

I wanted to create a CSV file for each sheet so that I can import the data set into Neo4j using the LOAD CSV command.

I came across the Python Excel website which pointed me at the xlrd library since I’m working with a pre 2010 Excel file.


The main documentation is very extensive but I found the github example much easier to follow.

I ended up with the following script which iterates through all but the first two sheets in the spreadsheet – the first two sheets contain instructions rather than data:

from xlrd import open_workbook
import csv
 
wb = open_workbook('Road-Accident-Safety-Data-Guide-1979-2004.xls')
 
for i in range(2, wb.nsheets):
    sheet = wb.sheet_by_index(i)
    print sheet.name
    with open("data/%s.csv" %(sheet.name.replace(" ","")), "w") as file:
        writer = csv.writer(file, delimiter = ",")
        print sheet, sheet.name, sheet.ncols, sheet.nrows
 
        header = [cell.value for cell in sheet.row(0)]
        writer.writerow(header)
 
        for row_idx in range(1, sheet.nrows):
            row = [int(cell.value) if isinstance(cell.value, float) else cell.value
                   for cell in sheet.row(row_idx)]
            writer.writerow(row)

I’ve replaced spaces in the sheet name so that the file name on a disk is a bit easier to work with. For some reason the numeric values were all floats whereas I wanted them as ints so I had to explicitly apply that transformation.

Here are a few examples of what the CSV files look like:

$ cat data/1stPointofImpact.csv
code,label
0,Did not impact
1,Front
2,Back
3,Offside
4,Nearside
-1,Data missing or out of range
 
$ cat data/RoadType.csv
code,label
1,Roundabout
2,One way street
3,Dual carriageway
6,Single carriageway
7,Slip road
9,Unknown
12,One way street/Slip road
-1,Data missing or out of range
 
$ cat data/Weather.csv
code,label
1,Fine no high winds
2,Raining no high winds
3,Snowing no high winds
4,Fine + high winds
5,Raining + high winds
6,Snowing + high winds
7,Fog or mist
8,Other
9,Unknown
-1,Data missing or out of range

And that’s it. Not too difficult!

Written by Mark Needham

August 19th, 2015 at 11:27 pm

Posted in Python

Tagged with

Python: Difference between two datetimes in milliseconds

without comments

I’ve been doing a bit of adhoc measurement of some cypher queries executed via py2neo and wanted to work out how many milliseconds each query was taking end to end.

I thought there’d be an obvious way of doing this but if there is it’s evaded me so far and I ended up calculating the different between two datetime objects which gave me the following timedelta object:

>>> import datetime
>>> start = datetime.datetime.now()
>>> end = datetime.datetime.now()
 
>>> end - start
datetime.timedelta(0, 3, 519319)

The 3 parts of this object are ‘days’, ‘seconds’ and ‘microseconds’ which I found quite strange!

These are the methods/attributes we have available to us:

>>> dir(end - start)
['__abs__', '__add__', '__class__', '__delattr__', '__div__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__pos__', '__radd__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmul__', '__rsub__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', 'days', 'max', 'microseconds', 'min', 'resolution', 'seconds', 'total_seconds']

There’s no ‘milliseconds’ on there so we’ll have to calculate it from what we do have:

>>> diff = end - start
>>> elapsed_ms = (diff.days * 86400000) + (diff.seconds * 1000) + (diff.microseconds / 1000)
 
>>> elapsed_ms
3519

Or we could do the following slightly simpler calculation:

>>> diff.total_seconds() * 1000
3519.319

And now back to the query profiling!

Written by Mark Needham

July 28th, 2015 at 8:05 pm

Posted in Python

Tagged with

Python: UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 0: ordinal not in range(128)

with one comment

I was recently doing some text scrubbing and had difficulty working out how to remove the ‘†’ character from strings.

e.g. I had a string like this:

>>> u'foo †'
u'foo \u2020'

I wanted to get rid of the ‘†’ character and then strip any trailing spaces so I’d end up with the string ‘foo’. I tried to do this in one call to ‘replace’:

>>> u'foo †'.replace(" †", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

It took me a while to work out that “† ” was being treated as ASCII rather than UTF-8. Let’s fix that:

>>> u'foo †'.replace(u' †', "")
u'foo'

I think the following call to unicode, which I’ve written about before, is equivalent:

>>> u'foo †'.replace(unicode(' †', "utf-8"), "")
u'foo'

Now back to the scrubbing!

Written by Mark Needham

July 15th, 2015 at 6:20 am

Posted in Python

Tagged with

Python: Converting WordPress posts in CSV format

without comments

Over the weekend I wanted to look into the WordPress data behind this blog (very meta!) and wanted to get the data in CSV format so I could do some analysis in R.

2015 07 07 06 59 02

I found a couple of WordPress CSV plugins but unfortunately I couldn’t get any of them to work and ended up working with the raw XML data that WordPress produces when you ‘export’ a blog.

I had the problem of the export being incomplete which I ‘solved’ by importing the posts in two parts of a few years each.

I then spent quite a few hours struggling to get the data into shape using R’s rvest library but eventually decided to do the scraping using Python’s beautifulsoup and save it to a CSV file for analysis in R.

The structure of the XML that we want to extract is as follows:

<rss version="2.0"
	xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:wp="http://wordpress.org/export/1.2/"
>
...
    <channel>
		<item>
		<title>First thoughts on Ruby...</title>
		<link>http://www.markhneedham.com/blog/2006/08/29/first-thoughts-on-ruby/</link>
		<pubDate>Tue, 29 Aug 2006 13:31:05 +0000</pubDate>
...

I wrote the following script to parse the files:

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser
 
import csv
 
def read_page(page):
    return BeautifulSoup(open(page, 'r').read())
 
with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])
 
    for row in select(read_page("part2.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])
 
    for row in select(read_page("part1.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])

We end up with a CSV file that looks like this:

$ head -n 10 posts.csv
title,date
Functional C#: Writing a 'partition' function,2010-02-01 23:34:02+00:00
Coding: Wrapping/not wrapping 3rd party libraries and DSLs,2010-02-02 23:54:21+00:00
Functional C#: LINQ vs Method chaining,2010-02-05 18:06:28+00:00
F#: function keyword,2010-02-07 02:54:13+00:00
Willed vs Forced designs,2010-02-08 22:48:05+00:00
Functional C#: Extracting a higher order function with generics,2010-02-08 23:17:47+00:00
Javascript: File encoding when using string.replace,2010-02-10 00:02:02+00:00
F#: Inline functions and statically resolved type parameters,2010-02-10 23:06:14+00:00
Javascript: Passing functions around with call and apply,2010-02-12 20:18:02+00:00

Let’s quickly look over the data in R and check it’s being correctly exported:

require(dplyr)
require(lubridate)
 
df = read.csv("posts.csv")
 
> df %>% count()
Source: local data frame [1 x 1]
 
     n
1 1501

So we’ve exported 1501 posts. Let’s cross check with the WordPress dashboard:

2015 07 07 07 06 02

We’ve gained two extra posts! A bit more exploration of the WordPress dashboard reveals that there are actually 2 draft posts lying around.

We probably want to remove those from the export and luckily there’s a ‘status’ tag for each post that we can check. We want to make sure it doesn’t have the value ‘draft’:

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser
 
import csv
 
def read_page(page):
    return BeautifulSoup(open(page, 'r').read())
 
with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])
 
    for row in select(read_page("part2.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])
 
    for row in select(read_page("part1.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])

I also had to check if that tag actually existed since there were a couple of posts which didn’t have it but had been published. If we check the resulting CSV file in R we can see that we’ve now got all the posts:

> df = read.csv("posts.csv")
> df %>% count()
Source: local data frame [1 x 1]
 
     n
1 1499

Now we’re ready to test a couple of hypotheses that I have but that’s for another post!

Written by Mark Needham

July 7th, 2015 at 6:28 am

Posted in Python

Tagged with

Python: CSV writing – TypeError: ‘builtin_function_or_method’ object has no attribute ‘__getitem__’

without comments

When I’m working in Python I often find myself writing to CSV files using the in built library and every now and then make a mistake when calling writerow:

import csv
writer = csv.writer(file, delimiter=",")
writer.writerow["player", "team"]

This results in the following error message:

TypeError: 'builtin_function_or_method' object has no attribute '__getitem__'

The error message is a bit weird at first but it’s basically saying that I’ve tried to do an associative lookup on an object which doesn’t support that operation.

The resolution is simply to include the appropriate parentheses instead of leaving them out!

writer.writerow(["player", "team"])

This one’s for future Mark.

Written by Mark Needham

May 31st, 2015 at 10:33 pm

Posted in Python

Tagged with

Python: Look ahead multiple elements in an iterator/generator

without comments

As part of the BBC live text scraping code I’ve been working on I needed to take an iterator of raw events created by a generator and transform this into an iterator of cards shown in a match.

The structure of the raw events I’m interested in is as follows:

  • Line 1: Player booked
  • Line 2: Player fouled
  • Line 3: Information about the foul

e.g.

events = [
  {'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'sortable_time': 5083, 'match_id': '32683310', 'formatted_time': u'84:43'}, 
  {'event': u'Rafinha (FC Bayern M\xfcnchen) wins a free kick on the right wing.', 'sortable_time': 5078, 'match_id': '32683310', 'formatted_time': u'84:38'}, 
  {'event': u'Foul by Pedro (Barcelona).', 'sortable_time': 5078, 'match_id': '32683310', 'formatted_time': u'84:38'}
]

We want to take these 3 raw events and create one ‘booking event’. We therefore need to have access to all 3 lines at the same time.

I started with the following function:

def cards(events):
    events = iter(events)
 
    item = events.next()
    next = events.next()
    event_id = 0
 
    for next_next in events:
        event = item["event"]
        booking = re.findall("Booking.*", event)
        if booking:
            player = re.findall("Booking([^(]*)", event)[0].strip()
            team = re.findall("Booking([^(]*) \((.*)\)", event)[0][1]
 
            associated_foul = [x for x in [(next, event_id+1), (next_next, event_id+2)]
                                 if re.findall("Foul by.*", x[0]["event"])]
 
            if associated_foul:
                associated_foul = associated_foul[0]
                yield event_id, associated_foul[1], player, team, item, "yellow"
            else:
                yield event_id, "", player, team, item, "yellow"
 
        item = next
        next = next_next
        event_id += 1

If we run the sample events through it we’d see this output:

>>> for card_id, associated_foul, player, team, item, card_type in cards(iter(events)):
      print card_id, associated_foul, player, team, item, card_type
 
0 2 Pedro Barcelona {'match_id': '32683310', 'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'formatted_time': u'84:43', 'sortable_time': 5083} yellow

In retrospect, it’s a bit of a hacky way of moving a window of 3 items over the iterator of raw events and yielding a ‘booking’ event if the regex matches.

I thought there was probably a better way of doing this using itertools and indeed there is!

First we need to create a window function which will return us an iterator of tuples containing consecutive events:

from itertools import tee, izip
 
def window(iterable, size):
    iters = tee(iterable, size)
    for i in xrange(1, size):
        for each in iters[i:]:
            next(each, None)
    return izip(*iters)

Now let’s have a look at how it works using a simple example:

>>> numbers = iter(range(0,10))
>>> for triple in window(numbers, 3):
      print triple
 
(0, 1, 2)
(1, 2, 3)
(2, 3, 4)
(3, 4, 5)
(4, 5, 6)
(5, 6, 7)
(6, 7, 8)
(7, 8, 9)

Exactly what we need. Now let’s plug the windows function into our cards function:

def cards(events):
    events = iter(events)
    for event_id, triple in enumerate(window(events, 3)):
        item = triple[0]
        event = triple[0]["event"]
 
        booking = re.findall("Booking.*", event)
        if booking:
            player = re.findall("Booking([^(]*)", event)[0].strip()
            team = re.findall("Booking([^(]*) \((.*)\)", event)[0][1]
 
            associated_foul = [x for x in [(triple[1], event_id+1), (triple[2], event_id+2)]
                                 if re.findall("Foul by.*", x[0]["event"])]
 
            if associated_foul:
                associated_foul = associated_foul[0]
                yield event_id, associated_foul[1], player, team, item, "yellow"
            else:
                yield event_id, "", player, team, item, "yellow"

And finally check it still processes events correctly:

>>> for card_id, associated_foul, player, team, item, card_type in cards(iter(events)):
      print card_id, associated_foul, player, team, item, card_type
 
0 2 Pedro Barcelona {'match_id': '32683310', 'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'formatted_time': u'84:43', 'sortable_time': 5083} yellow

Written by Mark Needham

May 28th, 2015 at 8:56 pm

Posted in Python

Tagged with

Python: Joining multiple generators/iterators

with 2 comments

In my previous blog post I described how I’d refactored some scraping code I’ve been working on to use iterators and ended up with a function which returned a generator containing all the events for one BBC live text match:

match_id = "32683310"
events = extract_events("data/raw/%s" % (match_id))
 
>>> print type(events)
<type 'generator'>

The next thing I wanted to do is get the events for multiple matches which meant I needed to glue together multiple generators into one big generator.

itertools’ chain function does exactly what we want:

itertools.chain(*iterables)

Make an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. Used for treating consecutive sequences as a single sequence.

First let’s try it out on a collection of range generators:

import itertools
gens = [(n*2 for n in range(0, 3)), (n*2 for n in range(4,7))]
>>> gens
[<generator object <genexpr> at 0x10ff3b140>, <generator object <genexpr> at 0x10ff7d870>]
 
output = itertools.chain()
for gen in gens:
  output = itertools.chain(output, gen)

Now if we iterate through ‘output’ we’d expect to see the multiples of 2 up to and including 12:

>>> for item in output:
...   print item
...
0
2
4
8
10
12

Exactly as we expected! Our scraping code looks like this once we plug the chaining in:

matches = ["32683310", "32683303", "32384894", "31816155"]
 
raw_events = itertools.chain()
for match_id in matches:
    raw_events = itertools.chain(raw_events, extract_events("data/raw/%s" % (match_id)))

‘raw_events’ now contains a single generator that we can iterate through and process the events for all matches.

Written by Mark Needham

May 24th, 2015 at 11:51 pm

Posted in Python

Tagged with

Python: Refactoring to iterator

without comments

Over the last week I’ve been building a set of scripts to scrape the events from the Bayern Munich/Barcelona game and I’ve ended up with a few hundred lines of nested for statements, if statements and mutated lists. I thought it was about time I did a bit of refactoring.

The following is a function which takes in a match file and spits out a collection of maps containing times & events.

import bs4
import re
from bs4 import BeautifulSoup
from soupselect import select
 
def extract_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
 
    all_events = []
    for event in select(soup, 'div#live-text-commentary-wrapper div#live-text'):
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    for event in select(soup, 'div#live-text-commentary-wrapper div#more-live-text'):
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

We call it like this:

match_id = "32683310"
for event in extract_events("data/%s" % (match_id))[:10]:
    print event

The file we’re loading is the Bayern Munich vs Barcelona match HTML file which I have saved locally. After we’ve got that read into beautiful soup we locate the two divs on the page which contain the match events.

We then iterate over that list and create a new list containing (time, event) pairs which we return.

I think we should be able to get to our resulting collection without persisting an intermediate list, but first things first – let’s remove the duplicated for loops:

def extract_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
 
    all_events = []
    events = select(soup, 'div#live-text-commentary-wrapper div#live-text')
    more_events = select(soup, 'div#live-text-commentary-wrapper div#more-live-text')
 
    for event in events + more_events:
        for child in event.children:
            if type(child) is bs4.element.Tag:
                all_events.append(child.getText().strip())
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

The next step is to refactor towards using an iterator. After a bit of reading I realised a generator would make life even easier.

I created a function which returned an iterator of the raw events and plugged that into the original function:

def raw_events(file):
    match = open(file, 'r')
    soup = BeautifulSoup(match.read())
    events = select(soup, 'div#live-text-commentary-wrapper div#live-text')
    more_events = select(soup, 'div#live-text-commentary-wrapper div#more-live-text')
    for event in events + more_events:
        for child in event.children:
            if type(child) is bs4.element.Tag:
                yield child.getText().strip()
 
def extract_events(file):
    all_events = list(raw_events(file))
 
    timed_events = []
    for i in range(0, len(all_events)):
        event = all_events[i]
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            timed_events.append({'time': formatted_time, 'event': all_events[i+1]})
    return timed_events

If we run that function we still get the same output as before which is good. Now we need to work out how to clean up the second bit of the code which groups the appropriate rows together.

The goal is that ‘extract_events’ returns an iterator rather than a list – we need to figure out how to iterate over the output of ‘raw_events’ in such a way that when we find a ‘time row’ we can yield that and the row immediately after.

Luckily I found a Stack Overflow post explaining that you can use the ‘next’ function inside an iterator to achieve this:

def extract_events(file):
    events = raw_events(file)
    for event in events:
        time =  re.findall("\d{1,2}:\d{2}", event)
        formatted_time = " +".join(time)
        if time:
            yield {'time': formatted_time, 'event': next(events)}

It’s not that much less code than the original function but I think it’s an improvement. Any thoughts/tips to simplify it further are always welcome.

Written by Mark Needham

May 23rd, 2015 at 10:14 am

Posted in Python

Tagged with