Mark Needham

Thoughts on Software Development

Archive for the ‘Python’ Category

Python: Regex – matching foreign characters/unicode letters

without comments

I’ve been back in the land of screen scrapping this week extracting data from the Game of Thrones wiki and needed to write a regular expression to pull out characters and actors.

Here are some examples of the format of the data:

Peter Dinklage as Tyrion Lannister
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
Filip Lozić as Young Nobleman
Morgan C. Jones as a Braavosi captain
Adewale Akinnuoye-Agbaje as Malko

So the pattern is:

<actor> as <character>

optionally followed by some other text that we’re not interested in.

The output I want to get is:

Peter Dinklage, Tyrion Lannister
Daniel Naprous, Oznak zo Pahl
Filip Lozić, Young Nobleman
Morgan C. Jones, a Braavosi captain
Adewale Akinnuoye-Agbaje, Malko

I started using the ‘split’ command on the word ‘as’ but that broke down when I realised some of the characters had the letters ‘as’ in the middle of their name. So regex it is!

This was my first attempt:

import re
 
strings = [
    "Peter Dinklage as Tyrion Lannister",
    "Filip Lozić as Young Nobleman",
    "Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)",
    "Morgan C. Jones as a Braavosi captain",
    "Adewale Akinnuoye-Agbaje as Malko"
]
 
regex = "([A-Za-z\-'\. ]*) as ([A-Za-z\-'\. ]*)"
 
for string in strings:
    print string
    match = re.match( regex, string)
    if match is not None:
        print match.groups()
    else:
        print "FAIL"
	print ""
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Filip Lozić as Young Nobleman
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

It works for 4 of the 5 scenarios but now for Filip Lozić. The ‘ć’ character causes the issue so we need to be able to match foreign characters which the current charset I defined in the regex doesn’t capture.

I came across this Stack Overflow post which said that in some regex libraries you can use ‘\p{L}’ to match all letters. I gave that a try:

regex = "([\p{L}\-'\. ]*) as ([\p{L}\-'\. ]*)"

And then re-ran the script:

Peter Dinklage as Tyrion Lannister
FAIL
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
FAIL
 
Filip Lozić as Young Nobleman
FAIL
 
Morgan C. Jones as a Braavosi captain
FAIL
 
Adewale Akinnuoye-Agbaje as Malko
FAIL

Hmmm, not sure if I did it wrong or if that isn’t available in Python. I’ll assume the latter but feel free to correct me in the comments and I’ll update the post.

I went search again and found this post which suggested another approach:

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means “Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore”.

Let’s try plugging that in:

regex = "([A-Za-z\-'\.^\W\d_ ]*) as ([A-Za-z\-'\.^\W\d_ ]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous as Oznak zo Pahl(credited', 'Stunt Performer)')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

So that’s fixed Filip but now Daniel Naprous is being incorrectly parsed.

For Attempt #4 I decided to try excluding what I don’t want instead:

regex = "([^0-9\(]*) as ([^0-9\(]*)"
Peter Dinklage as Tyrion Lannister
('Peter Dinklage', 'Tyrion Lannister')
 
Daniel Naprous as Oznak zo Pahl(credited as Stunt Performer)
('Daniel Naprous', 'Oznak zo Pahl')
 
Filip Lozić as Young Nobleman
('Filip Lozi\xc4\x87', 'Young Nobleman')
 
Morgan C. Jones as a Braavosi captain
('Morgan C. Jones', 'a Braavosi captain')
 
Adewale Akinnuoye-Agbaje as Malko
('Adewale Akinnuoye-Agbaje', 'Malko')

That does the job but has exposed my lack of regex skillz. If you know a better way let me know in the comments.

Written by Mark Needham

June 18th, 2016 at 7:38 am

Posted in Python

Tagged with

Python: Squashing ‘duplicate’ pairs together

with 5 comments

As part of a data cleaning pipeline I had pairs of ids of duplicate addresses that I wanted to group together.

I couldn’t work out how to solve the problem immediately so I simplified the problem into pairs of letters i.e.

A	B		(A is the same as B)
B	C		(B is the same as C)
C	D		...
E	F		(E is the same as F)
F	G		...

The output that I want to get is:

(A, B, C, D)
(E, F, G)

I spent several hours trying to come up with a clever data structure to do this until Reshmee suggested tracking the sets of duplicates using an array of arrays or list of lists since we’re going to script this using Python.

The actual data is in a CSV file but we’ll create a list of tuples to save ourselves some work:

pairs = [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ]

We’re going to iterate through the list of pairs and on each iteration we’ll check if there’s an entry in the list containing either of the values. There can be three outcomes from this check:

  1. No entry – we’ll add a new entry with our pair of values.
  2. One entry – we’ll add the other value to that entry.
  3. Two entries – we’ll merge them together replacing the existing entry.

The first step is to write a function to check the list of lists for a matching pair:

def find_matching_index(pair, dups):
    return [index
            for index, dup in enumerate(dups)
            if pair[0] in dup or pair[1] in dup]
 
print find_matching_index(("A", "B"), [set(["D", "E"])])
[]
 
print find_matching_index(("B", "C"), [set(["A", "B"])])
[0]
 
print find_matching_index(("B", "C"), [set(["A", "B"]), set(["C", "D"])])
[0, 1]

Next we need to write a function which iterates over all our pairs of values and uses find_matching_index to work out which decision to make:

def extract_groups(items):
    dups = []
    for pair in items:
        matching_index = find_matching_index(pair, dups)
 
        if len(matching_index) == 0:
            dups.append(set([pair[0], pair[1]]))
        elif len(matching_index) == 1:
            index = matching_index[0]
            matching_dup = dups[index]
            dups.pop(index)
            dups.append(matching_dup.union([pair[0], pair[1]]))
        else:
            index1, index2 = matching_index
            dup1 = dups[index1]
            dup2 = dups[index2]
 
            dups.pop(index1)
            dups.pop(index2 - 1) # the index decrements since we removed one entry on the previous line
            dups.append(dup1.union(dup2))
    return dups

Now let’s run this with a few test cases:

test_cases = [
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G") ],
    [ ("A", "B"), ("B", "C"), ("C", "D"), ("E", "F"), ("F", "G"), ("G", "A"), ("G", "Z"), ("B", "D") ],
    [ ("A", "B"), ("B", "C"), ("C", "E"), ("E", "A") ],
    [ ("A", "B"), ("C", "D"), ("F", "G"), ("H", "I"), ("J", "A") ]
]
 
for test_case in test_cases:
    print extract_groups(test_case)
 
[set(['A', 'C', 'B', 'D']), set(['E', 'G', 'F'])]
[set(['A', 'C', 'B', 'E', 'D', 'G', 'F', 'Z'])]
[set(['A', 'C', 'B', 'E'])]
[set(['C', 'D']), set(['G', 'F']), set(['I', 'H']), set(['A', 'J', 'B'])]

This certainly doesn’t scale very well but since I only have a few hundred duplicate addresses it does the job for me.

It feels like there should be a more functional way to write these functions without mutating all these lists but I haven’t figured out what that is yet.

Written by Mark Needham

December 20th, 2015 at 12:12 pm

Posted in Python

Tagged with

Python: Parsing a JSON HTTP chunking stream

without comments

I’ve been playing around with meetup.com’s API again and this time wanted to consume the chunked HTTP RSVP stream and filter RSVPs for events I’m interested in.

I use Python for most of my hacking these days and if HTTP requests are required the requests library is my first port of call.

I started out with the following script

import requests
import json
 
def stream_meetup_initial():
    uri = "http://stream.meetup.com/2/rsvps"
    response = requests.get(uri, stream = True)
    for chunk in response.iter_content(chunk_size = None):
        yield chunk
 
for raw_rsvp in stream_meetup_initial():
    print raw_rsvp
    try:
        rsvp = json.loads(raw_rsvp)
    except ValueError as e:
        print e
        continue

This mostly worked but I also noticed the following error from time to time:

No JSON object could be decoded

Although less frequent, I also saw errors suggesting I was trying to parse an incomplete JSON object. I tweaked the function to keep a local buffer and only yield that if the chunk ended in a new line character:

def stream_meetup_newline():
    uri = "http://stream.meetup.com/2/rsvps"
    response = requests.get(uri, stream = True)
    buffer = ""
    for chunk in response.iter_content(chunk_size = 1):
        if chunk.endswith("\n"):
            buffer += chunk
            yield buffer
            buffer = ""
        else:
            buffer += chunk

This mostly works although I’m sure I’ve seen some occasions where two JSON objects were being yielded and then the call to ‘json.loads’ failed. I haven’t been able to reproduce that though.

A second read through the requests documentation made me realise I hadn’t read it very carefully the first time since we can make our lives much easier by using ‘iter_lines’ rather than ‘iter_content’:

r = requests.get('http://stream.meetup.com/2/rsvps', stream=True)
for raw_rsvp in r.iter_lines():
    if raw_rsvp:
        rsvp = json.loads(raw_rsvp)
        print rsvp

We can then process ‘rsvp’, filtering out the ones we’re interested in.

Written by Mark Needham

November 28th, 2015 at 1:56 pm

Posted in Python

Tagged with

Python: Extracting Excel spreadsheet into CSV files

without comments

I’ve been playing around with the Road Safety open data set and the download comes with several CSV files and an excel spreadsheet containing the legend.

There are 45 sheets in total and each of them looks like this:

2015 08 17 23 33 19

I wanted to create a CSV file for each sheet so that I can import the data set into Neo4j using the LOAD CSV command.

I came across the Python Excel website which pointed me at the xlrd library since I’m working with a pre 2010 Excel file.


The main documentation is very extensive but I found the github example much easier to follow.

I ended up with the following script which iterates through all but the first two sheets in the spreadsheet – the first two sheets contain instructions rather than data:

from xlrd import open_workbook
import csv
 
wb = open_workbook('Road-Accident-Safety-Data-Guide-1979-2004.xls')
 
for i in range(2, wb.nsheets):
    sheet = wb.sheet_by_index(i)
    print sheet.name
    with open("data/%s.csv" %(sheet.name.replace(" ","")), "w") as file:
        writer = csv.writer(file, delimiter = ",")
        print sheet, sheet.name, sheet.ncols, sheet.nrows
 
        header = [cell.value for cell in sheet.row(0)]
        writer.writerow(header)
 
        for row_idx in range(1, sheet.nrows):
            row = [int(cell.value) if isinstance(cell.value, float) else cell.value
                   for cell in sheet.row(row_idx)]
            writer.writerow(row)

I’ve replaced spaces in the sheet name so that the file name on a disk is a bit easier to work with. For some reason the numeric values were all floats whereas I wanted them as ints so I had to explicitly apply that transformation.

Here are a few examples of what the CSV files look like:

$ cat data/1stPointofImpact.csv
code,label
0,Did not impact
1,Front
2,Back
3,Offside
4,Nearside
-1,Data missing or out of range
 
$ cat data/RoadType.csv
code,label
1,Roundabout
2,One way street
3,Dual carriageway
6,Single carriageway
7,Slip road
9,Unknown
12,One way street/Slip road
-1,Data missing or out of range
 
$ cat data/Weather.csv
code,label
1,Fine no high winds
2,Raining no high winds
3,Snowing no high winds
4,Fine + high winds
5,Raining + high winds
6,Snowing + high winds
7,Fog or mist
8,Other
9,Unknown
-1,Data missing or out of range

And that’s it. Not too difficult!

Written by Mark Needham

August 19th, 2015 at 11:27 pm

Posted in Python

Tagged with

Python: Difference between two datetimes in milliseconds

without comments

I’ve been doing a bit of adhoc measurement of some cypher queries executed via py2neo and wanted to work out how many milliseconds each query was taking end to end.

I thought there’d be an obvious way of doing this but if there is it’s evaded me so far and I ended up calculating the different between two datetime objects which gave me the following timedelta object:

>>> import datetime
>>> start = datetime.datetime.now()
>>> end = datetime.datetime.now()
 
>>> end - start
datetime.timedelta(0, 3, 519319)

The 3 parts of this object are ‘days’, ‘seconds’ and ‘microseconds’ which I found quite strange!

These are the methods/attributes we have available to us:

>>> dir(end - start)
['__abs__', '__add__', '__class__', '__delattr__', '__div__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__pos__', '__radd__', '__rdiv__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmul__', '__rsub__', '__setattr__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', 'days', 'max', 'microseconds', 'min', 'resolution', 'seconds', 'total_seconds']

There’s no ‘milliseconds’ on there so we’ll have to calculate it from what we do have:

>>> diff = end - start
>>> elapsed_ms = (diff.days * 86400000) + (diff.seconds * 1000) + (diff.microseconds / 1000)
 
>>> elapsed_ms
3519

Or we could do the following slightly simpler calculation:

>>> diff.total_seconds() * 1000
3519.319

And now back to the query profiling!

Written by Mark Needham

July 28th, 2015 at 8:05 pm

Posted in Python

Tagged with

Python: UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 0: ordinal not in range(128)

with one comment

I was recently doing some text scrubbing and had difficulty working out how to remove the ‘†’ character from strings.

e.g. I had a string like this:

>>> u'foo †'
u'foo \u2020'

I wanted to get rid of the ‘†’ character and then strip any trailing spaces so I’d end up with the string ‘foo’. I tried to do this in one call to ‘replace’:

>>> u'foo †'.replace(" †", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

It took me a while to work out that “† ” was being treated as ASCII rather than UTF-8. Let’s fix that:

>>> u'foo †'.replace(u' †', "")
u'foo'

I think the following call to unicode, which I’ve written about before, is equivalent:

>>> u'foo †'.replace(unicode(' †', "utf-8"), "")
u'foo'

Now back to the scrubbing!

Written by Mark Needham

July 15th, 2015 at 6:20 am

Posted in Python

Tagged with

Python: Converting WordPress posts in CSV format

without comments

Over the weekend I wanted to look into the WordPress data behind this blog (very meta!) and wanted to get the data in CSV format so I could do some analysis in R.

2015 07 07 06 59 02

I found a couple of WordPress CSV plugins but unfortunately I couldn’t get any of them to work and ended up working with the raw XML data that WordPress produces when you ‘export’ a blog.

I had the problem of the export being incomplete which I ‘solved’ by importing the posts in two parts of a few years each.

I then spent quite a few hours struggling to get the data into shape using R’s rvest library but eventually decided to do the scraping using Python’s beautifulsoup and save it to a CSV file for analysis in R.

The structure of the XML that we want to extract is as follows:

<rss version="2.0"
	xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:wp="http://wordpress.org/export/1.2/"
>
...
    <channel>
		<item>
		<title>First thoughts on Ruby...</title>
		<link>http://www.markhneedham.com/blog/2006/08/29/first-thoughts-on-ruby/</link>
		<pubDate>Tue, 29 Aug 2006 13:31:05 +0000</pubDate>
...

I wrote the following script to parse the files:

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser
 
import csv
 
def read_page(page):
    return BeautifulSoup(open(page, 'r').read())
 
with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])
 
    for row in select(read_page("part2.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])
 
    for row in select(read_page("part1.xml"), "item"):
        title = select(row, "title")[0].text.encode("utf-8")
        date = parser.parse(select(row, "pubdate")[0].text)
        writer.writerow([title, date])

We end up with a CSV file that looks like this:

$ head -n 10 posts.csv
title,date
Functional C#: Writing a 'partition' function,2010-02-01 23:34:02+00:00
Coding: Wrapping/not wrapping 3rd party libraries and DSLs,2010-02-02 23:54:21+00:00
Functional C#: LINQ vs Method chaining,2010-02-05 18:06:28+00:00
F#: function keyword,2010-02-07 02:54:13+00:00
Willed vs Forced designs,2010-02-08 22:48:05+00:00
Functional C#: Extracting a higher order function with generics,2010-02-08 23:17:47+00:00
Javascript: File encoding when using string.replace,2010-02-10 00:02:02+00:00
F#: Inline functions and statically resolved type parameters,2010-02-10 23:06:14+00:00
Javascript: Passing functions around with call and apply,2010-02-12 20:18:02+00:00

Let’s quickly look over the data in R and check it’s being correctly exported:

require(dplyr)
require(lubridate)
 
df = read.csv("posts.csv")
 
> df %>% count()
Source: local data frame [1 x 1]
 
     n
1 1501

So we’ve exported 1501 posts. Let’s cross check with the WordPress dashboard:

2015 07 07 07 06 02

We’ve gained two extra posts! A bit more exploration of the WordPress dashboard reveals that there are actually 2 draft posts lying around.

We probably want to remove those from the export and luckily there’s a ‘status’ tag for each post that we can check. We want to make sure it doesn’t have the value ‘draft’:

from bs4 import BeautifulSoup
from soupselect import select
from dateutil import parser
 
import csv
 
def read_page(page):
    return BeautifulSoup(open(page, 'r').read())
 
with open("posts.csv", "w") as file:
    writer = csv.writer(file, delimiter=",")
    writer.writerow(["title", "date"])
 
    for row in select(read_page("part2.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])
 
    for row in select(read_page("part1.xml"), "item"):
        if (not row.find("wp:status")) or row.find("wp:status").text != "draft":
            title = select(row, "title")[0].text.encode("utf-8")
            date = parser.parse(select(row, "pubdate")[0].text)
            writer.writerow([title, date])

I also had to check if that tag actually existed since there were a couple of posts which didn’t have it but had been published. If we check the resulting CSV file in R we can see that we’ve now got all the posts:

> df = read.csv("posts.csv")
> df %>% count()
Source: local data frame [1 x 1]
 
     n
1 1499

Now we’re ready to test a couple of hypotheses that I have but that’s for another post!

Written by Mark Needham

July 7th, 2015 at 6:28 am

Posted in Python

Tagged with

Python: CSV writing – TypeError: ‘builtin_function_or_method’ object has no attribute ‘__getitem__’

without comments

When I’m working in Python I often find myself writing to CSV files using the in built library and every now and then make a mistake when calling writerow:

import csv
writer = csv.writer(file, delimiter=",")
writer.writerow["player", "team"]

This results in the following error message:

TypeError: 'builtin_function_or_method' object has no attribute '__getitem__'

The error message is a bit weird at first but it’s basically saying that I’ve tried to do an associative lookup on an object which doesn’t support that operation.

The resolution is simply to include the appropriate parentheses instead of leaving them out!

writer.writerow(["player", "team"])

This one’s for future Mark.

Written by Mark Needham

May 31st, 2015 at 10:33 pm

Posted in Python

Tagged with

Python: Look ahead multiple elements in an iterator/generator

without comments

As part of the BBC live text scraping code I’ve been working on I needed to take an iterator of raw events created by a generator and transform this into an iterator of cards shown in a match.

The structure of the raw events I’m interested in is as follows:

  • Line 1: Player booked
  • Line 2: Player fouled
  • Line 3: Information about the foul

e.g.

events = [
  {'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'sortable_time': 5083, 'match_id': '32683310', 'formatted_time': u'84:43'}, 
  {'event': u'Rafinha (FC Bayern M\xfcnchen) wins a free kick on the right wing.', 'sortable_time': 5078, 'match_id': '32683310', 'formatted_time': u'84:38'}, 
  {'event': u'Foul by Pedro (Barcelona).', 'sortable_time': 5078, 'match_id': '32683310', 'formatted_time': u'84:38'}
]

We want to take these 3 raw events and create one ‘booking event’. We therefore need to have access to all 3 lines at the same time.

I started with the following function:

def cards(events):
    events = iter(events)
 
    item = events.next()
    next = events.next()
    event_id = 0
 
    for next_next in events:
        event = item["event"]
        booking = re.findall("Booking.*", event)
        if booking:
            player = re.findall("Booking([^(]*)", event)[0].strip()
            team = re.findall("Booking([^(]*) \((.*)\)", event)[0][1]
 
            associated_foul = [x for x in [(next, event_id+1), (next_next, event_id+2)]
                                 if re.findall("Foul by.*", x[0]["event"])]
 
            if associated_foul:
                associated_foul = associated_foul[0]
                yield event_id, associated_foul[1], player, team, item, "yellow"
            else:
                yield event_id, "", player, team, item, "yellow"
 
        item = next
        next = next_next
        event_id += 1

If we run the sample events through it we’d see this output:

>>> for card_id, associated_foul, player, team, item, card_type in cards(iter(events)):
      print card_id, associated_foul, player, team, item, card_type
 
0 2 Pedro Barcelona {'match_id': '32683310', 'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'formatted_time': u'84:43', 'sortable_time': 5083} yellow

In retrospect, it’s a bit of a hacky way of moving a window of 3 items over the iterator of raw events and yielding a ‘booking’ event if the regex matches.

I thought there was probably a better way of doing this using itertools and indeed there is!

First we need to create a window function which will return us an iterator of tuples containing consecutive events:

from itertools import tee, izip
 
def window(iterable, size):
    iters = tee(iterable, size)
    for i in xrange(1, size):
        for each in iters[i:]:
            next(each, None)
    return izip(*iters)

Now let’s have a look at how it works using a simple example:

>>> numbers = iter(range(0,10))
>>> for triple in window(numbers, 3):
      print triple
 
(0, 1, 2)
(1, 2, 3)
(2, 3, 4)
(3, 4, 5)
(4, 5, 6)
(5, 6, 7)
(6, 7, 8)
(7, 8, 9)

Exactly what we need. Now let’s plug the windows function into our cards function:

def cards(events):
    events = iter(events)
    for event_id, triple in enumerate(window(events, 3)):
        item = triple[0]
        event = triple[0]["event"]
 
        booking = re.findall("Booking.*", event)
        if booking:
            player = re.findall("Booking([^(]*)", event)[0].strip()
            team = re.findall("Booking([^(]*) \((.*)\)", event)[0][1]
 
            associated_foul = [x for x in [(triple[1], event_id+1), (triple[2], event_id+2)]
                                 if re.findall("Foul by.*", x[0]["event"])]
 
            if associated_foul:
                associated_foul = associated_foul[0]
                yield event_id, associated_foul[1], player, team, item, "yellow"
            else:
                yield event_id, "", player, team, item, "yellow"

And finally check it still processes events correctly:

>>> for card_id, associated_foul, player, team, item, card_type in cards(iter(events)):
      print card_id, associated_foul, player, team, item, card_type
 
0 2 Pedro Barcelona {'match_id': '32683310', 'event': u'Booking     Pedro (Barcelona) is shown the yellow card for a bad foul.', 'formatted_time': u'84:43', 'sortable_time': 5083} yellow

Written by Mark Needham

May 28th, 2015 at 8:56 pm

Posted in Python

Tagged with

Python: Joining multiple generators/iterators

with 2 comments

In my previous blog post I described how I’d refactored some scraping code I’ve been working on to use iterators and ended up with a function which returned a generator containing all the events for one BBC live text match:

match_id = "32683310"
events = extract_events("data/raw/%s" % (match_id))
 
>>> print type(events)
<type 'generator'>

The next thing I wanted to do is get the events for multiple matches which meant I needed to glue together multiple generators into one big generator.

itertools’ chain function does exactly what we want:

itertools.chain(*iterables)

Make an iterator that returns elements from the first iterable until it is exhausted, then proceeds to the next iterable, until all of the iterables are exhausted. Used for treating consecutive sequences as a single sequence.

First let’s try it out on a collection of range generators:

import itertools
gens = [(n*2 for n in range(0, 3)), (n*2 for n in range(4,7))]
>>> gens
[<generator object <genexpr> at 0x10ff3b140>, <generator object <genexpr> at 0x10ff7d870>]
 
output = itertools.chain()
for gen in gens:
  output = itertools.chain(output, gen)

Now if we iterate through ‘output’ we’d expect to see the multiples of 2 up to and including 12:

>>> for item in output:
...   print item
...
0
2
4
8
10
12

Exactly as we expected! Our scraping code looks like this once we plug the chaining in:

matches = ["32683310", "32683303", "32384894", "31816155"]
 
raw_events = itertools.chain()
for match_id in matches:
    raw_events = itertools.chain(raw_events, extract_events("data/raw/%s" % (match_id)))

‘raw_events’ now contains a single generator that we can iterate through and process the events for all matches.

Written by Mark Needham

May 24th, 2015 at 11:51 pm

Posted in Python

Tagged with