30 Jun 2016

Python: BeautifulSoup - Insert tag

I’ve been scraping the Game of Thrones wiki in preparation for a meetup at Women Who Code next week and while attempting to extract character allegiances I wanted to insert missing line breaks to separate different allegiances.

I initially tried creating a line break like this:

PYTHON >>> from bs4 import BeautifulSoup
>>> tag = BeautifulSoup("<br />", "html.parser")
>>> tag
<br/>

It looks like it should work but later on in my script I check the 'name' attribute to work out whether I’ve got a line break and it doesn’t return the value I expected it to:

PYTHON >>> tag.name
u'[document]'

My script assumes it’s going to return the string 'br' so I needed another way of creating the tag. The following does the trick:

PYTHON >>> from bs4 import Tag
>>> tag = Tag(name = "br")
>>> tag
<br></br>

PYTHON >>> tag.name
'br'

That’s all for now, back to scraping for me!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.