Mark Needham

Thoughts on Software Development

Archive for the ‘R’ Category

R: Building up a data frame row by row

with one comment

Jen and I recently started working on the Kaggle Titanic problem and we thought it’d probably be useful to start with some exploratory data analysis to get a feel for the data set.

For this problem you are given a selection of different features describing the passengers on board the Titanic and you have to predict whether or not they would have survived or died based on those features.

I thought an interesting first thing to look at would be the survival rate for passengers of different socio economic status as you would imagine that people of a higher status were more likely to have survived.

The data looks like this:

> head(titanic)
  survived pclass                                                name    sex age sibsp parch           ticket    fare cabin embarked
1        0      3                             Braund, Mr. Owen Harris   male  22     1     0        A/5 21171  7.2500              S
2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0         PC 17599 71.2833   C85        C
3        1      3                              Heikkinen, Miss. Laina female  26     0     0 STON/O2. 3101282  7.9250              S
4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0           113803 53.1000  C123        S
5        0      3                            Allen, Mr. William Henry   male  35     0     0           373450  8.0500              S
6        0      3                                    Moran, Mr. James   male  NA     0     0           330877  8.4583              Q

I started by writing the following function to work out how many people in a particular class survived:

survived <- function(class) {
  peopleInThatClass <- titanic[titanic$pclass == class,]
  survived <- peopleInThatClass[peopleInThatClass$survived == 1,]  
  nrow(survived)
}

I could then work out how many people in each class survived by using lapply:

> lapply(c(1,2,3), survived)
[[1]]
[1] 136
 
[[2]]
[1] 87
 
[[3]]
[1] 119

This works but the output isn’t very nice to read and I wanted to put that data side by side with other information about eeach class for which I figured I’d need to construct a data frame.

I came across a solution which works quite well about half way down a StackOverflow question and reworked my code to make use of that:

survivalSummary <- function(classes) {
  summary<-NULL
  for(class in classes) {
    numberSurvived <- survived(class)
    summary <- rbind(summary,data.frame(class=class, survived=numberSurvived))
  }
 
  summary[with(summary, order(class)),]  
}
> survivalSummary(c(1,2,3))
  class survived
1     1      136
2     2       87
3     3      119

We make use of the rbind function which creates a new data frame with a row appended. We then have to reassign the new data frame to the summary variable.

If we want to add other information about the class onto each row such as the total number of passengers and the survival rate it’s very easy:

survivalSummary <- function(classes) {
  summary<-NULL
  for(class in classes) {
    numberSurvived <- survived(class)
    total <- passengers(class)
    percentageSurvived <- numberSurvived / total
    summary <- rbind(summary,data.frame(class=class, survived=numberSurvived, total=total, percentSurvived=percentageSurvived))
  }
 
  summary[with(summary, order(class)),]  
}
> survivalSummary(unique(titanic$pclass))
  class survived total percentSurvived
2     1      136   216       0.6296296
3     2       87   184       0.4728261
1     3      119   491       0.2423625

We’re also now using unique to get the different classes rather than hard coding them.

Written by Mark Needham

February 10th, 2013 at 1:29 pm

Posted in R

Tagged with

R: Modelling a conversion rate with a binomial distribution

without comments

As part of some work Sid and I were doing last week we wanted to simulate the conversion rate for an A/B testing we were planning.

We started with the following function which returns the simulated conversion rate for a given conversion rate of 12%:

generateConversionRates <- function(sampleSize) {
	sample_a <- rbinom(seq(0, sampleSize), 1, 0.12)
	conversion_a <- length(sample_a[sample_a == 1]) / sampleSize
 
	sample_b <- rbinom(seq(0, sampleSize), 1, 0.12)
	conversion_b <- length(sample_b[sample_b == 1]) / sampleSize
 
	c(conversion_a, conversion_b)
}

If we call it:

> generateConversionRates(10000)
[1] 0.1230 0.1207

We have a 12.3% conversion rate on A and a 12.07% conversion rate on B based on 10,000 sample values.

We then wrote the following function to come up with 1000 versions of those conversion rates:

generateSample <- function(sampleSize) {
	lapply(seq(1, 1000), function(x) generateConversionRates(sampleSize))
}

We can call that like this:

> getSample(10000)
[[998]]
[1] 0.1179 0.1216
 
[[999]]
[1] 0.1246 0.1211
 
[[1000]]
[1] 0.1248 0.1234

We were then using these conversion rates to try and work out how many samples we needed to include in an A/B test to have reasonable confidence that it represented the population.

We actually ended up abandoning that exercise but I thought I’d record the code because I thought it was pretty interesting.

Written by Mark Needham

February 7th, 2013 at 1:26 am

Posted in R

Tagged with

R: Mapping over a list of lists

with one comment

As part of the coursera Data Analysis course I had the following code to download and then read in a file:

> file <- "https://dl.dropbox.com/u/7710864/data/csv_hid/ss06hid.csv"
> download.file(file, destfile="americancommunity.csv", method="curl")
> acomm <- read.csv("americancommunity.csv")

We then had to filter the data based on the values in a couple of columns and work out how many rows were returned in each case:

> one <- acomm[acomm$RMS == 4 & !is.na(acomm$RMS) 
               & acomm$BDS == 3 & !is.na(acomm$BDS), c("RMS")]
> one
  [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
...
[137] 4 4 4 4 4 4 4 4 4 4 4 4
> two <- acomm[acomm$RMS == 5 & !is.na(acomm$RMS) 
               & acomm$BDS == 2 & !is.na(acomm$BDS), c("RMS")]
> two
  [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
...
[375] 5 5 5 5 5 5 5 5 5 5 5 5
 
> three <- acomm[acomm$RMS == 7 & !is.na(acomm$RMS) 
                 & acomm$BDS == 2 & !is.na(acomm$BDS), c("RMS")]
> three
 [1] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[36] 7 7 7 7 7 7 7 7 7 7 7 7 7 7

So I needed to know how many values were in the variables one, two and three.

I thought I could probably put those lists into another list and then use apply or one of its variants to get the length of each one.

I usually use the c function to help me create lists but it’s not helpful in this case as it creates one massive vector with all the values concatenated together:

Calling apply doesn’t have the intended outcome:

> lapply(c(one, two, three), length)
...
[[582]]
[1] 1
 
[[583]]
[1] 1

Instead what we need is the list function:

> lapply(list(one, two, three), length)
[[1]]
[1] 148
 
[[2]]
[1] 386
 
[[3]]
[1] 49

Et voila!

The code is on github as usual.

Written by Mark Needham

February 3rd, 2013 at 10:40 am

Posted in R

Tagged with

R: Ordering rows in a data frame by multiple columns

without comments

In one of the assignments of Computing for Data Analysis we needed to sort a data frame based on the values in two of the columns and then return the top value.

The initial data frame looked a bit like this:

> names <- c("paul", "mark", "dave", "will", "john")
> values <- c(1,4,1,2,1)
> smallData <- data.frame(name = names, value = values)
> smallData
  name value
1 paul     1
2 mark     4
3 dave     1
4 will     2
5 john     1

I want to be able to sort the data frame by value and name both in ascending order so the final result should look like this:

  name value
3 dave     1
5 john     1
1 paul     1
4 will     2
2 mark     4

To do that we can use the order function which will tell us the indices of the vector in sorted order.

e.g. in our case

> order(c(1,4,1,2,1))
[1] 1 3 5 4 2

If we pass a collection of indices to the extract operation it’ll reorder the rows. e.g.

> smallData[c(5,4,3,2,1),]
  name value
5 john     1
4 will     2
3 dave     1
2 mark     4
1 paul     1

In our case we wire everything together like this to sort by the second column (value):

> smallData[order(smallData[,2]),]
  name value
1 paul     1
3 dave     1
5 john     1
4 will     2
2 mark     4

It’s a reasonably small tweak to get it to sort first by the second column and then by the first (name) which is what we want:

> smallData[order(smallData[,2], smallData[,1]),]
  name value
3 dave     1
5 john     1
1 paul     1
4 will     2
2 mark     4

If we wanted to use the column names instead of indices we’d do the following:

> smallData[order(smallData$value, smallData$name),]
  name value
3 dave     1
5 john     1
1 paul     1
4 will     2
2 mark     4

We could also rewrite it using the with function if we want to reduce the code further:

> smallData[with(smallData, order(value, name)),]
  name value
3 dave     1
5 john     1
1 paul     1
4 will     2
2 mark     4

As I understand it, when we use with we put smallData into the environment and evaluate the second argument to with with respect to that so in this case it allows us to refer to the column names of smallData.

Written by Mark Needham

January 23rd, 2013 at 11:09 pm

Posted in R

Tagged with

R: Filter a data frame based on values in two columns

with one comment

In the most recent assignment of the Computing for Data Analysis course we had to filter a data frame which contained N/A values in two columns to only return rows which had no N/A’s.

I started with a data frame that looked like this:

> data <- read.csv("specdata/002.csv") 
> # we'll just use a few rows to make it easier to see what's going on
> data[2494:2500,]
           Date sulfate nitrate ID
2494 2007-10-30    3.25   0.902  2
2495 2007-10-31      NA      NA  2
2496 2007-11-01      NA      NA  2
2497 2007-11-02    6.56   1.270  2
2498 2007-11-03      NA      NA  2
2499 2007-11-04      NA      NA  2
2500 2007-11-05    6.10   0.772  2

We want to only return the rows which have a value in both the ‘sulfate’ and the ‘nitrate’ columns.

I initially tried to use the Filter function but wasn’t very successful:

> smallData <- data[2494:2500,]
> Filter(function(x) !is.na(x$sulfate), smallData)
Error in x$sulfate : $ operator is invalid for atomic vectors

I’m not sure that Filter is designed to filter data frames – it seems more appropriate for lists or vectors – so I ended up filtering the data frame using what I think is called an extract operation:

> smallData[!is.na(smallData$sulfate) & !is.na(smallData$nitrate),]
           Date sulfate nitrate ID
2494 2007-10-30    3.25   0.902  2
2497 2007-11-02    6.56   1.270  2
2500 2007-11-05    6.10   0.772  2

The code inside the square brackets returns a collection indicating whether or not we should return each row:

> !is.na(smallData$sulfate) & !is.na(smallData$nitrate)
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE

which is equivalent to doing this:

> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),]
           Date sulfate nitrate ID
2494 2007-10-30    3.25   0.902  2
2497 2007-11-02    6.56   1.270  2
2500 2007-11-05    6.10   0.772  2

We put a comma after the list of true/false values to indicate that we want to return all the columns otherwise we’d get this error:

> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)]
Error in `[.data.frame`(smallData, c(TRUE, FALSE, FALSE, TRUE, FALSE,  : 
  undefined columns selected

We could filter the columns as well if we wanted to:

> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), c(1,2)]
           Date sulfate
2494 2007-10-30    3.25
2497 2007-11-02    6.56
2500 2007-11-05    6.10

As is no doubt obvious, I don’t know much R so if there’s a better way to do anything please let me know.

The full code is on github if you’re interested in seeing it in context.

Written by Mark Needham

January 23rd, 2013 at 10:34 pm

Posted in R

Tagged with

Kaggle Digit Recognizer: Finding pixels with no variance using R

with one comment

I’ve written previously about our attempts at the Kaggle Digit Recogniser problem and our approach so far has been to use the data provided and plug it into different algorithms and see what we end up with.

From browsing through the forums we saw others mentioning feature extraction – an approach where we transform the data into another format , the thinking being that we can train a better classifier with better data.

There was quite a nice quote from a post written by Rachel Schutt about the Columbia Data Science course which summed up the mistake we’d made:

The Space between the Data Set and the Algorithm

Many people go straight from a data set to applying an algorithm. But there’s a huge space in between of important stuff. It’s easy to run a piece of code that predicts or classifies. That’s not the hard part. The hard part is doing it well.

One thing we’d noticed while visually scanning the data set was that a lot of the features seemed to consistently have a value of 0. We thought we’d try and find out which pixels those were by finding the features which had zero variance in their values.

I started out by loading a subset of the data set and then taking a sample of that to play around with:

initial <- read.csv("train.csv", nrows=10000, header = TRUE)
 
# take a sample of 1000 rows of the input 
sampleSet <- initial[sample(1:nrow(initial), 1000), ]

Just for fun I thought it’d be interesting to see how well the labels were distributed in my sample which we can do with the following code:

# get all the labels
sampleSet.labels <- as.factor(sampleSet$label)
 
> table(sampleSet.labels)
sampleSet.labels
  0   1   2   3   4   5   6   7   8   9 
102 116 100  97  95  91  79 122 102  96

There are a few more 1′s and 7s than the other labels but they’re roughly in the same ballpark so it’s ok.

I wanted to exclude the ‘label’ field from the data set because the variance of labels isn’t interesting to us on this occasion. We can do that with the following code:

# get data set excluding label
excludingLabel <- subset( sampleSet, select = -label)

To find all the features with no variance we then do this:

# show all the features which don't have any variance - all have the same value
variances <- apply(excludingLabel, 2, var)
 
# get the names of the labels which have no variance
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
 
 [1] "pixel0"   "pixel1"   "pixel2"   "pixel3"   "pixel4"   "pixel5"   "pixel6"   "pixel7"  
  [9] "pixel8"   "pixel9"   "pixel10"  "pixel11"  "pixel12"  "pixel13"  "pixel14"  "pixel15" 
 [17] "pixel16"  "pixel17"  "pixel18"  "pixel19"  "pixel20"  "pixel21"  "pixel22"  "pixel23" 
 [25] "pixel24"  "pixel25"  "pixel26"  "pixel27"  "pixel28"  "pixel29"  "pixel30"  "pixel31" 
 [33] "pixel32"  "pixel33"  "pixel51"  "pixel52"  "pixel53"  "pixel54"  "pixel55"  "pixel56" 
 [41] "pixel57"  "pixel58"  "pixel59"  "pixel60"  "pixel82"  "pixel83"  "pixel84"  "pixel85" 
 [49] "pixel86"  "pixel88"  "pixel110" "pixel111" "pixel112" "pixel113" "pixel114" "pixel139"
 [57] "pixel140" "pixel141" "pixel142" "pixel168" "pixel169" "pixel196" "pixel252" "pixel280"
 [65] "pixel308" "pixel335" "pixel336" "pixel364" "pixel365" "pixel392" "pixel393" "pixel420"
 [73] "pixel421" "pixel448" "pixel476" "pixel504" "pixel532" "pixel559" "pixel560" "pixel587"
 [81] "pixel615" "pixel643" "pixel644" "pixel645" "pixel671" "pixel672" "pixel673" "pixel699"
 [89] "pixel700" "pixel701" "pixel727" "pixel728" "pixel729" "pixel730" "pixel731" "pixel752"
 [97] "pixel753" "pixel754" "pixel755" "pixel756" "pixel757" "pixel758" "pixel759" "pixel760"
[105] "pixel779" "pixel780" "pixel781" "pixel782" "pixel783"

We can count how many labels there are by using the length function:

# count how many labels have no variance
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 109

I then wrote those out to a file so that we could use them as the input to the code which builds up our classifier.

write(file="pointless-features.txt", pointlessFeatures)

Of course we should run the variance test against the full data set rather than just a sample and on the whole data set there are only 76 features with zero variance:

> sampleSet <- read.csv("train.csv", header = TRUE)
> sampleSet.labels <- as.factor(sampleSet$label)
> table(sampleSet.labels)
sampleSet.labels
   0    1    2    3    4    5    6    7    8    9 
4132 4684 4177 4351 4072 3795 4137 4401 4063 4188 
> excludingLabel <- subset( sampleSet, select = -label)
> variances <- apply(excludingLabel, 2, var)
> pointlessFeatures <- names(excludingLabel[variances == 0][1,])
> length(names(excludingLabel[apply(excludingLabel, 2, var) == 0][1,]))
[1] 76

We’ve built decision trees using this reduced data set but haven’t yet submitted the forest to Kaggle to see if it’s any more accurate!

I picked up the little R I know from the Computing for Data Analysis course which started last week and from the book ‘R in a Nutshell‘ which my colleague Paul Lam recommended.

Written by Mark Needham

January 8th, 2013 at 12:48 am

Posted in Machine Learning,R

Tagged with

R: Mapping a function over a collection of values

with 2 comments

I spent a bit of Sunday playing around with R and one thing I wanted to do was map a function over a collection of values and transform each value slightly.

I loaded my data set using the ‘Import Dataset’ option in R Studio (suggested to me by Rob) which gets converted to the following function call:

> data <-  read.csv("~/data.csv", header=T, encoding="ISO-8859")
> data
  Column1 InterestingColumn
1    Mark             12.50
2    Dave            100.00
3    John          1,231.00

data.csv

Column1, InterestingColumn
Mark, 12.50
Dave, 100.00 
John, 1,231.00

data is a table with the type ’3 obs. of 2 variables’ in R Studio.

I was only interested in the values in the 2nd column so I selected those like this:

> data$InterestingColumn
[1]  12.50     100.00    1,231.00
Levels:  1,231.00  100.00  12.50

I wanted to apply a function over each of the numbers and return a new list containing those transformations.

I initially had a look at doing this with a for loop but it didn’t turn out to be as easy as I’d expected and I eventually came across the lapply function which allows you to apply a function over a list or vector.

> values <- data$InterestingColumn
> lapply(values, function(x) 5000 - as.numeric(gsub("\\s|,","", x)))
[[1]]
[1] 4987.5
 
[[2]]
[1] 4900
 
[[3]]
[1] 3769

We define a function which subtracts the value in the column from 5000 since the CSV file contained derived values and I was interested in the original value.

In order to do that subtraction I needed to cast the value from the CSV file to be numeric which first involved stripping out any spaces or commas using gsub and then casting the string using as.numeric.

If we want to have a table structure then we can use the ‘by’ function to do a similar thing:

> as.table(by(data$InterestingColumn, data$Column1, function(x) 5000 - as.numeric(gsub("\\s|,","", x))))
data$Column1
  Dave   John   Mark 
4900.0 3769.0 4987.5

I don’t know enough R to know how to keep the data in exactly the same structure as we got it so if anyone can point me in the right direction that’d be cool.

Written by Mark Needham

July 23rd, 2012 at 11:25 pm

Posted in R

Tagged with