I’ve been following Michy Alice’s logistic regression tutorial to create an attendance model for London dev meetups and ran into an interesting problem while doing so.
Our dataset has a class imbalance i.e. most people RSVP ‘no’ to events which can lead to misleading accuracy score where predicting ‘no’ every time would lead to supposed high accuracy.
Source: local data frame [2 x 2] attended n (dbl) (int) 1 0 1541 2 1 53
I sampled the data using caret‘s upSample function to avoid this:
attended = as.factor((df %>% dplyr::select(attended))$attended) upSampledDf = upSample(df %>% dplyr::select(-attended), attended) upSampledDf$attended = as.numeric(as.character(upSampledDf$Class))
I then trained a logistic regression model but when I tried to plot the area under the curve I ran into trouble:
p <- predict(model, newdata=test, type="response") pr <- prediction(p, test$attended) prf <- performance(pr, measure = "tpr", x.measure = "fpr") Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, : zero non-NA points
I don’t have any NA values in my data frame so this message was a bit confusing to start with. As usual Stack Overflow came to the rescue with the suggestion that I was probably missing positive/negative values for the independent variable i.e. ‘approved’.
A quick count on the test data frame using dplyr confirmed my mistake:
> test %>% count(attended) Source: local data frame [1 x 2] attended n (dbl) (int) 1 1 582
I’ll have to randomly sort the data frame and then reassign my training and test data frames to work around it.