· r-2

R: Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1, : zero non-NA points

I’ve been following Michy Alice’s logistic regression tutorial to create an attendance model for London dev meetups and ran into an interesting problem while doing so.

Our dataset has a class imbalance i.e. most people RSVP 'no' to events which can lead to misleading accuracy score where predicting 'no' every time would lead to supposed high accuracy.

Source: local data frame [2 x 2]

  attended     n
     (dbl) (int)
1        0  1541
2        1    53

I sampled the data using caret's http://www.inside-r.org/packages/cran/caret/docs/upSample function to avoid this:

attended = as.factor((df %>% dplyr::select(attended))$attended)
upSampledDf = upSample(df %>% dplyr::select(-attended), attended)
upSampledDf$attended = as.numeric(as.character(upSampledDf$Class))

I then trained a logistic regression model but when I tried to plot the area under the curve I ran into trouble:

p <- predict(model, newdata=test, type="response")
pr <- prediction(p, test$attended)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")

Error in approxfun(x.values.1, y.values.1, method = "constant", f = 1,  :
  zero non-NA points

I don’t have any NA values in my data frame so this message was a bit confusing to start with. As usual Stack Overflow came to the rescue with the suggestion that I was probably missing positive/negative values for the independent variable i.e. 'approved'.

A quick count on the test data frame using dplyr confirmed my mistake:

> test %>% count(attended)
Source: local data frame [1 x 2]

  attended     n
     (dbl) (int)
1        1   582

I’ll have to randomly sort the data frame and then reassign my training and test data frames to work around it.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket