Skip to content

Conversation

@andrewbaxter439
Copy link

@andrewbaxter439 andrewbaxter439 commented Mar 1, 2024

Hello, I came across the same error message as in #20:

Error in seq.int(min(dim(R))) : 'from' must be a finite number
In addition: Warning message:
In min(dim(R)) : no non-missing arguments to min; returning Inf

This was being produced when NA coefficients were being returned by stats::glm in pred_fun_Y, which then breaks stats::predict in 'simulate.R'. This only happened when:

  1. I included a time variable in ymodel in a "continuous_eof" model. I think I follow that this is because the model is being fitted at a single time point, therefore there's no variation in time across observations.
  2. I include a custom variable - say a cumulative sum of A - alongside all lags of A. I presume here it's multicollinearity (as A and all lags of A perfectly predict the cumulative sum of A).
  3. I included an unnecessary linear variable alongside the polynomial version (i.e. ~ A + stats::poly(A, 2)), as R assigned NA to one of the duplicates (A or A^1)

So, my fault all round and I just need to understand the model better!

My suggestion here is to have a clearer message for tracking where this error is being produced and what might be causing it. A potential solution would be adding such an error checker to give clear suggestions. A rough skeleton could be: "Error: NA coefficients produced in prediction model. This may be due to (multi)collinearity in ymodel predictor variables, or including time_name variable in a 'continuous_eof' model."

That's just me hazarding a guess though. There may be clearer messages it'd be helpful to give, or even a more generic message merely that the model didn't provide a fit.

Example run with new error (produced by including t0 in ymodel):

devtools::load_all()
#> ℹ Loading gfoRmula
library('Hmisc')
#> 
#> Attaching package: 'Hmisc'
#> 
#> The following objects are masked from 'package:base':
#> 
#>     format.pval, units
id <- 'id'
time_name <- 't0'
covnames <- c('L1', 'L2', 'A')
outcome_name <- 'Y'
outcome_type <- 'continuous_eof'
covtypes <- c('categorical', 'normal', 'binary')
histories <- c(lagged)
histvars <- list(c('A', 'L1', 'L2'))
covparams <- list(covmodels = c(L1 ~ lag1_A + lag1_L1 + L3 + t0 +
                                  rcspline.eval(lag1_L2, knots = c(-1, 0, 1)),
                                L2 ~ lag1_A + L1 + lag1_L1 + lag1_L2 + L3 + t0,
                                A ~ lag1_A + L1 + L2 + lag1_L1 + lag1_L2 + L3 + t0))
ymodel <- Y ~ A + L1 + L2 + lag1_A + lag1_L1 + lag1_L2 + L3 + t0
intvars <- list('A', 'A')
interventions <- list(list(c(static, rep(0, 7))),
                      list(c(static, rep(1, 7))))
int_descript <- c('Never treat', 'Always treat')
nsimul <- 10000

gform_cont_eof <- gformula(obs_data = continuous_eofdata,
                           id = id, time_name = time_name,
                           covnames = covnames, outcome_name = outcome_name,
                           outcome_type = outcome_type, covtypes = covtypes,
                           covparams = covparams, ymodel = ymodel,
                           intvars = intvars, interventions = interventions,
                           int_descript = int_descript,
                           histories = histories, histvars = histvars,
                           basecovs = c("L3"), nsimul = nsimul, seed = 1234)
#> Error in pred_fun_Y(ymodel, yrestrictions, outcome_type, outcome_name, : `NA` coefficients produced in prediction model. This may be due to (multi)collinearity in `ymodel` predictor variables, or including `time_name` variable in a 'continuous_eof' model.

@andrewbaxter439
Copy link
Author

And a suggested error message for invalid coefficients in the covmodels formulae:

devtools::load_all()
#> ℹ Loading gfoRmula
library('Hmisc')
#> 
#> Attaching package: 'Hmisc'
#> 
#> The following objects are masked from 'package:base':
#> 
#>     format.pval, units

id <- 'id'
time_name <- 't0'
covnames <- c('L1', 'L2', 'A')
outcome_name <- 'Y'
outcome_type <- 'continuous_eof'
covtypes <- c('categorical', 'normal', 'binary')
histories <- c(lagged)
histvars <- list(c('A', 'L1', 'L2'))
covparams <- list(covmodels = c(L1 ~ lag1_A + lag1_L1 + L3 + t0 +
                                  rcspline.eval(lag1_L2, knots = c(-1, 0, 1)),
                                L2 ~ lag1_A + L1 + stats::poly(L3, 2) + lag1_L1 + lag1_L2 + L3 + t0,
                                A ~ lag1_A + L1 + L2 + lag1_L1 + lag1_L2 + L3 + t0))
ymodel <- Y ~ A + L1 + L2 + lag1_A + lag1_L1 + lag1_L2 + L3
intvars <- list('A', 'A')
interventions <- list(list(c(static, rep(0, 7))),
                      list(c(static, rep(1, 7))))
int_descript <- c('Never treat', 'Always treat')
nsimul <- 10000

gform_cont_eof <- gformula(obs_data = continuous_eofdata,
                           id = id, time_name = time_name,
                           covnames = covnames, outcome_name = outcome_name,
                           outcome_type = outcome_type, covtypes = covtypes,
                           covparams = covparams, ymodel = ymodel,
                           intvars = intvars, interventions = interventions,
                           int_descript = int_descript,
                           histories = histories, histvars = histvars,
                           basecovs = c("L3"), nsimul = nsimul, seed = 1234)
#> Error in simulate(fitcov = fitcov, fitY = fitY, fitD = NA, yrestrictions = yrestrictions, : `NA` coefficients produced in covariate prediction model. This may be due to (multi)collinearity in `covmodels` predictor variables, or including `time_name` variable in a 'continuous_eof' model.
#> Variables returning `NA` coefficients:
#> L2
#>  - L3

@rwlogan
Copy link

rwlogan commented Mar 1, 2024

HI,
Thanks for looking into possible solutions to this issue.

I will check, but part of the problem might be that in R, glm does not check for the full rank of the design matrix being used in the supplied model. When including time in an eof type outcome model, this variable will be a multiple of the column for the intercept and the design matrix will not have full rank. When include cumsum of A and all lags of A, the cumsum is a linear function of the lags, so again the design matrix will not have full rank.

I am not sure about having a check for including variables that are not included in the basecov or covnames vectors. For the various models being used we might be able to include a check for looking at the rank of the design matrix for each model being used. This might help, but might not be able to capture all problems. Would it be able to capture the error for L3 being in a model/data and not being included in the covnames or basecovs list.

Thanks again for your suggested help.
Roger

@andrewbaxter439
Copy link
Author

Thanks for the response Roger.

This suggestion of course comes with my full admission of my very basic understanding of what should be happening within the model and how to fix it, so your expert oversight on all of this is much appreciated!

To clarify, with the time problem above for example, is the problem caused by the user's incorrect specification of the time variable in the model, rather than an underlying bug? My proposed message here is to tell people like me to not be silly and put such things in there (if they're not meant to be), but if it's hinting at a deeper underlying problem then that may need a different approach.

The simulated 'L3' problem above (again one that I've tripped over in my stupidity) is created by having both 'L3' and 'stats::poly(L3, 2)' in the 'L2' covmodel. R (sensibly) sees that 'L3' and 'L3^1' are identical and returns 'NA' for one of them (which works fine in a normal R context). So it hasn't caught an error before this (as L3 is in basecovs) and only hits the problem when stats::predict tries to return new values for L2 in the data frame.

In both cases I've tried a simple check of all coefficients of the models in each case, on the basis that one NA makes the prediction parts crash with a cryptic message. The user, I imagine, should be told to take the redundant variables out. This might be the wrong place to check though, or the wrong message/solution. Do edit/reject/request edits to suggested code as helpful.

I really appreciate this package btw, and do please excuse any confusions and other questions that come your way as I get my head around it a bit more!

@stmcg
Copy link
Collaborator

stmcg commented Mar 2, 2024

Hi Andrew,

Thanks for posting this as well! To your point, I think it'd be good if we put each instance of model fitting (ie, each covariate model, outcome model) in a try-catch block and also check for issues such as NA coefficients, etc. Otherwise, as you describe, cryptic error messages can appear downstream in the analyses.

I'll try to make these updates within the next week or two

Best,
Sean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants