Model predictions Inf, under particular data structures

Problem: with certain real world data (in a typical model selection workflow) model predictions end up Inf.
This seems to be the result of high collinearity, unstable parameter estimates.

Demonstration:
``` r
library(MIAmaxent)
packageVersion("MIAmaxent")
#> [1] '1.2.0.9000'

reprexdata <- structure(list(RV = c(1, 1, 1, NA, NA, NA, NA, NA, NA, NA), 
                             EV_L = c(0.98, 0.83, 0.94, 0.72, 0.89, 0.7, 0.67, 0.64, 0.39, 0.83), 
                             EV_D2 = c(0, 0.03, 0, 0.07, 0.01, 0.09, 0.1, 0.13, 0.37, 0.03), 
                             EV_M = c(0.96, 0.74, 0.9, 0.6, 0.83, 0.58, 0.54, 0.5, 0.26, 0.74)), 
                        class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

plot(reprexdata[,-1])
```

![](https://i.imgur.com/rbD3Ft7.png)

``` r
cor(reprexdata[,-1], reprexdata[,-1])
#>             EV_L      EV_D2       EV_M
#> EV_L   1.0000000 -0.9451377  0.9946556
#> EV_D2 -0.9451377  1.0000000 -0.9069347
#> EV_M   0.9946556 -0.9069347  1.0000000

iwlr <- MIAmaxent:::.runIWLR(formula("RV ~ EV_L + EV_D2 + EV_M"), reprexdata)
coef(iwlr)
#> (Intercept)        EV_L       EV_D2        EV_M 
#>  -2531.4654   7431.9418    802.6664  -4954.6279
iwlr$entropy
#> [1] 0
iwlr$alpha
#> [1] -Inf
```

The instability is contingent on the data in unexpected ways. Leave only the highly correlated variables and the problem disappears:
``` r
iwlr <- MIAmaxent:::.runIWLR(formula("RV ~ EV_L + EV_M"), reprexdata)
coef(iwlr)
#> (Intercept)        EV_L        EV_M 
#>   -262.0798    782.4296   -530.8548
iwlr$entropy
#> [1] 1.41998
iwlr$alpha
#> [1] -258.5736
```

This problem can appear in a real world MIAmaxent workflow, as highly correlated derived variables *are* selected together given a sufficient number of background points: 
``` r
set.seed(42)
longerdata <- dplyr::slice_sample(reprexdata, n = 1e4, replace = TRUE)
selection <- selectDVforEV(list(RV = longerdata$RV, EV = longerdata[,-1]))
#> Forward selection of DVs for 1 EVs
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
selection$selection
#> $EV
#>   round           variables m   Dsq    Chisq df        P
#> 1     1               EV_D2 1 0.099 4033.427  1 0.00e+00
#> 2     1                EV_L 1 0.097 3945.131  1 0.00e+00
#> 3     1                EV_M 1 0.094 3834.647  1 0.00e+00
#> 4     2        EV_D2 + EV_L 2 0.100   26.073  1 3.29e-07
#> 5     2        EV_D2 + EV_M 2 0.099    9.107  1 2.55e-03
#> 6     3 EV_D2 + EV_L + EV_M 3 0.140 1660.015  1 0.00e+00
```

I have only encountered the issue when both 'L' and 'M'-type transformations are used. The derived variables that result from these may often be highly correlated. So I expect that picking only one of these transformation types in deriveVars() will resolve the problem in most real world cases.

<sup>Created on 2022-11-22 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model predictions Inf, under particular data structures #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model predictions Inf, under particular data structures #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions