Skip to content

dependence on scaling of data #26

@stephens999

Description

@stephens999

I noticed that the scaling of the data matters, which seems undesirable (and unnecessary).

For example:

set.seed(51)
true_mean = rep(c(-0.2,0.1,1,-0.5,0.2,-0.5,0.1,-0.2),c(137,87,17,49,29,52,87,42))
genomdat = list(x = rnorm(500, sd=0.2) + true_mean, true_mean=true_mean)

The cpt.mean default does not find any changepoints:

genomdat.cp = cpt.mean(genomdat$x,method="PELT")
plot(genomdat.cp)

But if we multiply the data by 10 we find many changepoints.

genomdat.cp = cpt.mean(10*genomdat$x,method="PELT")
plot(genomdat.cp)

I speculate that perhaps the cost function (log-likelihood) implicitly assumes the variance
is 1?

Incidentally to this, while digging around the code to see if I could understand the issue, I
noticed that some places in the code
use "norm.mean" whereas others use "mean.norm". I'm not sure that was intended?

Matthews-MacBook-Air-2:changepoint stephens$ grep norm.mean src/*
src/BinSeg_one_func_minseglen.c:     char **cost_func; //Descibe the cost function used i.e. norm.mean.cost (change in mean in normal distributed data)  
src/BinSeg_one_func_minseglen.c:  {"norm.mean", mll_mean},
src/BinSeg_one_func_minseglen.c:  {"norm.meanvar", mll_meanvar},
Matthews-MacBook-Air-2:changepoint stephens$ grep mean.norm src/*
src/BinSeg_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm")==0){
src/BinSeg_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm.mbic")==0){
src/PELT_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm")==0){
src/PELT_one_func_minseglen.c:   else if (strcmp(*cost_func,"mean.norm.mbic")==0){

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions