Skip to content

Is it possible to use different bandwidth / binwidth for different variables / rows? #153

@adrianolszewski

Description

@adrianolszewski

Hello,
My goal is to summarize a number of variables of very different nature and scales with inline plots.
I would like to be able to manually provide the parameters to density or histogram plots.

The defaults use the information from each variable, but do not work well for me in all cases.
When I provide my own value, it pertains to all plots in a column.

set.seed(1030)
data <- data.frame(Age = rnorm(40, mean=44, sd =20),
                   Sex = factor(rbinom(40, 1, prob = c(0.4, 0.6)),
                                levels = 0:1, 
                                labels = c("Male", "Female")),
                   X = runif(40, 10, 20),
                   Y = c(rbeta(40, 0.15, 0.3) * 40))


library(gtExtras)
library(tidyr)
library(dplyr)

data_l <- data %>%
  pivot_longer(cols = X:Y, names_to = "Variable", values_to = "Value" ) %>% 
  group_by(Variable) %>%
  summarize(Mean= mean(Value),
            SD = sd(Value),
            Value = list(Value)) %>% 
  mutate(Value1 = Value)

data_l %>%
  gt() %>% 
  gt_plt_dist( Value,
               type = "boxplot", line_color = "purple", fill_color = "green", same_limit = FALSE) %>% 
  gt_plt_dist(Value1,
              type = "density", line_color = "purple", fill_color = "green", same_limit = FALSE) 
Image

For the type="histogram" the upper plot is much better, but the lower is "worse" (to me).

Image

data_l %>%
  gt() %>% 
  gt_plt_dist( Value,
               type = "boxplot", line_color = "purple", fill_color = "green", same_limit = FALSE) %>% 
  gt_plt_dist(Value1,
              type = "density", line_color = "purple", fill_color = "green", same_limit = FALSE, bw = .8) 

OK, this is better but I'd prefer to adjust it more per case:

Image

For the histogram I used the Freedman-Diaconis rule implemented in R, so now it resembles a bit more the beta "U" shaped distribution:

fd_binwidth <- function(x) {
  num_bins <- nclass.FD(x)
  data_range <- max(x) - min(x)
  bin_width <- data_range / num_bins
  return(bin_width)
}

data_l %>%
  gt() %>% 
  gt_plt_dist( Value,
               type = "boxplot", line_color = "purple", fill_color = "green", same_limit = FALSE) %>% 
  gt_plt_dist(Value1,
              type = "histogram", line_color = "purple", fill_color = "green", same_limit = FALSE, bw = fd_binwidth) 
}

which gives a little bit better result.
Image

But there any trick, any way to tell the function to use different BW for different variables, e.g. 0.1 for variable 1, 0.5 for variable 2, and so on? Any "named vector", list, etc?

Or maybe these table rows could be made separately, row-by-row in a loop / map, each with appropriate BW, and then somehow combined into the final table?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions