You can install the development version of MaxentDisaggregation from GitHub with:
# install.packages("devtools")
devtools::install_github("simschul/MaxentDisaggregation")Note, this package is under constant development. Together with co-authors, I’m currently preparing a journal article for describing more of the background of data disaggregation and showing use cases within the field of Industrial Ecology.
The goal of MaxentDisaggregation is an R-package that helps you with uncertainty propagation when data disaggregation is involved. Data disaggregation usually involves splitting one data point into several disaggregates using proxy data. It is a common problem in many different research disciplines.
flowchart-elk TD
%% Define node classes
classDef Aggregate fill:#eeeee4,color:black,stroke:none;
classDef DisAgg1 fill:#abdbe3,color:black,stroke:none;
classDef DisAgg2 fill:#e28743,color:black,stroke:none;
classDef DisAgg3 fill:#abdbe3,color:black,stroke:none;
agg("Y_0"):::Aggregate
disagg1("Y_1=x_1 Y_0"):::DisAgg1
disagg2("Y_2=x_2 Y_0"):::DisAgg1
disagg3("Y_3=x_3 Y_0"):::DisAgg1
%% Define connections
agg --> disagg1
agg --> disagg2
agg --> disagg3
Data disaggregation usually involves an aggregate flow
This equation, also called an accounting identity introduces
dependencies/correlations between the individual disaggregate flows
To get estimates for the disaggregate flows, one usually looks for proxy
data. Those proxy data are used to calculate shares (ratios/fractions)
of the respective disaggregate units
Disaggregate flows are calculated as
This package generates a random sample of disaggregates based on the information provided. The aggregate and the shares are sampled independently. The distribution from which to sample is determined internally based on the information provided by the user. This choice of distribution is mostly based on the principle of Maximum Entropy (MaxEnt).
The aggregate distribution is determined using the following decision tree:
flowchart-elk TD
MeanDecision{{"Best guess/
mean available?"}} -- no --> BoundsDecision1{{"Bounds available?"}}
MeanDecision -- yes --> SDDecision{{"Standard deviation available?"}}
SDDecision -- yes --> BoundsDecision2{{"Bounds available?"}}
BoundsDecision2 -- yes --> GeneralBounds{{"General Bounds a,b"}}
GeneralBounds -- "no, $$a=0, b=\infty$$" --> LogNorm("LogNormal distribution
or
Truncated Normal")
GeneralBounds -- yes --> TruncNorm("Truncated Normal
(Maximum Entropy distribution)")
BoundsDecision2 -- no --> Normal("Normal distribution")
SDDecision -- no --> LowerBound0{{"Lower bound = 0?"}}
LowerBound0 -- yes --> Exponential("Exponential distribution")
LowerBound0 -- no --> NotImplemented["No MaxEnt solution
(currently not implemented)"]
BoundsDecision1 -- yes --> Uniform("Uniform distribution on [a,b]")
BoundsDecision1 -- no --> GoBackToStart["☠️ !Game Over!
We suggest to rethink your problem... 🤓"]
MeanDecision:::decision
BoundsDecision1:::decision
SDDecision:::decision
BoundsDecision2:::decision
GeneralBounds:::decision
LogNorm:::distribution
TruncNorm:::distribution
Normal:::distribution
LowerBound0:::decision
Exponential:::distribution
NotImplemented:::notimplementednode
Uniform:::distribution
GoBackToStart:::notimplementednode
classDef decision fill:#e28743,color:black,stroke:none
classDef distribution fill:#abdbe3,color:black,stroke:none
classDef notimplementednode fill:#eeeee4,color:black,stroke:none
The shares are sampled from different variants of the Dirichlet distribution:
flowchart-elk TD
%% Define node classes
classDef decision fill:#e28743,color:black,stroke:none;
classDef distribution fill:#abdbe3,color:black,stroke:none;
classDef explanationnode fill:#eeeee4,color:black,stroke:none;
MeanDecision{{"Best guess/mean available?"}}:::decision
SDDecision{{"Standard deviation available?"}}:::decision
MaxEntDir("Maximum Entropy Dirichlet"):::distribution
GenDir("Generalised Dirichlet"):::distribution
NestedDir("Nested Dirichlet"):::distribution
UniformDir("Uniform Dirichlet"):::distribution
%% Define connections
MeanDecision -- "no" --> UniformDir
MeanDecision -- "yes" --> SDDecision
MeanDecision -- "paritially" --> NestedDir
SDDecision -- "no" --> MaxEntDir
SDDecision -- "yes" --> GenDir
SDDecision -- "partially" --> NestedDir
The main function is rdisagg which creates a random sample of
disaggregates based on the information provided:
library(MaxentDisaggregation)
#> Loading required package: truncnorm
#> Loading required package: nloptr
#> Loading required package: gtools
#> Loading required package: data.table
#>
#> Attaching package: 'MaxentDisaggregation'
#> The following object is masked from 'package:gtools':
#>
#> rdirichlet
sample <- rdisagg(n = 1000, mean_0 = 100, sd_0 = 5, min = 0, shares = c(0.1, 0.3, 0.6))
head(sample)
#> [,1] [,2] [,3]
#> [1,] 25.542248 15.44779 54.45482
#> [2,] 2.194530 25.33249 79.60015
#> [3,] 1.481192 16.74706 82.64684
#> [4,] 19.181736 43.81459 45.58907
#> [5,] 18.558886 49.36833 36.33028
#> [6,] 2.633181 38.11875 60.79836We can plot the marginal histograms of the sample:
hist(sample[,1])hist(sample[,2])hist(sample[,3])The samples are consistent with all information provided. Thus, summing the disaggregate samples should provide an aggregate sample consistent with the information provided (mean: 100, sd: 5):
sample_agg <- rowSums(sample)
hist(sample_agg)And indeed:
cat('Mean: ', mean(sample_agg), '\n')
#> Mean: 99.91481
cat('SD: ', sd(sample_agg))
#> SD: 5.083025With MaxentDisaggregation you can also sample the aggregate and the
shares independently using the ragg and rshares functions:
sample_agg <- ragg(1000, mean = 100, sd = 5)
hist(sample_agg)sample_shares <- rshares(1000, shares = c(0.1, 0.3, 0.6))
boxplot(sample_shares)




