Skip to content

Dataset Splitting

Ivan Zvonkov edited this page Apr 3, 2023 · 1 revision

In [1a Goal Setup] Evaluation Metric we identified the goal and an evaluation metric to measure if we are doing well.

The purpose of this page is to identify how we can get data to appropriately generate the evaluation metric.

Test Set

Used to give unbiased estimate of performance in the real world.

Test set distribution

In order to give an unbiased estimate of performance, the distribution of the test set must match that of the real world.

Example: If we are generating a crop mask for Kenya and have an estimate that 10.19% of Kenya was crop in 2015 (source), then our test set should contain a similar distribution of crop.

Validation Set

Used to assess performance and make decisions on how to improve performance.

Example decisions based on validation metric

  • Should more training data be obtained?
  • Is the crop mask accurate enough to be released?
  • Should a model hyperparameter be tweaked?

Here, the hold out sample is known as the validation set.

Can the test set be used as the validation set?

No. If the test set metric is used to make decisions regarding the model, then the test set metric is no longer an unbiased estimate of performance in the real world.

Validation set distribution

  • The validation metric is used to make decisions on how to improve model performance.
  • These decisions would only be useful if the validation metric was reflective of the test metric.
  • If the validation and test set distribution differ substantially, then the decisions made using the validation metric to improve the model would likely have no meaningful effect on the test metric.

Getting the data

Methods to generate a validation and test set

1. Traditional ML

  • Take all the available training data, and hold out a certain ratio to evaluate the model
  • Pros: No to low cost of labeling
  • Cons: May not be representative of region of interest

2. Simple Random Sampling

  • Randomly sample the data points in the region of interest
  • Pros: Representative of large region of interest
  • Cons: Medium cost of labeling, minority class may have too few examples

3. Stratified Random Sampling

  • Partition the region of interest into strata such that the minority class can be sampled at a higher rate.
  • Pros: Representative of large region of interest and ensure minority class accuracy is statistically significant
  • Cons: High cost of labeling

4. Wall to wall

  • Instead of labelling points, we will do dense labels of a whole area, and use this to evaluate the model
  • Pros: Can be representative for small region of interest with consistent land features
  • Cons: High cost of labeling

Which validation/test set generation method is best for the crop-mask use case?

Since crop-masks are being created for countries (large regions of interest) where one class (non-crop) makes up the majority and the other class (crop) makes up the minority, 3. Stratified Random Sampling is most fitting. However, due to its high cost of labeling, 2. Simple Random Sampling for the region of interest can also be used.

If the validation and test sets are generated using the 1. Traditional ML method then it is highly likely that the dataset is not representative of the large region of interest and therefore the generated metric will not reflect the real world metric.

If the validation and test sets are generated using the 4. Wall to wall method then it is likely that the small region used to generate the datasets is not representative of the large region of interest and therefore the generated metric will not reflect the real world metric.

Conclusion: 2. Simple Random Sampling should be used to collect validation and test data.

How much data?

We use the method identified by Olofsson et al. in Good practices for estimating area and assessing accuracy of land change (eq 13) to determine sample size:

n ≈ ( Σ(WiSi) / S(Ô) )2

Where
Wi Mapped proportion of class i
Si Standard deviation √(Ui(1-Ui))
Ui Expected user's accuracy for class i
S(Ô) Standard error
n Sample size

If we assume 0.7 for the user accuracy of both classes and choose a standard error of 0.02, then since Σ(Wi) = 1,

n ≈ ((W0(0.46) + W1(0.46)) / 0.02)2 = 525

Conclusion: The test set and validation set should both have a sample size of at least 525 respectively.

Remaining Questions

  1. Do we really need an unbiased estimate of performance (a test set)? Or do we just want to generate the best crop masks we can?

Clone this wiki locally