-
Notifications
You must be signed in to change notification settings - Fork 37
Dataset Splitting
In [1a Goal Setup] Evaluation Metric we identified the goal and an evaluation metric to measure if we are doing well.
The purpose of this page is to identify how we can get data to appropriately generate the evaluation metric.
Used to give unbiased estimate of performance in the real world.
In order to give an unbiased estimate of performance, the distribution of the test set must match that of the real world.
Example: If we are generating a crop mask for Kenya and have an estimate that 10.19% of Kenya was crop in 2015 (source), then our test set should contain a similar distribution of crop.
Used to assess performance and make decisions on how to improve performance.
- Should more training data be obtained?
- Is the crop mask accurate enough to be released?
- Should a model hyperparameter be tweaked?
Here, the hold out sample is known as the validation set.
No. If the test set metric is used to make decisions regarding the model, then the test set metric is no longer an unbiased estimate of performance in the real world.
- The validation metric is used to make decisions on how to improve model performance.
- These decisions would only be useful if the validation metric was reflective of the test metric.
- If the validation and test set distribution differ substantially, then the decisions made using the validation metric to improve the model would likely have no meaningful effect on the test metric.
1. Traditional ML
- Take all the available training data, and hold out a certain ratio to evaluate the model
- Pros: No to low cost of labeling
- Cons: May not be representative of region of interest
2. Simple Random Sampling
- Randomly sample the data points in the region of interest
- Pros: Representative of large region of interest
- Cons: Medium cost of labeling, minority class may have too few examples
3. Stratified Random Sampling
- Partition the region of interest into strata such that the minority class can be sampled at a higher rate.
- Pros: Representative of large region of interest and ensure minority class accuracy is statistically significant
- Cons: High cost of labeling
4. Wall to wall
- Instead of labelling points, we will do dense labels of a whole area, and use this to evaluate the model
- Pros: Can be representative for small region of interest with consistent land features
- Cons: High cost of labeling
Since crop-masks are being created for countries (large regions of interest) where one class (non-crop) makes up the majority and the other class (crop) makes up the minority, 3. Stratified Random Sampling is most fitting. However, due to its high cost of labeling, 2. Simple Random Sampling for the region of interest can also be used.
If the validation and test sets are generated using the 1. Traditional ML method then it is highly likely that the dataset is not representative of the large region of interest and therefore the generated metric will not reflect the real world metric.
If the validation and test sets are generated using the 4. Wall to wall method then it is likely that the small region used to generate the datasets is not representative of the large region of interest and therefore the generated metric will not reflect the real world metric.
Conclusion: 2. Simple Random Sampling should be used to collect validation and test data.
We use the method identified by Olofsson et al. in Good practices for estimating area and assessing accuracy of land change (eq 13) to determine sample size:
n ≈ ( Σ(WiSi) / S(Ô) )2
| Where | |
|---|---|
| Wi | Mapped proportion of class i |
| Si | Standard deviation √(Ui(1-Ui)) |
| Ui | Expected user's accuracy for class i |
| S(Ô) | Standard error |
| n | Sample size |
If we assume 0.7 for the user accuracy of both classes and choose a standard error of 0.02, then since Σ(Wi) = 1,
n ≈ ((W0(0.46) + W1(0.46)) / 0.02)2 = 525
Conclusion: The test set and validation set should both have a sample size of at least 525 respectively.
- Do we really need an unbiased estimate of performance (a test set)? Or do we just want to generate the best crop masks we can?