tech-report/08-appendix2.Rmd at main · salmonwatersheds/tech-report · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# Appendix 2: NuSEDS Data Processing {#appendix-2 .unnumbered}

This is a summary of the NuSEDS cleaning procedure and attribution of the data to Conservation Units (CUs) as defined in the Pacific Salmon Explorer ([PSE](https://www.salmonexplorer.ca/)). The complete details of the cleaning procedure can be found at [1_nuseds_collation.html]( https://salmonwatersheds.github.io/population-indicators/spawner-surveys/code/1_nuseds_collation.html) and the corresponding Rmd script can be accessed from [github/.../spawner-surveys/code/1_nuseds_collation.rmd](https://github.com/salmonwatersheds/population-indicators/blob/master/spawner-surveys/code/1_nuseds_collation.rmd).

These is the first step in preparing the spawner-survey data and it concerns the cleaning of NuSEDS exclusively. The other consecutive steps and associated documentation are provided at the end of this page.


## Cleaning Procedure {-}

The objective of the procedure is to obtain the yearly counts (i.e. "time series") of each salmon **population** -- defined as a group of salmon belonging to the same conservation unit (CU) and spawning in the same stream -- and associate these populations to their CU. The NuSEDS data is separated into two datasets. The *All Areas NuSEDS* dataset contains the observed yearly counts (related fields: `NATURAL_ADULT_SPAWNERS`, `NATURAL_SPAWNERS_TOTAL`, etc.) for each population (related fields: `SPECIES`, `POP_ID`, `POPULATION`) in their respective site (related fields: `AREA`, `GFE_ID`, `WATERBODY`, `GAZETTED_NAME`, etc.), along with the associated methods used (related fields: `ESTIMATE_METHOD`, `ESTIMATE_CLASSIFICATION`, `ENUMERATION_METHODS`). Note that as the 2025-11-03 NuSEDS update, the CU-related field `FULL_CU_IN` is now in *All Areas NuSEDS* as well.

The second dataset, *Conservation Unit Census Sites*, links each population (related fields: `POP_ID`) to their respective CU (related fields: `CU_NAME`, `FULL_CU_IN`, `CU_LONGT`, `CU_LAT`, etc.) and site (related fields: `GFE_ID`, `CENSUS_SITE`, `X_LONGT`, `Y_LAT`). Ideally, attributing each time series in *All Areas NuSEDS* its corresponding CU and location’s coordinates in  *Conservation Unit Census Sites* would simply consist in merging the two datasets using the population and location identification number `POP_ID` and `GFE_ID`, respectively. Unfortunately, numerous time series are problematic, which occurs when:

- A time series is present in *All Areas NuSEDS* but its `POP_ID` and `GFE_ID` association is absent in  *Conservation Unit Census Sites* (there are 4447 populations in that case).

- Multiple time series of the same population (i.e. same `POP_ID`) are observed in multiple locations (i.e. different `GFE_IDs`), which should not occur because a `POP_ID` should be defined for a unique location.

- Multiple populations (i.e. different `POP_ID`) of a same CU are observed in the same location (i.e. same `GFE_ID`), suggesting that these populations should form one unique population (these `POP_ID` are probably related to different surveys of a same population).

The observation of problematic time series revealed inconsistencies such as missing, duplicated, and conflicting data points (i.e. different counts in the same year). The goal of the procedure is to fix these time series to rescue as many data points as possible.


### Determine total count (MAX_ESTIMATE) {-}

We first define the unique yearly count field `MAX_ESTIMATE` for each population as the maximum value of the count-related fields in *All Areas NuSEDS*, i.e., `NATURAL_ADULT_SPAWNERS`, `NATURAL_JACK_SPAWNERS`, `NATURAL_SPAWNERS_TOTAL`, `ADULT_BROODSTOCK_REMOVALS`, `JACK_BROODSTOCK_REMOVALS`, `TOTAL_BROODSTOCK_REMOVALS`, `OTHER_REMOVALS` and `TOTAL_RETURN_TO_RIVER`. `MAX_ESTIMATE` is the only count-related field we use in the rest of the procedure. A population’s `MAX_ESTIMATE` data points is referred to as its “time series”.


### Remove duplicated and conflictal rows {-}

There are 60 duplicated rows in *All Areas NuSEDS* when considering the fields related to population (`SPECIES`, `POP_ID`, `POPULATION`), location (`GFE_ID`) and counts (`Year`, `NATURAL_ADULT_SPAWNERS`, `NATURAL_JACK_SPAWNERS`, `NATURAL_SPAWNERS_TOTAL`, `ADULT_BROODSTOCK_REMOVALS`, `JACK_BROODSTOCK_REMOVALS`, `TOTAL_BROODSTOCK_REMOVALS`, `OTHER_REMOVALS`, `TOTAL_RETURN_TO_RIVER`, `ENUMERATION_METHODS`, `ESTIMATE_CLASSIFICATION`), most of them having NA for `MAX_ESTIMATE`. These rows are removed. There are no duplicated rows in *Conservation Unit Census Sites*.

There are two instances in *All Areas NuSEDS* where a same `POP_ID` has two different `MAX_ESTIMATE` values in a same `Year`. We keep the value corresponding to the better method (`ESTIMATE_METHOD`) or the most recent entry.


### Find missing stream coordinates {-}

There are nine locations (`GFE_ID`) without coordinates (`Y_LAT`, `X_LONGT`) in *Conservation Unit Census Sites* and 23 in *All Areas NuSEDS* (the coordinates could not be found the other DFO files with `GFE_ID` that were sent to us). We define these coordinates manually using the best information available.


### Remove time series only made of NAs {-}

There are 4425 time series in *All Areas NuSEDS* that only have NAs for `MAX_ESTIMATE`. The corresponding 10,2104 rows (24.4%) are removed (the time series with NAs AND 0s are kept).


### Time series not in Conservation Unit Census Sites {-}

There are 264 time series in *All Areas NuSEDS* whose reference (i.e. `POP_ID` and `GFE_ID` association) is not in *Conservation Unit Census Sites*. Among those, only 53 have a `CU_NAME` and `FULL_CUN_IN`, which we use to find the corresponding  PSE's `cuid` and `cu_name_pse`. For the remaining 211 time series without a `CU_NAME` and `FULL_CUN_IN` (corresponding to 208 `POP_ID`), we first find their `cuid` and `cu_name_pse` by intersection the their stream coordinates (`X_LONGT`, `Y_LAT`) with the CUs’ shape files used in the PSE. When more than one CU layer is intersected (for a same species), we use the information in `POPULATION` and `WATERBODY` to manually select the correct CU. Once their `cuid` is found, we can find their `CU_NAME` and `FULL_CUN_IN`.

After the procedure, there remain (1) five populations for which we found their `cuid` and `cu_name_pse` but did not find the corresponding `CU_NAME` and `FULL_CU_IN`, and (2) two time series with a `CU_NAME` and `FULL_CU_IN` for which we could not find a `cuid` and `cu_name_pse`.

The reference of these time series is then added to *Conservation Unit Census Sites*.


### Find the cuid and cu_name_pse of the remaining time series {-}

We now find the `cuid` and `cu_name_pse` to all the remining time series using `FULL_IN_IN` and `CU_NAME` in both *All Areas NuSEDS* and *Conservation Unit Census Sites*.

After the procedure, there remain (1) the five time series with a `cuid` and `cu_name_pse` but no `CU_NAME` and `FULL_CU_IN` (the ones mention in the section above) and (2) 86 time series (corresponding to 22 `FULL_CU_IN` and `CU_NAME`) for which we could not find a `cuid` and `cu_name_pse`. These series are kept at this stage.


### Cases where a CU has multiple time series in a single location {-}

There are 79 instances where multiple time series of a single CU are associated to a one location (`GFE_ID`). Checking all these cases reveals clear duplicated data points or single data point that are not worth keeping. To fix these issues we proceed as follow:

-   **Case 1**: one of the duplicated series has only one data point:

    -   if it is complementary: merge to the other (longer) series

    -   if it is in conflict or a duplicate: remove the focal series

-   **Case 2**: the shorter series is 100% duplicated: removed the focal series

-   **Case 3**: for the rest of the duplicated series:

    -   points that are conflictual or duplicated are summed up

    -   points that are complementary are merged

In the few cases where conflictual data points are summed, we assume that the different runs (e.g., “Chinook Run 1” and “Chinook Run 2”) can be considered a single population. For example, the Bridge River has "Summer" and "Late run" sockeye surveys, but these are both the MIDDLE FRASER river-type sockeye CU.

In the few instances where data points are summed, we define the `ESTIMATE_CLASSIFICATION` (e.g., “RELATIVE ABUNDANCE (TYPE-3)”) as the value corresponding to the highest `MAX_ESTIMATE` value between the two data points.


### Cases where a CU has multiple time series in a single location {-}

There are 37 instances where a single `POP_ID` is associated to multiple locations (`GFE_ID`). Similarly as in the previous section, checking all these cases reveals inconsistencies in the data. For instance, the `POPULATION` "Fennel Creek Early Summer Sockeye" (`POP_ID` = 3416) has two complementary time series, one in the `WATERBODAY` "FENNEL CREEK AND SAKUM CREEK" (`GFE_ID` = 2746) and in "FENNEL CREEK" (`GFE_ID` = 261), and these two locations have the same coordinates (`Y_LAT` and `X_LONGT`). In this type of cases, we merge the two series by replacing the location-related fields (i.e. `WATERBODAY`, `CENSUS_SITE`, `GFE_ID`, etc.) of one time series by the values of the other time series in *All Areas NuSEDS*. We only make changes to the data when the issue is obvious and the appropriate information is available to make the correction.


### Additional corrections for the Northern Transboundary {-}

PSF formed a technical working group (TWG) specifically to compile data for the Transboundary region (cf. [transboundary-data](https://github.com/salmonwatersheds/transboundary-data/tree/main)). As part of the work the following modifications were requested:


- TATSAMENIE RIVER coho (`POP_ID = 45152`) for 1994 and earlier changed to TATSATUA RIVER (`POP_ID = 45154`) and remaining records 1995+ are removed

- any records of `POP_ID = 45151` (Tatsamenie River lake-type sockeye) for 1994 or earlier get changed to `POP_ID = 45153` (TATSATUA RIVER river-type sockeye)

- one record with `POP_ID = 45165` (Chinook Run 2) change to `45164` (Chinook Run 1) in Nahlin river


### Merge the two dataset {-}

We merge *All Areas NuSEDS* and *Conservation Unit Census Sites* and the resulting file is exported.


### Next steps {-}

There are other consecutive scripts that process the spawner-survey before it is ready for the PSE:

- [2_nuseds_cuid_pse.rmd](https://github.com/salmonwatersheds/population-indicators/blob/master/spawner-surveys/code/2_nuseds_cuid_pse.rmd): to do additional corrections of time series, modifications related to the PSE; for all details: [2_nuseds_cuid_pse.html](https://salmonwatersheds.github.io/population-indicators/spawner-surveys/code/2_nuseds_cuid_pse.html); the different versions of the dataset can be downloaded from [zenodo/.../2_nuseds_cuid_streamid](https://doi.org/10.5281/zenodo.14194638).

- [3_data_extra_Reynolds_lab.rmd](https://github.com/salmonwatersheds/population-indicators/blob/master/spawner-surveys/code/3_data_extra_Reynolds_lab.rmd): to include the Reynolds's Lab data for several populations in the Central Coast; for all details:  [3_data_extra_Reynolds_lab.html](https://salmonwatersheds.github.io/population-indicators/spawner-surveys/code/3_data_extra_Reynolds_lab.html).

- [yukon-data](https://github.com/salmonwatersheds/yukon-data): to compile additional data for all the CUs.

- [columbia-data](https://github.com/salmonwatersheds/columbia-data): to compile additional data for the lake sockeye "Osoyoos" CU.

- [transboundary-data](https://github.com/salmonwatersheds/transboundary-data): to compile additional data and corrections of data from NuSEDS for multiple CUs

- [central-coast-data](https://github.com/salmonwatersheds/central-coast-data): to compile additional data for the lake sockeye "South Atnarko Lake" CU.

- [steelhead-data](https://github.com/salmonwatersheds/steelhead-data): to compile the data for all CUs across all regions.

- [4_datasets_for_PSE.R](https://github.com/salmonwatersheds/population-indicators/blob/master/spawner-surveys/code/4_datasets_for_PSE.R): to combine all the datasets generated in the different repositories above.