forked from ices-eg/WKSSFGEO
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathprepare_data_WKSSFGEO.Rmd
More file actions
409 lines (296 loc) · 11.5 KB
/
prepare_data_WKSSFGEO.Rmd
File metadata and controls
409 lines (296 loc) · 11.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
---
title: "Prepare tracking data"
author: 'M.M.Rufino, T. Mendo and J.Egekvist'
date: "Last compiled on `r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: yes
toc_floot: true
number_sections: yes
theme: cosmo
word_document:
fig_caption: yes
fig_height: 5
fig_width: 10
pdf_document:
fig_height: 5
fig_width: 10
#highlight: tango
keep_tex: yes
number_sections: TRUE
toc: true
#fontfamily: mathpazo
fontsize: 11pt
geometry: margin=0.5in
papersize: a4
---
```{r setup, include=FALSE}
# first date: "15/11/2021"
# clean workspace
rm(list = ls())
## Load packages
# ploting
require(ggplot2)
require(RColorBrewer)
require(viridis)
require(gridExtra)
# spatial
require(sf)
require(raster)
# for interactive mapping
require(mapview)
# gramatics and facilitators
require(dplyr)
# for handling time series
require(lubridate)
# for coastline
#require(rnaturalearth)
#require(marmap)
# test1: from Rstudio mac to git push
# 2nd try test1: from Rstudio mac to git push
```
# General remarks
The following is a conceptual workflow for identifying fishing trips from highly resolved spatial data. This script will be used and improved during the WKSSFGEO workshop, hosted by IPMA.\
Please remember every fishery as its specificities, which should be known in detail to perform any analysis.\
These data can be obtained from different devices, namely:
- Automatic Identification Systems (AIS)\
- GPS trackers\
- Electronic monitoring system (can be equiped with video or other devices that permit to make the validation)\
- Any other trackers transmitting at least latitude, longitude, and a time stamp.\
# I. Data preparation
We will first import the data into R, and check if it is ok.\
Then we will format the columns into the correct type (i.e. data, etc.) and make it a spatial object.\
```{r open_data}
#######################
# Import the data from the url
# kk <- read.table(url("https://raw.githubusercontent.com/ices-eg/WKSSFGEO/main/example_data_AIS.csv"), head=TRUE, sep=",")
# or download the file and import from local drive
# or open it directly from github folder
kk1 <- read.table("example_data_AIS.csv", head=TRUE, sep=",")
dim(kk1)
# 395843 8
names(kk1)
# "vessel_id", "time_stamp","lon","lat",
# "speed", "course","gear"
head(kk1)
# note there is an extra col in this dataset
table(kk1$behaviour, useNA = "always")
# select data where behaviour is 'Fishing'
kk1 <- kk1[kk1$behaviour=="Fishing",]
# 1. All records have vessel_id?
table(kk1$vessel_id, useNA = "always")
# 2. All records have lat/lon? Extent of the study area.
range(kk1$lat)
range(kk1$lon)
# 3. Dates
# format time cols
kk1$time_stamp[1:2]
head(kk1$data<-as.POSIXct(as.character(kk1$time_stamp), format = "%Y-%m-%d %H:%M:%S"))
# temporal duration
range(kk1$data)
# check difference between successive points (to redo after filtering perhaps)
kk1 %>%
group_by(vessel_id) %>%
summarise(diff=diff(data)) %>%
distinct(diff)
# other situations:
# also check if you have any special code for missing values
# note that if the data is imported from excel we might need to make something like this:
# as.Date(as.numeric(record.date), origin = "1899-12-30")
```
# Workflow
## 1) Extent of the study area
Define the extent of the study area and remove latitudes and longitudes outside this area extent.
Rationale: Some devices might send 0,0 messages for lats and longs when there is not good reception of satellites. Also, if using on-board observers or fishers that transport the devices you might get points in land if they turned them on before the trip.
```{r area_extent}
######################
# range of the data
range(kk1$lat)
range(kk1$lon)
# in this case it appears to be no issue on the lats/lons
#######################
# If the lat/lons are not correct, cutoff the area
# dim(kk1)
# kk1 <- kk1 %>%
# filter(lat>55,lon<0, lon>(-8))
# dim(kk1)
# make a spatial object
## note we are assuming CRS=4326, which might not be the case.
## You should know the spatial reference system of the data
# interactive plot the first day only (to avoid blocking the computer with number of points)
# kk1.sf %>%
# st_as_sf(kk1, coords=c("lon","lat"),crs=4326, remove=FALSE))
# filter(day(data)==1) %>%
# group_by(vessel_id) %>%
# summarize(do_union=FALSE) %>%
# st_cast("LINESTRING") %>%
# mapview(zcol="vessel_id")
# plot all data to check for dubious lats/lons:
kk1 %>%
#filter(day(data)==1) %>%
ggplot()+
geom_path(aes(x=lat, y=lon, col=month(data)))+
scale_color_viridis()+
facet_wrap(~vessel_id, scales="free")+
theme_minimal()
```
## 2) Remove duplicates
For each vessel, we should check for the presence of duplicated time_stamps. In this case we can do the average of lat/lon/speed, etc. or use the first value (or any other criteria).\
These duplicates tend to occur more in AIS.\
```{r remove_duplicates}
# check overall duplicates. There is none.
kk1[duplicated(kk1),]
# check duplicated time_stamps by vessel_id:
dim(kk1[duplicated(paste(kk1$vessel_id,kk1$data)),])
# this could also be done by
# kk1 %>%
# group_by(data, vessel_id) %>%
# summarise(N=n()) %>%
# filter(N>1)
# remove duplicated time/vessel_id entries (OR we could also average those)
kk1 <- kk1[!duplicated(paste(kk1$vessel_id,kk1$data)),]
dim(kk1)
```
## 3) Remove points on land or on harbour.
This aspect is particular sensitive to the case study (fishery), as:\
a) We can have points on harbour with very low speeds and very close to each other - the boats are resting but some movement is recorded which induces errors. In this case, these should be removed either by defining a polygon of the harbour and exclude the points inside (essential for fisheries that operate very close to land) or with a distance buffer around the harbour location.\
b) We can have points that were recorded in land - which are not correct. In this case, you can use a high resolution coastline to remove those points.\
c) We can have boats that do not leave from an harbour, but directly from the coast/beach.\
```{r remove_harbour}
# Open harbor's file EU without Portugal (sf object)
har <- readRDS("jepol/harbours.rds")
plot(har)
# Select points inside the harbours?
# make spatial object of boats tracks
kk1.sf <- kk1 %>%
st_as_sf(coords=c("lon","lat"),crs=4326, remove=FALSE)
# select points inside polygons
kk1.sf <- st_join(kk1.sf, har)
# points inside the ports (none)
table(kk1.sf$SI_HARB, useNA = "always")
```
## 4) Correct speeds
Some speeds might be weird.\
For most cases we expect the valid speeds to vary between 0.1 and 12 knots, but this should be studied for the fishery under work.\
We can than either remove unrealistic speeds considering a threshold (taking into account the knowledge of the fishery) or use a mathematical criteria (e.g. mean+5*SD).\
In either case, we can remove those speeds or replace them by nearby speeds in time (for example by a moving average of 5 points).\
```{r correct_speeds}
# exclude weird speeds
dim(kk1.sf)
summary(kk1.sf$speed)
# if there were weird speeds:
# kk1.sf <- kk1 %>% filter(speed < 12) #speed >0.1,
# dim(kk1)
```
## **Alternatively** (from Einer:
```{r}
library(tidyverse)
# remotes::install_github("Hafro/geo")
track <-
read_csv("https://raw.githubusercontent.com/ices-eg/WKSSFGEO/main/example_data_AIS.csv") %>%
# save some downstream typing by using shorter variable names
rename(vid = vessel_id, time = time_stamp) %>%
# unique rows
distinct() %>%
distinct(vid, time, lon, lat, .keep_all = TRUE) %>%
# just in case
arrange(vid, time) %>%
group_by(vid) %>%
mutate(duration = difftime(time, lag(time), units = "secs"),
# warnings here are just because of the first datapoint
distance = geo::arcdist(lat, lon, lag(lat), lag(lon), scale = "km") * 1000,
# derived speed
speed2 = distance / as.numeric(duration)) %>%
ungroup() %>%
st_as_sf(coords = c("lon", "lat"),
crs = 4326,
remove = FALSE)
```
```{r}
# alternative to mapview:
library(mapdeck) # greater control on points and things, one need though a token
# to get a background map
mapdeck(location = c(11, 56.5), zoom = 8) %>%
add_scatterplot(data = track %>% sample_n(5e4),
lon = "lon",
lat = "lat",
fill_colour = "speed",
legend = FALSE,
tooltip = "time",
layer_id = "points",
radius = 10,
radius_min_pixels = 2,
radius_max_pixels = 10,
update_view = FALSE,
stroke_opacity = 1,
palette = "inferno")
```
# II. Define individual trips
Two alternatives (at least):\
a) If each trip is done in one day only, then we can construct the trajectories considering one trip by day;
b) If there are boat trips that go from one day to another, we can use a time threshold for defining trips (general: applicable in all cases);\
c) We can define that a trip starts when a vessel leaves the harbour and finishes when it returns; For this case we will use a function found in the functions folder ('define_trips.R').\
```{r individual_trips}
dim(kk1.sf)
kk1.sf <- kk1.sf %>%
dplyr::group_by(vessel_id) %>%
dplyr::arrange(data) %>%
dplyr::mutate(
#diff.time = round(difftime(data, lag(data, 1, default = data[1])),1),
#diff.cum = round(cumsum(as.numeric(diff.time))),
trip = cumsum(c(TRUE, diff(data) >= 10800)), #this is 3h*60min*60 sec because it is in seconds: head(diff(boats$horas))
vessel_trip = paste(vessel_id, trip, sep="_")) %>%
ungroup()
table(kk1.sf$vessel_trip, useNA = "always")
# Plot one example
kk1.sf %>%
filter(vessel_id=="EX_9") %>%
ggplot()+
geom_path(aes(x=data, y=speed, col=speed))+
facet_wrap(~vessel_trip, scales="free_x")+
scale_color_viridis()+
theme_minimal()
# map it
kk1.sf %>%
filter(vessel_id=="EX_9") %>%
ggplot()+
geom_sf(aes(col=speed))+
facet_wrap(~vessel_trip)+
scale_color_viridis()+
theme_minimal()
```
```{r individual_trips2}
# Second alternative, we will use the function developped by DTU-Aqua
## NOT WORKING
load("/Functions/define_trips.R")
# Error in load("Functions/define_trips.R") :
# bad restore file magic number (file may be corrupted) -- no data loaded
define_trips(kk1.sf, min_dur = 0.5, max_dur = 72, split_trips = T)
# does not run
```
6) Check trip duration and distance covered - threshold required (again knowledge of the fishery required) - check trips visually (this can only be done after individuals trips are defined).\
```{r trip_duration}
# Plot duration histogram
kk1.sf %>%
group_by(vessel_trip) %>%
summarise(duration = round(difftime(max(data),min(data), units="mins"))) %>%
arrange(duration) %>%
# here we see there is a big gap between <99 mins and the remaining of the trips. We will cut there, then.
ggplot()+
geom_histogram(aes(x=duration), bins=50)+
theme_minimal()
# geom_point(aes(x=vessel.trip, y=duration))
dim(kk1)
kk1.sf <- kk1.sf %>%
group_by(vessel.trip) %>%
mutate(duration = round(difftime(max(data),min(data), units="mins"))) %>%
filter(duration>99) %>%
ungroup()
dim(kk1)
```
# III. Identifying fishing events
1) Cut trajectories where distance between consecutive observations is above a threshold?
For example- 1000 metres
2) Interpolation - specify frequency – 60 seconds?
3) Remove points on land that might result from this interpolation
4) Try: Random Forest, HMMs, EM algorithm, fixed threshold expert based or estimated with classification trees – use speed instead of distance?