We have been using zrep for multiple years to replicate through SSH, but in the last few months we suffer from less-than-optimal connectivity which is causing pain with zrep.
The error message is inaccurate, resulting from zrep being confused by its own error handling. When we encounter the issue, manual steps are needed to resolve it.
Order of events
Connection problem
we encounter a connection problem during zrep synconly, this log is from the source where we run the command
packet_write_wait: Connection to x.x.x.x port 22: Broken pipe
packet_write_wait: Connection to UNKNOWN port 65535: Broken pipe
Error: Problem doing sync for esrepo/snap_esids@zrep_000bc6. Renamed to esrepo/snap_esids@zrep_000bc6_unsent
Next run fails
This happens with all subsequent runs:
cannot receive incremental stream: destination esrepo-targethost/snap_esids has been modified since most recent snapshot
warning: cannot send 'esrepo/snap_esids@zrep_000bc6_unsent': signal received
Error: Problem doing sync for esrepo/snap_esids@zrep_000bc7. Renamed to esrepo/snap_esids@zrep_000bc7_unsent
We usually do zrep snaponly followed by zrep synconly to ensure consistency with systems running ontop the replicated dataset.
Failing run with ZREP_VERBOSE
# ZREP_VERBOSE=true zrep synconly esrepo/snap_esids
sending esrepo/snap_esids@zrep_000bfb to targethost.domain.com:esrepo-targethost/snap_esids
send from @zrep_000bc5 to esrepo/snap_esids@zrep_000bc6_unsent estimated size is 2.90M
send from @zrep_000bc6_unsent to esrepo/snap_esids@zrep_000bc7_unsent estimated size is 6.42M
# more in between
send from @zrep_000bf8_unsent to esrepo/snap_esids@zrep_000bf9_unsent estimated size is 11.3M
send from @zrep_000bf9_unsent to esrepo/snap_esids@zrep_000bfa_unsent_unsent_unsent estimated size is 27.4M
# multiple _unsent suffix because we did some failing synconly calls without snaponly in between
send from @zrep_000bfa_unsent_unsent_unsent to esrepo/snap_esids@zrep_000bfb estimated size is 13.6M
total estimated size is 2.57G
TIME SENT SNAPSHOT
cannot receive incremental stream: destination esrepo-targethost/snap_esids has been modified
since most recent snapshot
warning: cannot send 'esrepo/snap_esids@zrep_000bc6_unsent': signal received
Error: Problem doing sync for esrepo/snap_esids@zrep_000bfb. Renamed to esrepo/snap_esids@zrep_000bfb_unsent
Further info / workaround
Latest snapshot on target
$ zfs list -t snap -r esrepo-targethost/snap_esids | tail -1
esrepo-targethost/snap_esids@zrep_000bc6 0B - 12.3G -
But we see above that zrep thinks it needs to send from @zrep_000bc5 instead of @zrep_000bc6. @zrep_000bc6 has been renamed to @zrep_000bc6_unsent on the source.
comparing @zrep_000bc6
Source
# zfs get guid esrepo/snap_esids@zrep_000bc6_unsent
NAME PROPERTY VALUE SOURCE
esrepo/snap_esids@zrep_000bc6_unsent guid 12096180512352983159 -
Target
$ zfs get guid esrepo-targethost/snap_esids@zrep_000bc6
NAME PROPERTY VALUE SOURCE
esrepo-targethost/snap_esids@zrep_000bc6 guid 12096180512352983159 -
Fixing replication
Rename
Since @zrep_000bc6 has actually been successfully transferred (same guid), we rename it on the source so it has the same name as on the target:
# zfs rename esrepo/snap_esids@zrep_000bc6_unsent esrepo/snap_esids@zrep_000bc6
Rerun
... and run it again, successfully
# ZREP_VERBOSE=true zrep synconly esrepo/snap_esids
sending esrepo/snap_esids@zrep_000bfb_unsent to targethost.domain.com:esrepo-targethost/snap_esids
send from @zrep_000bc5 to esrepo/snap_esids@zrep_000bc6 estimated size is 2.90M
send from @zrep_000bc6 to esrepo/snap_esids@zrep_000bc7_unsent estimated size is 6.42M
send from @zrep_000bc7_unsent to esrepo/snap_esids@zrep_000bc8_unsent estimated size is 7.53M
send from @zrep_000bc8_unsent to esrepo/snap_esids@zrep_000bc9_unsent estimated size is 2.73M
send from @zrep_000bc9_unsent to esrepo/snap_esids@zrep_000bca_unsent estimated size is 4.49M
send from @zrep_000bca_unsent to esrepo/snap_esids@zrep_000bcb_unsent estimated size is 6.07M
# more in between
send from @zrep_000bf4_unsent to esrepo/snap_esids@zrep_000bf5_unsent estimated size is 6.74M
send from @zrep_000bf5_unsent to esrepo/snap_esids@zrep_000bf6_unsent estimated size is 6.10M
send from @zrep_000bf6_unsent to esrepo/snap_esids@zrep_000bf7_unsent estimated size is 11.9M
send from @zrep_000bf7_unsent to esrepo/snap_esids@zrep_000bf8_unsent estimated size is 10.6M
send from @zrep_000bf8_unsent to esrepo/snap_esids@zrep_000bf9_unsent estimated size is 11.3M
send from @zrep_000bf9_unsent to esrepo/snap_esids@zrep_000bfa_unsent_unsent_unsent estimated size is 27.4M
send from @zrep_000bfa_unsent_unsent_unsent to esrepo/snap_esids@zrep_000bfb_unsent estimated size is 13.6M
total estimated size is 2.57G
TIME SENT SNAPSHOT
TIME SENT SNAPSHOT
07:41:56 2.48M esrepo/snap_esids@zrep_000bc7_unsent
TIME SENT SNAPSHOT
TIME SENT SNAPSHOT
TIME SENT SNAPSHOT
TIME SENT SNAPSHOT
07:42:00 5.89M esrepo/snap_esids@zrep_000bcb_unsent
# cut out some progress
07:43:31 1013M esrepo/snap_esids@zrep_000bf4_unsent
07:43:32 1.04G esrepo/snap_esids@zrep_000bf4_unsent
07:43:33 1.09G esrepo/snap_esids@zrep_000bf4_unsent
TIME SENT SNAPSHOT
07:43:35 3.16M esrepo/snap_esids@zrep_000bf5_unsent
07:43:36 3.16M esrepo/snap_esids@zrep_000bf5_unsent
07:43:37 3.16M esrepo/snap_esids@zrep_000bf5_unsent
07:43:38 3.16M esrepo/snap_esids@zrep_000bf5_unsent
07:43:39 3.16M esrepo/snap_esids@zrep_000bf5_unsent
TIME SENT SNAPSHOT
TIME SENT SNAPSHOT
TIME SENT SNAPSHOT
07:43:42 6.27M esrepo/snap_esids@zrep_000bf8_unsent
TIME SENT SNAPSHOT
TIME SENT SNAPSHOT
07:43:44 21.8M esrepo/snap_esids@zrep_000bfa_unsent_unsent_unsent
TIME SENT SNAPSHOT
07:43:45 3.00M esrepo/snap_esids@zrep_000bfb_unsent
Also running expire on targethost.domain.com:esrepo-targethost/snap_esids now...
Expiring zrep snaps on esrepo-targethost/snap_esids
Version used
We are running zrep 1.9.0 on the source and 1.8.0 on the target. We are going to update (& align versions) soon, but from the changelog its not obvious whether there were any related changes.
We have been using
zrepfor multiple years to replicate through SSH, but in the last few months we suffer from less-than-optimal connectivity which is causing pain withzrep.The error message is inaccurate, resulting from
zrepbeing confused by its own error handling. When we encounter the issue, manual steps are needed to resolve it.Order of events
Connection problem
we encounter a connection problem during
zrep synconly, this log is from the source where we run the commandNext run fails
This happens with all subsequent runs:
We usually do
zrep snaponlyfollowed byzrep synconlyto ensure consistency with systems running ontop the replicated dataset.Failing run with
ZREP_VERBOSEFurther info / workaround
Latest snapshot on target
But we see above that
zrepthinks it needs to send from@zrep_000bc5instead of@zrep_000bc6.@zrep_000bc6has been renamed to@zrep_000bc6_unsenton the source.comparing
@zrep_000bc6Source
Target
Fixing replication
Rename
Since
@zrep_000bc6has actually been successfully transferred (same guid), we rename it on the source so it has the same name as on the target:Rerun
... and run it again, successfully
Version used
We are running zrep 1.9.0 on the source and 1.8.0 on the target. We are going to update (& align versions) soon, but from the changelog its not obvious whether there were any related changes.