Skip to content
Jan Veverka edited this page Apr 2, 2015 · 16 revisions

19 March 2015 - 1 April 2015

Week 12 - 19-22 March

Week 13 - 23-29 March

  • The transfer machine srv-c2c07-16 seems to be in a bad shape. There are signs of rfcp having errors allocating pages using dmesg.
  • Reported by Federico de G.: Recent runs like 238799 are appearing (even with some delay) on our ramdisk, but not on the output disk at: /fff/output/transfer/ the last one I see there is: 238649
  • Rebooted srv-c2c07-16 on March 23 around 17:30 during the transfers of the run 238752:
    • The ongoing transfers failed because of wrong sequence of manual interventions. The correct sequence is: service sm stop; umount /store/lustre
    • After the reboot, one has to mount lustre and the DQM by hand:
      mount 10.180.9.17:/fff/output /dqmburam
      mount /bin/mount -t lustre -o flock 10.180.7.44@tcp0,10.180.7.45@tcp0:/cmsfs /store/lustre/
  • Copy worker crashed at 13:52 on Tuesday 24 March, restarted 15:29, as a result 238832 has some missing files.
  • Inject worker terminated at 03:25:02, restarted transfers at 10:44:40, caught up with data taking by 16:00.

Clone this wiki locally