Hi Folks,
This is not a bug as such - just that I am not sure the capabilities of the --regex and --replacement features.
What I want is ideally to convert directories "one directory per hour" eg
...somedirectory/2015/05/10/21/...lots of files...
...somedirectory/2015/05/10/22/...lots of files...
...somedirectory/2015/05/10/23/...lots of files...
...somedirectory/2015/05/11/00/...lots of files...
...somedirectory/2015/05/11/01/...lots of files...
...somedirectory/2015/05/11/02/...lots of files...
into "one directory per day"
...somedirectory/2015/05/10/oneBigFile
...somedirectory/2015/05/11/oneBigFile
or, if necessary
...somedirectory/2015/05/10/00/oneBigFile
...somedirectory/2015/05/11/00/oneBigFile
(And ideally I'd love it to tell Hive HCatalog at the same time, but that might be asking too much)
I am trying to use the --regex and --replacement features to do this. Should it work?
This just adds in a new directory
--regex=".*/\d\d/(.+)"
--replacement=00/$1-${crush.timestamp}-${crush.task.num}-${crush.file.num} \
Should I be trying something like
--regex=".*/(\d\d)/(.+)"
--replacement=00/$2-${crush.timestamp}-${crush.task.num}-${crush.file.num} \
I suppose my fall back solution would be to move everything from the low level directories one directory up before running the file crush. That would be a bit of a pain - I suppose I could write a perl or shell script to do that which ran "hadoop fs -mv " commands
Alex
Hi Folks,
This is not a bug as such - just that I am not sure the capabilities of the --regex and --replacement features.
What I want is ideally to convert directories "one directory per hour" eg
...somedirectory/2015/05/10/21/...lots of files...
...somedirectory/2015/05/10/22/...lots of files...
...somedirectory/2015/05/10/23/...lots of files...
...somedirectory/2015/05/11/00/...lots of files...
...somedirectory/2015/05/11/01/...lots of files...
...somedirectory/2015/05/11/02/...lots of files...
into "one directory per day"
...somedirectory/2015/05/10/oneBigFile
...somedirectory/2015/05/11/oneBigFile
or, if necessary
...somedirectory/2015/05/10/00/oneBigFile
...somedirectory/2015/05/11/00/oneBigFile
(And ideally I'd love it to tell Hive HCatalog at the same time, but that might be asking too much)
I am trying to use the --regex and --replacement features to do this. Should it work?
This just adds in a new directory
--regex=".*/\d\d/(.+)"
--replacement=00/$1-${crush.timestamp}-${crush.task.num}-${crush.file.num} \
Should I be trying something like
--regex=".*/(\d\d)/(.+)"
--replacement=00/$2-${crush.timestamp}-${crush.task.num}-${crush.file.num} \
I suppose my fall back solution would be to move everything from the low level directories one directory up before running the file crush. That would be a bit of a pain - I suppose I could write a perl or shell script to do that which ran "hadoop fs -mv " commands
Alex