How to merge multiple directories into one (Documentation/Help Request)

Hi Folks, 

This is not a bug as such - just that I am not sure the capabilities of the --regex and --replacement features. 

What I want is ideally to convert directories "one directory per hour"  eg

...somedirectory/2015/05/10/21/...lots of files...
...somedirectory/2015/05/10/22/...lots of files...
...somedirectory/2015/05/10/23/...lots of files...
...somedirectory/2015/05/11/00/...lots of files...
...somedirectory/2015/05/11/01/...lots of files...
...somedirectory/2015/05/11/02/...lots of files...

into "one directory per day"

...somedirectory/2015/05/10/oneBigFile
...somedirectory/2015/05/11/oneBigFile

or, if necessary 

...somedirectory/2015/05/10/00/oneBigFile
...somedirectory/2015/05/11/00/oneBigFile

(And ideally I'd love it to tell Hive HCatalog at the same time, but that might be asking too much)

I am trying to use the --regex and --replacement features to do this. Should it work? 

This just adds in a new directory

  --regex=".*/\d\d/(.+)" \
  --replacement=00/\$1-\${crush.timestamp}-\${crush.task.num}-\${crush.file.num} \

Should I be trying something like 

  --regex=".*/(\d\d)/(.+)" \
  --replacement=00/\$2-\${crush.timestamp}-\${crush.task.num}-\${crush.file.num} \

I suppose my fall back solution would be to move everything from the low level directories one directory up before running the file crush. That would be a bit of a pain - I suppose I could write a perl or shell script to do that which ran "hadoop fs -mv " commands

Alex


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to merge multiple directories into one (Documentation/Help Request) #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to merge multiple directories into one (Documentation/Help Request) #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions