Skip to content

How to merge multiple directories into one (Documentation/Help Request) #10

@alexmc6

Description

@alexmc6

Hi Folks,

This is not a bug as such - just that I am not sure the capabilities of the --regex and --replacement features.

What I want is ideally to convert directories "one directory per hour" eg

...somedirectory/2015/05/10/21/...lots of files...
...somedirectory/2015/05/10/22/...lots of files...
...somedirectory/2015/05/10/23/...lots of files...
...somedirectory/2015/05/11/00/...lots of files...
...somedirectory/2015/05/11/01/...lots of files...
...somedirectory/2015/05/11/02/...lots of files...

into "one directory per day"

...somedirectory/2015/05/10/oneBigFile
...somedirectory/2015/05/11/oneBigFile

or, if necessary

...somedirectory/2015/05/10/00/oneBigFile
...somedirectory/2015/05/11/00/oneBigFile

(And ideally I'd love it to tell Hive HCatalog at the same time, but that might be asking too much)

I am trying to use the --regex and --replacement features to do this. Should it work?

This just adds in a new directory

--regex=".*/\d\d/(.+)"
--replacement=00/$1-${crush.timestamp}-${crush.task.num}-${crush.file.num} \

Should I be trying something like

--regex=".*/(\d\d)/(.+)"
--replacement=00/$2-${crush.timestamp}-${crush.task.num}-${crush.file.num} \

I suppose my fall back solution would be to move everything from the low level directories one directory up before running the file crush. That would be a bit of a pain - I suppose I could write a perl or shell script to do that which ran "hadoop fs -mv " commands

Alex

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions