GitHub - ijan10/pyspark_handle_skewed_data

For using this code please be notice the bellow 2 methods: left_join_with_skew_key repartition_dfs_for_join

The first one is in case you wish to do simple join without additional expression and the function return transformation datafarme of the join results. The second one is for cases you wish to add an expression or customize the join. In such case the method returns the dataframe after repartition with bin_id. Please keep in mind bin_id is now part of the join. We tested this solution with "big" and "small" tables with broadcast hint. In my scenario, we succeed to reduce running time of the job from 6 hours to total 1-hour chain of jobs (on the same type of cluster). Also, the job became much more stable.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
join_skew_data		join_skew_data
LICENSE.txt		LICENSE.txt
README.md		README.md
bin_packing.py		bin_packing.py
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages