Skip to content

ijan10/pyspark_handle_skewed_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

For using this code please be notice the bellow 2 methods: left_join_with_skew_key repartition_dfs_for_join

The first one is in case you wish to do simple join without additional expression and the function return transformation datafarme of the join results. The second one is for cases you wish to add an expression or customize the join. In such case the method returns the dataframe after repartition with bin_id. Please keep in mind bin_id is now part of the join. We tested this solution with "big" and "small" tables with broadcast hint. In my scenario, we succeed to reduce running time of the job from 6 hours to total 1-hour chain of jobs (on the same type of cluster). Also, the job became much more stable.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages