Skip to content

Python 3.12.1 with pandarallel==1.6.5 usage of parallel_apply time increase X3 #261

@mdclone-oa

Description

@mdclone-oa

General

  • Operating System: 8.9 (Ootpa)
  • Python version: 3.12.1
  • Pandas version: 2.1.3
  • Pandarallel version: 1.6.5

Acknowledgement

after upgrading to Python 3.12 from Python 3.10 the usage of parallel_apply increased almost X3.
running on docker with 8.9 (Ootpa)

this is the information about the OS that the docker is running

NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Python 3.12 packages

annotated-types==0.6.0
astroid==3.0.1
attrs==23.1.0
Cerberus==1.3.5
certifi==2023.11.17
charset-normalizer==3.3.2
contourpy==1.2.0
coverage==7.3.2
cycler==0.12.1
debugpy==1.8.0
dill==0.3.7
distlib==0.3.7
docopt==0.6.2
execnet==2.0.2
fonttools==4.46.0
idna==3.6
iniconfig==2.0.0
isort==5.13.0
Jinja2==3.1.2
joblib==1.3.2
jsonschema==4.20.0
jsonschema-specifications==2023.11.2
kiwisolver==1.4.5
MarkupSafe==2.1.3
matplotlib==3.8.2
mccabe==0.7.0
mlxtend==0.23.0
numpy==1.26.2
packaging==23.2
pandarallel==1.6.5
pandas==2.1.3
pep517==0.13.1
pika==1.3.2
Pillow==10.1.0
pip-api==0.0.30
pipreqs==0.4.13
platformdirs==4.1.0
plette==0.4.4
pluggy==1.3.0
psutil==5.9.6
py-cpuinfo==9.0.0
pydantic==2.5.2
pydantic_core==2.14.5
pylint==3.0.2
pyparsing==3.1.1
pytest==7.4.3
pytest-benchmark==4.0.0
pytest-cov==4.1.0
pytest-html==4.1.1
pytest-metadata==3.0.0
pytest-mock==3.12.0
pytest-order==1.2.0
pytest-ordering==0.6
pytest-timeout==2.2.0
pytest-xdist==3.4.0
python-dateutil==2.8.2
pytz==2023.3.post1
redis==5.0.1
referencing==0.32.0
requests==2.31.0
requirementslib==3.0.0
rpds-py==0.13.2
scikit-learn==1.3.2
scipy==1.11.4
seaborn==0.13.0
setuptools==68.2.2
six==1.16.0
threadpoolctl==3.2.0
tomlkit==0.12.3
typing_extensions==4.9.0
tzdata==2023.3
urllib3==2.1.0
yarg==0.1.9

I can't add all my code but this is some of it.

results = combined.groupby(by='NewGroup').parallel_apply(
            lambda group: TestClass(data=group.drop(columns=columns, inplace=False)).run())

TestClass - init the class with the new data after the drop
columns - is a list of columns that we need to drop
run - is the function that runs on each group

the servers are the same and the code didn't change, but still, I got time increased almost by X3

with python 3.10.11 with pandarallel==1.6.5 and pandas==2.0.0
the same data frame takes 2.49 min and with the 3.12.1 it takes 7.22 min

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions