Skip to content

Can't access training data in s3 storage #878

@canonic-epicure

Description

@canonic-epicure

🐛 Describe the bug

Hi,

I'm trying to launch the local training of the "tiny" configuration (after fresh github clone and pip install):
torchrun scripts/train.py ./workspace/OLMo-20M/config.yaml --save_overwrite

However, I receive:

[olmo.util:623, rank=0] _s3_file_size failed attempt 2 with retriable error: An error occurred (403) when calling the HeadObject operation: Forbidden error

Then, trying to list the bucket content with: aws s3 ls s3://ai2-llm/preprocessed/megawika/v1/gpt-neox-olmo-dolma-v1_5/ --no-sign-request and get:

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

Is the data public?

Screenshot from https://us-east-1.console.aws.amazon.com/s3/buckets/ai2-llm?region=us-east-1&bucketType=general&tab=objects

Image

Or am I missing something?

Versions

(.venv) nickolay@leblanc:~/workspace/python/OLMo$ python --version && pip freeze
Python 3.12.3
-e git+ssh://git@github.com/allenai/OLMo.git@04820704616af5d25cdba4df23aa7b4d9ce86cad#egg=ai2_olmo
ai2-olmo-core==2.1.0
ai2-olmo-eval==0.7.1
aiohappyeyeballs==2.6.1
aiohttp==3.12.15
aiosignal==1.4.0
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
attrs==25.3.0
beaker-gantry==3.0.0
beaker-py==2.4.7
black==23.12.1
boltons==25.0.0
boto3==1.40.13
botocore==1.40.13
build==1.3.0
cached_path==1.7.3
cachetools==5.5.2
certifi==2025.8.3
cffi==1.17.1
charset-normalizer==3.4.3
click==8.2.1
click-help-colors==0.9.4
click-option-group==0.5.7
cryptography==45.0.6
datasets==4.0.0
dill==0.3.8
docutils==0.22
face==24.0.0
filelock==3.19.1
frozenlist==1.7.0
fsspec==2025.3.0
ftfy==6.3.1
gitdb==4.0.12
GitPython==3.1.45
glom==24.11.0
google-api-core==2.25.1
google-auth==2.40.3
google-cloud-core==2.4.3
google-cloud-storage==2.19.0
google-crc32c==1.7.1
google-resumable-media==2.7.2
googleapis-common-protos==1.70.0
grpcio==1.74.0
hf-xet==1.1.8
huggingface-hub==0.34.4
id==1.5.0
idna==3.10
importlib_resources==6.5.2
iniconfig==2.1.0
isort==5.12.0
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.3.0
jeepney==0.9.0
Jinja2==3.1.6
jmespath==1.0.1
joblib==1.5.1
keyring==25.6.0
lightning-utilities==0.15.2
markdown-it-py==4.0.0
MarkupSafe==3.0.2
mdurl==0.1.2
more-itertools==10.7.0
mpmath==1.3.0
msgspec==0.19.0
multidict==6.6.4
multiprocess==0.70.16
mypy==1.3.0
mypy_extensions==1.1.0
necessary==0.4.3
networkx==3.5
nh3==0.3.0
numpy==1.26.4
nvidia-cublas-cu12==12.8.4.1
nvidia-cuda-cupti-cu12==12.8.90
nvidia-cuda-nvrtc-cu12==12.8.93
nvidia-cuda-runtime-cu12==12.8.90
nvidia-cudnn-cu12==9.10.2.21
nvidia-cufft-cu12==11.3.3.83
nvidia-cufile-cu12==1.13.1.3
nvidia-curand-cu12==10.3.9.90
nvidia-cusolver-cu12==11.7.3.90
nvidia-cusparse-cu12==12.5.8.93
nvidia-cusparselt-cu12==0.7.1
nvidia-nccl-cu12==2.27.3
nvidia-nvjitlink-cu12==12.8.93
nvidia-nvtx-cu12==12.8.90
omegaconf==2.3.0
packaging==25.0
pandas==2.3.1
pathspec==0.12.1
petname==2.6
platformdirs==4.3.8
pluggy==1.6.0
propcache==0.3.2
proto-plus==1.26.1
protobuf==5.29.5
pyarrow==21.0.0
pyasn1==0.6.1
pyasn1_modules==0.4.2
pycparser==2.22
pydantic==2.11.7
pydantic_core==2.33.2
Pygments==2.19.2
pyproject_hooks==1.2.0
pytest==8.4.1
pytest-sphinx==0.6.3
python-dateutil==2.9.0.post0
pytz==2025.2
PyYAML==6.0.2
readme_renderer==44.0
regex==2025.7.34
requests==2.32.5
requests-toolbelt==1.0.0
requirements-parser==0.13.0
rfc3986==2.0.0
rich==13.9.4
rsa==4.9.1
ruff==0.12.9
s3transfer==0.13.1
safetensors==0.6.2
scikit-learn==1.7.1
scipy==1.16.1
SecretStorage==3.3.3
sentry-sdk==2.35.0
setuptools==80.9.0
six==1.17.0
smart_open==7.3.0.post1
smashed==0.21.5
smmap==5.0.2
sympy==1.14.0
threadpoolctl==3.6.0
tokenizers==0.21.4
torch==2.8.0
torchmetrics==1.8.1
tqdm==4.67.1
transformers==4.55.2
triton==3.4.0
trouting==0.3.3
twine==6.1.0
typing-inspection==0.4.1
typing_extensions==4.14.1
tzdata==2025.2
urllib3==2.5.0
wandb==0.21.1
wcwidth==0.2.13
wheel==0.45.1
wrapt==1.17.3
xxhash==3.5.0
yarl==1.20.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugAn issue about a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions