Skip to content

Conversation

@Jeronymous
Copy link
Member

No description provided.

try:
loss = model[0].eval_batch(data_iterator) # average loss per sample per microbatch
# difficult to know if it is the right way to get the total loss
loss = loss * args.micro_batch_size * args.seq_length # losses per token
Copy link
Member Author

@Jeronymous Jeronymous Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want a total loss, and not an average loss ?

I am not sure micro_batch_size is the correct one : this is the batch size per GPU, the effective batch size is macro_batch_size.

I would suggest to save the average loss per token AND the total number of tokens in the dataset (separately).
So that we can chose between the stats (average / total) and make checks based on the numbers of tokens.

Copy link
Member Author

@Jeronymous Jeronymous Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I suggest to use

loss = model[0].eval_batch(data_iterator)
loss_dicts = [{'lm loss' : loss, 'num_batches' : 1}]

and to aggregate losses and number of batches where relevant (I think it's around line 417).
Then at the really end to normalize using the number of batches.

Comment on lines 415 to 417
if is_last_rank():

val_loss = total_loss_dict['lm loss'].item() / (num_tokenized_tokens - 1)
Copy link
Member Author

@Jeronymous Jeronymous Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aussi il me semble (je peux avoir tord) que "is_last_rank" n'est True que sur un GPU en cas de multi-GPU.
Ce qui voudrait dire qu'en multi-GPU, on ignorerait les résultats sur "n-1" GPUs ?

if is_last_rank():

val_loss = total_loss_dict['lm loss'].item() / (num_tokenized_tokens - 1)
ppl = math.exp(min(20, val_loss))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ppl = math.exp(min(20, val_loss))
dist.all_reduce(val_loss, op=ReduceOp.SUM) # mean reduction is not supported
dist.all_reduce(ppl, op=ReduceOp.SUM)
dist.all_reduce(adjusted_ppl, op=ReduceOp.SUM)
dist.all_reduce(token_ratio, op=ReduceOp.SUM)
val_loss = val_loss / NB_SHARDS
token_ratio = token_ratio / NB_SHARDS
ppl = math.exp(min(20, val_loss))
adjusted_ppl = math.exp(min(20, val_loss * token_ratio))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll try it out, hope it solves the synchronization problem ;)

lucashervier and others added 23 commits August 21, 2024 16:51
- Add datasets: Pile (WIP) and Stac (tiny).
- Improve a bit folder organization.
- Add zstandard in requirements (to read datasets in .jsonl.zst format)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants