[2025-08-03 01:38:26,089] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,089] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,090] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,112] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,113] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,113] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,113] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,096] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,099] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,104] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,104] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,104] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,129] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,129] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,129] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,132] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,128] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,128] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,133] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,133] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,135] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,135] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,140] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,135] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,135] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,140] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,141] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:38:26,141] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:39:10,022] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,012] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,038] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,038] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-08-03 01:39:10,034] [INFO] [comm.py:637:init_distributed] cdb=None 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage3_config_100b_1e8.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/runs/Aug03_01-39-10_HOST-10-140-60-108, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=2000, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=1000, save_strategy=steps, save_total_limit=10000, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.05, ) 08/03/2025 01:39:10 - INFO - __main__ - Loading Tokenizer: /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B [2025-08-03 01:39:10,308] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,322] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,323] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,324] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,327] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,334] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,336] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,346] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,343] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,351] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,344] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,353] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,353] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,347] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,356] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,350] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,352] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,353] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,354] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,361] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,363] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,393] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,394] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,395] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,397] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,398] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,399] [INFO] [comm.py:637:init_distributed] cdb=None [2025-08-03 01:39:10,400] [INFO] [comm.py:637:init_distributed] cdb=None [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file vocab.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file merges.txt [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file tokenizer.json 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False [WARNING|logging.py:314] 2025-08-03 01:39:10,730 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:10,827 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,828 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,828 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,829 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,829 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,829 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,829 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:10,871 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,872 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,874 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,877 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,879 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,875 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,883 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:10,881 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:10,893 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:10,917 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:10,976 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,979 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,982 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,987 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,987 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2025-08-03 01:39:10,987 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:10,988 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) 08/03/2025 01:39:11 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:11 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage3_config_100b_1e8.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/runs/Aug03_01-39-11_HOST-10-140-66-41, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=2000, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=1000, save_strategy=steps, save_total_limit=10000, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.05, ) 08/03/2025 01:39:11 - INFO - __main__ - Loading Tokenizer: /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,198 >> loading file vocab.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,198 >> loading file merges.txt [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,199 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,199 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,199 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,199 >> loading file tokenizer.json 08/03/2025 01:39:11 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:11 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage3_config_100b_1e8.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/runs/Aug03_01-39-11_HOST-10-140-66-62, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=2000, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=1000, save_strategy=steps, save_total_limit=10000, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.05, ) 08/03/2025 01:39:11 - INFO - __main__ - Loading Tokenizer: /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file vocab.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file merges.txt [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file tokenizer.json 08/03/2025 01:39:11 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 08/03/2025 01:39:11 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage3_config_100b_1e8.json, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/runs/Aug03_01-39-11_HOST-10-140-60-44, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=2000, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=1000, save_strategy=steps, save_total_limit=10000, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.05, ) 08/03/2025 01:39:11 - INFO - __main__ - Loading Tokenizer: /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file vocab.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file merges.txt [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file tokenizer.json [WARNING|logging.py:314] 2025-08-03 01:39:11,394 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:11,396 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) [WARNING|logging.py:314] 2025-08-03 01:39:11,520 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf --> before Client(conf_path) --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! 08/03/2025 01:42:12 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2025-08-03 01:42:12,183 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/config.json [INFO|configuration_utils.py:792] 2025-08-03 01:42:12,185 >> Model config InternVLChatConfig { "_commit_hash": null, "_name_or_path": "/mnt/petrelfs/wangweiyun/workspace_wwy/open_source/InternVL/internvl_chat/work_dirs/internvl_chat_v3_0/InternVL3_0-2B-MPO-try0-2", "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "hidden_size": 1536, "image_fold": null, "llm_config": { "_attn_implementation_autoset": true, "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151643, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 1536, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 8960, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 70, "min_length": 0, "model_type": "qwen2", "moe_config": null, "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 28, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": { "factor": 2.0, "rope_type": "dynamic", "type": "dynamic" }, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": false, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "system_message": null, "template": "internvl2_5", "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": true, "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "auto_map": { "AutoConfig": "configuration_intern_vit.InternVisionConfig", "AutoModel": "modeling_intern_vit.InternVisionModel" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "capacity_factor": 1.2, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.1, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "eval_capacity_factor": 1.4, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 0.1, "initializer_range": 1e-10, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "laux_allreduce": "all_nodes", "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "moe_coeff_ratio": 0.5, "moe_intermediate_size": 768, "moe_output_scale": 4.0, "no_repeat_ngram_size": 0, "noisy_gate_policy": "RSample_before", "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_experts": 8, "num_hidden_layers": 24, "num_return_sequences": 1, "num_routed_experts": 4, "num_shared_experts": 4, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "shared_expert_intermediate_size": 3072, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true, "use_moe": false, "use_residual": true, "use_rts": false, "use_weighted_residual": false } } 08/03/2025 01:42:12 - INFO - __main__ - Using flash_attention_2 for LLaMA [INFO|modeling_utils.py:3473] 2025-08-03 01:42:12,191 >> loading weights file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/model.safetensors --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! 08/03/2025 01:42:12 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2025-08-03 01:42:12,272 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/config.json [INFO|configuration_utils.py:792] 2025-08-03 01:42:12,274 >> Model config InternVLChatConfig { "_commit_hash": null, "_name_or_path": "/mnt/petrelfs/wangweiyun/workspace_wwy/open_source/InternVL/internvl_chat/work_dirs/internvl_chat_v3_0/InternVL3_0-2B-MPO-try0-2", "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "hidden_size": 1536, "image_fold": null, "llm_config": { "_attn_implementation_autoset": true, "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151643, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 1536, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 8960, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 70, "min_length": 0, "model_type": "qwen2", "moe_config": null, "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 28, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": { "factor": 2.0, "rope_type": "dynamic", "type": "dynamic" }, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": false, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "system_message": null, "template": "internvl2_5", "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": true, "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "auto_map": { "AutoConfig": "configuration_intern_vit.InternVisionConfig", "AutoModel": "modeling_intern_vit.InternVisionModel" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "capacity_factor": 1.2, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.1, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "eval_capacity_factor": 1.4, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 0.1, "initializer_range": 1e-10, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "laux_allreduce": "all_nodes", "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "moe_coeff_ratio": 0.5, "moe_intermediate_size": 768, "moe_output_scale": 4.0, "no_repeat_ngram_size": 0, "noisy_gate_policy": "RSample_before", "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_experts": 8, "num_hidden_layers": 24, "num_return_sequences": 1, "num_routed_experts": 4, "num_shared_experts": 4, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "shared_expert_intermediate_size": 3072, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true, "use_moe": false, "use_residual": true, "use_rts": false, "use_weighted_residual": false } } 08/03/2025 01:42:12 - INFO - __main__ - Using flash_attention_2 for LLaMA [INFO|modeling_utils.py:3473] 2025-08-03 01:42:12,279 >> loading weights file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/model.safetensors [INFO|modeling_utils.py:1426] 2025-08-03 01:42:12,296 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|modeling_utils.py:3582] 2025-08-03 01:42:12,296 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [INFO|configuration_utils.py:826] 2025-08-03 01:42:12,309 >> Generate config GenerationConfig {} [INFO|modeling_utils.py:1426] 2025-08-03 01:42:12,735 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|modeling_utils.py:3582] 2025-08-03 01:42:12,735 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [INFO|configuration_utils.py:826] 2025-08-03 01:42:12,753 >> Generate config GenerationConfig {} --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! 08/03/2025 01:42:14 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2025-08-03 01:42:14,229 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/config.json [INFO|configuration_utils.py:792] 2025-08-03 01:42:14,230 >> Model config InternVLChatConfig { "_commit_hash": null, "_name_or_path": "/mnt/petrelfs/wangweiyun/workspace_wwy/open_source/InternVL/internvl_chat/work_dirs/internvl_chat_v3_0/InternVL3_0-2B-MPO-try0-2", "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "hidden_size": 1536, "image_fold": null, "llm_config": { "_attn_implementation_autoset": true, "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151643, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 1536, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 8960, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 70, "min_length": 0, "model_type": "qwen2", "moe_config": null, "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 28, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": { "factor": 2.0, "rope_type": "dynamic", "type": "dynamic" }, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": false, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "system_message": null, "template": "internvl2_5", "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": true, "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "auto_map": { "AutoConfig": "configuration_intern_vit.InternVisionConfig", "AutoModel": "modeling_intern_vit.InternVisionModel" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "capacity_factor": 1.2, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.1, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "eval_capacity_factor": 1.4, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 0.1, "initializer_range": 1e-10, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "laux_allreduce": "all_nodes", "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "moe_coeff_ratio": 0.5, "moe_intermediate_size": 768, "moe_output_scale": 4.0, "no_repeat_ngram_size": 0, "noisy_gate_policy": "RSample_before", "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_experts": 8, "num_hidden_layers": 24, "num_return_sequences": 1, "num_routed_experts": 4, "num_shared_experts": 4, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "shared_expert_intermediate_size": 3072, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true, "use_moe": false, "use_residual": true, "use_rts": false, "use_weighted_residual": false } } 08/03/2025 01:42:14 - INFO - __main__ - Using flash_attention_2 for LLaMA [INFO|modeling_utils.py:3473] 2025-08-03 01:42:14,236 >> loading weights file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/model.safetensors [INFO|modeling_utils.py:1426] 2025-08-03 01:42:14,253 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|modeling_utils.py:3582] 2025-08-03 01:42:14,253 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [INFO|configuration_utils.py:826] 2025-08-03 01:42:14,267 >> Generate config GenerationConfig {} --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! 08/03/2025 01:42:18 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:727] 2025-08-03 01:42:18,408 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/config.json --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! [INFO|configuration_utils.py:792] 2025-08-03 01:42:18,410 >> Model config InternVLChatConfig { "_commit_hash": null, "_name_or_path": "/mnt/petrelfs/wangweiyun/workspace_wwy/open_source/InternVL/internvl_chat/work_dirs/internvl_chat_v3_0/InternVL3_0-2B-MPO-try0-2", "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "hidden_size": 1536, "image_fold": null, "llm_config": { "_attn_implementation_autoset": true, "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151643, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 1536, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 8960, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 70, "min_length": 0, "model_type": "qwen2", "moe_config": null, "no_repeat_ngram_size": 0, "num_attention_heads": 12, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 28, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_scaling": { "factor": 2.0, "rope_type": "dynamic", "type": "dynamic" }, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": false, "use_sliding_window": false, "vocab_size": 151674 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "system_message": null, "template": "internvl2_5", "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": true, "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "auto_map": { "AutoConfig": "configuration_intern_vit.InternVisionConfig", "AutoModel": "modeling_intern_vit.InternVisionModel" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "capacity_factor": 1.2, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.1, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "eval_capacity_factor": 1.4, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 0.1, "initializer_range": 1e-10, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "laux_allreduce": "all_nodes", "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "moe_coeff_ratio": 0.5, "moe_intermediate_size": 768, "moe_output_scale": 4.0, "no_repeat_ngram_size": 0, "noisy_gate_policy": "RSample_before", "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_experts": 8, "num_hidden_layers": 24, "num_return_sequences": 1, "num_routed_experts": 4, "num_shared_experts": 4, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "shared_expert_intermediate_size": 3072, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true, "use_moe": false, "use_residual": true, "use_rts": false, "use_weighted_residual": false } } 08/03/2025 01:42:18 - INFO - __main__ - Using flash_attention_2 for LLaMA [INFO|modeling_utils.py:3473] 2025-08-03 01:42:18,416 >> loading weights file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/model.safetensors [INFO|modeling_utils.py:1426] 2025-08-03 01:42:18,443 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|modeling_utils.py:3582] 2025-08-03 01:42:18,444 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [INFO|configuration_utils.py:826] 2025-08-03 01:42:18,476 >> Generate config GenerationConfig {} --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! --> after Client(conf_path) Replace INTERNLM2_ATTENTION_CLASSES to support packed training!! Replace QWEN2_ATTENTION_CLASSES to support packed training!! Replace PHI3_ATTENTION_CLASSES to support packed training!! Replace LLAMA_ATTENTION_CLASSES to support packed training!! [INFO|configuration_utils.py:826] 2025-08-03 01:42:22,338 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "use_cache": false } [INFO|configuration_utils.py:826] 2025-08-03 01:42:22,326 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "use_cache": false } [INFO|configuration_utils.py:826] 2025-08-03 01:42:22,404 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "use_cache": false } [INFO|configuration_utils.py:826] 2025-08-03 01:42:22,466 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151643, "use_cache": false } [2025-08-03 01:42:22,888] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 685, num_elems = 2.09B [INFO|modeling_utils.py:4350] 2025-08-03 01:42:28,929 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4350] 2025-08-03 01:42:28,948 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2025-08-03 01:42:28,929 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|modeling_utils.py:4358] 2025-08-03 01:42:28,948 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|modeling_utils.py:4350] 2025-08-03 01:42:28,933 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2025-08-03 01:42:28,934 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|modeling_utils.py:4350] 2025-08-03 01:42:28,945 >> All model checkpoint weights were used when initializing InternVLChatModel. [INFO|modeling_utils.py:4358] 2025-08-03 01:42:28,945 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B. If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training. [INFO|configuration_utils.py:779] 2025-08-03 01:42:28,941 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/generation_config.json [INFO|configuration_utils.py:779] 2025-08-03 01:42:28,946 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/generation_config.json [INFO|configuration_utils.py:779] 2025-08-03 01:42:28,961 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/generation_config.json [INFO|configuration_utils.py:826] 2025-08-03 01:42:28,942 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2025-08-03 01:42:28,946 >> Generate config GenerationConfig {} [INFO|configuration_utils.py:826] 2025-08-03 01:42:28,961 >> Generate config GenerationConfig {} 08/03/2025 01:42:28 - INFO - __main__ - Finished 08/03/2025 01:42:28 - INFO - __main__ - model.config.force_image_size: 448 08/03/2025 01:42:28 - INFO - __main__ - Finished 08/03/2025 01:42:28 - INFO - __main__ - data_args.force_image_size: 448 08/03/2025 01:42:28 - INFO - __main__ - model.config.vision_config.image_size: 448 08/03/2025 01:42:28 - INFO - __main__ - model.config.force_image_size: 448 08/03/2025 01:42:28 - INFO - __main__ - data_args.force_image_size: 448 08/03/2025 01:42:28 - INFO - __main__ - model.config.vision_config.image_size: 448 08/03/2025 01:42:28 - INFO - __main__ - Finished 08/03/2025 01:42:28 - INFO - __main__ - model.config.force_image_size: 448 08/03/2025 01:42:28 - INFO - __main__ - data_args.force_image_size: 448 08/03/2025 01:42:28 - INFO - __main__ - model.config.vision_config.image_size: 448 [INFO|configuration_utils.py:779] 2025-08-03 01:42:28,959 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/generation_config.json [INFO|configuration_utils.py:826] 2025-08-03 01:42:28,960 >> Generate config GenerationConfig {} 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:28 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:28 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:28 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:28 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:29 - INFO - __main__ - Finished 08/03/2025 01:42:29 - INFO - __main__ - model.config.force_image_size: 448 08/03/2025 01:42:29 - INFO - __main__ - data_args.force_image_size: 448 08/03/2025 01:42:29 - INFO - __main__ - model.config.vision_config.image_size: 448 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:29 - INFO - __main__ - Add dataset: point_xy_format with length: 666578 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:29 - INFO - __main__ - Add dataset: converted_affordance with length: 16305 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:29 - INFO - __main__ - Add dataset: point_xy_format with length: 666578 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:30 - INFO - __main__ - Add dataset: converted_affordance with length: 16305 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:30 - INFO - __main__ - Add dataset: converted_trajectory with length: 17175 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:30 - INFO - __main__ - Add dataset: converted_trajectory with length: 17175 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:30 - INFO - __main__ - Add dataset: point_xy_format with length: 666578 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:30 - INFO - __main__ - Add dataset: converted_affordance with length: 16305 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:31 - INFO - __main__ - Add dataset: converted_trajectory with length: 17175 08/03/2025 01:42:31 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:31 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:31 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:31 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:31 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:32 - INFO - __main__ - Add dataset: pixmo-points with length: 161095 08/03/2025 01:42:32 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:32 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:32 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:32 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:32 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:32 - INFO - __main__ - Add dataset: pixmo-points with length: 161095 08/03/2025 01:42:32 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:32 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:32 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:32 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:32 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:33 - INFO - __main__ - Add dataset: pixmo-points with length: 161095 08/03/2025 01:42:33 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:33 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:33 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:33 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:33 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:34 - INFO - __main__ - Add dataset: point_xy_format with length: 666578 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:34 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:34 - INFO - __main__ - Add dataset: paco_lvis_v1_train with length: 228945 08/03/2025 01:42:34 - INFO - __main__ - Add dataset: paco_lvis_v1_train with length: 228945 08/03/2025 01:42:34 - INFO - __main__ - Add dataset: converted_affordance with length: 16305 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:34 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:34 - INFO - __main__ - Add dataset: converted_trajectory with length: 17175 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:34 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:34 - INFO - __main__ - Formatting inputs...Skip in lazy mode 08/03/2025 01:42:35 - INFO - __main__ - Add dataset: paco_lvis_v1_train with length: 228945 [INFO|trainer.py:522] 2025-08-03 01:42:36,070 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:571] 2025-08-03 01:42:36,070 >> Using auto half precision backend [INFO|trainer.py:522] 2025-08-03 01:42:36,054 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:571] 2025-08-03 01:42:36,054 >> Using auto half precision backend [INFO|trainer.py:522] 2025-08-03 01:42:36,328 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:571] 2025-08-03 01:42:36,328 >> Using auto half precision backend 08/03/2025 01:42:36 - INFO - __main__ - Add dataset: pixmo-points with length: 161095 08/03/2025 01:42:36 - INFO - __main__ - [Dataset] num_image_token: 256 08/03/2025 01:42:36 - INFO - __main__ - [Dataset] dynamic_image_size: True 08/03/2025 01:42:36 - INFO - __main__ - [Dataset] use_thumbnail: True 08/03/2025 01:42:36 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12 08/03/2025 01:42:36 - INFO - __main__ - Formatting inputs...Skip in lazy mode Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... 08/03/2025 01:42:38 - INFO - __main__ - Add dataset: paco_lvis_v1_train with length: 228945 08/03/2025 01:42:38 - INFO - internvl.train.dataset_packed - Loaded dataset to pack: ['point_xy_format', 'converted_affordance', 'converted_trajectory', 'pixmo-points', 'paco_lvis_v1_train'], self.num_images_expected=48, self.max_packed_tokens=16384, self.replacement=True, self.allow_overflow=False 08/03/2025 01:42:38 - INFO - internvl.train.dataset_packed - Sampling prob for each dataset: point_xy_format : 61.15% converted_affordance : 1.50% converted_trajectory : 1.58% pixmo-points : 14.78% paco_lvis_v1_train : 21.00% 08/03/2025 01:42:38 - INFO - __main__ - vision_model.embeddings.class_embedding 08/03/2025 01:42:38 - INFO - __main__ - vision_model.embeddings.position_embedding 08/03/2025 01:42:38 - INFO - __main__ - vision_model.embeddings.patch_embedding.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.embeddings.patch_embedding.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.ls1 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.ls2 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.attn.qkv.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.attn.qkv.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.attn.proj.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.attn.proj.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc2.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.norm1.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.norm1.bias 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.norm2.weight 08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.norm2.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.embed_tokens.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.post_attention_layernorm.weight Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.q_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.q_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.k_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.k_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.v_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.v_proj.bias 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.o_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.mlp.gate_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.mlp.up_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.mlp.down_proj.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.input_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.post_attention_layernorm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.model.norm.weight 08/03/2025 01:42:38 - INFO - __main__ - language_model.lm_head.weight 08/03/2025 01:42:38 - INFO - __main__ - mlp1.0.weight 08/03/2025 01:42:38 - INFO - __main__ - mlp1.0.bias 08/03/2025 01:42:38 - INFO - __main__ - mlp1.1.weight 08/03/2025 01:42:38 - INFO - __main__ - mlp1.1.bias 08/03/2025 01:42:38 - INFO - __main__ - mlp1.3.weight 08/03/2025 01:42:38 - INFO - __main__ - mlp1.3.bias Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... 08/03/2025 01:42:38 - WARNING - accelerate.utils.other - Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.4399733543395996 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.4024319648742676 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.4023013114929199 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.4023604393005371 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.4067509174346924 seconds Time to load fused_adam op: 0.40588808059692383 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.4028747081756592 seconds Time to load fused_adam op: 0.40257835388183594 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.4022998809814453 seconds Time to load fused_adam op: 0.4024229049682617 seconds Loading extension module fused_adam... Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.5733091831207275 seconds Time to load fused_adam op: 0.5024425983428955 seconds Time to load fused_adam op: 0.515697717666626 seconds Loading extension module fused_adam... Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.5719597339630127 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.5374770164489746 seconds Loading extension module fused_adam... Loading extension module fused_adam... Loading extension module fused_adam... Loading extension module fused_adam... Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.4998030662536621 seconds Time to load fused_adam op: 0.47902798652648926 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.5331830978393555 seconds Time to load fused_adam op: 0.5076618194580078 seconds Time to load fused_adam op: 0.49834728240966797 seconds Time to load fused_adam op: 0.4961533546447754 seconds Time to load fused_adam op: 0.4804708957672119 seconds Time to load fused_adam op: 0.5025105476379395 seconds Time to load fused_adam op: 0.5071156024932861 seconds [INFO|trainer.py:522] 2025-08-03 01:42:39,397 >> max_steps is given, it will override any value given in num_train_epochs [INFO|trainer.py:571] 2025-08-03 01:42:39,397 >> Using auto half precision backend [2025-08-03 01:42:39,734] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown [2025-08-03 01:42:39,752] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.6324775218963623 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.5039777755737305 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.5029892921447754 seconds Time to load fused_adam op: 0.5027832984924316 seconds [2025-08-03 01:42:41,439] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-08-03 01:42:41,439] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer Loading extension module fused_adam... Time to load fused_adam op: 0.6037395000457764 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.6040046215057373 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.602717399597168 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.6030519008636475 seconds [2025-08-03 01:42:41,528] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-08-03 01:42:41,528] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-08-03 01:42:41,528] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2025-08-03 01:42:41,528] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2025-08-03 01:42:41,883] [INFO] [utils.py:800:see_memory_usage] Stage 3 initialize beginning [2025-08-03 01:42:41,884] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB Max_MA 1.43 GB CA 0.89 GB Max_CA 2 GB [2025-08-03 01:42:41,885] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.78 GB, percent = 10.6% [2025-08-03 01:42:41,888] [INFO] [stage3.py:130:__init__] Reduce bucket size 100000000 [2025-08-03 01:42:41,888] [INFO] [stage3.py:131:__init__] Prefetch bucket size 100000000 [2025-08-03 01:42:42,106] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2025-08-03 01:42:42,107] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB Max_MA 0.56 GB CA 0.89 GB Max_CA 1 GB [2025-08-03 01:42:42,108] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.78 GB, percent = 10.6% Parameter Offload: Total persistent parameters: 526848 in 387 params [2025-08-03 01:42:42,353] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2025-08-03 01:42:42,354] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB Max_MA 0.56 GB CA 0.89 GB Max_CA 1 GB [2025-08-03 01:42:42,355] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.78 GB, percent = 10.6% [2025-08-03 01:42:42,568] [INFO] [utils.py:800:see_memory_usage] Before creating fp16 partitions [2025-08-03 01:42:42,569] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB Max_MA 0.56 GB CA 0.89 GB Max_CA 1 GB [2025-08-03 01:42:42,570] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.77 GB, percent = 10.6% [2025-08-03 01:42:43,178] [INFO] [utils.py:800:see_memory_usage] After creating fp16 partitions: 1 [2025-08-03 01:42:43,179] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB Max_MA 0.56 GB CA 0.78 GB Max_CA 1 GB [2025-08-03 01:42:43,181] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.89 GB, percent = 10.6% [2025-08-03 01:42:43,361] [INFO] [utils.py:800:see_memory_usage] Before creating fp32 partitions [2025-08-03 01:42:43,362] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB Max_MA 0.56 GB CA 0.78 GB Max_CA 1 GB [2025-08-03 01:42:43,363] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.76 GB, percent = 10.6% [2025-08-03 01:42:43,584] [INFO] [utils.py:800:see_memory_usage] After creating fp32 partitions [2025-08-03 01:42:43,585] [INFO] [utils.py:801:see_memory_usage] MA 0.8 GB Max_MA 0.92 GB CA 1.14 GB Max_CA 1 GB [2025-08-03 01:42:43,586] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.76 GB, percent = 10.6% [2025-08-03 01:42:43,777] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states [2025-08-03 01:42:43,778] [INFO] [utils.py:801:see_memory_usage] MA 0.8 GB Max_MA 0.8 GB CA 1.14 GB Max_CA 1 GB [2025-08-03 01:42:43,779] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.76 GB, percent = 10.6% [2025-08-03 01:42:43,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | init_optimizer_state: 0.12 [2025-08-03 01:42:44,074] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states [2025-08-03 01:42:44,075] [INFO] [utils.py:801:see_memory_usage] MA 0.8 GB Max_MA 1.05 GB CA 1.39 GB Max_CA 1 GB [2025-08-03 01:42:44,076] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 213.76 GB, percent = 10.6% [2025-08-03 01:42:44,077] [INFO] [stage3.py:486:_setup_for_real_optimizer] optimizer state initialized [INFO|trainer.py:1721] 2025-08-03 01:42:44,750 >> ***** Running training ***** [INFO|trainer.py:1722] 2025-08-03 01:42:44,750 >> Num examples = 256,000 [INFO|trainer.py:1723] 2025-08-03 01:42:44,750 >> Num Epochs = 9,223,372,036,854,775,807 [INFO|trainer.py:1724] 2025-08-03 01:42:44,750 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1721] 2025-08-03 01:42:44,769 >> ***** Running training ***** [INFO|trainer.py:1727] 2025-08-03 01:42:44,750 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:1728] 2025-08-03 01:42:44,750 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1729] 2025-08-03 01:42:44,750 >> Total optimization steps = 2,000 [INFO|trainer.py:1722] 2025-08-03 01:42:44,769 >> Num examples = 256,000 [INFO|trainer.py:1723] 2025-08-03 01:42:44,769 >> Num Epochs = 9,223,372,036,854,775,807 [INFO|trainer.py:1724] 2025-08-03 01:42:44,770 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2025-08-03 01:42:44,770 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:1728] 2025-08-03 01:42:44,770 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1729] 2025-08-03 01:42:44,770 >> Total optimization steps = 2,000 [INFO|trainer.py:1730] 2025-08-03 01:42:44,752 >> Number of trainable parameters = 2,088,957,440 [INFO|trainer.py:1730] 2025-08-03 01:42:44,772 >> Number of trainable parameters = 2,088,957,440 [INFO|trainer.py:1721] 2025-08-03 01:42:44,759 >> ***** Running training ***** [INFO|trainer.py:1722] 2025-08-03 01:42:44,760 >> Num examples = 256,000 [INFO|trainer.py:1723] 2025-08-03 01:42:44,760 >> Num Epochs = 9,223,372,036,854,775,807 [INFO|trainer.py:1724] 2025-08-03 01:42:44,760 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2025-08-03 01:42:44,760 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:1728] 2025-08-03 01:42:44,760 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1729] 2025-08-03 01:42:44,760 >> Total optimization steps = 2,000 [INFO|trainer.py:1730] 2025-08-03 01:42:44,762 >> Number of trainable parameters = 2,088,957,440 [2025-08-03 01:42:45,053] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer [2025-08-03 01:42:45,054] [INFO] [utils.py:801:see_memory_usage] MA 1.11 GB Max_MA 1.98 GB CA 2.7 GB Max_CA 3 GB [2025-08-03 01:42:45,055] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 214.06 GB, percent = 10.6% [2025-08-03 01:42:45,055] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw [2025-08-03 01:42:45,055] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-08-03 01:42:45,055] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-08-03 01:42:45,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-08-03 01:42:45,057] [INFO] [config.py:996:print] DeepSpeedEngine configuration: [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] amp_enabled .................. False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] amp_params ................... False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] bfloat16_enabled ............. True [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] comms_config ................. [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] communication_data_type ...... None [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={} [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] dataloader_drop_last ......... False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] disable_allgather ............ False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] dump_state ................... False [2025-08-03 01:42:45,058] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] elasticity_enabled ........... False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] fp16_auto_cast ............... None [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] fp16_enabled ................. False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] global_rank .................. 0 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] grad_accum_dtype ............. None [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 4 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] graph_harvesting ............. False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 1 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] load_universal_checkpoint .... False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] loss_scale ................... 1.0 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] memory_breakdown ............. False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] mics_shard_size .............. -1 [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] optimizer_name ............... adamw [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.05} [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] pld_enabled .................. False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] pld_params ................... False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] prescale_gradients ........... False [2025-08-03 01:42:45,059] [INFO] [config.py:1000:print] scheduler_name ............... None [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] scheduler_params ............. None [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32 [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] sparse_attention ............. None [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] steps_per_print .............. inf [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] train_batch_size ............. 128 [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 1 [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] use_data_before_expert_parallel_ False [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] use_node_local_storage ....... False [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] wall_clock_breakdown ......... True [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] weight_quantization_config ... None [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] world_size ................... 32 [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] zero_allow_untested_optimizer False [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=100000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=100000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] zero_enabled ................. True [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True [2025-08-03 01:42:45,060] [INFO] [config.py:1000:print] zero_optimization_stage ...... 3 [2025-08-03 01:42:45,060] [INFO] [config.py:986:print_user_config] json = { "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+08, "reduce_bucket_size": 1.000000e+08, "stage3_prefetch_bucket_size": 1.000000e+08, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 1.000000e+09, "stage3_max_reuse_distance": 1.000000e+09, "stage3_gather_16bit_weights_on_model_save": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 2e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.05 } }, "gradient_accumulation_steps": 4, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 128, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": true } [INFO|trainer.py:1721] 2025-08-03 01:42:45,060 >> ***** Running training ***** [INFO|trainer.py:1722] 2025-08-03 01:42:45,060 >> Num examples = 256,000 [INFO|trainer.py:1723] 2025-08-03 01:42:45,060 >> Num Epochs = 9,223,372,036,854,775,807 [INFO|trainer.py:1724] 2025-08-03 01:42:45,060 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2025-08-03 01:42:45,061 >> Total train batch size (w. parallel, distributed & accumulation) = 128 [INFO|trainer.py:1728] 2025-08-03 01:42:45,061 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1729] 2025-08-03 01:42:45,061 >> Total optimization steps = 2,000 [INFO|trainer.py:1730] 2025-08-03 01:42:45,063 >> Number of trainable parameters = 2,088,957,440 [2025-08-03 01:42:54,054] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,055] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,060] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,066] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,125] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,125] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,125] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,420] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,420] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,421] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,421] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,421] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,422] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,422] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:54,422] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:56,719] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:56,719] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:56,720] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:56,720] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:56,722] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:56,722] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:56,722] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:42:56,722] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,704] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,704] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,764] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,778] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,847] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,920] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,927] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,947] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:09,968] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,056] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,058] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,085] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,112] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,125] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,127] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,129] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,917] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,917] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,918] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,932] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,963] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,975] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,977] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:10,977] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:20,211] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:20,211] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:20,211] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:20,221] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:20,221] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:20,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:20,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:20,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,515] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,602] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,643] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,646] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,669] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,671] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,691] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,693] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:22,695] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:25,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:25,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:25,240] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:25,277] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:25,303] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:25,305] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:25,310] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:25,327] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,042] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,077] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,079] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,122] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,143] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,144] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,176] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,198] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,597] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,635] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,655] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,676] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,676] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,701] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,728] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:35,734] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,033] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,033] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,321] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,472] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,480] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,594] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,595] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,635] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,665] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:39,668] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:54,462] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:54,467] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:54,468] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:54,468] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:54,469] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:54,470] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:54,509] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-03 01:43:54,509] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13342 total_samples=4, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:17,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 9622.70 | bwd_microstep: 3464.68 | bwd_inner_microstep: 3449.24 | bwd_allreduce_microstep: 15.28 | step_microstep: 0.06 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13827 total_samples=8, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:24,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2499.24 | bwd_microstep: 4776.29 | bwd_inner_microstep: 4752.78 | bwd_allreduce_microstep: 23.45 | step_microstep: 0.07 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13444 total_samples=12, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:28,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1486.90 | bwd_microstep: 2472.37 | bwd_inner_microstep: 2332.45 | bwd_allreduce_microstep: 139.85 | step_microstep: 0.06 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13780 total_samples=16, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:33,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.49 [2025-08-03 01:44:33,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1628.79 | bwd_microstep: 2939.01 | bwd_inner_microstep: 2460.50 | bwd_allreduce_microstep: 478.37 | step_microstep: 154.60 [2025-08-03 01:44:33,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 15237.58 | bwd: 13652.38 | bwd_inner: 12995.01 | bwd_allreduce: 657.01 | step: 154.80 {'loss': 2.4023, 'learning_rate': 3.3333333333333335e-07, 'epoch': 0.0} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13811 total_samples=21, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:36,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1425.59 | bwd_microstep: 2143.14 | bwd_inner_microstep: 2002.90 | bwd_allreduce_microstep: 140.17 | step_microstep: 0.06 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13382 total_samples=25, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:41,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1503.37 | bwd_microstep: 3043.16 | bwd_inner_microstep: 2895.22 | bwd_allreduce_microstep: 147.86 | step_microstep: 0.08 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14242 total_samples=29, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:45,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1478.95 | bwd_microstep: 2405.54 | bwd_inner_microstep: 2383.43 | bwd_allreduce_microstep: 22.06 | step_microstep: 0.05 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13549 total_samples=33, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:48,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.63 [2025-08-03 01:44:48,912] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time [2025-08-03 01:44:48,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.58 | bwd_microstep: 1769.16 | bwd_inner_microstep: 1696.72 | bwd_allreduce_microstep: 72.35 | step_microstep: 137.52 [2025-08-03 01:44:48,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5856.42 | bwd: 9361.04 | bwd_inner: 8978.27 | bwd_allreduce: 382.52 | step: 137.73 {'loss': 2.3879, 'learning_rate': 6.666666666666667e-07, 'epoch': 0.0} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15000 total_samples=38, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:52,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.34 | bwd_microstep: 2043.81 | bwd_inner_microstep: 1938.67 | bwd_allreduce_microstep: 105.07 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13658 total_samples=42, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:55,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.59 | bwd_microstep: 1749.25 | bwd_inner_microstep: 1687.29 | bwd_allreduce_microstep: 61.89 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14057 total_samples=46, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:44:58,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.02 | bwd_microstep: 1831.49 | bwd_inner_microstep: 1731.25 | bwd_allreduce_microstep: 100.18 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14561 total_samples=50, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:02,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40 [2025-08-03 01:45:02,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1450.32 | bwd_microstep: 2125.63 | bwd_inner_microstep: 2023.28 | bwd_allreduce_microstep: 102.29 | step_microstep: 123.93 [2025-08-03 01:45:02,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5070.21 | bwd: 7750.22 | bwd_inner: 7380.48 | bwd_allreduce: 369.51 | step: 124.28 {'loss': 2.4169, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215 total_samples=54, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:05,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1451.74 | bwd_microstep: 1771.67 | bwd_inner_microstep: 1691.28 | bwd_allreduce_microstep: 80.32 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14164 total_samples=59, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:08,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1144.33 | bwd_microstep: 1803.45 | bwd_inner_microstep: 1734.69 | bwd_allreduce_microstep: 68.70 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13826 total_samples=63, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:11,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.65 | bwd_microstep: 2018.88 | bwd_inner_microstep: 1855.80 | bwd_allreduce_microstep: 163.01 | step_microstep: 0.20 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13482 total_samples=67, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:16,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51 [2025-08-03 01:45:16,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.46 | bwd_microstep: 3373.24 | bwd_inner_microstep: 2892.18 | bwd_allreduce_microstep: 480.96 | step_microstep: 953.91 [2025-08-03 01:45:16,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4056.09 | bwd: 8967.29 | bwd_inner: 8173.96 | bwd_allreduce: 793.06 | step: 954.39 {'loss': 2.3752, 'learning_rate': 1.3333333333333334e-06, 'epoch': 0.0} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14053 total_samples=71, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:19,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.38 | bwd_microstep: 2024.75 | bwd_inner_microstep: 1897.41 | bwd_allreduce_microstep: 127.28 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13469 total_samples=75, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:23,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1484.92 | bwd_microstep: 2638.82 | bwd_inner_microstep: 2627.48 | bwd_allreduce_microstep: 11.27 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13533 total_samples=79, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:27,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1422.82 | bwd_microstep: 2280.57 | bwd_inner_microstep: 2109.13 | bwd_allreduce_microstep: 171.37 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14828 total_samples=86, num_samples=7, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:30,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87 [2025-08-03 01:45:30,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1504.09 | bwd_microstep: 1778.89 | bwd_inner_microstep: 1735.27 | bwd_allreduce_microstep: 43.56 | step_microstep: 133.61 [2025-08-03 01:45:30,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5114.15 | bwd: 8723.08 | bwd_inner: 8369.28 | bwd_allreduce: 353.56 | step: 134.09 {'loss': 2.3778, 'learning_rate': 1.6666666666666667e-06, 'epoch': 0.0} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13349 total_samples=90, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:34,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1396.85 | bwd_microstep: 2294.00 | bwd_inner_microstep: 2121.57 | bwd_allreduce_microstep: 172.37 | step_microstep: 0.09 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13432 total_samples=94, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:39,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1408.40 | bwd_microstep: 3254.09 | bwd_inner_microstep: 2772.67 | bwd_allreduce_microstep: 481.36 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13493 total_samples=98, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:43,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1827.83 | bwd_microstep: 1798.47 | bwd_inner_microstep: 1712.90 | bwd_allreduce_microstep: 85.51 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837 total_samples=102, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:46,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98 [2025-08-03 01:45:46,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1187.49 | bwd_microstep: 2282.90 | bwd_inner_microstep: 2113.95 | bwd_allreduce_microstep: 168.89 | step_microstep: 114.58 [2025-08-03 01:45:46,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5820.50 | bwd: 9629.52 | bwd_inner: 8721.09 | bwd_allreduce: 908.20 | step: 114.90 {'loss': 2.3989, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13604 total_samples=106, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:49,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.80 | bwd_microstep: 1816.05 | bwd_inner_microstep: 1714.51 | bwd_allreduce_microstep: 101.48 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14006 total_samples=110, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:52,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1150.88 | bwd_microstep: 2246.20 | bwd_inner_microstep: 2088.92 | bwd_allreduce_microstep: 157.22 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14214 total_samples=114, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:55,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.48 | bwd_microstep: 1734.21 | bwd_inner_microstep: 1698.69 | bwd_allreduce_microstep: 35.46 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14825 total_samples=118, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 01:45:59,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01 [2025-08-03 01:45:59,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1506.42 | bwd_microstep: 2408.00 | bwd_inner_microstep: 2258.29 | bwd_allreduce_microstep: 149.66 | step_microstep: 113.57 [2025-08-03 01:45:59,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4074.51 | bwd: 8204.50 | bwd_inner: 7760.39 | bwd_allreduce: 443.88 | step: 114.00 0%| | 0/2000 [00:00> Saving model checkpoint to work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000 [INFO|configuration_utils.py:473] 2025-08-03 04:47:37,728 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/config.json [INFO|configuration_utils.py:594] 2025-08-03 04:47:37,732 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/generation_config.json [INFO|modeling_utils.py:2493] 2025-08-03 04:47:41,733 >> Model weights saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/model.safetensors [INFO|tokenization_utils_base.py:2433] 2025-08-03 04:47:41,739 >> tokenizer config file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2025-08-03 04:47:41,743 >> Special tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2025-08-03 04:47:41,745 >> added tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/added_tokens.json [2025-08-03 04:47:42,273] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step1000 is about to be saved! [2025-08-03 04:47:42,287] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_24_mp_rank_00_model_states.pt... [2025-08-03 04:47:42,294] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_model_states.pt [2025-08-03 04:47:42,294] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_model_states.pt... [2025-08-03 04:47:42,328] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_8_mp_rank_00_model_states.pt... [2025-08-03 04:47:42,295] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_16_mp_rank_00_model_states.pt... [2025-08-03 04:47:42,354] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_model_states.pt. [2025-08-03 04:47:42,385] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_16_mp_rank_00_model_states.pt. [2025-08-03 04:47:42,456] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_8_mp_rank_00_model_states.pt. [2025-08-03 04:47:42,438] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_24_mp_rank_00_model_states.pt. [2025-08-03 04:47:42,479] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... [2025-08-03 04:47:42,478] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2025-08-03 04:47:42,512] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... [2025-08-03 04:47:42,474] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... [2025-08-03 04:47:43,856] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. [2025-08-03 04:47:43,857] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt [2025-08-03 04:47:44,043] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. [2025-08-03 04:47:44,044] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt [2025-08-03 04:47:44,065] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. [2025-08-03 04:47:44,065] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt [2025-08-03 04:47:44,136] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2025-08-03 04:47:44,161] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-08-03 04:47:44,429] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1000 is ready now! [2025-08-03 04:47:44,463] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1000 is ready now! [2025-08-03 04:47:44,430] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1000 is ready now! [2025-08-03 04:47:44,424] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1000 is ready now! dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15085 total_samples=15190, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:47:47,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.23 | bwd_microstep: 1741.66 | bwd_inner_microstep: 1735.42 | bwd_allreduce_microstep: 6.18 | step_microstep: 0.11 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12262 total_samples=15194, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:47:49,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.85 | bwd_microstep: 1753.62 | bwd_inner_microstep: 1574.48 | bwd_allreduce_microstep: 179.08 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11949 total_samples=15197, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:47:52,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.28 | bwd_microstep: 1873.81 | bwd_inner_microstep: 1554.20 | bwd_allreduce_microstep: 319.55 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13303 total_samples=15201, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:47:55,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.74 [2025-08-03 04:47:55,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.45 | bwd_microstep: 2020.97 | bwd_inner_microstep: 1878.44 | bwd_allreduce_microstep: 142.47 | step_microstep: 135.80 [2025-08-03 04:47:55,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.75 | bwd: 7390.11 | bwd_inner: 6742.51 | bwd_allreduce: 647.36 | step: 136.13 {'loss': 0.7533, 'learning_rate': 1.046944692098213e-05, 'epoch': 0.5} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13912 total_samples=15206, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:47:57,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.28 | bwd_microstep: 1937.82 | bwd_inner_microstep: 1778.89 | bwd_allreduce_microstep: 158.87 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13738 total_samples=15210, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:00,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.96 | bwd_microstep: 1823.58 | bwd_inner_microstep: 1748.85 | bwd_allreduce_microstep: 74.67 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14041 total_samples=15214, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:02,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.78 | bwd_microstep: 1773.63 | bwd_inner_microstep: 1729.50 | bwd_allreduce_microstep: 44.06 | step_microstep: 0.11 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13202 total_samples=15218, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:05,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88 [2025-08-03 04:48:05,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.43 | bwd_microstep: 1860.89 | bwd_inner_microstep: 1756.32 | bwd_allreduce_microstep: 104.51 | step_microstep: 131.27 [2025-08-03 04:48:05,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.38 | bwd: 7395.97 | bwd_inner: 7013.56 | bwd_allreduce: 382.18 | step: 131.60 {'loss': 0.759, 'learning_rate': 1.0453270389749956e-05, 'epoch': 0.5} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15024 total_samples=15223, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:08,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.64 | bwd_microstep: 1901.65 | bwd_inner_microstep: 1862.91 | bwd_allreduce_microstep: 38.67 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13602 total_samples=15227, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:10,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.92 | bwd_microstep: 1763.52 | bwd_inner_microstep: 1700.76 | bwd_allreduce_microstep: 62.69 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11562 total_samples=15230, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:13,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.71 | bwd_microstep: 1788.26 | bwd_inner_microstep: 1564.57 | bwd_allreduce_microstep: 223.61 | step_microstep: 0.13 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12114 total_samples=15234, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:16,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88 [2025-08-03 04:48:16,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.84 | bwd_microstep: 2004.52 | bwd_inner_microstep: 1622.16 | bwd_allreduce_microstep: 382.30 | step_microstep: 111.21 [2025-08-03 04:48:16,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.03 | bwd: 7457.99 | bwd_inner: 6750.39 | bwd_allreduce: 707.36 | step: 111.71 {'loss': 0.7616, 'learning_rate': 1.0437092669869025e-05, 'epoch': 0.5} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14875 total_samples=15238, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:19,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 1990.82 | bwd_inner_microstep: 1888.38 | bwd_allreduce_microstep: 102.38 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334 total_samples=15242, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:21,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.20 | bwd_microstep: 1686.06 | bwd_inner_microstep: 1647.06 | bwd_allreduce_microstep: 38.94 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12774 total_samples=15245, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:24,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.55 | bwd_microstep: 1955.05 | bwd_inner_microstep: 1791.23 | bwd_allreduce_microstep: 163.74 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12502 total_samples=15248, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:27,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43 [2025-08-03 04:48:27,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 874.34 | bwd_microstep: 1779.61 | bwd_inner_microstep: 1580.97 | bwd_allreduce_microstep: 198.57 | step_microstep: 117.13 [2025-08-03 04:48:27,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2968.98 | bwd: 7411.59 | bwd_inner: 6907.64 | bwd_allreduce: 503.70 | step: 117.60 {'loss': 0.7575, 'learning_rate': 1.0420913803763522e-05, 'epoch': 0.5} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12710 total_samples=15252, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:29,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.88 | bwd_microstep: 1869.15 | bwd_inner_microstep: 1658.77 | bwd_allreduce_microstep: 210.32 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14002 total_samples=15256, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:32,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.62 | bwd_microstep: 1774.63 | bwd_inner_microstep: 1723.30 | bwd_allreduce_microstep: 51.26 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11957 total_samples=15259, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:35,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.17 | bwd_microstep: 1981.25 | bwd_inner_microstep: 1586.29 | bwd_allreduce_microstep: 394.89 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13962 total_samples=15263, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:38,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91 [2025-08-03 04:48:38,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.36 | bwd_microstep: 2123.32 | bwd_inner_microstep: 1819.79 | bwd_allreduce_microstep: 303.47 | step_microstep: 112.68 [2025-08-03 04:48:38,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.96 | bwd: 7748.40 | bwd_inner: 6788.14 | bwd_allreduce: 960.02 | step: 113.14 {'loss': 0.7425, 'learning_rate': 1.0404733833860639e-05, 'epoch': 0.5} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13580 total_samples=15267, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:40,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.36 | bwd_microstep: 1799.90 | bwd_inner_microstep: 1700.24 | bwd_allreduce_microstep: 99.58 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13383 total_samples=15271, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:43,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.60 | bwd_microstep: 1717.24 | bwd_inner_microstep: 1659.19 | bwd_allreduce_microstep: 57.97 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11933 total_samples=15274, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:45,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.33 | bwd_microstep: 1808.92 | bwd_inner_microstep: 1584.61 | bwd_allreduce_microstep: 224.25 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13250 total_samples=15278, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:48,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 04:48:48,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.58 | bwd_microstep: 2098.62 | bwd_inner_microstep: 1667.52 | bwd_allreduce_microstep: 431.04 | step_microstep: 123.67 [2025-08-03 04:48:48,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.79 | bwd: 7424.73 | bwd_inner: 6611.56 | bwd_allreduce: 812.92 | step: 124.17 50%|█████ | 1001/2000 [3:05:09<3:50:21, 13.84s/it] 50%|█████ | 1001/2000 [3:05:10<3:50:21, 13.84s/it] 50%|█████ | 1002/2000 [3:05:20<3:33:58, 12.86s/it] 50%|█████ | 1002/2000 [3:05:20<3:33:58, 12.86s/it] 50%|█████ | 1003/2000 [3:05:31<3:23:06, 12.22s/it] 50%|█████ | 1003/2000 [3:05:31<3:23:06, 12.22s/it] 50%|█████ | 1004/2000 [3:05:42<3:15:49, 11.80s/it] 50%|█████ | 1004/2000 [3:05:42<3:15:49, 11.80s/it] 50%|█████ | 1005/2000 [3:05:53<3:11:43, 11.56s/it] 50%|█████ | 1005/2000 [3:05:53<3:11:43, 11.56s/it] 50%|█████ | 1006/2000 [3:06:03<3:06:57, 11.29s/it] {'loss': 0.742, 'learning_rate': 1.0388552802590461e-05, 'epoch': 0.5} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468 total_samples=15282, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:51,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.41 | bwd_microstep: 1876.99 | bwd_inner_microstep: 1796.26 | bwd_allreduce_microstep: 80.65 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13858 total_samples=15287, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:54,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.70 | bwd_microstep: 1726.62 | bwd_inner_microstep: 1683.35 | bwd_allreduce_microstep: 43.21 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13915 total_samples=15291, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:56,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.47 | bwd_microstep: 2056.81 | bwd_inner_microstep: 2050.72 | bwd_allreduce_microstep: 6.03 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15045 total_samples=15295, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:48:59,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18 [2025-08-03 04:48:59,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.31 | bwd_microstep: 1808.56 | bwd_inner_microstep: 1764.28 | bwd_allreduce_microstep: 44.21 | step_microstep: 120.28 [2025-08-03 04:48:59,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.82 | bwd: 7469.03 | bwd_inner: 7294.60 | bwd_allreduce: 174.19 | step: 120.76 {'loss': 0.76, 'learning_rate': 1.0372370752385854e-05, 'epoch': 0.5} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13620 total_samples=15299, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:02,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.46 | bwd_microstep: 1840.60 | bwd_inner_microstep: 1709.75 | bwd_allreduce_microstep: 130.78 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13082 total_samples=15303, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:04,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.94 | bwd_microstep: 1845.84 | bwd_inner_microstep: 1652.52 | bwd_allreduce_microstep: 193.25 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13300 total_samples=15307, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:07,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.85 | bwd_microstep: 2009.51 | bwd_inner_microstep: 1864.43 | bwd_allreduce_microstep: 145.01 | step_microstep: 0.21 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11840 total_samples=15310, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:10,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48 [2025-08-03 04:49:10,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.26 | bwd_microstep: 1850.05 | bwd_inner_microstep: 1602.55 | bwd_allreduce_microstep: 247.43 | step_microstep: 123.44 [2025-08-03 04:49:10,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2849.44 | bwd: 7546.04 | bwd_inner: 6829.24 | bwd_allreduce: 716.56 | step: 123.91 {'loss': 0.753, 'learning_rate': 1.0356187725682359e-05, 'epoch': 0.5} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13969 total_samples=15314, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:13,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.23 | bwd_microstep: 1932.33 | bwd_inner_microstep: 1727.81 | bwd_allreduce_microstep: 204.46 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13123 total_samples=15318, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:15,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.95 | bwd_microstep: 1783.92 | bwd_inner_microstep: 1682.76 | bwd_allreduce_microstep: 101.09 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13554 total_samples=15323, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:18,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.93 | bwd_microstep: 1980.93 | bwd_inner_microstep: 1692.82 | bwd_allreduce_microstep: 288.04 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13475 total_samples=15327, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:21,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38 [2025-08-03 04:49:21,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.27 | bwd_microstep: 1845.86 | bwd_inner_microstep: 1797.20 | bwd_allreduce_microstep: 48.59 | step_microstep: 134.05 [2025-08-03 04:49:21,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.31 | bwd: 7543.09 | bwd_inner: 6900.58 | bwd_allreduce: 642.27 | step: 134.40 {'loss': 0.7435, 'learning_rate': 1.0340003764918078e-05, 'epoch': 0.5} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13300 total_samples=15331, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:24,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.04 | bwd_microstep: 2125.11 | bwd_inner_microstep: 1999.06 | bwd_allreduce_microstep: 125.99 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16000 total_samples=15337, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:26,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.26 | bwd_microstep: 1799.73 | bwd_inner_microstep: 1785.03 | bwd_allreduce_microstep: 14.64 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12283 total_samples=15340, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:29,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.97 | bwd_microstep: 1759.30 | bwd_inner_microstep: 1579.64 | bwd_allreduce_microstep: 179.60 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13358 total_samples=15344, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:32,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53 [2025-08-03 04:49:32,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.32 | bwd_microstep: 1955.30 | bwd_inner_microstep: 1770.09 | bwd_allreduce_microstep: 185.14 | step_microstep: 153.62 [2025-08-03 04:49:32,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.52 | bwd: 7639.49 | bwd_inner: 7133.82 | bwd_allreduce: 505.44 | step: 154.07 {'loss': 0.7645, 'learning_rate': 1.0323818912533561e-05, 'epoch': 0.51} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13778 total_samples=15348, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:34,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.39 | bwd_microstep: 1876.65 | bwd_inner_microstep: 1836.50 | bwd_allreduce_microstep: 40.09 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12154 total_samples=15351, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:37,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.14 | bwd_microstep: 1926.54 | bwd_inner_microstep: 1736.23 | bwd_allreduce_microstep: 190.24 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14344 total_samples=15355, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:40,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.48 | bwd_microstep: 1798.71 | bwd_inner_microstep: 1731.12 | bwd_allreduce_microstep: 67.53 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13547 total_samples=15359, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:43,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00 [2025-08-03 04:49:43,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.90 | bwd_microstep: 2135.72 | bwd_inner_microstep: 1930.10 | bwd_allreduce_microstep: 205.56 | step_microstep: 107.37 [2025-08-03 04:49:43,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.85 | bwd: 7737.68 | bwd_inner: 7233.95 | bwd_allreduce: 503.49 | step: 107.85 {'loss': 0.7422, 'learning_rate': 1.0307633210971697e-05, 'epoch': 0.51} 50%|█████ | 1006/2000 [3:06:03<3:06:57, 11.29s/it] 50%|█████ | 1007/2000 [3:06:14<3:03:55, 11.11s/it] 50%|█████ | 1007/2000 [3:06:14<3:03:55, 11.11s/it] 50%|█████ | 1008/2000 [3:06:25<3:02:21, 11.03s/it] 50%|█████ | 1008/2000 [3:06:25<3:02:21, 11.03s/it] 50%|█████ | 1009/2000 [3:06:36<3:00:49, 10.95s/it] 50%|█████ | 1009/2000 [3:06:36<3:00:49, 10.95s/it] 50%|█████ | 1010/2000 [3:06:47<3:00:39, 10.95s/it] 50%|█████ | 1010/2000 [3:06:47<3:00:39, 10.95s/it] 51%|█████ | 1011/2000 [3:06:57<3:00:32, 10.95s/it] 51%|█████ | 1011/20dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14663 total_samples=15364, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:45,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.32 | bwd_microstep: 1762.46 | bwd_inner_microstep: 1712.34 | bwd_allreduce_microstep: 50.05 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14115 total_samples=15368, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:48,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.21 | bwd_microstep: 1760.73 | bwd_inner_microstep: 1711.14 | bwd_allreduce_microstep: 49.53 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247 total_samples=15372, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:50,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.69 | bwd_microstep: 1780.02 | bwd_inner_microstep: 1680.35 | bwd_allreduce_microstep: 99.61 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13328 total_samples=15376, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:53,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86 [2025-08-03 04:49:53,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.80 | bwd_microstep: 1981.94 | bwd_inner_microstep: 1905.26 | bwd_allreduce_microstep: 76.61 | step_microstep: 108.48 [2025-08-03 04:49:53,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.96 | bwd: 7285.20 | bwd_inner: 7009.09 | bwd_allreduce: 275.88 | step: 108.81 {'loss': 0.761, 'learning_rate': 1.0291446702677598e-05, 'epoch': 0.51} dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13211 total_samples=15381, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:56,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 988.57 | bwd_microstep: 1750.62 | bwd_inner_microstep: 1652.88 | bwd_allreduce_microstep: 97.67 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13949 total_samples=15385, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:49:59,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.10 | bwd_microstep: 1842.41 | bwd_inner_microstep: 1737.08 | bwd_allreduce_microstep: 105.27 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14635 total_samples=15389, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:01,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.77 | bwd_microstep: 2160.36 | bwd_inner_microstep: 2131.94 | bwd_allreduce_microstep: 28.34 | step_microstep: 0.13 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15089 total_samples=15394, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:04,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32 [2025-08-03 04:50:04,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.00 | bwd_microstep: 1799.71 | bwd_inner_microstep: 1737.48 | bwd_allreduce_microstep: 62.16 | step_microstep: 120.41 [2025-08-03 04:50:04,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3121.36 | bwd: 7553.16 | bwd_inner: 7259.38 | bwd_allreduce: 293.52 | step: 120.79 {'loss': 0.7565, 'learning_rate': 1.0275259430098502e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11576 total_samples=15397, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:07,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.19 | bwd_microstep: 1991.54 | bwd_inner_microstep: 1800.59 | bwd_allreduce_microstep: 190.88 | step_microstep: 0.19 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13556 total_samples=15401, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:10,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.32 | bwd_microstep: 1814.43 | bwd_inner_microstep: 1807.40 | bwd_allreduce_microstep: 6.97 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13571 total_samples=15405, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:12,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.23 | bwd_microstep: 1745.55 | bwd_inner_microstep: 1694.18 | bwd_allreduce_microstep: 51.31 | step_microstep: 0.11 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13111 total_samples=15409, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:15,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16 [2025-08-03 04:50:15,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.37 | bwd_microstep: 1875.85 | bwd_inner_microstep: 1665.06 | bwd_allreduce_microstep: 210.73 | step_microstep: 109.62 [2025-08-03 04:50:15,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.05 | bwd: 7427.42 | bwd_inner: 6967.22 | bwd_allreduce: 459.97 | step: 110.04 {'loss': 0.7556, 'learning_rate': 1.0259071435683636e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12071 total_samples=15412, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:18,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.36 | bwd_microstep: 2026.84 | bwd_inner_microstep: 1842.31 | bwd_allreduce_microstep: 184.46 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13590 total_samples=15416, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:20,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.89 | bwd_microstep: 1774.96 | bwd_inner_microstep: 1714.01 | bwd_allreduce_microstep: 60.88 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13543 total_samples=15420, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:23,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.94 | bwd_microstep: 2060.64 | bwd_inner_microstep: 1844.47 | bwd_allreduce_microstep: 216.10 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11685 total_samples=15423, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:26,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02 [2025-08-03 04:50:26,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.17 | bwd_microstep: 2129.91 | bwd_inner_microstep: 1652.80 | bwd_allreduce_microstep: 477.02 | step_microstep: 125.73 [2025-08-03 04:50:26,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2873.29 | bwd: 7992.40 | bwd_inner: 7053.61 | bwd_allreduce: 938.52 | step: 126.18 {'loss': 0.7609, 'learning_rate': 1.0242882761884132e-05, 'epoch': 0.51} dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 14645 total_samples=15427, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:29,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.89 | bwd_microstep: 1758.04 | bwd_inner_microstep: 1692.06 | bwd_allreduce_microstep: 65.92 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14576 total_samples=15432, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:31,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.47 | bwd_microstep: 2073.79 | bwd_inner_microstep: 1763.18 | bwd_allreduce_microstep: 310.54 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15005 total_samples=15437, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:34,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.87 | bwd_microstep: 2063.22 | bwd_inner_microstep: 1930.12 | bwd_allreduce_microstep: 133.04 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13661 total_samples=15441, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:37,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85 [2025-08-03 04:50:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.61 | bwd_microstep: 1971.30 | bwd_inner_microstep: 1746.24 | bwd_allreduce_microstep: 225.00 | step_microstep: 133.20 [2025-08-03 04:50:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.78 | bwd: 7866.41 | bwd_inner: 7131.60 | bwd_allreduce: 734.57 | step: 133.54 {'loss': 0.7602, 'learning_rate': 1.02266934511529e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11627 total_samples=15444, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:40,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 1853.60 | bwd_inner_microstep: 1582.37 | bwd_allreduce_microstep: 271.17 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13324 total_samples=15448, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:42,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.59 | bwd_microstep: 1790.44 | bwd_inner_microstep: 1708.97 | bwd_allreduce_microstep: 81.41 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515 total_samples=15452, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:45,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.42 | bwd_microstep: 1721.37 | bwd_inner_microstep: 1679.54 | bwd_allreduce_microstep: 41.77 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11720 total_samples=15455, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:48,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26 [2025-08-03 04:50:48,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.67 | bwd_microstep: 1753.59 | bwd_inner_microstep: 1550.55 | bwd_allreduce_microstep: 202.98 | step_microstep: 114.17 [2025-08-03 04:50:48,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.55 | bwd: 7119.05 | bwd_inner: 6521.42 | bwd_allreduce: 597.41 | step: 114.51 00 [3:06:57<3:00:32, 10.95s/it] 51%|█████ | 1012/2000 [3:07:08<2:57:50, 10.80s/it] 51%|█████ | 1012/2000 [3:07:08<2:57:50, 10.80s/it] 51%|█████ | 1013/2000 [3:07:19<2:59:06, 10.89s/it] 51%|█████ | 1013/2000 [3:07:19<2:59:06, 10.89s/it] 51%|█████ | 1014/2000 [3:07:30<2:57:43, 10.82s/it] 51%|█████ | 1014/2000 [3:07:30<2:57:43, 10.82s/it] 51%|█████ | 1015/2000 [3:07:41<2:59:43, 10.95s/it] 51%|█████ | 1015/2000 [3:07:41<2:59:43, 10.95s/it] 51%|█████ | 1016/2000 [3:07:52<3:00:20, 11.00s/it] 51%|█████ | 1016/2000 [3:07:52<3:00:20, 11.00s/it] 51%|█████ | 1017/2000 [3:08:02<2{'loss': 0.7474, 'learning_rate': 1.0210503545944522e-05, 'epoch': 0.51} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13393 total_samples=15459, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:50,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.95 | bwd_microstep: 2017.09 | bwd_inner_microstep: 1861.89 | bwd_allreduce_microstep: 155.13 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13953 total_samples=15463, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:53,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.28 | bwd_microstep: 1843.49 | bwd_inner_microstep: 1726.33 | bwd_allreduce_microstep: 117.10 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13450 total_samples=15467, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:56,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.05 | bwd_microstep: 1997.57 | bwd_inner_microstep: 1731.05 | bwd_allreduce_microstep: 266.46 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13378 total_samples=15471, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:50:59,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27 [2025-08-03 04:50:59,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.68 | bwd_microstep: 1945.93 | bwd_inner_microstep: 1849.30 | bwd_allreduce_microstep: 96.55 | step_microstep: 121.67 [2025-08-03 04:50:59,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.88 | bwd: 7804.13 | bwd_inner: 7168.57 | bwd_allreduce: 635.32 | step: 122.13 {'loss': 0.7487, 'learning_rate': 1.0194313088715135e-05, 'epoch': 0.51} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13565 total_samples=15475, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:01,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.24 | bwd_microstep: 2020.27 | bwd_inner_microstep: 1689.06 | bwd_allreduce_microstep: 331.15 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13618 total_samples=15479, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:04,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.05 | bwd_microstep: 1793.39 | bwd_inner_microstep: 1711.01 | bwd_allreduce_microstep: 82.32 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16371 total_samples=15483, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:07,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.30 | bwd_microstep: 2047.30 | bwd_inner_microstep: 1974.26 | bwd_allreduce_microstep: 72.98 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13517 total_samples=15487, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:10,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06 [2025-08-03 04:51:10,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.32 | bwd_microstep: 2008.96 | bwd_inner_microstep: 1877.02 | bwd_allreduce_microstep: 131.87 | step_microstep: 122.71 [2025-08-03 04:51:10,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.83 | bwd: 7869.98 | bwd_inner: 7251.34 | bwd_allreduce: 618.40 | step: 123.04 {'loss': 0.7677, 'learning_rate': 1.0178122121922324e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11916 total_samples=15490, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:12,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.30 | bwd_microstep: 1861.33 | bwd_inner_microstep: 1549.59 | bwd_allreduce_microstep: 311.69 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12978 total_samples=15494, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:15,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.21 | bwd_microstep: 1834.33 | bwd_inner_microstep: 1698.10 | bwd_allreduce_microstep: 136.16 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13667 total_samples=15498, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:18,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.15 | bwd_microstep: 1782.43 | bwd_inner_microstep: 1705.42 | bwd_allreduce_microstep: 76.95 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13054 total_samples=15502, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:21,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38 [2025-08-03 04:51:21,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.94 | bwd_microstep: 2336.02 | bwd_inner_microstep: 2168.68 | bwd_allreduce_microstep: 167.27 | step_microstep: 113.17 [2025-08-03 04:51:21,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.52 | bwd: 7814.16 | bwd_inner: 7121.78 | bwd_allreduce: 692.15 | step: 113.51 {'loss': 0.74, 'learning_rate': 1.0161930688025018e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11660 total_samples=15505, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:24,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.75 | bwd_microstep: 2022.65 | bwd_inner_microstep: 1861.96 | bwd_allreduce_microstep: 160.63 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11773 total_samples=15508, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:26,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.86 | bwd_microstep: 1925.97 | bwd_inner_microstep: 1788.96 | bwd_allreduce_microstep: 136.96 | step_microstep: 0.20 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15763 total_samples=15513, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:29,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.10 | bwd_microstep: 1955.12 | bwd_inner_microstep: 1781.95 | bwd_allreduce_microstep: 173.07 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12984 total_samples=15517, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:32,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89 [2025-08-03 04:51:32,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.82 | bwd_microstep: 1786.74 | bwd_inner_microstep: 1670.59 | bwd_allreduce_microstep: 116.07 | step_microstep: 143.15 [2025-08-03 04:51:32,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.45 | bwd: 7690.51 | bwd_inner: 7103.44 | bwd_allreduce: 586.81 | step: 143.57 {'loss': 0.7453, 'learning_rate': 1.0145738829483354e-05, 'epoch': 0.51} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13165 total_samples=15521, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:34,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.75 | bwd_microstep: 1870.54 | bwd_inner_microstep: 1677.42 | bwd_allreduce_microstep: 193.06 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13456 total_samples=15525, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:37,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.54 | bwd_microstep: 1700.96 | bwd_inner_microstep: 1667.71 | bwd_allreduce_microstep: 33.19 | step_microstep: 0.10 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12850 total_samples=15529, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:40,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.68 | bwd_microstep: 2036.50 | bwd_inner_microstep: 1862.03 | bwd_allreduce_microstep: 174.40 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12797 total_samples=15533, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:43,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85 [2025-08-03 04:51:43,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.69 | bwd_microstep: 2032.27 | bwd_inner_microstep: 1786.78 | bwd_allreduce_microstep: 245.43 | step_microstep: 133.58 [2025-08-03 04:51:43,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.58 | bwd: 7640.32 | bwd_inner: 6993.93 | bwd_allreduce: 646.16 | step: 133.91 {'loss': 0.7384, 'learning_rate': 1.0129546588758605e-05, 'epoch': 0.51} :57:00, 10.80s/it] 51%|█████ | 1017/2000 [3:08:02<2:57:00, 10.80s/it] 51%|█████ | 1018/2000 [3:08:13<2:58:02, 10.88s/it] 51%|█████ | 1018/2000 [3:08:13<2:58:02, 10.88s/it] 51%|█████ | 1019/2000 [3:08:25<2:59:13, 10.96s/it] 51%|█████ | 1019/2000 [3:08:25<2:59:13, 10.96s/it] 51%|█████ | 1020/2000 [3:08:36<2:59:16, 10.98s/it] 51%|█████ | 1020/2000 [3:08:36<2:59:16, 10.98s/it] 51%|█████ | 1021/2000 [3:08:47<2:58:53, 10.96s/it] 51%|█████ | 1021/2000 [3:08:47<2:58:53, 10.96s/it] 51%|█████ | 1022/2000 [3:08:57<2:58:09, 10.93s/it] 51dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14088 total_samples=15537, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:45,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.23 | bwd_microstep: 1766.19 | bwd_inner_microstep: 1704.71 | bwd_allreduce_microstep: 61.41 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11982 total_samples=15540, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:48,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.32 | bwd_microstep: 1861.75 | bwd_inner_microstep: 1612.92 | bwd_allreduce_microstep: 248.78 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14095 total_samples=15544, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:50,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.90 | bwd_microstep: 1832.60 | bwd_inner_microstep: 1756.48 | bwd_allreduce_microstep: 76.06 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13265 total_samples=15548, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:53,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03 [2025-08-03 04:51:53,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.01 | bwd_microstep: 2215.00 | bwd_inner_microstep: 1884.32 | bwd_allreduce_microstep: 330.62 | step_microstep: 109.89 [2025-08-03 04:51:53,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2870.39 | bwd: 7675.59 | bwd_inner: 6958.42 | bwd_allreduce: 716.95 | step: 110.24 {'loss': 0.7578, 'learning_rate': 1.0113354008313025e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770 total_samples=15551, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:56,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.98 | bwd_microstep: 1992.75 | bwd_inner_microstep: 1526.89 | bwd_allreduce_microstep: 465.79 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13285 total_samples=15555, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:51:59,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.54 | bwd_microstep: 2058.83 | bwd_inner_microstep: 1905.11 | bwd_allreduce_microstep: 153.66 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12417 total_samples=15558, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:02,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.32 | bwd_microstep: 1805.71 | bwd_inner_microstep: 1592.47 | bwd_allreduce_microstep: 213.17 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11586 total_samples=15561, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:04,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93 [2025-08-03 04:52:04,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.70 | bwd_microstep: 1759.81 | bwd_inner_microstep: 1541.05 | bwd_allreduce_microstep: 218.70 | step_microstep: 121.87 [2025-08-03 04:52:04,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.46 | bwd: 7617.15 | bwd_inner: 6565.51 | bwd_allreduce: 1051.40 | step: 122.31 {'loss': 0.7455, 'learning_rate': 1.0097161130609774e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11870 total_samples=15564, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:07,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.89 | bwd_microstep: 2178.47 | bwd_inner_microstep: 1951.47 | bwd_allreduce_microstep: 226.93 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13961 total_samples=15568, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:10,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.60 | bwd_microstep: 1998.48 | bwd_inner_microstep: 1860.09 | bwd_allreduce_microstep: 138.32 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14636 total_samples=15572, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:13,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 1742.55 | bwd_inner_microstep: 1712.81 | bwd_allreduce_microstep: 29.68 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11705 total_samples=15575, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:15,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07 [2025-08-03 04:52:15,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.93 | bwd_microstep: 1831.22 | bwd_inner_microstep: 1568.86 | bwd_allreduce_microstep: 262.30 | step_microstep: 157.42 [2025-08-03 04:52:15,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.42 | bwd: 7750.77 | bwd_inner: 7093.22 | bwd_allreduce: 657.31 | step: 157.88 {'loss': 0.7531, 'learning_rate': 1.0080967998112787e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11852 total_samples=15578, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:18,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.09 | bwd_microstep: 1724.75 | bwd_inner_microstep: 1539.98 | bwd_allreduce_microstep: 184.71 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13633 total_samples=15582, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:20,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.09 | bwd_microstep: 1788.25 | bwd_inner_microstep: 1698.27 | bwd_allreduce_microstep: 89.92 | step_microstep: 0.16 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12693 total_samples=15586, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:23,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.41 | bwd_microstep: 2156.85 | bwd_inner_microstep: 1985.98 | bwd_allreduce_microstep: 170.81 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13343 total_samples=15590, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:26,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47 [2025-08-03 04:52:26,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.87 | bwd_microstep: 2113.43 | bwd_inner_microstep: 1961.38 | bwd_allreduce_microstep: 151.99 | step_microstep: 116.16 [2025-08-03 04:52:26,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.39 | bwd: 7783.34 | bwd_inner: 7185.61 | bwd_allreduce: 597.50 | step: 116.53 {'loss': 0.751, 'learning_rate': 1.0064774653286662e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12187 total_samples=15593, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:29,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.05 | bwd_microstep: 1935.18 | bwd_inner_microstep: 1766.26 | bwd_allreduce_microstep: 168.86 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13291 total_samples=15597, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:32,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.94 | bwd_microstep: 1775.13 | bwd_inner_microstep: 1689.92 | bwd_allreduce_microstep: 85.15 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11763 total_samples=15600, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:34,960] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.77 | bwd_microstep: 2101.87 | bwd_inner_microstep: 1861.70 | bwd_allreduce_microstep: 240.10 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13851 total_samples=15604, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:37,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16 [2025-08-03 04:52:37,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.78 | bwd_microstep: 1981.46 | bwd_inner_microstep: 1973.23 | bwd_allreduce_microstep: 8.16 | step_microstep: 143.65 [2025-08-03 04:52:37,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.46 | bwd: 7793.68 | bwd_inner: 7291.11 | bwd_allreduce: 502.36 | step: 143.98 {'loss': 0.7441, 'learning_rate': 1.0048581138596563e-05, 'epoch': 0.51} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11927 total_samples=15607, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:40,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.39 | bwd_microstep: 2033.30 | bwd_inner_microstep: 1802.10 | bwd_allreduce_microstep: 231.13 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13841 total_samples=15611, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:43,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.62 | bwd_microstep: 1786.51 | bwd_inner_microstep: 1714.47 | bwd_allreduce_microstep: 71.98 | step_microstep: 0.12 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13508 total_samples=15615, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:45,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.46 | bwd_microstep: 1837.72 | bwd_inner_microstep: 1697.52 | bwd_allreduce_microstep: 140.14 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14524 total_samples=15619, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:48,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93 [2025-08-03 04:52:48,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.08 | bwd_microstep: 1891.80 | bwd_inner_microstep: 1871.50 | bwd_allreduce_microstep: 20.23 | step_microstep: 140.15 [2025-08-03 04:52:48,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.47 | bwd: 7549.37 | bwd_inner: 7085.59 | bwd_allreduce: 463.54 | step: 140.49 %|█████ | 1022/2000 [3:08:57<2:58:09, 10.93s/it] 51%|█████ | 1023/2000 [3:09:08<2:58:03, 10.93s/it] 51%|█████ | 1023/2000 [3:09:08<2:58:03, 10.93s/it] 51%|█████ | 1024/2000 [3:09:19<2:57:16, 10.90s/it] 51%|█████ | 1024/2000 [3:09:19<2:57:16, 10.90s/it] 51%|█████▏ | 1025/2000 [3:09:30<2:57:52, 10.95s/it] 51%|█████▏ | 1025/2000 [3:09:30<2:57:52, 10.95s/it] 51%|█████▏ | 1026/2000 [3:09:41<2:57:55, 10.96s/it] 51%|█████▏ | 1026/2000 [3:09:41<2:57:55, 10.96s/it] 51%|█████▏ | 1027/2000 [3:09:52<2:58:02, 10.98s/it] 51%|█████▏ | 1027/2000 [3:09:52<2:58:02, 10.98s/it] 51%{'loss': 0.7585, 'learning_rate': 1.003238749650809e-05, 'epoch': 0.51} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13487 total_samples=15623, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:51,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.59 | bwd_microstep: 2015.62 | bwd_inner_microstep: 1737.57 | bwd_allreduce_microstep: 277.99 | step_microstep: 0.20 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13387 total_samples=15627, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:54,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.41 | bwd_microstep: 2097.95 | bwd_inner_microstep: 1946.58 | bwd_allreduce_microstep: 151.30 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13894 total_samples=15631, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:56,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.30 | bwd_microstep: 1851.25 | bwd_inner_microstep: 1703.84 | bwd_allreduce_microstep: 147.36 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13876 total_samples=15635, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:52:59,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09 [2025-08-03 04:52:59,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.25 | bwd_microstep: 1990.95 | bwd_inner_microstep: 1884.21 | bwd_allreduce_microstep: 106.68 | step_microstep: 114.16 [2025-08-03 04:52:59,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.48 | bwd: 7955.81 | bwd_inner: 7272.19 | bwd_allreduce: 683.40 | step: 114.58 {'loss': 0.7525, 'learning_rate': 1.001619376948718e-05, 'epoch': 0.51} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13495 total_samples=15639, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:02,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.97 | bwd_microstep: 1764.27 | bwd_inner_microstep: 1656.53 | bwd_allreduce_microstep: 107.67 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14686 total_samples=15643, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:04,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.53 | bwd_microstep: 1845.41 | bwd_inner_microstep: 1786.23 | bwd_allreduce_microstep: 59.12 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13230 total_samples=15647, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:07,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.86 | bwd_microstep: 1975.61 | bwd_inner_microstep: 1820.35 | bwd_allreduce_microstep: 155.18 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13199 total_samples=15651, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:10,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97 [2025-08-03 04:53:10,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.02 | bwd_microstep: 1823.08 | bwd_inner_microstep: 1711.22 | bwd_allreduce_microstep: 111.79 | step_microstep: 124.54 [2025-08-03 04:53:10,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.32 | bwd: 7408.42 | bwd_inner: 6974.31 | bwd_allreduce: 433.85 | step: 124.86 {'loss': 0.757, 'learning_rate': 1e-05, 'epoch': 0.52} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12396 total_samples=15654, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:13,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.77 | bwd_microstep: 1834.47 | bwd_inner_microstep: 1597.13 | bwd_allreduce_microstep: 237.28 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14345 total_samples=15658, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:15,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.18 | bwd_microstep: 1785.90 | bwd_inner_microstep: 1735.78 | bwd_allreduce_microstep: 50.05 | step_microstep: 0.22 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12926 total_samples=15662, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:18,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.52 | bwd_microstep: 2165.74 | bwd_inner_microstep: 1991.55 | bwd_allreduce_microstep: 174.13 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12515 total_samples=15665, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:21,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12 [2025-08-03 04:53:21,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.67 | bwd_microstep: 1984.82 | bwd_inner_microstep: 1766.05 | bwd_allreduce_microstep: 218.71 | step_microstep: 398.15 [2025-08-03 04:53:21,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.08 | bwd: 7770.99 | bwd_inner: 7090.52 | bwd_allreduce: 680.25 | step: 398.60 {'loss': 0.7407, 'learning_rate': 9.98380623051282e-06, 'epoch': 0.52} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13144 total_samples=15669, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:24,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.24 | bwd_microstep: 1954.14 | bwd_inner_microstep: 1703.12 | bwd_allreduce_microstep: 250.95 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13599 total_samples=15673, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:26,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.90 | bwd_microstep: 1720.36 | bwd_inner_microstep: 1633.05 | bwd_allreduce_microstep: 87.24 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11694 total_samples=15676, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:29,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.79 | bwd_microstep: 2111.22 | bwd_inner_microstep: 1870.18 | bwd_allreduce_microstep: 240.97 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14567 total_samples=15680, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:32,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14 [2025-08-03 04:53:32,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.84 | bwd_microstep: 1853.61 | bwd_inner_microstep: 1765.84 | bwd_allreduce_microstep: 87.71 | step_microstep: 137.06 [2025-08-03 04:53:32,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.71 | bwd: 7639.37 | bwd_inner: 6972.19 | bwd_allreduce: 666.96 | step: 137.39 {'loss': 0.7502, 'learning_rate': 9.967612503491915e-06, 'epoch': 0.52} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11862 total_samples=15683, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:35,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.41 | bwd_microstep: 2027.93 | bwd_inner_microstep: 1803.43 | bwd_allreduce_microstep: 224.43 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11898 total_samples=15686, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:38,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.81 | bwd_microstep: 1847.71 | bwd_inner_microstep: 1598.81 | bwd_allreduce_microstep: 248.84 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14538 total_samples=15690, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:40,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.44 | bwd_microstep: 2009.07 | bwd_inner_microstep: 1778.30 | bwd_allreduce_microstep: 230.71 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12891 total_samples=15694, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:43,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.76 [2025-08-03 04:53:43,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.56 | bwd_microstep: 2121.99 | bwd_inner_microstep: 1968.12 | bwd_allreduce_microstep: 153.82 | step_microstep: 106.55 [2025-08-03 04:53:43,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2888.14 | bwd: 8006.75 | bwd_inner: 7148.65 | bwd_allreduce: 857.88 | step: 106.86 |█████▏ | 1028/2000 [3:10:03<2:56:54, 10.92s/it] 51%|█████▏ | 1028/2000 [3:10:03<2:56:54, 10.92s/it] 51%|█████▏ | 1029/2000 [3:10:14<2:57:58, 11.00s/it] 51%|█████▏ | 1029/2000 [3:10:14<2:57:58, 11.00s/it] 52%|█████▏ | 1030/2000 [3:10:25<2:56:04, 10.89s/it] 52%|█████▏ | 1030/2000 [3:10:25<2:56:04, 10.89s/it] 52%|█████▏ | 1031/2000 [3:10:36<2:57:50, 11.01s/it] 52%|█████▏ | 1031/2000 [3:10:36<2:57:50, 11.01s/it] 52%|█████▏ | 1032/2000 [3:10:47<2:57:08, 10.98s/it] 52%|█████▏ | 1032/2000 [3:10:47<2:57:08, 10.98s/it] 52%|█████▏ | 1033/2000 [3:10:58<2:58:20, 11.07s{'loss': 0.7549, 'learning_rate': 9.95141886140344e-06, 'epoch': 0.52} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13661 total_samples=15698, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:46,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.68 | bwd_microstep: 2143.50 | bwd_inner_microstep: 1878.16 | bwd_allreduce_microstep: 265.27 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11631 total_samples=15701, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:49,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.12 | bwd_microstep: 1710.16 | bwd_inner_microstep: 1674.84 | bwd_allreduce_microstep: 35.26 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14562 total_samples=15705, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:51,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.84 | bwd_microstep: 1713.51 | bwd_inner_microstep: 1698.73 | bwd_allreduce_microstep: 14.72 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11651 total_samples=15708, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:54,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80 [2025-08-03 04:53:54,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.79 | bwd_microstep: 1986.17 | bwd_inner_microstep: 1811.87 | bwd_allreduce_microstep: 174.23 | step_microstep: 112.61 [2025-08-03 04:53:54,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.36 | bwd: 7553.38 | bwd_inner: 7063.60 | bwd_allreduce: 489.55 | step: 113.06 {'loss': 0.7625, 'learning_rate': 9.935225346713341e-06, 'epoch': 0.52} dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13308 total_samples=15712, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:57,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.65 | bwd_microstep: 1741.32 | bwd_inner_microstep: 1651.69 | bwd_allreduce_microstep: 89.56 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11726 total_samples=15715, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:53:59,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.10 | bwd_microstep: 1802.66 | bwd_inner_microstep: 1570.67 | bwd_allreduce_microstep: 231.93 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11895 total_samples=15718, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:02,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.88 | bwd_microstep: 2084.77 | bwd_inner_microstep: 1859.17 | bwd_allreduce_microstep: 225.53 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14132 total_samples=15722, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:05,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37 [2025-08-03 04:54:05,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.77 | bwd_microstep: 2090.17 | bwd_inner_microstep: 1782.43 | bwd_allreduce_microstep: 307.67 | step_microstep: 126.38 [2025-08-03 04:54:05,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.33 | bwd: 7718.96 | bwd_inner: 6863.96 | bwd_allreduce: 854.77 | step: 126.71 {'loss': 0.7612, 'learning_rate': 9.919032001887215e-06, 'epoch': 0.52} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12372 total_samples=15726, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:08,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.23 | bwd_microstep: 1807.79 | bwd_inner_microstep: 1567.56 | bwd_allreduce_microstep: 240.18 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13148 total_samples=15730, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:11,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.11 | bwd_microstep: 2032.64 | bwd_inner_microstep: 1981.36 | bwd_allreduce_microstep: 51.22 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13972 total_samples=15734, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:13,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.80 | bwd_microstep: 1833.42 | bwd_inner_microstep: 1748.89 | bwd_allreduce_microstep: 84.46 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11576 total_samples=15737, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:16,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84 [2025-08-03 04:54:16,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.76 | bwd_microstep: 1978.95 | bwd_inner_microstep: 1536.04 | bwd_allreduce_microstep: 442.84 | step_microstep: 145.86 [2025-08-03 04:54:16,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.82 | bwd: 7652.85 | bwd_inner: 6833.85 | bwd_allreduce: 818.78 | step: 146.18 {'loss': 0.7537, 'learning_rate': 9.90283886939023e-06, 'epoch': 0.52} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11726 total_samples=15741, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:19,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.48 | bwd_microstep: 1852.84 | bwd_inner_microstep: 1608.15 | bwd_allreduce_microstep: 244.63 | step_microstep: 0.12 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12854 total_samples=15745, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:22,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.98 | bwd_microstep: 2048.83 | bwd_inner_microstep: 1686.99 | bwd_allreduce_microstep: 361.79 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12045 total_samples=15748, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:24,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.99 | bwd_microstep: 1735.61 | bwd_inner_microstep: 1552.33 | bwd_allreduce_microstep: 183.22 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12540 total_samples=15751, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:27,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28 [2025-08-03 04:54:27,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.60 | bwd_microstep: 1946.97 | bwd_inner_microstep: 1815.26 | bwd_allreduce_microstep: 131.65 | step_microstep: 109.55 [2025-08-03 04:54:27,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.98 | bwd: 7584.31 | bwd_inner: 6662.74 | bwd_allreduce: 921.35 | step: 109.88 {'loss': 0.7554, 'learning_rate': 9.886645991686977e-06, 'epoch': 0.52} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14098 total_samples=15755, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:30,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.72 | bwd_microstep: 2637.81 | bwd_inner_microstep: 1946.53 | bwd_allreduce_microstep: 691.21 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13661 total_samples=15759, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:33,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.81 | bwd_microstep: 1826.11 | bwd_inner_microstep: 1729.98 | bwd_allreduce_microstep: 96.07 | step_microstep: 0.20 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14190 total_samples=15763, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:36,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.07 | bwd_microstep: 2003.69 | bwd_inner_microstep: 1761.17 | bwd_allreduce_microstep: 242.46 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11858 total_samples=15766, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:38,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79 [2025-08-03 04:54:38,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.13 | bwd_microstep: 1869.51 | bwd_inner_microstep: 1608.48 | bwd_allreduce_microstep: 260.97 | step_microstep: 114.45 [2025-08-03 04:54:38,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.66 | bwd: 8337.17 | bwd_inner: 7046.15 | bwd_allreduce: 1290.79 | step: 114.88 /it] 52%|█████▏ | 1033/2000 [3:10:58<2:58:20, 11.07s/it] 52%|█████▏ | 1034/2000 [3:11:09<2:56:35, 10.97s/it] 52%|█████▏ | 1034/2000 [3:11:09<2:56:35, 10.97s/it] 52%|█████▏ | 1035/2000 [3:11:20<2:56:28, 10.97s/it] 52%|█████▏ | 1035/2000 [3:11:20<2:56:28, 10.97s/it] 52%|█████▏ | 1036/2000 [3:11:31<2:55:59, 10.95s/it] 52%|█████▏ | 1036/2000 [3:11:31<2:55:59, 10.95s/it] 52%|█████▏ | 1037/2000 [3:11:42<2:55:05, 10.91s/it] 52%|█████▏ | 1037/2000 [3:11:42<2:55:05, 10.91s/it] 52%|█████▏ | 1038/2000 [3:11:53<2:58:00, 11.10s/it] {'loss': 0.7528, 'learning_rate': 9.870453411241399e-06, 'epoch': 0.52} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14236 total_samples=15770, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:41,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.69 | bwd_microstep: 1781.98 | bwd_inner_microstep: 1742.63 | bwd_allreduce_microstep: 39.28 | step_microstep: 0.12 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12581 total_samples=15774, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:44,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 765.19 | bwd_microstep: 1993.37 | bwd_inner_microstep: 1816.84 | bwd_allreduce_microstep: 176.47 | step_microstep: 0.21 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13566 total_samples=15778, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:46,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.49 | bwd_microstep: 1758.56 | bwd_inner_microstep: 1687.88 | bwd_allreduce_microstep: 70.63 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12046 total_samples=15781, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:49,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57 [2025-08-03 04:54:49,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.31 | bwd_microstep: 1815.76 | bwd_inner_microstep: 1579.25 | bwd_allreduce_microstep: 236.44 | step_microstep: 146.04 [2025-08-03 04:54:49,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2904.61 | bwd: 7349.72 | bwd_inner: 6826.59 | bwd_allreduce: 522.89 | step: 146.47 {'loss': 0.7617, 'learning_rate': 9.854261170516648e-06, 'epoch': 0.52} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11601 total_samples=15784, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:52,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.29 | bwd_microstep: 2009.25 | bwd_inner_microstep: 1808.10 | bwd_allreduce_microstep: 201.08 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11714 total_samples=15787, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:55,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.94 | bwd_microstep: 1889.69 | bwd_inner_microstep: 1531.49 | bwd_allreduce_microstep: 358.14 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13980 total_samples=15791, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:54:57,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.42 | bwd_microstep: 1958.39 | bwd_inner_microstep: 1771.42 | bwd_allreduce_microstep: 186.91 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12510 total_samples=15794, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:00,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15 [2025-08-03 04:55:00,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.48 | bwd_microstep: 1841.82 | bwd_inner_microstep: 1596.50 | bwd_allreduce_microstep: 245.24 | step_microstep: 107.76 [2025-08-03 04:55:00,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.05 | bwd: 7699.20 | bwd_inner: 6707.51 | bwd_allreduce: 991.46 | step: 108.20 {'loss': 0.751, 'learning_rate': 9.838069311974986e-06, 'epoch': 0.52} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14616 total_samples=15798, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:03,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.58 | bwd_microstep: 1832.47 | bwd_inner_microstep: 1826.50 | bwd_allreduce_microstep: 5.92 | step_microstep: 0.19 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11485 total_samples=15801, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:05,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.24 | bwd_microstep: 1710.74 | bwd_inner_microstep: 1525.49 | bwd_allreduce_microstep: 185.19 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13635 total_samples=15805, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:08,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.41 | bwd_microstep: 2195.76 | bwd_inner_microstep: 2051.72 | bwd_allreduce_microstep: 143.97 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14800 total_samples=15809, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:11,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00 [2025-08-03 04:55:11,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.61 | bwd_microstep: 1837.50 | bwd_inner_microstep: 1814.13 | bwd_allreduce_microstep: 23.31 | step_microstep: 140.74 [2025-08-03 04:55:11,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.78 | bwd: 7576.52 | bwd_inner: 7217.84 | bwd_allreduce: 358.46 | step: 141.27 {'loss': 0.7435, 'learning_rate': 9.821877878077678e-06, 'epoch': 0.52} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14435 total_samples=15813, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:14,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.32 | bwd_microstep: 2025.24 | bwd_inner_microstep: 1899.68 | bwd_allreduce_microstep: 125.49 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14995 total_samples=15817, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:16,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.16 | bwd_microstep: 2013.34 | bwd_inner_microstep: 1885.25 | bwd_allreduce_microstep: 128.03 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492 total_samples=15821, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:19,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.33 | bwd_microstep: 1749.18 | bwd_inner_microstep: 1680.42 | bwd_allreduce_microstep: 68.69 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13589 total_samples=15825, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:22,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14 [2025-08-03 04:55:22,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.26 | bwd_microstep: 2092.37 | bwd_inner_microstep: 1948.11 | bwd_allreduce_microstep: 144.19 | step_microstep: 139.52 [2025-08-03 04:55:22,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.96 | bwd: 7880.18 | bwd_inner: 7413.45 | bwd_allreduce: 466.49 | step: 139.98 {'loss': 0.7529, 'learning_rate': 9.805686911284867e-06, 'epoch': 0.52} dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12423 total_samples=15829, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:25,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.63 | bwd_microstep: 2077.14 | bwd_inner_microstep: 1955.66 | bwd_allreduce_microstep: 121.42 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14980 total_samples=15833, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:27,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.10 | bwd_microstep: 1747.26 | bwd_inner_microstep: 1734.29 | bwd_allreduce_microstep: 12.90 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13369 total_samples=15837, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:30,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.44 | bwd_microstep: 2081.12 | bwd_inner_microstep: 1921.53 | bwd_allreduce_microstep: 159.53 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13522 total_samples=15841, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:33,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27 [2025-08-03 04:55:33,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.79 | bwd_microstep: 1801.82 | bwd_inner_microstep: 1717.21 | bwd_allreduce_microstep: 84.54 | step_microstep: 141.23 [2025-08-03 04:55:33,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.89 | bwd: 7707.39 | bwd_inner: 7328.68 | bwd_allreduce: 378.47 | step: 141.70 {'loss': 0.7578, 'learning_rate': 9.789496454055482e-06, 'epoch': 0.52} 52%|█████▏ | 1038/2000 [3:11:53<2:58:00, 11.10s/it] 52%|█████▏ | 1039/2000 [3:12:04<2:55:53, 10.98s/it] 52%|█████▏ | 1039/2000 [3:12:04<2:55:53, 10.98s/it] 52%|█████▏ | 1040/2000 [3:12:15<2:55:30, 10.97s/it] 52%|█████▏ | 1040/2000 [3:12:15<2:55:30, 10.97s/it] 52%|█████▏ | 1041/2000 [3:12:26<2:54:41, 10.93s/it] 52%|█████▏ | 1041/2000 [3:12:26<2:54:41, 10.93s/it] 52%|█████▏ | 1042/2000 [3:12:37<2:55:32, 10.99s/it] 52%|█████▏ | 1042/2000 [3:12:37<2:55:32, 10.99s/it] 52%|█████▏ | 1043/2000 [3:12:48<2:55:03, 10.98s/it] 52%|█████▏ | 1043/2000 [3:12:48<2:55:03,dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13949 total_samples=15845, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:36,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.64 | bwd_microstep: 1822.33 | bwd_inner_microstep: 1731.77 | bwd_allreduce_microstep: 90.49 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13607 total_samples=15849, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:38,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.00 | bwd_microstep: 1846.42 | bwd_inner_microstep: 1744.18 | bwd_allreduce_microstep: 102.17 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13765 total_samples=15853, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:41,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.15 | bwd_microstep: 1980.98 | bwd_inner_microstep: 1883.65 | bwd_allreduce_microstep: 97.25 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14078 total_samples=15859, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:44,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89 [2025-08-03 04:55:44,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.18 | bwd_microstep: 2126.67 | bwd_inner_microstep: 2026.08 | bwd_allreduce_microstep: 100.53 | step_microstep: 126.62 [2025-08-03 04:55:44,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.91 | bwd: 7776.44 | bwd_inner: 7385.68 | bwd_allreduce: 390.53 | step: 127.19 {'loss': 0.7457, 'learning_rate': 9.773306548847102e-06, 'epoch': 0.52} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11773 total_samples=15862, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:47,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.05 | bwd_microstep: 2143.24 | bwd_inner_microstep: 1923.25 | bwd_allreduce_microstep: 219.93 | step_microstep: 0.23 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12790 total_samples=15866, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:50,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.00 | bwd_microstep: 2124.87 | bwd_inner_microstep: 1970.15 | bwd_allreduce_microstep: 154.66 | step_microstep: 0.10 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15276 total_samples=15870, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:52,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.79 | bwd_microstep: 1746.33 | bwd_inner_microstep: 1697.21 | bwd_allreduce_microstep: 49.05 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13428 total_samples=15874, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:55,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.41 [2025-08-03 04:55:55,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.58 | bwd_microstep: 1712.18 | bwd_inner_microstep: 1672.08 | bwd_allreduce_microstep: 40.03 | step_microstep: 135.75 [2025-08-03 04:55:55,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.35 | bwd: 7726.68 | bwd_inner: 7262.69 | bwd_allreduce: 463.74 | step: 136.19 {'loss': 0.7492, 'learning_rate': 9.757117238115871e-06, 'epoch': 0.52} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11982 total_samples=15877, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:55:57,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.10 | bwd_microstep: 1722.35 | bwd_inner_microstep: 1541.15 | bwd_allreduce_microstep: 181.14 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13571 total_samples=15881, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:00,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1002.73 | bwd_microstep: 1753.73 | bwd_inner_microstep: 1678.39 | bwd_allreduce_microstep: 75.27 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13550 total_samples=15885, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:03,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.11 | bwd_microstep: 1721.55 | bwd_inner_microstep: 1671.06 | bwd_allreduce_microstep: 50.43 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13998 total_samples=15889, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:06,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 04:56:06,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.08 | bwd_microstep: 1794.92 | bwd_inner_microstep: 1736.15 | bwd_allreduce_microstep: 58.70 | step_microstep: 159.02 [2025-08-03 04:56:06,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3070.93 | bwd: 6992.59 | bwd_inner: 6626.75 | bwd_allreduce: 365.62 | step: 159.45 {'loss': 0.7461, 'learning_rate': 9.740928564316369e-06, 'epoch': 0.52} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11638 total_samples=15892, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:08,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.48 | bwd_microstep: 1734.25 | bwd_inner_microstep: 1522.88 | bwd_allreduce_microstep: 211.31 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13263 total_samples=15896, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:11,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.05 | bwd_microstep: 2013.87 | bwd_inner_microstep: 1709.04 | bwd_allreduce_microstep: 304.76 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11889 total_samples=15899, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:13,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.01 | bwd_microstep: 1733.11 | bwd_inner_microstep: 1539.92 | bwd_allreduce_microstep: 193.12 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13713 total_samples=15903, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:16,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31 [2025-08-03 04:56:16,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.69 | bwd_microstep: 1895.81 | bwd_inner_microstep: 1865.82 | bwd_allreduce_microstep: 29.93 | step_microstep: 130.15 [2025-08-03 04:56:16,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.16 | bwd: 7377.09 | bwd_inner: 6637.67 | bwd_allreduce: 739.19 | step: 130.61 {'loss': 0.7578, 'learning_rate': 9.724740569901503e-06, 'epoch': 0.52} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14345 total_samples=15907, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:19,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.34 | bwd_microstep: 1952.01 | bwd_inner_microstep: 1773.60 | bwd_allreduce_microstep: 178.35 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13772 total_samples=15911, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:22,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.45 | bwd_microstep: 1919.16 | bwd_inner_microstep: 1870.48 | bwd_allreduce_microstep: 48.61 | step_microstep: 0.12 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12556 total_samples=15915, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:24,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.40 | bwd_microstep: 1804.16 | bwd_inner_microstep: 1608.94 | bwd_allreduce_microstep: 195.16 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13579 total_samples=15919, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:27,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20 [2025-08-03 04:56:27,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.23 | bwd_microstep: 1993.80 | bwd_inner_microstep: 1870.88 | bwd_allreduce_microstep: 122.86 | step_microstep: 112.62 [2025-08-03 04:56:27,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.33 | bwd: 7669.18 | bwd_inner: 7123.89 | bwd_allreduce: 545.05 | step: 112.95 {'loss': 0.7362, 'learning_rate': 9.708553297322407e-06, 'epoch': 0.52} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13217 total_samples=15923, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:30,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.39 | bwd_microstep: 1981.18 | bwd_inner_microstep: 1687.78 | bwd_allreduce_microstep: 293.34 | step_microstep: 0.11 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13048 total_samples=15927, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:32,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.09 | bwd_microstep: 1809.46 | bwd_inner_microstep: 1653.19 | bwd_allreduce_microstep: 156.20 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14242 total_samples=15931, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:35,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.45 | bwd_microstep: 1781.49 | bwd_inner_microstep: 1730.16 | bwd_allreduce_microstep: 51.26 | step_microstep: 0.24 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12978 total_samples=15935, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:38,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88 [2025-08-03 04:56:38,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.25 | bwd_microstep: 2226.37 | bwd_inner_microstep: 1807.30 | bwd_allreduce_microstep: 419.01 | step_microstep: 132.42 [2025-08-03 04:56:38,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.11 | bwd: 7798.55 | bwd_inner: 6878.43 | bwd_allreduce: 919.89 | step: 132.89 10.98s/it] 52%|█████▏ | 1044/2000 [3:12:59<2:55:05, 10.99s/it] 52%|█████▏ | 1044/2000 [3:12:59<2:55:05, 10.99s/it] 52%|█████▏ | 1045/2000 [3:13:10<2:54:49, 10.98s/it] 52%|█████▏ | 1045/2000 [3:13:10<2:54:49, 10.98s/it] 52%|█████▏ | 1046/2000 [3:13:20<2:52:32, 10.85s/it] 52%|█████▏ | 1046/2000 [3:13:20<2:52:32, 10.85s/it] 52%|█████▏ | 1047/2000 [3:13:31<2:51:22, 10.79s/it] 52%|█████▏ | 1047/2000 [3:13:31<2:51:22, 10.79s/it] 52%|█████▏ | 1048/2000 [3:13:42<2:51:39, 10.82s/it] 52%|█████▏ | 1048/2000 [3:13:42<2:51:39, 10.82s/it] 52%|█████▏ | 1049/2000 [3:13:53{'loss': 0.751, 'learning_rate': 9.692366789028308e-06, 'epoch': 0.52} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13505 total_samples=15939, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:41,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.48 | bwd_microstep: 2021.72 | bwd_inner_microstep: 1894.13 | bwd_allreduce_microstep: 127.52 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11888 total_samples=15942, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:44,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.13 | bwd_microstep: 1854.96 | bwd_inner_microstep: 1700.03 | bwd_allreduce_microstep: 154.86 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11575 total_samples=15945, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:46,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.05 | bwd_microstep: 1806.41 | bwd_inner_microstep: 1680.22 | bwd_allreduce_microstep: 126.13 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14313 total_samples=15949, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:49,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11 [2025-08-03 04:56:49,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.14 | bwd_microstep: 1794.36 | bwd_inner_microstep: 1728.52 | bwd_allreduce_microstep: 65.78 | step_microstep: 121.61 [2025-08-03 04:56:49,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.73 | bwd: 7477.50 | bwd_inner: 7002.89 | bwd_allreduce: 474.37 | step: 122.07 {'loss': 0.7567, 'learning_rate': 9.676181087466444e-06, 'epoch': 0.53} dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14255 total_samples=15955, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:52,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.07 | bwd_microstep: 1966.64 | bwd_inner_microstep: 1888.50 | bwd_allreduce_microstep: 78.06 | step_microstep: 0.30 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11954 total_samples=15958, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:54,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.88 | bwd_microstep: 1704.17 | bwd_inner_microstep: 1560.67 | bwd_allreduce_microstep: 143.43 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13176 total_samples=15962, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:56:57,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 981.89 | bwd_microstep: 1854.04 | bwd_inner_microstep: 1705.50 | bwd_allreduce_microstep: 148.47 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13520 total_samples=15966, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:00,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40 [2025-08-03 04:57:00,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.60 | bwd_microstep: 1950.77 | bwd_inner_microstep: 1716.97 | bwd_allreduce_microstep: 233.69 | step_microstep: 141.94 [2025-08-03 04:57:00,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3115.36 | bwd: 7475.69 | bwd_inner: 6871.64 | bwd_allreduce: 603.76 | step: 142.59 {'loss': 0.7523, 'learning_rate': 9.659996235081926e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12840 total_samples=15969, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:02,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.91 | bwd_microstep: 1765.01 | bwd_inner_microstep: 1603.57 | bwd_allreduce_microstep: 161.38 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12225 total_samples=15972, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:05,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.43 | bwd_microstep: 1842.36 | bwd_inner_microstep: 1593.67 | bwd_allreduce_microstep: 248.62 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13488 total_samples=15976, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:08,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.99 | bwd_microstep: 1820.89 | bwd_inner_microstep: 1735.99 | bwd_allreduce_microstep: 84.83 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13522 total_samples=15980, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:10,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72 [2025-08-03 04:57:10,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.53 | bwd_microstep: 1801.64 | bwd_inner_microstep: 1714.30 | bwd_allreduce_microstep: 87.27 | step_microstep: 134.43 [2025-08-03 04:57:10,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2865.80 | bwd: 7229.96 | bwd_inner: 6647.53 | bwd_allreduce: 582.20 | step: 135.00 {'loss': 0.7514, 'learning_rate': 9.643812274317644e-06, 'epoch': 0.53} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13259 total_samples=15984, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:13,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.94 | bwd_microstep: 1791.73 | bwd_inner_microstep: 1689.97 | bwd_allreduce_microstep: 101.69 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891 total_samples=15987, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:15,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.60 | bwd_microstep: 1735.31 | bwd_inner_microstep: 1536.91 | bwd_allreduce_microstep: 198.34 | step_microstep: 0.22 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13586 total_samples=15991, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:18,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.15 | bwd_microstep: 1778.23 | bwd_inner_microstep: 1684.51 | bwd_allreduce_microstep: 93.66 | step_microstep: 0.13 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13498 total_samples=15995, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:21,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44 [2025-08-03 04:57:21,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.20 | bwd_microstep: 1756.64 | bwd_inner_microstep: 1689.62 | bwd_allreduce_microstep: 66.94 | step_microstep: 113.84 [2025-08-03 04:57:21,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.81 | bwd: 7061.96 | bwd_inner: 6601.00 | bwd_allreduce: 460.71 | step: 114.36 {'loss': 0.7555, 'learning_rate': 9.627629247614151e-06, 'epoch': 0.53} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12917 total_samples=15999, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:24,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 770.53 | bwd_microstep: 2153.09 | bwd_inner_microstep: 1954.53 | bwd_allreduce_microstep: 198.49 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13643 total_samples=16003, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:26,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.14 | bwd_microstep: 1810.12 | bwd_inner_microstep: 1721.14 | bwd_allreduce_microstep: 88.91 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15387 total_samples=16008, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:29,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.11 | bwd_microstep: 2138.53 | bwd_inner_microstep: 2132.22 | bwd_allreduce_microstep: 6.25 | step_microstep: 0.21 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13567 total_samples=16012, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:32,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40 [2025-08-03 04:57:32,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.55 | bwd_microstep: 1842.24 | bwd_inner_microstep: 1720.65 | bwd_allreduce_microstep: 121.53 | step_microstep: 135.06 [2025-08-03 04:57:32,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2902.27 | bwd: 7944.04 | bwd_inner: 7528.55 | bwd_allreduce: 415.25 | step: 135.50 <2:52:37, 10.89s/it] 52%|█████▏ | 1049/2000 [3:13:53<2:52:37, 10.89s/it] 52%|█████▎ | 1050/2000 [3:14:04<2:51:43, 10.85s/it] 52%|█████▎ | 1050/2000 [3:14:04<2:51:43, 10.85s/it] 53%|█████▎ | 1051/2000 [3:14:15<2:52:23, 10.90s/it] 53%|█████▎ | 1051/2000 [3:14:15<2:52:23, 10.90s/it] 53%|█████▎ | 1052/2000 [3:14:25<2:50:25, 10.79s/it] 53%|█████▎ | 1052/2000 [3:14:25<2:50:25, 10.79s/it] 53%|█████▎ | 1053/2000 [3:14:36<2:47:51, 10.64s/it] 53%|█████▎ | 1053/2000 [3:14:36<2:47:51, 10.64s/it] 53%|█████▎ | 1054/2000 [3:14:47<2:50:46, 10.83s/it] {'loss': 0.7407, 'learning_rate': 9.611447197409544e-06, 'epoch': 0.53} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13391 total_samples=16016, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:35,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.42 | bwd_microstep: 1800.33 | bwd_inner_microstep: 1703.25 | bwd_allreduce_microstep: 97.01 | step_microstep: 0.14 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12122 total_samples=16020, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:37,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.20 | bwd_microstep: 1789.88 | bwd_inner_microstep: 1579.19 | bwd_allreduce_microstep: 210.62 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13300 total_samples=16024, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:40,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.38 | bwd_microstep: 1836.13 | bwd_inner_microstep: 1686.85 | bwd_allreduce_microstep: 149.21 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14362 total_samples=16030, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:42,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25 [2025-08-03 04:57:42,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.68 | bwd_microstep: 1814.91 | bwd_inner_microstep: 1752.48 | bwd_allreduce_microstep: 62.36 | step_microstep: 122.60 [2025-08-03 04:57:42,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.61 | bwd: 7241.29 | bwd_inner: 6721.77 | bwd_allreduce: 519.28 | step: 123.07 {'loss': 0.7449, 'learning_rate': 9.595266166139366e-06, 'epoch': 0.53} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13564 total_samples=16034, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:45,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.57 | bwd_microstep: 2012.00 | bwd_inner_microstep: 1866.70 | bwd_allreduce_microstep: 145.24 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14079 total_samples=16038, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:48,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.87 | bwd_microstep: 1725.51 | bwd_inner_microstep: 1678.06 | bwd_allreduce_microstep: 47.38 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13555 total_samples=16042, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:50,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.59 | bwd_microstep: 1784.72 | bwd_inner_microstep: 1726.00 | bwd_allreduce_microstep: 58.66 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264 total_samples=16046, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:53,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85 [2025-08-03 04:57:53,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.36 | bwd_microstep: 2115.52 | bwd_inner_microstep: 1843.49 | bwd_allreduce_microstep: 271.97 | step_microstep: 109.30 [2025-08-03 04:57:53,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.32 | bwd: 7637.80 | bwd_inner: 7114.24 | bwd_allreduce: 523.32 | step: 109.76 {'loss': 0.7528, 'learning_rate': 9.579086196236483e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12361 total_samples=16049, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:56,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.36 | bwd_microstep: 2037.83 | bwd_inner_microstep: 1810.38 | bwd_allreduce_microstep: 227.39 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11950 total_samples=16052, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:57:59,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.85 | bwd_microstep: 1857.07 | bwd_inner_microstep: 1569.83 | bwd_allreduce_microstep: 287.18 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12744 total_samples=16056, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:01,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.37 | bwd_microstep: 1696.04 | bwd_inner_microstep: 1618.53 | bwd_allreduce_microstep: 77.45 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13504 total_samples=16060, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:04,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.77 [2025-08-03 04:58:04,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.05 | bwd_microstep: 2032.30 | bwd_inner_microstep: 1719.30 | bwd_allreduce_microstep: 312.93 | step_microstep: 110.04 [2025-08-03 04:58:04,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.57 | bwd: 7623.28 | bwd_inner: 6718.03 | bwd_allreduce: 905.02 | step: 110.37 {'loss': 0.7475, 'learning_rate': 9.562907330130981e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12338 total_samples=16063, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:07,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.10 | bwd_microstep: 2161.73 | bwd_inner_microstep: 1811.90 | bwd_allreduce_microstep: 349.76 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14165 total_samples=16067, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:10,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.14 | bwd_microstep: 2044.12 | bwd_inner_microstep: 1902.59 | bwd_allreduce_microstep: 141.46 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13542 total_samples=16071, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:12,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.11 | bwd_microstep: 1692.55 | bwd_inner_microstep: 1661.64 | bwd_allreduce_microstep: 30.85 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12386 total_samples=16075, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:15,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93 [2025-08-03 04:58:15,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.69 | bwd_microstep: 1831.59 | bwd_inner_microstep: 1637.78 | bwd_allreduce_microstep: 193.74 | step_microstep: 126.64 [2025-08-03 04:58:15,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.98 | bwd: 7730.02 | bwd_inner: 7013.91 | bwd_allreduce: 715.89 | step: 127.07 {'loss': 0.7493, 'learning_rate': 9.54672961025005e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12065 total_samples=16078, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:18,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.90 | bwd_microstep: 1757.09 | bwd_inner_microstep: 1556.60 | bwd_allreduce_microstep: 200.42 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13251 total_samples=16082, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:20,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.52 | bwd_microstep: 1995.10 | bwd_inner_microstep: 1855.66 | bwd_allreduce_microstep: 139.38 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14220 total_samples=16086, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:23,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.23 | bwd_microstep: 1802.10 | bwd_inner_microstep: 1734.32 | bwd_allreduce_microstep: 67.71 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13187 total_samples=16090, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:26,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85 [2025-08-03 04:58:26,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.02 | bwd_microstep: 1719.62 | bwd_inner_microstep: 1654.49 | bwd_allreduce_microstep: 65.07 | step_microstep: 147.99 [2025-08-03 04:58:26,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.59 | bwd: 7273.96 | bwd_inner: 6801.06 | bwd_allreduce: 472.67 | step: 148.45 {'loss': 0.7457, 'learning_rate': 9.530553079017872e-06, 'epoch': 0.53} 53%|█████▎ | 1054/2000 [3:14:47<2:50:46, 10.83s/it] 53%|█████▎ | 1055/2000 [3:14:57<2:49:03, 10.73s/it] 53%|█████▎ | 1055/2000 [3:14:57<2:49:03, 10.73s/it] 53%|█████▎ | 1056/2000 [3:15:08<2:49:28, 10.77s/it] 53%|█████▎ | 1056/2000 [3:15:08<2:49:28, 10.77s/it] 53%|█████▎ | 1057/2000 [3:15:19<2:49:32, 10.79s/it] 53%|█████▎ | 1057/2000 [3:15:19<2:49:32, 10.79s/it] 53%|█████▎ | 1058/2000 [3:15:30<2:50:09, 10.84s/it] 53%|█████▎ | 1058/2000 [3:15:30<2:50:09, 10.84s/it] 53%|█████▎ | 1059/2000 [3:15:41<2:48:39, 10.75s/it] 53%|█████▎ | 1059/2000 [dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11610 total_samples=16093, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:28,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.30 | bwd_microstep: 1829.56 | bwd_inner_microstep: 1580.54 | bwd_allreduce_microstep: 248.96 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12805 total_samples=16097, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:31,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.86 | bwd_microstep: 1754.43 | bwd_inner_microstep: 1652.93 | bwd_allreduce_microstep: 101.43 | step_microstep: 0.14 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13595 total_samples=16101, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:34,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.39 | bwd_microstep: 1953.46 | bwd_inner_microstep: 1716.66 | bwd_allreduce_microstep: 236.74 | step_microstep: 0.21 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13653 total_samples=16105, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:36,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69 [2025-08-03 04:58:36,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.24 | bwd_microstep: 1838.97 | bwd_inner_microstep: 1725.91 | bwd_allreduce_microstep: 112.99 | step_microstep: 144.58 [2025-08-03 04:58:36,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2862.72 | bwd: 7376.48 | bwd_inner: 6676.03 | bwd_allreduce: 700.20 | step: 145.05 {'loss': 0.7606, 'learning_rate': 9.514377778855521e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11818 total_samples=16108, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:39,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.46 | bwd_microstep: 1836.65 | bwd_inner_microstep: 1608.33 | bwd_allreduce_microstep: 228.25 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12914 total_samples=16112, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:42,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.01 | bwd_microstep: 1900.62 | bwd_inner_microstep: 1889.21 | bwd_allreduce_microstep: 11.35 | step_microstep: 0.26 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13059 total_samples=16116, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:44,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.10 | bwd_microstep: 1900.18 | bwd_inner_microstep: 1712.43 | bwd_allreduce_microstep: 187.68 | step_microstep: 0.29 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14706 total_samples=16120, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:47,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74 [2025-08-03 04:58:47,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.20 | bwd_microstep: 1754.68 | bwd_inner_microstep: 1725.37 | bwd_allreduce_microstep: 29.24 | step_microstep: 113.77 [2025-08-03 04:58:47,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.69 | bwd: 7392.18 | bwd_inner: 6935.33 | bwd_allreduce: 456.60 | step: 114.43 {'loss': 0.7472, 'learning_rate': 9.498203752180827e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12007 total_samples=16123, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:50,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.16 | bwd_microstep: 1902.80 | bwd_inner_microstep: 1556.75 | bwd_allreduce_microstep: 345.97 | step_microstep: 0.27 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13978 total_samples=16127, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:52,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.10 | bwd_microstep: 1752.72 | bwd_inner_microstep: 1723.92 | bwd_allreduce_microstep: 28.74 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12853 total_samples=16131, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:55,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.13 | bwd_microstep: 1960.32 | bwd_inner_microstep: 1828.48 | bwd_allreduce_microstep: 131.78 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13532 total_samples=16135, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:58:58,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.92 [2025-08-03 04:58:58,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.59 | bwd_microstep: 1841.43 | bwd_inner_microstep: 1709.33 | bwd_allreduce_microstep: 132.03 | step_microstep: 129.63 [2025-08-03 04:58:58,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.91 | bwd: 7457.32 | bwd_inner: 6818.46 | bwd_allreduce: 638.61 | step: 130.24 {'loss': 0.7558, 'learning_rate': 9.482031041408296e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11592 total_samples=16138, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:00,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.02 | bwd_microstep: 2056.99 | bwd_inner_microstep: 1815.50 | bwd_allreduce_microstep: 241.42 | step_microstep: 0.13 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13364 total_samples=16143, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:03,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.90 | bwd_microstep: 1822.70 | bwd_inner_microstep: 1746.94 | bwd_allreduce_microstep: 75.69 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13427 total_samples=16148, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:06,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.52 | bwd_microstep: 1982.86 | bwd_inner_microstep: 1830.05 | bwd_allreduce_microstep: 152.75 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13802 total_samples=16152, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:09,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.02 [2025-08-03 04:59:09,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.57 | bwd_microstep: 2036.66 | bwd_inner_microstep: 1825.91 | bwd_allreduce_microstep: 210.68 | step_microstep: 113.37 [2025-08-03 04:59:09,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.93 | bwd: 7899.26 | bwd_inner: 7218.39 | bwd_allreduce: 680.62 | step: 113.87 {'loss': 0.7476, 'learning_rate': 9.465859688948977e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12152 total_samples=16155, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:11,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.44 | bwd_microstep: 1806.76 | bwd_inner_microstep: 1573.53 | bwd_allreduce_microstep: 233.16 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264 total_samples=16159, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:14,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.36 | bwd_microstep: 1737.98 | bwd_inner_microstep: 1670.23 | bwd_allreduce_microstep: 67.68 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12799 total_samples=16163, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:16,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.02 | bwd_microstep: 1826.28 | bwd_inner_microstep: 1679.25 | bwd_allreduce_microstep: 146.96 | step_microstep: 0.25 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13512 total_samples=16168, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:19,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.95 [2025-08-03 04:59:19,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.37 | bwd_microstep: 2080.79 | bwd_inner_microstep: 1924.17 | bwd_allreduce_microstep: 156.55 | step_microstep: 141.67 [2025-08-03 04:59:19,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.12 | bwd: 7451.86 | bwd_inner: 6847.18 | bwd_allreduce: 604.44 | step: 142.26 {'loss': 0.7553, 'learning_rate': 9.449689737210352e-06, 'epoch': 0.53} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13290 total_samples=16172, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:22,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.71 | bwd_microstep: 1769.41 | bwd_inner_microstep: 1694.23 | bwd_allreduce_microstep: 75.12 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689 total_samples=16175, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:25,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.22 | bwd_microstep: 1759.80 | bwd_inner_microstep: 1544.94 | bwd_allreduce_microstep: 214.80 | step_microstep: 0.16 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13890 total_samples=16179, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:27,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.92 | bwd_microstep: 1867.45 | bwd_inner_microstep: 1689.15 | bwd_allreduce_microstep: 178.22 | step_microstep: 0.27 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14990 total_samples=16184, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:30,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89 [2025-08-03 04:59:30,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.34 | bwd_microstep: 1773.40 | bwd_inner_microstep: 1741.18 | bwd_allreduce_microstep: 32.15 | step_microstep: 119.96 [2025-08-03 04:59:30,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2874.12 | bwd: 7170.12 | bwd_inner: 6669.49 | bwd_allreduce: 500.39 | step: 120.50 3:15:41<2:48:39, 10.75s/it] 53%|█████▎ | 1060/2000 [3:15:51<2:48:03, 10.73s/it] 53%|█████▎ | 1060/2000 [3:15:51<2:48:03, 10.73s/it] 53%|█████▎ | 1061/2000 [3:16:02<2:47:20, 10.69s/it] 53%|█████▎ | 1061/2000 [3:16:02<2:47:20, 10.69s/it] 53%|█████▎ | 1062/2000 [3:16:13<2:47:14, 10.70s/it] 53%|█████▎ | 1062/2000 [3:16:13<2:47:14, 10.70s/it] 53%|█████▎ | 1063/2000 [3:16:24<2:48:58, 10.82s/it] 53%|█████▎ | 1063/2000 [3:16:24<2:48:58, 10.82s/it] 53%|█████▎ | 1064/2000 [3:16:34<2:48:10, 10.78s/it] 53%|█████▎ | 1064/2000 [3:16:34<2:48:10, 10.78s/it] 53%|█████▎ | 10{'loss': 0.747, 'learning_rate': 9.433521228596237e-06, 'epoch': 0.53} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13384 total_samples=16188, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:33,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.28 | bwd_microstep: 1998.65 | bwd_inner_microstep: 1865.60 | bwd_allreduce_microstep: 132.98 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16231 total_samples=16192, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:35,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.94 | bwd_microstep: 1835.23 | bwd_inner_microstep: 1828.72 | bwd_allreduce_microstep: 6.45 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305 total_samples=16196, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:38,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.68 | bwd_microstep: 1773.83 | bwd_inner_microstep: 1695.06 | bwd_allreduce_microstep: 78.70 | step_microstep: 0.14 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12921 total_samples=16200, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:41,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61 [2025-08-03 04:59:41,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.88 | bwd_microstep: 1845.14 | bwd_inner_microstep: 1665.59 | bwd_allreduce_microstep: 179.49 | step_microstep: 135.12 [2025-08-03 04:59:41,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.70 | bwd: 7452.90 | bwd_inner: 7054.96 | bwd_allreduce: 397.70 | step: 135.61 {'loss': 0.7552, 'learning_rate': 9.417354205506663e-06, 'epoch': 0.53} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13904 total_samples=16204, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:43,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.03 | bwd_microstep: 1780.21 | bwd_inner_microstep: 1708.37 | bwd_allreduce_microstep: 71.77 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12352 total_samples=16207, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:46,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.71 | bwd_microstep: 1773.96 | bwd_inner_microstep: 1568.59 | bwd_allreduce_microstep: 205.31 | step_microstep: 0.22 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13462 total_samples=16211, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:48,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.90 | bwd_microstep: 1776.53 | bwd_inner_microstep: 1693.01 | bwd_allreduce_microstep: 83.46 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13604 total_samples=16215, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:51,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09 [2025-08-03 04:59:51,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.65 | bwd_microstep: 1984.96 | bwd_inner_microstep: 1913.30 | bwd_allreduce_microstep: 71.60 | step_microstep: 111.86 [2025-08-03 04:59:51,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.21 | bwd: 7315.72 | bwd_inner: 6883.27 | bwd_allreduce: 432.22 | step: 112.33 {'loss': 0.765, 'learning_rate': 9.401188710337757e-06, 'epoch': 0.53} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13472 total_samples=16219, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:54,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.07 | bwd_microstep: 2165.23 | bwd_inner_microstep: 2159.13 | bwd_allreduce_microstep: 6.03 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11617 total_samples=16222, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 04:59:58,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.99 | bwd_microstep: 3019.75 | bwd_inner_microstep: 2887.49 | bwd_allreduce_microstep: 132.19 | step_microstep: 0.21 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12955 total_samples=16226, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:01,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.64 | bwd_microstep: 1973.62 | bwd_inner_microstep: 1653.98 | bwd_allreduce_microstep: 319.57 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12501 total_samples=16229, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:03,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04 [2025-08-03 05:00:03,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.15 | bwd_microstep: 1788.80 | bwd_inner_microstep: 1578.64 | bwd_allreduce_microstep: 210.08 | step_microstep: 147.27 [2025-08-03 05:00:03,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.77 | bwd: 8947.44 | bwd_inner: 8279.24 | bwd_allreduce: 667.96 | step: 147.73 {'loss': 0.7531, 'learning_rate': 9.385024785481653e-06, 'epoch': 0.53} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11903 total_samples=16232, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:06,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.03 | bwd_microstep: 1745.79 | bwd_inner_microstep: 1549.94 | bwd_allreduce_microstep: 195.79 | step_microstep: 0.19 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11683 total_samples=16235, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:09,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.31 | bwd_microstep: 2034.36 | bwd_inner_microstep: 1801.10 | bwd_allreduce_microstep: 233.18 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13253 total_samples=16239, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:11,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.27 | bwd_microstep: 1856.32 | bwd_inner_microstep: 1729.58 | bwd_allreduce_microstep: 126.67 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13137 total_samples=16243, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:14,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72 [2025-08-03 05:00:14,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.47 | bwd_microstep: 1796.79 | bwd_inner_microstep: 1703.43 | bwd_allreduce_microstep: 93.30 | step_microstep: 143.88 [2025-08-03 05:00:14,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.99 | bwd: 7433.31 | bwd_inner: 6784.04 | bwd_allreduce: 649.03 | step: 144.34 {'loss': 0.7475, 'learning_rate': 9.368862473326355e-06, 'epoch': 0.53} dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 12100 total_samples=16246, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:18,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1511.33 | bwd_microstep: 2530.53 | bwd_inner_microstep: 2280.76 | bwd_allreduce_microstep: 249.71 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16165 total_samples=16250, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:21,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.50 | bwd_microstep: 1948.06 | bwd_inner_microstep: 1843.74 | bwd_allreduce_microstep: 104.25 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13719 total_samples=16255, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:24,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.71 | bwd_microstep: 1895.46 | bwd_inner_microstep: 1669.18 | bwd_allreduce_microstep: 226.22 | step_microstep: 0.18 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13333 total_samples=16259, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:27,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12 [2025-08-03 05:00:27,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.31 | bwd_microstep: 2012.92 | bwd_inner_microstep: 1876.27 | bwd_allreduce_microstep: 136.58 | step_microstep: 110.54 [2025-08-03 05:00:27,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3646.78 | bwd: 8387.13 | bwd_inner: 7669.94 | bwd_allreduce: 716.84 | step: 110.92 65/2000 [3:16:45<2:46:34, 10.69s/it] 53%|█████▎ | 1065/2000 [3:16:45<2:46:34, 10.69s/it] 53%|█████▎ | 1066/2000 [3:16:55<2:46:33, 10.70s/it] 53%|█████▎ | 1066/2000 [3:16:56<2:46:33, 10.70s/it] 53%|█████▎ | 1067/2000 [3:17:06<2:45:42, 10.66s/it] 53%|█████▎ | 1067/2000 [3:17:06<2:45:42, 10.66s/it] 53%|█████▎ | 1068/2000 [3:17:18<2:52:42, 11.12s/it] 53%|█████▎ | 1068/2000 [3:17:18<2:52:42, 11.12s/it] 53%|█████▎ | 1069/2000 [3:17:29<2:50:46, 11.01s/it] 53%|█████▎ | 1069/2000 [3:17:29<2:50:46, 11.01s/it] 54%|█████▎ | 1070/2000 [3:17:41<2:57:14, 11.44s/it] {'loss': 0.7611, 'learning_rate': 9.352701816255643e-06, 'epoch': 0.54} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11734 total_samples=16263, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:29,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.23 | bwd_microstep: 1898.51 | bwd_inner_microstep: 1728.75 | bwd_allreduce_microstep: 169.69 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13853 total_samples=16267, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:32,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.28 | bwd_microstep: 1819.39 | bwd_inner_microstep: 1737.83 | bwd_allreduce_microstep: 81.49 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13282 total_samples=16271, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:35,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.18 | bwd_microstep: 2114.24 | bwd_inner_microstep: 1903.29 | bwd_allreduce_microstep: 210.89 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13402 total_samples=16275, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:37,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20 [2025-08-03 05:00:37,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.42 | bwd_microstep: 1851.04 | bwd_inner_microstep: 1699.40 | bwd_allreduce_microstep: 151.57 | step_microstep: 134.33 [2025-08-03 05:00:37,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.06 | bwd: 7683.22 | bwd_inner: 7069.27 | bwd_allreduce: 613.71 | step: 134.81 {'loss': 0.7401, 'learning_rate': 9.336542856648958e-06, 'epoch': 0.54} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12227 total_samples=16278, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:40,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.37 | bwd_microstep: 1882.70 | bwd_inner_microstep: 1563.67 | bwd_allreduce_microstep: 318.96 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13587 total_samples=16282, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:43,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.45 | bwd_microstep: 1744.22 | bwd_inner_microstep: 1685.53 | bwd_allreduce_microstep: 58.63 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12737 total_samples=16286, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:45,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.01 | bwd_microstep: 1767.65 | bwd_inner_microstep: 1659.38 | bwd_allreduce_microstep: 108.21 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15187 total_samples=16291, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:48,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.71 [2025-08-03 05:00:48,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.74 | bwd_microstep: 1843.46 | bwd_inner_microstep: 1790.18 | bwd_allreduce_microstep: 53.21 | step_microstep: 144.81 [2025-08-03 05:00:48,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.50 | bwd: 7238.08 | bwd_inner: 6698.76 | bwd_allreduce: 539.09 | step: 145.28 {'loss': 0.7416, 'learning_rate': 9.320385636881283e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14312 total_samples=16295, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:51,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.39 | bwd_microstep: 1781.82 | bwd_inner_microstep: 1736.58 | bwd_allreduce_microstep: 45.18 | step_microstep: 0.10 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13360 total_samples=16299, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:53,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.48 | bwd_microstep: 1772.37 | bwd_inner_microstep: 1669.61 | bwd_allreduce_microstep: 102.69 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14235 total_samples=16303, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:56,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.64 | bwd_microstep: 1801.78 | bwd_inner_microstep: 1753.15 | bwd_allreduce_microstep: 48.56 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13871 total_samples=16307, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:00:58,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.65 [2025-08-03 05:00:58,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.42 | bwd_microstep: 1812.28 | bwd_inner_microstep: 1757.44 | bwd_allreduce_microstep: 54.77 | step_microstep: 117.65 [2025-08-03 05:00:58,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.86 | bwd: 7168.29 | bwd_inner: 6916.77 | bwd_allreduce: 251.28 | step: 118.10 {'loss': 0.7482, 'learning_rate': 9.30423019932305e-06, 'epoch': 0.54} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11755 total_samples=16310, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:01,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.71 | bwd_microstep: 2106.48 | bwd_inner_microstep: 1900.05 | bwd_allreduce_microstep: 206.37 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13917 total_samples=16314, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:04,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.55 | bwd_microstep: 1898.63 | bwd_inner_microstep: 1729.73 | bwd_allreduce_microstep: 168.84 | step_microstep: 0.22 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12495 total_samples=16318, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:07,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.98 | bwd_microstep: 1799.84 | bwd_inner_microstep: 1614.80 | bwd_allreduce_microstep: 184.98 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13502 total_samples=16322, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:10,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84 [2025-08-03 05:01:10,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.72 | bwd_microstep: 2138.59 | bwd_inner_microstep: 2132.67 | bwd_allreduce_microstep: 5.86 | step_microstep: 113.46 [2025-08-03 05:01:10,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.88 | bwd: 7943.59 | bwd_inner: 7377.24 | bwd_allreduce: 566.12 | step: 113.94 {'loss': 0.7551, 'learning_rate': 9.288076586340005e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13312 total_samples=16326, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:12,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.22 | bwd_microstep: 2173.81 | bwd_inner_microstep: 2056.27 | bwd_allreduce_microstep: 117.47 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14327 total_samples=16330, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:15,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.69 | bwd_microstep: 2240.76 | bwd_inner_microstep: 2113.42 | bwd_allreduce_microstep: 127.27 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12549 total_samples=16333, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:18,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.10 | bwd_microstep: 2041.81 | bwd_inner_microstep: 1827.60 | bwd_allreduce_microstep: 214.15 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14010 total_samples=16337, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:21,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43 [2025-08-03 05:01:21,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.75 | bwd_microstep: 1785.03 | bwd_inner_microstep: 1742.63 | bwd_allreduce_microstep: 42.35 | step_microstep: 135.20 [2025-08-03 05:01:21,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.70 | bwd: 8241.47 | bwd_inner: 7739.91 | bwd_allreduce: 501.32 | step: 135.75 {'loss': 0.7498, 'learning_rate': 9.27192484029312e-06, 'epoch': 0.54} 54%|█████▎ | 1070/2000 [3:17:41<2:57:14, 11.44s/it] 54%|█████▎ | 1071/2000 [3:17:52<2:54:37, 11.28s/it] 54%|█████▎ | 1071/2000 [3:17:52<2:54:37, 11.28s/it] 54%|█████▎ | 1072/2000 [3:18:03<2:50:42, 11.04s/it] 54%|█████▎ | 1072/2000 [3:18:03<2:50:42, 11.04s/it] 54%|█████▎ | 1073/2000 [3:18:13<2:47:34, 10.85s/it] 54%|█████▎ | 1073/2000 [3:18:13<2:47:34, 10.85s/it] 54%|█████▎ | 1074/2000 [3:18:24<2:48:57, 10.95s/it] 54%|█████▎ | 1074/2000 [3:18:24<2:48:57, 10.95s/it] 54%|█████▍ | 1075/2000 [3:18:36<2:51:05, 11.10s/it] 54%|█████▍ dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483 total_samples=16341, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:24,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.38 | bwd_microstep: 1722.28 | bwd_inner_microstep: 1673.84 | bwd_allreduce_microstep: 48.37 | step_microstep: 0.71 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13371 total_samples=16345, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:26,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.02 | bwd_microstep: 2195.59 | bwd_inner_microstep: 2064.11 | bwd_allreduce_microstep: 131.43 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12645 total_samples=16348, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:29,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.81 | bwd_microstep: 1978.23 | bwd_inner_microstep: 1613.88 | bwd_allreduce_microstep: 364.28 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13761 total_samples=16352, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:32,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92 [2025-08-03 05:01:32,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.29 | bwd_microstep: 1832.59 | bwd_inner_microstep: 1747.31 | bwd_allreduce_microstep: 85.22 | step_microstep: 113.65 [2025-08-03 05:01:32,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.42 | bwd: 7728.74 | bwd_inner: 7099.13 | bwd_allreduce: 629.38 | step: 114.58 {'loss': 0.7559, 'learning_rate': 9.255775003538462e-06, 'epoch': 0.54} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11872 total_samples=16355, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:35,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.04 | bwd_microstep: 1924.99 | bwd_inner_microstep: 1560.42 | bwd_allreduce_microstep: 364.49 | step_microstep: 0.15 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13880 total_samples=16360, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:37,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.71 | bwd_microstep: 1832.32 | bwd_inner_microstep: 1690.42 | bwd_allreduce_microstep: 141.83 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891 total_samples=16363, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:40,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.55 | bwd_microstep: 1846.54 | bwd_inner_microstep: 1542.99 | bwd_allreduce_microstep: 303.48 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13507 total_samples=16367, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:43,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17 [2025-08-03 05:01:43,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1001.10 | bwd_microstep: 2171.20 | bwd_inner_microstep: 2057.94 | bwd_allreduce_microstep: 113.19 | step_microstep: 117.33 [2025-08-03 05:01:43,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3140.34 | bwd: 7775.10 | bwd_inner: 6851.76 | bwd_allreduce: 923.07 | step: 117.86 {'loss': 0.7632, 'learning_rate': 9.239627118427098e-06, 'epoch': 0.54} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11859 total_samples=16370, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:46,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.28 | bwd_microstep: 1797.55 | bwd_inner_microstep: 1556.25 | bwd_allreduce_microstep: 241.23 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13298 total_samples=16374, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:49,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.78 | bwd_microstep: 2057.31 | bwd_inner_microstep: 1880.43 | bwd_allreduce_microstep: 176.82 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11605 total_samples=16377, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:52,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.04 | bwd_microstep: 2248.14 | bwd_inner_microstep: 1915.10 | bwd_allreduce_microstep: 332.97 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13776 total_samples=16381, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:55,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46 [2025-08-03 05:01:55,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.03 | bwd_microstep: 1912.09 | bwd_inner_microstep: 1861.79 | bwd_allreduce_microstep: 50.24 | step_microstep: 135.15 [2025-08-03 05:01:55,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.08 | bwd: 8015.14 | bwd_inner: 7213.56 | bwd_allreduce: 801.34 | step: 135.50 {'loss': 0.7444, 'learning_rate': 9.22348122730497e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13413 total_samples=16385, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:01:57,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.66 | bwd_microstep: 2092.20 | bwd_inner_microstep: 1933.67 | bwd_allreduce_microstep: 158.47 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13496 total_samples=16389, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:00,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.92 | bwd_microstep: 1733.82 | bwd_inner_microstep: 1674.24 | bwd_allreduce_microstep: 59.52 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12052 total_samples=16392, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:03,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.14 | bwd_microstep: 1990.89 | bwd_inner_microstep: 1783.19 | bwd_allreduce_microstep: 207.64 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13326 total_samples=16396, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:05,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17 [2025-08-03 05:02:05,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.43 | bwd_microstep: 1734.13 | bwd_inner_microstep: 1672.78 | bwd_allreduce_microstep: 61.28 | step_microstep: 110.54 [2025-08-03 05:02:05,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.08 | bwd: 7551.08 | bwd_inner: 7063.87 | bwd_allreduce: 486.99 | step: 110.99 {'loss': 0.7546, 'learning_rate': 9.207337372512797e-06, 'epoch': 0.54} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13746 total_samples=16400, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:08,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.38 | bwd_microstep: 1794.62 | bwd_inner_microstep: 1706.91 | bwd_allreduce_microstep: 87.64 | step_microstep: 0.09 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11846 total_samples=16403, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:11,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.51 | bwd_microstep: 1852.62 | bwd_inner_microstep: 1591.27 | bwd_allreduce_microstep: 261.28 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12067 total_samples=16406, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:13,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.99 | bwd_microstep: 1823.35 | bwd_inner_microstep: 1596.53 | bwd_allreduce_microstep: 226.74 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13922 total_samples=16410, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:16,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90 [2025-08-03 05:02:16,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.56 | bwd_microstep: 2132.24 | bwd_inner_microstep: 1984.35 | bwd_allreduce_microstep: 147.83 | step_microstep: 111.95 [2025-08-03 05:02:16,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2868.37 | bwd: 7602.88 | bwd_inner: 6879.05 | bwd_allreduce: 723.59 | step: 112.35 {'loss': 0.7426, 'learning_rate': 9.19119559638596e-06, 'epoch': 0.54} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11739 total_samples=16413, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:19,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.05 | bwd_microstep: 1775.19 | bwd_inner_microstep: 1542.61 | bwd_allreduce_microstep: 232.51 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11811 total_samples=16416, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:21,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.07 | bwd_microstep: 1800.68 | bwd_inner_microstep: 1565.36 | bwd_allreduce_microstep: 235.26 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14008 total_samples=16420, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:24,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.38 | bwd_microstep: 1760.65 | bwd_inner_microstep: 1739.99 | bwd_allreduce_microstep: 20.59 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13671 total_samples=16424, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:28,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78 [2025-08-03 05:02:28,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.56 | bwd_microstep: 2934.02 | bwd_inner_microstep: 2927.98 | bwd_allreduce_microstep: 5.98 | step_microstep: 116.34 [2025-08-03 05:02:28,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.99 | bwd: 8270.59 | bwd_inner: 7775.94 | bwd_allreduce: 494.42 | step: 116.79 | 1075/2000 [3:18:36<2:51:05, 11.10s/it] 54%|█████▍ | 1076/2000 [3:18:47<2:50:16, 11.06s/it] 54%|█████▍ | 1076/2000 [3:18:47<2:50:16, 11.06s/it] 54%|█████▍ | 1077/2000 [3:18:58<2:51:19, 11.14s/it] 54%|█████▍ | 1077/2000 [3:18:58<2:51:19, 11.14s/it] 54%|█████▍ | 1078/2000 [3:19:09<2:51:47, 11.18s/it] 54%|█████▍ | 1078/2000 [3:19:09<2:51:47, 11.18s/it] 54%|█████▍ | 1079/2000 [3:19:20<2:49:54, 11.07s/it] 54%|█████▍ | 1079/2000 [3:19:20<2:49:54, 11.07s/it] 54%|█████▍ | 1080/2000 [3:19:31<2:48:48, 11.01s/it] 54%|█████▍ | 1080/2000 [3:19:31<2:48:48, 11.01s/it] 54%|███{'loss': 0.7616, 'learning_rate': 9.17505594125438e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13452 total_samples=16428, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:30,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.28 | bwd_microstep: 1944.45 | bwd_inner_microstep: 1723.86 | bwd_allreduce_microstep: 220.52 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11595 total_samples=16431, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:33,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.77 | bwd_microstep: 1922.86 | bwd_inner_microstep: 1584.47 | bwd_allreduce_microstep: 338.32 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11829 total_samples=16434, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:36,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.21 | bwd_microstep: 1721.70 | bwd_inner_microstep: 1539.37 | bwd_allreduce_microstep: 182.26 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14949 total_samples=16438, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:38,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:02:38,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.31 | bwd_microstep: 1821.90 | bwd_inner_microstep: 1769.00 | bwd_allreduce_microstep: 52.84 | step_microstep: 111.43 [2025-08-03 05:02:38,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.49 | bwd: 7410.97 | bwd_inner: 6616.69 | bwd_allreduce: 794.03 | step: 111.97 {'loss': 0.736, 'learning_rate': 9.158918449442425e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13292 total_samples=16442, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:41,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.22 | bwd_microstep: 1759.05 | bwd_inner_microstep: 1702.14 | bwd_allreduce_microstep: 56.84 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13697 total_samples=16446, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:44,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.93 | bwd_microstep: 2154.71 | bwd_inner_microstep: 2055.23 | bwd_allreduce_microstep: 99.42 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14147 total_samples=16450, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:46,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.90 | bwd_microstep: 1873.28 | bwd_inner_microstep: 1765.26 | bwd_allreduce_microstep: 107.96 | step_microstep: 0.13 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14268 total_samples=16454, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:49,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99 [2025-08-03 05:02:49,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.19 | bwd_microstep: 2035.36 | bwd_inner_microstep: 1951.38 | bwd_allreduce_microstep: 83.91 | step_microstep: 108.26 [2025-08-03 05:02:49,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.16 | bwd: 7822.44 | bwd_inner: 7473.99 | bwd_allreduce: 348.22 | step: 108.79 {'loss': 0.7577, 'learning_rate': 9.142783163268782e-06, 'epoch': 0.54} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717 total_samples=16457, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:52,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.63 | bwd_microstep: 1849.02 | bwd_inner_microstep: 1597.54 | bwd_allreduce_microstep: 251.42 | step_microstep: 0.27 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12678 total_samples=16461, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:55,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.03 | bwd_microstep: 1778.14 | bwd_inner_microstep: 1611.79 | bwd_allreduce_microstep: 166.28 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12968 total_samples=16465, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:02:57,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.34 | bwd_microstep: 1930.76 | bwd_inner_microstep: 1654.52 | bwd_allreduce_microstep: 276.18 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13820 total_samples=16469, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:00,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19 [2025-08-03 05:03:00,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.68 | bwd_microstep: 2066.27 | bwd_inner_microstep: 1958.39 | bwd_allreduce_microstep: 107.80 | step_microstep: 131.72 [2025-08-03 05:03:00,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.60 | bwd: 7624.24 | bwd_inner: 6822.23 | bwd_allreduce: 801.75 | step: 132.23 {'loss': 0.7509, 'learning_rate': 9.126650125046361e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14462 total_samples=16473, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:03,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.26 | bwd_microstep: 1900.97 | bwd_inner_microstep: 1739.99 | bwd_allreduce_microstep: 160.92 | step_microstep: 0.31 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11683 total_samples=16476, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:06,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.82 | bwd_microstep: 1759.36 | bwd_inner_microstep: 1540.89 | bwd_allreduce_microstep: 218.40 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13193 total_samples=16480, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:08,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.74 | bwd_microstep: 2159.99 | bwd_inner_microstep: 1917.59 | bwd_allreduce_microstep: 242.33 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13263 total_samples=16485, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:11,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09 [2025-08-03 05:03:11,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.32 | bwd_microstep: 2063.38 | bwd_inner_microstep: 1904.09 | bwd_allreduce_microstep: 159.22 | step_microstep: 113.25 [2025-08-03 05:03:11,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.07 | bwd: 7883.74 | bwd_inner: 7102.56 | bwd_allreduce: 780.95 | step: 113.81 {'loss': 0.7587, 'learning_rate': 9.110519377082174e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13086 total_samples=16489, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:14,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.79 | bwd_microstep: 2038.69 | bwd_inner_microstep: 1896.94 | bwd_allreduce_microstep: 141.69 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13326 total_samples=16493, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:17,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.94 | bwd_microstep: 1753.14 | bwd_inner_microstep: 1686.40 | bwd_allreduce_microstep: 66.66 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13860 total_samples=16497, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:19,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.79 | bwd_microstep: 1777.29 | bwd_inner_microstep: 1720.83 | bwd_allreduce_microstep: 56.39 | step_microstep: 0.22 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13146 total_samples=16501, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:22,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 34.23 [2025-08-03 05:03:22,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.95 | bwd_microstep: 1858.86 | bwd_inner_microstep: 1656.79 | bwd_allreduce_microstep: 202.00 | step_microstep: 142.93 [2025-08-03 05:03:22,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.39 | bwd: 7428.03 | bwd_inner: 6960.96 | bwd_allreduce: 466.82 | step: 143.40 █▍ | 1081/2000 [3:19:43<2:50:46, 11.15s/it] 54%|█████▍ | 1081/2000 [3:19:43<2:50:46, 11.15s/it] 54%|█████▍ | 1082/2000 [3:19:53<2:48:15, 11.00s/it] 54%|█████▍ | 1082/2000 [3:19:53<2:48:15, 11.00s/it] 54%|█████▍ | 1083/2000 [3:20:04<2:48:22, 11.02s/it] 54%|█████▍ | 1083/2000 [3:20:04<2:48:22, 11.02s/it] 54%|█████▍ | 1084/2000 [3:20:15<2:47:41, 10.98s/it] 54%|█████▍ | 1084/2000 [3:20:15<2:47:41, 10.98s/it] 54%|█████▍ | 1085/2000 [3:20:26<2:47:56, 11.01s/it] 54%|█████▍ | 1085/2000 [3:20:26<2:47:56, 11.01s/it] 54%|█████▍ | 1086/2000 [3:20:37<2:46:25, 10.93s/it] {'loss': 0.7564, 'learning_rate': 9.094390961677223e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13237 total_samples=16505, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:25,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.13 | bwd_microstep: 1729.48 | bwd_inner_microstep: 1679.00 | bwd_allreduce_microstep: 50.41 | step_microstep: 0.24 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12405 total_samples=16509, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:27,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 746.73 | bwd_microstep: 1835.33 | bwd_inner_microstep: 1630.54 | bwd_allreduce_microstep: 204.71 | step_microstep: 0.18 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13335 total_samples=16513, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:30,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.91 | bwd_microstep: 1702.33 | bwd_inner_microstep: 1661.14 | bwd_allreduce_microstep: 41.12 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13535 total_samples=16517, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:33,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.80 [2025-08-03 05:03:33,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.26 | bwd_microstep: 2081.50 | bwd_inner_microstep: 1923.84 | bwd_allreduce_microstep: 157.59 | step_microstep: 150.62 [2025-08-03 05:03:33,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.94 | bwd: 7348.70 | bwd_inner: 6894.52 | bwd_allreduce: 453.93 | step: 151.28 {'loss': 0.734, 'learning_rate': 9.078264921126405e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13270 total_samples=16521, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:35,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.79 | bwd_microstep: 1773.65 | bwd_inner_microstep: 1691.32 | bwd_allreduce_microstep: 82.26 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12376 total_samples=16524, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:38,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.35 | bwd_microstep: 1856.64 | bwd_inner_microstep: 1598.79 | bwd_allreduce_microstep: 257.78 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13188 total_samples=16528, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:41,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.16 | bwd_microstep: 1961.21 | bwd_inner_microstep: 1733.25 | bwd_allreduce_microstep: 227.88 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13726 total_samples=16532, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:44,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91 [2025-08-03 05:03:44,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.27 | bwd_microstep: 2338.12 | bwd_inner_microstep: 2328.33 | bwd_allreduce_microstep: 9.72 | step_microstep: 112.03 [2025-08-03 05:03:44,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.50 | bwd: 7929.66 | bwd_inner: 7351.69 | bwd_allreduce: 577.72 | step: 112.62 {'loss': 0.7569, 'learning_rate': 9.062141297718372e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13428 total_samples=16536, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:47,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.45 | bwd_microstep: 1805.34 | bwd_inner_microstep: 1703.64 | bwd_allreduce_microstep: 101.63 | step_microstep: 0.25 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12324 total_samples=16540, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:49,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.73 | bwd_microstep: 1960.45 | bwd_inner_microstep: 1785.27 | bwd_allreduce_microstep: 175.12 | step_microstep: 0.11 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12052 total_samples=16544, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:52,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.06 | bwd_microstep: 1789.99 | bwd_inner_microstep: 1596.08 | bwd_allreduce_microstep: 193.84 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12846 total_samples=16548, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:54,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09 [2025-08-03 05:03:54,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.74 | bwd_microstep: 1748.47 | bwd_inner_microstep: 1664.97 | bwd_allreduce_microstep: 83.44 | step_microstep: 116.23 [2025-08-03 05:03:54,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.91 | bwd: 7304.31 | bwd_inner: 6749.95 | bwd_allreduce: 554.10 | step: 116.70 {'loss': 0.7608, 'learning_rate': 9.046020133735455e-06, 'epoch': 0.54} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13533 total_samples=16552, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:03:57,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.50 | bwd_microstep: 1955.53 | bwd_inner_microstep: 1715.64 | bwd_allreduce_microstep: 239.83 | step_microstep: 0.12 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12106 total_samples=16556, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:00,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.71 | bwd_microstep: 1734.66 | bwd_inner_microstep: 1566.24 | bwd_allreduce_microstep: 168.36 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876 total_samples=16559, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:02,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.60 | bwd_microstep: 1809.35 | bwd_inner_microstep: 1593.01 | bwd_allreduce_microstep: 216.26 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12042 total_samples=16562, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:05,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.57 [2025-08-03 05:04:05,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.03 | bwd_microstep: 2004.73 | bwd_inner_microstep: 1778.62 | bwd_allreduce_microstep: 226.04 | step_microstep: 130.65 [2025-08-03 05:04:05,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2940.77 | bwd: 7504.33 | bwd_inner: 6653.49 | bwd_allreduce: 850.57 | step: 131.16 {'loss': 0.7553, 'learning_rate': 9.02990147145352e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13294 total_samples=16566, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:08,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.20 | bwd_microstep: 1721.92 | bwd_inner_microstep: 1672.81 | bwd_allreduce_microstep: 49.04 | step_microstep: 0.19 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14470 total_samples=16570, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:11,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.58 | bwd_microstep: 2104.37 | bwd_inner_microstep: 1903.69 | bwd_allreduce_microstep: 200.61 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11902 total_samples=16573, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:14,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.11 | bwd_microstep: 2035.31 | bwd_inner_microstep: 1833.46 | bwd_allreduce_microstep: 201.78 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12037 total_samples=16576, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:17,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:04:17,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.07 | bwd_microstep: 2213.61 | bwd_inner_microstep: 2012.25 | bwd_allreduce_microstep: 201.30 | step_microstep: 134.18 [2025-08-03 05:04:17,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.88 | bwd: 8075.27 | bwd_inner: 7422.21 | bwd_allreduce: 652.82 | step: 134.75 {'loss': 0.737, 'learning_rate': 9.013785353141887e-06, 'epoch': 0.55} 54%|█████▍ | 1086/2000 [3:20:37<2:46:25, 10.93s/it] 54%|█████▍ | 1087/2000 [3:20:48<2:44:49, 10.83s/it] 54%|█████▍ | 1087/2000 [3:20:48<2:44:49, 10.83s/it] 54%|█████▍ | 1088/2000 [3:20:59<2:46:16, 10.94s/it] 54%|█████▍ | 1088/2000 [3:20:59<2:46:16, 10.94s/it] 54%|█████▍ | 1089/2000 [3:21:09<2:44:13, 10.82s/it] 54%|█████▍ | 1089/2000 [3:21:09<2:44:13, 10.82s/it] 55%|█████▍ | 1090/2000 [3:21:20<2:44:24, 10.84s/it] 55%|█████▍ | 1090/2000 [3:21:20<2:44:24, 10.84s/it] 55%|█████▍ | 1091/2000 [3:21:32<2:46:36, 11.00s/it] 55%|█dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13725 total_samples=16580, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:20,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.25 | bwd_microstep: 2045.88 | bwd_inner_microstep: 1891.80 | bwd_allreduce_microstep: 154.02 | step_microstep: 0.09 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14036 total_samples=16584, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:22,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.83 | bwd_microstep: 1796.03 | bwd_inner_microstep: 1736.49 | bwd_allreduce_microstep: 59.47 | step_microstep: 0.31 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11622 total_samples=16587, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:25,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.02 | bwd_microstep: 1781.89 | bwd_inner_microstep: 1549.44 | bwd_allreduce_microstep: 232.39 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12754 total_samples=16590, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:27,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58 [2025-08-03 05:04:27,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.00 | bwd_microstep: 1816.62 | bwd_inner_microstep: 1598.21 | bwd_allreduce_microstep: 218.33 | step_microstep: 127.12 [2025-08-03 05:04:27,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.04 | bwd: 7440.49 | bwd_inner: 6775.93 | bwd_allreduce: 664.30 | step: 127.65 {'loss': 0.749, 'learning_rate': 8.99767182106319e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13490 total_samples=16594, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:30,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.64 | bwd_microstep: 1881.19 | bwd_inner_microstep: 1827.13 | bwd_allreduce_microstep: 53.99 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13058 total_samples=16598, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:33,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.08 | bwd_microstep: 1753.75 | bwd_inner_microstep: 1684.04 | bwd_allreduce_microstep: 69.64 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12294 total_samples=16603, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:35,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.04 | bwd_microstep: 1784.21 | bwd_inner_microstep: 1558.03 | bwd_allreduce_microstep: 226.12 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11573 total_samples=16606, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:38,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72 [2025-08-03 05:04:38,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.61 | bwd_microstep: 1748.79 | bwd_inner_microstep: 1529.62 | bwd_allreduce_microstep: 219.09 | step_microstep: 153.08 [2025-08-03 05:04:38,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.29 | bwd: 7168.00 | bwd_inner: 6598.82 | bwd_allreduce: 568.92 | step: 153.66 {'loss': 0.7593, 'learning_rate': 8.981560917473292e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407 total_samples=16610, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:40,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.12 | bwd_microstep: 1767.64 | bwd_inner_microstep: 1693.60 | bwd_allreduce_microstep: 73.98 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13397 total_samples=16614, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:43,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.23 | bwd_microstep: 1971.44 | bwd_inner_microstep: 1906.27 | bwd_allreduce_microstep: 65.09 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885 total_samples=16617, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:46,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.10 | bwd_microstep: 1817.40 | bwd_inner_microstep: 1572.79 | bwd_allreduce_microstep: 244.54 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11659 total_samples=16620, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:48,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18 [2025-08-03 05:04:48,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.61 | bwd_microstep: 1791.82 | bwd_inner_microstep: 1550.24 | bwd_allreduce_microstep: 241.52 | step_microstep: 109.67 [2025-08-03 05:04:48,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.99 | bwd: 7348.34 | bwd_inner: 6722.89 | bwd_allreduce: 625.21 | step: 110.24 {'loss': 0.7564, 'learning_rate': 8.965452684621164e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14158 total_samples=16625, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:51,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.11 | bwd_microstep: 1778.07 | bwd_inner_microstep: 1735.07 | bwd_allreduce_microstep: 42.93 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13532 total_samples=16629, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:54,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.72 | bwd_microstep: 1990.19 | bwd_inner_microstep: 1764.78 | bwd_allreduce_microstep: 225.31 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13444 total_samples=16633, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:56,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.37 | bwd_microstep: 1747.76 | bwd_inner_microstep: 1683.18 | bwd_allreduce_microstep: 64.51 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11754 total_samples=16636, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:04:59,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95 [2025-08-03 05:04:59,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.60 | bwd_microstep: 2049.37 | bwd_inner_microstep: 1826.21 | bwd_allreduce_microstep: 223.09 | step_microstep: 114.71 [2025-08-03 05:04:59,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2876.73 | bwd: 7565.43 | bwd_inner: 7009.27 | bwd_allreduce: 555.90 | step: 115.15 {'loss': 0.7453, 'learning_rate': 8.949347164748761e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215 total_samples=16640, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:02,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.00 | bwd_microstep: 1851.62 | bwd_inner_microstep: 1729.23 | bwd_allreduce_microstep: 122.34 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13445 total_samples=16644, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:04,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.67 | bwd_microstep: 1742.52 | bwd_inner_microstep: 1682.18 | bwd_allreduce_microstep: 60.27 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11851 total_samples=16647, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:07,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.18 | bwd_microstep: 1707.14 | bwd_inner_microstep: 1538.31 | bwd_allreduce_microstep: 168.77 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11969 total_samples=16650, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:10,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.31 [2025-08-03 05:05:10,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.92 | bwd_microstep: 1910.92 | bwd_inner_microstep: 1745.65 | bwd_allreduce_microstep: 165.20 | step_microstep: 138.48 [2025-08-03 05:05:10,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.70 | bwd: 7212.25 | bwd_inner: 6695.35 | bwd_allreduce: 516.66 | step: 139.06 {'loss': 0.7425, 'learning_rate': 8.933244400090937e-06, 'epoch': 0.55} ████▍ | 1091/2000 [3:21:32<2:46:36, 11.00s/it] 55%|█████▍ | 1092/2000 [3:21:42<2:44:51, 10.89s/it] 55%|█████▍ | 1092/2000 [3:21:42<2:44:51, 10.89s/it] 55%|█████▍ | 1093/2000 [3:21:53<2:42:27, 10.75s/it] 55%|█████▍ | 1093/2000 [3:21:53<2:42:27, 10.75s/it] 55%|█████▍ | 1094/2000 [3:22:03<2:41:29, 10.70s/it] 55%|█████▍ | 1094/2000 [3:22:03<2:41:29, 10.70s/it] 55%|█████▍ | 1095/2000 [3:22:14<2:41:59, 10.74s/it] 55%|█████▍ | 1095/2000 [3:22:14<2:41:59, 10.74s/it] 55%|█████▍ | 1096/2000 [3:22:25<2:40:30, 10.65s/it] 55%|█████▍ | 1096/2000 [3:22:25<2:40:30, 10.65s/it]dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15086 total_samples=16654, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:13,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.36 | bwd_microstep: 2121.86 | bwd_inner_microstep: 1834.91 | bwd_allreduce_microstep: 286.88 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14143 total_samples=16658, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:15,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.22 | bwd_microstep: 2063.38 | bwd_inner_microstep: 1921.80 | bwd_allreduce_microstep: 141.51 | step_microstep: 0.94 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12265 total_samples=16661, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:18,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.49 | bwd_microstep: 2008.42 | bwd_inner_microstep: 1784.98 | bwd_allreduce_microstep: 223.37 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14833 total_samples=16665, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:21,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60 [2025-08-03 05:05:21,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.11 | bwd_microstep: 1856.89 | bwd_inner_microstep: 1751.33 | bwd_allreduce_microstep: 105.49 | step_microstep: 129.22 [2025-08-03 05:05:21,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.11 | bwd: 8050.60 | bwd_inner: 7293.01 | bwd_allreduce: 757.33 | step: 130.43 {'loss': 0.7551, 'learning_rate': 8.91714443287531e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13700 total_samples=16670, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:24,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.29 | bwd_microstep: 1731.73 | bwd_inner_microstep: 1675.99 | bwd_allreduce_microstep: 55.67 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13341 total_samples=16674, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:26,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.02 | bwd_microstep: 1802.72 | bwd_inner_microstep: 1725.11 | bwd_allreduce_microstep: 77.54 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11835 total_samples=16677, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:29,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.86 | bwd_microstep: 2283.25 | bwd_inner_microstep: 1989.93 | bwd_allreduce_microstep: 293.23 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13210 total_samples=16682, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:32,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.03 [2025-08-03 05:05:32,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.80 | bwd_microstep: 2056.92 | bwd_inner_microstep: 1877.89 | bwd_allreduce_microstep: 178.97 | step_microstep: 135.73 [2025-08-03 05:05:32,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.92 | bwd: 7874.68 | bwd_inner: 7268.91 | bwd_allreduce: 605.50 | step: 136.36 {'loss': 0.7472, 'learning_rate': 8.901047305322172e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13972 total_samples=16686, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:35,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.18 | bwd_microstep: 2015.70 | bwd_inner_microstep: 1889.27 | bwd_allreduce_microstep: 126.38 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13961 total_samples=16690, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:38,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.12 | bwd_microstep: 1789.08 | bwd_inner_microstep: 1736.68 | bwd_allreduce_microstep: 52.32 | step_microstep: 0.29 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12122 total_samples=16693, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:40,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.57 | bwd_microstep: 1782.48 | bwd_inner_microstep: 1585.34 | bwd_allreduce_microstep: 197.08 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13573 total_samples=16697, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:43,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61 [2025-08-03 05:05:43,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.65 | bwd_microstep: 2166.11 | bwd_inner_microstep: 2000.15 | bwd_allreduce_microstep: 165.88 | step_microstep: 138.49 [2025-08-03 05:05:43,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.44 | bwd: 7753.43 | bwd_inner: 7211.43 | bwd_allreduce: 541.74 | step: 139.04 {'loss': 0.7561, 'learning_rate': 8.88495305964436e-06, 'epoch': 0.55} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13391 total_samples=16701, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:46,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.39 | bwd_microstep: 1803.57 | bwd_inner_microstep: 1689.73 | bwd_allreduce_microstep: 113.78 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13803 total_samples=16705, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:48,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.19 | bwd_microstep: 1804.99 | bwd_inner_microstep: 1741.69 | bwd_allreduce_microstep: 63.24 | step_microstep: 0.12 dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 15137 total_samples=16708, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:51,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.22 | bwd_microstep: 1804.79 | bwd_inner_microstep: 1721.11 | bwd_allreduce_microstep: 83.61 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14636 total_samples=16712, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:54,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 20.56 [2025-08-03 05:05:54,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.25 | bwd_microstep: 1745.35 | bwd_inner_microstep: 1719.95 | bwd_allreduce_microstep: 25.33 | step_microstep: 124.40 [2025-08-03 05:05:54,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.98 | bwd: 7158.76 | bwd_inner: 6872.47 | bwd_allreduce: 286.03 | step: 124.77 {'loss': 0.7579, 'learning_rate': 8.868861738047158e-06, 'epoch': 0.55} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13479 total_samples=16716, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:56,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.43 | bwd_microstep: 1880.76 | bwd_inner_microstep: 1703.75 | bwd_allreduce_microstep: 176.94 | step_microstep: 0.29 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13015 total_samples=16720, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:05:59,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.20 | bwd_microstep: 1989.77 | bwd_inner_microstep: 1884.84 | bwd_allreduce_microstep: 104.86 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264 total_samples=16724, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:02,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.46 | bwd_microstep: 1764.70 | bwd_inner_microstep: 1669.70 | bwd_allreduce_microstep: 94.93 | step_microstep: 0.27 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15238 total_samples=16729, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:04,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06 [2025-08-03 05:06:04,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.94 | bwd_microstep: 1753.12 | bwd_inner_microstep: 1747.02 | bwd_allreduce_microstep: 6.04 | step_microstep: 114.27 [2025-08-03 05:06:04,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.95 | bwd: 7388.40 | bwd_inner: 7005.30 | bwd_allreduce: 382.85 | step: 114.97 {'loss': 0.7557, 'learning_rate': 8.852773382728184e-06, 'epoch': 0.55} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12847 total_samples=16733, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:07,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.18 | bwd_microstep: 1785.87 | bwd_inner_microstep: 1664.11 | bwd_allreduce_microstep: 121.70 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13045 total_samples=16737, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:10,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.03 | bwd_microstep: 2015.25 | bwd_inner_microstep: 2003.04 | bwd_allreduce_microstep: 12.15 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13718 total_samples=16741, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:12,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.80 | bwd_microstep: 1828.80 | bwd_inner_microstep: 1749.18 | bwd_allreduce_microstep: 79.55 | step_microstep: 0.19 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12907 total_samples=16745, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:15,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65 [2025-08-03 05:06:15,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1833.46 | bwd_inner_microstep: 1773.40 | bwd_allreduce_microstep: 59.99 | step_microstep: 148.90 [2025-08-03 05:06:15,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.54 | bwd: 7463.44 | bwd_inner: 7189.73 | bwd_allreduce: 273.48 | step: 149.43 55%|█████▍ | 1097/2000 [3:22:36<2:43:31, 10.87s/it] 55%|█████▍ | 1097/2000 [3:22:36<2:43:31, 10.87s/it] 55%|█████▍ | 1098/2000 [3:22:47<2:44:36, 10.95s/it] 55%|█████▍ | 1098/2000 [3:22:47<2:44:36, 10.95s/it] 55%|█████▍ | 1099/2000 [3:22:58<2:44:46, 10.97s/it] 55%|█████▍ | 1099/2000 [3:22:58<2:44:46, 10.97s/it] 55%|█████▌ | 1100/2000 [3:23:08<2:42:05, 10.81s/it] 55%|█████▌ | 1100/2000 [3:23:08<2:42:05, 10.81s/it] 55%|█████▌ | 1101/2000 [3:23:19<2:41:00, 10.75s/it] 55%|█████▌ | 1101/2000 [3:23:19<2:41:00, 10.75s/it] 55%|█████▌ | 1102/2000 [3:23:30<2:40:46, 1{'loss': 0.7642, 'learning_rate': 8.836688035877268e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14330 total_samples=16750, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:17,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.79 | bwd_microstep: 1727.24 | bwd_inner_microstep: 1698.73 | bwd_allreduce_microstep: 28.45 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13481 total_samples=16754, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:20,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.15 | bwd_microstep: 1806.22 | bwd_inner_microstep: 1702.11 | bwd_allreduce_microstep: 104.03 | step_microstep: 0.32 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13255 total_samples=16758, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:23,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.42 | bwd_microstep: 1858.28 | bwd_inner_microstep: 1807.50 | bwd_allreduce_microstep: 50.72 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12518 total_samples=16761, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:26,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 36.66 [2025-08-03 05:06:26,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.40 | bwd_microstep: 1904.65 | bwd_inner_microstep: 1601.74 | bwd_allreduce_microstep: 302.84 | step_microstep: 162.85 [2025-08-03 05:06:26,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.69 | bwd: 7296.44 | bwd_inner: 6810.07 | bwd_allreduce: 486.13 | step: 163.42 {'loss': 0.7465, 'learning_rate': 8.820605739676363e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14254 total_samples=16765, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:28,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.42 | bwd_microstep: 2002.25 | bwd_inner_microstep: 1875.94 | bwd_allreduce_microstep: 126.23 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332 total_samples=16769, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:31,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.26 | bwd_microstep: 1793.07 | bwd_inner_microstep: 1705.28 | bwd_allreduce_microstep: 87.72 | step_microstep: 0.70 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468 total_samples=16773, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:34,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.61 | bwd_microstep: 1846.65 | bwd_inner_microstep: 1722.30 | bwd_allreduce_microstep: 124.27 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13567 total_samples=16777, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:36,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.29 [2025-08-03 05:06:36,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.09 | bwd_microstep: 1751.30 | bwd_inner_microstep: 1690.94 | bwd_allreduce_microstep: 60.27 | step_microstep: 145.05 [2025-08-03 05:06:36,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.31 | bwd: 7393.33 | bwd_inner: 6994.47 | bwd_allreduce: 398.59 | step: 146.14 {'loss': 0.7503, 'learning_rate': 8.804526536299413e-06, 'epoch': 0.55} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14049 total_samples=16782, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:39,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.27 | bwd_microstep: 2157.80 | bwd_inner_microstep: 2052.74 | bwd_allreduce_microstep: 105.00 | step_microstep: 0.24 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13646 total_samples=16786, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:42,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.15 | bwd_microstep: 1971.00 | bwd_inner_microstep: 1867.31 | bwd_allreduce_microstep: 103.61 | step_microstep: 0.18 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13319 total_samples=16790, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:45,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.62 | bwd_microstep: 1977.78 | bwd_inner_microstep: 1722.60 | bwd_allreduce_microstep: 255.13 | step_microstep: 0.68 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11720 total_samples=16793, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:48,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04 [2025-08-03 05:06:48,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.41 | bwd_microstep: 1996.23 | bwd_inner_microstep: 1813.21 | bwd_allreduce_microstep: 182.95 | step_microstep: 120.77 [2025-08-03 05:06:48,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2867.38 | bwd: 8102.86 | bwd_inner: 7455.85 | bwd_allreduce: 646.77 | step: 121.88 {'loss': 0.7456, 'learning_rate': 8.788450467912254e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13693 total_samples=16797, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:51,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.71 | bwd_microstep: 2231.14 | bwd_inner_microstep: 2132.58 | bwd_allreduce_microstep: 98.50 | step_microstep: 0.21 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12935 total_samples=16801, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:53,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.74 | bwd_microstep: 1795.45 | bwd_inner_microstep: 1674.31 | bwd_allreduce_microstep: 121.04 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13527 total_samples=16805, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:56,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.44 | bwd_microstep: 1734.79 | bwd_inner_microstep: 1691.89 | bwd_allreduce_microstep: 42.83 | step_microstep: 0.25 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13748 total_samples=16809, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:06:59,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48 [2025-08-03 05:06:59,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.66 | bwd_microstep: 2141.23 | bwd_inner_microstep: 1934.26 | bwd_allreduce_microstep: 206.91 | step_microstep: 145.14 [2025-08-03 05:06:59,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.47 | bwd: 7902.67 | bwd_inner: 7433.04 | bwd_allreduce: 469.38 | step: 145.86 {'loss': 0.7584, 'learning_rate': 8.772377576672502e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13831 total_samples=16815, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:01,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.50 | bwd_microstep: 1860.17 | bwd_inner_microstep: 1832.50 | bwd_allreduce_microstep: 27.60 | step_microstep: 0.27 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12828 total_samples=16819, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:04,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.82 | bwd_microstep: 2005.03 | bwd_inner_microstep: 1844.81 | bwd_allreduce_microstep: 160.16 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14871 total_samples=16823, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:07,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.93 | bwd_microstep: 1781.42 | bwd_inner_microstep: 1745.20 | bwd_allreduce_microstep: 36.15 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12060 total_samples=16826, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:10,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.32 [2025-08-03 05:07:10,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.53 | bwd_microstep: 1964.78 | bwd_inner_microstep: 1756.16 | bwd_allreduce_microstep: 208.54 | step_microstep: 158.41 [2025-08-03 05:07:10,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.70 | bwd: 7611.46 | bwd_inner: 7178.67 | bwd_allreduce: 432.54 | step: 158.98 0.74s/it] 55%|█████▌ | 1102/2000 [3:23:30<2:40:46, 10.74s/it] 55%|█████▌ | 1103/2000 [3:23:40<2:40:02, 10.70s/it] 55%|█████▌ | 1103/2000 [3:23:40<2:40:02, 10.70s/it] 55%|█████▌ | 1104/2000 [3:23:51<2:39:41, 10.69s/it] 55%|█████▌ | 1104/2000 [3:23:51<2:39:41, 10.69s/it] 55%|█████▌ | 1105/2000 [3:24:02<2:42:35, 10.90s/it] 55%|█████▌ | 1105/2000 [3:24:03<2:42:35, 10.90s/it] 55%|█████▌ | 1106/2000 [3:24:14<2:43:41, 10.99s/it] 55%|█████▌ | 1106/2000 [3:24:14<2:43:41, 10.99s/it] 55%|█████▌ | 1107/2000 [3:24:25<2:42:58, 10.95s/it] {'loss': 0.7537, 'learning_rate': 8.75630790472944e-06, 'epoch': 0.55} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13830 total_samples=16831, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:12,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.98 | bwd_microstep: 2033.69 | bwd_inner_microstep: 1694.98 | bwd_allreduce_microstep: 338.64 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16278 total_samples=16835, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:15,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.79 | bwd_microstep: 1850.97 | bwd_inner_microstep: 1844.14 | bwd_allreduce_microstep: 6.75 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14075 total_samples=16841, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:18,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.03 | bwd_microstep: 1756.04 | bwd_inner_microstep: 1708.88 | bwd_allreduce_microstep: 47.09 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12214 total_samples=16844, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:20,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35 [2025-08-03 05:07:20,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.63 | bwd_microstep: 1881.62 | bwd_inner_microstep: 1600.35 | bwd_allreduce_microstep: 281.20 | step_microstep: 111.97 [2025-08-03 05:07:20,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.37 | bwd: 7522.38 | bwd_inner: 6848.34 | bwd_allreduce: 673.79 | step: 112.47 {'loss': 0.7545, 'learning_rate': 8.740241494223911e-06, 'epoch': 0.55} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14334 total_samples=16848, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:23,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.08 | bwd_microstep: 1915.68 | bwd_inner_microstep: 1908.92 | bwd_allreduce_microstep: 6.65 | step_microstep: 0.30 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11579 total_samples=16851, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:26,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.68 | bwd_microstep: 1947.47 | bwd_inner_microstep: 1728.49 | bwd_allreduce_microstep: 218.92 | step_microstep: 0.19 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12451 total_samples=16855, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:29,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.29 | bwd_microstep: 2021.96 | bwd_inner_microstep: 1834.64 | bwd_allreduce_microstep: 187.25 | step_microstep: 0.85 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15253 total_samples=16860, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:32,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36 [2025-08-03 05:07:32,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.85 | bwd_microstep: 1960.78 | bwd_inner_microstep: 1791.77 | bwd_allreduce_microstep: 168.94 | step_microstep: 137.07 [2025-08-03 05:07:32,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.83 | bwd: 7845.94 | bwd_inner: 7263.84 | bwd_allreduce: 581.82 | step: 138.42 {'loss': 0.7548, 'learning_rate': 8.724178387288202e-06, 'epoch': 0.55} dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15510 total_samples=16865, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:34,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.24 | bwd_microstep: 1779.90 | bwd_inner_microstep: 1717.02 | bwd_allreduce_microstep: 62.81 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13319 total_samples=16869, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:37,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.74 | bwd_microstep: 1912.39 | bwd_inner_microstep: 1818.88 | bwd_allreduce_microstep: 93.44 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12011 total_samples=16872, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:39,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.34 | bwd_microstep: 1765.88 | bwd_inner_microstep: 1560.73 | bwd_allreduce_microstep: 205.09 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12184 total_samples=16875, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:42,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01 [2025-08-03 05:07:42,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.60 | bwd_microstep: 2222.66 | bwd_inner_microstep: 1954.48 | bwd_allreduce_microstep: 268.11 | step_microstep: 108.62 [2025-08-03 05:07:42,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.85 | bwd: 7680.88 | bwd_inner: 7051.10 | bwd_allreduce: 629.53 | step: 109.03 {'loss': 0.7478, 'learning_rate': 8.708118626045939e-06, 'epoch': 0.56} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15425 total_samples=16879, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:45,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.94 | bwd_microstep: 1822.68 | bwd_inner_microstep: 1764.43 | bwd_allreduce_microstep: 58.18 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13469 total_samples=16884, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:48,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.88 | bwd_microstep: 1802.44 | bwd_inner_microstep: 1710.16 | bwd_allreduce_microstep: 92.22 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13181 total_samples=16888, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:50,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.42 | bwd_microstep: 1938.84 | bwd_inner_microstep: 1888.63 | bwd_allreduce_microstep: 50.13 | step_microstep: 0.20 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12037 total_samples=16891, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:53,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33 [2025-08-03 05:07:53,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.52 | bwd_microstep: 2122.28 | bwd_inner_microstep: 1932.21 | bwd_allreduce_microstep: 189.99 | step_microstep: 113.56 [2025-08-03 05:07:53,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.70 | bwd: 7686.28 | bwd_inner: 7295.41 | bwd_allreduce: 390.60 | step: 114.16 {'loss': 0.7509, 'learning_rate': 8.692062252611973e-06, 'epoch': 0.56} dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12674 total_samples=16895, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:56,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.02 | bwd_microstep: 1816.34 | bwd_inner_microstep: 1600.34 | bwd_allreduce_microstep: 215.93 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14319 total_samples=16899, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:07:59,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.42 | bwd_microstep: 2052.85 | bwd_inner_microstep: 1882.54 | bwd_allreduce_microstep: 170.24 | step_microstep: 0.11 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12503 total_samples=16903, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:01,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.86 | bwd_microstep: 1743.50 | bwd_inner_microstep: 1579.89 | bwd_allreduce_microstep: 163.54 | step_microstep: 0.17 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11788 total_samples=16906, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:04,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40 [2025-08-03 05:08:04,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.22 | bwd_microstep: 1724.17 | bwd_inner_microstep: 1527.45 | bwd_allreduce_microstep: 196.64 | step_microstep: 471.74 [2025-08-03 05:08:04,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.45 | bwd: 7336.91 | bwd_inner: 6590.21 | bwd_allreduce: 746.43 | step: 472.15 {'loss': 0.7498, 'learning_rate': 8.676009309092273e-06, 'epoch': 0.56} 55%|█████▌ | 1107/2000 [3:24:25<2:42:58, 10.95s/it] 55%|█████▌ | 1108/2000 [3:24:35<2:42:05, 10.90s/it] 55%|█████▌ | 1108/2000 [3:24:35<2:42:05, 10.90s/it] 55%|█████▌ | 1109/2000 [3:24:46<2:42:48, 10.96s/it] 55%|█████▌ | 1109/2000 [3:24:46<2:42:48, 10.96s/it] 56%|█████▌ | 1110/2000 [3:24:57<2:42:25, 10.95s/it] 56%|█████▌ | 1110/2000 [3:24:57<2:42:25, 10.95s/it] 56%|█████▌ | 1111/2000 [3:25:08<2:42:17, 10.95s/it] 56%|█████▌ | 1111/2000 [3:25:08<2:42:17, 10.95s/it] 56%|█████▌ | 1112/2000 [3:25:19<2:42:01, 10.95s/it] 56%|█████▌ | 1112/2000 [3:25:19<2:4dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13550 total_samples=16910, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:07,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.90 | bwd_microstep: 2087.93 | bwd_inner_microstep: 1908.64 | bwd_allreduce_microstep: 179.23 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12199 total_samples=16913, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:10,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.22 | bwd_microstep: 1738.54 | bwd_inner_microstep: 1552.20 | bwd_allreduce_microstep: 186.27 | step_microstep: 0.11 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13204 total_samples=16917, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:12,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.84 | bwd_microstep: 1787.20 | bwd_inner_microstep: 1630.29 | bwd_allreduce_microstep: 156.84 | step_microstep: 0.45 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625 total_samples=16920, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:15,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18 [2025-08-03 05:08:15,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.87 | bwd_microstep: 1862.70 | bwd_inner_microstep: 1597.58 | bwd_allreduce_microstep: 265.05 | step_microstep: 133.18 [2025-08-03 05:08:15,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.76 | bwd: 7476.42 | bwd_inner: 6688.71 | bwd_allreduce: 787.46 | step: 133.86 {'loss': 0.7539, 'learning_rate': 8.659959837583808e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13409 total_samples=16924, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:18,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.91 | bwd_microstep: 1975.62 | bwd_inner_microstep: 1851.53 | bwd_allreduce_microstep: 124.02 | step_microstep: 0.18 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12866 total_samples=16927, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:20,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.99 | bwd_microstep: 1726.20 | bwd_inner_microstep: 1579.34 | bwd_allreduce_microstep: 146.75 | step_microstep: 1.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13649 total_samples=16931, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:23,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.21 | bwd_microstep: 1795.90 | bwd_inner_microstep: 1710.85 | bwd_allreduce_microstep: 84.98 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13542 total_samples=16936, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:26,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03 [2025-08-03 05:08:26,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.34 | bwd_microstep: 1891.53 | bwd_inner_microstep: 1711.28 | bwd_allreduce_microstep: 180.18 | step_microstep: 135.56 [2025-08-03 05:08:26,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.38 | bwd: 7389.30 | bwd_inner: 6853.00 | bwd_allreduce: 536.05 | step: 136.99 {'loss': 0.7477, 'learning_rate': 8.643913880174449e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13148 total_samples=16940, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:28,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.94 | bwd_microstep: 1809.41 | bwd_inner_microstep: 1689.67 | bwd_allreduce_microstep: 119.67 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13760 total_samples=16944, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:31,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.26 | bwd_microstep: 1716.71 | bwd_inner_microstep: 1678.37 | bwd_allreduce_microstep: 38.27 | step_microstep: 0.26 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13421 total_samples=16949, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:33,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.80 | bwd_microstep: 1822.73 | bwd_inner_microstep: 1688.62 | bwd_allreduce_microstep: 134.05 | step_microstep: 0.88 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12927 total_samples=16953, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:36,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.76 [2025-08-03 05:08:36,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.97 | bwd_microstep: 1788.14 | bwd_inner_microstep: 1732.29 | bwd_allreduce_microstep: 55.78 | step_microstep: 114.58 [2025-08-03 05:08:36,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.90 | bwd: 7137.05 | bwd_inner: 6788.95 | bwd_allreduce: 347.85 | step: 115.86 {'loss': 0.7491, 'learning_rate': 8.62787147894285e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418 total_samples=16957, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:39,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.36 | bwd_microstep: 2071.71 | bwd_inner_microstep: 1905.04 | bwd_allreduce_microstep: 166.62 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11541 total_samples=16960, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:42,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.91 | bwd_microstep: 1750.19 | bwd_inner_microstep: 1524.80 | bwd_allreduce_microstep: 225.32 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11822 total_samples=16963, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:44,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.09 | bwd_microstep: 1889.26 | bwd_inner_microstep: 1556.51 | bwd_allreduce_microstep: 332.68 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11690 total_samples=16966, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:47,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45 [2025-08-03 05:08:47,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.92 | bwd_microstep: 2010.69 | bwd_inner_microstep: 1580.40 | bwd_allreduce_microstep: 430.15 | step_microstep: 110.80 [2025-08-03 05:08:47,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.21 | bwd: 7721.90 | bwd_inner: 6566.78 | bwd_allreduce: 1154.81 | step: 111.18 {'loss': 0.7402, 'learning_rate': 8.611832675958335e-06, 'epoch': 0.56} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11821 total_samples=16969, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:50,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.46 | bwd_microstep: 2024.17 | bwd_inner_microstep: 1817.89 | bwd_allreduce_microstep: 206.21 | step_microstep: 0.12 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14983 total_samples=16973, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:53,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.53 | bwd_microstep: 1824.01 | bwd_inner_microstep: 1678.80 | bwd_allreduce_microstep: 145.14 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14101 total_samples=16977, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:55,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.48 | bwd_microstep: 1753.44 | bwd_inner_microstep: 1713.31 | bwd_allreduce_microstep: 40.07 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13142 total_samples=16981, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:08:58,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48 [2025-08-03 05:08:58,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.66 | bwd_microstep: 1741.17 | bwd_inner_microstep: 1662.48 | bwd_allreduce_microstep: 78.62 | step_microstep: 406.08 [2025-08-03 05:08:58,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.06 | bwd: 7342.84 | bwd_inner: 6872.48 | bwd_allreduce: 470.12 | step: 406.55 {'loss': 0.7495, 'learning_rate': 8.595797513280799e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13355 total_samples=16985, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:01,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.54 | bwd_microstep: 1817.49 | bwd_inner_microstep: 1700.33 | bwd_allreduce_microstep: 117.09 | step_microstep: 0.28 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11574 total_samples=16988, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:04,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.87 | bwd_microstep: 2276.00 | bwd_inner_microstep: 2033.27 | bwd_allreduce_microstep: 242.66 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13325 total_samples=16992, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:06,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.78 | bwd_microstep: 1824.10 | bwd_inner_microstep: 1719.52 | bwd_allreduce_microstep: 104.52 | step_microstep: 0.15 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12709 total_samples=16996, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:09,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42 [2025-08-03 05:09:09,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.80 | bwd_microstep: 1927.14 | bwd_inner_microstep: 1619.71 | bwd_allreduce_microstep: 307.37 | step_microstep: 154.19 [2025-08-03 05:09:09,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.92 | bwd: 7844.78 | bwd_inner: 7072.83 | bwd_allreduce: 771.71 | step: 154.73 2:01, 10.95s/it] 56%|█████▌ | 1113/2000 [3:25:30<2:40:51, 10.88s/it] 56%|█████▌ | 1113/2000 [3:25:30<2:40:51, 10.88s/it] 56%|█████▌ | 1114/2000 [3:25:41<2:39:48, 10.82s/it] 56%|█████▌ | 1114/2000 [3:25:41<2:39:48, 10.82s/it] 56%|█████▌ | 1115/2000 [3:25:51<2:37:38, 10.69s/it] 56%|█████▌ | 1115/2000 [3:25:51<2:37:38, 10.69s/it] 56%|█████▌ | 1116/2000 [3:26:02<2:38:38, 10.77s/it] 56%|█████▌ | 1116/2000 [3:26:02<2:38:38, 10.77s/it] 56%|█████▌ | 1117/2000 [3:26:13<2:38:46, 10.79s/it] 56%|█████▌ | 1117/2000 [3:26:13<2:38:46, 10.79s/it] 56%|█████▌ | 1118/2000 [3:{'loss': 0.7394, 'learning_rate': 8.579766032960582e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13681 total_samples=17002, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:12,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.36 | bwd_microstep: 1782.03 | bwd_inner_microstep: 1676.14 | bwd_allreduce_microstep: 105.82 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11632 total_samples=17005, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:14,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.08 | bwd_microstep: 1929.65 | bwd_inner_microstep: 1730.89 | bwd_allreduce_microstep: 198.70 | step_microstep: 0.30 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13353 total_samples=17009, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:17,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.76 | bwd_microstep: 1815.64 | bwd_inner_microstep: 1748.43 | bwd_allreduce_microstep: 67.14 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11941 total_samples=17012, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:20,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20 [2025-08-03 05:09:20,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.27 | bwd_microstep: 2215.31 | bwd_inner_microstep: 1790.57 | bwd_allreduce_microstep: 424.63 | step_microstep: 109.26 [2025-08-03 05:09:20,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.40 | bwd: 7742.69 | bwd_inner: 6946.04 | bwd_allreduce: 796.36 | step: 109.80 {'loss': 0.7505, 'learning_rate': 8.563738277038376e-06, 'epoch': 0.56} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12296 total_samples=17015, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:23,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.47 | bwd_microstep: 1991.25 | bwd_inner_microstep: 1773.90 | bwd_allreduce_microstep: 217.27 | step_microstep: 0.18 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12464 total_samples=17018, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:26,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.76 | bwd_microstep: 2079.03 | bwd_inner_microstep: 1876.04 | bwd_allreduce_microstep: 202.93 | step_microstep: 0.79 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12073 total_samples=17021, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:28,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.42 | bwd_microstep: 1924.92 | bwd_inner_microstep: 1759.76 | bwd_allreduce_microstep: 165.08 | step_microstep: 0.13 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13197 total_samples=17025, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:31,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.17 [2025-08-03 05:09:31,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.30 | bwd_microstep: 1716.80 | bwd_inner_microstep: 1625.22 | bwd_allreduce_microstep: 91.49 | step_microstep: 153.36 [2025-08-03 05:09:31,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.88 | bwd: 7712.05 | bwd_inner: 7034.91 | bwd_allreduce: 676.86 | step: 154.46 {'loss': 0.7543, 'learning_rate': 8.5477142875451e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13405 total_samples=17029, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:34,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.48 | bwd_microstep: 1834.37 | bwd_inner_microstep: 1728.22 | bwd_allreduce_microstep: 106.08 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11897 total_samples=17032, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:36,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.13 | bwd_microstep: 1813.21 | bwd_inner_microstep: 1583.98 | bwd_allreduce_microstep: 229.16 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14097 total_samples=17036, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:39,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.60 | bwd_microstep: 2079.78 | bwd_inner_microstep: 1937.85 | bwd_allreduce_microstep: 141.86 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11614 total_samples=17039, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:42,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.89 [2025-08-03 05:09:42,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.77 | bwd_microstep: 1857.87 | bwd_inner_microstep: 1540.50 | bwd_allreduce_microstep: 317.28 | step_microstep: 540.53 [2025-08-03 05:09:42,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.92 | bwd: 7585.29 | bwd_inner: 6790.54 | bwd_allreduce: 794.47 | step: 541.00 {'loss': 0.7494, 'learning_rate': 8.531694106501796e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14929 total_samples=17043, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:45,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.83 | bwd_microstep: 1769.64 | bwd_inner_microstep: 1745.25 | bwd_allreduce_microstep: 24.31 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11846 total_samples=17046, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:48,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.64 | bwd_microstep: 2042.01 | bwd_inner_microstep: 1969.47 | bwd_allreduce_microstep: 72.48 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13529 total_samples=17050, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:50,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.14 | bwd_microstep: 1939.50 | bwd_inner_microstep: 1739.66 | bwd_allreduce_microstep: 199.77 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876 total_samples=17053, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:53,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69 [2025-08-03 05:09:53,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.40 | bwd_microstep: 1713.91 | bwd_inner_microstep: 1537.87 | bwd_allreduce_microstep: 175.98 | step_microstep: 125.10 [2025-08-03 05:09:53,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.95 | bwd: 7465.11 | bwd_inner: 6992.25 | bwd_allreduce: 472.62 | step: 125.75 {'loss': 0.7513, 'learning_rate': 8.515677775919528e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14031 total_samples=17057, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:56,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.56 | bwd_microstep: 1971.93 | bwd_inner_microstep: 1702.96 | bwd_allreduce_microstep: 268.90 | step_microstep: 0.30 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11846 total_samples=17060, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:09:58,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.75 | bwd_microstep: 2059.51 | bwd_inner_microstep: 1900.58 | bwd_allreduce_microstep: 158.88 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11575 total_samples=17063, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:02,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.52 | bwd_microstep: 2384.53 | bwd_inner_microstep: 2151.13 | bwd_allreduce_microstep: 233.32 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11537 total_samples=17066, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:04,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43 [2025-08-03 05:10:04,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.25 | bwd_microstep: 1802.77 | bwd_inner_microstep: 1553.80 | bwd_allreduce_microstep: 248.91 | step_microstep: 118.08 [2025-08-03 05:10:04,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.01 | bwd: 8218.79 | bwd_inner: 7308.46 | bwd_allreduce: 910.09 | step: 118.76 26:24<2:40:00, 10.89s/it] 56%|█████▌ | 1118/2000 [3:26:24<2:40:00, 10.89s/it] 56%|█████▌ | 1119/2000 [3:26:35<2:39:54, 10.89s/it] 56%|█████▌ | 1119/2000 [3:26:35<2:39:54, 10.89s/it] 56%|█████▌ | 1120/2000 [3:26:46<2:40:10, 10.92s/it] 56%|█████▌ | 1120/2000 [3:26:46<2:40:10, 10.92s/it] 56%|█████▌ | 1121/2000 [3:26:57<2:41:15, 11.01s/it] 56%|█████▌ | 1121/2000 [3:26:57<2:41:15, 11.01s/it] 56%|█████▌ | 1122/2000 [3:27:08<2:39:40, 10.91s/it] 56%|█████▌ | 1122/2000 [3:27:08<2:39:40, 10.91s/it] 56%|█████▌ | 1123/2000 [3:27:19<2:41:47, 11.07s/it] {'loss': 0.7639, 'learning_rate': 8.499665337799254e-06, 'epoch': 0.56} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13051 total_samples=17070, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:07,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.98 | bwd_microstep: 1812.09 | bwd_inner_microstep: 1684.35 | bwd_allreduce_microstep: 127.68 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12275 total_samples=17073, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:10,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.32 | bwd_microstep: 1825.78 | bwd_inner_microstep: 1567.96 | bwd_allreduce_microstep: 257.76 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13107 total_samples=17076, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:12,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.80 | bwd_microstep: 2060.29 | bwd_inner_microstep: 1702.10 | bwd_allreduce_microstep: 358.11 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13385 total_samples=17080, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:15,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26 [2025-08-03 05:10:15,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.24 | bwd_microstep: 1973.82 | bwd_inner_microstep: 1901.85 | bwd_allreduce_microstep: 71.90 | step_microstep: 141.09 [2025-08-03 05:10:15,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2861.27 | bwd: 7672.03 | bwd_inner: 6856.26 | bwd_allreduce: 815.52 | step: 141.60 {'loss': 0.7425, 'learning_rate': 8.48365683413172e-06, 'epoch': 0.56} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15978 total_samples=17084, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:18,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.79 | bwd_microstep: 2079.49 | bwd_inner_microstep: 1817.75 | bwd_allreduce_microstep: 261.68 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12078 total_samples=17087, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:21,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.44 | bwd_microstep: 1804.16 | bwd_inner_microstep: 1579.52 | bwd_allreduce_microstep: 224.58 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13834 total_samples=17091, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:23,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.04 | bwd_microstep: 1947.54 | bwd_inner_microstep: 1906.63 | bwd_allreduce_microstep: 40.83 | step_microstep: 0.32 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13635 total_samples=17095, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:26,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45 [2025-08-03 05:10:26,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.95 | bwd_microstep: 1989.13 | bwd_inner_microstep: 1753.87 | bwd_allreduce_microstep: 235.18 | step_microstep: 140.82 [2025-08-03 05:10:26,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.15 | bwd: 7820.38 | bwd_inner: 7057.77 | bwd_allreduce: 762.36 | step: 141.49 {'loss': 0.7485, 'learning_rate': 8.46765230689737e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13690 total_samples=17099, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:29,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.48 | bwd_microstep: 1864.08 | bwd_inner_microstep: 1727.45 | bwd_allreduce_microstep: 136.57 | step_microstep: 0.10 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12474 total_samples=17104, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:32,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.68 | bwd_microstep: 1767.54 | bwd_inner_microstep: 1620.18 | bwd_allreduce_microstep: 147.28 | step_microstep: 0.17 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 15316 total_samples=17108, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:34,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.77 | bwd_microstep: 1972.10 | bwd_inner_microstep: 1777.10 | bwd_allreduce_microstep: 194.93 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13192 total_samples=17112, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:37,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64 [2025-08-03 05:10:37,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.46 | bwd_microstep: 1811.17 | bwd_inner_microstep: 1694.09 | bwd_allreduce_microstep: 117.02 | step_microstep: 113.40 [2025-08-03 05:10:37,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.28 | bwd: 7414.95 | bwd_inner: 6818.82 | bwd_allreduce: 595.89 | step: 113.81 {'loss': 0.7469, 'learning_rate': 8.451651798066203e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13733 total_samples=17116, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:40,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.06 | bwd_microstep: 1916.59 | bwd_inner_microstep: 1831.34 | bwd_allreduce_microstep: 85.19 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938 total_samples=17119, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:42,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.50 | bwd_microstep: 1701.19 | bwd_inner_microstep: 1541.43 | bwd_allreduce_microstep: 159.70 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446 total_samples=17123, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:45,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.26 | bwd_microstep: 1800.60 | bwd_inner_microstep: 1705.09 | bwd_allreduce_microstep: 95.45 | step_microstep: 0.13 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13298 total_samples=17127, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:47,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.02 [2025-08-03 05:10:47,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.11 | bwd_microstep: 1784.70 | bwd_inner_microstep: 1665.81 | bwd_allreduce_microstep: 118.81 | step_microstep: 142.76 [2025-08-03 05:10:47,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.86 | bwd: 7203.13 | bwd_inner: 6743.66 | bwd_allreduce: 459.23 | step: 143.26 {'loss': 0.7526, 'learning_rate': 8.43565534959769e-06, 'epoch': 0.56} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12139 total_samples=17130, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:51,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.83 | bwd_microstep: 2423.53 | bwd_inner_microstep: 2198.84 | bwd_allreduce_microstep: 224.61 | step_microstep: 0.90 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13450 total_samples=17134, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:53,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.44 | bwd_microstep: 1846.46 | bwd_inner_microstep: 1714.54 | bwd_allreduce_microstep: 131.86 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13244 total_samples=17138, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:56,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.93 | bwd_microstep: 1855.53 | bwd_inner_microstep: 1719.94 | bwd_allreduce_microstep: 135.51 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12003 total_samples=17141, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:10:59,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19 [2025-08-03 05:10:59,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.48 | bwd_microstep: 2107.14 | bwd_inner_microstep: 1890.08 | bwd_allreduce_microstep: 217.00 | step_microstep: 132.25 [2025-08-03 05:10:59,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.61 | bwd: 8232.74 | bwd_inner: 7523.41 | bwd_allreduce: 709.05 | step: 133.42 {'loss': 0.7537, 'learning_rate': 8.419663003440657e-06, 'epoch': 0.56} 56%|█████▌ | 1123/2000 [3:27:19<2:41:47, 11.07s/it] 56%|█████▌ | 1124/2000 [3:27:30<2:41:11, 11.04s/it] 56%|█████▌ | 1124/2000 [3:27:30<2:41:11, 11.04s/it] 56%|█████▋ | 1125/2000 [3:27:41<2:41:14, 11.06s/it] 56%|█████▋ | 1125/2000 [3:27:41<2:41:14, 11.06s/it] 56%|█████▋ | 1126/2000 [3:27:52<2:39:12, 10.93s/it] 56%|█████▋ | 1126/2000 [3:27:52<2:39:12, 10.93s/it] 56%|█████▋ | 1127/2000 [3:28:02<2:37:05, 10.80s/it] 56%|█████▋ | 1127/2000 [3:28:02<2:37:05, 10.80s/it] 56%|█████▋ | 1128/2000 [3:28:14<2:39:44, 10.99s/it] 56%|█████▋ | 1128/2dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12204 total_samples=17144, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:02,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.59 | bwd_microstep: 2513.69 | bwd_inner_microstep: 2482.06 | bwd_allreduce_microstep: 31.55 | step_microstep: 0.19 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13713 total_samples=17148, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:05,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.34 | bwd_microstep: 1948.09 | bwd_inner_microstep: 1853.97 | bwd_allreduce_microstep: 94.05 | step_microstep: 0.77 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11769 total_samples=17151, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:08,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.30 | bwd_microstep: 2061.89 | bwd_inner_microstep: 1766.79 | bwd_allreduce_microstep: 295.02 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247 total_samples=17155, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:11,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.94 [2025-08-03 05:11:11,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.56 | bwd_microstep: 1809.00 | bwd_inner_microstep: 1696.27 | bwd_allreduce_microstep: 112.65 | step_microstep: 157.93 [2025-08-03 05:11:11,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.72 | bwd: 8332.74 | bwd_inner: 7799.08 | bwd_allreduce: 533.38 | step: 159.02 {'loss': 0.7508, 'learning_rate': 8.40367480153316e-06, 'epoch': 0.56} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13221 total_samples=17159, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:14,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.08 | bwd_microstep: 2153.24 | bwd_inner_microstep: 1941.67 | bwd_allreduce_microstep: 211.51 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12045 total_samples=17162, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:16,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.16 | bwd_microstep: 1789.27 | bwd_inner_microstep: 1566.49 | bwd_allreduce_microstep: 222.71 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12182 total_samples=17165, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:19,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.50 | bwd_microstep: 1781.44 | bwd_inner_microstep: 1603.83 | bwd_allreduce_microstep: 177.54 | step_microstep: 0.27 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11887 total_samples=17168, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:21,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.08 [2025-08-03 05:11:21,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.20 | bwd_microstep: 1719.16 | bwd_inner_microstep: 1534.50 | bwd_allreduce_microstep: 184.59 | step_microstep: 127.61 [2025-08-03 05:11:21,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.85 | bwd: 7443.16 | bwd_inner: 6646.47 | bwd_allreduce: 796.42 | step: 128.16 {'loss': 0.7458, 'learning_rate': 8.387690785802403e-06, 'epoch': 0.56} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11862 total_samples=17171, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:24,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.54 | bwd_microstep: 1793.10 | bwd_inner_microstep: 1573.85 | bwd_allreduce_microstep: 219.17 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13252 total_samples=17175, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:26,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.19 | bwd_microstep: 1827.24 | bwd_inner_microstep: 1697.04 | bwd_allreduce_microstep: 130.12 | step_microstep: 0.20 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11958 total_samples=17178, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:29,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.64 | bwd_microstep: 1841.14 | bwd_inner_microstep: 1558.83 | bwd_allreduce_microstep: 282.24 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12998 total_samples=17182, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:32,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31 [2025-08-03 05:11:32,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.83 | bwd_microstep: 1975.98 | bwd_inner_microstep: 1803.21 | bwd_allreduce_microstep: 172.69 | step_microstep: 111.49 [2025-08-03 05:11:32,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.12 | bwd: 7437.51 | bwd_inner: 6632.94 | bwd_allreduce: 804.32 | step: 111.94 {'loss': 0.7496, 'learning_rate': 8.371710998164595e-06, 'epoch': 0.57} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12053 total_samples=17185, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:35,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.84 | bwd_microstep: 2210.12 | bwd_inner_microstep: 2043.35 | bwd_allreduce_microstep: 166.71 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713 total_samples=17188, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:38,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.90 | bwd_microstep: 1868.49 | bwd_inner_microstep: 1737.75 | bwd_allreduce_microstep: 130.67 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11652 total_samples=17191, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:40,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.96 | bwd_microstep: 1942.02 | bwd_inner_microstep: 1758.80 | bwd_allreduce_microstep: 183.15 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11786 total_samples=17194, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:43,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.94 [2025-08-03 05:11:43,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.40 | bwd_microstep: 1881.26 | bwd_inner_microstep: 1739.43 | bwd_allreduce_microstep: 141.75 | step_microstep: 131.93 [2025-08-03 05:11:43,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.02 | bwd: 7901.94 | bwd_inner: 7279.33 | bwd_allreduce: 622.36 | step: 132.26 {'loss': 0.7516, 'learning_rate': 8.355735480524874e-06, 'epoch': 0.57} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14691 total_samples=17198, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:46,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.21 | bwd_microstep: 1845.88 | bwd_inner_microstep: 1750.26 | bwd_allreduce_microstep: 95.56 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12370 total_samples=17202, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:48,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.18 | bwd_microstep: 1727.93 | bwd_inner_microstep: 1558.94 | bwd_allreduce_microstep: 168.92 | step_microstep: 0.13 dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 16280 total_samples=17206, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:51,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.16 | bwd_microstep: 1989.57 | bwd_inner_microstep: 1755.07 | bwd_allreduce_microstep: 234.43 | step_microstep: 0.14 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12412 total_samples=17210, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:54,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22 [2025-08-03 05:11:54,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.01 | bwd_microstep: 1982.92 | bwd_inner_microstep: 1569.57 | bwd_allreduce_microstep: 413.28 | step_microstep: 129.44 [2025-08-03 05:11:54,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.49 | bwd: 7546.33 | bwd_inner: 6633.83 | bwd_allreduce: 912.26 | step: 129.83 {'loss': 0.7416, 'learning_rate': 8.339764274777165e-06, 'epoch': 0.57} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13146 total_samples=17214, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:56,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.87 | bwd_microstep: 1827.46 | bwd_inner_microstep: 1702.32 | bwd_allreduce_microstep: 125.08 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11948 total_samples=17217, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:11:59,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.38 | bwd_microstep: 1753.57 | bwd_inner_microstep: 1550.67 | bwd_allreduce_microstep: 202.83 | step_microstep: 0.33 dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 16344 total_samples=17220, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:02,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.88 | bwd_microstep: 1829.30 | bwd_inner_microstep: 1776.82 | bwd_allreduce_microstep: 52.42 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13271 total_samples=17224, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:04,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19 [2025-08-03 05:12:04,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.29 | bwd_microstep: 1960.84 | bwd_inner_microstep: 1903.32 | bwd_allreduce_microstep: 57.46 | step_microstep: 114.06 [2025-08-03 05:12:04,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.35 | bwd: 7371.23 | bwd_inner: 6933.12 | bwd_allreduce: 437.87 | step: 114.62 000 [3:28:14<2:39:44, 10.99s/it] 56%|█████▋ | 1129/2000 [3:28:25<2:42:12, 11.17s/it] 56%|█████▋ | 1129/2000 [3:28:25<2:42:12, 11.17s/it] 56%|█████▋ | 1130/2000 [3:28:36<2:39:54, 11.03s/it] 56%|█████▋ | 1130/2000 [3:28:36<2:39:54, 11.03s/it] 57%|█████▋ | 1131/2000 [3:28:47<2:38:14, 10.93s/it] 57%|█████▋ | 1131/2000 [3:28:47<2:38:14, 10.93s/it] 57%|█████▋ | 1132/2000 [3:28:58<2:39:03, 11.00s/it] 57%|█████▋ | 1132/2000 [3:28:58<2:39:03, 11.00s/it] 57%|█████▋ | 1133/2000 [3:29:09<2:38:00, 10.94s/it] 57%|█████▋ | 1133/2000 [3:29:09<2:38:00, 10.94s/it] 57%|█████▋ {'loss': 0.7566, 'learning_rate': 8.3237974228041e-06, 'epoch': 0.57} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13793 total_samples=17228, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:07,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.25 | bwd_microstep: 1816.04 | bwd_inner_microstep: 1720.57 | bwd_allreduce_microstep: 95.41 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11562 total_samples=17231, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:10,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.85 | bwd_microstep: 1714.53 | bwd_inner_microstep: 1535.35 | bwd_allreduce_microstep: 179.13 | step_microstep: 0.12 dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11320 total_samples=17234, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:12,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.53 | bwd_microstep: 2057.33 | bwd_inner_microstep: 1785.04 | bwd_allreduce_microstep: 272.23 | step_microstep: 0.26 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12899 total_samples=17238, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:15,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14 [2025-08-03 05:12:15,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.84 | bwd_microstep: 2141.73 | bwd_inner_microstep: 1884.02 | bwd_allreduce_microstep: 257.64 | step_microstep: 112.02 [2025-08-03 05:12:15,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.40 | bwd: 7729.68 | bwd_inner: 6924.96 | bwd_allreduce: 804.49 | step: 112.54 {'loss': 0.7423, 'learning_rate': 8.307834966476885e-06, 'epoch': 0.57} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11948 total_samples=17241, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:18,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.28 | bwd_microstep: 1760.36 | bwd_inner_microstep: 1550.06 | bwd_allreduce_microstep: 210.22 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13555 total_samples=17246, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:21,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.66 | bwd_microstep: 1864.90 | bwd_inner_microstep: 1718.70 | bwd_allreduce_microstep: 146.13 | step_microstep: 0.13 dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11487 total_samples=17249, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:23,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.02 | bwd_microstep: 1710.66 | bwd_inner_microstep: 1507.05 | bwd_allreduce_microstep: 203.52 | step_microstep: 0.14 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13083 total_samples=17253, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:26,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58 [2025-08-03 05:12:26,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.94 | bwd_microstep: 1952.14 | bwd_inner_microstep: 1685.94 | bwd_allreduce_microstep: 266.14 | step_microstep: 171.55 [2025-08-03 05:12:26,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.83 | bwd: 7288.11 | bwd_inner: 6461.75 | bwd_allreduce: 826.10 | step: 171.96 {'loss': 0.7459, 'learning_rate': 8.291876947655197e-06, 'epoch': 0.57} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462 total_samples=17257, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:29,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.99 | bwd_microstep: 2006.68 | bwd_inner_microstep: 1737.31 | bwd_allreduce_microstep: 269.29 | step_microstep: 0.13 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15947 total_samples=17262, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:31,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.60 | bwd_microstep: 1816.65 | bwd_inner_microstep: 1743.62 | bwd_allreduce_microstep: 72.96 | step_microstep: 0.12 dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11214 total_samples=17265, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:34,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.30 | bwd_microstep: 2010.84 | bwd_inner_microstep: 1776.45 | bwd_allreduce_microstep: 234.33 | step_microstep: 0.18 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13364 total_samples=17269, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:37,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.20 [2025-08-03 05:12:37,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.10 | bwd_microstep: 1751.25 | bwd_inner_microstep: 1646.53 | bwd_allreduce_microstep: 104.64 | step_microstep: 141.02 [2025-08-03 05:12:37,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.92 | bwd: 7585.46 | bwd_inner: 6903.90 | bwd_allreduce: 681.32 | step: 141.45 {'loss': 0.747, 'learning_rate': 8.275923408187086e-06, 'epoch': 0.57} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12800 total_samples=17273, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:40,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.46 | bwd_microstep: 2035.61 | bwd_inner_microstep: 1627.91 | bwd_allreduce_microstep: 407.64 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12089 total_samples=17276, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:43,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.83 | bwd_microstep: 2166.81 | bwd_inner_microstep: 1942.27 | bwd_allreduce_microstep: 224.46 | step_microstep: 0.23 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13215 total_samples=17280, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:45,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.77 | bwd_microstep: 2119.74 | bwd_inner_microstep: 2003.90 | bwd_allreduce_microstep: 115.76 | step_microstep: 0.32 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14078 total_samples=17284, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:48,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97 [2025-08-03 05:12:48,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.39 | bwd_microstep: 1973.00 | bwd_inner_microstep: 1785.23 | bwd_allreduce_microstep: 187.70 | step_microstep: 126.93 [2025-08-03 05:12:48,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.39 | bwd: 8295.21 | bwd_inner: 7359.31 | bwd_allreduce: 935.65 | step: 127.62 {'loss': 0.7524, 'learning_rate': 8.259974389908842e-06, 'epoch': 0.57} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13707 total_samples=17288, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:51,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.38 | bwd_microstep: 2057.98 | bwd_inner_microstep: 1895.98 | bwd_allreduce_microstep: 161.94 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12056 total_samples=17291, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:54,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.76 | bwd_microstep: 1859.48 | bwd_inner_microstep: 1615.07 | bwd_allreduce_microstep: 244.34 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13211 total_samples=17295, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:57,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.64 | bwd_microstep: 1874.70 | bwd_inner_microstep: 1831.80 | bwd_allreduce_microstep: 42.84 | step_microstep: 0.14 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12926 total_samples=17299, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:12:59,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95 [2025-08-03 05:12:59,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.85 | bwd_microstep: 1846.32 | bwd_inner_microstep: 1679.49 | bwd_allreduce_microstep: 166.76 | step_microstep: 108.96 [2025-08-03 05:12:59,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.56 | bwd: 7638.54 | bwd_inner: 7022.33 | bwd_allreduce: 615.97 | step: 109.45 | 1134/2000 [3:29:19<2:36:21, 10.83s/it] 57%|█████▋ | 1134/2000 [3:29:19<2:36:21, 10.83s/it] 57%|█████▋ | 1135/2000 [3:29:30<2:36:47, 10.88s/it] 57%|█████▋ | 1135/2000 [3:29:30<2:36:47, 10.88s/it] 57%|█████▋ | 1136/2000 [3:29:41<2:35:14, 10.78s/it] 57%|█████▋ | 1136/2000 [3:29:41<2:35:14, 10.78s/it] 57%|█████▋ | 1137/2000 [3:29:52<2:35:25, 10.81s/it] 57%|█████▋ | 1137/2000 [3:29:52<2:35:25, 10.81s/it] 57%|█████▋ | 1138/2000 [3:30:03<2:38:20, 11.02s/it] 57%|█████▋ | 1138/2000 [3:30:03<2:38:20, 11.02s/it] 57%|█████▋ | 1139/2000 [3:30:14<2:37:45, 10.99s/it] {'loss': 0.7534, 'learning_rate': 8.244029934644916e-06, 'epoch': 0.57} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13463 total_samples=17303, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:02,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.64 | bwd_microstep: 1745.47 | bwd_inner_microstep: 1671.90 | bwd_allreduce_microstep: 73.49 | step_microstep: 0.28 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12167 total_samples=17306, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:05,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.94 | bwd_microstep: 2066.69 | bwd_inner_microstep: 1849.63 | bwd_allreduce_microstep: 217.00 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752 total_samples=17310, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:07,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.45 | bwd_microstep: 1817.09 | bwd_inner_microstep: 1729.94 | bwd_allreduce_microstep: 87.08 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809 total_samples=17313, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:10,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12 [2025-08-03 05:13:10,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.71 | bwd_microstep: 2138.87 | bwd_inner_microstep: 1910.83 | bwd_allreduce_microstep: 227.97 | step_microstep: 130.01 [2025-08-03 05:13:10,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2879.66 | bwd: 7768.18 | bwd_inner: 7162.30 | bwd_allreduce: 605.64 | step: 130.54 {'loss': 0.7475, 'learning_rate': 8.228090084207773e-06, 'epoch': 0.57} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13524 total_samples=17317, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:13,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.10 | bwd_microstep: 1817.43 | bwd_inner_microstep: 1709.19 | bwd_allreduce_microstep: 108.17 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11761 total_samples=17320, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:16,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.98 | bwd_microstep: 1794.85 | bwd_inner_microstep: 1575.97 | bwd_allreduce_microstep: 218.81 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13528 total_samples=17324, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:18,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.47 | bwd_microstep: 1757.27 | bwd_inner_microstep: 1688.80 | bwd_allreduce_microstep: 68.41 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13651 total_samples=17328, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:21,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.25 [2025-08-03 05:13:21,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.62 | bwd_microstep: 1733.11 | bwd_inner_microstep: 1681.51 | bwd_allreduce_microstep: 51.53 | step_microstep: 138.78 [2025-08-03 05:13:21,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.08 | bwd: 7102.73 | bwd_inner: 6655.47 | bwd_allreduce: 447.01 | step: 139.15 {'loss': 0.7536, 'learning_rate': 8.212154880397817e-06, 'epoch': 0.57} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13658 total_samples=17332, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:24,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.18 | bwd_microstep: 2082.92 | bwd_inner_microstep: 2076.44 | bwd_allreduce_microstep: 6.41 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13501 total_samples=17336, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:26,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.41 | bwd_microstep: 2046.96 | bwd_inner_microstep: 1913.53 | bwd_allreduce_microstep: 133.36 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15245 total_samples=17341, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:29,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.41 | bwd_microstep: 1871.23 | bwd_inner_microstep: 1773.93 | bwd_allreduce_microstep: 97.23 | step_microstep: 0.93 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13669 total_samples=17345, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:32,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22 [2025-08-03 05:13:32,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.67 | bwd_microstep: 2015.43 | bwd_inner_microstep: 1872.60 | bwd_allreduce_microstep: 142.76 | step_microstep: 136.35 [2025-08-03 05:13:32,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.60 | bwd: 8016.59 | bwd_inner: 7636.49 | bwd_allreduce: 379.85 | step: 137.57 {'loss': 0.7499, 'learning_rate': 8.196224365003267e-06, 'epoch': 0.57} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12918 total_samples=17349, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:35,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.25 | bwd_microstep: 1838.17 | bwd_inner_microstep: 1692.96 | bwd_allreduce_microstep: 145.14 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375 total_samples=17353, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:37,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.13 | bwd_microstep: 1982.39 | bwd_inner_microstep: 1886.02 | bwd_allreduce_microstep: 96.30 | step_microstep: 0.27 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14338 total_samples=17357, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:40,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.79 | bwd_microstep: 2026.86 | bwd_inner_microstep: 1892.70 | bwd_allreduce_microstep: 134.10 | step_microstep: 0.11 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13065 total_samples=17361, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:43,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 05:13:43,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 919.36 | bwd_microstep: 2097.70 | bwd_inner_microstep: 1951.44 | bwd_allreduce_microstep: 146.18 | step_microstep: 109.92 [2025-08-03 05:13:43,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3034.47 | bwd: 7945.18 | bwd_inner: 7423.11 | bwd_allreduce: 521.80 | step: 110.42 {'loss': 0.7536, 'learning_rate': 8.180298579800034e-06, 'epoch': 0.57} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12679 total_samples=17365, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:46,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.11 | bwd_microstep: 1912.47 | bwd_inner_microstep: 1837.85 | bwd_allreduce_microstep: 74.55 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14882 total_samples=17369, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:49,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.91 | bwd_microstep: 1781.37 | bwd_inner_microstep: 1748.98 | bwd_allreduce_microstep: 32.32 | step_microstep: 0.19 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13377 total_samples=17374, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:51,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.87 | bwd_microstep: 1747.64 | bwd_inner_microstep: 1682.63 | bwd_allreduce_microstep: 64.95 | step_microstep: 0.24 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12460 total_samples=17378, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:54,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11 [2025-08-03 05:13:54,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.73 | bwd_microstep: 2040.27 | bwd_inner_microstep: 1875.90 | bwd_allreduce_microstep: 164.31 | step_microstep: 111.60 [2025-08-03 05:13:54,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.55 | bwd: 7481.82 | bwd_inner: 7145.35 | bwd_allreduce: 336.21 | step: 112.17 {'loss': 0.7474, 'learning_rate': 8.16437756655164e-06, 'epoch': 0.57} 57%|█████▋ | 1139/2000 [3:30:14<2:37:45, 10.99s/it] 57%|█████▋ | 1140/2000 [3:30:25<2:37:51, 11.01s/it] 57%|█████▋ | 1140/2000 [3:30:25<2:37:51, 11.01s/it] 57%|█████▋ | 1141/2000 [3:30:36<2:34:48, 10.81s/it] 57%|█████▋ | 1141/2000 [3:30:36<2:34:48, 10.81s/it] 57%|█████▋ | 1142/2000 [3:30:47<2:36:34, 10.95s/it] 57%|█████▋ | 1142/2000 [3:30:47<2:36:34, 10.95s/it] 57%|█████▋ | 1143/2000 [3:30:58<2:38:17, 11.08s/it] 57%|█████▋ | 1143/2000 [3:30:58<2:38:17, 11.08s/it] 57%|█████▋ | 1144/2000 [3:31:09<2:36:21, 10.96s/it] 57%|████dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13223 total_samples=17382, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:57,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.23 | bwd_microstep: 1717.34 | bwd_inner_microstep: 1659.27 | bwd_allreduce_microstep: 58.00 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13287 total_samples=17386, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:13:59,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.02 | bwd_microstep: 2016.01 | bwd_inner_microstep: 1886.90 | bwd_allreduce_microstep: 129.04 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14465 total_samples=17391, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:02,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.99 | bwd_microstep: 1999.84 | bwd_inner_microstep: 1792.37 | bwd_allreduce_microstep: 207.41 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11686 total_samples=17394, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:05,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 05:14:05,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.98 | bwd_microstep: 1718.33 | bwd_inner_microstep: 1525.76 | bwd_allreduce_microstep: 192.49 | step_microstep: 134.55 [2025-08-03 05:14:05,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.16 | bwd: 7451.60 | bwd_inner: 6864.31 | bwd_allreduce: 587.03 | step: 135.05 {'loss': 0.7375, 'learning_rate': 8.148461367009081e-06, 'epoch': 0.57} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13614 total_samples=17400, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:07,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.77 | bwd_microstep: 1744.04 | bwd_inner_microstep: 1690.66 | bwd_allreduce_microstep: 53.31 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13904 total_samples=17405, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:10,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.82 | bwd_microstep: 1841.82 | bwd_inner_microstep: 1804.28 | bwd_allreduce_microstep: 37.47 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13283 total_samples=17409, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:12,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.27 | bwd_microstep: 1749.59 | bwd_inner_microstep: 1706.37 | bwd_allreduce_microstep: 43.15 | step_microstep: 0.29 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12401 total_samples=17412, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:15,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01 [2025-08-03 05:14:15,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.73 | bwd_microstep: 2028.03 | bwd_inner_microstep: 1808.59 | bwd_allreduce_microstep: 219.37 | step_microstep: 116.33 [2025-08-03 05:14:15,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.53 | bwd: 7363.53 | bwd_inner: 7009.91 | bwd_allreduce: 353.37 | step: 116.97 {'loss': 0.7389, 'learning_rate': 8.132550022910737e-06, 'epoch': 0.57} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13228 total_samples=17416, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:18,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.66 | bwd_microstep: 1823.46 | bwd_inner_microstep: 1765.33 | bwd_allreduce_microstep: 58.07 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14627 total_samples=17420, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:20,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.59 | bwd_microstep: 1820.82 | bwd_inner_microstep: 1744.28 | bwd_allreduce_microstep: 76.47 | step_microstep: 0.27 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13445 total_samples=17424, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:23,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.20 | bwd_microstep: 1770.92 | bwd_inner_microstep: 1666.19 | bwd_allreduce_microstep: 104.67 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11532 total_samples=17427, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:26,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19 [2025-08-03 05:14:26,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.37 | bwd_microstep: 1778.41 | bwd_inner_microstep: 1540.87 | bwd_allreduce_microstep: 237.48 | step_microstep: 114.77 [2025-08-03 05:14:26,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.74 | bwd: 7193.67 | bwd_inner: 6716.66 | bwd_allreduce: 476.77 | step: 115.32 {'loss': 0.748, 'learning_rate': 8.116643575982254e-06, 'epoch': 0.57} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13491 total_samples=17431, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:28,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.98 | bwd_microstep: 1727.34 | bwd_inner_microstep: 1673.81 | bwd_allreduce_microstep: 53.47 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13316 total_samples=17435, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:31,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.54 | bwd_microstep: 1819.04 | bwd_inner_microstep: 1707.71 | bwd_allreduce_microstep: 111.26 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13229 total_samples=17439, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:33,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.19 | bwd_microstep: 1821.50 | bwd_inner_microstep: 1715.97 | bwd_allreduce_microstep: 105.48 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11811 total_samples=17442, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:36,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07 [2025-08-03 05:14:36,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.33 | bwd_microstep: 2166.37 | bwd_inner_microstep: 1697.65 | bwd_allreduce_microstep: 468.65 | step_microstep: 108.25 [2025-08-03 05:14:36,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.97 | bwd: 7534.31 | bwd_inner: 6795.14 | bwd_allreduce: 738.94 | step: 108.73 {'loss': 0.7511, 'learning_rate': 8.100742067936432e-06, 'epoch': 0.57} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12569 total_samples=17445, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:39,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.86 | bwd_microstep: 2055.35 | bwd_inner_microstep: 1823.63 | bwd_allreduce_microstep: 231.65 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14135 total_samples=17449, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:42,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.76 | bwd_microstep: 1822.43 | bwd_inner_microstep: 1738.76 | bwd_allreduce_microstep: 83.61 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13728 total_samples=17453, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:45,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.20 | bwd_microstep: 2146.89 | bwd_inner_microstep: 2081.55 | bwd_allreduce_microstep: 65.27 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11925 total_samples=17456, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:48,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04 [2025-08-03 05:14:48,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.53 | bwd_microstep: 2054.16 | bwd_inner_microstep: 1822.09 | bwd_allreduce_microstep: 232.01 | step_microstep: 119.14 [2025-08-03 05:14:48,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.27 | bwd: 8078.88 | bwd_inner: 7466.03 | bwd_allreduce: 612.62 | step: 119.61 {'loss': 0.7472, 'learning_rate': 8.084845540473127e-06, 'epoch': 0.57} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12439 total_samples=17459, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:50,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.98 | bwd_microstep: 1842.38 | bwd_inner_microstep: 1703.12 | bwd_allreduce_microstep: 139.19 | step_microstep: 0.15 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12273 total_samples=17463, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:53,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.41 | bwd_microstep: 1820.93 | bwd_inner_microstep: 1591.04 | bwd_allreduce_microstep: 229.82 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770 total_samples=17466, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:56,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.99 | bwd_microstep: 1764.88 | bwd_inner_microstep: 1540.94 | bwd_allreduce_microstep: 223.87 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13534 total_samples=17470, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:14:58,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.99 [2025-08-03 05:14:58,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.30 | bwd_microstep: 1879.97 | bwd_inner_microstep: 1675.14 | bwd_allreduce_microstep: 204.72 | step_microstep: 142.36 [2025-08-03 05:14:58,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.62 | bwd: 7308.21 | bwd_inner: 6510.24 | bwd_allreduce: 797.71 | step: 142.90 ▋ | 1144/2000 [3:31:09<2:36:21, 10.96s/it] 57%|█████▋ | 1145/2000 [3:31:20<2:34:53, 10.87s/it] 57%|█████▋ | 1145/2000 [3:31:20<2:34:53, 10.87s/it] 57%|█████▋ | 1146/2000 [3:31:30<2:33:20, 10.77s/it] 57%|█████▋ | 1146/2000 [3:31:30<2:33:20, 10.77s/it] 57%|█████▋ | 1147/2000 [3:31:41<2:31:44, 10.67s/it] 57%|█████▋ | 1147/2000 [3:31:41<2:31:44, 10.67s/it] 57%|█████▋ | 1148/2000 [3:31:51<2:31:57, 10.70s/it] 57%|█████▋ | 1148/2000 [3:31:51<2:31:57, 10.70s/it] 57%|█████▋ | 1149/2000 [3:32:03<2:34:18, 10.88s/it] 57%|█████▋ | 1149/2000 [3:32:03<2:34:18, 10.88s/it] 57%|█{'loss': 0.7465, 'learning_rate': 8.068954035279121e-06, 'epoch': 0.57} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13611 total_samples=17474, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:01,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.77 | bwd_microstep: 1801.15 | bwd_inner_microstep: 1718.17 | bwd_allreduce_microstep: 82.91 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12388 total_samples=17477, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:04,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.95 | bwd_microstep: 2137.22 | bwd_inner_microstep: 1844.06 | bwd_allreduce_microstep: 293.08 | step_microstep: 0.34 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13709 total_samples=17481, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:06,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.90 | bwd_microstep: 1825.17 | bwd_inner_microstep: 1732.30 | bwd_allreduce_microstep: 92.80 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11597 total_samples=17484, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:09,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60 [2025-08-03 05:15:09,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.61 | bwd_microstep: 2046.25 | bwd_inner_microstep: 1813.71 | bwd_allreduce_microstep: 232.46 | step_microstep: 112.45 [2025-08-03 05:15:09,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.15 | bwd: 7809.85 | bwd_inner: 7108.24 | bwd_allreduce: 701.35 | step: 113.02 {'loss': 0.7445, 'learning_rate': 8.053067594028044e-06, 'epoch': 0.58} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13555 total_samples=17488, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:12,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.64 | bwd_microstep: 1985.58 | bwd_inner_microstep: 1854.50 | bwd_allreduce_microstep: 131.01 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13686 total_samples=17492, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:15,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.49 | bwd_microstep: 1855.11 | bwd_inner_microstep: 1739.18 | bwd_allreduce_microstep: 115.86 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11596 total_samples=17495, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:18,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.80 | bwd_microstep: 2044.69 | bwd_inner_microstep: 1830.39 | bwd_allreduce_microstep: 214.23 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13489 total_samples=17499, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:21,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64 [2025-08-03 05:15:21,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.00 | bwd_microstep: 1978.78 | bwd_inner_microstep: 1741.85 | bwd_allreduce_microstep: 236.86 | step_microstep: 118.16 [2025-08-03 05:15:21,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2877.86 | bwd: 7864.21 | bwd_inner: 7165.92 | bwd_allreduce: 698.05 | step: 118.66 {'loss': 0.7362, 'learning_rate': 8.037186258380226e-06, 'epoch': 0.58} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837 total_samples=17503, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:23,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.21 | bwd_microstep: 2173.66 | bwd_inner_microstep: 2048.54 | bwd_allreduce_microstep: 125.06 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13190 total_samples=17507, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:26,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.81 | bwd_microstep: 1809.61 | bwd_inner_microstep: 1704.92 | bwd_allreduce_microstep: 104.62 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14912 total_samples=17511, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:29,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.30 | bwd_microstep: 1817.04 | bwd_inner_microstep: 1744.86 | bwd_allreduce_microstep: 72.10 | step_microstep: 0.18 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13729 total_samples=17516, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:31,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94 [2025-08-03 05:15:31,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.69 | bwd_microstep: 1825.77 | bwd_inner_microstep: 1728.25 | bwd_allreduce_microstep: 97.45 | step_microstep: 137.44 [2025-08-03 05:15:31,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.93 | bwd: 7626.13 | bwd_inner: 7226.56 | bwd_allreduce: 399.33 | step: 137.87 {'loss': 0.7475, 'learning_rate': 8.021310069982624e-06, 'epoch': 0.58} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12905 total_samples=17520, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:34,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.04 | bwd_microstep: 1807.18 | bwd_inner_microstep: 1688.67 | bwd_allreduce_microstep: 118.44 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14908 total_samples=17524, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:37,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.78 | bwd_microstep: 1803.50 | bwd_inner_microstep: 1763.54 | bwd_allreduce_microstep: 39.88 | step_microstep: 0.20 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12069 total_samples=17528, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:39,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.64 | bwd_microstep: 1799.62 | bwd_inner_microstep: 1568.88 | bwd_allreduce_microstep: 230.68 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13207 total_samples=17532, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:42,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95 [2025-08-03 05:15:42,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.57 | bwd_microstep: 1765.56 | bwd_inner_microstep: 1688.46 | bwd_allreduce_microstep: 77.04 | step_microstep: 120.42 [2025-08-03 05:15:42,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.96 | bwd: 7175.92 | bwd_inner: 6709.55 | bwd_allreduce: 466.13 | step: 120.98 {'loss': 0.7634, 'learning_rate': 8.005439070468692e-06, 'epoch': 0.58} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13046 total_samples=17536, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:45,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.60 | bwd_microstep: 2053.44 | bwd_inner_microstep: 1869.13 | bwd_allreduce_microstep: 184.26 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14994 total_samples=17540, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:47,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.10 | bwd_microstep: 1839.36 | bwd_inner_microstep: 1770.51 | bwd_allreduce_microstep: 68.78 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11751 total_samples=17543, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:50,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.51 | bwd_microstep: 2185.35 | bwd_inner_microstep: 2059.64 | bwd_allreduce_microstep: 125.66 | step_microstep: 0.20 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13302 total_samples=17547, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:53,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11 [2025-08-03 05:15:53,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.94 | bwd_microstep: 1827.05 | bwd_inner_microstep: 1696.29 | bwd_allreduce_microstep: 130.69 | step_microstep: 160.10 [2025-08-03 05:15:53,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.09 | bwd: 7905.25 | bwd_inner: 7395.57 | bwd_allreduce: 509.45 | step: 160.57 ███▊ | 1150/2000 [3:32:13<2:32:49, 10.79s/it] 57%|█████▊ | 1150/2000 [3:32:13<2:32:49, 10.79s/it] 58%|█████▊ | 1151/2000 [3:32:24<2:33:50, 10.87s/it] 58%|█████▊ | 1151/2000 [3:32:24<2:33:50, 10.87s/it] 58%|█████▊ | 1152/2000 [3:32:35<2:34:51, 10.96s/it] 58%|█████▊ | 1152/2000 [3:32:35<2:34:51, 10.96s/it] 58%|█████▊ | 1153/2000 [3:32:46<2:34:14, 10.93s/it] 58%|█████▊ | 1153/2000 [3:32:46<2:34:14, 10.93s/it] 58%|█████▊ | 1154/2000 [3:32:57<2:32:01, 10.78s/it] 58%|█████▊ | 1154/2000 [3:32:57<2:32:01, 10.78s/it] 58%|█████▊ | 1155/2000 [3:33:08<2:33:26, 10.89s/it] {'loss': 0.7532, 'learning_rate': 7.989573301458274e-06, 'epoch': 0.58} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12020 total_samples=17550, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:56,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.64 | bwd_microstep: 1972.69 | bwd_inner_microstep: 1759.60 | bwd_allreduce_microstep: 213.02 | step_microstep: 0.17 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14874 total_samples=17554, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:15:58,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.95 | bwd_microstep: 1881.58 | bwd_inner_microstep: 1827.99 | bwd_allreduce_microstep: 53.52 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12384 total_samples=17557, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:01,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.61 | bwd_microstep: 1817.56 | bwd_inner_microstep: 1588.43 | bwd_allreduce_microstep: 229.06 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13929 total_samples=17561, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:04,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58 [2025-08-03 05:16:04,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.63 | bwd_microstep: 1922.10 | bwd_inner_microstep: 1764.52 | bwd_allreduce_microstep: 157.51 | step_microstep: 123.39 [2025-08-03 05:16:04,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.76 | bwd: 7594.00 | bwd_inner: 6940.54 | bwd_allreduce: 653.20 | step: 123.92 {'loss': 0.747, 'learning_rate': 7.9737128045575e-06, 'epoch': 0.58} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14048 total_samples=17565, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:07,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.65 | bwd_microstep: 1947.28 | bwd_inner_microstep: 1884.34 | bwd_allreduce_microstep: 62.86 | step_microstep: 0.95 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13041 total_samples=17569, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:10,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.64 | bwd_microstep: 2094.97 | bwd_inner_microstep: 1960.68 | bwd_allreduce_microstep: 134.23 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13206 total_samples=17573, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:12,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.37 | bwd_microstep: 1726.37 | bwd_inner_microstep: 1670.40 | bwd_allreduce_microstep: 55.91 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13298 total_samples=17577, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:15,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30 [2025-08-03 05:16:15,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.90 | bwd_microstep: 1851.63 | bwd_inner_microstep: 1719.68 | bwd_allreduce_microstep: 131.88 | step_microstep: 134.57 [2025-08-03 05:16:15,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.50 | bwd: 7620.32 | bwd_inner: 7235.10 | bwd_allreduce: 384.96 | step: 135.78 {'loss': 0.741, 'learning_rate': 7.957857621358674e-06, 'epoch': 0.58} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14675 total_samples=17581, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:18,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.41 | bwd_microstep: 2051.44 | bwd_inner_microstep: 1934.67 | bwd_allreduce_microstep: 116.70 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15623 total_samples=17585, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:20,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.77 | bwd_microstep: 1918.40 | bwd_inner_microstep: 1824.63 | bwd_allreduce_microstep: 93.71 | step_microstep: 0.23 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13198 total_samples=17589, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:23,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.45 | bwd_microstep: 1779.18 | bwd_inner_microstep: 1674.99 | bwd_allreduce_microstep: 104.13 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13166 total_samples=17593, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:26,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27 [2025-08-03 05:16:26,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.67 | bwd_microstep: 1968.58 | bwd_inner_microstep: 1856.19 | bwd_allreduce_microstep: 112.32 | step_microstep: 115.88 [2025-08-03 05:16:26,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.22 | bwd: 7717.66 | bwd_inner: 7290.47 | bwd_allreduce: 426.94 | step: 116.46 {'loss': 0.7645, 'learning_rate': 7.942007793440165e-06, 'epoch': 0.58} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13748 total_samples=17597, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:28,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.87 | bwd_microstep: 1782.55 | bwd_inner_microstep: 1710.44 | bwd_allreduce_microstep: 72.05 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14171 total_samples=17602, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:31,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.47 | bwd_microstep: 2160.57 | bwd_inner_microstep: 1947.42 | bwd_allreduce_microstep: 213.07 | step_microstep: 0.29 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15255 total_samples=17607, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:34,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.99 | bwd_microstep: 2207.13 | bwd_inner_microstep: 2011.55 | bwd_allreduce_microstep: 195.50 | step_microstep: 0.32 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13392 total_samples=17611, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:37,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13 [2025-08-03 05:16:37,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.69 | bwd_microstep: 1800.34 | bwd_inner_microstep: 1702.80 | bwd_allreduce_microstep: 97.46 | step_microstep: 122.86 [2025-08-03 05:16:37,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.93 | bwd: 7950.65 | bwd_inner: 7372.22 | bwd_allreduce: 578.16 | step: 123.60 {'loss': 0.7521, 'learning_rate': 7.9261633623663e-06, 'epoch': 0.58} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13911 total_samples=17615, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:40,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.53 | bwd_microstep: 1896.98 | bwd_inner_microstep: 1846.52 | bwd_allreduce_microstep: 50.39 | step_microstep: 0.81 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13521 total_samples=17619, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:42,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.05 | bwd_microstep: 2014.77 | bwd_inner_microstep: 1984.43 | bwd_allreduce_microstep: 30.27 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11664 total_samples=17622, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:45,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.14 | bwd_microstep: 1784.87 | bwd_inner_microstep: 1548.05 | bwd_allreduce_microstep: 236.73 | step_microstep: 0.20 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14136 total_samples=17626, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:48,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25 [2025-08-03 05:16:48,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.81 | bwd_microstep: 1821.36 | bwd_inner_microstep: 1738.22 | bwd_allreduce_microstep: 83.07 | step_microstep: 113.05 [2025-08-03 05:16:48,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.45 | bwd: 7518.03 | bwd_inner: 7117.22 | bwd_allreduce: 400.56 | step: 114.19 {'loss': 0.753, 'learning_rate': 7.91032436968725e-06, 'epoch': 0.58} 58%|█████▊ | 1155/2000 [3:33:08<2:33:26, 10.89s/it] 58%|█████▊ | 1156/2000 [3:33:19<2:33:12, 10.89s/it] 58%|█████▊ | 1156/2000 [3:33:19<2:33:12, 10.89s/it] 58%|█████▊ | 1157/2000 [3:33:30<2:33:01, 10.89s/it] 58%|█████▊ | 1157/2000 [3:33:30<2:33:01, 10.89s/it] 58%|█████▊ | 1158/2000 [3:33:41<2:33:16, 10.92s/it] 58%|█████▊ | 1158/2000 [3:33:41<2:33:16, 10.92s/it] 58%|█████▊ | 1159/2000 [3:33:52<2:34:23, 11.01s/it] 58%|█████▊ | 1159/2000 [3:33:52<2:34:23, 11.01s/it] 58%|█████▊ | 1160/2000 [3:34:03<2:32:55, 10.92s/it] 58dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13746 total_samples=17630, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:50,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.10 | bwd_microstep: 2015.20 | bwd_inner_microstep: 1982.92 | bwd_allreduce_microstep: 32.20 | step_microstep: 0.12 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12248 total_samples=17634, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:53,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.42 | bwd_microstep: 1797.79 | bwd_inner_microstep: 1593.72 | bwd_allreduce_microstep: 204.01 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13396 total_samples=17638, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:56,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.38 | bwd_microstep: 1755.53 | bwd_inner_microstep: 1686.98 | bwd_allreduce_microstep: 68.47 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13933 total_samples=17642, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:16:58,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.83 [2025-08-03 05:16:58,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.08 | bwd_microstep: 1726.64 | bwd_inner_microstep: 1700.32 | bwd_allreduce_microstep: 26.25 | step_microstep: 134.42 [2025-08-03 05:16:58,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.91 | bwd: 7295.22 | bwd_inner: 6963.93 | bwd_allreduce: 331.02 | step: 134.89 {'loss': 0.7512, 'learning_rate': 7.894490856938931e-06, 'epoch': 0.58} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11924 total_samples=17645, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:01,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.08 | bwd_microstep: 1803.82 | bwd_inner_microstep: 1596.02 | bwd_allreduce_microstep: 207.74 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13527 total_samples=17649, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:04,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.12 | bwd_microstep: 2161.74 | bwd_inner_microstep: 2102.05 | bwd_allreduce_microstep: 59.61 | step_microstep: 0.29 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13693 total_samples=17653, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:07,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.55 | bwd_microstep: 2124.68 | bwd_inner_microstep: 1817.42 | bwd_allreduce_microstep: 307.20 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13230 total_samples=17657, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:09,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96 [2025-08-03 05:17:09,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.36 | bwd_microstep: 1942.97 | bwd_inner_microstep: 1905.64 | bwd_allreduce_microstep: 37.28 | step_microstep: 116.54 [2025-08-03 05:17:09,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.03 | bwd: 8033.26 | bwd_inner: 7421.12 | bwd_allreduce: 611.91 | step: 117.07 {'loss': 0.7419, 'learning_rate': 7.87866286564288e-06, 'epoch': 0.58} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13239 total_samples=17661, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:12,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.08 | bwd_microstep: 1773.16 | bwd_inner_microstep: 1681.09 | bwd_allreduce_microstep: 92.00 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11686 total_samples=17664, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:15,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.00 | bwd_microstep: 2325.94 | bwd_inner_microstep: 2019.70 | bwd_allreduce_microstep: 306.18 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14009 total_samples=17669, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:18,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.93 | bwd_microstep: 1833.08 | bwd_inner_microstep: 1741.11 | bwd_allreduce_microstep: 91.89 | step_microstep: 0.29 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13867 total_samples=17673, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:21,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50 [2025-08-03 05:17:21,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.36 | bwd_microstep: 1986.01 | bwd_inner_microstep: 1892.88 | bwd_allreduce_microstep: 93.06 | step_microstep: 126.73 [2025-08-03 05:17:21,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.29 | bwd: 7918.25 | bwd_inner: 7334.78 | bwd_allreduce: 583.22 | step: 127.38 {'loss': 0.7344, 'learning_rate': 7.862840437306165e-06, 'epoch': 0.58} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13473 total_samples=17677, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:23,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.93 | bwd_microstep: 1688.18 | bwd_inner_microstep: 1652.44 | bwd_allreduce_microstep: 35.68 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14050 total_samples=17681, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:26,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.69 | bwd_microstep: 2047.96 | bwd_inner_microstep: 1760.38 | bwd_allreduce_microstep: 287.50 | step_microstep: 0.27 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13981 total_samples=17686, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:29,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.16 | bwd_microstep: 1772.50 | bwd_inner_microstep: 1714.84 | bwd_allreduce_microstep: 57.61 | step_microstep: 0.23 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13348 total_samples=17691, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:31,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27 [2025-08-03 05:17:31,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.85 | bwd_microstep: 1862.61 | bwd_inner_microstep: 1805.47 | bwd_allreduce_microstep: 57.07 | step_microstep: 113.75 [2025-08-03 05:17:31,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.57 | bwd: 7371.30 | bwd_inner: 6933.12 | bwd_allreduce: 437.93 | step: 114.35 {'loss': 0.7438, 'learning_rate': 7.847023613421251e-06, 'epoch': 0.58} dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12743 total_samples=17695, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:34,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.12 | bwd_microstep: 1891.13 | bwd_inner_microstep: 1660.42 | bwd_allreduce_microstep: 230.65 | step_microstep: 0.22 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12473 total_samples=17699, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:37,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.76 | bwd_microstep: 1772.41 | bwd_inner_microstep: 1586.12 | bwd_allreduce_microstep: 186.23 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13429 total_samples=17704, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:39,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.24 | bwd_microstep: 1835.01 | bwd_inner_microstep: 1780.94 | bwd_allreduce_microstep: 53.99 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13673 total_samples=17708, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:42,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31 [2025-08-03 05:17:42,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.93 | bwd_microstep: 2045.05 | bwd_inner_microstep: 1749.55 | bwd_allreduce_microstep: 295.44 | step_microstep: 129.14 [2025-08-03 05:17:42,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.97 | bwd: 7543.66 | bwd_inner: 6777.03 | bwd_allreduce: 766.38 | step: 129.73 {'loss': 0.743, 'learning_rate': 7.831212435465925e-06, 'epoch': 0.58} %|█████▊ | 1160/2000 [3:34:03<2:32:55, 10.92s/it] 58%|█████▊ | 1161/2000 [3:34:13<2:30:59, 10.80s/it] 58%|█████▊ | 1161/2000 [3:34:13<2:30:59, 10.80s/it] 58%|█████▊ | 1162/2000 [3:34:24<2:32:49, 10.94s/it] 58%|█████▊ | 1162/2000 [3:34:24<2:32:49, 10.94s/it] 58%|█████▊ | 1163/2000 [3:34:36<2:33:45, 11.02s/it] 58%|█████▊ | 1163/2000 [3:34:36<2:33:45, 11.02s/it] 58%|█████▊ | 1164/2000 [3:34:46<2:31:52, 10.90s/it] 58%|█████▊ | 1164/2000 [3:34:46<2:31:52, 10.90s/it] 58%|█████▊ | 1165/2000 [3:34:57<2:31:15, 10.87s/it] 58%|█████▊ | 1165/2000 [3:34:57<2:31:15, 10.87dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11829 total_samples=17711, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:45,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.63 | bwd_microstep: 2076.96 | bwd_inner_microstep: 1674.10 | bwd_allreduce_microstep: 402.80 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12071 total_samples=17714, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:48,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.45 | bwd_microstep: 2081.36 | bwd_inner_microstep: 1620.10 | bwd_allreduce_microstep: 461.15 | step_microstep: 0.25 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13085 total_samples=17718, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:51,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.05 | bwd_microstep: 2112.34 | bwd_inner_microstep: 1865.65 | bwd_allreduce_microstep: 246.62 | step_microstep: 0.14 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13494 total_samples=17722, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:54,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.12 [2025-08-03 05:17:54,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.57 | bwd_microstep: 1768.12 | bwd_inner_microstep: 1681.29 | bwd_allreduce_microstep: 86.77 | step_microstep: 400.71 [2025-08-03 05:17:54,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.63 | bwd: 8038.83 | bwd_inner: 6841.16 | bwd_allreduce: 1197.41 | step: 401.22 {'loss': 0.758, 'learning_rate': 7.815406944903148e-06, 'epoch': 0.58} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13667 total_samples=17726, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:56,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.30 | bwd_microstep: 1832.63 | bwd_inner_microstep: 1644.26 | bwd_allreduce_microstep: 188.31 | step_microstep: 0.09 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11796 total_samples=17729, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:17:59,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.78 | bwd_microstep: 1739.95 | bwd_inner_microstep: 1537.55 | bwd_allreduce_microstep: 202.33 | step_microstep: 0.14 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12633 total_samples=17733, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:01,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.18 | bwd_microstep: 1786.24 | bwd_inner_microstep: 1622.02 | bwd_allreduce_microstep: 164.15 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492 total_samples=17737, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:05,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89 [2025-08-03 05:18:05,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.22 | bwd_microstep: 2563.63 | bwd_inner_microstep: 2511.08 | bwd_allreduce_microstep: 52.49 | step_microstep: 106.85 [2025-08-03 05:18:05,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.40 | bwd: 7922.51 | bwd_inner: 7314.90 | bwd_allreduce: 607.36 | step: 107.38 {'loss': 0.7565, 'learning_rate': 7.799607183180981e-06, 'epoch': 0.58} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12063 total_samples=17740, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:07,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.38 | bwd_microstep: 1768.76 | bwd_inner_microstep: 1555.62 | bwd_allreduce_microstep: 213.07 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11795 total_samples=17743, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:10,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.25 | bwd_microstep: 1819.43 | bwd_inner_microstep: 1607.00 | bwd_allreduce_microstep: 212.36 | step_microstep: 0.22 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 16351 total_samples=17748, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:12,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.23 | bwd_microstep: 1793.55 | bwd_inner_microstep: 1754.74 | bwd_allreduce_microstep: 38.75 | step_microstep: 0.14 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13332 total_samples=17752, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:15,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27 [2025-08-03 05:18:15,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.82 | bwd_microstep: 2038.29 | bwd_inner_microstep: 1894.91 | bwd_allreduce_microstep: 143.31 | step_microstep: 126.41 [2025-08-03 05:18:15,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.61 | bwd: 7420.09 | bwd_inner: 6812.27 | bwd_allreduce: 607.57 | step: 126.90 {'loss': 0.7445, 'learning_rate': 7.78381319173246e-06, 'epoch': 0.58} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12481 total_samples=17756, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:18,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.18 | bwd_microstep: 2071.65 | bwd_inner_microstep: 2064.53 | bwd_allreduce_microstep: 7.06 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12407 total_samples=17759, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:21,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.88 | bwd_microstep: 2116.59 | bwd_inner_microstep: 1885.17 | bwd_allreduce_microstep: 231.36 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13308 total_samples=17763, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:24,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.08 | bwd_microstep: 1863.40 | bwd_inner_microstep: 1714.08 | bwd_allreduce_microstep: 149.26 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13175 total_samples=17767, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:26,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48 [2025-08-03 05:18:26,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.80 | bwd_microstep: 1752.80 | bwd_inner_microstep: 1672.08 | bwd_allreduce_microstep: 80.65 | step_microstep: 133.39 [2025-08-03 05:18:26,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.86 | bwd: 7804.50 | bwd_inner: 7335.87 | bwd_allreduce: 468.40 | step: 133.84 {'loss': 0.7427, 'learning_rate': 7.768025011975481e-06, 'epoch': 0.58} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938 total_samples=17770, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:29,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.07 | bwd_microstep: 1691.73 | bwd_inner_microstep: 1532.10 | bwd_allreduce_microstep: 159.54 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14173 total_samples=17775, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:32,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.51 | bwd_microstep: 1846.03 | bwd_inner_microstep: 1733.43 | bwd_allreduce_microstep: 112.54 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13497 total_samples=17779, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:34,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.77 | bwd_microstep: 1773.97 | bwd_inner_microstep: 1702.68 | bwd_allreduce_microstep: 71.22 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15469 total_samples=17786, num_samples=7, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:37,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.97 [2025-08-03 05:18:37,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.39 | bwd_microstep: 1756.89 | bwd_inner_microstep: 1749.99 | bwd_allreduce_microstep: 6.83 | step_microstep: 145.30 [2025-08-03 05:18:37,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.65 | bwd: 7068.67 | bwd_inner: 6718.19 | bwd_allreduce: 350.22 | step: 145.83 {'loss': 0.7231, 'learning_rate': 7.752242685312709e-06, 'epoch': 0.58} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11818 total_samples=17789, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:39,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.21 | bwd_microstep: 1773.34 | bwd_inner_microstep: 1570.12 | bwd_allreduce_microstep: 203.16 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13340 total_samples=17793, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:42,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.81 | bwd_microstep: 2149.29 | bwd_inner_microstep: 1935.48 | bwd_allreduce_microstep: 213.73 | step_microstep: 0.31 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13317 total_samples=17797, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:45,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 660.23 | bwd_microstep: 1849.19 | bwd_inner_microstep: 1710.21 | bwd_allreduce_microstep: 138.91 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13103 total_samples=17801, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:48,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33 [2025-08-03 05:18:48,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.62 | bwd_microstep: 2123.79 | bwd_inner_microstep: 1880.39 | bwd_allreduce_microstep: 243.33 | step_microstep: 109.12 [2025-08-03 05:18:48,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.80 | bwd: 7895.67 | bwd_inner: 7096.19 | bwd_allreduce: 799.22 | step: 109.65 s/it] 58%|█████▊ | 1166/2000 [3:35:08<2:33:44, 11.06s/it] 58%|█████▊ | 1166/2000 [3:35:09<2:33:44, 11.06s/it] 58%|█████▊ | 1167/2000 [3:35:20<2:33:50, 11.08s/it] 58%|█████▊ | 1167/2000 [3:35:20<2:33:50, 11.08s/it] 58%|█████▊ | 1168/2000 [3:35:30<2:31:47, 10.95s/it] 58%|█████▊ | 1168/2000 [3:35:30<2:31:47, 10.95s/it] 58%|█████▊ | 1169/2000 [3:35:41<2:32:05, 10.98s/it] 58%|█████▊ | 1169/2000 [3:35:41<2:32:05, 10.98s/it] 58%|█████▊ | 1170/2000 [3:35:52<2:29:06, 10.78s/it] 58%|█████▊ | 1170/2000 [3:35:52<2:29:06, 10.78s/it] 59%|█████▊ | 1171/2000 [3:36:03<2:30:{'loss': 0.7452, 'learning_rate': 7.736466253131451e-06, 'epoch': 0.59} dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12829 total_samples=17805, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:51,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.09 | bwd_microstep: 2012.71 | bwd_inner_microstep: 1917.05 | bwd_allreduce_microstep: 95.59 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13735 total_samples=17809, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:53,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.80 | bwd_microstep: 1782.33 | bwd_inner_microstep: 1718.12 | bwd_allreduce_microstep: 64.14 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12906 total_samples=17813, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:56,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.58 | bwd_microstep: 1848.45 | bwd_inner_microstep: 1671.65 | bwd_allreduce_microstep: 176.73 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15309 total_samples=17817, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:18:59,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.81 [2025-08-03 05:18:59,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.11 | bwd_microstep: 1785.43 | bwd_inner_microstep: 1765.79 | bwd_allreduce_microstep: 19.58 | step_microstep: 126.42 [2025-08-03 05:18:59,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2862.51 | bwd: 7428.97 | bwd_inner: 7072.61 | bwd_allreduce: 356.13 | step: 126.90 {'loss': 0.7508, 'learning_rate': 7.720695756803569e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13108 total_samples=17821, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:02,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.32 | bwd_microstep: 2280.04 | bwd_inner_microstep: 2075.99 | bwd_allreduce_microstep: 203.98 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13778 total_samples=17825, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:05,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.32 | bwd_microstep: 2332.12 | bwd_inner_microstep: 2045.80 | bwd_allreduce_microstep: 286.26 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13416 total_samples=17829, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:08,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.60 | bwd_microstep: 2038.88 | bwd_inner_microstep: 1910.82 | bwd_allreduce_microstep: 127.99 | step_microstep: 0.29 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13940 total_samples=17833, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:10,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39 [2025-08-03 05:19:10,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.28 | bwd_microstep: 1799.34 | bwd_inner_microstep: 1727.00 | bwd_allreduce_microstep: 72.27 | step_microstep: 123.05 [2025-08-03 05:19:10,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.44 | bwd: 8450.44 | bwd_inner: 7759.60 | bwd_allreduce: 690.60 | step: 123.72 {'loss': 0.7384, 'learning_rate': 7.704931237685342e-06, 'epoch': 0.59} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11866 total_samples=17837, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:13,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.05 | bwd_microstep: 1766.43 | bwd_inner_microstep: 1527.92 | bwd_allreduce_microstep: 238.44 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13800 total_samples=17841, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:16,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.64 | bwd_microstep: 2178.00 | bwd_inner_microstep: 2083.94 | bwd_allreduce_microstep: 93.99 | step_microstep: 0.29 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13561 total_samples=17845, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:18,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.59 | bwd_microstep: 1816.69 | bwd_inner_microstep: 1717.60 | bwd_allreduce_microstep: 99.02 | step_microstep: 0.25 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15917 total_samples=17849, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:21,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 33.60 [2025-08-03 05:19:21,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.95 | bwd_microstep: 1775.30 | bwd_inner_microstep: 1717.93 | bwd_allreduce_microstep: 57.30 | step_microstep: 142.13 [2025-08-03 05:19:21,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.15 | bwd: 7536.47 | bwd_inner: 7047.38 | bwd_allreduce: 488.84 | step: 142.92 {'loss': 0.7443, 'learning_rate': 7.689172737117389e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332 total_samples=17853, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:24,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.56 | bwd_microstep: 2241.66 | bwd_inner_microstep: 2149.36 | bwd_allreduce_microstep: 92.23 | step_microstep: 0.32 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13833 total_samples=17857, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:27,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.94 | bwd_microstep: 1820.48 | bwd_inner_microstep: 1722.68 | bwd_allreduce_microstep: 97.73 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13087 total_samples=17861, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:30,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.22 | bwd_microstep: 2054.22 | bwd_inner_microstep: 1953.92 | bwd_allreduce_microstep: 100.24 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11533 total_samples=17864, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:33,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12 [2025-08-03 05:19:33,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.28 | bwd_microstep: 2146.37 | bwd_inner_microstep: 2135.24 | bwd_allreduce_microstep: 11.07 | step_microstep: 120.76 [2025-08-03 05:19:33,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.94 | bwd: 8262.78 | bwd_inner: 7961.18 | bwd_allreduce: 301.35 | step: 121.32 {'loss': 0.749, 'learning_rate': 7.673420296424541e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13970 total_samples=17868, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:35,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.57 | bwd_microstep: 1849.51 | bwd_inner_microstep: 1723.49 | bwd_allreduce_microstep: 125.95 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15122 total_samples=17873, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:38,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.51 | bwd_microstep: 1769.21 | bwd_inner_microstep: 1730.61 | bwd_allreduce_microstep: 38.53 | step_microstep: 0.17 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14119 total_samples=17877, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:40,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.82 | bwd_microstep: 1758.08 | bwd_inner_microstep: 1701.54 | bwd_allreduce_microstep: 56.47 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12916 total_samples=17881, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:43,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21 [2025-08-03 05:19:43,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.83 | bwd_microstep: 1906.33 | bwd_inner_microstep: 1794.74 | bwd_allreduce_microstep: 111.52 | step_microstep: 132.70 [2025-08-03 05:19:43,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2866.66 | bwd: 7283.18 | bwd_inner: 6950.38 | bwd_allreduce: 332.56 | step: 133.11 12, 10.87s/it] 59%|█████▊ | 1171/2000 [3:36:03<2:30:12, 10.87s/it] 59%|█████▊ | 1172/2000 [3:36:13<2:29:23, 10.83s/it] 59%|█████▊ | 1172/2000 [3:36:13<2:29:23, 10.83s/it] 59%|█████▊ | 1173/2000 [3:36:25<2:32:46, 11.08s/it] 59%|█████▊ | 1173/2000 [3:36:25<2:32:46, 11.08s/it] 59%|█████▊ | 1174/2000 [3:36:36<2:31:28, 11.00s/it] 59%|█████▊ | 1174/2000 [3:36:36<2:31:28, 11.00s/it] 59%|█████▉ | 1175/2000 [3:36:47<2:33:25, 11.16s/it] 59%|█████▉ | 1175/2000 [3:36:47<2:33:25, 11.16s/it] 59%|█████▉ | 1176/2000 [3:36:58<2:30:59, 10.99s/it] {'loss': 0.7484, 'learning_rate': 7.657673956915735e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13952 total_samples=17885, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:46,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.98 | bwd_microstep: 1855.13 | bwd_inner_microstep: 1722.27 | bwd_allreduce_microstep: 132.79 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11762 total_samples=17888, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:48,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.82 | bwd_microstep: 1809.67 | bwd_inner_microstep: 1571.50 | bwd_allreduce_microstep: 238.10 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13166 total_samples=17892, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:51,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.94 | bwd_microstep: 1843.00 | bwd_inner_microstep: 1725.49 | bwd_allreduce_microstep: 117.44 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11777 total_samples=17895, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:54,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17 [2025-08-03 05:19:54,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.73 | bwd_microstep: 1848.84 | bwd_inner_microstep: 1616.41 | bwd_allreduce_microstep: 232.37 | step_microstep: 138.57 [2025-08-03 05:19:54,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.39 | bwd: 7356.69 | bwd_inner: 6635.66 | bwd_allreduce: 720.78 | step: 139.09 {'loss': 0.7419, 'learning_rate': 7.641933759883913e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13495 total_samples=17899, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:56,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 1826.16 | bwd_inner_microstep: 1690.65 | bwd_allreduce_microstep: 135.44 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14499 total_samples=17903, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:19:59,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.05 | bwd_microstep: 1867.58 | bwd_inner_microstep: 1772.96 | bwd_allreduce_microstep: 94.55 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13206 total_samples=17907, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:02,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.83 | bwd_microstep: 2019.33 | bwd_inner_microstep: 1723.74 | bwd_allreduce_microstep: 295.51 | step_microstep: 0.14 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12497 total_samples=17911, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:05,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.92 [2025-08-03 05:20:05,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.96 | bwd_microstep: 1721.59 | bwd_inner_microstep: 1577.00 | bwd_allreduce_microstep: 144.52 | step_microstep: 160.05 [2025-08-03 05:20:05,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.82 | bwd: 7434.72 | bwd_inner: 6764.36 | bwd_allreduce: 670.10 | step: 160.58 {'loss': 0.7499, 'learning_rate': 7.6261997466059035e-06, 'epoch': 0.59} dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13627 total_samples=17915, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:07,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.88 | bwd_microstep: 1988.18 | bwd_inner_microstep: 1896.98 | bwd_allreduce_microstep: 91.14 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12659 total_samples=17919, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:10,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.19 | bwd_microstep: 1718.50 | bwd_inner_microstep: 1624.53 | bwd_allreduce_microstep: 93.91 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13811 total_samples=17924, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:13,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.44 | bwd_microstep: 2241.93 | bwd_inner_microstep: 2042.17 | bwd_allreduce_microstep: 199.70 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11913 total_samples=17927, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:16,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22 [2025-08-03 05:20:16,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.54 | bwd_microstep: 1977.64 | bwd_inner_microstep: 1563.40 | bwd_allreduce_microstep: 414.18 | step_microstep: 142.91 [2025-08-03 05:20:16,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.98 | bwd: 7926.31 | bwd_inner: 7127.08 | bwd_allreduce: 799.00 | step: 143.41 {'loss': 0.7383, 'learning_rate': 7.610471958342326e-06, 'epoch': 0.59} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13038 total_samples=17931, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:19,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.57 | bwd_microstep: 2342.19 | bwd_inner_microstep: 2026.23 | bwd_allreduce_microstep: 315.89 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13221 total_samples=17935, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:22,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.20 | bwd_microstep: 2002.02 | bwd_inner_microstep: 1866.15 | bwd_allreduce_microstep: 135.81 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14596 total_samples=17942, num_samples=7, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:24,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.83 | bwd_microstep: 1802.66 | bwd_inner_microstep: 1761.72 | bwd_allreduce_microstep: 40.87 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12178 total_samples=17945, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:27,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61 [2025-08-03 05:20:27,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.64 | bwd_microstep: 1945.60 | bwd_inner_microstep: 1736.68 | bwd_allreduce_microstep: 208.86 | step_microstep: 129.13 [2025-08-03 05:20:27,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.17 | bwd: 8092.53 | bwd_inner: 7390.77 | bwd_allreduce: 701.50 | step: 129.63 {'loss': 0.7438, 'learning_rate': 7.594750436337467e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13453 total_samples=17949, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:30,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.39 | bwd_microstep: 2016.99 | bwd_inner_microstep: 1757.60 | bwd_allreduce_microstep: 259.32 | step_microstep: 0.24 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12510 total_samples=17953, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:33,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.96 | bwd_microstep: 1878.35 | bwd_inner_microstep: 1778.33 | bwd_allreduce_microstep: 99.96 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13111 total_samples=17957, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:35,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.17 | bwd_microstep: 1752.60 | bwd_inner_microstep: 1664.52 | bwd_allreduce_microstep: 88.02 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11765 total_samples=17960, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:38,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72 [2025-08-03 05:20:38,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1083.29 | bwd_microstep: 1928.39 | bwd_inner_microstep: 1604.80 | bwd_allreduce_microstep: 323.52 | step_microstep: 112.12 [2025-08-03 05:20:38,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3204.73 | bwd: 7576.39 | bwd_inner: 6805.24 | bwd_allreduce: 770.90 | step: 112.60 {'loss': 0.7568, 'learning_rate': 7.579035221819188e-06, 'epoch': 0.59} 59%|█████▉ | 1176/2000 [3:36:58<2:30:59, 10.99s/it] 59%|█████▉ | 1177/2000 [3:37:09<2:29:17, 10.88s/it] 59%|█████▉ | 1177/2000 [3:37:09<2:29:17, 10.88s/it] 59%|█████▉ | 1178/2000 [3:37:19<2:28:29, 10.84s/it] 59%|█████▉ | 1178/2000 [3:37:19<2:28:29, 10.84s/it] 59%|█████▉ | 1179/2000 [3:37:31<2:29:45, 10.95s/it] 59%|█████▉ | 1179/2000 [3:37:31<2:29:45, 10.95s/it] 59%|█████▉ | 1180/2000 [3:37:42<2:31:11, 11.06s/it] 59%|█████▉ | 1180/2000 [3:37:42<2:31:11, 11.06s/it] 59%|█████▉ | 1181/2000 [3:37:53<2:31:31, 11.10s/it] 59%|█████▉ | 1181/2000 [3:37:5dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885 total_samples=17963, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:41,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.90 | bwd_microstep: 1927.07 | bwd_inner_microstep: 1768.87 | bwd_allreduce_microstep: 158.13 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717 total_samples=17966, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:44,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.29 | bwd_microstep: 2048.90 | bwd_inner_microstep: 1881.66 | bwd_allreduce_microstep: 167.17 | step_microstep: 0.27 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12940 total_samples=17970, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:46,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.12 | bwd_microstep: 1767.68 | bwd_inner_microstep: 1641.33 | bwd_allreduce_microstep: 126.27 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11535 total_samples=17973, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:49,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27 [2025-08-03 05:20:49,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.82 | bwd_microstep: 2073.51 | bwd_inner_microstep: 2067.41 | bwd_allreduce_microstep: 6.04 | step_microstep: 135.05 [2025-08-03 05:20:49,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2751.06 | bwd: 7817.21 | bwd_inner: 7359.26 | bwd_allreduce: 457.69 | step: 135.56 {'loss': 0.75, 'learning_rate': 7.5633263559988035e-06, 'epoch': 0.59} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12002 total_samples=17976, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:52,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.92 | bwd_microstep: 1855.73 | bwd_inner_microstep: 1730.10 | bwd_allreduce_microstep: 125.56 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13405 total_samples=17980, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:54,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.99 | bwd_microstep: 1711.05 | bwd_inner_microstep: 1670.61 | bwd_allreduce_microstep: 40.37 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11861 total_samples=17983, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:20:57,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.77 | bwd_microstep: 1750.58 | bwd_inner_microstep: 1549.88 | bwd_allreduce_microstep: 200.63 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11894 total_samples=17986, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:00,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21 [2025-08-03 05:21:00,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 862.27 | bwd_microstep: 1757.62 | bwd_inner_microstep: 1538.78 | bwd_allreduce_microstep: 218.78 | step_microstep: 114.15 [2025-08-03 05:21:00,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2913.88 | bwd: 7075.03 | bwd_inner: 6489.35 | bwd_allreduce: 585.42 | step: 114.65 {'loss': 0.748, 'learning_rate': 7.547623880070992e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14671 total_samples=17990, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:02,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.52 | bwd_microstep: 2026.57 | bwd_inner_microstep: 1904.12 | bwd_allreduce_microstep: 122.38 | step_microstep: 0.27 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11642 total_samples=17993, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:05,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.50 | bwd_microstep: 2023.80 | bwd_inner_microstep: 1831.77 | bwd_allreduce_microstep: 191.97 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12465 total_samples=17996, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:08,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.46 | bwd_microstep: 1794.40 | bwd_inner_microstep: 1614.62 | bwd_allreduce_microstep: 179.72 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12504 total_samples=17999, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:10,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:21:10,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.66 | bwd_microstep: 1708.84 | bwd_inner_microstep: 1556.37 | bwd_allreduce_microstep: 152.40 | step_microstep: 114.91 [2025-08-03 05:21:10,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.07 | bwd: 7553.67 | bwd_inner: 6906.87 | bwd_allreduce: 646.55 | step: 115.43 {'loss': 0.7498, 'learning_rate': 7.531927835213657e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508 total_samples=18003, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:13,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.69 | bwd_microstep: 2184.02 | bwd_inner_microstep: 1921.45 | bwd_allreduce_microstep: 262.50 | step_microstep: 0.89 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13402 total_samples=18007, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:16,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.71 | bwd_microstep: 1789.09 | bwd_inner_microstep: 1697.48 | bwd_allreduce_microstep: 91.54 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778 total_samples=18010, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:19,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.33 | bwd_microstep: 1884.65 | bwd_inner_microstep: 1542.52 | bwd_allreduce_microstep: 342.06 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11940 total_samples=18013, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:21,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32 [2025-08-03 05:21:21,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 652.03 | bwd_microstep: 1759.91 | bwd_inner_microstep: 1586.39 | bwd_allreduce_microstep: 173.45 | step_microstep: 126.37 [2025-08-03 05:21:21,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.69 | bwd: 7617.72 | bwd_inner: 6747.84 | bwd_allreduce: 869.63 | step: 127.49 {'loss': 0.7448, 'learning_rate': 7.516238262587851e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13758 total_samples=18017, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:24,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.13 | bwd_microstep: 1852.84 | bwd_inner_microstep: 1694.07 | bwd_allreduce_microstep: 158.71 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13590 total_samples=18021, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:27,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.08 | bwd_microstep: 2238.80 | bwd_inner_microstep: 1921.02 | bwd_allreduce_microstep: 317.71 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13465 total_samples=18025, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:30,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.15 | bwd_microstep: 1826.23 | bwd_inner_microstep: 1720.70 | bwd_allreduce_microstep: 105.47 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12077 total_samples=18028, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:33,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51 [2025-08-03 05:21:33,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.35 | bwd_microstep: 2227.35 | bwd_inner_microstep: 2004.73 | bwd_allreduce_microstep: 222.55 | step_microstep: 115.11 [2025-08-03 05:21:33,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.62 | bwd: 8145.28 | bwd_inner: 7340.52 | bwd_allreduce: 804.52 | step: 115.46 {'loss': 0.7605, 'learning_rate': 7.500555203337647e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14065 total_samples=18033, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:35,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.13 | bwd_microstep: 1701.38 | bwd_inner_microstep: 1679.98 | bwd_allreduce_microstep: 21.34 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348 total_samples=18037, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:38,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.51 | bwd_microstep: 1748.44 | bwd_inner_microstep: 1675.89 | bwd_allreduce_microstep: 72.47 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13437 total_samples=18041, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:40,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.00 | bwd_microstep: 1816.61 | bwd_inner_microstep: 1719.82 | bwd_allreduce_microstep: 96.71 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13498 total_samples=18045, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:43,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20 [2025-08-03 05:21:43,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.41 | bwd_microstep: 1750.21 | bwd_inner_microstep: 1684.83 | bwd_allreduce_microstep: 65.32 | step_microstep: 110.66 [2025-08-03 05:21:43,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.98 | bwd: 7016.69 | bwd_inner: 6760.52 | bwd_allreduce: 255.92 | step: 111.29 3<2:31:31, 11.10s/it] 59%|█████▉ | 1182/2000 [3:38:04<2:30:54, 11.07s/it] 59%|█████▉ | 1182/2000 [3:38:04<2:30:54, 11.07s/it] 59%|█████▉ | 1183/2000 [3:38:15<2:28:01, 10.87s/it] 59%|█████▉ | 1183/2000 [3:38:15<2:28:01, 10.87s/it] 59%|█████▉ | 1184/2000 [3:38:25<2:27:22, 10.84s/it] 59%|█████▉ | 1184/2000 [3:38:25<2:27:22, 10.84s/it] 59%|█████▉ | 1185/2000 [3:38:36<2:27:17, 10.84s/it] 59%|█████▉ | 1185/2000 [3:38:36<2:27:17, 10.84s/it] 59%|█████▉ | 1186/2000 [3:38:48<2:29:16, 11.00s/it] 59%|█████▉ | 1186/2000 [3:38:48<2:29:16, 11.00s/it] 59%|█████▉ | 1187/200{'loss': 0.7414, 'learning_rate': 7.48487869859004e-06, 'epoch': 0.59} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14031 total_samples=18050, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:46,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 766.90 | bwd_microstep: 1821.33 | bwd_inner_microstep: 1711.88 | bwd_allreduce_microstep: 109.39 | step_microstep: 0.10 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12555 total_samples=18055, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:48,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.07 | bwd_microstep: 1871.73 | bwd_inner_microstep: 1631.68 | bwd_allreduce_microstep: 239.99 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13394 total_samples=18059, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:51,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.09 | bwd_microstep: 1837.37 | bwd_inner_microstep: 1721.83 | bwd_allreduce_microstep: 115.48 | step_microstep: 0.11 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12674 total_samples=18063, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:54,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52 [2025-08-03 05:21:54,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.68 | bwd_microstep: 1809.78 | bwd_inner_microstep: 1632.93 | bwd_allreduce_microstep: 176.79 | step_microstep: 143.00 [2025-08-03 05:21:54,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2935.67 | bwd: 7340.28 | bwd_inner: 6698.31 | bwd_allreduce: 641.73 | step: 143.46 {'loss': 0.7507, 'learning_rate': 7.469208789454838e-06, 'epoch': 0.59} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13620 total_samples=18066, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:56,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.29 | bwd_microstep: 2041.73 | bwd_inner_microstep: 1641.64 | bwd_allreduce_microstep: 400.02 | step_microstep: 0.26 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12612 total_samples=18070, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:21:59,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.36 | bwd_microstep: 1768.88 | bwd_inner_microstep: 1591.34 | bwd_allreduce_microstep: 177.48 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13946 total_samples=18074, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:02,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.94 | bwd_microstep: 2122.88 | bwd_inner_microstep: 1953.54 | bwd_allreduce_microstep: 169.28 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11643 total_samples=18077, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:05,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20 [2025-08-03 05:22:05,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.32 | bwd_microstep: 1804.60 | bwd_inner_microstep: 1784.96 | bwd_allreduce_microstep: 19.58 | step_microstep: 110.54 [2025-08-03 05:22:05,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.86 | bwd: 7738.14 | bwd_inner: 6971.47 | bwd_allreduce: 766.43 | step: 111.02 {'loss': 0.7454, 'learning_rate': 7.4535455170245476e-06, 'epoch': 0.59} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13554 total_samples=18081, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:07,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.27 | bwd_microstep: 1738.67 | bwd_inner_microstep: 1668.94 | bwd_allreduce_microstep: 69.66 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446 total_samples=18085, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:10,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.13 | bwd_microstep: 1748.33 | bwd_inner_microstep: 1687.47 | bwd_allreduce_microstep: 60.80 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13142 total_samples=18089, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:12,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.06 | bwd_microstep: 1746.66 | bwd_inner_microstep: 1660.33 | bwd_allreduce_microstep: 86.27 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11605 total_samples=18092, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:16,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.84 [2025-08-03 05:22:16,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.42 | bwd_microstep: 2324.28 | bwd_inner_microstep: 2087.10 | bwd_allreduce_microstep: 237.12 | step_microstep: 417.05 [2025-08-03 05:22:16,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.82 | bwd: 7558.00 | bwd_inner: 7103.85 | bwd_allreduce: 453.92 | step: 417.37 {'loss': 0.7497, 'learning_rate': 7.4378889223742766e-06, 'epoch': 0.59} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12721 total_samples=18096, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:18,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.79 | bwd_microstep: 1775.52 | bwd_inner_microstep: 1632.48 | bwd_allreduce_microstep: 142.98 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12150 total_samples=18099, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:21,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.74 | bwd_microstep: 2022.25 | bwd_inner_microstep: 1802.77 | bwd_allreduce_microstep: 219.41 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11707 total_samples=18102, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:24,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.65 | bwd_microstep: 1952.54 | bwd_inner_microstep: 1775.68 | bwd_allreduce_microstep: 176.79 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11855 total_samples=18105, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:27,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.39 [2025-08-03 05:22:27,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.86 | bwd_microstep: 1852.71 | bwd_inner_microstep: 1632.08 | bwd_allreduce_microstep: 220.56 | step_microstep: 135.53 [2025-08-03 05:22:27,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.98 | bwd: 7603.07 | bwd_inner: 6843.01 | bwd_allreduce: 759.81 | step: 135.87 {'loss': 0.744, 'learning_rate': 7.422239046561619e-06, 'epoch': 0.6} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713 total_samples=18108, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:29,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.18 | bwd_microstep: 1787.09 | bwd_inner_microstep: 1541.61 | bwd_allreduce_microstep: 245.42 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11883 total_samples=18111, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:32,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.48 | bwd_microstep: 2181.18 | bwd_inner_microstep: 1834.10 | bwd_allreduce_microstep: 347.01 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717 total_samples=18114, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:35,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.30 | bwd_microstep: 2002.73 | bwd_inner_microstep: 1763.98 | bwd_allreduce_microstep: 238.69 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11802 total_samples=18117, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:38,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93 [2025-08-03 05:22:38,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.26 | bwd_microstep: 1796.15 | bwd_inner_microstep: 1554.66 | bwd_allreduce_microstep: 241.42 | step_microstep: 111.94 [2025-08-03 05:22:38,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.14 | bwd: 7767.20 | bwd_inner: 6694.35 | bwd_allreduce: 1072.62 | step: 112.26 0 [3:38:58<2:26:02, 10.78s/it] 59%|█████▉ | 1187/2000 [3:38:58<2:26:02, 10.78s/it] 59%|█████▉ | 1188/2000 [3:39:08<2:25:34, 10.76s/it] 59%|█████▉ | 1188/2000 [3:39:09<2:25:34, 10.76s/it] 59%|█████▉ | 1189/2000 [3:39:19<2:26:24, 10.83s/it] 59%|█████▉ | 1189/2000 [3:39:20<2:26:24, 10.83s/it] 60%|█████▉ | 1190/2000 [3:39:31<2:27:09, 10.90s/it] 60%|█████▉ | 1190/2000 [3:39:31<2:27:09, 10.90s/it] 60%|█████▉ | 1191/2000 [3:39:41<2:26:46, 10.89s/it] 60%|█████▉ | 1191/2000 [3:39:41<2:26:46, 10.89s/it] 60%|█████▉ | 1192/2000 [3:39:52<2:27:01, 10.92s/it] {'loss': 0.7434, 'learning_rate': 7.40659593062655e-06, 'epoch': 0.6} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11777 total_samples=18120, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:40,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.09 | bwd_microstep: 1762.21 | bwd_inner_microstep: 1543.98 | bwd_allreduce_microstep: 218.16 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13089 total_samples=18124, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:43,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.41 | bwd_microstep: 1994.11 | bwd_inner_microstep: 1897.45 | bwd_allreduce_microstep: 96.59 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12285 total_samples=18127, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:46,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.96 | bwd_microstep: 2238.05 | bwd_inner_microstep: 1932.97 | bwd_allreduce_microstep: 305.02 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11680 total_samples=18130, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:49,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21 [2025-08-03 05:22:49,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.96 | bwd_microstep: 1954.12 | bwd_inner_microstep: 1761.34 | bwd_allreduce_microstep: 192.72 | step_microstep: 137.06 [2025-08-03 05:22:49,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2859.35 | bwd: 7948.54 | bwd_inner: 7135.73 | bwd_allreduce: 812.56 | step: 137.56 {'loss': 0.7453, 'learning_rate': 7.390959615591315e-06, 'epoch': 0.6} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11914 total_samples=18133, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:51,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.35 | bwd_microstep: 1844.02 | bwd_inner_microstep: 1586.89 | bwd_allreduce_microstep: 257.06 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14382 total_samples=18137, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:54,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.79 | bwd_microstep: 2012.19 | bwd_inner_microstep: 1904.42 | bwd_allreduce_microstep: 107.71 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13169 total_samples=18141, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:22:57,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.01 | bwd_microstep: 1805.58 | bwd_inner_microstep: 1690.66 | bwd_allreduce_microstep: 114.86 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12120 total_samples=18144, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:00,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94 [2025-08-03 05:23:00,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.13 | bwd_microstep: 1848.90 | bwd_inner_microstep: 1594.35 | bwd_allreduce_microstep: 254.48 | step_microstep: 111.61 [2025-08-03 05:23:00,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2895.21 | bwd: 7510.74 | bwd_inner: 6776.30 | bwd_allreduce: 734.20 | step: 112.09 {'loss': 0.7552, 'learning_rate': 7.375330142460331e-06, 'epoch': 0.6} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13327 total_samples=18148, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:02,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.98 | bwd_microstep: 1742.54 | bwd_inner_microstep: 1670.79 | bwd_allreduce_microstep: 71.68 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14613 total_samples=18152, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:05,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.49 | bwd_microstep: 2523.60 | bwd_inner_microstep: 2457.43 | bwd_allreduce_microstep: 66.08 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14960 total_samples=18156, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:08,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.41 | bwd_microstep: 1781.70 | bwd_inner_microstep: 1744.38 | bwd_allreduce_microstep: 37.26 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13658 total_samples=18160, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:11,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85 [2025-08-03 05:23:11,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.51 | bwd_microstep: 1862.12 | bwd_inner_microstep: 1718.23 | bwd_allreduce_microstep: 143.82 | step_microstep: 132.30 [2025-08-03 05:23:11,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.31 | bwd: 7910.00 | bwd_inner: 7590.83 | bwd_allreduce: 318.91 | step: 132.66 {'loss': 0.7514, 'learning_rate': 7.35970755222007e-06, 'epoch': 0.6} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13158 total_samples=18164, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:14,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.41 | bwd_microstep: 2065.32 | bwd_inner_microstep: 1963.97 | bwd_allreduce_microstep: 101.28 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14996 total_samples=18168, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:16,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.81 | bwd_microstep: 1939.29 | bwd_inner_microstep: 1759.41 | bwd_allreduce_microstep: 179.82 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13239 total_samples=18172, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:19,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.47 | bwd_microstep: 2191.97 | bwd_inner_microstep: 1980.29 | bwd_allreduce_microstep: 211.61 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11859 total_samples=18175, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:22,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00 [2025-08-03 05:23:22,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.94 | bwd_microstep: 1811.04 | bwd_inner_microstep: 1562.97 | bwd_allreduce_microstep: 248.00 | step_microstep: 110.93 [2025-08-03 05:23:22,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2891.56 | bwd: 8007.66 | bwd_inner: 7266.64 | bwd_allreduce: 740.79 | step: 111.29 {'loss': 0.7552, 'learning_rate': 7.344091885838949e-06, 'epoch': 0.6} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12013 total_samples=18178, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:25,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.89 | bwd_microstep: 1797.82 | bwd_inner_microstep: 1583.90 | bwd_allreduce_microstep: 213.84 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13315 total_samples=18182, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:27,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.77 | bwd_microstep: 1768.13 | bwd_inner_microstep: 1690.84 | bwd_allreduce_microstep: 77.21 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13128 total_samples=18186, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:30,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.41 | bwd_microstep: 2025.59 | bwd_inner_microstep: 1735.91 | bwd_allreduce_microstep: 289.60 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11586 total_samples=18189, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:33,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47 [2025-08-03 05:23:33,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.55 | bwd_microstep: 1765.07 | bwd_inner_microstep: 1558.89 | bwd_allreduce_microstep: 206.12 | step_microstep: 111.43 [2025-08-03 05:23:33,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2879.54 | bwd: 7356.66 | bwd_inner: 6569.54 | bwd_allreduce: 786.86 | step: 111.94 {'loss': 0.7482, 'learning_rate': 7.328483184267236e-06, 'epoch': 0.6} 60%|█████▉ | 1192/2000 [3:39:52<2:27:01, 10.92s/it] 60%|█████▉ | 1193/2000 [3:40:04<2:28:12, 11.02s/it] 60%|█████▉ | 1193/2000 [3:40:04<2:28:12, 11.02s/it] 60%|█████▉ | 1194/2000 [3:40:14<2:27:06, 10.95s/it] 60%|█████▉ | 1194/2000 [3:40:14<2:27:06, 10.95s/it] 60%|█████▉ | 1195/2000 [3:40:26<2:27:50, 11.02s/it] 60%|█████▉ | 1195/2000 [3:40:26<2:27:50, 11.02s/it] 60%|█████▉ | 1196/2000 [3:40:37<2:28:50, 11.11s/it] 60%|█████▉ | 1196/2000 [3:40:37<2:28:50, 11.11s/it] 60%|█████▉ | 1197/2000 [3:40:48<2:26:47, 10.97s/it] 60%|█████▉ | 1dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13717 total_samples=18193, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:35,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.04 | bwd_microstep: 1790.35 | bwd_inner_microstep: 1721.55 | bwd_allreduce_microstep: 68.72 | step_microstep: 0.30 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13717 total_samples=18197, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:38,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.13 | bwd_microstep: 1834.02 | bwd_inner_microstep: 1712.25 | bwd_allreduce_microstep: 121.71 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790 total_samples=18200, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:41,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.38 | bwd_microstep: 1821.20 | bwd_inner_microstep: 1599.17 | bwd_allreduce_microstep: 221.97 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13427 total_samples=18204, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:43,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12 [2025-08-03 05:23:43,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.09 | bwd_microstep: 1834.25 | bwd_inner_microstep: 1734.48 | bwd_allreduce_microstep: 99.70 | step_microstep: 110.15 [2025-08-03 05:23:43,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.55 | bwd: 7279.87 | bwd_inner: 6767.45 | bwd_allreduce: 512.18 | step: 110.66 {'loss': 0.7542, 'learning_rate': 7.312881488436928e-06, 'epoch': 0.6} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13036 total_samples=18208, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:46,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.45 | bwd_microstep: 1893.65 | bwd_inner_microstep: 1839.95 | bwd_allreduce_microstep: 53.62 | step_microstep: 0.17 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12715 total_samples=18212, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:49,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.13 | bwd_microstep: 2292.70 | bwd_inner_microstep: 1974.56 | bwd_allreduce_microstep: 318.08 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13519 total_samples=18216, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:52,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.14 | bwd_microstep: 2246.96 | bwd_inner_microstep: 1949.03 | bwd_allreduce_microstep: 297.86 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11803 total_samples=18219, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:55,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28 [2025-08-03 05:23:55,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.35 | bwd_microstep: 1692.71 | bwd_inner_microstep: 1523.93 | bwd_allreduce_microstep: 168.71 | step_microstep: 124.29 [2025-08-03 05:23:55,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.99 | bwd: 8126.07 | bwd_inner: 7287.46 | bwd_allreduce: 838.36 | step: 124.83 {'loss': 0.7439, 'learning_rate': 7.297286839261659e-06, 'epoch': 0.6} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418 total_samples=18223, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:23:57,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.70 | bwd_microstep: 1714.70 | bwd_inner_microstep: 1658.32 | bwd_allreduce_microstep: 56.30 | step_microstep: 0.79 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12142 total_samples=18227, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:00,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.22 | bwd_microstep: 2134.02 | bwd_inner_microstep: 1761.60 | bwd_allreduce_microstep: 372.35 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11594 total_samples=18230, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:03,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.15 | bwd_microstep: 2002.54 | bwd_inner_microstep: 1695.56 | bwd_allreduce_microstep: 306.89 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13276 total_samples=18234, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:05,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43 [2025-08-03 05:24:05,960] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.94 | bwd_microstep: 1771.09 | bwd_inner_microstep: 1683.76 | bwd_allreduce_microstep: 87.26 | step_microstep: 110.59 [2025-08-03 05:24:05,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.92 | bwd: 7622.41 | bwd_inner: 6799.22 | bwd_allreduce: 822.90 | step: 111.75 {'loss': 0.7375, 'learning_rate': 7.2816992776365714e-06, 'epoch': 0.6} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13066 total_samples=18238, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:08,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.53 | bwd_microstep: 2035.58 | bwd_inner_microstep: 1844.99 | bwd_allreduce_microstep: 190.52 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375 total_samples=18242, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:11,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.11 | bwd_microstep: 1769.79 | bwd_inner_microstep: 1682.23 | bwd_allreduce_microstep: 87.49 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13313 total_samples=18246, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:13,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.76 | bwd_microstep: 1748.62 | bwd_inner_microstep: 1673.56 | bwd_allreduce_microstep: 74.98 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407 total_samples=18250, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:16,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14 [2025-08-03 05:24:16,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.30 | bwd_microstep: 2085.71 | bwd_inner_microstep: 1977.55 | bwd_allreduce_microstep: 108.10 | step_microstep: 115.03 [2025-08-03 05:24:16,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.62 | bwd: 7639.75 | bwd_inner: 7178.33 | bwd_allreduce: 461.17 | step: 115.61 {'loss': 0.7444, 'learning_rate': 7.2661188444382345e-06, 'epoch': 0.6} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12819 total_samples=18254, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:19,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.06 | bwd_microstep: 1883.19 | bwd_inner_microstep: 1671.54 | bwd_allreduce_microstep: 211.58 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12026 total_samples=18257, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:22,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.60 | bwd_microstep: 1817.01 | bwd_inner_microstep: 1577.90 | bwd_allreduce_microstep: 239.05 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11420 total_samples=18260, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:24,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.77 | bwd_microstep: 2006.28 | bwd_inner_microstep: 1551.37 | bwd_allreduce_microstep: 454.84 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13467 total_samples=18264, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:27,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21 [2025-08-03 05:24:27,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.82 | bwd_microstep: 1790.17 | bwd_inner_microstep: 1699.69 | bwd_allreduce_microstep: 90.42 | step_microstep: 132.12 [2025-08-03 05:24:27,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.17 | bwd: 7496.70 | bwd_inner: 6500.50 | bwd_allreduce: 995.96 | step: 132.57 {'loss': 0.7548, 'learning_rate': 7.250545580524515e-06, 'epoch': 0.6} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508 total_samples=18268, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:30,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.48 | bwd_microstep: 1892.85 | bwd_inner_microstep: 1723.68 | bwd_allreduce_microstep: 169.10 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12107 total_samples=18271, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:33,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.76 | bwd_microstep: 2774.53 | bwd_inner_microstep: 2496.66 | bwd_allreduce_microstep: 277.81 | step_microstep: 0.19 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13688 total_samples=18275, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:36,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.70 | bwd_microstep: 1785.49 | bwd_inner_microstep: 1713.84 | bwd_allreduce_microstep: 71.59 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13248 total_samples=18279, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:39,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.66 [2025-08-03 05:24:39,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.45 | bwd_microstep: 1749.15 | bwd_inner_microstep: 1679.18 | bwd_allreduce_microstep: 69.90 | step_microstep: 157.69 [2025-08-03 05:24:39,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.31 | bwd: 8202.07 | bwd_inner: 7613.36 | bwd_allreduce: 588.47 | step: 158.27 197/2000 [3:40:48<2:26:47, 10.97s/it] 60%|█████▉ | 1198/2000 [3:40:58<2:24:58, 10.85s/it] 60%|█████▉ | 1198/2000 [3:40:58<2:24:58, 10.85s/it] 60%|█████▉ | 1199/2000 [3:41:09<2:26:37, 10.98s/it] 60%|█████▉ | 1199/2000 [3:41:09<2:26:37, 10.98s/it] 60%|██████ | 1200/2000 [3:41:20<2:26:00, 10.95s/it] 60%|██████ | 1200/2000 [3:41:20<2:26:00, 10.95s/it] 60%|██████ | 1201/2000 [3:41:31<2:25:22, 10.92s/it] 60%|██████ | 1201/2000 [3:41:31<2:25:22, 10.92s/it] 60%|██████ | 1202/2000 [3:41:42<2:24:32, 10.87s/it] 60%|██████ | 1202/2000 [3:41:42<2:24:32, 10.87s/it] 60%|█████{'loss': 0.7388, 'learning_rate': 7.234979526734482e-06, 'epoch': 0.6} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13415 total_samples=18283, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:41,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.87 | bwd_microstep: 1813.32 | bwd_inner_microstep: 1727.61 | bwd_allreduce_microstep: 85.65 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13190 total_samples=18287, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:44,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.97 | bwd_microstep: 1737.22 | bwd_inner_microstep: 1647.61 | bwd_allreduce_microstep: 89.54 | step_microstep: 0.34 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12231 total_samples=18290, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:46,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.31 | bwd_microstep: 1713.37 | bwd_inner_microstep: 1558.83 | bwd_allreduce_microstep: 154.48 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13466 total_samples=18294, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:49,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.18 [2025-08-03 05:24:49,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.89 | bwd_microstep: 1837.89 | bwd_inner_microstep: 1804.15 | bwd_allreduce_microstep: 33.66 | step_microstep: 161.87 [2025-08-03 05:24:49,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.98 | bwd: 7101.86 | bwd_inner: 6738.19 | bwd_allreduce: 363.41 | step: 162.45 {'loss': 0.7393, 'learning_rate': 7.219420723888301e-06, 'epoch': 0.6} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13769 total_samples=18298, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:52,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.65 | bwd_microstep: 1961.82 | bwd_inner_microstep: 1699.24 | bwd_allreduce_microstep: 262.51 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13622 total_samples=18302, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:54,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.59 | bwd_microstep: 1791.17 | bwd_inner_microstep: 1670.60 | bwd_allreduce_microstep: 120.51 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11685 total_samples=18305, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:24:57,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.25 | bwd_microstep: 1909.74 | bwd_inner_microstep: 1747.19 | bwd_allreduce_microstep: 162.48 | step_microstep: 0.29 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13434 total_samples=18309, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:00,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64 [2025-08-03 05:25:00,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.67 | bwd_microstep: 1950.08 | bwd_inner_microstep: 1701.40 | bwd_allreduce_microstep: 248.58 | step_microstep: 139.62 [2025-08-03 05:25:00,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2870.10 | bwd: 7612.88 | bwd_inner: 6818.43 | bwd_allreduce: 794.19 | step: 140.16 {'loss': 0.751, 'learning_rate': 7.203869212787112e-06, 'epoch': 0.6} dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13141 total_samples=18313, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:03,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.72 | bwd_microstep: 2047.21 | bwd_inner_microstep: 1864.78 | bwd_allreduce_microstep: 182.35 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13686 total_samples=18317, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:05,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.33 | bwd_microstep: 1816.66 | bwd_inner_microstep: 1655.98 | bwd_allreduce_microstep: 160.62 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13560 total_samples=18321, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:08,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.19 | bwd_microstep: 1710.68 | bwd_inner_microstep: 1666.90 | bwd_allreduce_microstep: 43.72 | step_microstep: 0.28 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13157 total_samples=18325, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:10,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.26 [2025-08-03 05:25:10,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.43 | bwd_microstep: 1803.87 | bwd_inner_microstep: 1660.35 | bwd_allreduce_microstep: 143.45 | step_microstep: 155.86 [2025-08-03 05:25:10,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.60 | bwd: 7378.48 | bwd_inner: 6848.01 | bwd_allreduce: 530.23 | step: 156.51 {'loss': 0.7439, 'learning_rate': 7.188325034212944e-06, 'epoch': 0.6} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11963 total_samples=18328, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:13,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.96 | bwd_microstep: 1825.33 | bwd_inner_microstep: 1590.68 | bwd_allreduce_microstep: 234.59 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14526 total_samples=18332, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:16,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.31 | bwd_microstep: 1755.09 | bwd_inner_microstep: 1702.99 | bwd_allreduce_microstep: 52.03 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12093 total_samples=18335, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:18,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 756.51 | bwd_microstep: 1807.38 | bwd_inner_microstep: 1561.78 | bwd_allreduce_microstep: 245.52 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13389 total_samples=18339, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:21,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18 [2025-08-03 05:25:21,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.38 | bwd_microstep: 1992.95 | bwd_inner_microstep: 1723.15 | bwd_allreduce_microstep: 269.73 | step_microstep: 114.90 [2025-08-03 05:25:21,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2896.09 | bwd: 7380.80 | bwd_inner: 6578.60 | bwd_allreduce: 801.96 | step: 115.39 {'loss': 0.7433, 'learning_rate': 7.1727882289285915e-06, 'epoch': 0.6} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12773 total_samples=18343, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:24,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.21 | bwd_microstep: 1939.72 | bwd_inner_microstep: 1658.64 | bwd_allreduce_microstep: 281.00 | step_microstep: 0.18 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12993 total_samples=18347, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:27,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.53 | bwd_microstep: 2025.49 | bwd_inner_microstep: 1781.89 | bwd_allreduce_microstep: 243.53 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11667 total_samples=18350, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:29,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.09 | bwd_microstep: 1967.13 | bwd_inner_microstep: 1831.73 | bwd_allreduce_microstep: 135.32 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11756 total_samples=18353, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:32,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58 [2025-08-03 05:25:32,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.25 | bwd_microstep: 1753.30 | bwd_inner_microstep: 1542.78 | bwd_allreduce_microstep: 210.44 | step_microstep: 115.51 [2025-08-03 05:25:32,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.01 | bwd: 7685.68 | bwd_inner: 6815.03 | bwd_allreduce: 870.38 | step: 116.10 | 1203/2000 [3:41:53<2:26:48, 11.05s/it] 60%|██████ | 1203/2000 [3:41:53<2:26:48, 11.05s/it] 60%|██████ | 1204/2000 [3:42:04<2:23:56, 10.85s/it] 60%|██████ | 1204/2000 [3:42:04<2:23:56, 10.85s/it] 60%|██████ | 1205/2000 [3:42:15<2:24:00, 10.87s/it] 60%|██████ | 1205/2000 [3:42:15<2:24:00, 10.87s/it] 60%|██████ | 1206/2000 [3:42:25<2:22:48, 10.79s/it] 60%|██████ | 1206/2000 [3:42:25<2:22:48, 10.79s/it] 60%|██████ | 1207/2000 [3:42:36<2:22:09, 10.76s/it] 60%|██████ | 1207/2000 [3:42:36<2:22:09, 10.76s/it] 60%|██████ | 1208/2000 [3:42:47<2:22:29, 10.80s/it] {'loss': 0.7463, 'learning_rate': 7.157258837677514e-06, 'epoch': 0.6} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13148 total_samples=18357, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:35,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.83 | bwd_microstep: 1778.72 | bwd_inner_microstep: 1672.43 | bwd_allreduce_microstep: 106.22 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13701 total_samples=18360, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:37,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.20 | bwd_microstep: 1993.51 | bwd_inner_microstep: 1843.60 | bwd_allreduce_microstep: 149.84 | step_microstep: 0.28 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12009 total_samples=18363, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:40,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.42 | bwd_microstep: 1938.09 | bwd_inner_microstep: 1753.00 | bwd_allreduce_microstep: 185.02 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13278 total_samples=18367, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:43,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40 [2025-08-03 05:25:43,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.75 | bwd_microstep: 1789.75 | bwd_inner_microstep: 1695.59 | bwd_allreduce_microstep: 94.08 | step_microstep: 111.39 [2025-08-03 05:25:43,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.12 | bwd: 7500.13 | bwd_inner: 6964.63 | bwd_allreduce: 535.25 | step: 111.91 {'loss': 0.7416, 'learning_rate': 7.1417369011837355e-06, 'epoch': 0.6} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12947 total_samples=18371, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:46,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.77 | bwd_microstep: 2168.00 | bwd_inner_microstep: 1735.81 | bwd_allreduce_microstep: 432.11 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15187 total_samples=18375, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:48,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 1762.82 | bwd_inner_microstep: 1749.55 | bwd_allreduce_microstep: 13.21 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14063 total_samples=18379, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:51,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1773.13 | bwd_inner_microstep: 1733.99 | bwd_allreduce_microstep: 39.06 | step_microstep: 0.29 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13409 total_samples=18383, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:54,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52 [2025-08-03 05:25:54,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.39 | bwd_microstep: 1795.03 | bwd_inner_microstep: 1711.75 | bwd_allreduce_microstep: 83.20 | step_microstep: 116.44 [2025-08-03 05:25:54,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.74 | bwd: 7499.02 | bwd_inner: 6931.10 | bwd_allreduce: 567.67 | step: 117.01 {'loss': 0.7337, 'learning_rate': 7.126222460151719e-06, 'epoch': 0.6} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13760 total_samples=18387, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:56,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.70 | bwd_microstep: 1791.82 | bwd_inner_microstep: 1709.13 | bwd_allreduce_microstep: 82.62 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12206 total_samples=18391, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:25:59,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.06 | bwd_microstep: 1749.40 | bwd_inner_microstep: 1581.01 | bwd_allreduce_microstep: 168.32 | step_microstep: 0.18 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14199 total_samples=18395, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:01,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.54 | bwd_microstep: 2081.12 | bwd_inner_microstep: 1754.28 | bwd_allreduce_microstep: 326.77 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13481 total_samples=18399, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:04,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:26:04,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.17 | bwd_microstep: 1986.71 | bwd_inner_microstep: 1866.97 | bwd_allreduce_microstep: 119.67 | step_microstep: 138.62 [2025-08-03 05:26:04,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.39 | bwd: 7609.11 | bwd_inner: 6911.38 | bwd_allreduce: 697.48 | step: 139.09 {'loss': 0.7451, 'learning_rate': 7.110715555266281e-06, 'epoch': 0.61} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12747 total_samples=18403, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:07,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.54 | bwd_microstep: 1819.29 | bwd_inner_microstep: 1624.97 | bwd_allreduce_microstep: 194.26 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11853 total_samples=18406, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:10,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.83 | bwd_microstep: 1957.75 | bwd_inner_microstep: 1585.63 | bwd_allreduce_microstep: 372.05 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13062 total_samples=18410, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:13,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 901.38 | bwd_microstep: 1889.03 | bwd_inner_microstep: 1821.50 | bwd_allreduce_microstep: 67.47 | step_microstep: 0.23 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13357 total_samples=18414, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:16,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91 [2025-08-03 05:26:16,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.92 | bwd_microstep: 2103.45 | bwd_inner_microstep: 1938.87 | bwd_allreduce_microstep: 164.51 | step_microstep: 394.66 [2025-08-03 05:26:16,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2983.60 | bwd: 7769.56 | bwd_inner: 6970.97 | bwd_allreduce: 798.36 | step: 395.23 {'loss': 0.7489, 'learning_rate': 7.095216227192467e-06, 'epoch': 0.61} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15036 total_samples=18418, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:19,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.73 | bwd_microstep: 2279.89 | bwd_inner_microstep: 1952.99 | bwd_allreduce_microstep: 326.83 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13638 total_samples=18422, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:22,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.44 | bwd_microstep: 1984.97 | bwd_inner_microstep: 1872.75 | bwd_allreduce_microstep: 112.16 | step_microstep: 0.25 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13773 total_samples=18427, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:24,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.12 | bwd_microstep: 1763.57 | bwd_inner_microstep: 1705.94 | bwd_allreduce_microstep: 57.57 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11760 total_samples=18430, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:28,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03 [2025-08-03 05:26:28,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1243.15 | bwd_microstep: 2514.73 | bwd_inner_microstep: 2293.18 | bwd_allreduce_microstep: 221.47 | step_microstep: 116.81 [2025-08-03 05:26:28,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3362.37 | bwd: 8543.22 | bwd_inner: 7824.86 | bwd_allreduce: 718.10 | step: 117.43 {'loss': 0.7373, 'learning_rate': 7.0797245165754654e-06, 'epoch': 0.61} 60%|██████ | 1208/2000 [3:42:47<2:22:29, 10.80s/it] 60%|██████ | 1209/2000 [3:42:58<2:22:11, 10.79s/it] 60%|██████ | 1209/2000 [3:42:58<2:22:11, 10.79s/it] 60%|██████ | 1210/2000 [3:43:08<2:21:53, 10.78s/it] 60%|██████ | 1210/2000 [3:43:08<2:21:53, 10.78s/it] 61%|██████ | 1211/2000 [3:43:19<2:22:11, 10.81s/it] 61%|██████ | 1211/2000 [3:43:19<2:22:11, 10.81s/it] 61%|██████ | 1212/2000 [3:43:31<2:24:33, 11.01s/it] 61%|██████ | 1212/2000 [3:43:31<2:24:33, 11.01s/it] 61%|██████ | 1213/2000 [3:43:43<2:29:35, 11.40s/it] 61%|███dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13841 total_samples=18435, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:31,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.86 | bwd_microstep: 1728.27 | bwd_inner_microstep: 1674.39 | bwd_allreduce_microstep: 53.81 | step_microstep: 0.11 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12681 total_samples=18439, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:33,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.03 | bwd_microstep: 1847.54 | bwd_inner_microstep: 1644.67 | bwd_allreduce_microstep: 202.81 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13630 total_samples=18443, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:36,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.37 | bwd_microstep: 1992.86 | bwd_inner_microstep: 1878.95 | bwd_allreduce_microstep: 113.83 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13376 total_samples=18447, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:39,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86 [2025-08-03 05:26:39,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.94 | bwd_microstep: 1816.66 | bwd_inner_microstep: 1714.56 | bwd_allreduce_microstep: 102.03 | step_microstep: 119.27 [2025-08-03 05:26:39,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.11 | bwd: 7385.38 | bwd_inner: 6912.58 | bwd_allreduce: 472.56 | step: 119.73 {'loss': 0.7381, 'learning_rate': 7.064240464040472e-06, 'epoch': 0.61} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14037 total_samples=18451, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:42,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 2018.34 | bwd_inner_microstep: 1880.68 | bwd_allreduce_microstep: 137.59 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12189 total_samples=18454, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:45,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.20 | bwd_microstep: 2097.64 | bwd_inner_microstep: 1875.66 | bwd_allreduce_microstep: 221.91 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13217 total_samples=18458, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:47,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.03 | bwd_microstep: 1731.67 | bwd_inner_microstep: 1667.86 | bwd_allreduce_microstep: 63.74 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13140 total_samples=18462, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:50,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38 [2025-08-03 05:26:50,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.32 | bwd_microstep: 1814.33 | bwd_inner_microstep: 1700.17 | bwd_allreduce_microstep: 114.09 | step_microstep: 122.48 [2025-08-03 05:26:50,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.42 | bwd: 7662.02 | bwd_inner: 7124.36 | bwd_allreduce: 537.41 | step: 122.93 {'loss': 0.745, 'learning_rate': 7.048764110192618e-06, 'epoch': 0.61} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13703 total_samples=18466, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:53,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.05 | bwd_microstep: 2168.18 | bwd_inner_microstep: 1976.14 | bwd_allreduce_microstep: 191.97 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14328 total_samples=18470, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:55,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.89 | bwd_microstep: 1755.00 | bwd_inner_microstep: 1733.67 | bwd_allreduce_microstep: 21.26 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13697 total_samples=18474, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:26:58,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.66 | bwd_microstep: 1727.03 | bwd_inner_microstep: 1655.76 | bwd_allreduce_microstep: 71.21 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13760 total_samples=18478, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:01,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23 [2025-08-03 05:27:01,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.24 | bwd_microstep: 1818.58 | bwd_inner_microstep: 1812.17 | bwd_allreduce_microstep: 6.34 | step_microstep: 149.27 [2025-08-03 05:27:01,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.78 | bwd: 7468.84 | bwd_inner: 7177.73 | bwd_allreduce: 290.88 | step: 149.68 {'loss': 0.7443, 'learning_rate': 7.033295495616834e-06, 'epoch': 0.61} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12866 total_samples=18482, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:03,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.70 | bwd_microstep: 2063.10 | bwd_inner_microstep: 1645.13 | bwd_allreduce_microstep: 417.90 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14262 total_samples=18486, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:06,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.24 | bwd_microstep: 1719.82 | bwd_inner_microstep: 1700.38 | bwd_allreduce_microstep: 19.36 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14327 total_samples=18490, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:08,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.85 | bwd_microstep: 1835.25 | bwd_inner_microstep: 1754.81 | bwd_allreduce_microstep: 80.37 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332 total_samples=18494, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:11,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 05:27:11,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.90 | bwd_microstep: 1764.79 | bwd_inner_microstep: 1693.06 | bwd_allreduce_microstep: 71.65 | step_microstep: 113.27 [2025-08-03 05:27:11,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.62 | bwd: 7383.00 | bwd_inner: 6793.39 | bwd_allreduce: 589.37 | step: 113.77 {'loss': 0.7397, 'learning_rate': 7.017834660877756e-06, 'epoch': 0.61} dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12807 total_samples=18498, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:14,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.15 | bwd_microstep: 1736.56 | bwd_inner_microstep: 1594.24 | bwd_allreduce_microstep: 142.25 | step_microstep: 0.12 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12423 total_samples=18502, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:16,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.02 | bwd_microstep: 1714.85 | bwd_inner_microstep: 1589.31 | bwd_allreduce_microstep: 125.48 | step_microstep: 0.23 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13374 total_samples=18506, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:19,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.23 | bwd_microstep: 1987.06 | bwd_inner_microstep: 1859.95 | bwd_allreduce_microstep: 127.05 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14662 total_samples=18510, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:22,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19 [2025-08-03 05:27:22,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.21 | bwd_microstep: 2170.22 | bwd_inner_microstep: 1780.86 | bwd_allreduce_microstep: 389.28 | step_microstep: 132.63 [2025-08-03 05:27:22,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.55 | bwd: 7608.74 | bwd_inner: 6824.35 | bwd_allreduce: 784.14 | step: 133.10 {'loss': 0.7522, 'learning_rate': 7.002381646519625e-06, 'epoch': 0.61} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11746 total_samples=18513, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:25,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.07 | bwd_microstep: 1957.00 | bwd_inner_microstep: 1601.49 | bwd_allreduce_microstep: 355.44 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13135 total_samples=18517, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:27,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.38 | bwd_microstep: 1787.10 | bwd_inner_microstep: 1723.08 | bwd_allreduce_microstep: 63.94 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11874 total_samples=18520, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:30,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.48 | bwd_microstep: 2135.02 | bwd_inner_microstep: 1773.35 | bwd_allreduce_microstep: 361.61 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11940 total_samples=18523, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:33,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.13 [2025-08-03 05:27:33,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.77 | bwd_microstep: 2177.63 | bwd_inner_microstep: 1926.59 | bwd_allreduce_microstep: 250.97 | step_microstep: 160.37 [2025-08-03 05:27:33,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.62 | bwd: 8056.81 | bwd_inner: 7024.51 | bwd_allreduce: 1032.05 | step: 160.88 ███ | 1213/2000 [3:43:43<2:29:35, 11.40s/it] 61%|██████ | 1214/2000 [3:43:54<2:26:23, 11.18s/it] 61%|██████ | 1214/2000 [3:43:54<2:26:23, 11.18s/it] 61%|██████ | 1215/2000 [3:44:05<2:25:09, 11.10s/it] 61%|██████ | 1215/2000 [3:44:05<2:25:09, 11.10s/it] 61%|██████ | 1216/2000 [3:44:15<2:23:39, 10.99s/it] 61%|██████ | 1216/2000 [3:44:15<2:23:39, 10.99s/it] 61%|██████ | 1217/2000 [3:44:26<2:22:06, 10.89s/it] 61%|██████ | 1217/2000 [3:44:26<2:22:06, 10.89s/it] 61%|██████ | 1218/2000 [3:44:37<2:21:45, 10.88s/it] 61%|██████ | 1218/2000 [3:44:37<2:21:45, 10.88s/it] 61%|{'loss': 0.7476, 'learning_rate': 6.986936493066165e-06, 'epoch': 0.61} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723 total_samples=18526, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:36,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.29 | bwd_microstep: 1988.67 | bwd_inner_microstep: 1759.82 | bwd_allreduce_microstep: 228.78 | step_microstep: 0.12 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13989 total_samples=18530, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:39,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.21 | bwd_microstep: 2031.47 | bwd_inner_microstep: 1886.37 | bwd_allreduce_microstep: 145.03 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14722 total_samples=18534, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:42,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.69 | bwd_microstep: 1837.88 | bwd_inner_microstep: 1783.39 | bwd_allreduce_microstep: 54.42 | step_microstep: 0.83 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13103 total_samples=18538, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:44,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38 [2025-08-03 05:27:44,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.24 | bwd_microstep: 1744.51 | bwd_inner_microstep: 1675.33 | bwd_allreduce_microstep: 69.12 | step_microstep: 133.56 [2025-08-03 05:27:44,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.36 | bwd: 7602.59 | bwd_inner: 7104.91 | bwd_allreduce: 497.43 | step: 134.61 {'loss': 0.7395, 'learning_rate': 6.971499241020495e-06, 'epoch': 0.61} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11870 total_samples=18541, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:47,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.11 | bwd_microstep: 1838.52 | bwd_inner_microstep: 1576.48 | bwd_allreduce_microstep: 261.98 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16239 total_samples=18545, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:50,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.68 | bwd_microstep: 2163.56 | bwd_inner_microstep: 1961.18 | bwd_allreduce_microstep: 202.31 | step_microstep: 0.77 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13241 total_samples=18549, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:52,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.05 | bwd_microstep: 1717.25 | bwd_inner_microstep: 1655.82 | bwd_allreduce_microstep: 61.37 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13136 total_samples=18553, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:55,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50 [2025-08-03 05:27:55,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.95 | bwd_microstep: 1725.17 | bwd_inner_microstep: 1663.99 | bwd_allreduce_microstep: 61.12 | step_microstep: 128.76 [2025-08-03 05:27:55,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.71 | bwd: 7444.56 | bwd_inner: 6857.47 | bwd_allreduce: 586.86 | step: 129.90 {'loss': 0.732, 'learning_rate': 6.956069930865005e-06, 'epoch': 0.61} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13292 total_samples=18557, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:27:58,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.56 | bwd_microstep: 2149.47 | bwd_inner_microstep: 2004.92 | bwd_allreduce_microstep: 144.48 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13648 total_samples=18561, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:00,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.85 | bwd_microstep: 1822.01 | bwd_inner_microstep: 1723.19 | bwd_allreduce_microstep: 98.75 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11651 total_samples=18564, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:03,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.91 | bwd_microstep: 1777.86 | bwd_inner_microstep: 1547.77 | bwd_allreduce_microstep: 230.02 | step_microstep: 0.25 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13551 total_samples=18568, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:06,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.15 [2025-08-03 05:28:06,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.96 | bwd_microstep: 2113.42 | bwd_inner_microstep: 1916.45 | bwd_allreduce_microstep: 196.91 | step_microstep: 114.81 [2025-08-03 05:28:06,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2875.20 | bwd: 7862.81 | bwd_inner: 7192.32 | bwd_allreduce: 670.24 | step: 115.44 {'loss': 0.7562, 'learning_rate': 6.940648603061263e-06, 'epoch': 0.61} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11680 total_samples=18571, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:09,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.08 | bwd_microstep: 1851.50 | bwd_inner_microstep: 1602.78 | bwd_allreduce_microstep: 248.65 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14209 total_samples=18575, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:11,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.66 | bwd_microstep: 1919.51 | bwd_inner_microstep: 1890.91 | bwd_allreduce_microstep: 28.54 | step_microstep: 0.24 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12269 total_samples=18579, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:14,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.59 | bwd_microstep: 1978.11 | bwd_inner_microstep: 1971.07 | bwd_allreduce_microstep: 6.96 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13745 total_samples=18583, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:17,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 05:28:17,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.37 | bwd_microstep: 1771.50 | bwd_inner_microstep: 1720.35 | bwd_allreduce_microstep: 51.09 | step_microstep: 132.05 [2025-08-03 05:28:17,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.63 | bwd: 7520.68 | bwd_inner: 7185.11 | bwd_allreduce: 335.32 | step: 132.55 {'loss': 0.7594, 'learning_rate': 6.925235298049906e-06, 'epoch': 0.61} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12547 total_samples=18587, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:19,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.02 | bwd_microstep: 1773.83 | bwd_inner_microstep: 1636.67 | bwd_allreduce_microstep: 137.09 | step_microstep: 0.27 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13315 total_samples=18591, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:22,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.37 | bwd_microstep: 2304.07 | bwd_inner_microstep: 2266.93 | bwd_allreduce_microstep: 37.08 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11988 total_samples=18594, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:25,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.99 | bwd_microstep: 1741.44 | bwd_inner_microstep: 1547.34 | bwd_allreduce_microstep: 194.01 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13229 total_samples=18598, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:28,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.68 [2025-08-03 05:28:28,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.38 | bwd_microstep: 1734.72 | bwd_inner_microstep: 1673.20 | bwd_allreduce_microstep: 61.45 | step_microstep: 133.55 [2025-08-03 05:28:28,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.69 | bwd: 7554.11 | bwd_inner: 7124.13 | bwd_allreduce: 429.73 | step: 134.17 ██████ | 1219/2000 [3:44:48<2:23:28, 11.02s/it] 61%|██████ | 1219/2000 [3:44:48<2:23:28, 11.02s/it] 61%|██████ | 1220/2000 [3:44:59<2:22:33, 10.97s/it] 61%|██████ | 1220/2000 [3:44:59<2:22:33, 10.97s/it] 61%|██████ | 1221/2000 [3:45:10<2:21:14, 10.88s/it] 61%|██████ | 1221/2000 [3:45:10<2:21:14, 10.88s/it] 61%|██████ | 1222/2000 [3:45:21<2:22:00, 10.95s/it] 61%|██████ | 1222/2000 [3:45:21<2:22:00, 10.95s/it] 61%|██████ | 1223/2000 [3:45:32<2:21:18, 10.91s/it] 61%|██████ | 1223/2000 [3:45:32<2:21:18, 10.91s/it] 61%|██████ | 1224/2000 [3:45:42<2:20:35, 10.87s/{'loss': 0.7455, 'learning_rate': 6.909830056250527e-06, 'epoch': 0.61} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15547 total_samples=18602, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:30,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.98 | bwd_microstep: 1773.26 | bwd_inner_microstep: 1766.08 | bwd_allreduce_microstep: 7.11 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14218 total_samples=18606, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:33,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.86 | bwd_microstep: 1822.69 | bwd_inner_microstep: 1723.22 | bwd_allreduce_microstep: 99.40 | step_microstep: 0.84 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13732 total_samples=18610, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:35,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.31 | bwd_microstep: 1744.19 | bwd_inner_microstep: 1702.45 | bwd_allreduce_microstep: 41.67 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13210 total_samples=18614, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:38,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12 [2025-08-03 05:28:38,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.23 | bwd_microstep: 2007.76 | bwd_inner_microstep: 1882.18 | bwd_allreduce_microstep: 125.52 | step_microstep: 115.70 [2025-08-03 05:28:38,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.31 | bwd: 7347.95 | bwd_inner: 7073.93 | bwd_allreduce: 273.77 | step: 116.77 {'loss': 0.7395, 'learning_rate': 6.894432918061579e-06, 'epoch': 0.61} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13378 total_samples=18618, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:41,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.45 | bwd_microstep: 1905.23 | bwd_inner_microstep: 1812.52 | bwd_allreduce_microstep: 92.62 | step_microstep: 0.30 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13464 total_samples=18622, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:43,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.35 | bwd_microstep: 1774.96 | bwd_inner_microstep: 1692.61 | bwd_allreduce_microstep: 82.28 | step_microstep: 0.13 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12945 total_samples=18626, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:46,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.01 | bwd_microstep: 2045.14 | bwd_inner_microstep: 1899.63 | bwd_allreduce_microstep: 145.44 | step_microstep: 0.09 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14815 total_samples=18631, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:49,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.05 [2025-08-03 05:28:49,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.32 | bwd_microstep: 1761.70 | bwd_inner_microstep: 1734.91 | bwd_allreduce_microstep: 26.72 | step_microstep: 119.75 [2025-08-03 05:28:49,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.07 | bwd: 7487.08 | bwd_inner: 7139.67 | bwd_allreduce: 347.15 | step: 120.28 {'loss': 0.7352, 'learning_rate': 6.8790439238602576e-06, 'epoch': 0.61} dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12115 total_samples=18635, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:52,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.29 | bwd_microstep: 1964.29 | bwd_inner_microstep: 1799.98 | bwd_allreduce_microstep: 164.24 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13746 total_samples=18640, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:54,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.09 | bwd_microstep: 2020.38 | bwd_inner_microstep: 1878.12 | bwd_allreduce_microstep: 142.20 | step_microstep: 0.24 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14031 total_samples=18644, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:28:57,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.77 | bwd_microstep: 1811.06 | bwd_inner_microstep: 1725.34 | bwd_allreduce_microstep: 85.66 | step_microstep: 0.09 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11953 total_samples=18647, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:00,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11 [2025-08-03 05:29:00,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.86 | bwd_microstep: 1969.08 | bwd_inner_microstep: 1825.22 | bwd_allreduce_microstep: 143.79 | step_microstep: 110.69 [2025-08-03 05:29:00,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2750.94 | bwd: 7764.86 | bwd_inner: 7228.65 | bwd_allreduce: 535.96 | step: 111.14 {'loss': 0.7387, 'learning_rate': 6.863663114002411e-06, 'epoch': 0.61} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11981 total_samples=18650, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:02,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.25 | bwd_microstep: 1724.99 | bwd_inner_microstep: 1536.83 | bwd_allreduce_microstep: 188.09 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13273 total_samples=18654, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:05,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.33 | bwd_microstep: 1823.62 | bwd_inner_microstep: 1677.23 | bwd_allreduce_microstep: 146.33 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12831 total_samples=18658, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:08,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.26 | bwd_microstep: 2125.43 | bwd_inner_microstep: 1905.49 | bwd_allreduce_microstep: 219.88 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11663 total_samples=18661, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:11,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64 [2025-08-03 05:29:11,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.07 | bwd_microstep: 2084.23 | bwd_inner_microstep: 1838.21 | bwd_allreduce_microstep: 245.96 | step_microstep: 112.80 [2025-08-03 05:29:11,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.83 | bwd: 7758.31 | bwd_inner: 6957.75 | bwd_allreduce: 800.33 | step: 113.16 {'loss': 0.7489, 'learning_rate': 6.848290528822417e-06, 'epoch': 0.61} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13292 total_samples=18665, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:13,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.44 | bwd_microstep: 1854.41 | bwd_inner_microstep: 1798.01 | bwd_allreduce_microstep: 56.33 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243 total_samples=18669, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:16,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.97 | bwd_microstep: 1758.60 | bwd_inner_microstep: 1689.88 | bwd_allreduce_microstep: 68.65 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12777 total_samples=18673, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:19,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.12 | bwd_microstep: 1867.05 | bwd_inner_microstep: 1671.24 | bwd_allreduce_microstep: 195.75 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305 total_samples=18677, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:21,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99 [2025-08-03 05:29:21,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.55 | bwd_microstep: 1878.19 | bwd_inner_microstep: 1689.94 | bwd_allreduce_microstep: 188.19 | step_microstep: 115.95 [2025-08-03 05:29:21,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.01 | bwd: 7358.31 | bwd_inner: 6849.07 | bwd_allreduce: 509.01 | step: 116.42 it] 61%|██████ | 1224/2000 [3:45:42<2:20:35, 10.87s/it] 61%|██████▏ | 1225/2000 [3:45:53<2:19:09, 10.77s/it] 61%|██████▏ | 1225/2000 [3:45:53<2:19:09, 10.77s/it] 61%|██████▏ | 1226/2000 [3:46:04<2:18:34, 10.74s/it] 61%|██████▏ | 1226/2000 [3:46:04<2:18:34, 10.74s/it] 61%|██████▏ | 1227/2000 [3:46:15<2:19:10, 10.80s/it] 61%|██████▏ | 1227/2000 [3:46:15<2:19:10, 10.80s/it] 61%|██████▏ | 1228/2000 [3:46:26<2:19:44, 10.86s/it] 61%|██████▏ | 1228/2000 [3:46:26<2:19:44, 10.86s/it] 61%|██████▏ | 1229/2000 [3:46:36<2:18:20, 10.77s/it] {'loss': 0.7468, 'learning_rate': 6.8329262086330864e-06, 'epoch': 0.61} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13395 total_samples=18681, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:24,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.04 | bwd_microstep: 1775.57 | bwd_inner_microstep: 1690.34 | bwd_allreduce_microstep: 85.16 | step_microstep: 0.13 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12890 total_samples=18685, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:27,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.54 | bwd_microstep: 2304.95 | bwd_inner_microstep: 2299.26 | bwd_allreduce_microstep: 5.63 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13426 total_samples=18689, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:29,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.03 | bwd_microstep: 1772.17 | bwd_inner_microstep: 1695.21 | bwd_allreduce_microstep: 76.89 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11404 total_samples=18692, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:32,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90 [2025-08-03 05:29:32,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.48 | bwd_microstep: 1812.38 | bwd_inner_microstep: 1585.88 | bwd_allreduce_microstep: 226.44 | step_microstep: 131.76 [2025-08-03 05:29:32,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.01 | bwd: 7665.12 | bwd_inner: 7270.69 | bwd_allreduce: 394.20 | step: 132.25 {'loss': 0.7451, 'learning_rate': 6.8175701937255645e-06, 'epoch': 0.61} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11993 total_samples=18695, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:35,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.81 | bwd_microstep: 2020.07 | bwd_inner_microstep: 1788.29 | bwd_allreduce_microstep: 231.72 | step_microstep: 0.17 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14282 total_samples=18699, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:38,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.68 | bwd_microstep: 1878.17 | bwd_inner_microstep: 1734.02 | bwd_allreduce_microstep: 144.08 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13564 total_samples=18703, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:40,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.23 | bwd_microstep: 1777.16 | bwd_inner_microstep: 1695.24 | bwd_allreduce_microstep: 81.87 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11596 total_samples=18706, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:43,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44 [2025-08-03 05:29:43,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.32 | bwd_microstep: 1986.64 | bwd_inner_microstep: 1798.32 | bwd_allreduce_microstep: 188.25 | step_microstep: 112.78 [2025-08-03 05:29:43,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.97 | bwd: 7662.10 | bwd_inner: 7015.85 | bwd_allreduce: 646.00 | step: 113.33 {'loss': 0.7473, 'learning_rate': 6.802222524369202e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13032 total_samples=18710, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:46,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.98 | bwd_microstep: 1892.43 | bwd_inner_microstep: 1822.98 | bwd_allreduce_microstep: 69.38 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13279 total_samples=18714, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:49,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.75 | bwd_microstep: 1966.70 | bwd_inner_microstep: 1863.51 | bwd_allreduce_microstep: 103.13 | step_microstep: 0.25 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13280 total_samples=18719, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:51,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.43 | bwd_microstep: 2059.96 | bwd_inner_microstep: 1901.55 | bwd_allreduce_microstep: 158.35 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373 total_samples=18723, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:54,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52 [2025-08-03 05:29:54,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.74 | bwd_microstep: 2092.02 | bwd_inner_microstep: 1746.89 | bwd_allreduce_microstep: 345.06 | step_microstep: 112.84 [2025-08-03 05:29:54,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.82 | bwd: 8011.16 | bwd_inner: 7334.93 | bwd_allreduce: 675.99 | step: 113.47 {'loss': 0.7461, 'learning_rate': 6.786883240811479e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13818 total_samples=18727, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:29:57,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.86 | bwd_microstep: 1913.90 | bwd_inner_microstep: 1865.09 | bwd_allreduce_microstep: 48.75 | step_microstep: 0.13 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13709 total_samples=18731, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:00,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.08 | bwd_microstep: 2229.71 | bwd_inner_microstep: 1877.67 | bwd_allreduce_microstep: 351.98 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305 total_samples=18735, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:03,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.81 | bwd_microstep: 1768.89 | bwd_inner_microstep: 1687.59 | bwd_allreduce_microstep: 81.24 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13338 total_samples=18739, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:05,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61 [2025-08-03 05:30:05,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.44 | bwd_microstep: 1822.28 | bwd_inner_microstep: 1711.97 | bwd_allreduce_microstep: 110.24 | step_microstep: 116.53 [2025-08-03 05:30:05,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.12 | bwd: 7734.84 | bwd_inner: 7142.31 | bwd_allreduce: 592.29 | step: 117.13 {'loss': 0.7554, 'learning_rate': 6.771552383277875e-06, 'epoch': 0.62} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13355 total_samples=18743, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:08,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.79 | bwd_microstep: 1785.50 | bwd_inner_microstep: 1671.56 | bwd_allreduce_microstep: 113.87 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14101 total_samples=18747, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:11,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.02 | bwd_microstep: 1909.66 | bwd_inner_microstep: 1854.79 | bwd_allreduce_microstep: 54.80 | step_microstep: 0.29 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15507 total_samples=18751, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:13,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.15 | bwd_microstep: 2016.70 | bwd_inner_microstep: 1931.81 | bwd_allreduce_microstep: 84.82 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14433 total_samples=18756, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:16,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23 [2025-08-03 05:30:16,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.12 | bwd_microstep: 1746.00 | bwd_inner_microstep: 1718.90 | bwd_allreduce_microstep: 27.03 | step_microstep: 125.76 [2025-08-03 05:30:16,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.01 | bwd: 7457.91 | bwd_inner: 7177.06 | bwd_allreduce: 280.61 | step: 126.29 {'loss': 0.7478, 'learning_rate': 6.756229991971779e-06, 'epoch': 0.62} 61%|██████▏ | 1229/2000 [3:46:36<2:18:20, 10.77s/it] 62%|██████▏ | 1230/2000 [3:46:47<2:18:38, 10.80s/it] 62%|██████▏ | 1230/2000 [3:46:47<2:18:38, 10.80s/it] 62%|██████▏ | 1231/2000 [3:46:58<2:18:54, 10.84s/it] 62%|██████▏ | 1231/2000 [3:46:58<2:18:54, 10.84s/it] 62%|██████▏ | 1232/2000 [3:47:09<2:20:14, 10.96s/it] 62%|██████▏ | 1232/2000 [3:47:09<2:20:14, 10.96s/it] 62%|██████▏ | 1233/2000 [3:47:20<2:20:15, 10.97s/it] 62%|██████▏ | 1233/2000 [3:47:20<2:20:15, 10.97s/it] 62%|██████▏ | 1234/2000 [3:47:31<2:19:03, 10.89s/it] 62%|████dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12120 total_samples=18759, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:19,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.72 | bwd_microstep: 1851.22 | bwd_inner_microstep: 1594.92 | bwd_allreduce_microstep: 256.23 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13274 total_samples=18763, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:22,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.32 | bwd_microstep: 2003.96 | bwd_inner_microstep: 1860.90 | bwd_allreduce_microstep: 143.00 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462 total_samples=18768, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:24,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.82 | bwd_microstep: 2047.07 | bwd_inner_microstep: 1890.89 | bwd_allreduce_microstep: 156.09 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13495 total_samples=18773, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:27,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60 [2025-08-03 05:30:27,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.29 | bwd_microstep: 1841.75 | bwd_inner_microstep: 1729.88 | bwd_allreduce_microstep: 111.80 | step_microstep: 124.14 [2025-08-03 05:30:27,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2886.08 | bwd: 7744.05 | bwd_inner: 7076.60 | bwd_allreduce: 667.20 | step: 124.63 {'loss': 0.7509, 'learning_rate': 6.740916107074372e-06, 'epoch': 0.62} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13013 total_samples=18777, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:30,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.11 | bwd_microstep: 1870.34 | bwd_inner_microstep: 1655.18 | bwd_allreduce_microstep: 215.10 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12203 total_samples=18780, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:32,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.94 | bwd_microstep: 1788.13 | bwd_inner_microstep: 1590.35 | bwd_allreduce_microstep: 197.72 | step_microstep: 0.26 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13366 total_samples=18784, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:35,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.88 | bwd_microstep: 1960.64 | bwd_inner_microstep: 1715.87 | bwd_allreduce_microstep: 244.70 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14000 total_samples=18789, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:38,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64 [2025-08-03 05:30:38,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.50 | bwd_microstep: 1820.11 | bwd_inner_microstep: 1745.50 | bwd_allreduce_microstep: 74.54 | step_microstep: 131.85 [2025-08-03 05:30:38,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2757.36 | bwd: 7439.29 | bwd_inner: 6706.89 | bwd_allreduce: 732.15 | step: 132.33 {'loss': 0.748, 'learning_rate': 6.725610768744535e-06, 'epoch': 0.62} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11953 total_samples=18792, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:40,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.55 | bwd_microstep: 1817.85 | bwd_inner_microstep: 1781.41 | bwd_allreduce_microstep: 36.38 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11535 total_samples=18795, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:43,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.45 | bwd_microstep: 1773.04 | bwd_inner_microstep: 1539.46 | bwd_allreduce_microstep: 233.50 | step_microstep: 0.27 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13140 total_samples=18799, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:45,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.23 | bwd_microstep: 1794.84 | bwd_inner_microstep: 1686.76 | bwd_allreduce_microstep: 108.02 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14324 total_samples=18803, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:48,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60 [2025-08-03 05:30:48,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.15 | bwd_microstep: 1733.08 | bwd_inner_microstep: 1705.72 | bwd_allreduce_microstep: 27.29 | step_microstep: 127.51 [2025-08-03 05:30:48,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.30 | bwd: 7118.87 | bwd_inner: 6713.35 | bwd_allreduce: 405.27 | step: 128.02 {'loss': 0.7446, 'learning_rate': 6.710314017118734e-06, 'epoch': 0.62} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14194 total_samples=18807, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:51,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.77 | bwd_microstep: 1889.63 | bwd_inner_microstep: 1832.22 | bwd_allreduce_microstep: 57.34 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12082 total_samples=18810, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:54,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.43 | bwd_microstep: 1997.17 | bwd_inner_microstep: 1777.16 | bwd_allreduce_microstep: 219.95 | step_microstep: 0.13 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13234 total_samples=18814, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:56,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.46 | bwd_microstep: 2076.88 | bwd_inner_microstep: 1909.62 | bwd_allreduce_microstep: 167.19 | step_microstep: 0.29 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12783 total_samples=18818, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:30:59,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57 [2025-08-03 05:30:59,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.97 | bwd_microstep: 1818.92 | bwd_inner_microstep: 1667.80 | bwd_allreduce_microstep: 151.05 | step_microstep: 130.61 [2025-08-03 05:30:59,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.55 | bwd: 7782.66 | bwd_inner: 7186.80 | bwd_allreduce: 595.61 | step: 131.27 {'loss': 0.7335, 'learning_rate': 6.695025892310913e-06, 'epoch': 0.62} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12093 total_samples=18821, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:02,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.23 | bwd_microstep: 2129.68 | bwd_inner_microstep: 1870.39 | bwd_allreduce_microstep: 259.22 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11654 total_samples=18824, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:05,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.73 | bwd_microstep: 2042.11 | bwd_inner_microstep: 1816.85 | bwd_allreduce_microstep: 225.20 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13388 total_samples=18828, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:08,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.39 | bwd_microstep: 1928.35 | bwd_inner_microstep: 1726.29 | bwd_allreduce_microstep: 201.99 | step_microstep: 0.13 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12795 total_samples=18832, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:11,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32 [2025-08-03 05:31:11,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.65 | bwd_microstep: 2055.36 | bwd_inner_microstep: 1864.30 | bwd_allreduce_microstep: 191.00 | step_microstep: 130.33 [2025-08-03 05:31:11,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.94 | bwd: 8155.55 | bwd_inner: 7277.82 | bwd_allreduce: 877.49 | step: 130.79 {'loss': 0.7381, 'learning_rate': 6.6797464344124045e-06, 'epoch': 0.62} █▏ | 1234/2000 [3:47:31<2:19:03, 10.89s/it] 62%|██████▏ | 1235/2000 [3:47:42<2:19:27, 10.94s/it] 62%|██████▏ | 1235/2000 [3:47:42<2:19:27, 10.94s/it] 62%|██████▏ | 1236/2000 [3:47:53<2:18:12, 10.85s/it] 62%|██████▏ | 1236/2000 [3:47:53<2:18:12, 10.85s/it] 62%|██████▏ | 1237/2000 [3:48:03<2:16:07, 10.70s/it] 62%|██████▏ | 1237/2000 [3:48:03<2:16:07, 10.70s/it] 62%|██████▏ | 1238/2000 [3:48:14<2:17:22, 10.82s/it] 62%|██████▏ | 1238/2000 [3:48:14<2:17:22, 10.82s/it] 62%|██████▏ | 1239/2000 [3:48:25<2:19:21, 10.99s/it] 62%|██████▏ | 1239/2000 [3:48:25<2:19:21,dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13708 total_samples=18836, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:13,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.94 | bwd_microstep: 1943.69 | bwd_inner_microstep: 1691.32 | bwd_allreduce_microstep: 252.30 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812 total_samples=18839, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:16,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.62 | bwd_microstep: 1873.14 | bwd_inner_microstep: 1713.67 | bwd_allreduce_microstep: 159.40 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13903 total_samples=18844, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:18,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.74 | bwd_microstep: 1728.55 | bwd_inner_microstep: 1663.38 | bwd_allreduce_microstep: 65.10 | step_microstep: 0.27 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12605 total_samples=18847, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:21,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93 [2025-08-03 05:31:21,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.53 | bwd_microstep: 1822.51 | bwd_inner_microstep: 1598.95 | bwd_allreduce_microstep: 223.50 | step_microstep: 118.93 [2025-08-03 05:31:21,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.76 | bwd: 7367.95 | bwd_inner: 6667.31 | bwd_allreduce: 700.39 | step: 119.44 {'loss': 0.7529, 'learning_rate': 6.664475683491797e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14274 total_samples=18851, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:24,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.89 | bwd_microstep: 1810.95 | bwd_inner_microstep: 1737.66 | bwd_allreduce_microstep: 73.23 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14730 total_samples=18855, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:26,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.40 | bwd_microstep: 1733.99 | bwd_inner_microstep: 1721.33 | bwd_allreduce_microstep: 12.59 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13210 total_samples=18859, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:29,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.21 | bwd_microstep: 1966.47 | bwd_inner_microstep: 1874.00 | bwd_allreduce_microstep: 92.39 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12023 total_samples=18862, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:32,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16 [2025-08-03 05:31:32,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.09 | bwd_microstep: 2046.28 | bwd_inner_microstep: 1808.41 | bwd_allreduce_microstep: 237.80 | step_microstep: 107.60 [2025-08-03 05:31:32,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.52 | bwd: 7557.73 | bwd_inner: 7141.40 | bwd_allreduce: 416.08 | step: 108.07 {'loss': 0.7565, 'learning_rate': 6.649213679594859e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13993 total_samples=18867, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:35,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.90 | bwd_microstep: 2233.81 | bwd_inner_microstep: 2017.99 | bwd_allreduce_microstep: 215.76 | step_microstep: 0.13 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12214 total_samples=18871, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:38,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.69 | bwd_microstep: 2022.39 | bwd_inner_microstep: 1607.98 | bwd_allreduce_microstep: 414.31 | step_microstep: 0.26 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13374 total_samples=18875, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:42,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1883.44 | bwd_microstep: 2420.38 | bwd_inner_microstep: 2353.51 | bwd_allreduce_microstep: 66.80 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12051 total_samples=18878, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:45,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16 [2025-08-03 05:31:45,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.73 | bwd_microstep: 2020.26 | bwd_inner_microstep: 1791.67 | bwd_allreduce_microstep: 228.52 | step_microstep: 111.35 [2025-08-03 05:31:45,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4002.68 | bwd: 8696.91 | bwd_inner: 7771.16 | bwd_allreduce: 925.46 | step: 111.88 {'loss': 0.7442, 'learning_rate': 6.633960462744415e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13182 total_samples=18882, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:48,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.13 | bwd_microstep: 2001.92 | bwd_inner_microstep: 1862.73 | bwd_allreduce_microstep: 139.13 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13862 total_samples=18886, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:50,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.18 | bwd_microstep: 1854.46 | bwd_inner_microstep: 1710.26 | bwd_allreduce_microstep: 144.15 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752 total_samples=18890, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:53,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.21 | bwd_microstep: 2218.10 | bwd_inner_microstep: 2096.32 | bwd_allreduce_microstep: 121.71 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11954 total_samples=18893, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:56,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.28 [2025-08-03 05:31:56,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.23 | bwd_microstep: 1972.06 | bwd_inner_microstep: 1562.40 | bwd_allreduce_microstep: 409.59 | step_microstep: 137.85 [2025-08-03 05:31:56,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.68 | bwd: 8046.59 | bwd_inner: 7231.71 | bwd_allreduce: 814.64 | step: 138.32 {'loss': 0.7431, 'learning_rate': 6.618716072940248e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13699 total_samples=18897, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:31:59,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.73 | bwd_microstep: 1758.12 | bwd_inner_microstep: 1689.76 | bwd_allreduce_microstep: 68.29 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11928 total_samples=18900, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:01,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.66 | bwd_microstep: 1711.87 | bwd_inner_microstep: 1544.68 | bwd_allreduce_microstep: 167.12 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13254 total_samples=18904, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:04,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.84 | bwd_microstep: 2082.38 | bwd_inner_microstep: 1737.37 | bwd_allreduce_microstep: 344.94 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891 total_samples=18907, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:07,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16 [2025-08-03 05:32:07,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.06 | bwd_microstep: 1776.56 | bwd_inner_microstep: 1555.23 | bwd_allreduce_microstep: 221.26 | step_microstep: 135.15 [2025-08-03 05:32:07,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.21 | bwd: 7328.97 | bwd_inner: 6527.04 | bwd_allreduce: 801.68 | step: 135.54 {'loss': 0.7419, 'learning_rate': 6.603480550158995e-06, 'epoch': 0.62} dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13724 total_samples=18912, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:09,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.80 | bwd_microstep: 1764.04 | bwd_inner_microstep: 1654.70 | bwd_allreduce_microstep: 109.28 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799 total_samples=18915, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:12,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.04 | bwd_microstep: 1897.67 | bwd_inner_microstep: 1620.83 | bwd_allreduce_microstep: 276.78 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14370 total_samples=18919, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:15,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.89 | bwd_microstep: 1949.32 | bwd_inner_microstep: 1915.44 | bwd_allreduce_microstep: 33.82 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11693 total_samples=18922, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:18,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89 [2025-08-03 05:32:18,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.31 | bwd_microstep: 2110.66 | bwd_inner_microstep: 1921.63 | bwd_allreduce_microstep: 188.97 | step_microstep: 111.10 [2025-08-03 05:32:18,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.97 | bwd: 7721.75 | bwd_inner: 7112.59 | bwd_allreduce: 608.92 | step: 111.55 10.99s/it] 62%|██████▏ | 1240/2000 [3:48:36<2:17:31, 10.86s/it] 62%|██████▏ | 1240/2000 [3:48:36<2:17:31, 10.86s/it] 62%|██████▏ | 1241/2000 [3:48:47<2:17:07, 10.84s/it] 62%|██████▏ | 1241/2000 [3:48:47<2:17:07, 10.84s/it] 62%|██████▏ | 1242/2000 [3:49:00<2:25:32, 11.52s/it] 62%|██████▏ | 1242/2000 [3:49:00<2:25:32, 11.52s/it] 62%|██████▏ | 1243/2000 [3:49:11<2:24:24, 11.45s/it] 62%|██████▏ | 1243/2000 [3:49:11<2:24:24, 11.45s/it] 62%|██████▏ | 1244/2000 [3:49:22<2:21:02, 11.19s/it] 62%|██████▏ | 1244/2000 [3:49:22<2:21:02, 11.19s/it] 62%|██████▏ {'loss': 0.7415, 'learning_rate': 6.588253934354039e-06, 'epoch': 0.62} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11721 total_samples=18925, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:21,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.06 | bwd_microstep: 2013.58 | bwd_inner_microstep: 1800.20 | bwd_allreduce_microstep: 213.32 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770 total_samples=18928, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:24,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.59 | bwd_microstep: 2162.93 | bwd_inner_microstep: 1927.91 | bwd_allreduce_microstep: 234.96 | step_microstep: 0.25 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14277 total_samples=18934, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:26,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.60 | bwd_microstep: 1780.65 | bwd_inner_microstep: 1671.81 | bwd_allreduce_microstep: 108.78 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11765 total_samples=18937, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:29,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70 [2025-08-03 05:32:29,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.82 | bwd_microstep: 1768.25 | bwd_inner_microstep: 1536.37 | bwd_allreduce_microstep: 231.81 | step_microstep: 110.96 [2025-08-03 05:32:29,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.00 | bwd: 7725.47 | bwd_inner: 6936.29 | bwd_allreduce: 788.94 | step: 111.43 {'loss': 0.7458, 'learning_rate': 6.5730362654554015e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14005 total_samples=18941, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:32,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.36 | bwd_microstep: 1998.64 | bwd_inner_microstep: 1886.19 | bwd_allreduce_microstep: 112.38 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12297 total_samples=18944, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:34,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.25 | bwd_microstep: 1751.16 | bwd_inner_microstep: 1573.76 | bwd_allreduce_microstep: 177.34 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13767 total_samples=18948, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:37,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.27 | bwd_microstep: 1837.30 | bwd_inner_microstep: 1727.41 | bwd_allreduce_microstep: 109.83 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14200 total_samples=18952, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:39,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18 [2025-08-03 05:32:39,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.56 | bwd_microstep: 1851.08 | bwd_inner_microstep: 1776.66 | bwd_allreduce_microstep: 74.36 | step_microstep: 136.01 [2025-08-03 05:32:39,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.38 | bwd: 7438.24 | bwd_inner: 6964.01 | bwd_allreduce: 473.98 | step: 136.51 {'loss': 0.7476, 'learning_rate': 6.5578275833696485e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14101 total_samples=18956, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:42,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.35 | bwd_microstep: 1810.70 | bwd_inner_microstep: 1738.04 | bwd_allreduce_microstep: 72.58 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11947 total_samples=18959, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:45,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.99 | bwd_microstep: 1833.82 | bwd_inner_microstep: 1605.17 | bwd_allreduce_microstep: 228.59 | step_microstep: 0.22 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12561 total_samples=18963, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:48,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.99 | bwd_microstep: 2006.50 | bwd_inner_microstep: 1630.09 | bwd_allreduce_microstep: 376.36 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13873 total_samples=18967, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:50,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.25 [2025-08-03 05:32:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.98 | bwd_microstep: 1739.97 | bwd_inner_microstep: 1695.08 | bwd_allreduce_microstep: 44.81 | step_microstep: 140.33 [2025-08-03 05:32:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.24 | bwd: 7391.05 | bwd_inner: 6668.37 | bwd_allreduce: 722.42 | step: 140.83 {'loss': 0.7495, 'learning_rate': 6.542627927979772e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13714 total_samples=18971, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:53,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.03 | bwd_microstep: 2185.55 | bwd_inner_microstep: 2062.50 | bwd_allreduce_microstep: 122.98 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13582 total_samples=18975, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:56,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.89 | bwd_microstep: 2109.22 | bwd_inner_microstep: 1858.63 | bwd_allreduce_microstep: 250.53 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11551 total_samples=18978, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:32:59,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.78 | bwd_microstep: 1871.91 | bwd_inner_microstep: 1620.43 | bwd_allreduce_microstep: 251.43 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13471 total_samples=18982, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:01,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:33:01,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.52 | bwd_microstep: 1736.31 | bwd_inner_microstep: 1679.80 | bwd_allreduce_microstep: 56.43 | step_microstep: 111.65 [2025-08-03 05:33:01,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.16 | bwd: 7903.05 | bwd_inner: 7221.35 | bwd_allreduce: 681.46 | step: 112.13 {'loss': 0.7468, 'learning_rate': 6.527437339145097e-06, 'epoch': 0.62} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14705 total_samples=18986, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:04,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.12 | bwd_microstep: 2017.17 | bwd_inner_microstep: 1955.52 | bwd_allreduce_microstep: 61.57 | step_microstep: 0.29 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13374 total_samples=18990, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:07,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.86 | bwd_microstep: 2059.33 | bwd_inner_microstep: 2053.17 | bwd_allreduce_microstep: 6.09 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14083 total_samples=18994, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:10,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.46 | bwd_microstep: 2158.74 | bwd_inner_microstep: 2022.76 | bwd_allreduce_microstep: 135.92 | step_microstep: 0.23 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14974 total_samples=18998, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:12,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45 [2025-08-03 05:33:12,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.41 | bwd_microstep: 1776.21 | bwd_inner_microstep: 1726.55 | bwd_allreduce_microstep: 49.57 | step_microstep: 141.31 [2025-08-03 05:33:12,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.78 | bwd: 8011.48 | bwd_inner: 7757.99 | bwd_allreduce: 253.23 | step: 141.94 | 1245/2000 [3:49:33<2:19:58, 11.12s/it] 62%|██████▏ | 1245/2000 [3:49:33<2:19:58, 11.12s/it] 62%|██████▏ | 1246/2000 [3:49:44<2:19:02, 11.06s/it] 62%|██████▏ | 1246/2000 [3:49:44<2:19:02, 11.06s/it] 62%|██████▏ | 1247/2000 [3:49:54<2:17:24, 10.95s/it] 62%|██████▏ | 1247/2000 [3:49:54<2:17:24, 10.95s/it] 62%|██████▏ | 1248/2000 [3:50:05<2:16:07, 10.86s/it] 62%|██████▏ | 1248/2000 [3:50:05<2:16:07, 10.86s/it] 62%|██████▏ | 1249/2000 [3:50:16<2:16:47, 10.93s/it] 62%|██████▏ | 1249/2000 [3:50:16<2:16:47, 10.93s/it] 62%|██████▎ | 1250/2000 [3:50:27<2:17:50, 11.03s/{'loss': 0.7445, 'learning_rate': 6.5122558567011775e-06, 'epoch': 0.62} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13008 total_samples=19002, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:15,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 914.62 | bwd_microstep: 1752.42 | bwd_inner_microstep: 1675.47 | bwd_allreduce_microstep: 76.88 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13566 total_samples=19006, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:18,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.38 | bwd_microstep: 1767.67 | bwd_inner_microstep: 1706.93 | bwd_allreduce_microstep: 60.67 | step_microstep: 0.30 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11750 total_samples=19009, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:21,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.82 | bwd_microstep: 2099.38 | bwd_inner_microstep: 1863.38 | bwd_allreduce_microstep: 235.93 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13599 total_samples=19013, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:23,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35 [2025-08-03 05:33:23,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.99 | bwd_microstep: 1896.90 | bwd_inner_microstep: 1707.08 | bwd_allreduce_microstep: 189.76 | step_microstep: 158.14 [2025-08-03 05:33:23,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3018.75 | bwd: 7516.43 | bwd_inner: 6952.86 | bwd_allreduce: 563.32 | step: 158.67 {'loss': 0.7508, 'learning_rate': 6.497083520459674e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16009 total_samples=19017, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:26,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.04 | bwd_microstep: 1808.79 | bwd_inner_microstep: 1802.11 | bwd_allreduce_microstep: 6.61 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16144 total_samples=19021, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:29,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.24 | bwd_microstep: 2010.53 | bwd_inner_microstep: 2004.46 | bwd_allreduce_microstep: 6.01 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13725 total_samples=19025, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:31,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.30 | bwd_microstep: 1720.00 | bwd_inner_microstep: 1678.91 | bwd_allreduce_microstep: 41.02 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11981 total_samples=19028, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:35,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33 [2025-08-03 05:33:35,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.17 | bwd_microstep: 2029.17 | bwd_inner_microstep: 1787.92 | bwd_allreduce_microstep: 241.18 | step_microstep: 469.85 [2025-08-03 05:33:35,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.67 | bwd: 7568.54 | bwd_inner: 7273.40 | bwd_allreduce: 294.90 | step: 470.51 {'loss': 0.7378, 'learning_rate': 6.481920370208274e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13664 total_samples=19032, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:37,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.13 | bwd_microstep: 1731.56 | bwd_inner_microstep: 1677.68 | bwd_allreduce_microstep: 53.81 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13733 total_samples=19036, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:40,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.69 | bwd_microstep: 1806.39 | bwd_inner_microstep: 1742.76 | bwd_allreduce_microstep: 63.55 | step_microstep: 0.42 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13635 total_samples=19040, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:42,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.35 | bwd_microstep: 1978.54 | bwd_inner_microstep: 1876.46 | bwd_allreduce_microstep: 101.99 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14715 total_samples=19044, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:45,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85 [2025-08-03 05:33:45,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.94 | bwd_microstep: 1755.98 | bwd_inner_microstep: 1726.32 | bwd_allreduce_microstep: 29.59 | step_microstep: 149.67 [2025-08-03 05:33:45,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.04 | bwd: 7272.54 | bwd_inner: 7023.22 | bwd_allreduce: 249.04 | step: 150.34 {'loss': 0.7426, 'learning_rate': 6.466766445710568e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14922 total_samples=19049, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:48,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.42 | bwd_microstep: 2107.12 | bwd_inner_microstep: 1971.42 | bwd_allreduce_microstep: 135.63 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483 total_samples=19053, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:51,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.91 | bwd_microstep: 2012.72 | bwd_inner_microstep: 1825.08 | bwd_allreduce_microstep: 187.57 | step_microstep: 0.27 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13393 total_samples=19057, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:54,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.60 | bwd_microstep: 1854.94 | bwd_inner_microstep: 1814.76 | bwd_allreduce_microstep: 40.11 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11972 total_samples=19060, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:56,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33 [2025-08-03 05:33:56,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.19 | bwd_microstep: 1912.36 | bwd_inner_microstep: 1793.47 | bwd_allreduce_microstep: 118.83 | step_microstep: 111.77 [2025-08-03 05:33:56,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.06 | bwd: 7887.19 | bwd_inner: 7404.72 | bwd_allreduce: 482.22 | step: 112.27 {'loss': 0.7311, 'learning_rate': 6.4516217867059615e-06, 'epoch': 0.63} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938 total_samples=19063, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:33:59,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.70 | bwd_microstep: 1782.21 | bwd_inner_microstep: 1572.75 | bwd_allreduce_microstep: 209.39 | step_microstep: 0.26 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13829 total_samples=19069, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:01,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.51 | bwd_microstep: 1763.26 | bwd_inner_microstep: 1691.88 | bwd_allreduce_microstep: 71.31 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11855 total_samples=19072, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:04,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.21 | bwd_microstep: 1761.85 | bwd_inner_microstep: 1553.27 | bwd_allreduce_microstep: 208.51 | step_microstep: 0.19 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11715 total_samples=19075, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:07,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04 [2025-08-03 05:34:07,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.09 | bwd_microstep: 2164.06 | bwd_inner_microstep: 1833.03 | bwd_allreduce_microstep: 330.96 | step_microstep: 116.36 [2025-08-03 05:34:07,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.43 | bwd: 7471.43 | bwd_inner: 6650.92 | bwd_allreduce: 820.26 | step: 116.93 it] 62%|██████▎ | 1250/2000 [3:50:27<2:17:50, 11.03s/it] 63%|██████▎ | 1251/2000 [3:50:38<2:17:26, 11.01s/it] 63%|██████▎ | 1251/2000 [3:50:38<2:17:26, 11.01s/it] 63%|██████▎ | 1252/2000 [3:50:49<2:17:54, 11.06s/it] 63%|██████▎ | 1252/2000 [3:50:50<2:17:54, 11.06s/it] 63%|██████▎ | 1253/2000 [3:51:00<2:15:51, 10.91s/it] 63%|██████▎ | 1253/2000 [3:51:00<2:15:51, 10.91s/it] 63%|██████▎ | 1254/2000 [3:51:11<2:16:18, 10.96s/it] 63%|██████▎ | 1254/2000 [3:51:11<2:16:18, 10.96s/it] 63%|██████▎ | 1255/2000 [3:51:22<2:15:04, 10.88s/it] {'loss': 0.7435, 'learning_rate': 6.43648643290955e-06, 'epoch': 0.63} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11518 total_samples=19078, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:10,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.39 | bwd_microstep: 2218.64 | bwd_inner_microstep: 1855.95 | bwd_allreduce_microstep: 362.62 | step_microstep: 0.13 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14456 total_samples=19082, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:13,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.81 | bwd_microstep: 1936.20 | bwd_inner_microstep: 1820.44 | bwd_allreduce_microstep: 115.69 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13737 total_samples=19086, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:15,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.06 | bwd_microstep: 1758.24 | bwd_inner_microstep: 1698.90 | bwd_allreduce_microstep: 59.27 | step_microstep: 0.27 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13110 total_samples=19089, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:18,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77 [2025-08-03 05:34:18,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.41 | bwd_microstep: 1892.29 | bwd_inner_microstep: 1744.76 | bwd_allreduce_microstep: 147.45 | step_microstep: 160.97 [2025-08-03 05:34:18,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2884.59 | bwd: 7805.43 | bwd_inner: 7120.05 | bwd_allreduce: 685.13 | step: 161.49 {'loss': 0.7442, 'learning_rate': 6.421360424012039e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13314 total_samples=19093, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:21,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.13 | bwd_microstep: 2274.62 | bwd_inner_microstep: 2119.72 | bwd_allreduce_microstep: 154.83 | step_microstep: 0.13 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13246 total_samples=19097, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:24,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.37 | bwd_microstep: 2040.83 | bwd_inner_microstep: 1872.48 | bwd_allreduce_microstep: 168.27 | step_microstep: 0.26 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625 total_samples=19100, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:26,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.27 | bwd_microstep: 1733.18 | bwd_inner_microstep: 1538.15 | bwd_allreduce_microstep: 194.97 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12116 total_samples=19103, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:29,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69 [2025-08-03 05:34:29,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.80 | bwd_microstep: 1867.13 | bwd_inner_microstep: 1567.56 | bwd_allreduce_microstep: 299.50 | step_microstep: 111.79 [2025-08-03 05:34:29,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.50 | bwd: 7915.82 | bwd_inner: 7097.91 | bwd_allreduce: 817.66 | step: 112.29 {'loss': 0.7373, 'learning_rate': 6.406243799679625e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13544 total_samples=19107, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:32,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.15 | bwd_microstep: 1861.41 | bwd_inner_microstep: 1820.60 | bwd_allreduce_microstep: 40.74 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12253 total_samples=19110, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:34,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.71 | bwd_microstep: 1791.60 | bwd_inner_microstep: 1574.24 | bwd_allreduce_microstep: 217.29 | step_microstep: 0.26 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12889 total_samples=19114, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:37,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.61 | bwd_microstep: 1989.78 | bwd_inner_microstep: 1676.52 | bwd_allreduce_microstep: 313.20 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11826 total_samples=19117, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:40,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.78 [2025-08-03 05:34:40,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.76 | bwd_microstep: 1811.07 | bwd_inner_microstep: 1579.03 | bwd_allreduce_microstep: 231.96 | step_microstep: 130.10 [2025-08-03 05:34:40,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.17 | bwd: 7453.93 | bwd_inner: 6650.39 | bwd_allreduce: 803.29 | step: 130.59 {'loss': 0.7422, 'learning_rate': 6.39113659955389e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13915 total_samples=19121, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:43,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.71 | bwd_microstep: 2185.19 | bwd_inner_microstep: 2060.50 | bwd_allreduce_microstep: 124.62 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13293 total_samples=19125, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:46,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.73 | bwd_microstep: 2025.59 | bwd_inner_microstep: 1721.03 | bwd_allreduce_microstep: 304.49 | step_microstep: 0.42 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13756 total_samples=19129, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:49,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.07 | bwd_microstep: 2087.82 | bwd_inner_microstep: 1933.13 | bwd_allreduce_microstep: 154.63 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11919 total_samples=19132, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:51,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50 [2025-08-03 05:34:51,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.02 | bwd_microstep: 1837.22 | bwd_inner_microstep: 1605.03 | bwd_allreduce_microstep: 232.12 | step_microstep: 112.95 [2025-08-03 05:34:51,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.45 | bwd: 8135.88 | bwd_inner: 7319.69 | bwd_allreduce: 815.94 | step: 113.61 {'loss': 0.7451, 'learning_rate': 6.376038863251706e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14032 total_samples=19138, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:54,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.37 | bwd_microstep: 1721.54 | bwd_inner_microstep: 1681.65 | bwd_allreduce_microstep: 39.82 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14265 total_samples=19143, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:57,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.87 | bwd_microstep: 1873.51 | bwd_inner_microstep: 1817.00 | bwd_allreduce_microstep: 56.43 | step_microstep: 0.25 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13738 total_samples=19147, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:34:59,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.24 | bwd_microstep: 1845.20 | bwd_inner_microstep: 1783.40 | bwd_allreduce_microstep: 61.73 | step_microstep: 0.74 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11758 total_samples=19150, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:02,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57 [2025-08-03 05:35:02,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.09 | bwd_microstep: 1847.98 | bwd_inner_microstep: 1596.31 | bwd_allreduce_microstep: 251.60 | step_microstep: 111.48 [2025-08-03 05:35:02,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.49 | bwd: 7288.28 | bwd_inner: 6878.36 | bwd_allreduce: 409.67 | step: 112.60 {'loss': 0.7394, 'learning_rate': 6.360950630365126e-06, 'epoch': 0.63} 63%|██████▎ | 1255/2000 [3:51:22<2:15:04, 10.88s/it] 63%|██████▎ | 1256/2000 [3:51:33<2:15:49, 10.95s/it] 63%|██████▎ | 1256/2000 [3:51:33<2:15:49, 10.95s/it] 63%|██████▎ | 1257/2000 [3:51:44<2:16:15, 11.00s/it] 63%|██████▎ | 1257/2000 [3:51:44<2:16:15, 11.00s/it] 63%|██████▎ | 1258/2000 [3:51:55<2:15:05, 10.92s/it] 63%|██████▎ | 1258/2000 [3:51:55<2:15:05, 10.92s/it] 63%|██████▎ | 1259/2000 [3:52:06<2:16:42, 11.07s/it] 63%|██████▎ | 1259/2000 [3:52:06<2:16:42, 11.07s/it] 63%|██████▎ | 1260/2000 [3:52:17<2:14:37, 10.92s/it] 63%|████dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15621 total_samples=19154, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:04,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.18 | bwd_microstep: 1777.67 | bwd_inner_microstep: 1768.78 | bwd_allreduce_microstep: 8.82 | step_microstep: 0.30 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13314 total_samples=19158, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:07,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.78 | bwd_microstep: 1860.48 | bwd_inner_microstep: 1726.47 | bwd_allreduce_microstep: 133.95 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689 total_samples=19161, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:10,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.20 | bwd_microstep: 1777.11 | bwd_inner_microstep: 1566.39 | bwd_allreduce_microstep: 210.64 | step_microstep: 0.82 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723 total_samples=19164, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:12,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77 [2025-08-03 05:35:12,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.45 | bwd_microstep: 1725.45 | bwd_inner_microstep: 1534.95 | bwd_allreduce_microstep: 190.43 | step_microstep: 159.42 [2025-08-03 05:35:12,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2753.54 | bwd: 7140.77 | bwd_inner: 6596.59 | bwd_allreduce: 543.92 | step: 160.66 {'loss': 0.7508, 'learning_rate': 6.345871940461282e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14586 total_samples=19168, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:15,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.91 | bwd_microstep: 1750.13 | bwd_inner_microstep: 1716.57 | bwd_allreduce_microstep: 33.49 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12055 total_samples=19171, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:17,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.64 | bwd_microstep: 1872.05 | bwd_inner_microstep: 1574.93 | bwd_allreduce_microstep: 297.06 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11879 total_samples=19174, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:20,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.34 | bwd_microstep: 2012.43 | bwd_inner_microstep: 1556.75 | bwd_allreduce_microstep: 455.62 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12618 total_samples=19178, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:23,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.95 [2025-08-03 05:35:23,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.72 | bwd_microstep: 1844.78 | bwd_inner_microstep: 1644.84 | bwd_allreduce_microstep: 199.87 | step_microstep: 123.80 [2025-08-03 05:35:23,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.54 | bwd: 7479.45 | bwd_inner: 6493.08 | bwd_allreduce: 986.12 | step: 124.31 {'loss': 0.7307, 'learning_rate': 6.33080283308228e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132 total_samples=19182, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:26,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.13 | bwd_microstep: 1958.49 | bwd_inner_microstep: 1838.40 | bwd_allreduce_microstep: 120.03 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14696 total_samples=19186, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:28,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.99 | bwd_microstep: 1919.73 | bwd_inner_microstep: 1758.20 | bwd_allreduce_microstep: 161.45 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13620 total_samples=19190, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:31,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.23 | bwd_microstep: 1803.66 | bwd_inner_microstep: 1687.87 | bwd_allreduce_microstep: 115.72 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11787 total_samples=19193, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:34,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38 [2025-08-03 05:35:34,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.82 | bwd_microstep: 2032.11 | bwd_inner_microstep: 1923.74 | bwd_allreduce_microstep: 108.30 | step_microstep: 113.77 [2025-08-03 05:35:34,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.10 | bwd: 7714.03 | bwd_inner: 7208.19 | bwd_allreduce: 505.58 | step: 114.38 {'loss': 0.7522, 'learning_rate': 6.315743347745098e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14879 total_samples=19198, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:36,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.17 | bwd_microstep: 1731.18 | bwd_inner_microstep: 1719.47 | bwd_allreduce_microstep: 11.65 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13322 total_samples=19202, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:39,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.14 | bwd_microstep: 1796.83 | bwd_inner_microstep: 1714.71 | bwd_allreduce_microstep: 82.04 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13787 total_samples=19206, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:42,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.65 | bwd_microstep: 1795.50 | bwd_inner_microstep: 1721.21 | bwd_allreduce_microstep: 74.21 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13504 total_samples=19210, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:44,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15 [2025-08-03 05:35:44,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.22 | bwd_microstep: 1925.59 | bwd_inner_microstep: 1885.71 | bwd_allreduce_microstep: 39.81 | step_microstep: 126.13 [2025-08-03 05:35:44,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.11 | bwd: 7249.16 | bwd_inner: 7041.10 | bwd_allreduce: 207.80 | step: 126.64 {'loss': 0.7422, 'learning_rate': 6.300693523941481e-06, 'epoch': 0.63} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13312 total_samples=19214, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:47,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.13 | bwd_microstep: 1811.80 | bwd_inner_microstep: 1681.03 | bwd_allreduce_microstep: 130.71 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13488 total_samples=19218, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:49,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.79 | bwd_microstep: 1744.65 | bwd_inner_microstep: 1695.32 | bwd_allreduce_microstep: 49.27 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11640 total_samples=19221, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:52,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.25 | bwd_microstep: 2144.48 | bwd_inner_microstep: 1936.96 | bwd_allreduce_microstep: 207.46 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13252 total_samples=19225, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:55,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98 [2025-08-03 05:35:55,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.38 | bwd_microstep: 1805.72 | bwd_inner_microstep: 1698.21 | bwd_allreduce_microstep: 107.44 | step_microstep: 134.58 [2025-08-03 05:35:55,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.48 | bwd: 7506.70 | bwd_inner: 7011.50 | bwd_allreduce: 494.95 | step: 135.06 {'loss': 0.7547, 'learning_rate': 6.2856534011378365e-06, 'epoch': 0.63} ██▎ | 1260/2000 [3:52:17<2:14:37, 10.92s/it] 63%|██████▎ | 1261/2000 [3:52:27<2:12:22, 10.75s/it] 63%|██████▎ | 1261/2000 [3:52:27<2:12:22, 10.75s/it] 63%|██████▎ | 1262/2000 [3:52:38<2:12:02, 10.74s/it] 63%|██████▎ | 1262/2000 [3:52:38<2:12:02, 10.74s/it] 63%|██████▎ | 1263/2000 [3:52:49<2:12:38, 10.80s/it] 63%|██████▎ | 1263/2000 [3:52:49<2:12:38, 10.80s/it] 63%|██████▎ | 1264/2000 [3:52:59<2:11:11, 10.70s/it] 63%|██████▎ | 1264/2000 [3:52:59<2:11:11, 10.70s/it] 63%|██████▎ | 1265/2000 [3:53:10<2:11:11, 10.71s/it] 63%|██████▎ | 1265/2000 [3:53:10<2:11:1dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15728 total_samples=19229, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:35:58,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.44 | bwd_microstep: 2183.04 | bwd_inner_microstep: 2098.96 | bwd_allreduce_microstep: 84.01 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14725 total_samples=19233, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:01,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.17 | bwd_microstep: 1780.18 | bwd_inner_microstep: 1737.68 | bwd_allreduce_microstep: 42.44 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13758 total_samples=19237, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:03,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.17 | bwd_microstep: 1974.65 | bwd_inner_microstep: 1906.79 | bwd_allreduce_microstep: 67.80 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13215 total_samples=19241, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:06,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49 [2025-08-03 05:36:06,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.50 | bwd_microstep: 1694.74 | bwd_inner_microstep: 1636.52 | bwd_allreduce_microstep: 58.16 | step_microstep: 145.25 [2025-08-03 05:36:06,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.22 | bwd: 7632.66 | bwd_inner: 7379.93 | bwd_allreduce: 252.48 | step: 145.73 {'loss': 0.7441, 'learning_rate': 6.270623018775135e-06, 'epoch': 0.63} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13008 total_samples=19245, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:09,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 981.80 | bwd_microstep: 2026.83 | bwd_inner_microstep: 1865.42 | bwd_allreduce_microstep: 161.35 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508 total_samples=19249, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:12,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.39 | bwd_microstep: 1963.48 | bwd_inner_microstep: 1872.63 | bwd_allreduce_microstep: 90.79 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14781 total_samples=19255, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:14,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.26 | bwd_microstep: 1768.73 | bwd_inner_microstep: 1753.58 | bwd_allreduce_microstep: 15.09 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309 total_samples=19259, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:17,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27 [2025-08-03 05:36:17,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.74 | bwd_microstep: 2055.39 | bwd_inner_microstep: 1901.56 | bwd_allreduce_microstep: 153.77 | step_microstep: 132.32 [2025-08-03 05:36:17,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3075.13 | bwd: 7814.48 | bwd_inner: 7393.17 | bwd_allreduce: 421.08 | step: 132.64 {'loss': 0.7438, 'learning_rate': 6.255602416268799e-06, 'epoch': 0.63} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13744 total_samples=19263, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:20,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.77 | bwd_microstep: 1829.64 | bwd_inner_microstep: 1708.48 | bwd_allreduce_microstep: 121.09 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13599 total_samples=19267, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:23,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.95 | bwd_microstep: 1768.35 | bwd_inner_microstep: 1695.06 | bwd_allreduce_microstep: 73.22 | step_microstep: 0.32 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12172 total_samples=19270, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:25,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.28 | bwd_microstep: 1762.48 | bwd_inner_microstep: 1570.62 | bwd_allreduce_microstep: 191.80 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348 total_samples=19274, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:28,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14 [2025-08-03 05:36:28,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.03 | bwd_microstep: 1972.80 | bwd_inner_microstep: 1876.00 | bwd_allreduce_microstep: 96.74 | step_microstep: 109.99 [2025-08-03 05:36:28,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.93 | bwd: 7333.32 | bwd_inner: 6850.15 | bwd_allreduce: 482.93 | step: 110.58 {'loss': 0.7451, 'learning_rate': 6.2405916330086106e-06, 'epoch': 0.63} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13463 total_samples=19278, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:30,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.93 | bwd_microstep: 1736.92 | bwd_inner_microstep: 1675.07 | bwd_allreduce_microstep: 61.79 | step_microstep: 0.10 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15218 total_samples=19282, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:33,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.93 | bwd_microstep: 2112.12 | bwd_inner_microstep: 1797.14 | bwd_allreduce_microstep: 314.91 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14307 total_samples=19286, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:36,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.38 | bwd_microstep: 1746.61 | bwd_inner_microstep: 1722.21 | bwd_allreduce_microstep: 24.34 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14415 total_samples=19290, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:38,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15 [2025-08-03 05:36:38,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.13 | bwd_microstep: 1719.47 | bwd_inner_microstep: 1713.34 | bwd_allreduce_microstep: 6.06 | step_microstep: 138.06 [2025-08-03 05:36:38,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.32 | bwd: 7315.18 | bwd_inner: 6907.75 | bwd_allreduce: 407.18 | step: 138.43 {'loss': 0.7591, 'learning_rate': 6.225590708358596e-06, 'epoch': 0.63} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13022 total_samples=19294, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:41,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.07 | bwd_microstep: 1733.54 | bwd_inner_microstep: 1634.65 | bwd_allreduce_microstep: 98.82 | step_microstep: 0.22 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13419 total_samples=19298, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:44,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.41 | bwd_microstep: 2190.11 | bwd_inner_microstep: 2184.09 | bwd_allreduce_microstep: 5.96 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11830 total_samples=19301, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:47,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.25 | bwd_microstep: 2022.57 | bwd_inner_microstep: 1563.27 | bwd_allreduce_microstep: 459.24 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13847 total_samples=19305, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:49,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.46 [2025-08-03 05:36:49,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.44 | bwd_microstep: 1726.19 | bwd_inner_microstep: 1693.92 | bwd_allreduce_microstep: 32.21 | step_microstep: 129.37 [2025-08-03 05:36:49,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.10 | bwd: 7672.46 | bwd_inner: 7075.93 | bwd_allreduce: 596.30 | step: 129.82 {'loss': 0.7515, 'learning_rate': 6.210599681656933e-06, 'epoch': 0.64} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15025 total_samples=19310, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:52,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.52 | bwd_microstep: 1769.72 | bwd_inner_microstep: 1701.97 | bwd_allreduce_microstep: 67.68 | step_microstep: 0.13 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13168 total_samples=19314, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:54,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.94 | bwd_microstep: 1736.05 | bwd_inner_microstep: 1629.58 | bwd_allreduce_microstep: 106.40 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11727 total_samples=19317, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:36:57,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.49 | bwd_microstep: 2028.84 | bwd_inner_microstep: 1860.60 | bwd_allreduce_microstep: 168.17 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215 total_samples=19321, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:00,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37 [2025-08-03 05:37:00,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.94 | bwd_microstep: 2027.92 | bwd_inner_microstep: 1896.48 | bwd_allreduce_microstep: 131.37 | step_microstep: 151.67 [2025-08-03 05:37:00,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.81 | bwd: 7562.58 | bwd_inner: 7088.63 | bwd_allreduce: 473.71 | step: 152.03 1, 10.71s/it] 63%|██████▎ | 1266/2000 [3:53:21<2:11:42, 10.77s/it] 63%|██████▎ | 1266/2000 [3:53:21<2:11:42, 10.77s/it] 63%|██████▎ | 1267/2000 [3:53:32<2:13:39, 10.94s/it] 63%|██████▎ | 1267/2000 [3:53:32<2:13:39, 10.94s/it] 63%|██████▎ | 1268/2000 [3:53:43<2:12:07, 10.83s/it] 63%|██████▎ | 1268/2000 [3:53:43<2:12:07, 10.83s/it] 63%|██████▎ | 1269/2000 [3:53:53<2:10:53, 10.74s/it] 63%|██████▎ | 1269/2000 [3:53:53<2:10:53, 10.74s/it] 64%|██████▎ | 1270/2000 [3:54:04<2:11:16, 10.79s/it] 64%|██████▎ | 1270/2000 [3:54:04<2:11:16, 10.79s/it] 64%|██████{'loss': 0.7453, 'learning_rate': 6.1956185922158445e-06, 'epoch': 0.64} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14082 total_samples=19325, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:03,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.76 | bwd_microstep: 2127.67 | bwd_inner_microstep: 1734.31 | bwd_allreduce_microstep: 393.30 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11930 total_samples=19328, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:06,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.14 | bwd_microstep: 1825.01 | bwd_inner_microstep: 1589.60 | bwd_allreduce_microstep: 235.35 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13354 total_samples=19332, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:09,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.19 | bwd_microstep: 1995.22 | bwd_inner_microstep: 1883.96 | bwd_allreduce_microstep: 111.19 | step_microstep: 0.27 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14106 total_samples=19336, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:11,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03 [2025-08-03 05:37:11,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.76 | bwd_microstep: 1796.23 | bwd_inner_microstep: 1731.87 | bwd_allreduce_microstep: 64.30 | step_microstep: 148.88 [2025-08-03 05:37:11,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.77 | bwd: 7744.19 | bwd_inner: 6939.73 | bwd_allreduce: 804.22 | step: 149.39 {'loss': 0.7517, 'learning_rate': 6.180647479321484e-06, 'epoch': 0.64} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132 total_samples=19340, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:14,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.74 | bwd_microstep: 2007.63 | bwd_inner_microstep: 1877.60 | bwd_allreduce_microstep: 129.96 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13518 total_samples=19344, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:17,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.26 | bwd_microstep: 2147.57 | bwd_inner_microstep: 1746.20 | bwd_allreduce_microstep: 401.27 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14230 total_samples=19348, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:20,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.35 | bwd_microstep: 1924.06 | bwd_inner_microstep: 1849.54 | bwd_allreduce_microstep: 74.45 | step_microstep: 0.29 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11907 total_samples=19351, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:22,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73 [2025-08-03 05:37:22,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.93 | bwd_microstep: 1708.35 | bwd_inner_microstep: 1536.35 | bwd_allreduce_microstep: 171.94 | step_microstep: 126.35 [2025-08-03 05:37:22,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.21 | bwd: 7787.66 | bwd_inner: 7009.70 | bwd_allreduce: 777.70 | step: 126.88 {'loss': 0.734, 'learning_rate': 6.165686382233856e-06, 'epoch': 0.64} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16315 total_samples=19355, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:25,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.45 | bwd_microstep: 1996.71 | bwd_inner_microstep: 1933.03 | bwd_allreduce_microstep: 63.62 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11968 total_samples=19358, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:28,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.85 | bwd_microstep: 1766.96 | bwd_inner_microstep: 1575.39 | bwd_allreduce_microstep: 191.50 | step_microstep: 0.13 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13530 total_samples=19362, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:30,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.18 | bwd_microstep: 1728.67 | bwd_inner_microstep: 1664.85 | bwd_allreduce_microstep: 63.75 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309 total_samples=19366, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:33,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74 [2025-08-03 05:37:33,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.84 | bwd_microstep: 2045.71 | bwd_inner_microstep: 1917.20 | bwd_allreduce_microstep: 128.43 | step_microstep: 115.71 [2025-08-03 05:37:33,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.24 | bwd: 7538.10 | bwd_inner: 7090.45 | bwd_allreduce: 447.38 | step: 116.06 {'loss': 0.7303, 'learning_rate': 6.150735340186689e-06, 'epoch': 0.64} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348 total_samples=19370, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:36,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.68 | bwd_microstep: 1791.00 | bwd_inner_microstep: 1693.22 | bwd_allreduce_microstep: 97.71 | step_microstep: 0.12 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12298 total_samples=19374, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:38,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.77 | bwd_microstep: 1908.36 | bwd_inner_microstep: 1590.42 | bwd_allreduce_microstep: 317.87 | step_microstep: 0.11 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13203 total_samples=19379, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:41,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.35 | bwd_microstep: 1724.75 | bwd_inner_microstep: 1620.41 | bwd_allreduce_microstep: 104.28 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13487 total_samples=19383, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:44,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.81 [2025-08-03 05:37:44,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.06 | bwd_microstep: 1887.88 | bwd_inner_microstep: 1828.88 | bwd_allreduce_microstep: 58.91 | step_microstep: 158.69 [2025-08-03 05:37:44,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.79 | bwd: 7312.05 | bwd_inner: 6732.93 | bwd_allreduce: 578.86 | step: 159.06 {'loss': 0.7423, 'learning_rate': 6.135794392387353e-06, 'epoch': 0.64} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12982 total_samples=19387, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:46,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.19 | bwd_microstep: 1970.01 | bwd_inner_microstep: 1642.36 | bwd_allreduce_microstep: 327.58 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11585 total_samples=19390, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:49,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.02 | bwd_microstep: 1751.68 | bwd_inner_microstep: 1531.95 | bwd_allreduce_microstep: 219.67 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12257 total_samples=19393, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:51,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.80 | bwd_microstep: 1779.24 | bwd_inner_microstep: 1577.23 | bwd_allreduce_microstep: 201.95 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14016 total_samples=19397, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:55,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57 [2025-08-03 05:37:55,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.13 | bwd_microstep: 2043.67 | bwd_inner_microstep: 1892.84 | bwd_allreduce_microstep: 150.74 | step_microstep: 415.03 [2025-08-03 05:37:55,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.07 | bwd: 7544.67 | bwd_inner: 6644.39 | bwd_allreduce: 900.01 | step: 415.42 | 1271/2000 [3:54:15<2:11:14, 10.80s/it] 64%|██████▎ | 1271/2000 [3:54:15<2:11:14, 10.80s/it] 64%|██████▎ | 1272/2000 [3:54:26<2:11:54, 10.87s/it] 64%|██████▎ | 1272/2000 [3:54:26<2:11:54, 10.87s/it] 64%|██████▎ | 1273/2000 [3:54:37<2:12:26, 10.93s/it] 64%|██████▎ | 1273/2000 [3:54:37<2:12:26, 10.93s/it] 64%|██████▎ | 1274/2000 [3:54:48<2:11:43, 10.89s/it] 64%|██████▎ | 1274/2000 [3:54:48<2:11:43, 10.89s/it] 64%|██████▍ | 1275/2000 [3:54:59<2:10:26, 10.80s/it] 64%|██████▍ | 1275/2000 [3:54:59<2:10:26, 10.80s/it] 64%|██████▍ | 1276/2000 [3:55:10<2:11:08, 10.87{'loss': 0.7434, 'learning_rate': 6.120863578016736e-06, 'epoch': 0.64} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12826 total_samples=19401, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:37:58,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.84 | bwd_microstep: 2126.74 | bwd_inner_microstep: 2106.64 | bwd_allreduce_microstep: 20.03 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11871 total_samples=19404, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:00,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.52 | bwd_microstep: 1836.97 | bwd_inner_microstep: 1565.42 | bwd_allreduce_microstep: 271.47 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11544 total_samples=19407, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:03,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.62 | bwd_microstep: 1831.39 | bwd_inner_microstep: 1586.97 | bwd_allreduce_microstep: 244.35 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13345 total_samples=19411, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:06,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96 [2025-08-03 05:38:06,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.50 | bwd_microstep: 1755.32 | bwd_inner_microstep: 1686.04 | bwd_allreduce_microstep: 69.21 | step_microstep: 158.09 [2025-08-03 05:38:06,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.40 | bwd: 7550.47 | bwd_inner: 6945.08 | bwd_allreduce: 605.14 | step: 158.62 {'loss': 0.7449, 'learning_rate': 6.1059429362291615e-06, 'epoch': 0.64} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13353 total_samples=19415, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:08,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.43 | bwd_microstep: 1799.12 | bwd_inner_microstep: 1681.04 | bwd_allreduce_microstep: 117.99 | step_microstep: 0.96 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11693 total_samples=19418, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:11,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.54 | bwd_microstep: 1728.20 | bwd_inner_microstep: 1533.50 | bwd_allreduce_microstep: 194.63 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132 total_samples=19422, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:13,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.73 | bwd_microstep: 1981.04 | bwd_inner_microstep: 1894.57 | bwd_allreduce_microstep: 86.41 | step_microstep: 0.31 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13328 total_samples=19426, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:17,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 05:38:17,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.77 | bwd_microstep: 2084.60 | bwd_inner_microstep: 1919.59 | bwd_allreduce_microstep: 164.94 | step_microstep: 444.22 [2025-08-03 05:38:17,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.40 | bwd: 7593.03 | bwd_inner: 7028.68 | bwd_allreduce: 564.08 | step: 445.63 {'loss': 0.7475, 'learning_rate': 6.091032506152274e-06, 'epoch': 0.64} dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12517 total_samples=19430, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:19,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.76 | bwd_microstep: 1820.08 | bwd_inner_microstep: 1620.85 | bwd_allreduce_microstep: 199.15 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14830 total_samples=19434, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:22,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.03 | bwd_microstep: 2259.81 | bwd_inner_microstep: 2001.95 | bwd_allreduce_microstep: 257.80 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12255 total_samples=19437, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:25,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.32 | bwd_microstep: 1739.64 | bwd_inner_microstep: 1562.95 | bwd_allreduce_microstep: 176.62 | step_microstep: 0.17 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13841 total_samples=19442, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:28,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.23 [2025-08-03 05:38:28,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.14 | bwd_microstep: 2123.69 | bwd_inner_microstep: 1809.88 | bwd_allreduce_microstep: 313.75 | step_microstep: 132.65 [2025-08-03 05:38:28,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.18 | bwd: 7943.29 | bwd_inner: 6995.63 | bwd_allreduce: 947.40 | step: 133.25 {'loss': 0.7587, 'learning_rate': 6.076132326886934e-06, 'epoch': 0.64} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13617 total_samples=19446, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:31,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.04 | bwd_microstep: 2006.67 | bwd_inner_microstep: 1882.09 | bwd_allreduce_microstep: 124.52 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375 total_samples=19450, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:34,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.96 | bwd_microstep: 2186.28 | bwd_inner_microstep: 2052.17 | bwd_allreduce_microstep: 134.05 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13175 total_samples=19454, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:37,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.17 | bwd_microstep: 2172.99 | bwd_inner_microstep: 2056.10 | bwd_allreduce_microstep: 116.82 | step_microstep: 0.83 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11843 total_samples=19457, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:39,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50 [2025-08-03 05:38:39,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.17 | bwd_microstep: 1747.98 | bwd_inner_microstep: 1553.48 | bwd_allreduce_microstep: 194.43 | step_microstep: 110.93 [2025-08-03 05:38:39,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.26 | bwd: 8113.98 | bwd_inner: 7543.84 | bwd_allreduce: 569.88 | step: 112.14 {'loss': 0.7398, 'learning_rate': 6.061242437507131e-06, 'epoch': 0.64} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13009 total_samples=19461, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:42,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.80 | bwd_microstep: 1969.06 | bwd_inner_microstep: 1865.98 | bwd_allreduce_microstep: 103.01 | step_microstep: 0.28 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12922 total_samples=19465, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:45,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.48 | bwd_microstep: 1733.31 | bwd_inner_microstep: 1660.81 | bwd_allreduce_microstep: 72.43 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13642 total_samples=19469, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:47,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 756.84 | bwd_microstep: 1979.51 | bwd_inner_microstep: 1738.09 | bwd_allreduce_microstep: 241.36 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13797 total_samples=19473, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:50,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40 [2025-08-03 05:38:50,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.68 | bwd_microstep: 1841.35 | bwd_inner_microstep: 1731.32 | bwd_allreduce_microstep: 109.96 | step_microstep: 112.85 [2025-08-03 05:38:50,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2875.74 | bwd: 7523.28 | bwd_inner: 6996.19 | bwd_allreduce: 526.84 | step: 113.36 s/it] 64%|██████▍ | 1276/2000 [3:55:10<2:11:08, 10.87s/it] 64%|██████▍ | 1277/2000 [3:55:20<2:10:58, 10.87s/it] 64%|██████▍ | 1277/2000 [3:55:20<2:10:58, 10.87s/it] 64%|██████▍ | 1278/2000 [3:55:32<2:11:45, 10.95s/it] 64%|██████▍ | 1278/2000 [3:55:32<2:11:45, 10.95s/it] 64%|██████▍ | 1279/2000 [3:55:43<2:12:28, 11.02s/it] 64%|██████▍ | 1279/2000 [3:55:43<2:12:28, 11.02s/it] 64%|██████▍ | 1280/2000 [3:55:54<2:13:40, 11.14s/it] 64%|██████▍ | 1280/2000 [3:55:54<2:13:40, 11.14s/it] 64%|██████▍ | 1281/2000 [3:56:05<2:12:15, 11.04s/it] {'loss': 0.7437, 'learning_rate': 6.0463628770598574e-06, 'epoch': 0.64} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11795 total_samples=19476, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:53,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.17 | bwd_microstep: 1781.03 | bwd_inner_microstep: 1547.94 | bwd_allreduce_microstep: 233.01 | step_microstep: 0.15 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13153 total_samples=19480, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:55,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.52 | bwd_microstep: 1723.03 | bwd_inner_microstep: 1641.31 | bwd_allreduce_microstep: 81.64 | step_microstep: 0.23 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12924 total_samples=19484, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:38:58,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.78 | bwd_microstep: 2063.69 | bwd_inner_microstep: 1767.94 | bwd_allreduce_microstep: 295.69 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14359 total_samples=19488, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:01,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70 [2025-08-03 05:39:01,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.41 | bwd_microstep: 1929.16 | bwd_inner_microstep: 1786.84 | bwd_allreduce_microstep: 142.26 | step_microstep: 134.26 [2025-08-03 05:39:01,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.81 | bwd: 7496.96 | bwd_inner: 6744.03 | bwd_allreduce: 752.69 | step: 134.75 {'loss': 0.7442, 'learning_rate': 6.0314936845650296e-06, 'epoch': 0.64} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12529 total_samples=19491, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:03,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.17 | bwd_microstep: 1778.16 | bwd_inner_microstep: 1586.46 | bwd_allreduce_microstep: 191.63 | step_microstep: 0.32 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13351 total_samples=19495, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:06,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.07 | bwd_microstep: 1858.44 | bwd_inner_microstep: 1726.11 | bwd_allreduce_microstep: 132.27 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14095 total_samples=19499, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:09,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.43 | bwd_microstep: 2359.05 | bwd_inner_microstep: 2352.71 | bwd_allreduce_microstep: 6.27 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13453 total_samples=19503, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:12,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25 [2025-08-03 05:39:12,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.59 | bwd_microstep: 1829.48 | bwd_inner_microstep: 1733.92 | bwd_allreduce_microstep: 95.50 | step_microstep: 425.37 [2025-08-03 05:39:12,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.19 | bwd: 7825.19 | bwd_inner: 7399.20 | bwd_allreduce: 425.75 | step: 426.04 {'loss': 0.7382, 'learning_rate': 6.016634899015369e-06, 'epoch': 0.64} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13318 total_samples=19507, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:15,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.02 | bwd_microstep: 2036.29 | bwd_inner_microstep: 1796.87 | bwd_allreduce_microstep: 239.35 | step_microstep: 0.27 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12983 total_samples=19511, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:18,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.73 | bwd_microstep: 1720.69 | bwd_inner_microstep: 1643.26 | bwd_allreduce_microstep: 77.37 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13484 total_samples=19515, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:20,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.37 | bwd_microstep: 1762.55 | bwd_inner_microstep: 1703.63 | bwd_allreduce_microstep: 58.84 | step_microstep: 0.36 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13888 total_samples=19520, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:23,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.79 [2025-08-03 05:39:23,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.08 | bwd_microstep: 1848.01 | bwd_inner_microstep: 1733.72 | bwd_allreduce_microstep: 114.23 | step_microstep: 156.19 [2025-08-03 05:39:23,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.13 | bwd: 7367.60 | bwd_inner: 6877.47 | bwd_allreduce: 489.88 | step: 156.94 {'loss': 0.7522, 'learning_rate': 6.00178655937631e-06, 'epoch': 0.64} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11687 total_samples=19523, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:26,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.37 | bwd_microstep: 1796.36 | bwd_inner_microstep: 1554.31 | bwd_allreduce_microstep: 241.98 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12785 total_samples=19527, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:28,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.47 | bwd_microstep: 1839.30 | bwd_inner_microstep: 1605.76 | bwd_allreduce_microstep: 233.47 | step_microstep: 0.42 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11685 total_samples=19530, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:31,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.78 | bwd_microstep: 2070.57 | bwd_inner_microstep: 1599.32 | bwd_allreduce_microstep: 471.15 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13390 total_samples=19534, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:34,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.75 [2025-08-03 05:39:34,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.29 | bwd_microstep: 1935.85 | bwd_inner_microstep: 1859.87 | bwd_allreduce_microstep: 75.91 | step_microstep: 115.21 [2025-08-03 05:39:34,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.83 | bwd: 7642.14 | bwd_inner: 6619.27 | bwd_allreduce: 1022.58 | step: 115.88 {'loss': 0.7338, 'learning_rate': 5.986948704585895e-06, 'epoch': 0.64} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13844 total_samples=19538, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:36,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.46 | bwd_microstep: 1765.31 | bwd_inner_microstep: 1700.97 | bwd_allreduce_microstep: 64.28 | step_microstep: 0.25 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15416 total_samples=19542, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:39,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.93 | bwd_microstep: 2102.90 | bwd_inner_microstep: 1967.86 | bwd_allreduce_microstep: 134.98 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13420 total_samples=19547, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:42,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1091.26 | bwd_microstep: 1821.81 | bwd_inner_microstep: 1735.89 | bwd_allreduce_microstep: 85.85 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13142 total_samples=19551, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:45,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97 [2025-08-03 05:39:45,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.45 | bwd_microstep: 2000.56 | bwd_inner_microstep: 1700.77 | bwd_allreduce_microstep: 299.72 | step_microstep: 125.87 [2025-08-03 05:39:45,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3204.03 | bwd: 7690.62 | bwd_inner: 7105.48 | bwd_allreduce: 584.90 | step: 126.46 {'loss': 0.7438, 'learning_rate': 5.972121373554665e-06, 'epoch': 0.64} 64%|██████▍ | 1281/2000 [3:56:05<2:12:15, 11.04s/it] 64%|██████▍ | 1282/2000 [3:56:16<2:11:08, 10.96s/it] 64%|██████▍ | 1282/2000 [3:56:16<2:11:08, 10.96s/it] 64%|██████▍ | 1283/2000 [3:56:27<2:12:24, 11.08s/it] 64%|██████▍ | 1283/2000 [3:56:27<2:12:24, 11.08s/it] 64%|██████▍ | 1284/2000 [3:56:38<2:10:36, 10.94s/it] 64%|██████▍ | 1284/2000 [3:56:38<2:10:36, 10.94s/it] 64%|██████▍ | 1285/2000 [3:56:49<2:10:13, 10.93s/it] 64%|██████▍ | 1285/2000 [3:56:49<2:10:13, 10.93s/it] 64%|██████▍ | 1286/2000 [3:57:00<2:11:26, 11.05s/it] 64%|███dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12142 total_samples=19554, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:48,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.41 | bwd_microstep: 1862.56 | bwd_inner_microstep: 1739.67 | bwd_allreduce_microstep: 122.83 | step_microstep: 0.12 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15919 total_samples=19559, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:50,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.00 | bwd_microstep: 1797.25 | bwd_inner_microstep: 1735.75 | bwd_allreduce_microstep: 61.43 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11708 total_samples=19562, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:53,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.89 | bwd_microstep: 2113.07 | bwd_inner_microstep: 1830.01 | bwd_allreduce_microstep: 282.99 | step_microstep: 0.77 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14011 total_samples=19566, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:56,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59 [2025-08-03 05:39:56,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.22 | bwd_microstep: 1844.14 | bwd_inner_microstep: 1733.29 | bwd_allreduce_microstep: 110.79 | step_microstep: 113.26 [2025-08-03 05:39:56,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.45 | bwd: 7617.07 | bwd_inner: 7038.72 | bwd_allreduce: 578.11 | step: 114.39 {'loss': 0.7287, 'learning_rate': 5.957304605165567e-06, 'epoch': 0.64} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13155 total_samples=19570, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:39:59,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.47 | bwd_microstep: 2033.70 | bwd_inner_microstep: 1692.79 | bwd_allreduce_microstep: 340.84 | step_microstep: 0.14 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14907 total_samples=19574, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:01,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.78 | bwd_microstep: 1758.77 | bwd_inner_microstep: 1733.21 | bwd_allreduce_microstep: 25.49 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11923 total_samples=19577, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:04,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.66 | bwd_microstep: 1771.26 | bwd_inner_microstep: 1563.05 | bwd_allreduce_microstep: 208.11 | step_microstep: 0.20 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13259 total_samples=19581, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:07,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18 [2025-08-03 05:40:07,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.96 | bwd_microstep: 1795.54 | bwd_inner_microstep: 1674.96 | bwd_allreduce_microstep: 120.51 | step_microstep: 113.33 [2025-08-03 05:40:07,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.80 | bwd: 7359.33 | bwd_inner: 6664.01 | bwd_allreduce: 695.05 | step: 113.92 {'loss': 0.748, 'learning_rate': 5.942498438273849e-06, 'epoch': 0.64} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13497 total_samples=19586, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:09,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.36 | bwd_microstep: 2055.56 | bwd_inner_microstep: 1901.89 | bwd_allreduce_microstep: 153.60 | step_microstep: 0.75 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12451 total_samples=19590, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:12,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.48 | bwd_microstep: 1727.02 | bwd_inner_microstep: 1574.39 | bwd_allreduce_microstep: 152.56 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13682 total_samples=19594, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:14,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.16 | bwd_microstep: 1752.76 | bwd_inner_microstep: 1690.93 | bwd_allreduce_microstep: 61.76 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13236 total_samples=19598, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:17,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.23 [2025-08-03 05:40:17,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.28 | bwd_microstep: 1806.76 | bwd_inner_microstep: 1700.00 | bwd_allreduce_microstep: 106.69 | step_microstep: 133.07 [2025-08-03 05:40:17,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.21 | bwd: 7342.16 | bwd_inner: 6867.21 | bwd_allreduce: 474.70 | step: 134.09 {'loss': 0.735, 'learning_rate': 5.927702911706961e-06, 'epoch': 0.64} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12100 total_samples=19601, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:20,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.47 | bwd_microstep: 1979.58 | bwd_inner_microstep: 1698.15 | bwd_allreduce_microstep: 281.37 | step_microstep: 0.19 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13304 total_samples=19605, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:23,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.02 | bwd_microstep: 2008.88 | bwd_inner_microstep: 1899.28 | bwd_allreduce_microstep: 109.53 | step_microstep: 0.24 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12462 total_samples=19608, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:26,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.96 | bwd_microstep: 2346.98 | bwd_inner_microstep: 2135.00 | bwd_allreduce_microstep: 211.92 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13523 total_samples=19612, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:28,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.29 [2025-08-03 05:40:28,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.05 | bwd_microstep: 1760.15 | bwd_inner_microstep: 1691.26 | bwd_allreduce_microstep: 68.82 | step_microstep: 120.24 [2025-08-03 05:40:28,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.42 | bwd: 8095.65 | bwd_inner: 7423.68 | bwd_allreduce: 671.73 | step: 120.78 {'loss': 0.7352, 'learning_rate': 5.912918064264441e-06, 'epoch': 0.65} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11598 total_samples=19615, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:31,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.87 | bwd_microstep: 2016.63 | bwd_inner_microstep: 1839.13 | bwd_allreduce_microstep: 177.44 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12158 total_samples=19618, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:34,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.01 | bwd_microstep: 2017.81 | bwd_inner_microstep: 1792.13 | bwd_allreduce_microstep: 225.61 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13213 total_samples=19622, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:37,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.61 | bwd_microstep: 1748.65 | bwd_inner_microstep: 1671.71 | bwd_allreduce_microstep: 76.87 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13537 total_samples=19626, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:39,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50 [2025-08-03 05:40:39,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1819.76 | bwd_inner_microstep: 1729.33 | bwd_allreduce_microstep: 90.36 | step_microstep: 133.13 [2025-08-03 05:40:39,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.02 | bwd: 7602.92 | bwd_inner: 7032.29 | bwd_allreduce: 570.36 | step: 133.74 {'loss': 0.7315, 'learning_rate': 5.898143934717831e-06, 'epoch': 0.65} ██▍ | 1286/2000 [3:57:00<2:11:26, 11.05s/it] 64%|██████▍ | 1287/2000 [3:57:11<2:10:35, 10.99s/it] 64%|██████▍ | 1287/2000 [3:57:11<2:10:35, 10.99s/it] 64%|██████▍ | 1288/2000 [3:57:21<2:08:53, 10.86s/it] 64%|██████▍ | 1288/2000 [3:57:21<2:08:53, 10.86s/it] 64%|██████▍ | 1289/2000 [3:57:32<2:07:43, 10.78s/it] 64%|██████▍ | 1289/2000 [3:57:32<2:07:43, 10.78s/it] 64%|██████▍ | 1290/2000 [3:57:43<2:09:34, 10.95s/it] 64%|██████▍ | 1290/2000 [3:57:43<2:09:34, 10.95s/it] 65%|██████▍ | 1291/2000 [3:57:54<2:08:57, 10.91s/it] 65%|██████▍ | 1291/2000 [3:57:54<2:08dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12818 total_samples=19630, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:42,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.56 | bwd_microstep: 1832.23 | bwd_inner_microstep: 1617.77 | bwd_allreduce_microstep: 214.39 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15171 total_samples=19634, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:45,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.31 | bwd_microstep: 1771.59 | bwd_inner_microstep: 1751.24 | bwd_allreduce_microstep: 20.29 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14715 total_samples=19638, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:47,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.13 | bwd_microstep: 1754.57 | bwd_inner_microstep: 1739.78 | bwd_allreduce_microstep: 14.72 | step_microstep: 0.14 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13228 total_samples=19643, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:50,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28 [2025-08-03 05:40:50,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.12 | bwd_microstep: 1819.39 | bwd_inner_microstep: 1707.36 | bwd_allreduce_microstep: 111.95 | step_microstep: 114.39 [2025-08-03 05:40:50,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.05 | bwd: 7177.82 | bwd_inner: 6816.15 | bwd_allreduce: 361.43 | step: 114.93 {'loss': 0.7484, 'learning_rate': 5.8833805618105635e-06, 'epoch': 0.65} dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12898 total_samples=19647, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:52,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.27 | bwd_microstep: 1732.75 | bwd_inner_microstep: 1613.73 | bwd_allreduce_microstep: 118.96 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13161 total_samples=19651, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:55,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.93 | bwd_microstep: 2022.71 | bwd_inner_microstep: 1915.27 | bwd_allreduce_microstep: 107.37 | step_microstep: 0.18 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132 total_samples=19655, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:40:58,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.84 | bwd_microstep: 2033.62 | bwd_inner_microstep: 1898.04 | bwd_allreduce_microstep: 135.51 | step_microstep: 0.26 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12846 total_samples=19659, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:01,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:41:01,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.48 | bwd_microstep: 2066.43 | bwd_inner_microstep: 1701.95 | bwd_allreduce_microstep: 364.41 | step_microstep: 146.82 [2025-08-03 05:41:01,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.43 | bwd: 7855.57 | bwd_inner: 7128.98 | bwd_allreduce: 726.34 | step: 147.38 {'loss': 0.7443, 'learning_rate': 5.868627984257862e-06, 'epoch': 0.65} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11779 total_samples=19662, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:04,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.61 | bwd_microstep: 1990.78 | bwd_inner_microstep: 1798.57 | bwd_allreduce_microstep: 192.13 | step_microstep: 0.19 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13884 total_samples=19666, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:06,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.85 | bwd_microstep: 1700.70 | bwd_inner_microstep: 1661.72 | bwd_allreduce_microstep: 38.92 | step_microstep: 0.24 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13057 total_samples=19670, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:09,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.36 | bwd_microstep: 1843.26 | bwd_inner_microstep: 1670.87 | bwd_allreduce_microstep: 172.32 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12023 total_samples=19673, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:11,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49 [2025-08-03 05:41:11,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.82 | bwd_microstep: 1772.73 | bwd_inner_microstep: 1598.82 | bwd_allreduce_microstep: 173.84 | step_microstep: 115.17 [2025-08-03 05:41:11,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.58 | bwd: 7307.52 | bwd_inner: 6729.97 | bwd_allreduce: 577.30 | step: 115.74 {'loss': 0.7379, 'learning_rate': 5.853886240746643e-06, 'epoch': 0.65} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12036 total_samples=19676, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:14,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.33 | bwd_microstep: 2011.79 | bwd_inner_microstep: 1808.77 | bwd_allreduce_microstep: 202.96 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15581 total_samples=19681, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:17,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.15 | bwd_microstep: 1792.05 | bwd_inner_microstep: 1766.47 | bwd_allreduce_microstep: 25.51 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11735 total_samples=19684, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:19,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.09 | bwd_microstep: 1841.58 | bwd_inner_microstep: 1592.54 | bwd_allreduce_microstep: 248.97 | step_microstep: 0.27 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14068 total_samples=19688, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:22,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73 [2025-08-03 05:41:22,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.11 | bwd_microstep: 1972.16 | bwd_inner_microstep: 1897.49 | bwd_allreduce_microstep: 74.60 | step_microstep: 111.13 [2025-08-03 05:41:22,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.62 | bwd: 7617.62 | bwd_inner: 7065.25 | bwd_allreduce: 552.12 | step: 111.76 {'loss': 0.7466, 'learning_rate': 5.839155369935407e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13630 total_samples=19692, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:25,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.55 | bwd_microstep: 1752.99 | bwd_inner_microstep: 1723.83 | bwd_allreduce_microstep: 29.09 | step_microstep: 0.23 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12760 total_samples=19696, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:27,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.10 | bwd_microstep: 1746.00 | bwd_inner_microstep: 1621.36 | bwd_allreduce_microstep: 124.57 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13036 total_samples=19700, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:30,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.76 | bwd_microstep: 2054.72 | bwd_inner_microstep: 1710.07 | bwd_allreduce_microstep: 344.60 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13370 total_samples=19704, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:33,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:41:33,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.30 | bwd_microstep: 1749.68 | bwd_inner_microstep: 1687.24 | bwd_allreduce_microstep: 62.37 | step_microstep: 152.66 [2025-08-03 05:41:33,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.63 | bwd: 7303.46 | bwd_inner: 6742.49 | bwd_allreduce: 560.71 | step: 153.25 {'loss': 0.7389, 'learning_rate': 5.82443541045415e-06, 'epoch': 0.65} dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12615 total_samples=19708, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:35,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.49 | bwd_microstep: 1943.54 | bwd_inner_microstep: 1616.75 | bwd_allreduce_microstep: 326.72 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13491 total_samples=19713, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:38,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.52 | bwd_microstep: 1899.72 | bwd_inner_microstep: 1888.98 | bwd_allreduce_microstep: 10.68 | step_microstep: 0.21 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11695 total_samples=19717, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:41,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.14 | bwd_microstep: 2133.80 | bwd_inner_microstep: 1915.81 | bwd_allreduce_microstep: 217.93 | step_microstep: 0.10 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13963 total_samples=19721, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:44,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06 [2025-08-03 05:41:44,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.53 | bwd_microstep: 1773.88 | bwd_inner_microstep: 1703.09 | bwd_allreduce_microstep: 70.72 | step_microstep: 141.02 [2025-08-03 05:41:44,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.62 | bwd: 7750.99 | bwd_inner: 7124.63 | bwd_allreduce: 626.13 | step: 141.48 :57, 10.91s/it] 65%|██████▍ | 1292/2000 [3:58:05<2:07:11, 10.78s/it] 65%|██████▍ | 1292/2000 [3:58:05<2:07:11, 10.78s/it] 65%|██████▍ | 1293/2000 [3:58:16<2:08:04, 10.87s/it] 65%|██████▍ | 1293/2000 [3:58:16<2:08:04, 10.87s/it] 65%|██████▍ | 1294/2000 [3:58:26<2:06:30, 10.75s/it] 65%|██████▍ | 1294/2000 [3:58:26<2:06:30, 10.75s/it] 65%|██████▍ | 1295/2000 [3:58:37<2:06:44, 10.79s/it] 65%|██████▍ | 1295/2000 [3:58:37<2:06:44, 10.79s/it] 65%|██████▍ | 1296/2000 [3:58:48<2:05:38, 10.71s/it] 65%|██████▍ | 1296/2000 [3:58:48<2:05:38, 10.71s/it] 65%|██████{'loss': 0.748, 'learning_rate': 5.809726400904242e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13991 total_samples=19725, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:47,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 939.87 | bwd_microstep: 1792.52 | bwd_inner_microstep: 1728.14 | bwd_allreduce_microstep: 64.30 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13569 total_samples=19729, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:49,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.49 | bwd_microstep: 1750.14 | bwd_inner_microstep: 1696.63 | bwd_allreduce_microstep: 53.45 | step_microstep: 0.23 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13689 total_samples=19733, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:52,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.28 | bwd_microstep: 1983.21 | bwd_inner_microstep: 1778.28 | bwd_allreduce_microstep: 204.86 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11649 total_samples=19736, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:55,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41 [2025-08-03 05:41:55,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.34 | bwd_microstep: 1789.65 | bwd_inner_microstep: 1541.52 | bwd_allreduce_microstep: 248.06 | step_microstep: 135.52 [2025-08-03 05:41:55,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3099.91 | bwd: 7315.57 | bwd_inner: 6744.58 | bwd_allreduce: 570.74 | step: 135.99 {'loss': 0.7459, 'learning_rate': 5.795028379858355e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13657 total_samples=19740, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:41:57,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.23 | bwd_microstep: 1825.33 | bwd_inner_microstep: 1725.95 | bwd_allreduce_microstep: 99.32 | step_microstep: 0.11 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12941 total_samples=19744, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:00,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.61 | bwd_microstep: 2113.06 | bwd_inner_microstep: 1828.77 | bwd_allreduce_microstep: 284.23 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13377 total_samples=19748, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:03,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.21 | bwd_microstep: 1789.43 | bwd_inner_microstep: 1710.69 | bwd_allreduce_microstep: 78.67 | step_microstep: 0.30 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14508 total_samples=19752, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:05,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37 [2025-08-03 05:42:05,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.20 | bwd_microstep: 1773.23 | bwd_inner_microstep: 1729.11 | bwd_allreduce_microstep: 44.06 | step_microstep: 138.38 [2025-08-03 05:42:05,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.16 | bwd: 7501.12 | bwd_inner: 6994.51 | bwd_allreduce: 506.36 | step: 138.89 {'loss': 0.7398, 'learning_rate': 5.780341385860333e-06, 'epoch': 0.65} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11865 total_samples=19755, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:08,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.32 | bwd_microstep: 1843.99 | bwd_inner_microstep: 1560.64 | bwd_allreduce_microstep: 283.28 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11651 total_samples=19758, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:10,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.04 | bwd_microstep: 1744.02 | bwd_inner_microstep: 1555.81 | bwd_allreduce_microstep: 188.14 | step_microstep: 0.30 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13359 total_samples=19762, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:13,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.89 | bwd_microstep: 1796.18 | bwd_inner_microstep: 1704.13 | bwd_allreduce_microstep: 91.99 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14010 total_samples=19766, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:16,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19 [2025-08-03 05:42:16,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.76 | bwd_microstep: 1739.47 | bwd_inner_microstep: 1704.47 | bwd_allreduce_microstep: 34.93 | step_microstep: 116.20 [2025-08-03 05:42:16,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.95 | bwd: 7123.71 | bwd_inner: 6525.04 | bwd_allreduce: 598.42 | step: 116.73 {'loss': 0.7393, 'learning_rate': 5.765665457425102e-06, 'epoch': 0.65} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13851 total_samples=19770, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:18,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.04 | bwd_microstep: 1825.69 | bwd_inner_microstep: 1682.23 | bwd_allreduce_microstep: 143.38 | step_microstep: 0.18 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12277 total_samples=19773, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:21,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.88 | bwd_microstep: 1861.99 | bwd_inner_microstep: 1617.18 | bwd_allreduce_microstep: 244.74 | step_microstep: 0.80 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12487 total_samples=19777, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:24,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.60 | bwd_microstep: 1971.62 | bwd_inner_microstep: 1793.89 | bwd_allreduce_microstep: 177.65 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13527 total_samples=19781, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:27,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05 [2025-08-03 05:42:27,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.53 | bwd_microstep: 1959.09 | bwd_inner_microstep: 1853.96 | bwd_allreduce_microstep: 105.06 | step_microstep: 108.78 [2025-08-03 05:42:27,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.97 | bwd: 7618.44 | bwd_inner: 6947.26 | bwd_allreduce: 670.93 | step: 109.89 {'loss': 0.7416, 'learning_rate': 5.751000633038573e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14208 total_samples=19786, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:29,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.99 | bwd_microstep: 2175.81 | bwd_inner_microstep: 2120.14 | bwd_allreduce_microstep: 55.60 | step_microstep: 0.27 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13723 total_samples=19790, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:32,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.35 | bwd_microstep: 1872.54 | bwd_inner_microstep: 1686.40 | bwd_allreduce_microstep: 186.07 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11834 total_samples=19793, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:35,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.52 | bwd_microstep: 1874.10 | bwd_inner_microstep: 1698.39 | bwd_allreduce_microstep: 175.64 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13780 total_samples=19799, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:38,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.15 [2025-08-03 05:42:38,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.60 | bwd_microstep: 1899.13 | bwd_inner_microstep: 1830.44 | bwd_allreduce_microstep: 68.61 | step_microstep: 154.44 [2025-08-03 05:42:38,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.40 | bwd: 7821.63 | bwd_inner: 7335.36 | bwd_allreduce: 486.01 | step: 155.08 ▍ | 1297/2000 [3:58:59<2:06:38, 10.81s/it] 65%|██████▍ | 1297/2000 [3:58:59<2:06:38, 10.81s/it] 65%|██████▍ | 1298/2000 [3:59:09<2:06:32, 10.82s/it] 65%|██████▍ | 1298/2000 [3:59:09<2:06:32, 10.82s/it] 65%|██████▍ | 1299/2000 [3:59:20<2:06:07, 10.79s/it] 65%|██████▍ | 1299/2000 [3:59:20<2:06:07, 10.79s/it] 65%|██████▌ | 1300/2000 [3:59:31<2:04:20, 10.66s/it] 65%|██████▌ | 1300/2000 [3:59:31<2:04:20, 10.66s/it] 65%|██████▌ | 1301/2000 [3:59:41<2:04:53, 10.72s/it] 65%|██████▌ | 1301/2000 [3:59:41<2:04:53, 10.72s/it] 65%|██████▌ | 1302/2000 [3:59:52<2:05:56, 10.{'loss': 0.7343, 'learning_rate': 5.736346951157544e-06, 'epoch': 0.65} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12168 total_samples=19802, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:40,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.52 | bwd_microstep: 1842.18 | bwd_inner_microstep: 1614.95 | bwd_allreduce_microstep: 227.16 | step_microstep: 0.12 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11927 total_samples=19805, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:43,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.98 | bwd_microstep: 1809.92 | bwd_inner_microstep: 1564.03 | bwd_allreduce_microstep: 245.81 | step_microstep: 0.14 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12722 total_samples=19809, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:45,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.76 | bwd_microstep: 1833.02 | bwd_inner_microstep: 1627.23 | bwd_allreduce_microstep: 205.73 | step_microstep: 0.23 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13519 total_samples=19814, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:48,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13 [2025-08-03 05:42:48,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.70 | bwd_microstep: 1996.71 | bwd_inner_microstep: 1913.81 | bwd_allreduce_microstep: 82.84 | step_microstep: 112.50 [2025-08-03 05:42:48,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2880.91 | bwd: 7481.89 | bwd_inner: 6720.02 | bwd_allreduce: 761.62 | step: 113.01 {'loss': 0.7434, 'learning_rate': 5.721704450209581e-06, 'epoch': 0.65} dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12323 total_samples=19818, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:52,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.52 | bwd_microstep: 2337.36 | bwd_inner_microstep: 2331.24 | bwd_allreduce_microstep: 6.05 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13624 total_samples=19822, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:54,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.90 | bwd_microstep: 2066.13 | bwd_inner_microstep: 2034.48 | bwd_allreduce_microstep: 31.59 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11657 total_samples=19825, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:42:57,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.39 | bwd_microstep: 1982.57 | bwd_inner_microstep: 1784.55 | bwd_allreduce_microstep: 197.96 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11488 total_samples=19828, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:00,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70 [2025-08-03 05:43:00,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.78 | bwd_microstep: 1847.85 | bwd_inner_microstep: 1616.40 | bwd_allreduce_microstep: 231.38 | step_microstep: 142.52 [2025-08-03 05:43:00,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2892.52 | bwd: 8233.96 | bwd_inner: 7766.67 | bwd_allreduce: 467.06 | step: 142.99 {'loss': 0.7477, 'learning_rate': 5.707073168592943e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14403 total_samples=19833, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:02,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.98 | bwd_microstep: 1741.09 | bwd_inner_microstep: 1717.01 | bwd_allreduce_microstep: 24.01 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13598 total_samples=19837, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:05,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.14 | bwd_microstep: 1751.18 | bwd_inner_microstep: 1684.58 | bwd_allreduce_microstep: 66.54 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13645 total_samples=19841, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:08,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.05 | bwd_microstep: 2123.33 | bwd_inner_microstep: 1878.10 | bwd_allreduce_microstep: 245.17 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12285 total_samples=19844, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:11,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04 [2025-08-03 05:43:11,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.33 | bwd_microstep: 2028.63 | bwd_inner_microstep: 1809.54 | bwd_allreduce_microstep: 219.02 | step_microstep: 114.96 [2025-08-03 05:43:11,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.42 | bwd: 7644.29 | bwd_inner: 7089.23 | bwd_allreduce: 554.82 | step: 115.44 {'loss': 0.7488, 'learning_rate': 5.692453144676451e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14330 total_samples=19848, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:14,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.10 | bwd_microstep: 1890.87 | bwd_inner_microstep: 1843.56 | bwd_allreduce_microstep: 47.24 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723 total_samples=19851, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:16,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.77 | bwd_microstep: 1732.51 | bwd_inner_microstep: 1536.55 | bwd_allreduce_microstep: 195.90 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13950 total_samples=19855, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:19,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.32 | bwd_microstep: 1819.75 | bwd_inner_microstep: 1745.20 | bwd_allreduce_microstep: 74.49 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14139 total_samples=19859, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:22,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01 [2025-08-03 05:43:22,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.39 | bwd_microstep: 2029.85 | bwd_inner_microstep: 1896.31 | bwd_allreduce_microstep: 133.47 | step_microstep: 125.80 [2025-08-03 05:43:22,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.51 | bwd: 7473.02 | bwd_inner: 7021.61 | bwd_allreduce: 451.17 | step: 126.33 {'loss': 0.7495, 'learning_rate': 5.677844416799424e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243 total_samples=19863, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:24,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.09 | bwd_microstep: 1770.53 | bwd_inner_microstep: 1691.52 | bwd_allreduce_microstep: 78.94 | step_microstep: 0.24 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12818 total_samples=19867, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:27,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.54 | bwd_microstep: 1841.60 | bwd_inner_microstep: 1662.18 | bwd_allreduce_microstep: 179.36 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13987 total_samples=19871, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:29,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.11 | bwd_microstep: 1845.27 | bwd_inner_microstep: 1754.20 | bwd_allreduce_microstep: 90.99 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14593 total_samples=19876, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:32,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33 [2025-08-03 05:43:32,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.44 | bwd_microstep: 1822.63 | bwd_inner_microstep: 1800.21 | bwd_allreduce_microstep: 22.36 | step_microstep: 135.99 [2025-08-03 05:43:32,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.12 | bwd: 7280.08 | bwd_inner: 6908.10 | bwd_allreduce: 371.73 | step: 136.51 83s/it] 65%|██████▌ | 1302/2000 [3:59:52<2:05:56, 10.83s/it] 65%|██████▌ | 1303/2000 [4:00:03<2:05:30, 10.80s/it] 65%|██████▌ | 1303/2000 [4:00:03<2:05:30, 10.80s/it] 65%|██████▌ | 1304/2000 [4:00:15<2:08:03, 11.04s/it] 65%|██████▌ | 1304/2000 [4:00:15<2:08:03, 11.04s/it] 65%|██████▌ | 1305/2000 [4:00:26<2:07:27, 11.00s/it] 65%|██████▌ | 1305/2000 [4:00:26<2:07:27, 11.00s/it] 65%|██████▌ | 1306/2000 [4:00:36<2:06:15, 10.92s/it] 65%|██████▌ | 1306/2000 [4:00:36<2:06:15, 10.92s/it] 65%|██████▌ | 1307/2000 [4:00:47<2:04:56, 10.82s/it] {'loss': 0.7464, 'learning_rate': 5.663247023271543e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13307 total_samples=19880, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:35,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.52 | bwd_microstep: 1690.57 | bwd_inner_microstep: 1644.78 | bwd_allreduce_microstep: 45.72 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15258 total_samples=19884, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:37,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.96 | bwd_microstep: 2048.35 | bwd_inner_microstep: 1930.24 | bwd_allreduce_microstep: 118.04 | step_microstep: 0.30 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14731 total_samples=19888, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:40,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.04 | bwd_microstep: 1903.73 | bwd_inner_microstep: 1743.69 | bwd_allreduce_microstep: 159.97 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11806 total_samples=19891, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:43,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13 [2025-08-03 05:43:43,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.67 | bwd_microstep: 1783.96 | bwd_inner_microstep: 1564.93 | bwd_allreduce_microstep: 218.97 | step_microstep: 127.41 [2025-08-03 05:43:43,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2751.11 | bwd: 7426.67 | bwd_inner: 6883.65 | bwd_allreduce: 542.78 | step: 127.96 {'loss': 0.7526, 'learning_rate': 5.648661002372769e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13576 total_samples=19895, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:46,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.29 | bwd_microstep: 1963.60 | bwd_inner_microstep: 1867.14 | bwd_allreduce_microstep: 96.39 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11865 total_samples=19898, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:48,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.23 | bwd_microstep: 1762.67 | bwd_inner_microstep: 1567.66 | bwd_allreduce_microstep: 194.94 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13657 total_samples=19902, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:51,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.57 | bwd_microstep: 1996.22 | bwd_inner_microstep: 1784.28 | bwd_allreduce_microstep: 211.86 | step_microstep: 0.98 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11650 total_samples=19905, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:53,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:43:53,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.62 | bwd_microstep: 1738.55 | bwd_inner_microstep: 1538.23 | bwd_allreduce_microstep: 200.25 | step_microstep: 108.49 [2025-08-03 05:43:53,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.65 | bwd: 7461.09 | bwd_inner: 6757.30 | bwd_allreduce: 703.54 | step: 109.86 {'loss': 0.7327, 'learning_rate': 5.63408639235324e-06, 'epoch': 0.65} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15180 total_samples=19909, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:56,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.21 | bwd_microstep: 1753.92 | bwd_inner_microstep: 1736.54 | bwd_allreduce_microstep: 17.32 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11633 total_samples=19912, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:43:59,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.39 | bwd_microstep: 1858.76 | bwd_inner_microstep: 1535.19 | bwd_allreduce_microstep: 323.51 | step_microstep: 0.12 dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13516 total_samples=19916, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:02,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.64 | bwd_microstep: 2121.48 | bwd_inner_microstep: 1933.13 | bwd_allreduce_microstep: 188.29 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12093 total_samples=19919, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:04,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48 [2025-08-03 05:44:04,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.87 | bwd_microstep: 2046.27 | bwd_inner_microstep: 1814.37 | bwd_allreduce_microstep: 231.83 | step_microstep: 113.19 [2025-08-03 05:44:04,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.03 | bwd: 7780.49 | bwd_inner: 7019.22 | bwd_allreduce: 761.03 | step: 113.65 {'loss': 0.733, 'learning_rate': 5.619523231433177e-06, 'epoch': 0.66} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12973 total_samples=19923, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:07,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.39 | bwd_microstep: 1727.47 | bwd_inner_microstep: 1660.91 | bwd_allreduce_microstep: 66.48 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407 total_samples=19927, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:09,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.29 | bwd_microstep: 1769.56 | bwd_inner_microstep: 1698.68 | bwd_allreduce_microstep: 70.82 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11515 total_samples=19930, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:12,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.17 | bwd_microstep: 1822.45 | bwd_inner_microstep: 1582.56 | bwd_allreduce_microstep: 239.81 | step_microstep: 0.17 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12586 total_samples=19934, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:15,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36 [2025-08-03 05:44:15,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.54 | bwd_microstep: 1806.17 | bwd_inner_microstep: 1632.48 | bwd_allreduce_microstep: 173.62 | step_microstep: 117.94 [2025-08-03 05:44:15,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.32 | bwd: 7125.71 | bwd_inner: 6574.63 | bwd_allreduce: 550.82 | step: 118.47 {'loss': 0.7447, 'learning_rate': 5.604971557802769e-06, 'epoch': 0.66} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12806 total_samples=19938, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:17,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.95 | bwd_microstep: 1792.35 | bwd_inner_microstep: 1669.92 | bwd_allreduce_microstep: 122.35 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13507 total_samples=19942, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:20,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.56 | bwd_microstep: 1888.43 | bwd_inner_microstep: 1841.48 | bwd_allreduce_microstep: 46.89 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13988 total_samples=19946, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:23,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.92 | bwd_microstep: 2042.79 | bwd_inner_microstep: 1936.68 | bwd_allreduce_microstep: 106.04 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12009 total_samples=19949, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:26,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54 [2025-08-03 05:44:26,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.30 | bwd_microstep: 1963.18 | bwd_inner_microstep: 1727.22 | bwd_allreduce_microstep: 235.89 | step_microstep: 129.12 [2025-08-03 05:44:26,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.66 | bwd: 7686.80 | bwd_inner: 7175.31 | bwd_allreduce: 511.26 | step: 129.60 {'loss': 0.738, 'learning_rate': 5.590431409622081e-06, 'epoch': 0.66} 65%|██████▌ | 1307/2000 [4:00:47<2:04:56, 10.82s/it] 65%|██████▌ | 1308/2000 [4:00:58<2:04:02, 10.75s/it] 65%|██████▌ | 1308/2000 [4:00:58<2:04:02, 10.75s/it] 65%|██████▌ | 1309/2000 [4:01:08<2:03:38, 10.74s/it] 65%|██████▌ | 1309/2000 [4:01:08<2:03:38, 10.74s/it] 66%|██████▌ | 1310/2000 [4:01:19<2:04:16, 10.81s/it] 66%|██████▌ | 1310/2000 [4:01:19<2:04:16, 10.81s/it] 66%|██████▌ | 1311/2000 [4:01:30<2:02:36, 10.68s/it] 66%|██████▌ | 1311/2000 [4:01:30<2:02:36, 10.68s/it] 66%|██████▌ | 1312/2000 [4:01:41<2:03:25, 10.76s/it] 66%|██dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14554 total_samples=19955, num_samples=6, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:28,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.13 | bwd_microstep: 1766.12 | bwd_inner_microstep: 1720.35 | bwd_allreduce_microstep: 45.70 | step_microstep: 0.35 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13217 total_samples=19959, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:31,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.71 | bwd_microstep: 1968.69 | bwd_inner_microstep: 1840.47 | bwd_allreduce_microstep: 128.15 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13329 total_samples=19963, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:34,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 914.85 | bwd_microstep: 1782.68 | bwd_inner_microstep: 1712.98 | bwd_allreduce_microstep: 69.64 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13286 total_samples=19967, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:37,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15 [2025-08-03 05:44:37,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.28 | bwd_microstep: 2024.86 | bwd_inner_microstep: 1735.98 | bwd_allreduce_microstep: 288.81 | step_microstep: 111.97 [2025-08-03 05:44:37,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3064.90 | bwd: 7542.41 | bwd_inner: 7009.77 | bwd_allreduce: 532.39 | step: 112.58 {'loss': 0.7434, 'learning_rate': 5.575902825020962e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14807 total_samples=19971, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:39,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.82 | bwd_microstep: 1894.27 | bwd_inner_microstep: 1869.03 | bwd_allreduce_microstep: 25.18 | step_microstep: 0.11 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13497 total_samples=19975, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:42,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.25 | bwd_microstep: 1799.60 | bwd_inner_microstep: 1694.83 | bwd_allreduce_microstep: 104.67 | step_microstep: 0.19 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12861 total_samples=19979, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:45,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.88 | bwd_microstep: 1834.19 | bwd_inner_microstep: 1632.33 | bwd_allreduce_microstep: 201.79 | step_microstep: 0.12 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12670 total_samples=19983, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:48,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37 [2025-08-03 05:44:48,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.78 | bwd_microstep: 2308.05 | bwd_inner_microstep: 1987.60 | bwd_allreduce_microstep: 320.38 | step_microstep: 154.91 [2025-08-03 05:44:48,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.66 | bwd: 7836.18 | bwd_inner: 7183.77 | bwd_allreduce: 652.13 | step: 155.34 {'loss': 0.7468, 'learning_rate': 5.56138584209893e-06, 'epoch': 0.66} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11863 total_samples=19986, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:50,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.78 | bwd_microstep: 1779.59 | bwd_inner_microstep: 1568.18 | bwd_allreduce_microstep: 211.29 | step_microstep: 0.61 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14921 total_samples=19991, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:53,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.57 | bwd_microstep: 1781.62 | bwd_inner_microstep: 1726.15 | bwd_allreduce_microstep: 55.40 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12202 total_samples=19994, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:56,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.12 | bwd_microstep: 1970.63 | bwd_inner_microstep: 1763.68 | bwd_allreduce_microstep: 206.86 | step_microstep: 0.32 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14284 total_samples=19998, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:44:58,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.33 [2025-08-03 05:44:58,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.80 | bwd_microstep: 1834.38 | bwd_inner_microstep: 1761.91 | bwd_allreduce_microstep: 72.41 | step_microstep: 125.99 [2025-08-03 05:44:59,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.16 | bwd: 7366.29 | bwd_inner: 6819.93 | bwd_allreduce: 546.08 | step: 127.05 {'loss': 0.7404, 'learning_rate': 5.546880498925079e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14605 total_samples=20002, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:01,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.54 | bwd_microstep: 2048.15 | bwd_inner_microstep: 1948.30 | bwd_allreduce_microstep: 99.79 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11791 total_samples=20006, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:04,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.45 | bwd_microstep: 2030.21 | bwd_inner_microstep: 1817.11 | bwd_allreduce_microstep: 213.03 | step_microstep: 0.31 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13320 total_samples=20010, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:07,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.97 | bwd_microstep: 1781.99 | bwd_inner_microstep: 1697.03 | bwd_allreduce_microstep: 84.89 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13805 total_samples=20014, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:10,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.29 [2025-08-03 05:45:10,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.04 | bwd_microstep: 2050.48 | bwd_inner_microstep: 1938.27 | bwd_allreduce_microstep: 112.14 | step_microstep: 139.93 [2025-08-03 05:45:10,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.91 | bwd: 7910.88 | bwd_inner: 7400.70 | bwd_allreduce: 509.93 | step: 140.48 {'loss': 0.7448, 'learning_rate': 5.5323868335379775e-06, 'epoch': 0.66} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11831 total_samples=20017, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:13,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.70 | bwd_microstep: 2042.76 | bwd_inner_microstep: 1786.47 | bwd_allreduce_microstep: 256.22 | step_microstep: 0.17 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11985 total_samples=20020, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:15,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.09 | bwd_microstep: 1741.47 | bwd_inner_microstep: 1549.05 | bwd_allreduce_microstep: 192.36 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11869 total_samples=20023, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:18,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.84 | bwd_microstep: 2247.42 | bwd_inner_microstep: 2238.75 | bwd_allreduce_microstep: 8.61 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15368 total_samples=20027, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:21,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 05:45:21,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.94 | bwd_microstep: 1911.32 | bwd_inner_microstep: 1809.87 | bwd_allreduce_microstep: 101.38 | step_microstep: 124.32 [2025-08-03 05:45:21,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.50 | bwd: 7943.02 | bwd_inner: 7384.14 | bwd_allreduce: 558.65 | step: 124.95 {'loss': 0.7543, 'learning_rate': 5.517904883945577e-06, 'epoch': 0.66} ███▌ | 1312/2000 [4:01:41<2:03:25, 10.76s/it] 66%|██████▌ | 1313/2000 [4:01:52<2:04:10, 10.85s/it] 66%|██████▌ | 1313/2000 [4:01:52<2:04:10, 10.85s/it] 66%|██████▌ | 1314/2000 [4:02:03<2:04:53, 10.92s/it] 66%|██████▌ | 1314/2000 [4:02:03<2:04:53, 10.92s/it] 66%|██████▌ | 1315/2000 [4:02:13<2:03:33, 10.82s/it] 66%|██████▌ | 1315/2000 [4:02:13<2:03:33, 10.82s/it] 66%|██████▌ | 1316/2000 [4:02:25<2:04:41, 10.94s/it] 66%|██████▌ | 1316/2000 [4:02:25<2:04:41, 10.94s/it] 66%|██████▌ | 1317/2000 [4:02:36<2:05:23, 11.02s/it] 66%|██████▌ | 1317/2000 [4:02:36<2:dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12765 total_samples=20031, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:24,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.91 | bwd_microstep: 1825.05 | bwd_inner_microstep: 1667.60 | bwd_allreduce_microstep: 157.37 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11757 total_samples=20034, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:26,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.91 | bwd_microstep: 1794.63 | bwd_inner_microstep: 1550.16 | bwd_allreduce_microstep: 244.40 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14295 total_samples=20038, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:29,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.97 | bwd_microstep: 2010.88 | bwd_inner_microstep: 1874.75 | bwd_allreduce_microstep: 136.06 | step_microstep: 0.14 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13031 total_samples=20042, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:32,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.51 [2025-08-03 05:45:32,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.88 | bwd_microstep: 1921.05 | bwd_inner_microstep: 1660.30 | bwd_allreduce_microstep: 260.68 | step_microstep: 127.42 [2025-08-03 05:45:32,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.59 | bwd: 7551.68 | bwd_inner: 6752.81 | bwd_allreduce: 798.61 | step: 127.95 {'loss': 0.7396, 'learning_rate': 5.503434688125104e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752 total_samples=20046, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:35,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.73 | bwd_microstep: 2034.97 | bwd_inner_microstep: 1896.88 | bwd_allreduce_microstep: 138.02 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13910 total_samples=20050, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:37,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.27 | bwd_microstep: 1884.41 | bwd_inner_microstep: 1754.12 | bwd_allreduce_microstep: 130.22 | step_microstep: 0.17 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12423 total_samples=20054, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:40,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.34 | bwd_microstep: 2061.32 | bwd_inner_microstep: 1840.93 | bwd_allreduce_microstep: 220.33 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14357 total_samples=20058, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:43,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31 [2025-08-03 05:45:43,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.55 | bwd_microstep: 1775.32 | bwd_inner_microstep: 1768.65 | bwd_allreduce_microstep: 6.61 | step_microstep: 148.39 [2025-08-03 05:45:43,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.83 | bwd: 7756.07 | bwd_inner: 7260.57 | bwd_allreduce: 495.26 | step: 148.92 {'loss': 0.7425, 'learning_rate': 5.488976284022953e-06, 'epoch': 0.66} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14298 total_samples=20062, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:45,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.26 | bwd_microstep: 1759.35 | bwd_inner_microstep: 1705.03 | bwd_allreduce_microstep: 54.23 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11681 total_samples=20065, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:48,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.58 | bwd_microstep: 1758.21 | bwd_inner_microstep: 1552.63 | bwd_allreduce_microstep: 205.52 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11819 total_samples=20069, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:50,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.94 | bwd_microstep: 1786.58 | bwd_inner_microstep: 1559.63 | bwd_allreduce_microstep: 226.88 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13350 total_samples=20073, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:53,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.96 [2025-08-03 05:45:53,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.25 | bwd_microstep: 1734.97 | bwd_inner_microstep: 1673.56 | bwd_allreduce_microstep: 61.33 | step_microstep: 140.20 [2025-08-03 05:45:53,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.95 | bwd: 7039.15 | bwd_inner: 6490.84 | bwd_allreduce: 548.04 | step: 140.68 {'loss': 0.7509, 'learning_rate': 5.4745297095546125e-06, 'epoch': 0.66} dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14359 total_samples=20077, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:56,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.77 | bwd_microstep: 1757.83 | bwd_inner_microstep: 1688.67 | bwd_allreduce_microstep: 69.09 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11579 total_samples=20080, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:45:58,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.98 | bwd_microstep: 1809.37 | bwd_inner_microstep: 1580.55 | bwd_allreduce_microstep: 228.75 | step_microstep: 0.16 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11613 total_samples=20083, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:01,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.13 | bwd_microstep: 1849.01 | bwd_inner_microstep: 1555.02 | bwd_allreduce_microstep: 293.84 | step_microstep: 0.71 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13774 total_samples=20087, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:04,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 33.78 [2025-08-03 05:46:04,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.53 | bwd_microstep: 1892.43 | bwd_inner_microstep: 1706.05 | bwd_allreduce_microstep: 186.31 | step_microstep: 144.94 [2025-08-03 05:46:04,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.33 | bwd: 7308.71 | bwd_inner: 6530.28 | bwd_allreduce: 778.13 | step: 145.93 {'loss': 0.7439, 'learning_rate': 5.460095002604533e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13188 total_samples=20091, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:06,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.79 | bwd_microstep: 2017.34 | bwd_inner_microstep: 1880.80 | bwd_allreduce_microstep: 136.47 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14038 total_samples=20095, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:09,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.43 | bwd_microstep: 1815.74 | bwd_inner_microstep: 1740.16 | bwd_allreduce_microstep: 75.52 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13526 total_samples=20099, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:12,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.90 | bwd_microstep: 1968.47 | bwd_inner_microstep: 1854.59 | bwd_allreduce_microstep: 113.81 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14502 total_samples=20103, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:14,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.84 [2025-08-03 05:46:14,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.87 | bwd_microstep: 1743.62 | bwd_inner_microstep: 1722.04 | bwd_allreduce_microstep: 21.52 | step_microstep: 143.05 [2025-08-03 05:46:14,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.91 | bwd: 7545.22 | bwd_inner: 7197.58 | bwd_allreduce: 347.40 | step: 143.62 {'loss': 0.7436, 'learning_rate': 5.445672201026054e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13325 total_samples=20107, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:17,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.57 | bwd_microstep: 2059.17 | bwd_inner_microstep: 1783.19 | bwd_allreduce_microstep: 275.91 | step_microstep: 0.11 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13547 total_samples=20111, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:20,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.08 | bwd_microstep: 1790.67 | bwd_inner_microstep: 1698.62 | bwd_allreduce_microstep: 91.98 | step_microstep: 0.28 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13491 total_samples=20115, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:23,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.45 | bwd_microstep: 2115.63 | bwd_inner_microstep: 1948.80 | bwd_allreduce_microstep: 166.75 | step_microstep: 0.21 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13091 total_samples=20119, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:26,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.23 [2025-08-03 05:46:26,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.67 | bwd_microstep: 1838.56 | bwd_inner_microstep: 1696.52 | bwd_allreduce_microstep: 141.98 | step_microstep: 124.13 [2025-08-03 05:46:26,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.70 | bwd: 7804.09 | bwd_inner: 7127.13 | bwd_allreduce: 676.72 | step: 124.74 05:23, 11.02s/it] 66%|██████▌ | 1318/2000 [4:02:47<2:04:32, 10.96s/it] 66%|██████▌ | 1318/2000 [4:02:47<2:04:32, 10.96s/it] 66%|██████▌ | 1319/2000 [4:02:58<2:04:46, 10.99s/it] 66%|██████▌ | 1319/2000 [4:02:58<2:04:46, 10.99s/it] 66%|██████▌ | 1320/2000 [4:03:08<2:02:10, 10.78s/it] 66%|██████▌ | 1320/2000 [4:03:08<2:02:10, 10.78s/it] 66%|██████▌ | 1321/2000 [4:03:19<2:01:16, 10.72s/it] 66%|██████▌ | 1321/2000 [4:03:19<2:01:16, 10.72s/it] 66%|██████▌ | 1322/2000 [4:03:29<2:01:20, 10.74s/it] 66%|██████▌ | 1322/2000 [4:03:29<2:01:20, 10.74s/it] 66%|█████{'loss': 0.7444, 'learning_rate': 5.431261342641287e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13478 total_samples=20124, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:28,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.24 | bwd_microstep: 1769.59 | bwd_inner_microstep: 1694.43 | bwd_allreduce_microstep: 75.09 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13584 total_samples=20128, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:31,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.51 | bwd_microstep: 1960.54 | bwd_inner_microstep: 1953.72 | bwd_allreduce_microstep: 6.71 | step_microstep: 0.23 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12982 total_samples=20132, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:33,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.96 | bwd_microstep: 1722.64 | bwd_inner_microstep: 1624.22 | bwd_allreduce_microstep: 98.35 | step_microstep: 0.27 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12898 total_samples=20136, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:36,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73 [2025-08-03 05:46:36,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.32 | bwd_microstep: 1864.63 | bwd_inner_microstep: 1683.94 | bwd_allreduce_microstep: 180.63 | step_microstep: 160.79 [2025-08-03 05:46:36,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.96 | bwd: 7317.45 | bwd_inner: 6956.33 | bwd_allreduce: 360.85 | step: 161.46 {'loss': 0.742, 'learning_rate': 5.416862465241033e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14876 total_samples=20140, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:39,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.33 | bwd_microstep: 1751.73 | bwd_inner_microstep: 1742.55 | bwd_allreduce_microstep: 9.11 | step_microstep: 0.13 dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12099 total_samples=20144, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:41,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.43 | bwd_microstep: 1780.13 | bwd_inner_microstep: 1599.55 | bwd_allreduce_microstep: 180.50 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12276 total_samples=20147, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:44,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.38 | bwd_microstep: 1819.52 | bwd_inner_microstep: 1592.90 | bwd_allreduce_microstep: 226.55 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13594 total_samples=20151, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:47,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.98 [2025-08-03 05:46:47,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.17 | bwd_microstep: 2129.07 | bwd_inner_microstep: 1904.43 | bwd_allreduce_microstep: 224.58 | step_microstep: 455.90 [2025-08-03 05:46:47,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.23 | bwd: 7480.50 | bwd_inner: 6839.42 | bwd_allreduce: 640.82 | step: 456.41 {'loss': 0.7509, 'learning_rate': 5.40247560658467e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14395 total_samples=20156, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:50,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.19 | bwd_microstep: 2032.15 | bwd_inner_microstep: 1900.32 | bwd_allreduce_microstep: 131.76 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13268 total_samples=20160, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:53,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.64 | bwd_microstep: 1813.81 | bwd_inner_microstep: 1711.68 | bwd_allreduce_microstep: 102.07 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13431 total_samples=20164, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:56,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.08 | bwd_microstep: 2545.77 | bwd_inner_microstep: 2164.00 | bwd_allreduce_microstep: 381.71 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13004 total_samples=20168, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:46:59,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48 [2025-08-03 05:46:59,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.61 | bwd_microstep: 1793.50 | bwd_inner_microstep: 1675.58 | bwd_allreduce_microstep: 117.85 | step_microstep: 136.75 [2025-08-03 05:46:59,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.45 | bwd: 8185.29 | bwd_inner: 7451.57 | bwd_allreduce: 733.48 | step: 137.29 {'loss': 0.739, 'learning_rate': 5.3881008044000495e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14068 total_samples=20172, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:01,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.26 | bwd_microstep: 1807.34 | bwd_inner_microstep: 1740.76 | bwd_allreduce_microstep: 66.52 | step_microstep: 0.15 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13159 total_samples=20176, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:04,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.78 | bwd_microstep: 1925.28 | bwd_inner_microstep: 1808.29 | bwd_allreduce_microstep: 116.93 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13876 total_samples=20180, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:07,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.51 | bwd_microstep: 2666.79 | bwd_inner_microstep: 2577.35 | bwd_allreduce_microstep: 89.37 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13349 total_samples=20184, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:10,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.62 [2025-08-03 05:47:10,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.90 | bwd_microstep: 1753.99 | bwd_inner_microstep: 1672.66 | bwd_allreduce_microstep: 81.26 | step_microstep: 143.41 [2025-08-03 05:47:10,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.38 | bwd: 8153.46 | bwd_inner: 7799.06 | bwd_allreduce: 354.16 | step: 143.94 {'loss': 0.738, 'learning_rate': 5.373738096383423e-06, 'epoch': 0.66} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13338 total_samples=20188, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:13,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.49 | bwd_microstep: 1903.30 | bwd_inner_microstep: 1666.30 | bwd_allreduce_microstep: 236.94 | step_microstep: 0.24 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13443 total_samples=20192, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:16,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 975.33 | bwd_microstep: 2080.42 | bwd_inner_microstep: 1776.83 | bwd_allreduce_microstep: 303.53 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13795 total_samples=20196, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:18,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.72 | bwd_microstep: 1736.98 | bwd_inner_microstep: 1692.05 | bwd_allreduce_microstep: 44.86 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15431 total_samples=20201, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:21,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.62 [2025-08-03 05:47:21,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.69 | bwd_microstep: 1750.80 | bwd_inner_microstep: 1744.37 | bwd_allreduce_microstep: 6.36 | step_microstep: 112.19 [2025-08-03 05:47:21,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3081.16 | bwd: 7471.56 | bwd_inner: 6879.54 | bwd_allreduce: 591.78 | step: 112.68 ▌ | 1323/2000 [4:03:40<2:02:15, 10.84s/it] 66%|██████▌ | 1323/2000 [4:03:40<2:02:15, 10.84s/it] 66%|██████▌ | 1324/2000 [4:03:51<2:01:17, 10.77s/it] 66%|██████▌ | 1324/2000 [4:03:51<2:01:17, 10.77s/it] 66%|██████▋ | 1325/2000 [4:04:02<2:02:08, 10.86s/it] 66%|██████▋ | 1325/2000 [4:04:02<2:02:08, 10.86s/it] 66%|██████▋ | 1326/2000 [4:04:13<2:03:59, 11.04s/it] 66%|██████▋ | 1326/2000 [4:04:14<2:03:59, 11.04s/it] 66%|██████▋ | 1327/2000 [4:04:25<2:05:03, 11.15s/it] 66%|██████▋ | 1327/2000 [4:04:25<2:05:03, 11.15s/it] 66%|██████▋ | 1328/2000 [4:04:36<2:04:17, 1{'loss': 0.7377, 'learning_rate': 5.359387520199317e-06, 'epoch': 0.66} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13953 total_samples=20205, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:24,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.47 | bwd_microstep: 2005.08 | bwd_inner_microstep: 1873.52 | bwd_allreduce_microstep: 131.49 | step_microstep: 0.12 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13284 total_samples=20209, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:27,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.75 | bwd_microstep: 2251.86 | bwd_inner_microstep: 2082.71 | bwd_allreduce_microstep: 169.08 | step_microstep: 0.28 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14154 total_samples=20214, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:30,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.39 | bwd_microstep: 2164.69 | bwd_inner_microstep: 1903.74 | bwd_allreduce_microstep: 260.88 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13411 total_samples=20218, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:33,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85 [2025-08-03 05:47:33,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.47 | bwd_microstep: 1802.08 | bwd_inner_microstep: 1715.06 | bwd_allreduce_microstep: 86.95 | step_microstep: 157.18 [2025-08-03 05:47:33,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.01 | bwd: 8223.76 | bwd_inner: 7575.03 | bwd_allreduce: 648.49 | step: 157.71 {'loss': 0.7364, 'learning_rate': 5.3450491134804416e-06, 'epoch': 0.66} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12038 total_samples=20221, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:35,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.92 | bwd_microstep: 1761.77 | bwd_inner_microstep: 1558.51 | bwd_allreduce_microstep: 203.18 | step_microstep: 0.15 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11942 total_samples=20224, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:38,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.21 | bwd_microstep: 2148.26 | bwd_inner_microstep: 1939.29 | bwd_allreduce_microstep: 208.87 | step_microstep: 0.87 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13513 total_samples=20228, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:41,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.54 | bwd_microstep: 1710.52 | bwd_inner_microstep: 1658.52 | bwd_allreduce_microstep: 51.94 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14598 total_samples=20233, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:43,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77 [2025-08-03 05:47:43,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.62 | bwd_microstep: 1900.40 | bwd_inner_microstep: 1739.54 | bwd_allreduce_microstep: 160.77 | step_microstep: 123.24 [2025-08-03 05:47:43,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.22 | bwd: 7521.02 | bwd_inner: 6895.85 | bwd_allreduce: 624.88 | step: 124.38 {'loss': 0.7403, 'learning_rate': 5.330722913827594e-06, 'epoch': 0.67} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12075 total_samples=20236, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:47,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 971.12 | bwd_microstep: 1951.42 | bwd_inner_microstep: 1904.63 | bwd_allreduce_microstep: 46.73 | step_microstep: 0.14 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12103 total_samples=20239, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:49,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.70 | bwd_microstep: 1713.41 | bwd_inner_microstep: 1539.92 | bwd_allreduce_microstep: 173.41 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13252 total_samples=20243, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:52,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.60 | bwd_microstep: 1764.01 | bwd_inner_microstep: 1694.39 | bwd_allreduce_microstep: 69.50 | step_microstep: 0.44 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14501 total_samples=20247, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:54,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04 [2025-08-03 05:47:54,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.30 | bwd_microstep: 1784.68 | bwd_inner_microstep: 1727.50 | bwd_allreduce_microstep: 57.10 | step_microstep: 111.30 [2025-08-03 05:47:55,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3062.64 | bwd: 7213.58 | bwd_inner: 6866.44 | bwd_allreduce: 346.87 | step: 112.14 {'loss': 0.7419, 'learning_rate': 5.3164089588095705e-06, 'epoch': 0.67} dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12925 total_samples=20252, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:47:57,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.36 | bwd_microstep: 2027.08 | bwd_inner_microstep: 1835.26 | bwd_allreduce_microstep: 191.73 | step_microstep: 0.34 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13798 total_samples=20255, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:00,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.77 | bwd_microstep: 1880.66 | bwd_inner_microstep: 1663.25 | bwd_allreduce_microstep: 217.28 | step_microstep: 0.89 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13965 total_samples=20259, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:03,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.60 | bwd_microstep: 1797.35 | bwd_inner_microstep: 1735.90 | bwd_allreduce_microstep: 61.38 | step_microstep: 0.13 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13205 total_samples=20263, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:05,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:48:05,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 2010.64 | bwd_inner_microstep: 1866.93 | bwd_allreduce_microstep: 143.64 | step_microstep: 120.48 [2025-08-03 05:48:05,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.72 | bwd: 7715.78 | bwd_inner: 7101.34 | bwd_allreduce: 614.16 | step: 121.85 {'loss': 0.7437, 'learning_rate': 5.302107285963045e-06, 'epoch': 0.67} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13935 total_samples=20267, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:09,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.90 | bwd_microstep: 2404.26 | bwd_inner_microstep: 1713.17 | bwd_allreduce_microstep: 691.02 | step_microstep: 0.30 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11984 total_samples=20270, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:11,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.52 | bwd_microstep: 1746.24 | bwd_inner_microstep: 1558.61 | bwd_allreduce_microstep: 187.57 | step_microstep: 0.89 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13947 total_samples=20274, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:14,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 915.67 | bwd_microstep: 1815.26 | bwd_inner_microstep: 1738.96 | bwd_allreduce_microstep: 76.23 | step_microstep: 0.25 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14159 total_samples=20279, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:17,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04 [2025-08-03 05:48:17,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.33 | bwd_microstep: 1729.09 | bwd_inner_microstep: 1666.22 | bwd_allreduce_microstep: 62.80 | step_microstep: 123.83 [2025-08-03 05:48:17,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3000.34 | bwd: 7694.90 | bwd_inner: 6676.95 | bwd_allreduce: 1017.70 | step: 125.27 1.10s/it] 66%|██████▋ | 1328/2000 [4:04:36<2:04:17, 11.10s/it] 66%|██████▋ | 1329/2000 [4:04:47<2:05:29, 11.22s/it] 66%|██████▋ | 1329/2000 [4:04:47<2:05:29, 11.22s/it] 66%|██████▋ | 1330/2000 [4:04:58<2:03:45, 11.08s/it] 66%|██████▋ | 1330/2000 [4:04:59<2:03:45, 11.08s/it] 67%|██████▋ | 1331/2000 [4:05:09<2:03:59, 11.12s/it] 67%|██████▋ | 1331/2000 [4:05:09<2:03:59, 11.12s/it] 67%|██████▋ | 1332/2000 [4:05:20<2:03:22, 11.08s/it] 67%|██████▋ | 1332/2000 [4:05:20<2:03:22, 11.08s/it] 67%|██████▋ | 1333/2000 [4:05:31<2:03:21, 11.10s/it] {'loss': 0.7362, 'learning_rate': 5.287817932792485e-06, 'epoch': 0.67} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15401 total_samples=20283, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:19,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.27 | bwd_microstep: 1802.57 | bwd_inner_microstep: 1775.75 | bwd_allreduce_microstep: 26.76 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12800 total_samples=20286, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:22,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.23 | bwd_microstep: 1733.21 | bwd_inner_microstep: 1576.94 | bwd_allreduce_microstep: 156.20 | step_microstep: 0.24 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15890 total_samples=20290, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:24,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.70 | bwd_microstep: 1774.55 | bwd_inner_microstep: 1756.47 | bwd_allreduce_microstep: 18.01 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13188 total_samples=20294, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:27,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91 [2025-08-03 05:48:27,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.37 | bwd_microstep: 2088.07 | bwd_inner_microstep: 2036.77 | bwd_allreduce_microstep: 51.24 | step_microstep: 105.98 [2025-08-03 05:48:27,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.49 | bwd: 7398.45 | bwd_inner: 7145.91 | bwd_allreduce: 252.30 | step: 106.59 {'loss': 0.7324, 'learning_rate': 5.273540936770059e-06, 'epoch': 0.67} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11569 total_samples=20297, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:30,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.96 | bwd_microstep: 1781.00 | bwd_inner_microstep: 1541.89 | bwd_allreduce_microstep: 239.05 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14079 total_samples=20301, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:32,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.92 | bwd_microstep: 1723.89 | bwd_inner_microstep: 1709.70 | bwd_allreduce_microstep: 14.13 | step_microstep: 0.25 dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13578 total_samples=20306, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:35,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.53 | bwd_microstep: 1957.19 | bwd_inner_microstep: 1852.42 | bwd_allreduce_microstep: 104.68 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11914 total_samples=20309, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:38,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96 [2025-08-03 05:48:38,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.50 | bwd_microstep: 2072.94 | bwd_inner_microstep: 1837.03 | bwd_allreduce_microstep: 235.85 | step_microstep: 133.09 [2025-08-03 05:48:38,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.85 | bwd: 7535.08 | bwd_inner: 6941.02 | bwd_allreduce: 593.81 | step: 133.67 {'loss': 0.7459, 'learning_rate': 5.259276335335522e-06, 'epoch': 0.67} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885 total_samples=20312, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:41,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.86 | bwd_microstep: 1921.62 | bwd_inner_microstep: 1552.34 | bwd_allreduce_microstep: 369.21 | step_microstep: 0.22 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14165 total_samples=20317, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:44,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.68 | bwd_microstep: 2077.90 | bwd_inner_microstep: 1988.71 | bwd_allreduce_microstep: 89.13 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14341 total_samples=20322, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:46,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.33 | bwd_microstep: 1716.00 | bwd_inner_microstep: 1693.76 | bwd_allreduce_microstep: 22.17 | step_microstep: 0.23 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11966 total_samples=20325, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:49,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24 [2025-08-03 05:48:49,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.64 | bwd_microstep: 1796.19 | bwd_inner_microstep: 1564.07 | bwd_allreduce_microstep: 232.05 | step_microstep: 116.32 [2025-08-03 05:48:49,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.44 | bwd: 7511.76 | bwd_inner: 6798.85 | bwd_allreduce: 712.65 | step: 116.91 {'loss': 0.735, 'learning_rate': 5.245024165896126e-06, 'epoch': 0.67} dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13502 total_samples=20329, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:51,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.77 | bwd_microstep: 1798.70 | bwd_inner_microstep: 1681.15 | bwd_allreduce_microstep: 117.49 | step_microstep: 0.14 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13510 total_samples=20333, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:54,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.55 | bwd_microstep: 2112.37 | bwd_inner_microstep: 1954.86 | bwd_allreduce_microstep: 157.45 | step_microstep: 0.20 dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12936 total_samples=20338, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:48:57,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.28 | bwd_microstep: 1962.43 | bwd_inner_microstep: 1651.27 | bwd_allreduce_microstep: 311.10 | step_microstep: 0.11 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11811 total_samples=20341, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:00,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.80 [2025-08-03 05:49:00,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.33 | bwd_microstep: 1733.07 | bwd_inner_microstep: 1705.20 | bwd_allreduce_microstep: 27.81 | step_microstep: 129.16 [2025-08-03 05:49:00,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.86 | bwd: 7606.62 | bwd_inner: 6992.46 | bwd_allreduce: 613.92 | step: 129.63 {'loss': 0.7469, 'learning_rate': 5.2307844658265236e-06, 'epoch': 0.67} dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13029 total_samples=20345, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:02,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.09 | bwd_microstep: 1766.81 | bwd_inner_microstep: 1661.21 | bwd_allreduce_microstep: 105.53 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13248 total_samples=20349, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:05,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.33 | bwd_microstep: 1765.46 | bwd_inner_microstep: 1702.81 | bwd_allreduce_microstep: 62.59 | step_microstep: 0.12 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13800 total_samples=20353, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:08,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.30 | bwd_microstep: 2025.83 | bwd_inner_microstep: 1905.22 | bwd_allreduce_microstep: 120.53 | step_microstep: 0.27 dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13393 total_samples=20357, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:10,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07 [2025-08-03 05:49:10,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.67 | bwd_microstep: 1798.99 | bwd_inner_microstep: 1667.80 | bwd_allreduce_microstep: 131.12 | step_microstep: 111.06 [2025-08-03 05:49:10,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2871.31 | bwd: 7357.13 | bwd_inner: 6937.04 | bwd_allreduce: 419.85 | step: 111.58 {'loss': 0.7325, 'learning_rate': 5.216557272468675e-06, 'epoch': 0.67} 67%|██████▋ | 1333/2000 [4:05:32<2:03:21, 11.10s/it] 67%|██████▋ | 1334/2000 [4:05:42<2:01:33, 10.95s/it] 67%|██████▋ | 1334/2000 [4:05:42<2:01:33, 10.95s/it] 67%|██████▋ | 1335/2000 [4:05:53<2:00:41, 10.89s/it] 67%|██████▋ | 1335/2000 [4:05:53<2:00:41, 10.89s/it] 67%|██████▋ | 1336/2000 [4:06:04<1:59:59, 10.84s/it] 67%|██████▋ | 1336/2000 [4:06:04<1:59:59, 10.84s/it] 67%|██████▋ | 1337/2000 [4:06:14<1:59:49, 10.84s/it] 67%|██████▋ | 1337/2000 [4:06:14<1:59:49, 10.84s/it] 67%|██████▋ | 1338/2000 [4:06:25<1:58:56, 10.78s/it] 67%|██dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13573 total_samples=20361, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:13,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.62 | bwd_microstep: 1836.09 | bwd_inner_microstep: 1639.95 | bwd_allreduce_microstep: 196.07 | step_microstep: 0.26 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13287 total_samples=20365, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:15,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.70 | bwd_microstep: 1880.78 | bwd_inner_microstep: 1710.50 | bwd_allreduce_microstep: 170.21 | step_microstep: 0.25 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13863 total_samples=20369, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:18,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.14 | bwd_microstep: 2130.81 | bwd_inner_microstep: 1756.91 | bwd_allreduce_microstep: 373.85 | step_microstep: 0.10 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11928 total_samples=20372, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:21,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.97 [2025-08-03 05:49:21,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.04 | bwd_microstep: 1790.92 | bwd_inner_microstep: 1583.20 | bwd_allreduce_microstep: 207.66 | step_microstep: 119.28 [2025-08-03 05:49:21,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.44 | bwd: 7638.66 | bwd_inner: 6690.55 | bwd_allreduce: 947.86 | step: 119.90 {'loss': 0.7366, 'learning_rate': 5.202342623131731e-06, 'epoch': 0.67} dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12173 total_samples=20375, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:24,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.79 | bwd_microstep: 1844.56 | bwd_inner_microstep: 1723.97 | bwd_allreduce_microstep: 120.52 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13621 total_samples=20379, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:26,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.91 | bwd_microstep: 1832.98 | bwd_inner_microstep: 1748.43 | bwd_allreduce_microstep: 84.48 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13623 total_samples=20383, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:29,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.50 | bwd_microstep: 2077.56 | bwd_inner_microstep: 1759.71 | bwd_allreduce_microstep: 317.79 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11641 total_samples=20386, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:32,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98 [2025-08-03 05:49:32,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.13 | bwd_microstep: 1861.07 | bwd_inner_microstep: 1529.39 | bwd_allreduce_microstep: 331.61 | step_microstep: 106.84 [2025-08-03 05:49:32,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.26 | bwd: 7616.22 | bwd_inner: 6761.50 | bwd_allreduce: 854.48 | step: 107.30 {'loss': 0.7311, 'learning_rate': 5.18814055509195e-06, 'epoch': 0.67} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13874 total_samples=20390, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:35,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.98 | bwd_microstep: 1833.24 | bwd_inner_microstep: 1735.67 | bwd_allreduce_microstep: 97.50 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515 total_samples=20394, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:37,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.77 | bwd_microstep: 2002.88 | bwd_inner_microstep: 1827.46 | bwd_allreduce_microstep: 175.35 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13745 total_samples=20399, num_samples=5, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:40,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.68 | bwd_microstep: 1797.72 | bwd_inner_microstep: 1727.10 | bwd_allreduce_microstep: 70.56 | step_microstep: 0.22 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11604 total_samples=20402, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:43,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44 [2025-08-03 05:49:43,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.58 | bwd_microstep: 2177.66 | bwd_inner_microstep: 1884.25 | bwd_allreduce_microstep: 293.34 | step_microstep: 126.65 [2025-08-03 05:49:43,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.94 | bwd: 7811.55 | bwd_inner: 7174.47 | bwd_allreduce: 636.83 | step: 127.14 {'loss': 0.7416, 'learning_rate': 5.173951105592605e-06, 'epoch': 0.67} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13688 total_samples=20406, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:46,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.70 | bwd_microstep: 1734.93 | bwd_inner_microstep: 1680.92 | bwd_allreduce_microstep: 53.94 | step_microstep: 0.11 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13896 total_samples=20410, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:48,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.45 | bwd_microstep: 1951.25 | bwd_inner_microstep: 1856.14 | bwd_allreduce_microstep: 95.04 | step_microstep: 0.17 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13354 total_samples=20414, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:51,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.94 | bwd_microstep: 1741.22 | bwd_inner_microstep: 1683.48 | bwd_allreduce_microstep: 57.67 | step_microstep: 0.13 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810 total_samples=20417, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:54,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25 [2025-08-03 05:49:54,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.92 | bwd_microstep: 1948.59 | bwd_inner_microstep: 1561.06 | bwd_allreduce_microstep: 387.47 | step_microstep: 109.39 [2025-08-03 05:49:54,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.93 | bwd: 7376.05 | bwd_inner: 6781.60 | bwd_allreduce: 594.21 | step: 109.80 {'loss': 0.7423, 'learning_rate': 5.1597743118438725e-06, 'epoch': 0.67} dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13126 total_samples=20421, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:56,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.39 | bwd_microstep: 1963.95 | bwd_inner_microstep: 1862.64 | bwd_allreduce_microstep: 101.24 | step_microstep: 0.10 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14220 total_samples=20425, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:49:59,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.03 | bwd_microstep: 1713.85 | bwd_inner_microstep: 1697.03 | bwd_allreduce_microstep: 16.75 | step_microstep: 0.16 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13629 total_samples=20429, num_samples=4, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:50:01,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.03 | bwd_microstep: 1743.57 | bwd_inner_microstep: 1685.36 | bwd_allreduce_microstep: 58.14 | step_microstep: 0.25 dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11676 total_samples=20432, num_samples=3, num_padding_tokens=0, num_padding_images=0 [2025-08-03 05:50:04,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08 [2025-08-03 05:50:04,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.19 | bwd_microstep: 1930.60 | bwd_inner_microstep: 1745.54 | bwd_allreduce_microstep: 184.99 | step_microstep: 139.59 [2025-08-03 05:50:04,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.57 | bwd: 7352.02 | bwd_inner: 6990.56 | bwd_allreduce: 361.21 | step: 140.12 {'loss': 0.7402, 'learning_rate': 5.145610211022738e-06, 'epoch': 0.67} ████▋ | 1338/2000 [4:06:25<1:58:56, 10.78s/it] 67%|██████▋ | 1339/2000 [4:06:36<1:59:05, 10.81s/it] 67%|██████▋ | 1339/2000 [4:06:36<1:59:05, 10.81s/it] 67%|██████▋ | 1340/2000 [4:06:47<1:59:08, 10.83s/it] 67%|██████▋ | 1340/2000 [4:06:47<1:59:08, 10.83s/it] 67%|██████▋ | 1341/2000 [4:06:58<1:59:45, 10.90s/it] 67%|██████▋ | 1341/2000 [4:06:58<1:59:45, 10.90s/it] 67%|██████▋ | 1342/2000 [4:07:08<1:58:33, 10.81s/it] 67%|██████▋ | 1342/2000 [4:07:08<1:58:33, 10.81s/it] 67%|██████▋ | 1343/2000 [4:07:19<1:57:39, 10.75s/it] 67%|██████▋ | 1343/2000 [4:07:19> Saving model checkpoint to work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000 [INFO|configuration_utils.py:473] 2025-08-03 07:49:29,761 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/config.json [INFO|configuration_utils.py:594] 2025-08-03 07:49:29,765 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/generation_config.json [INFO|modeling_utils.py:2493] 2025-08-03 07:49:34,265 >> Model weights saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/model.safetensors [INFO|tokenization_utils_base.py:2433] 2025-08-03 07:49:34,271 >> tokenizer config file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2025-08-03 07:49:34,277 >> Special tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2025-08-03 07:49:34,279 >> added tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/added_tokens.json [2025-08-03 07:49:34,911] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step2000 is about to be saved! [2025-08-03 07:49:34,972] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_16_mp_rank_00_model_states.pt... [2025-08-03 07:49:34,932] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt [2025-08-03 07:49:34,932] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt... [2025-08-03 07:49:34,950] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_24_mp_rank_00_model_states.pt... [2025-08-03 07:49:34,954] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_8_mp_rank_00_model_states.pt... [2025-08-03 07:49:35,022] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_24_mp_rank_00_model_states.pt. [2025-08-03 07:49:35,053] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_8_mp_rank_00_model_states.pt. [2025-08-03 07:49:35,044] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt. [2025-08-03 07:49:35,112] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_16_mp_rank_00_model_states.pt. [2025-08-03 07:49:35,108] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... [2025-08-03 07:49:35,130] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... [2025-08-03 07:49:35,090] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... [2025-08-03 07:49:35,107] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... [2025-08-03 07:49:36,654] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. [2025-08-03 07:49:36,654] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt [2025-08-03 07:49:36,690] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. [2025-08-03 07:49:36,691] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt [2025-08-03 07:49:36,718] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. [2025-08-03 07:49:36,718] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt [2025-08-03 07:49:36,731] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. [2025-08-03 07:49:36,752] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt [2025-08-03 07:49:36,782] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2000 is ready now! [2025-08-03 07:49:36,799] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2000 is ready now! [2025-08-03 07:49:36,802] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2000 is ready now! [2025-08-03 07:49:36,824] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2000 is ready now! Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 74, in load loaded_dict = pickle.load(handle) FileNotFoundError: [Errno 2] No such file or directory Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 74, in load loaded_dict = pickle.load(handle) FileNotFoundError: [Errno 2] No such file or directory Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle' fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle' fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle' fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 74, in load loaded_dict = pickle.load(handle) FileNotFoundError: [Errno 2] No such file or directory Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle' Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle' [INFO|trainer.py:1962] 2025-08-03 07:49:40,263 >> Training completed. Do not forget to share your model on huggingface.co/models =) [INFO|trainer.py:1962] 2025-08-03 07:49:40,744 >> Training completed. Do not forget to share your model on huggingface.co/models =) [INFO|trainer.py:1962] 2025-08-03 07:49:41,108 >> Training completed. Do not forget to share your model on huggingface.co/models =) [INFO|trainer.py:1962] 2025-08-03 07:49:41,645 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 22016.0455, 'train_samples_per_second': 11.628, 'train_steps_per_second': 0.091, 'train_loss': 0.7865399915277957, 'epoch': 1.0} 100%|██████████| 2000/2000 [6:07:00<00:00, 10.97s/it] 100%|██████████| 2000/2000 [6:07:00<00:00, 11.01s/it] [INFO|trainer.py:2936] 2025-08-03 07:49:47,696 >> Saving model checkpoint to work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802 [INFO|configuration_utils.py:473] 2025-08-03 07:49:47,703 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/config.json [INFO|configuration_utils.py:594] 2025-08-03 07:49:47,707 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/generation_config.json Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table fp16_matmul._update_autotune_table() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel) File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table autotune_table = cache_manager.load() File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load with open(self.file_path, 'rb') as handle: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle' [INFO|modeling_utils.py:2493] 2025-08-03 07:49:54,598 >> Model weights saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/model.safetensors [INFO|tokenization_utils_base.py:2433] 2025-08-03 07:49:54,603 >> tokenizer config file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2025-08-03 07:49:54,607 >> Special tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2025-08-03 07:49:54,609 >> added tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/added_tokens.json ***** train metrics ***** epoch = 1.0 train_loss = 0.7865 train_runtime = 6:06:56.04 train_samples = -1 train_samples_per_second = 11.628 train_steps_per_second = 0.091