[2025-08-03 01:38:26,089] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,089] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,090] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,112] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,113] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,113] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,113] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,096] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,099] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,104] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,104] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,104] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,129] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,129] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,129] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,132] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,128] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,128] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,133] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,133] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,135] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,135] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,140] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,135] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,135] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,140] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,141] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:38:26,141] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:39:10,022] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,012] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,038] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,038] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2025-08-03 01:39:10,034] [INFO] [comm.py:637:init_distributed] cdb=None
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=zero_stage3_config_100b_1e8.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/runs/Aug03_01-39-10_HOST-10-140-60-108,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=2000,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=steps,
save_total_limit=10000,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.05,
)
08/03/2025 01:39:10 - INFO - __main__ - Loading Tokenizer: /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B
[2025-08-03 01:39:10,308] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,322] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,323] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,324] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,327] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,334] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,336] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,346] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,343] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,351] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,344] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,353] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,353] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,347] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,356] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,350] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,352] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,353] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,354] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,361] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,363] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,393] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,394] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,395] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,397] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,398] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,399] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-08-03 01:39:10,400] [INFO] [comm.py:637:init_distributed] cdb=None
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:10,410 >> loading file tokenizer.json
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:10 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False
[WARNING|logging.py:314] 2025-08-03 01:39:10,730 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:10,827 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,828 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,828 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,829 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,829 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,829 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,829 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:10,871 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,872 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,874 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,877 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,879 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,875 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,883 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,876 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:10,881 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:10,893 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:10,917 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:10,976 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,979 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,982 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,987 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,987 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2025-08-03 01:39:10,987 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:10,988 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
08/03/2025 01:39:11 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:11 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=zero_stage3_config_100b_1e8.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/runs/Aug03_01-39-11_HOST-10-140-66-41,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=2000,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=steps,
save_total_limit=10000,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.05,
)
08/03/2025 01:39:11 - INFO - __main__ - Loading Tokenizer: /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,198 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,198 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,199 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,199 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,199 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,199 >> loading file tokenizer.json
08/03/2025 01:39:11 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:11 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=zero_stage3_config_100b_1e8.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/runs/Aug03_01-39-11_HOST-10-140-66-62,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=2000,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=steps,
save_total_limit=10000,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.05,
)
08/03/2025 01:39:11 - INFO - __main__ - Loading Tokenizer: /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,197 >> loading file tokenizer.json
08/03/2025 01:39:11 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
08/03/2025 01:39:11 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=zero_stage3_config_100b_1e8.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/runs/Aug03_01-39-11_HOST-10-140-60-44,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_steps=2000,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=1000,
save_strategy=steps,
save_total_limit=10000,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.05,
)
08/03/2025 01:39:11 - INFO - __main__ - Loading Tokenizer: /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2025] 2025-08-03 01:39:11,253 >> loading file tokenizer.json
[WARNING|logging.py:314] 2025-08-03 01:39:11,394 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:11,396 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
[WARNING|logging.py:314] 2025-08-03 01:39:11,520 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TCSLoader] config_path: /mnt/petrelfs/yangganlin/petreloss_config_qingyun.conf
--> before Client(conf_path)
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
08/03/2025 01:42:12 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2025-08-03 01:42:12,183 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/config.json
[INFO|configuration_utils.py:792] 2025-08-03 01:42:12,185 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "_name_or_path": "/mnt/petrelfs/wangweiyun/workspace_wwy/open_source/InternVL/internvl_chat/work_dirs/internvl_chat_v3_0/InternVL3_0-2B-MPO-try0-2",
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "hidden_size": 1536,
  "image_fold": null,
  "llm_config": {
    "_attn_implementation_autoset": true,
    "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151643,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 1536,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 8960,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 70,
    "min_length": 0,
    "model_type": "qwen2",
    "moe_config": null,
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 12,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 28,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": {
      "factor": 2.0,
      "rope_type": "dynamic",
      "type": "dynamic"
    },
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "system_message": null,
  "template": "internvl2_5",
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_attn_implementation_autoset": true,
    "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "auto_map": {
      "AutoConfig": "configuration_intern_vit.InternVisionConfig",
      "AutoModel": "modeling_intern_vit.InternVisionModel"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "capacity_factor": 1.2,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.1,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "eval_capacity_factor": 1.4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 0.1,
    "initializer_range": 1e-10,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "laux_allreduce": "all_nodes",
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "moe_coeff_ratio": 0.5,
    "moe_intermediate_size": 768,
    "moe_output_scale": 4.0,
    "no_repeat_ngram_size": 0,
    "noisy_gate_policy": "RSample_before",
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_experts": 8,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "num_routed_experts": 4,
    "num_shared_experts": 4,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "shared_expert_intermediate_size": 3072,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true,
    "use_moe": false,
    "use_residual": true,
    "use_rts": false,
    "use_weighted_residual": false
  }
}

08/03/2025 01:42:12 - INFO - __main__ - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3473] 2025-08-03 01:42:12,191 >> loading weights file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/model.safetensors
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
08/03/2025 01:42:12 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2025-08-03 01:42:12,272 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/config.json
[INFO|configuration_utils.py:792] 2025-08-03 01:42:12,274 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "_name_or_path": "/mnt/petrelfs/wangweiyun/workspace_wwy/open_source/InternVL/internvl_chat/work_dirs/internvl_chat_v3_0/InternVL3_0-2B-MPO-try0-2",
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "hidden_size": 1536,
  "image_fold": null,
  "llm_config": {
    "_attn_implementation_autoset": true,
    "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151643,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 1536,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 8960,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 70,
    "min_length": 0,
    "model_type": "qwen2",
    "moe_config": null,
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 12,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 28,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": {
      "factor": 2.0,
      "rope_type": "dynamic",
      "type": "dynamic"
    },
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "system_message": null,
  "template": "internvl2_5",
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_attn_implementation_autoset": true,
    "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "auto_map": {
      "AutoConfig": "configuration_intern_vit.InternVisionConfig",
      "AutoModel": "modeling_intern_vit.InternVisionModel"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "capacity_factor": 1.2,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.1,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "eval_capacity_factor": 1.4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 0.1,
    "initializer_range": 1e-10,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "laux_allreduce": "all_nodes",
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "moe_coeff_ratio": 0.5,
    "moe_intermediate_size": 768,
    "moe_output_scale": 4.0,
    "no_repeat_ngram_size": 0,
    "noisy_gate_policy": "RSample_before",
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_experts": 8,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "num_routed_experts": 4,
    "num_shared_experts": 4,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "shared_expert_intermediate_size": 3072,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true,
    "use_moe": false,
    "use_residual": true,
    "use_rts": false,
    "use_weighted_residual": false
  }
}

08/03/2025 01:42:12 - INFO - __main__ - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3473] 2025-08-03 01:42:12,279 >> loading weights file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/model.safetensors
[INFO|modeling_utils.py:1426] 2025-08-03 01:42:12,296 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|modeling_utils.py:3582] 2025-08-03 01:42:12,296 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:826] 2025-08-03 01:42:12,309 >> Generate config GenerationConfig {}

[INFO|modeling_utils.py:1426] 2025-08-03 01:42:12,735 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|modeling_utils.py:3582] 2025-08-03 01:42:12,735 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:826] 2025-08-03 01:42:12,753 >> Generate config GenerationConfig {}

--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
08/03/2025 01:42:14 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2025-08-03 01:42:14,229 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/config.json
[INFO|configuration_utils.py:792] 2025-08-03 01:42:14,230 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "_name_or_path": "/mnt/petrelfs/wangweiyun/workspace_wwy/open_source/InternVL/internvl_chat/work_dirs/internvl_chat_v3_0/InternVL3_0-2B-MPO-try0-2",
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "hidden_size": 1536,
  "image_fold": null,
  "llm_config": {
    "_attn_implementation_autoset": true,
    "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151643,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 1536,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 8960,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 70,
    "min_length": 0,
    "model_type": "qwen2",
    "moe_config": null,
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 12,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 28,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": {
      "factor": 2.0,
      "rope_type": "dynamic",
      "type": "dynamic"
    },
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "system_message": null,
  "template": "internvl2_5",
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_attn_implementation_autoset": true,
    "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "auto_map": {
      "AutoConfig": "configuration_intern_vit.InternVisionConfig",
      "AutoModel": "modeling_intern_vit.InternVisionModel"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "capacity_factor": 1.2,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.1,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "eval_capacity_factor": 1.4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 0.1,
    "initializer_range": 1e-10,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "laux_allreduce": "all_nodes",
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "moe_coeff_ratio": 0.5,
    "moe_intermediate_size": 768,
    "moe_output_scale": 4.0,
    "no_repeat_ngram_size": 0,
    "noisy_gate_policy": "RSample_before",
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_experts": 8,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "num_routed_experts": 4,
    "num_shared_experts": 4,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "shared_expert_intermediate_size": 3072,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true,
    "use_moe": false,
    "use_residual": true,
    "use_rts": false,
    "use_weighted_residual": false
  }
}

08/03/2025 01:42:14 - INFO - __main__ - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3473] 2025-08-03 01:42:14,236 >> loading weights file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/model.safetensors
[INFO|modeling_utils.py:1426] 2025-08-03 01:42:14,253 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|modeling_utils.py:3582] 2025-08-03 01:42:14,253 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:826] 2025-08-03 01:42:14,267 >> Generate config GenerationConfig {}

--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
08/03/2025 01:42:18 - INFO - __main__ - Loading InternVLChatModel...
[INFO|configuration_utils.py:727] 2025-08-03 01:42:18,408 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/config.json
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
[INFO|configuration_utils.py:792] 2025-08-03 01:42:18,410 >> Model config InternVLChatConfig {
  "_commit_hash": null,
  "_name_or_path": "/mnt/petrelfs/wangweiyun/workspace_wwy/open_source/InternVL/internvl_chat/work_dirs/internvl_chat_v3_0/InternVL3_0-2B-MPO-try0-2",
  "architectures": [
    "InternVLChatModel"
  ],
  "auto_map": {
    "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
  },
  "downsample_ratio": 0.5,
  "dynamic_image_size": true,
  "force_image_size": 448,
  "hidden_size": 1536,
  "image_fold": null,
  "llm_config": {
    "_attn_implementation_autoset": true,
    "_name_or_path": "./pretrained/Qwen2.5-32B-Instruct",
    "add_cross_attention": false,
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151643,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 1536,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 8960,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 70,
    "min_length": 0,
    "model_type": "qwen2",
    "moe_config": null,
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 12,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 28,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": {
      "factor": 2.0,
      "rope_type": "dynamic",
      "type": "dynamic"
    },
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": false,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "max_dynamic_patch": 12,
  "min_dynamic_patch": 1,
  "model_type": "internvl_chat",
  "pad2square": false,
  "ps_version": "v2",
  "select_layer": -1,
  "system_message": null,
  "template": "internvl2_5",
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": null,
  "use_backbone_lora": 0,
  "use_llm_lora": 0,
  "use_thumbnail": true,
  "vision_config": {
    "_attn_implementation_autoset": true,
    "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
    "add_cross_attention": false,
    "architectures": [
      "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "auto_map": {
      "AutoConfig": "configuration_intern_vit.InternVisionConfig",
      "AutoModel": "modeling_intern_vit.InternVisionModel"
    },
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "capacity_factor": 1.2,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.1,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "eval_capacity_factor": 1.4,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 0.1,
    "initializer_range": 1e-10,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "laux_allreduce": "all_nodes",
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "moe_coeff_ratio": 0.5,
    "moe_intermediate_size": 768,
    "moe_output_scale": 4.0,
    "no_repeat_ngram_size": 0,
    "noisy_gate_policy": "RSample_before",
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_experts": 8,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "num_routed_experts": 4,
    "num_shared_experts": 4,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "shared_expert_intermediate_size": 3072,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.37.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": true,
    "use_moe": false,
    "use_residual": true,
    "use_rts": false,
    "use_weighted_residual": false
  }
}

08/03/2025 01:42:18 - INFO - __main__ - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3473] 2025-08-03 01:42:18,416 >> loading weights file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/model.safetensors
[INFO|modeling_utils.py:1426] 2025-08-03 01:42:18,443 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|modeling_utils.py:3582] 2025-08-03 01:42:18,444 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:826] 2025-08-03 01:42:18,476 >> Generate config GenerationConfig {}

--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
--> after Client(conf_path)
Replace INTERNLM2_ATTENTION_CLASSES to support packed training!!
Replace QWEN2_ATTENTION_CLASSES to support packed training!!
Replace PHI3_ATTENTION_CLASSES to support packed training!!
Replace LLAMA_ATTENTION_CLASSES to support packed training!!
[INFO|configuration_utils.py:826] 2025-08-03 01:42:22,338 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "use_cache": false
}

[INFO|configuration_utils.py:826] 2025-08-03 01:42:22,326 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "use_cache": false
}

[INFO|configuration_utils.py:826] 2025-08-03 01:42:22,404 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "use_cache": false
}

[INFO|configuration_utils.py:826] 2025-08-03 01:42:22,466 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "use_cache": false
}

[2025-08-03 01:42:22,888] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 685, num_elems = 2.09B
[INFO|modeling_utils.py:4350] 2025-08-03 01:42:28,929 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4350] 2025-08-03 01:42:28,948 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2025-08-03 01:42:28,929 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|modeling_utils.py:4358] 2025-08-03 01:42:28,948 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|modeling_utils.py:4350] 2025-08-03 01:42:28,933 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2025-08-03 01:42:28,934 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|modeling_utils.py:4350] 2025-08-03 01:42:28,945 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4358] 2025-08-03 01:42:28,945 >> All the weights of InternVLChatModel were initialized from the model checkpoint at /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:779] 2025-08-03 01:42:28,941 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/generation_config.json
[INFO|configuration_utils.py:779] 2025-08-03 01:42:28,946 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/generation_config.json
[INFO|configuration_utils.py:779] 2025-08-03 01:42:28,961 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/generation_config.json
[INFO|configuration_utils.py:826] 2025-08-03 01:42:28,942 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2025-08-03 01:42:28,946 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:826] 2025-08-03 01:42:28,961 >> Generate config GenerationConfig {}

08/03/2025 01:42:28 - INFO - __main__ - Finished
08/03/2025 01:42:28 - INFO - __main__ - model.config.force_image_size: 448
08/03/2025 01:42:28 - INFO - __main__ - Finished
08/03/2025 01:42:28 - INFO - __main__ - data_args.force_image_size: 448
08/03/2025 01:42:28 - INFO - __main__ - model.config.vision_config.image_size: 448
08/03/2025 01:42:28 - INFO - __main__ - model.config.force_image_size: 448
08/03/2025 01:42:28 - INFO - __main__ - data_args.force_image_size: 448
08/03/2025 01:42:28 - INFO - __main__ - model.config.vision_config.image_size: 448
08/03/2025 01:42:28 - INFO - __main__ - Finished
08/03/2025 01:42:28 - INFO - __main__ - model.config.force_image_size: 448
08/03/2025 01:42:28 - INFO - __main__ - data_args.force_image_size: 448
08/03/2025 01:42:28 - INFO - __main__ - model.config.vision_config.image_size: 448
[INFO|configuration_utils.py:779] 2025-08-03 01:42:28,959 >> loading configuration file /mnt/petrelfs/share_data/wangweiyun/share_internvl/InternVL3-2B/generation_config.json
[INFO|configuration_utils.py:826] 2025-08-03 01:42:28,960 >> Generate config GenerationConfig {}

08/03/2025 01:42:28 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:28 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:28 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:28 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:28 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:29 - INFO - __main__ - Finished
08/03/2025 01:42:29 - INFO - __main__ - model.config.force_image_size: 448
08/03/2025 01:42:29 - INFO - __main__ - data_args.force_image_size: 448
08/03/2025 01:42:29 - INFO - __main__ - model.config.vision_config.image_size: 448
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:29 - INFO - __main__ - Add dataset: point_xy_format with length: 666578
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:29 - INFO - __main__ - Add dataset: converted_affordance with length: 16305
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:29 - INFO - __main__ - Add dataset: point_xy_format with length: 666578
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:29 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:29 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:30 - INFO - __main__ - Add dataset: converted_affordance with length: 16305
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:30 - INFO - __main__ - Add dataset: converted_trajectory with length: 17175
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:30 - INFO - __main__ - Add dataset: converted_trajectory with length: 17175
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:30 - INFO - __main__ - Add dataset: point_xy_format with length: 666578
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:30 - INFO - __main__ - Add dataset: converted_affordance with length: 16305
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:30 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:30 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:31 - INFO - __main__ - Add dataset: converted_trajectory with length: 17175
08/03/2025 01:42:31 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:31 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:31 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:31 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:31 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:32 - INFO - __main__ - Add dataset: pixmo-points with length: 161095
08/03/2025 01:42:32 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:32 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:32 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:32 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:32 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:32 - INFO - __main__ - Add dataset: pixmo-points with length: 161095
08/03/2025 01:42:32 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:32 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:32 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:32 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:32 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:33 - INFO - __main__ - Add dataset: pixmo-points with length: 161095
08/03/2025 01:42:33 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:33 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:33 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:33 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:33 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:34 - INFO - __main__ - Add dataset: point_xy_format with length: 666578
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:34 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:34 - INFO - __main__ - Add dataset: paco_lvis_v1_train with length: 228945
08/03/2025 01:42:34 - INFO - __main__ - Add dataset: paco_lvis_v1_train with length: 228945
08/03/2025 01:42:34 - INFO - __main__ - Add dataset: converted_affordance with length: 16305
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:34 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:34 - INFO - __main__ - Add dataset: converted_trajectory with length: 17175
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:34 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:34 - INFO - __main__ - Formatting inputs...Skip in lazy mode
08/03/2025 01:42:35 - INFO - __main__ - Add dataset: paco_lvis_v1_train with length: 228945
[INFO|trainer.py:522] 2025-08-03 01:42:36,070 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:571] 2025-08-03 01:42:36,070 >> Using auto half precision backend
[INFO|trainer.py:522] 2025-08-03 01:42:36,054 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:571] 2025-08-03 01:42:36,054 >> Using auto half precision backend
[INFO|trainer.py:522] 2025-08-03 01:42:36,328 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:571] 2025-08-03 01:42:36,328 >> Using auto half precision backend
08/03/2025 01:42:36 - INFO - __main__ - Add dataset: pixmo-points with length: 161095
08/03/2025 01:42:36 - INFO - __main__ - [Dataset] num_image_token: 256
08/03/2025 01:42:36 - INFO - __main__ - [Dataset] dynamic_image_size: True
08/03/2025 01:42:36 - INFO - __main__ - [Dataset] use_thumbnail: True
08/03/2025 01:42:36 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 12
08/03/2025 01:42:36 - INFO - __main__ - Formatting inputs...Skip in lazy mode
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
08/03/2025 01:42:38 - INFO - __main__ - Add dataset: paco_lvis_v1_train with length: 228945
08/03/2025 01:42:38 - INFO - internvl.train.dataset_packed - Loaded dataset to pack: ['point_xy_format', 'converted_affordance', 'converted_trajectory', 'pixmo-points', 'paco_lvis_v1_train'], self.num_images_expected=48, self.max_packed_tokens=16384, self.replacement=True, self.allow_overflow=False
08/03/2025 01:42:38 - INFO - internvl.train.dataset_packed - Sampling prob for each dataset:
point_xy_format          : 61.15%
converted_affordance     : 1.50%
converted_trajectory     : 1.58%
pixmo-points             : 14.78%
paco_lvis_v1_train       : 21.00%
08/03/2025 01:42:38 - INFO - __main__ - vision_model.embeddings.class_embedding
08/03/2025 01:42:38 - INFO - __main__ - vision_model.embeddings.position_embedding
08/03/2025 01:42:38 - INFO - __main__ - vision_model.embeddings.patch_embedding.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.embeddings.patch_embedding.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.0.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.1.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.2.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.3.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.4.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.5.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.6.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.7.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.8.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.9.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.10.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.11.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.12.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.13.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.14.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.15.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.16.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.17.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.18.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.19.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.20.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.21.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.22.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.ls1
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.ls2
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.attn.qkv.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.attn.qkv.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.attn.proj.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.attn.proj.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.mlp.fc2.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.norm1.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.norm1.bias
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.norm2.weight
08/03/2025 01:42:38 - INFO - __main__ - vision_model.encoder.layers.23.norm2.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.embed_tokens.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.0.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.1.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.2.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.3.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.4.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.5.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.6.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.7.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.8.post_attention_layernorm.weight
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.9.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.10.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.11.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.12.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.13.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.14.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.15.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.16.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.17.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.18.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.19.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.20.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.21.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.22.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.23.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.24.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.25.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.26.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.q_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.q_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.k_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.k_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.v_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.v_proj.bias
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.self_attn.o_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.mlp.gate_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.mlp.up_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.mlp.down_proj.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.input_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.layers.27.post_attention_layernorm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.model.norm.weight
08/03/2025 01:42:38 - INFO - __main__ - language_model.lm_head.weight
08/03/2025 01:42:38 - INFO - __main__ - mlp1.0.weight
08/03/2025 01:42:38 - INFO - __main__ - mlp1.0.bias
08/03/2025 01:42:38 - INFO - __main__ - mlp1.1.weight
08/03/2025 01:42:38 - INFO - __main__ - mlp1.1.bias
08/03/2025 01:42:38 - INFO - __main__ - mlp1.3.weight
08/03/2025 01:42:38 - INFO - __main__ - mlp1.3.bias
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
08/03/2025 01:42:38 - WARNING - accelerate.utils.other - Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.4399733543395996 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.4024319648742676 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.4023013114929199 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.4023604393005371 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.4067509174346924 seconds
Time to load fused_adam op: 0.40588808059692383 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.4028747081756592 seconds
Time to load fused_adam op: 0.40257835388183594 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.4022998809814453 seconds
Time to load fused_adam op: 0.4024229049682617 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.5733091831207275 seconds
Time to load fused_adam op: 0.5024425983428955 seconds
Time to load fused_adam op: 0.515697717666626 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.5719597339630127 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.5374770164489746 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.4998030662536621 seconds
Time to load fused_adam op: 0.47902798652648926 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.5331830978393555 seconds
Time to load fused_adam op: 0.5076618194580078 seconds
Time to load fused_adam op: 0.49834728240966797 seconds
Time to load fused_adam op: 0.4961533546447754 seconds
Time to load fused_adam op: 0.4804708957672119 seconds
Time to load fused_adam op: 0.5025105476379395 seconds
Time to load fused_adam op: 0.5071156024932861 seconds
[INFO|trainer.py:522] 2025-08-03 01:42:39,397 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:571] 2025-08-03 01:42:39,397 >> Using auto half precision backend
[2025-08-03 01:42:39,734] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown
[2025-08-03 01:42:39,752] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Using /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/hwfile/yangganlin/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.6324775218963623 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.5039777755737305 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.5029892921447754 seconds
Time to load fused_adam op: 0.5027832984924316 seconds
[2025-08-03 01:42:41,439] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-08-03 01:42:41,439] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
Loading extension module fused_adam...
Time to load fused_adam op: 0.6037395000457764 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.6040046215057373 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.602717399597168 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.6030519008636475 seconds
[2025-08-03 01:42:41,528] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-08-03 01:42:41,528] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2025-08-03 01:42:41,528] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2025-08-03 01:42:41,528] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2025-08-03 01:42:41,883] [INFO] [utils.py:800:see_memory_usage] Stage 3 initialize beginning
[2025-08-03 01:42:41,884] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB         Max_MA 1.43 GB         CA 0.89 GB         Max_CA 2 GB 
[2025-08-03 01:42:41,885] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.78 GB, percent = 10.6%
[2025-08-03 01:42:41,888] [INFO] [stage3.py:130:__init__] Reduce bucket size 100000000
[2025-08-03 01:42:41,888] [INFO] [stage3.py:131:__init__] Prefetch bucket size 100000000
[2025-08-03 01:42:42,106] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2025-08-03 01:42:42,107] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB         Max_MA 0.56 GB         CA 0.89 GB         Max_CA 1 GB 
[2025-08-03 01:42:42,108] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.78 GB, percent = 10.6%
Parameter Offload: Total persistent parameters: 526848 in 387 params
[2025-08-03 01:42:42,353] [INFO] [utils.py:800:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2025-08-03 01:42:42,354] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB         Max_MA 0.56 GB         CA 0.89 GB         Max_CA 1 GB 
[2025-08-03 01:42:42,355] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.78 GB, percent = 10.6%
[2025-08-03 01:42:42,568] [INFO] [utils.py:800:see_memory_usage] Before creating fp16 partitions
[2025-08-03 01:42:42,569] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB         Max_MA 0.56 GB         CA 0.89 GB         Max_CA 1 GB 
[2025-08-03 01:42:42,570] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.77 GB, percent = 10.6%
[2025-08-03 01:42:43,178] [INFO] [utils.py:800:see_memory_usage] After creating fp16 partitions: 1
[2025-08-03 01:42:43,179] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB         Max_MA 0.56 GB         CA 0.78 GB         Max_CA 1 GB 
[2025-08-03 01:42:43,181] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.89 GB, percent = 10.6%
[2025-08-03 01:42:43,361] [INFO] [utils.py:800:see_memory_usage] Before creating fp32 partitions
[2025-08-03 01:42:43,362] [INFO] [utils.py:801:see_memory_usage] MA 0.56 GB         Max_MA 0.56 GB         CA 0.78 GB         Max_CA 1 GB 
[2025-08-03 01:42:43,363] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.76 GB, percent = 10.6%
[2025-08-03 01:42:43,584] [INFO] [utils.py:800:see_memory_usage] After creating fp32 partitions
[2025-08-03 01:42:43,585] [INFO] [utils.py:801:see_memory_usage] MA 0.8 GB         Max_MA 0.92 GB         CA 1.14 GB         Max_CA 1 GB 
[2025-08-03 01:42:43,586] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.76 GB, percent = 10.6%
[2025-08-03 01:42:43,777] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
[2025-08-03 01:42:43,778] [INFO] [utils.py:801:see_memory_usage] MA 0.8 GB         Max_MA 0.8 GB         CA 1.14 GB         Max_CA 1 GB 
[2025-08-03 01:42:43,779] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.76 GB, percent = 10.6%
[2025-08-03 01:42:43,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | init_optimizer_state: 0.12
[2025-08-03 01:42:44,074] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
[2025-08-03 01:42:44,075] [INFO] [utils.py:801:see_memory_usage] MA 0.8 GB         Max_MA 1.05 GB         CA 1.39 GB         Max_CA 1 GB 
[2025-08-03 01:42:44,076] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 213.76 GB, percent = 10.6%
[2025-08-03 01:42:44,077] [INFO] [stage3.py:486:_setup_for_real_optimizer] optimizer state initialized
[INFO|trainer.py:1721] 2025-08-03 01:42:44,750 >> ***** Running training *****
[INFO|trainer.py:1722] 2025-08-03 01:42:44,750 >>   Num examples = 256,000
[INFO|trainer.py:1723] 2025-08-03 01:42:44,750 >>   Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:1724] 2025-08-03 01:42:44,750 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1721] 2025-08-03 01:42:44,769 >> ***** Running training *****
[INFO|trainer.py:1727] 2025-08-03 01:42:44,750 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1728] 2025-08-03 01:42:44,750 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1729] 2025-08-03 01:42:44,750 >>   Total optimization steps = 2,000
[INFO|trainer.py:1722] 2025-08-03 01:42:44,769 >>   Num examples = 256,000
[INFO|trainer.py:1723] 2025-08-03 01:42:44,769 >>   Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:1724] 2025-08-03 01:42:44,770 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2025-08-03 01:42:44,770 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1728] 2025-08-03 01:42:44,770 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1729] 2025-08-03 01:42:44,770 >>   Total optimization steps = 2,000
[INFO|trainer.py:1730] 2025-08-03 01:42:44,752 >>   Number of trainable parameters = 2,088,957,440
[INFO|trainer.py:1730] 2025-08-03 01:42:44,772 >>   Number of trainable parameters = 2,088,957,440
[INFO|trainer.py:1721] 2025-08-03 01:42:44,759 >> ***** Running training *****
[INFO|trainer.py:1722] 2025-08-03 01:42:44,760 >>   Num examples = 256,000
[INFO|trainer.py:1723] 2025-08-03 01:42:44,760 >>   Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:1724] 2025-08-03 01:42:44,760 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2025-08-03 01:42:44,760 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1728] 2025-08-03 01:42:44,760 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1729] 2025-08-03 01:42:44,760 >>   Total optimization steps = 2,000
[INFO|trainer.py:1730] 2025-08-03 01:42:44,762 >>   Number of trainable parameters = 2,088,957,440
[2025-08-03 01:42:45,053] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
[2025-08-03 01:42:45,054] [INFO] [utils.py:801:see_memory_usage] MA 1.11 GB         Max_MA 1.98 GB         CA 2.7 GB         Max_CA 3 GB 
[2025-08-03 01:42:45,055] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory:  used = 214.06 GB, percent = 10.6%
[2025-08-03 01:42:45,055] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2025-08-03 01:42:45,055] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2025-08-03 01:42:45,055] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fefb867e130>
[2025-08-03 01:42:45,056] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2025-08-03 01:42:45,057] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   amp_enabled .................. False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   amp_params ................... False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   bfloat16_enabled ............. True
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fefd0078ac0>
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   communication_data_type ...... None
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   disable_allgather ............ False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   dump_state ................... False
[2025-08-03 01:42:45,058] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   fp16_auto_cast ............... None
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   fp16_enabled ................. False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   global_rank .................. 0
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 4
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   gradient_clipping ............ 1.0
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   graph_harvesting ............. False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 1
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   loss_scale ................... 1.0
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   memory_breakdown ............. False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   optimizer_name ............... adamw
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.05}
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   pld_enabled .................. False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   pld_params ................... False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   prescale_gradients ........... False
[2025-08-03 01:42:45,059] [INFO] [config.py:1000:print]   scheduler_name ............... None
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   scheduler_params ............. None
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   sparse_attention ............. None
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   steps_per_print .............. inf
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   train_batch_size ............. 128
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  1
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... True
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   weight_quantization_config ... None
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   world_size ................... 32
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=100000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=100000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   zero_enabled ................. True
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
[2025-08-03 01:42:45,060] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 3
[2025-08-03 01:42:45,060] [INFO] [config.py:986:print_user_config]   json = {
    "zero_optimization": {
        "stage": 3, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+08, 
        "reduce_bucket_size": 1.000000e+08, 
        "stage3_prefetch_bucket_size": 1.000000e+08, 
        "stage3_param_persistence_threshold": 1.000000e+04, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "fp16": {
        "enabled": false, 
        "auto_cast": true, 
        "loss_scale": 0, 
        "initial_scale_power": 32, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 2e-05, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08, 
            "weight_decay": 0.05
        }
    }, 
    "gradient_accumulation_steps": 4, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 128, 
    "train_micro_batch_size_per_gpu": 1, 
    "wall_clock_breakdown": true
}
[INFO|trainer.py:1721] 2025-08-03 01:42:45,060 >> ***** Running training *****
[INFO|trainer.py:1722] 2025-08-03 01:42:45,060 >>   Num examples = 256,000
[INFO|trainer.py:1723] 2025-08-03 01:42:45,060 >>   Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:1724] 2025-08-03 01:42:45,060 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2025-08-03 01:42:45,061 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1728] 2025-08-03 01:42:45,061 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1729] 2025-08-03 01:42:45,061 >>   Total optimization steps = 2,000
[INFO|trainer.py:1730] 2025-08-03 01:42:45,063 >>   Number of trainable parameters = 2,088,957,440
[2025-08-03 01:42:54,054] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,055] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,060] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,066] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,125] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,125] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,125] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,420] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,420] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,421] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,421] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,421] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,422] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,422] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:54,422] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:56,719] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:56,719] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:56,720] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:56,720] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:56,722] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:56,722] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:56,722] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:42:56,722] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,704] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,704] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,764] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,778] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,847] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,920] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,927] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,947] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:09,968] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,056] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,058] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,085] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,112] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,125] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,127] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,129] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,917] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,917] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,918] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,932] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,963] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,975] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,977] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:10,977] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:20,211] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:20,211] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:20,211] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:20,221] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:20,221] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:20,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:20,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:20,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,515] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,602] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,643] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,646] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,647] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,669] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,671] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,691] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,693] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:22,695] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:25,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:25,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:25,240] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:25,277] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:25,303] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:25,305] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:25,310] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:25,327] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,042] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,077] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,079] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,122] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,143] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,144] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,176] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,198] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,597] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,635] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,655] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,676] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,676] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,701] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,728] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:35,734] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,033] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,033] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,034] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,321] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,472] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,480] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,594] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,595] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,635] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,665] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:39,668] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:54,462] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:54,467] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:54,468] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:54,468] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:54,469] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:54,470] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:54,509] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-08-03 01:43:54,509] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13342
total_samples=4, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:17,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 9622.70 | bwd_microstep: 3464.68 | bwd_inner_microstep: 3449.24 | bwd_allreduce_microstep: 15.28 | step_microstep: 0.06
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13827
total_samples=8, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:24,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2499.24 | bwd_microstep: 4776.29 | bwd_inner_microstep: 4752.78 | bwd_allreduce_microstep: 23.45 | step_microstep: 0.07
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13444
total_samples=12, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:28,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1486.90 | bwd_microstep: 2472.37 | bwd_inner_microstep: 2332.45 | bwd_allreduce_microstep: 139.85 | step_microstep: 0.06
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13780
total_samples=16, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:33,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.49
[2025-08-03 01:44:33,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1628.79 | bwd_microstep: 2939.01 | bwd_inner_microstep: 2460.50 | bwd_allreduce_microstep: 478.37 | step_microstep: 154.60
[2025-08-03 01:44:33,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 15237.58 | bwd: 13652.38 | bwd_inner: 12995.01 | bwd_allreduce: 657.01 | step: 154.80
{'loss': 2.4023, 'learning_rate': 3.3333333333333335e-07, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13811
total_samples=21, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:36,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1425.59 | bwd_microstep: 2143.14 | bwd_inner_microstep: 2002.90 | bwd_allreduce_microstep: 140.17 | step_microstep: 0.06
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13382
total_samples=25, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:41,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1503.37 | bwd_microstep: 3043.16 | bwd_inner_microstep: 2895.22 | bwd_allreduce_microstep: 147.86 | step_microstep: 0.08
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14242
total_samples=29, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:45,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1478.95 | bwd_microstep: 2405.54 | bwd_inner_microstep: 2383.43 | bwd_allreduce_microstep: 22.06 | step_microstep: 0.05
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13549
total_samples=33, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:48,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.63
[2025-08-03 01:44:48,912] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2025-08-03 01:44:48,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.58 | bwd_microstep: 1769.16 | bwd_inner_microstep: 1696.72 | bwd_allreduce_microstep: 72.35 | step_microstep: 137.52
[2025-08-03 01:44:48,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5856.42 | bwd: 9361.04 | bwd_inner: 8978.27 | bwd_allreduce: 382.52 | step: 137.73
{'loss': 2.3879, 'learning_rate': 6.666666666666667e-07, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15000
total_samples=38, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:52,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1465.34 | bwd_microstep: 2043.81 | bwd_inner_microstep: 1938.67 | bwd_allreduce_microstep: 105.07 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13658
total_samples=42, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:55,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.59 | bwd_microstep: 1749.25 | bwd_inner_microstep: 1687.29 | bwd_allreduce_microstep: 61.89 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14057
total_samples=46, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:44:58,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.02 | bwd_microstep: 1831.49 | bwd_inner_microstep: 1731.25 | bwd_allreduce_microstep: 100.18 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14561
total_samples=50, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:02,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 01:45:02,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1450.32 | bwd_microstep: 2125.63 | bwd_inner_microstep: 2023.28 | bwd_allreduce_microstep: 102.29 | step_microstep: 123.93
[2025-08-03 01:45:02,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5070.21 | bwd: 7750.22 | bwd_inner: 7380.48 | bwd_allreduce: 369.51 | step: 124.28
{'loss': 2.4169, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215
total_samples=54, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:05,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1451.74 | bwd_microstep: 1771.67 | bwd_inner_microstep: 1691.28 | bwd_allreduce_microstep: 80.32 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14164
total_samples=59, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:08,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1144.33 | bwd_microstep: 1803.45 | bwd_inner_microstep: 1734.69 | bwd_allreduce_microstep: 68.70 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13826
total_samples=63, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:11,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.65 | bwd_microstep: 2018.88 | bwd_inner_microstep: 1855.80 | bwd_allreduce_microstep: 163.01 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13482
total_samples=67, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:16,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 01:45:16,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.46 | bwd_microstep: 3373.24 | bwd_inner_microstep: 2892.18 | bwd_allreduce_microstep: 480.96 | step_microstep: 953.91
[2025-08-03 01:45:16,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4056.09 | bwd: 8967.29 | bwd_inner: 8173.96 | bwd_allreduce: 793.06 | step: 954.39
{'loss': 2.3752, 'learning_rate': 1.3333333333333334e-06, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14053
total_samples=71, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:19,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.38 | bwd_microstep: 2024.75 | bwd_inner_microstep: 1897.41 | bwd_allreduce_microstep: 127.28 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13469
total_samples=75, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:23,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1484.92 | bwd_microstep: 2638.82 | bwd_inner_microstep: 2627.48 | bwd_allreduce_microstep: 11.27 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13533
total_samples=79, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:27,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1422.82 | bwd_microstep: 2280.57 | bwd_inner_microstep: 2109.13 | bwd_allreduce_microstep: 171.37 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14828
total_samples=86, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:30,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 01:45:30,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1504.09 | bwd_microstep: 1778.89 | bwd_inner_microstep: 1735.27 | bwd_allreduce_microstep: 43.56 | step_microstep: 133.61
[2025-08-03 01:45:30,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5114.15 | bwd: 8723.08 | bwd_inner: 8369.28 | bwd_allreduce: 353.56 | step: 134.09
{'loss': 2.3778, 'learning_rate': 1.6666666666666667e-06, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13349
total_samples=90, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:34,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1396.85 | bwd_microstep: 2294.00 | bwd_inner_microstep: 2121.57 | bwd_allreduce_microstep: 172.37 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13432
total_samples=94, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:39,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1408.40 | bwd_microstep: 3254.09 | bwd_inner_microstep: 2772.67 | bwd_allreduce_microstep: 481.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13493
total_samples=98, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:43,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1827.83 | bwd_microstep: 1798.47 | bwd_inner_microstep: 1712.90 | bwd_allreduce_microstep: 85.51 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837
total_samples=102, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:46,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 01:45:46,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1187.49 | bwd_microstep: 2282.90 | bwd_inner_microstep: 2113.95 | bwd_allreduce_microstep: 168.89 | step_microstep: 114.58
[2025-08-03 01:45:46,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5820.50 | bwd: 9629.52 | bwd_inner: 8721.09 | bwd_allreduce: 908.20 | step: 114.90
{'loss': 2.3989, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13604
total_samples=106, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:49,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.80 | bwd_microstep: 1816.05 | bwd_inner_microstep: 1714.51 | bwd_allreduce_microstep: 101.48 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14006
total_samples=110, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:52,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1150.88 | bwd_microstep: 2246.20 | bwd_inner_microstep: 2088.92 | bwd_allreduce_microstep: 157.22 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14214
total_samples=114, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:55,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.48 | bwd_microstep: 1734.21 | bwd_inner_microstep: 1698.69 | bwd_allreduce_microstep: 35.46 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14825
total_samples=118, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:45:59,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 01:45:59,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1506.42 | bwd_microstep: 2408.00 | bwd_inner_microstep: 2258.29 | bwd_allreduce_microstep: 149.66 | step_microstep: 113.57
[2025-08-03 01:45:59,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4074.51 | bwd: 8204.50 | bwd_inner: 7760.39 | bwd_allreduce: 443.88 | step: 114.00
  0%|          | 0/2000 [00:00<?, ?it/s]  0%|          | 1/2000 [01:48<60:00:45, 108.08s/it]                                                      0%|          | 1/2000 [01:48<60:00:45, 108.08s/it]  0%|          | 2/2000 [02:03<29:49:15, 53.73s/it]                                                      0%|          | 2/2000 [02:03<29:49:15, 53.73s/it]  0%|          | 3/2000 [02:17<19:33:54, 35.27s/it]                                                     0%|          | 3/2000 [02:17<19:33:54, 35.27s/it]  0%|          | 4/2000 [02:31<14:57:48, 26.99s/it]                                                     0%|          | 4/2000 [02:31<14:57:48, 26.99s/it]  0%|          | 5/2000 [02:45<12:25:23, 22.42s/it]                                                     0%|          | 5/2000 [02:45<12:25:23, 22.42s/it]  0%|          | 6/2000 [03:01<11:11:22, 20.20s/it]                                                     0%|          | 6/2000 [03:01<11:11:22, 20.20s/it]  0%|          | 7/2000 [03:14<9:49:{'loss': 2.2732, 'learning_rate': 2.3333333333333336e-06, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13109
total_samples=122, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:02,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.51 | bwd_microstep: 1843.82 | bwd_inner_microstep: 1693.67 | bwd_allreduce_microstep: 150.08 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14370
total_samples=127, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:05,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1393.74 | bwd_microstep: 2234.29 | bwd_inner_microstep: 2090.33 | bwd_allreduce_microstep: 143.91 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13987
total_samples=131, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:09,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1396.67 | bwd_microstep: 2237.42 | bwd_inner_microstep: 2086.73 | bwd_allreduce_microstep: 150.62 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13702
total_samples=135, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:13,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 01:46:13,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1526.12 | bwd_microstep: 1801.24 | bwd_inner_microstep: 1725.23 | bwd_allreduce_microstep: 75.94 | step_microstep: 109.87
[2025-08-03 01:46:13,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5052.95 | bwd: 8116.83 | bwd_inner: 7595.96 | bwd_allreduce: 520.63 | step: 110.21
{'loss': 2.2673, 'learning_rate': 2.666666666666667e-06, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14060
total_samples=139, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:16,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1909.94 | bwd_microstep: 1793.68 | bwd_inner_microstep: 1725.44 | bwd_allreduce_microstep: 68.17 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13317
total_samples=143, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:20,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1502.18 | bwd_microstep: 2408.95 | bwd_inner_microstep: 2241.52 | bwd_allreduce_microstep: 167.37 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13695
total_samples=147, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:23,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.33 | bwd_microstep: 1795.01 | bwd_inner_microstep: 1707.41 | bwd_allreduce_microstep: 87.54 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14006
total_samples=151, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:27,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 01:46:27,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1549.93 | bwd_microstep: 1794.12 | bwd_inner_microstep: 1730.50 | bwd_allreduce_microstep: 63.55 | step_microstep: 122.71
[2025-08-03 01:46:27,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5656.30 | bwd: 7791.80 | bwd_inner: 7404.86 | bwd_allreduce: 386.71 | step: 123.14
{'loss': 2.0511, 'learning_rate': 3e-06, 'epoch': 0.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13940
total_samples=155, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:29,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.44 | bwd_microstep: 1829.86 | bwd_inner_microstep: 1754.74 | bwd_allreduce_microstep: 75.07 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13263
total_samples=159, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:32,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1440.67 | bwd_microstep: 1846.59 | bwd_inner_microstep: 1709.29 | bwd_allreduce_microstep: 137.24 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13360
total_samples=163, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:36,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1556.68 | bwd_microstep: 2164.97 | bwd_inner_microstep: 2003.90 | bwd_allreduce_microstep: 161.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14937
total_samples=167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:39,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 01:46:39,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.86 | bwd_microstep: 1852.68 | bwd_inner_microstep: 1777.86 | bwd_allreduce_microstep: 74.76 | step_microstep: 154.26
[2025-08-03 01:46:39,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4452.59 | bwd: 7694.16 | bwd_inner: 7245.76 | bwd_allreduce: 448.16 | step: 154.59
{'loss': 2.0299, 'learning_rate': 3.3333333333333333e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13970
total_samples=173, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:44,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1580.62 | bwd_microstep: 3081.40 | bwd_inner_microstep: 2269.47 | bwd_allreduce_microstep: 811.87 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13114
total_samples=177, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:48,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1181.81 | bwd_microstep: 2620.74 | bwd_inner_microstep: 2240.16 | bwd_allreduce_microstep: 380.53 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13838
total_samples=181, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:52,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.51 | bwd_microstep: 3664.22 | bwd_inner_microstep: 2868.12 | bwd_allreduce_microstep: 795.98 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15164
total_samples=185, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:46:57,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 01:46:57,088] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2025-08-03 01:46:57,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1145.46 | bwd_microstep: 3038.69 | bwd_inner_microstep: 2573.88 | bwd_allreduce_microstep: 464.72 | step_microstep: 119.23
[2025-08-03 01:46:57,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4612.33 | bwd: 12405.10 | bwd_inner: 9951.67 | bwd_allreduce: 2453.14 | step: 119.66
{'loss': 1.9622, 'learning_rate': 3.6666666666666666e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13416
total_samples=190, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:01,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1161.25 | bwd_microstep: 2690.41 | bwd_inner_microstep: 2629.70 | bwd_allreduce_microstep: 60.65 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16090
total_samples=197, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:04,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1169.79 | bwd_microstep: 2307.52 | bwd_inner_microstep: 2164.33 | bwd_allreduce_microstep: 143.13 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13862
total_samples=201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:08,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1778.47 | bwd_microstep: 1812.57 | bwd_inner_microstep: 1735.50 | bwd_allreduce_microstep: 77.00 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12936
total_samples=205, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:11,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 01:47:11,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.64 | bwd_microstep: 1869.76 | bwd_inner_microstep: 1677.09 | bwd_allreduce_microstep: 192.61 | step_microstep: 115.86
[2025-08-03 01:47:11,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4842.08 | bwd: 8680.30 | bwd_inner: 8206.61 | bwd_allreduce: 473.46 | step: 116.29
{'loss': 1.7955, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13378
total_samples=209, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:13,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.40 | bwd_microstep: 1775.32 | bwd_inner_microstep: 1686.97 | bwd_allreduce_microstep: 88.29 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13298
total_samples=213, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:16,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.38 | bwd_microstep: 1776.91 | bwd_inner_microstep: 1688.86 | bwd_allreduce_microstep: 87.98 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14046
total_samples=217, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:18,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.08 | bwd_microstep: 1821.22 | bwd_inner_microstep: 1744.15 | bwd_allreduce_microstep: 77.01 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13307
total_samples=221, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:21,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 01:47:21,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.49 | bwd_microstep: 1764.31 | bwd_inner_microstep: 1673.09 | bwd_allreduce_microstep: 91.15 | step_microstep: 146.43
[2025-08-03 01:47:21,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.29 | bwd: 7137.81 | bwd_inner: 6793.07 | bwd_allreduce: 344.51 | step: 146.77
{'loss': 1.7824, 'learning_rate': 4.333333333333334e-06, 'epoch': 0.01}
50, 17.76s/it]                                                     0%|          | 7/2000 [03:14<9:49:50, 17.76s/it]  0%|          | 8/2000 [03:27<9:05:49, 16.44s/it]                                                    0%|          | 8/2000 [03:27<9:05:49, 16.44s/it]  0%|          | 9/2000 [03:41<8:39:34, 15.66s/it]                                                    0%|          | 9/2000 [03:41<8:39:34, 15.66s/it]  0%|          | 10/2000 [03:54<8:08:09, 14.72s/it]                                                     0%|          | 10/2000 [03:54<8:08:09, 14.72s/it]  1%|          | 11/2000 [04:11<8:35:50, 15.56s/it]                                                     1%|          | 11/2000 [04:11<8:35:50, 15.56s/it]  1%|          | 12/2000 [04:25<8:19:37, 15.08s/it]                                                     1%|          | 12/2000 [04:25<8:19:37, 15.08s/it]  1%|          | 13/2000 [04:36<7:33:01, 13.68s/it]                                                     1%|          | 13/2dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13240
total_samples=225, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:24,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.15 | bwd_microstep: 1777.18 | bwd_inner_microstep: 1674.47 | bwd_allreduce_microstep: 102.64 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13267
total_samples=229, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:26,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.39 | bwd_microstep: 1845.45 | bwd_inner_microstep: 1702.12 | bwd_allreduce_microstep: 143.27 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14413
total_samples=237, num_samples=8, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:29,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.06 | bwd_microstep: 1763.46 | bwd_inner_microstep: 1714.30 | bwd_allreduce_microstep: 49.10 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13321
total_samples=241, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:32,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 01:47:32,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.84 | bwd_microstep: 1812.64 | bwd_inner_microstep: 1719.69 | bwd_allreduce_microstep: 92.89 | step_microstep: 113.03
[2025-08-03 01:47:32,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.38 | bwd: 7198.76 | bwd_inner: 6810.56 | bwd_allreduce: 387.96 | step: 113.36
{'loss': 1.7422, 'learning_rate': 4.666666666666667e-06, 'epoch': 0.01}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12721
total_samples=245, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:34,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.05 | bwd_microstep: 1803.94 | bwd_inner_microstep: 1645.05 | bwd_allreduce_microstep: 158.83 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13419
total_samples=249, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:37,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.00 | bwd_microstep: 1769.60 | bwd_inner_microstep: 1686.27 | bwd_allreduce_microstep: 83.26 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13155
total_samples=253, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:40,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1564.03 | bwd_microstep: 2156.78 | bwd_inner_microstep: 1989.41 | bwd_allreduce_microstep: 167.31 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13430
total_samples=257, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:43,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 01:47:43,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.93 | bwd_microstep: 1784.46 | bwd_inner_microstep: 1698.38 | bwd_allreduce_microstep: 86.01 | step_microstep: 130.92
[2025-08-03 01:47:43,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3683.94 | bwd: 7514.83 | bwd_inner: 7019.11 | bwd_allreduce: 495.49 | step: 131.37
{'loss': 1.6717, 'learning_rate': 5e-06, 'epoch': 0.01}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13141
total_samples=261, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:46,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.13 | bwd_microstep: 1774.31 | bwd_inner_microstep: 1697.38 | bwd_allreduce_microstep: 76.86 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14512
total_samples=266, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:48,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.08 | bwd_microstep: 1828.43 | bwd_inner_microstep: 1772.04 | bwd_allreduce_microstep: 56.32 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13488
total_samples=270, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:51,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.98 | bwd_microstep: 1801.72 | bwd_inner_microstep: 1675.94 | bwd_allreduce_microstep: 125.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14619
total_samples=274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:54,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 01:47:54,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.63 | bwd_microstep: 1811.40 | bwd_inner_microstep: 1754.41 | bwd_allreduce_microstep: 56.92 | step_microstep: 118.68
[2025-08-03 01:47:54,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.76 | bwd: 7215.91 | bwd_inner: 6899.77 | bwd_allreduce: 315.89 | step: 119.02
{'loss': 1.632, 'learning_rate': 5.333333333333334e-06, 'epoch': 0.01}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13024
total_samples=278, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:47:56,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.71 | bwd_microstep: 1755.17 | bwd_inner_microstep: 1667.24 | bwd_allreduce_microstep: 87.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13272
total_samples=282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:00,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1572.66 | bwd_microstep: 1864.40 | bwd_inner_microstep: 1728.34 | bwd_allreduce_microstep: 135.99 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13715
total_samples=286, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:03,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.01 | bwd_microstep: 2126.20 | bwd_inner_microstep: 1986.82 | bwd_allreduce_microstep: 139.32 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15055
total_samples=290, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:06,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 01:48:06,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.69 | bwd_microstep: 1921.86 | bwd_inner_microstep: 1715.04 | bwd_allreduce_microstep: 206.75 | step_microstep: 117.53
[2025-08-03 01:48:06,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3668.00 | bwd: 7667.68 | bwd_inner: 7097.44 | bwd_allreduce: 570.00 | step: 117.84
{'loss': 1.5736, 'learning_rate': 5.666666666666667e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462
total_samples=294, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:09,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.65 | bwd_microstep: 2789.54 | bwd_inner_microstep: 2638.88 | bwd_allreduce_microstep: 150.60 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13278
total_samples=298, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:12,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.60 | bwd_microstep: 1861.51 | bwd_inner_microstep: 1709.04 | bwd_allreduce_microstep: 152.41 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13496
total_samples=302, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:15,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.46 | bwd_microstep: 2762.02 | bwd_inner_microstep: 2284.03 | bwd_allreduce_microstep: 477.93 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13376
total_samples=306, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:19,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 01:48:19,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.97 | bwd_microstep: 2131.87 | bwd_inner_microstep: 1827.98 | bwd_allreduce_microstep: 303.83 | step_microstep: 446.57
[2025-08-03 01:48:19,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.61 | bwd: 9544.99 | bwd_inner: 8459.92 | bwd_allreduce: 1084.84 | step: 446.89
{'loss': 1.5352, 'learning_rate': 6e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13342
total_samples=310, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:22,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.96 | bwd_microstep: 2079.02 | bwd_inner_microstep: 1714.92 | bwd_allreduce_microstep: 364.04 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14519
total_samples=314, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:24,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.74 | bwd_microstep: 1806.32 | bwd_inner_microstep: 1750.58 | bwd_allreduce_microstep: 55.68 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13261
total_samples=318, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:27,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.35 | bwd_microstep: 1752.39 | bwd_inner_microstep: 1676.92 | bwd_allreduce_microstep: 75.41 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14159
total_samples=322, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:30,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 01:48:30,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1497.91 | bwd_microstep: 2011.16 | bwd_inner_microstep: 1826.44 | bwd_allreduce_microstep: 184.67 | step_microstep: 109.72
[2025-08-03 01:48:30,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3612.89 | bwd: 7648.95 | bwd_inner: 6968.86 | bwd_allreduce: 679.86 | step: 110.07
{'loss': 1.5105, 'learning_rate': 6.333333333333333e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13272
total_samples=327, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:34,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1716.82 | bwd_microstep: 1850.99 | bwd_inner_microstep: 1698.37 | bwd_allreduce_microstep: 152.56 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14337
total_samples=332, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:37,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.52 | bwd_microstep: 1830.18 | bwd_inner_microstep: 1754.66 | bwd_allreduce_microstep: 75.45 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13884
total_samples=336, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:41,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1572.06 | bwd_microstep: 2315.96 | bwd_inner_microstep: 2178.79 | bwd_allreduce_microstep: 137.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12924
total_samples=340, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:43,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 01:48:43,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.04 | bwd_microstep: 1800.03 | bwd_inner_microstep: 1685.12 | bwd_allreduce_microstep: 114.84 | step_microstep: 137.93
[2025-08-03 01:48:43,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4693.37 | bwd: 7797.20 | bwd_inner: 7316.92 | bwd_allreduce: 480.03 | step: 138.25
000 [04:36<7:33:01, 13.68s/it]  1%|          | 14/2000 [04:46<7:01:09, 12.72s/it]                                                     1%|          | 14/2000 [04:46<7:01:09, 12.72s/it]  1%|          | 15/2000 [04:58<6:50:28, 12.41s/it]                                                     1%|          | 15/2000 [04:58<6:50:28, 12.41s/it]  1%|          | 16/2000 [05:09<6:31:27, 11.84s/it]                                                     1%|          | 16/2000 [05:09<6:31:27, 11.84s/it]  1%|          | 17/2000 [05:20<6:30:47, 11.82s/it]                                                     1%|          | 17/2000 [05:20<6:30:47, 11.82s/it]  1%|          | 18/2000 [05:34<6:43:52, 12.23s/it]                                                     1%|          | 18/2000 [05:34<6:43:52, 12.23s/it]  1%|          | 19/2000 [05:45<6:38:43, 12.08s/it]                                                     1%|          | 19/2000 [05:45<6:38:43, 12.08s/it]  1%|          | 20/2000 [05:58<6:47:18, 12.34s/it]{'loss': 1.4713, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13885
total_samples=344, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:46,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.71 | bwd_microstep: 1859.62 | bwd_inner_microstep: 1731.28 | bwd_allreduce_microstep: 128.28 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13646
total_samples=349, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:49,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.87 | bwd_microstep: 1786.86 | bwd_inner_microstep: 1696.29 | bwd_allreduce_microstep: 90.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13272
total_samples=353, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:51,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.54 | bwd_microstep: 1802.82 | bwd_inner_microstep: 1702.93 | bwd_allreduce_microstep: 99.83 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13323
total_samples=357, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:54,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.82
[2025-08-03 01:48:54,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.84 | bwd_microstep: 1742.97 | bwd_inner_microstep: 1670.50 | bwd_allreduce_microstep: 72.40 | step_microstep: 129.96
[2025-08-03 01:48:54,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.89 | bwd: 7192.33 | bwd_inner: 6800.99 | bwd_allreduce: 391.10 | step: 130.30
{'loss': 1.4375, 'learning_rate': 7e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14072
total_samples=362, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:48:56,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.09 | bwd_microstep: 1757.00 | bwd_inner_microstep: 1702.36 | bwd_allreduce_microstep: 54.58 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13686
total_samples=366, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:00,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2163.93 | bwd_microstep: 1844.12 | bwd_inner_microstep: 1700.67 | bwd_allreduce_microstep: 143.39 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16370
total_samples=370, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:04,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1866.94 | bwd_microstep: 1831.30 | bwd_inner_microstep: 1823.07 | bwd_allreduce_microstep: 8.17 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12267
total_samples=374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:07,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.00
[2025-08-03 01:49:07,419] [WARNING] [stage3.py:2069:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
[2025-08-03 01:49:07,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.61 | bwd_microstep: 1763.01 | bwd_inner_microstep: 1583.53 | bwd_allreduce_microstep: 179.40 | step_microstep: 128.75
[2025-08-03 01:49:07,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5417.50 | bwd: 7195.49 | bwd_inner: 6809.61 | bwd_allreduce: 385.62 | step: 129.29
{'loss': 1.4096, 'learning_rate': 7.333333333333333e-06, 'epoch': 0.01}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12611
total_samples=378, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:10,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.49 | bwd_microstep: 1806.18 | bwd_inner_microstep: 1633.21 | bwd_allreduce_microstep: 172.90 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13779
total_samples=382, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:12,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.56 | bwd_microstep: 1771.03 | bwd_inner_microstep: 1684.37 | bwd_allreduce_microstep: 86.59 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13327
total_samples=386, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:15,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.98 | bwd_microstep: 1725.68 | bwd_inner_microstep: 1669.77 | bwd_allreduce_microstep: 55.84 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12918
total_samples=390, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:17,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 01:49:17,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.19 | bwd_microstep: 1792.98 | bwd_inner_microstep: 1626.64 | bwd_allreduce_microstep: 166.27 | step_microstep: 111.28
[2025-08-03 01:49:17,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.17 | bwd: 7095.91 | bwd_inner: 6613.99 | bwd_allreduce: 481.69 | step: 111.84
{'loss': 1.3267, 'learning_rate': 7.666666666666667e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13153
total_samples=394, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:20,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.77 | bwd_microstep: 1797.81 | bwd_inner_microstep: 1696.14 | bwd_allreduce_microstep: 101.60 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13895
total_samples=398, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:23,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.12 | bwd_microstep: 2482.52 | bwd_inner_microstep: 1922.79 | bwd_allreduce_microstep: 559.66 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12482
total_samples=401, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:26,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.28 | bwd_microstep: 1979.63 | bwd_inner_microstep: 1583.59 | bwd_allreduce_microstep: 395.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13535
total_samples=405, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:29,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 01:49:29,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.27 | bwd_microstep: 2443.05 | bwd_inner_microstep: 2172.98 | bwd_allreduce_microstep: 270.02 | step_microstep: 132.36
[2025-08-03 01:49:29,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.37 | bwd: 8703.05 | bwd_inner: 7375.50 | bwd_allreduce: 1327.33 | step: 132.80
{'loss': 1.3389, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13623
total_samples=410, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:32,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.15 | bwd_microstep: 2137.30 | bwd_inner_microstep: 1903.42 | bwd_allreduce_microstep: 233.82 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12785
total_samples=414, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:35,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.08 | bwd_microstep: 2161.37 | bwd_inner_microstep: 1819.56 | bwd_allreduce_microstep: 341.75 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13433
total_samples=418, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:38,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.30 | bwd_microstep: 2446.37 | bwd_inner_microstep: 2029.40 | bwd_allreduce_microstep: 416.87 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14168
total_samples=422, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:41,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 01:49:41,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.58 | bwd_microstep: 2060.24 | bwd_inner_microstep: 2054.23 | bwd_allreduce_microstep: 5.95 | step_microstep: 107.35
[2025-08-03 01:49:41,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.03 | bwd: 8805.33 | bwd_inner: 7806.63 | bwd_allreduce: 998.45 | step: 107.81
{'loss': 1.2893, 'learning_rate': 8.333333333333334e-06, 'epoch': 0.01}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15358
total_samples=427, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:44,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.95 | bwd_microstep: 1741.46 | bwd_inner_microstep: 1719.21 | bwd_allreduce_microstep: 22.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13211
total_samples=431, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:46,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.34 | bwd_microstep: 1812.00 | bwd_inner_microstep: 1714.61 | bwd_allreduce_microstep: 97.33 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12996
total_samples=435, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:49,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.18 | bwd_microstep: 1841.53 | bwd_inner_microstep: 1686.10 | bwd_allreduce_microstep: 155.37 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13036
total_samples=439, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:52,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 01:49:52,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.04 | bwd_microstep: 1824.19 | bwd_inner_microstep: 1685.41 | bwd_allreduce_microstep: 138.72 | step_microstep: 139.10
[2025-08-03 01:49:52,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.45 | bwd: 7219.22 | bwd_inner: 6805.32 | bwd_allreduce: 413.68 | step: 139.52
{'loss': 1.2338, 'learning_rate': 8.666666666666668e-06, 'epoch': 0.01}
                                                     1%|          | 20/2000 [05:58<6:47:18, 12.34s/it]  1%|          | 21/2000 [06:09<6:28:26, 11.78s/it]                                                     1%|          | 21/2000 [06:09<6:28:26, 11.78s/it]  1%|          | 22/2000 [06:22<6:41:12, 12.17s/it]                                                     1%|          | 22/2000 [06:22<6:41:12, 12.17s/it]  1%|          | 23/2000 [06:32<6:22:58, 11.62s/it]                                                     1%|          | 23/2000 [06:32<6:22:58, 11.62s/it]  1%|          | 24/2000 [06:44<6:26:13, 11.73s/it]                                                     1%|          | 24/2000 [06:44<6:26:13, 11.73s/it]  1%|▏         | 25/2000 [06:56<6:29:15, 11.83s/it]                                                     1%|▏         | 25/2000 [06:56<6:29:15, 11.83s/it]  1%|▏         | 26/2000 [07:07<6:16:01, 11.43s/it]                                                     1%|▏         | 26/dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14735
total_samples=445, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:54,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.94 | bwd_microstep: 1744.81 | bwd_inner_microstep: 1712.20 | bwd_allreduce_microstep: 32.55 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13325
total_samples=449, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:57,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.77 | bwd_microstep: 1759.01 | bwd_inner_microstep: 1685.33 | bwd_allreduce_microstep: 73.62 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13880
total_samples=453, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:49:59,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.75 | bwd_microstep: 1731.30 | bwd_inner_microstep: 1698.87 | bwd_allreduce_microstep: 32.36 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12179
total_samples=456, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:02,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 01:50:02,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.40 | bwd_microstep: 1822.52 | bwd_inner_microstep: 1583.12 | bwd_allreduce_microstep: 239.34 | step_microstep: 143.20
[2025-08-03 01:50:02,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.79 | bwd: 7057.70 | bwd_inner: 6679.51 | bwd_allreduce: 377.95 | step: 143.55
{'loss': 1.262, 'learning_rate': 9e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13756
total_samples=460, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:06,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1941.73 | bwd_microstep: 2119.74 | bwd_inner_microstep: 1935.58 | bwd_allreduce_microstep: 184.10 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13929
total_samples=464, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:09,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.89 | bwd_microstep: 1781.77 | bwd_inner_microstep: 1720.60 | bwd_allreduce_microstep: 61.11 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11954
total_samples=467, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:11,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.11 | bwd_microstep: 1766.79 | bwd_inner_microstep: 1549.10 | bwd_allreduce_microstep: 217.63 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13276
total_samples=471, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:15,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 01:50:15,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1826.55 | bwd_microstep: 1850.62 | bwd_inner_microstep: 1726.37 | bwd_allreduce_microstep: 124.19 | step_microstep: 115.80
[2025-08-03 01:50:15,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5157.20 | bwd: 7518.97 | bwd_inner: 6931.64 | bwd_allreduce: 587.10 | step: 116.12
{'loss': 1.2276, 'learning_rate': 9.333333333333334e-06, 'epoch': 0.01}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13471
total_samples=475, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:18,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.13 | bwd_microstep: 1773.52 | bwd_inner_microstep: 1680.26 | bwd_allreduce_microstep: 93.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15824
total_samples=479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:20,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.82 | bwd_microstep: 1836.25 | bwd_inner_microstep: 1797.43 | bwd_allreduce_microstep: 38.74 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13023
total_samples=483, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:23,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.82 | bwd_microstep: 1811.07 | bwd_inner_microstep: 1659.91 | bwd_allreduce_microstep: 151.10 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14493
total_samples=487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:26,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 01:50:26,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.63 | bwd_microstep: 1811.08 | bwd_inner_microstep: 1732.00 | bwd_allreduce_microstep: 79.01 | step_microstep: 167.39
[2025-08-03 01:50:26,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.33 | bwd: 7231.97 | bwd_inner: 6869.60 | bwd_allreduce: 362.13 | step: 167.85
{'loss': 1.2105, 'learning_rate': 9.666666666666667e-06, 'epoch': 0.01}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15363
total_samples=491, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:28,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.75 | bwd_microstep: 1824.98 | bwd_inner_microstep: 1796.63 | bwd_allreduce_microstep: 28.28 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14029
total_samples=495, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:31,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.19 | bwd_microstep: 1732.11 | bwd_inner_microstep: 1696.26 | bwd_allreduce_microstep: 35.79 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14433
total_samples=499, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:33,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.07 | bwd_microstep: 1728.00 | bwd_inner_microstep: 1698.38 | bwd_allreduce_microstep: 29.55 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12903
total_samples=502, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:38,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 01:50:38,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1800.71 | bwd_microstep: 2195.80 | bwd_inner_microstep: 2034.67 | bwd_allreduce_microstep: 161.07 | step_microstep: 108.13
[2025-08-03 01:50:38,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3919.65 | bwd: 7480.93 | bwd_inner: 7225.92 | bwd_allreduce: 254.77 | step: 108.59
{'loss': 1.1972, 'learning_rate': 1e-05, 'epoch': 0.01}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11725
total_samples=505, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:41,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.97 | bwd_microstep: 2066.42 | bwd_inner_microstep: 1801.05 | bwd_allreduce_microstep: 265.31 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13702
total_samples=509, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:43,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.21 | bwd_microstep: 2130.42 | bwd_inner_microstep: 1737.18 | bwd_allreduce_microstep: 393.18 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13313
total_samples=513, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:46,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.63 | bwd_microstep: 2265.59 | bwd_inner_microstep: 2118.52 | bwd_allreduce_microstep: 147.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13605
total_samples=517, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:50,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 01:50:50,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.22 | bwd_microstep: 2201.43 | bwd_inner_microstep: 2047.82 | bwd_allreduce_microstep: 153.54 | step_microstep: 110.81
[2025-08-03 01:50:50,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.96 | bwd: 8663.90 | bwd_inner: 7704.56 | bwd_allreduce: 959.11 | step: 111.15
{'loss': 1.1875, 'learning_rate': 1.0333333333333335e-05, 'epoch': 0.02}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12237
total_samples=520, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:52,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.88 | bwd_microstep: 1980.19 | bwd_inner_microstep: 1735.03 | bwd_allreduce_microstep: 245.09 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13512
total_samples=525, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:55,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.93 | bwd_microstep: 1889.78 | bwd_inner_microstep: 1722.96 | bwd_allreduce_microstep: 166.75 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13191
total_samples=529, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:50:58,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 960.05 | bwd_microstep: 2173.23 | bwd_inner_microstep: 2042.26 | bwd_allreduce_microstep: 130.91 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14014
total_samples=533, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:01,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 01:51:01,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1014.15 | bwd_microstep: 1842.22 | bwd_inner_microstep: 1759.97 | bwd_allreduce_microstep: 82.19 | step_microstep: 111.56
[2025-08-03 01:51:01,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3334.93 | bwd: 7885.46 | bwd_inner: 7260.22 | bwd_allreduce: 625.00 | step: 112.02
{'loss': 1.1659, 'learning_rate': 1.0666666666666667e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13196
total_samples=537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:04,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.13 | bwd_microstep: 1774.95 | bwd_inner_microstep: 1691.05 | bwd_allreduce_microstep: 83.83 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13633
total_samples=541, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:06,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.42 | bwd_microstep: 1767.85 | bwd_inner_microstep: 1665.20 | bwd_allreduce_microstep: 102.59 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13492
total_samples=546, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:09,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.91 | bwd_microstep: 1738.51 | bwd_inner_microstep: 1661.00 | bwd_allreduce_microstep: 77.45 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13865
total_samples=550, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:11,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.82
[2025-08-03 01:51:11,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.73 | bwd_microstep: 1761.01 | bwd_inner_microstep: 1713.64 | bwd_allreduce_microstep: 47.30 | step_microstep: 119.48
[2025-08-03 01:51:11,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.12 | bwd: 7042.36 | bwd_inner: 6730.90 | bwd_allreduce: 311.24 | step: 119.81
2000 [07:07<6:16:01, 11.43s/it]  1%|▏         | 27/2000 [07:17<6:05:05, 11.10s/it]                                                     1%|▏         | 27/2000 [07:17<6:05:05, 11.10s/it]  1%|▏         | 28/2000 [07:30<6:24:33, 11.70s/it]                                                     1%|▏         | 28/2000 [07:30<6:24:33, 11.70s/it]  1%|▏         | 29/2000 [07:41<6:12:59, 11.35s/it]                                                     1%|▏         | 29/2000 [07:41<6:12:59, 11.35s/it]  2%|▏         | 30/2000 [07:52<6:17:49, 11.51s/it]                                                     2%|▏         | 30/2000 [07:53<6:17:49, 11.51s/it]  2%|▏         | 31/2000 [08:04<6:21:10, 11.62s/it]                                                     2%|▏         | 31/2000 [08:04<6:21:10, 11.62s/it]  2%|▏         | 32/2000 [08:16<6:21:26, 11.63s/it]                                                     2%|▏         | 32/2000 [08:16<6:21:26, 11.63s/it]  2%|▏         | 33/2000{'loss': 1.1657, 'learning_rate': 1.1000000000000001e-05, 'epoch': 0.02}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14955
total_samples=554, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:14,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.17 | bwd_microstep: 1744.67 | bwd_inner_microstep: 1708.12 | bwd_allreduce_microstep: 36.49 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12655
total_samples=558, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:16,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.11 | bwd_microstep: 1746.48 | bwd_inner_microstep: 1609.09 | bwd_allreduce_microstep: 137.32 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14507
total_samples=563, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:21,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1874.40 | bwd_microstep: 2085.25 | bwd_inner_microstep: 1909.81 | bwd_allreduce_microstep: 175.38 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13870
total_samples=567, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:23,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 01:51:23,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.15 | bwd_microstep: 1768.67 | bwd_inner_microstep: 1727.27 | bwd_allreduce_microstep: 41.33 | step_microstep: 139.08
[2025-08-03 01:51:23,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3949.76 | bwd: 7345.11 | bwd_inner: 6954.29 | bwd_allreduce: 390.60 | step: 139.40
{'loss': 1.1602, 'learning_rate': 1.1333333333333334e-05, 'epoch': 0.02}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13169
total_samples=571, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:26,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.26 | bwd_microstep: 1804.52 | bwd_inner_microstep: 1685.79 | bwd_allreduce_microstep: 118.67 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=575, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:28,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.54 | bwd_microstep: 1738.83 | bwd_inner_microstep: 1687.99 | bwd_allreduce_microstep: 50.77 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11709
total_samples=578, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.27 | bwd_microstep: 1729.59 | bwd_inner_microstep: 1536.27 | bwd_allreduce_microstep: 193.26 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15931
total_samples=584, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:33,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 01:51:33,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.36 | bwd_microstep: 1753.40 | bwd_inner_microstep: 1744.21 | bwd_allreduce_microstep: 9.14 | step_microstep: 116.49
[2025-08-03 01:51:33,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.35 | bwd: 7026.39 | bwd_inner: 6654.25 | bwd_allreduce: 371.91 | step: 116.91
{'loss': 1.1493, 'learning_rate': 1.1666666666666668e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13990
total_samples=588, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:36,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.03 | bwd_microstep: 1784.21 | bwd_inner_microstep: 1730.46 | bwd_allreduce_microstep: 53.69 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13918
total_samples=592, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:39,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.62 | bwd_microstep: 1747.20 | bwd_inner_microstep: 1697.96 | bwd_allreduce_microstep: 49.18 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12669
total_samples=595, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:41,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.41 | bwd_microstep: 1742.14 | bwd_inner_microstep: 1584.57 | bwd_allreduce_microstep: 157.50 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13880
total_samples=599, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:44,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.81
[2025-08-03 01:51:44,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.80 | bwd_microstep: 1729.17 | bwd_inner_microstep: 1688.75 | bwd_allreduce_microstep: 40.35 | step_microstep: 146.45
[2025-08-03 01:51:44,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.77 | bwd: 7002.76 | bwd_inner: 6701.74 | bwd_allreduce: 300.80 | step: 146.78
{'loss': 1.1404, 'learning_rate': 1.2e-05, 'epoch': 0.02}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13222
total_samples=603, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:46,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.59 | bwd_microstep: 1718.74 | bwd_inner_microstep: 1639.83 | bwd_allreduce_microstep: 78.84 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11956
total_samples=607, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:49,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.86 | bwd_microstep: 1778.35 | bwd_inner_microstep: 1553.65 | bwd_allreduce_microstep: 224.64 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11831
total_samples=610, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:52,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.36 | bwd_microstep: 2049.13 | bwd_inner_microstep: 1816.33 | bwd_allreduce_microstep: 232.73 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14274
total_samples=614, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:55,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 01:51:55,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.08 | bwd_microstep: 2099.67 | bwd_inner_microstep: 1952.73 | bwd_allreduce_microstep: 146.88 | step_microstep: 153.04
[2025-08-03 01:51:55,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2747.83 | bwd: 7645.94 | bwd_inner: 6962.53 | bwd_allreduce: 683.16 | step: 153.39
{'loss': 1.149, 'learning_rate': 1.2333333333333334e-05, 'epoch': 0.02}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12720
total_samples=618, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:51:57,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.03 | bwd_microstep: 2114.39 | bwd_inner_microstep: 1850.20 | bwd_allreduce_microstep: 264.13 | step_microstep: 0.20
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12809
total_samples=622, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:00,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.27 | bwd_microstep: 2123.89 | bwd_inner_microstep: 1852.63 | bwd_allreduce_microstep: 271.19 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11416
total_samples=625, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:03,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.53 | bwd_microstep: 2096.16 | bwd_inner_microstep: 1625.77 | bwd_allreduce_microstep: 470.29 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13569
total_samples=629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:06,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 01:52:06,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.59 | bwd_microstep: 2120.68 | bwd_inner_microstep: 1950.81 | bwd_allreduce_microstep: 169.79 | step_microstep: 113.22
[2025-08-03 01:52:06,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.34 | bwd: 8455.16 | bwd_inner: 7279.43 | bwd_allreduce: 1175.45 | step: 113.64
{'loss': 1.1282, 'learning_rate': 1.2666666666666667e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15392
total_samples=634, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:09,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.95 | bwd_microstep: 2201.84 | bwd_inner_microstep: 2066.94 | bwd_allreduce_microstep: 134.84 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11729
total_samples=637, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:13,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1488.54 | bwd_microstep: 1969.44 | bwd_inner_microstep: 1860.09 | bwd_allreduce_microstep: 109.29 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483
total_samples=641, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:16,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.32 | bwd_microstep: 2277.02 | bwd_inner_microstep: 1868.51 | bwd_allreduce_microstep: 408.44 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13264
total_samples=645, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:19,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 01:52:19,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.18 | bwd_microstep: 2047.12 | bwd_inner_microstep: 1864.08 | bwd_allreduce_microstep: 182.97 | step_microstep: 113.91
[2025-08-03 01:52:19,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3535.91 | bwd: 8495.46 | bwd_inner: 7659.62 | bwd_allreduce: 835.62 | step: 114.23
 [08:26<6:07:52, 11.22s/it]                                                     2%|▏         | 33/2000 [08:26<6:07:52, 11.22s/it]  2%|▏         | 34/2000 [08:38<6:13:03, 11.39s/it]                                                     2%|▏         | 34/2000 [08:38<6:13:03, 11.39s/it]  2%|▏         | 35/2000 [08:48<6:01:35, 11.04s/it]                                                     2%|▏         | 35/2000 [08:48<6:01:35, 11.04s/it]  2%|▏         | 36/2000 [08:59<5:53:55, 10.81s/it]                                                     2%|▏         | 36/2000 [08:59<5:53:55, 10.81s/it]  2%|▏         | 37/2000 [09:09<5:54:25, 10.83s/it]                                                     2%|▏         | 37/2000 [09:09<5:54:25, 10.83s/it]  2%|▏         | 38/2000 [09:21<6:02:26, 11.08s/it]                                                     2%|▏         | 38/2000 [09:21<6:02:26, 11.08s/it]  2%|▏         | 39/2000 [09:34<6:15:54, 11.50s/it]                              {'loss': 1.1266, 'learning_rate': 1.3000000000000001e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13428
total_samples=650, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:21,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.63 | bwd_microstep: 1823.34 | bwd_inner_microstep: 1697.87 | bwd_allreduce_microstep: 125.40 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12042
total_samples=653, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:24,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 662.40 | bwd_microstep: 1752.84 | bwd_inner_microstep: 1562.53 | bwd_allreduce_microstep: 190.25 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13383
total_samples=657, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:26,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.95 | bwd_microstep: 1766.67 | bwd_inner_microstep: 1683.07 | bwd_allreduce_microstep: 83.54 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15676
total_samples=661, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:29,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 01:52:29,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.55 | bwd_microstep: 1775.57 | bwd_inner_microstep: 1769.50 | bwd_allreduce_microstep: 6.00 | step_microstep: 136.24
[2025-08-03 01:52:29,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.46 | bwd: 7118.46 | bwd_inner: 6712.96 | bwd_allreduce: 405.27 | step: 136.56
{'loss': 1.1207, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.02}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11819
total_samples=664, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:32,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.37 | bwd_microstep: 1769.41 | bwd_inner_microstep: 1545.31 | bwd_allreduce_microstep: 224.03 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12962
total_samples=668, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:34,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.37 | bwd_microstep: 1716.34 | bwd_inner_microstep: 1588.27 | bwd_allreduce_microstep: 128.01 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14322
total_samples=672, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:37,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.11 | bwd_microstep: 1831.14 | bwd_inner_microstep: 1746.20 | bwd_allreduce_microstep: 84.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13366
total_samples=676, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:39,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 01:52:39,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.53 | bwd_microstep: 1780.43 | bwd_inner_microstep: 1707.63 | bwd_allreduce_microstep: 72.73 | step_microstep: 118.46
[2025-08-03 01:52:39,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.32 | bwd: 7097.36 | bwd_inner: 6587.42 | bwd_allreduce: 509.72 | step: 118.80
{'loss': 1.1191, 'learning_rate': 1.3666666666666667e-05, 'epoch': 0.02}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11748
total_samples=679, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:42,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.30 | bwd_microstep: 1809.15 | bwd_inner_microstep: 1564.26 | bwd_allreduce_microstep: 244.82 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12040
total_samples=682, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:45,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.48 | bwd_microstep: 1768.28 | bwd_inner_microstep: 1575.88 | bwd_allreduce_microstep: 192.34 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11840
total_samples=685, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:47,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.18 | bwd_microstep: 1724.54 | bwd_inner_microstep: 1531.65 | bwd_allreduce_microstep: 192.82 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14575
total_samples=691, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:50,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.82
[2025-08-03 01:52:50,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.58 | bwd_microstep: 1829.64 | bwd_inner_microstep: 1739.00 | bwd_allreduce_microstep: 90.58 | step_microstep: 141.81
[2025-08-03 01:52:50,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.48 | bwd: 7131.67 | bwd_inner: 6410.79 | bwd_allreduce: 720.64 | step: 142.15
{'loss': 1.0977, 'learning_rate': 1.4e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837
total_samples=695, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:53,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.84 | bwd_microstep: 1823.75 | bwd_inner_microstep: 1714.05 | bwd_allreduce_microstep: 109.65 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12204
total_samples=699, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:55,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.65 | bwd_microstep: 1745.30 | bwd_inner_microstep: 1570.14 | bwd_allreduce_microstep: 175.09 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12321
total_samples=702, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:52:58,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.22 | bwd_microstep: 1780.57 | bwd_inner_microstep: 1572.11 | bwd_allreduce_microstep: 208.38 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13279
total_samples=706, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:00,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 01:53:00,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.08 | bwd_microstep: 1801.15 | bwd_inner_microstep: 1693.91 | bwd_allreduce_microstep: 107.17 | step_microstep: 140.29
[2025-08-03 01:53:00,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.72 | bwd: 7150.81 | bwd_inner: 6550.21 | bwd_allreduce: 600.36 | step: 140.79
{'loss': 1.1208, 'learning_rate': 1.4333333333333334e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264
total_samples=710, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:03,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.00 | bwd_microstep: 1739.03 | bwd_inner_microstep: 1664.67 | bwd_allreduce_microstep: 74.29 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14139
total_samples=714, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:05,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.68 | bwd_microstep: 1724.94 | bwd_inner_microstep: 1696.06 | bwd_allreduce_microstep: 28.83 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11783
total_samples=717, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:08,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.92 | bwd_microstep: 2217.84 | bwd_inner_microstep: 1837.15 | bwd_allreduce_microstep: 380.62 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14007
total_samples=721, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:12,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 01:53:12,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.95 | bwd_microstep: 2334.67 | bwd_inner_microstep: 2193.79 | bwd_allreduce_microstep: 140.81 | step_microstep: 126.94
[2025-08-03 01:53:12,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.49 | bwd: 8016.52 | bwd_inner: 7391.67 | bwd_allreduce: 624.62 | step: 127.26
{'loss': 1.0997, 'learning_rate': 1.4666666666666666e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15159
total_samples=725, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:15,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.40 | bwd_microstep: 2109.52 | bwd_inner_microstep: 1905.08 | bwd_allreduce_microstep: 204.38 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12134
total_samples=728, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:17,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.73 | bwd_microstep: 1955.93 | bwd_inner_microstep: 1809.92 | bwd_allreduce_microstep: 145.96 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11821
total_samples=731, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:20,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.34 | bwd_microstep: 1963.01 | bwd_inner_microstep: 1617.48 | bwd_allreduce_microstep: 345.48 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13544
total_samples=735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:23,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.75
[2025-08-03 01:53:23,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.32 | bwd_microstep: 1962.25 | bwd_inner_microstep: 1907.07 | bwd_allreduce_microstep: 55.12 | step_microstep: 112.50
[2025-08-03 01:53:23,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2754.72 | bwd: 7990.75 | bwd_inner: 7239.53 | bwd_allreduce: 751.00 | step: 112.93
{'loss': 1.1095, 'learning_rate': 1.5000000000000002e-05, 'epoch': 0.02}
                       2%|▏         | 39/2000 [09:34<6:15:54, 11.50s/it]  2%|▏         | 40/2000 [09:44<6:04:51, 11.17s/it]                                                     2%|▏         | 40/2000 [09:44<6:04:51, 11.17s/it]  2%|▏         | 41/2000 [09:54<5:56:32, 10.92s/it]                                                     2%|▏         | 41/2000 [09:54<5:56:32, 10.92s/it]  2%|▏         | 42/2000 [10:05<5:51:31, 10.77s/it]                                                     2%|▏         | 42/2000 [10:05<5:51:31, 10.77s/it]  2%|▏         | 43/2000 [10:15<5:48:10, 10.67s/it]                                                     2%|▏         | 43/2000 [10:15<5:48:10, 10.67s/it]  2%|▏         | 44/2000 [10:26<5:53:39, 10.85s/it]                                                     2%|▏         | 44/2000 [10:27<5:53:39, 10.85s/it]  2%|▏         | 45/2000 [10:38<5:56:45, 10.95s/it]                                                     2%|▏         | 45/2000 [10:38<5dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13019
total_samples=739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:25,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.17 | bwd_microstep: 1767.80 | bwd_inner_microstep: 1676.33 | bwd_allreduce_microstep: 91.40 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11650
total_samples=742, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:28,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.60 | bwd_microstep: 2048.86 | bwd_inner_microstep: 1572.67 | bwd_allreduce_microstep: 476.09 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13368
total_samples=746, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:31,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.50 | bwd_microstep: 2166.05 | bwd_inner_microstep: 1689.76 | bwd_allreduce_microstep: 476.21 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13124
total_samples=750, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:34,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 01:53:34,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.07 | bwd_microstep: 1821.59 | bwd_inner_microstep: 1690.62 | bwd_allreduce_microstep: 130.91 | step_microstep: 117.80
[2025-08-03 01:53:34,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.29 | bwd: 7804.34 | bwd_inner: 6629.40 | bwd_allreduce: 1174.68 | step: 118.18
{'loss': 1.0887, 'learning_rate': 1.5333333333333334e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13651
total_samples=754, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:36,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.11 | bwd_microstep: 1726.98 | bwd_inner_microstep: 1686.38 | bwd_allreduce_microstep: 40.54 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11637
total_samples=757, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:39,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.00 | bwd_microstep: 1783.10 | bwd_inner_microstep: 1560.16 | bwd_allreduce_microstep: 222.87 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11879
total_samples=760, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:41,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.14 | bwd_microstep: 1704.50 | bwd_inner_microstep: 1526.28 | bwd_allreduce_microstep: 178.15 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13499
total_samples=764, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:44,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 01:53:44,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.27 | bwd_microstep: 1757.32 | bwd_inner_microstep: 1700.30 | bwd_allreduce_microstep: 56.95 | step_microstep: 139.07
[2025-08-03 01:53:44,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2710.45 | bwd: 6971.95 | bwd_inner: 6473.12 | bwd_allreduce: 498.59 | step: 139.52
{'loss': 1.0811, 'learning_rate': 1.5666666666666667e-05, 'epoch': 0.02}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14131
total_samples=768, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:46,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.04 | bwd_microstep: 1703.56 | bwd_inner_microstep: 1687.21 | bwd_allreduce_microstep: 16.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14265
total_samples=772, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:49,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.88 | bwd_microstep: 1728.35 | bwd_inner_microstep: 1698.73 | bwd_allreduce_microstep: 29.55 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14415
total_samples=776, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:52,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.87 | bwd_microstep: 1819.01 | bwd_inner_microstep: 1739.30 | bwd_allreduce_microstep: 79.64 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13766
total_samples=780, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:54,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 01:53:54,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.95 | bwd_microstep: 1813.21 | bwd_inner_microstep: 1726.49 | bwd_allreduce_microstep: 86.65 | step_microstep: 137.72
[2025-08-03 01:53:54,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2757.68 | bwd: 7064.17 | bwd_inner: 6851.73 | bwd_allreduce: 212.21 | step: 138.06
{'loss': 1.0936, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.02}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13373
total_samples=784, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:57,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.00 | bwd_microstep: 1693.41 | bwd_inner_microstep: 1634.24 | bwd_allreduce_microstep: 59.11 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13326
total_samples=788, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:53:59,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.15 | bwd_microstep: 1747.60 | bwd_inner_microstep: 1673.54 | bwd_allreduce_microstep: 74.00 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12982
total_samples=792, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:02,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.92 | bwd_microstep: 1778.84 | bwd_inner_microstep: 1643.15 | bwd_allreduce_microstep: 135.63 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13616
total_samples=796, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:05,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 01:54:05,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.90 | bwd_microstep: 1735.47 | bwd_inner_microstep: 1663.93 | bwd_allreduce_microstep: 71.48 | step_microstep: 118.42
[2025-08-03 01:54:05,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.90 | bwd: 6955.37 | bwd_inner: 6614.86 | bwd_allreduce: 340.29 | step: 118.86
{'loss': 1.0887, 'learning_rate': 1.6333333333333335e-05, 'epoch': 0.02}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11679
total_samples=799, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:07,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.31 | bwd_microstep: 1807.25 | bwd_inner_microstep: 1560.74 | bwd_allreduce_microstep: 246.45 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12025
total_samples=802, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:10,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.90 | bwd_microstep: 1777.37 | bwd_inner_microstep: 1555.81 | bwd_allreduce_microstep: 221.50 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11608
total_samples=805, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:12,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.40 | bwd_microstep: 1817.86 | bwd_inner_microstep: 1584.16 | bwd_allreduce_microstep: 233.64 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12945
total_samples=809, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:15,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 01:54:15,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.54 | bwd_microstep: 1783.58 | bwd_inner_microstep: 1668.29 | bwd_allreduce_microstep: 115.22 | step_microstep: 484.05
[2025-08-03 01:54:15,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.09 | bwd: 7186.10 | bwd_inner: 6369.00 | bwd_allreduce: 816.88 | step: 484.39
{'loss': 1.0827, 'learning_rate': 1.6666666666666667e-05, 'epoch': 0.03}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14682
total_samples=814, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:18,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.77 | bwd_microstep: 1955.35 | bwd_inner_microstep: 1719.82 | bwd_allreduce_microstep: 235.46 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12201
total_samples=817, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:21,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.33 | bwd_microstep: 1941.11 | bwd_inner_microstep: 1731.26 | bwd_allreduce_microstep: 209.79 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11702
total_samples=820, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:24,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.50 | bwd_microstep: 2186.86 | bwd_inner_microstep: 1963.45 | bwd_allreduce_microstep: 223.35 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12794
total_samples=824, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:26,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 01:54:26,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.75 | bwd_microstep: 1856.36 | bwd_inner_microstep: 1663.23 | bwd_allreduce_microstep: 193.07 | step_microstep: 134.22
[2025-08-03 01:54:26,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.28 | bwd: 7939.74 | bwd_inner: 7077.75 | bwd_allreduce: 861.75 | step: 134.67
{'loss': 1.0819, 'learning_rate': 1.7e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13844
total_samples=828, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:29,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.59 | bwd_microstep: 1888.22 | bwd_inner_microstep: 1736.05 | bwd_allreduce_microstep: 152.11 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11630
total_samples=831, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:32,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.40 | bwd_microstep: 1790.23 | bwd_inner_microstep: 1537.51 | bwd_allreduce_microstep: 252.65 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11768
total_samples=834, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:35,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.29 | bwd_microstep: 2138.02 | bwd_inner_microstep: 1926.65 | bwd_allreduce_microstep: 211.31 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14333
total_samples=838, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:37,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 01:54:37,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.56 | bwd_microstep: 1820.36 | bwd_inner_microstep: 1746.30 | bwd_allreduce_microstep: 73.99 | step_microstep: 134.61
[2025-08-03 01:54:37,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.77 | bwd: 7636.88 | bwd_inner: 6946.51 | bwd_allreduce: 690.14 | step: 135.06
:56:45, 10.95s/it]  2%|▏         | 46/2000 [10:49<5:57:16, 10.97s/it]                                                     2%|▏         | 46/2000 [10:49<5:57:16, 10.97s/it]  2%|▏         | 47/2000 [10:59<5:49:07, 10.73s/it]                                                     2%|▏         | 47/2000 [10:59<5:49:07, 10.73s/it]  2%|▏         | 48/2000 [11:09<5:44:56, 10.60s/it]                                                     2%|▏         | 48/2000 [11:09<5:44:56, 10.60s/it]  2%|▏         | 49/2000 [11:19<5:41:02, 10.49s/it]                                                     2%|▏         | 49/2000 [11:19<5:41:02, 10.49s/it]  2%|▎         | 50/2000 [11:30<5:43:52, 10.58s/it]                                                     2%|▎         | 50/2000 [11:30<5:43:52, 10.58s/it]  3%|▎         | 51/2000 [11:41<5:49:30, 10.76s/it]                                                     3%|▎         | 51/2000 [11:41<5:49:30, 10.76s/it]  3%|▎         | 52/2000 [11:52<5:50:{'loss': 1.1096, 'learning_rate': 1.7333333333333336e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13234
total_samples=842, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:40,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.01 | bwd_microstep: 2178.93 | bwd_inner_microstep: 2172.58 | bwd_allreduce_microstep: 6.29 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13152
total_samples=846, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:43,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.03 | bwd_microstep: 2229.34 | bwd_inner_microstep: 2049.98 | bwd_allreduce_microstep: 179.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810
total_samples=849, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:46,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.04 | bwd_microstep: 1723.28 | bwd_inner_microstep: 1533.74 | bwd_allreduce_microstep: 189.47 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12699
total_samples=853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:48,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 01:54:48,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.73 | bwd_microstep: 1728.92 | bwd_inner_microstep: 1623.16 | bwd_allreduce_microstep: 105.69 | step_microstep: 113.24
[2025-08-03 01:54:48,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.74 | bwd: 7860.52 | bwd_inner: 7379.45 | bwd_allreduce: 480.83 | step: 113.56
{'loss': 1.0702, 'learning_rate': 1.7666666666666668e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13521
total_samples=858, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:51,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.80 | bwd_microstep: 1750.67 | bwd_inner_microstep: 1686.21 | bwd_allreduce_microstep: 64.39 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11784
total_samples=861, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:54,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.99 | bwd_microstep: 1789.98 | bwd_inner_microstep: 1553.07 | bwd_allreduce_microstep: 236.85 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12046
total_samples=864, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:54:56,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.67 | bwd_microstep: 1779.84 | bwd_inner_microstep: 1562.34 | bwd_allreduce_microstep: 217.44 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12871
total_samples=868, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:00,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 01:55:00,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.09 | bwd_microstep: 1874.11 | bwd_inner_microstep: 1653.18 | bwd_allreduce_microstep: 220.87 | step_microstep: 114.33
[2025-08-03 01:55:00,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4225.49 | bwd: 7194.65 | bwd_inner: 6454.80 | bwd_allreduce: 739.62 | step: 114.67
{'loss': 1.0861, 'learning_rate': 1.8e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13478
total_samples=872, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:03,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.83 | bwd_microstep: 1818.23 | bwd_inner_microstep: 1708.28 | bwd_allreduce_microstep: 109.88 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11683
total_samples=875, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:05,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.80 | bwd_microstep: 1759.71 | bwd_inner_microstep: 1538.21 | bwd_allreduce_microstep: 221.44 | step_microstep: 0.09
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13151
total_samples=879, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:08,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.60 | bwd_microstep: 1709.00 | bwd_inner_microstep: 1613.28 | bwd_allreduce_microstep: 95.66 | step_microstep: 0.21
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12824
total_samples=883, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:11,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 01:55:11,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.67 | bwd_microstep: 1820.72 | bwd_inner_microstep: 1607.54 | bwd_allreduce_microstep: 213.12 | step_microstep: 109.12
[2025-08-03 01:55:11,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2770.83 | bwd: 7107.71 | bwd_inner: 6467.30 | bwd_allreduce: 640.17 | step: 109.54
{'loss': 1.0817, 'learning_rate': 1.8333333333333333e-05, 'epoch': 0.03}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14502
total_samples=887, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:13,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.11 | bwd_microstep: 1717.96 | bwd_inner_microstep: 1685.25 | bwd_allreduce_microstep: 32.64 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12002
total_samples=890, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:16,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.37 | bwd_microstep: 1717.91 | bwd_inner_microstep: 1544.59 | bwd_allreduce_microstep: 173.26 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14679
total_samples=894, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:18,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.52 | bwd_microstep: 1821.83 | bwd_inner_microstep: 1769.39 | bwd_allreduce_microstep: 52.38 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12411
total_samples=897, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:21,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 01:55:21,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.84 | bwd_microstep: 1813.64 | bwd_inner_microstep: 1576.21 | bwd_allreduce_microstep: 237.36 | step_microstep: 139.92
[2025-08-03 01:55:21,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.78 | bwd: 7071.39 | bwd_inner: 6575.44 | bwd_allreduce: 495.72 | step: 140.27
{'loss': 1.0789, 'learning_rate': 1.866666666666667e-05, 'epoch': 0.03}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12605
total_samples=900, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:24,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.96 | bwd_microstep: 1870.02 | bwd_inner_microstep: 1612.04 | bwd_allreduce_microstep: 257.92 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11436
total_samples=903, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:26,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.87 | bwd_microstep: 1712.06 | bwd_inner_microstep: 1528.07 | bwd_allreduce_microstep: 183.93 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11789
total_samples=906, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:29,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.06 | bwd_microstep: 1786.49 | bwd_inner_microstep: 1542.21 | bwd_allreduce_microstep: 244.22 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12521
total_samples=909, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:31,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 01:55:31,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.58 | bwd_microstep: 1767.24 | bwd_inner_microstep: 1571.80 | bwd_allreduce_microstep: 195.36 | step_microstep: 137.87
[2025-08-03 01:55:31,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.41 | bwd: 7135.86 | bwd_inner: 6254.12 | bwd_allreduce: 881.51 | step: 138.28
{'loss': 1.0609, 'learning_rate': 1.9e-05, 'epoch': 0.03}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11747
total_samples=912, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:34,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.99 | bwd_microstep: 1847.78 | bwd_inner_microstep: 1683.21 | bwd_allreduce_microstep: 164.51 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13083
total_samples=915, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:37,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.52 | bwd_microstep: 2063.94 | bwd_inner_microstep: 1838.82 | bwd_allreduce_microstep: 225.06 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11955
total_samples=918, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:40,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.46 | bwd_microstep: 2547.68 | bwd_inner_microstep: 2330.24 | bwd_allreduce_microstep: 217.38 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13399
total_samples=922, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:43,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 01:55:43,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.81 | bwd_microstep: 2090.33 | bwd_inner_microstep: 1911.53 | bwd_allreduce_microstep: 178.75 | step_microstep: 113.78
[2025-08-03 01:55:43,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.70 | bwd: 8549.79 | bwd_inner: 7763.79 | bwd_allreduce: 785.77 | step: 114.13
56, 10.81s/it]                                                     3%|▎         | 52/2000 [11:52<5:50:56, 10.81s/it]  3%|▎         | 53/2000 [12:03<5:53:17, 10.89s/it]                                                     3%|▎         | 53/2000 [12:03<5:53:17, 10.89s/it]  3%|▎         | 54/2000 [12:15<6:02:43, 11.18s/it]                                                     3%|▎         | 54/2000 [12:15<6:02:43, 11.18s/it]  3%|▎         | 55/2000 [12:26<5:54:04, 10.92s/it]                                                     3%|▎         | 55/2000 [12:26<5:54:04, 10.92s/it]  3%|▎         | 56/2000 [12:36<5:48:13, 10.75s/it]                                                     3%|▎         | 56/2000 [12:36<5:48:13, 10.75s/it]  3%|▎         | 57/2000 [12:46<5:44:26, 10.64s/it]                                                     3%|▎         | 57/2000 [12:46<5:44:26, 10.64s/it]  3%|▎         | 58/2000 [12:58<5:55:18, 10.98s/it]                                           {'loss': 1.0681, 'learning_rate': 1.9333333333333333e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14022
total_samples=926, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:46,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.22 | bwd_microstep: 1960.28 | bwd_inner_microstep: 1825.73 | bwd_allreduce_microstep: 134.49 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13592
total_samples=930, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:49,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.48 | bwd_microstep: 2179.00 | bwd_inner_microstep: 2072.40 | bwd_allreduce_microstep: 106.54 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11719
total_samples=933, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:51,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.06 | bwd_microstep: 1810.58 | bwd_inner_microstep: 1597.55 | bwd_allreduce_microstep: 212.95 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13266
total_samples=937, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:54,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 01:55:54,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.90 | bwd_microstep: 1820.18 | bwd_inner_microstep: 1702.89 | bwd_allreduce_microstep: 117.23 | step_microstep: 113.91
[2025-08-03 01:55:54,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.58 | bwd: 7770.08 | bwd_inner: 7198.57 | bwd_allreduce: 571.28 | step: 114.21
{'loss': 1.0723, 'learning_rate': 1.9666666666666666e-05, 'epoch': 0.03}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12367
total_samples=940, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:55:58,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1754.82 | bwd_microstep: 2047.58 | bwd_inner_microstep: 1833.76 | bwd_allreduce_microstep: 213.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14109
total_samples=944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:01,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.31 | bwd_microstep: 2244.29 | bwd_inner_microstep: 1775.53 | bwd_allreduce_microstep: 468.70 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11728
total_samples=947, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:04,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.16 | bwd_microstep: 1820.80 | bwd_inner_microstep: 1703.29 | bwd_allreduce_microstep: 117.43 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13762
total_samples=951, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:06,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 01:56:06,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.61 | bwd_microstep: 1792.38 | bwd_inner_microstep: 1715.19 | bwd_allreduce_microstep: 77.13 | step_microstep: 153.12
[2025-08-03 01:56:06,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3870.85 | bwd: 7905.08 | bwd_inner: 7027.78 | bwd_allreduce: 877.06 | step: 153.56
{'loss': 1.0553, 'learning_rate': 2e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13671
total_samples=955, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:09,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.57 | bwd_microstep: 1842.20 | bwd_inner_microstep: 1737.83 | bwd_allreduce_microstep: 104.31 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14224
total_samples=959, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:12,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.25 | bwd_microstep: 1813.98 | bwd_inner_microstep: 1739.47 | bwd_allreduce_microstep: 74.45 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11736
total_samples=962, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:14,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.84 | bwd_microstep: 1792.66 | bwd_inner_microstep: 1557.90 | bwd_allreduce_microstep: 234.70 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13719
total_samples=966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:17,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 01:56:17,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.47 | bwd_microstep: 1721.74 | bwd_inner_microstep: 1659.73 | bwd_allreduce_microstep: 61.95 | step_microstep: 117.23
[2025-08-03 01:56:17,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.07 | bwd: 7170.64 | bwd_inner: 6694.94 | bwd_allreduce: 475.48 | step: 117.64
{'loss': 1.0551, 'learning_rate': 1.9999986888082895e-05, 'epoch': 0.03}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15597
total_samples=971, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:19,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.06 | bwd_microstep: 1747.27 | bwd_inner_microstep: 1702.47 | bwd_allreduce_microstep: 44.73 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13694
total_samples=975, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:22,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.99 | bwd_microstep: 1722.83 | bwd_inner_microstep: 1672.28 | bwd_allreduce_microstep: 50.49 | step_microstep: 0.11
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 12328
total_samples=978, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:24,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.55 | bwd_microstep: 1806.23 | bwd_inner_microstep: 1581.99 | bwd_allreduce_microstep: 224.18 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12941
total_samples=982, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:27,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 01:56:27,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.92 | bwd_microstep: 1740.57 | bwd_inner_microstep: 1666.06 | bwd_allreduce_microstep: 74.44 | step_microstep: 127.67
[2025-08-03 01:56:27,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.45 | bwd: 7016.96 | bwd_inner: 6622.81 | bwd_allreduce: 393.91 | step: 128.03
{'loss': 1.0486, 'learning_rate': 1.999994755236596e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13206
total_samples=986, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:30,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.60 | bwd_microstep: 1766.94 | bwd_inner_microstep: 1683.65 | bwd_allreduce_microstep: 83.22 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13632
total_samples=990, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:32,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.92 | bwd_microstep: 1859.43 | bwd_inner_microstep: 1728.85 | bwd_allreduce_microstep: 130.52 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12311
total_samples=993, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:35,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.08 | bwd_microstep: 1771.66 | bwd_inner_microstep: 1573.73 | bwd_allreduce_microstep: 197.87 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11914
total_samples=996, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:38,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 01:56:38,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.68 | bwd_microstep: 1731.20 | bwd_inner_microstep: 1536.60 | bwd_allreduce_microstep: 194.53 | step_microstep: 140.23
[2025-08-03 01:56:38,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.21 | bwd: 7129.28 | bwd_inner: 6522.83 | bwd_allreduce: 606.22 | step: 140.66
{'loss': 1.0355, 'learning_rate': 1.9999881992952353e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13280
total_samples=1000, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:40,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.59 | bwd_microstep: 1837.97 | bwd_inner_microstep: 1687.10 | bwd_allreduce_microstep: 150.80 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14143
total_samples=1004, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:43,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.45 | bwd_microstep: 1761.03 | bwd_inner_microstep: 1714.29 | bwd_allreduce_microstep: 46.68 | step_microstep: 0.11
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11496
total_samples=1007, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:45,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.26 | bwd_microstep: 1924.51 | bwd_inner_microstep: 1550.52 | bwd_allreduce_microstep: 373.93 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13381
total_samples=1011, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:48,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 01:56:48,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.40 | bwd_microstep: 1761.48 | bwd_inner_microstep: 1677.42 | bwd_allreduce_microstep: 83.99 | step_microstep: 420.99
[2025-08-03 01:56:48,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.62 | bwd: 7285.04 | bwd_inner: 6629.33 | bwd_allreduce: 655.48 | step: 421.32
{'loss': 1.0396, 'learning_rate': 1.999979021001399e-05, 'epoch': 0.03}
          3%|▎         | 58/2000 [12:58<5:55:18, 10.98s/it]  3%|▎         | 59/2000 [13:09<5:55:43, 11.00s/it]                                                     3%|▎         | 59/2000 [13:09<5:55:43, 11.00s/it]  3%|▎         | 60/2000 [13:21<6:07:41, 11.37s/it]                                                     3%|▎         | 60/2000 [13:21<6:07:41, 11.37s/it]  3%|▎         | 61/2000 [13:32<5:58:25, 11.09s/it]                                                     3%|▎         | 61/2000 [13:32<5:58:25, 11.09s/it]  3%|▎         | 62/2000 [13:42<5:49:55, 10.83s/it]                                                     3%|▎         | 62/2000 [13:42<5:49:55, 10.83s/it]  3%|▎         | 63/2000 [13:52<5:45:34, 10.70s/it]                                                     3%|▎         | 63/2000 [13:52<5:45:34, 10.70s/it]  3%|▎         | 64/2000 [14:03<5:46:22, 10.73s/it]                                                     3%|▎         | 64/2000 [14:03<5:46:22, 10.73dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13009
total_samples=1014, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:51,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.92 | bwd_microstep: 1901.65 | bwd_inner_microstep: 1756.87 | bwd_allreduce_microstep: 144.71 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13379
total_samples=1018, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:54,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.06 | bwd_microstep: 2311.47 | bwd_inner_microstep: 2235.66 | bwd_allreduce_microstep: 75.75 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13259
total_samples=1022, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:56:57,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.41 | bwd_microstep: 2254.27 | bwd_inner_microstep: 1884.09 | bwd_allreduce_microstep: 370.13 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13895
total_samples=1027, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:00,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 01:57:00,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.13 | bwd_microstep: 2095.27 | bwd_inner_microstep: 1990.85 | bwd_allreduce_microstep: 104.35 | step_microstep: 389.17
[2025-08-03 01:57:00,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.44 | bwd: 8562.70 | bwd_inner: 7867.47 | bwd_allreduce: 695.01 | step: 389.49
{'loss': 1.0198, 'learning_rate': 1.9999672203791564e-05, 'epoch': 0.03}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11669
total_samples=1030, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:03,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.36 | bwd_microstep: 2193.07 | bwd_inner_microstep: 1905.72 | bwd_allreduce_microstep: 287.28 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14531
total_samples=1034, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:07,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 982.23 | bwd_microstep: 2178.75 | bwd_inner_microstep: 2063.73 | bwd_allreduce_microstep: 114.96 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11916
total_samples=1037, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:09,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.11 | bwd_microstep: 1818.91 | bwd_inner_microstep: 1589.84 | bwd_allreduce_microstep: 229.01 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13333
total_samples=1041, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:12,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 01:57:12,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.10 | bwd_microstep: 1906.15 | bwd_inner_microstep: 1715.46 | bwd_allreduce_microstep: 190.63 | step_microstep: 119.17
[2025-08-03 01:57:12,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3096.72 | bwd: 8096.92 | bwd_inner: 7274.74 | bwd_allreduce: 821.95 | step: 119.58
{'loss': 1.015, 'learning_rate': 1.999952797459453e-05, 'epoch': 0.03}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11984
total_samples=1045, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:15,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.59 | bwd_microstep: 1750.67 | bwd_inner_microstep: 1556.84 | bwd_allreduce_microstep: 193.77 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13628
total_samples=1049, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:17,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.26 | bwd_microstep: 1972.70 | bwd_inner_microstep: 1862.33 | bwd_allreduce_microstep: 110.30 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13108
total_samples=1053, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:20,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.24 | bwd_microstep: 1786.83 | bwd_inner_microstep: 1694.93 | bwd_allreduce_microstep: 91.84 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12126
total_samples=1056, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:22,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 01:57:22,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.94 | bwd_microstep: 1762.38 | bwd_inner_microstep: 1574.82 | bwd_allreduce_microstep: 187.49 | step_microstep: 112.24
[2025-08-03 01:57:22,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.95 | bwd: 7272.63 | bwd_inner: 6688.92 | bwd_allreduce: 583.48 | step: 112.57
{'loss': 1.0298, 'learning_rate': 1.9999357522801125e-05, 'epoch': 0.03}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11854
total_samples=1059, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:25,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.25 | bwd_microstep: 1770.28 | bwd_inner_microstep: 1555.23 | bwd_allreduce_microstep: 214.98 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13420
total_samples=1063, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:28,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.10 | bwd_microstep: 1824.37 | bwd_inner_microstep: 1712.82 | bwd_allreduce_microstep: 111.48 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13442
total_samples=1067, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:30,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.06 | bwd_microstep: 1827.46 | bwd_inner_microstep: 1700.95 | bwd_allreduce_microstep: 126.43 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12628
total_samples=1070, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:33,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 01:57:33,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.98 | bwd_microstep: 1781.68 | bwd_inner_microstep: 1586.43 | bwd_allreduce_microstep: 195.19 | step_microstep: 158.17
[2025-08-03 01:57:33,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.33 | bwd: 7203.83 | bwd_inner: 6555.42 | bwd_allreduce: 648.16 | step: 158.69
{'loss': 1.0186, 'learning_rate': 1.999916084885832e-05, 'epoch': 0.03}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11909
total_samples=1073, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:36,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.43 | bwd_microstep: 1757.82 | bwd_inner_microstep: 1539.49 | bwd_allreduce_microstep: 218.26 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11922
total_samples=1076, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:38,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.81 | bwd_microstep: 1773.35 | bwd_inner_microstep: 1548.44 | bwd_allreduce_microstep: 224.84 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13926
total_samples=1080, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:41,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.93 | bwd_microstep: 1810.67 | bwd_inner_microstep: 1729.94 | bwd_allreduce_microstep: 80.67 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13707
total_samples=1084, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:43,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 01:57:43,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.79 | bwd_microstep: 1735.17 | bwd_inner_microstep: 1677.92 | bwd_allreduce_microstep: 57.18 | step_microstep: 142.82
[2025-08-03 01:57:43,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2745.89 | bwd: 7077.06 | bwd_inner: 6495.79 | bwd_allreduce: 581.02 | step: 143.15
{'loss': 1.0147, 'learning_rate': 1.999893795328188e-05, 'epoch': 0.03}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13463
total_samples=1088, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:46,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.50 | bwd_microstep: 1808.92 | bwd_inner_microstep: 1722.38 | bwd_allreduce_microstep: 86.47 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11905
total_samples=1091, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:48,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.44 | bwd_microstep: 1728.82 | bwd_inner_microstep: 1532.14 | bwd_allreduce_microstep: 196.62 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12364
total_samples=1095, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:51,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.43 | bwd_microstep: 1773.55 | bwd_inner_microstep: 1589.13 | bwd_allreduce_microstep: 184.36 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13389
total_samples=1099, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:54,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 01:57:54,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.46 | bwd_microstep: 1869.84 | bwd_inner_microstep: 1807.94 | bwd_allreduce_microstep: 61.83 | step_microstep: 123.73
[2025-08-03 01:57:54,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.75 | bwd: 7181.17 | bwd_inner: 6651.59 | bwd_allreduce: 529.36 | step: 124.15
{'loss': 1.0187, 'learning_rate': 1.9998688836656322e-05, 'epoch': 0.04}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13089
total_samples=1103, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:56,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.96 | bwd_microstep: 1812.58 | bwd_inner_microstep: 1648.93 | bwd_allreduce_microstep: 163.58 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13266
total_samples=1107, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:57:59,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.25 | bwd_microstep: 2061.76 | bwd_inner_microstep: 1921.38 | bwd_allreduce_microstep: 140.32 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11889
total_samples=1110, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:02,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.85 | bwd_microstep: 2111.00 | bwd_inner_microstep: 1937.81 | bwd_allreduce_microstep: 173.13 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14540
total_samples=1114, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:05,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 01:58:05,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.64 | bwd_microstep: 1972.66 | bwd_inner_microstep: 1822.51 | bwd_allreduce_microstep: 150.09 | step_microstep: 111.60
[2025-08-03 01:58:05,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.62 | bwd: 7958.04 | bwd_inner: 7330.62 | bwd_allreduce: 627.19 | step: 112.03
s/it]  3%|▎         | 65/2000 [14:15<5:59:02, 11.13s/it]                                                     3%|▎         | 65/2000 [14:15<5:59:02, 11.13s/it]  3%|▎         | 66/2000 [14:27<6:03:31, 11.28s/it]                                                     3%|▎         | 66/2000 [14:27<6:03:31, 11.28s/it]  3%|▎         | 67/2000 [14:37<5:55:31, 11.04s/it]                                                     3%|▎         | 67/2000 [14:37<5:55:31, 11.04s/it]  3%|▎         | 68/2000 [14:48<5:50:21, 10.88s/it]                                                     3%|▎         | 68/2000 [14:48<5:50:21, 10.88s/it]  3%|▎         | 69/2000 [14:58<5:44:36, 10.71s/it]                                                     3%|▎         | 69/2000 [14:58<5:44:36, 10.71s/it]  4%|▎         | 70/2000 [15:09<5:41:37, 10.62s/it]                                                     4%|▎         | 70/2000 [15:09<5:41:37, 10.62s/it]  4%|▎         | 71/2000 [15:20<5:47:15, 10.80s/it{'loss': 0.9964, 'learning_rate': 1.9998413499634927e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13170
total_samples=1118, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:09,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1969.61 | bwd_microstep: 2041.37 | bwd_inner_microstep: 1892.65 | bwd_allreduce_microstep: 148.66 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14085
total_samples=1122, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:12,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.88 | bwd_microstep: 2029.37 | bwd_inner_microstep: 1757.19 | bwd_allreduce_microstep: 272.11 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11615
total_samples=1125, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:15,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.79 | bwd_microstep: 2311.26 | bwd_inner_microstep: 2098.91 | bwd_allreduce_microstep: 212.29 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11746
total_samples=1128, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:18,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 01:58:18,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.77 | bwd_microstep: 2329.24 | bwd_inner_microstep: 2031.91 | bwd_allreduce_microstep: 297.27 | step_microstep: 448.92
[2025-08-03 01:58:18,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4050.99 | bwd: 8711.28 | bwd_inner: 7780.66 | bwd_allreduce: 930.39 | step: 449.25
{'loss': 0.9874, 'learning_rate': 1.9998111942939727e-05, 'epoch': 0.04}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12423
total_samples=1132, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:21,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.18 | bwd_microstep: 2043.00 | bwd_inner_microstep: 1916.80 | bwd_allreduce_microstep: 126.13 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13081
total_samples=1136, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:24,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.14 | bwd_microstep: 1774.26 | bwd_inner_microstep: 1683.94 | bwd_allreduce_microstep: 90.25 | step_microstep: 0.12
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11276
total_samples=1139, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:27,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.25 | bwd_microstep: 2288.67 | bwd_inner_microstep: 2024.56 | bwd_allreduce_microstep: 264.03 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=1142, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:31,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 01:58:31,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1579.63 | bwd_microstep: 2453.34 | bwd_inner_microstep: 2217.06 | bwd_allreduce_microstep: 236.23 | step_microstep: 108.00
[2025-08-03 01:58:31,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3713.13 | bwd: 8559.31 | bwd_inner: 7842.36 | bwd_allreduce: 716.71 | step: 108.35
{'loss': 0.9834, 'learning_rate': 1.9997784167361526e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14034
total_samples=1146, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:34,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.03 | bwd_microstep: 1778.39 | bwd_inner_microstep: 1704.90 | bwd_allreduce_microstep: 73.43 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799
total_samples=1149, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:36,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.62 | bwd_microstep: 1800.91 | bwd_inner_microstep: 1594.58 | bwd_allreduce_microstep: 206.28 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13497
total_samples=1153, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:39,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.64 | bwd_microstep: 1806.22 | bwd_inner_microstep: 1711.97 | bwd_allreduce_microstep: 94.19 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11767
total_samples=1156, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:42,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 01:58:42,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.43 | bwd_microstep: 1809.90 | bwd_inner_microstep: 1565.84 | bwd_allreduce_microstep: 244.00 | step_microstep: 128.57
[2025-08-03 01:58:42,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.65 | bwd: 7195.47 | bwd_inner: 6577.28 | bwd_allreduce: 617.96 | step: 128.91
{'loss': 0.9952, 'learning_rate': 1.9997430173759876e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14023
total_samples=1160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:44,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.67 | bwd_microstep: 1786.24 | bwd_inner_microstep: 1718.66 | bwd_allreduce_microstep: 67.52 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11779
total_samples=1163, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:49,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1863.17 | bwd_microstep: 2905.57 | bwd_inner_microstep: 2234.74 | bwd_allreduce_microstep: 670.73 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13075
total_samples=1167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:52,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 951.03 | bwd_microstep: 1816.60 | bwd_inner_microstep: 1694.22 | bwd_allreduce_microstep: 122.31 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12015
total_samples=1170, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:55,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 01:58:55,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.58 | bwd_microstep: 1749.56 | bwd_inner_microstep: 1548.54 | bwd_allreduce_microstep: 200.96 | step_microstep: 122.23
[2025-08-03 01:58:55,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4187.38 | bwd: 8258.01 | bwd_inner: 7196.17 | bwd_allreduce: 1061.58 | step: 122.55
{'loss': 0.9663, 'learning_rate': 1.999704996306308e-05, 'epoch': 0.04}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12205
total_samples=1173, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:58:57,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.53 | bwd_microstep: 1794.42 | bwd_inner_microstep: 1561.31 | bwd_allreduce_microstep: 233.04 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13857
total_samples=1177, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:00,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.66 | bwd_microstep: 1730.34 | bwd_inner_microstep: 1688.50 | bwd_allreduce_microstep: 41.78 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13439
total_samples=1181, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:02,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.17 | bwd_microstep: 1768.39 | bwd_inner_microstep: 1692.37 | bwd_allreduce_microstep: 75.95 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12708
total_samples=1185, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:05,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 01:59:05,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.00 | bwd_microstep: 1819.59 | bwd_inner_microstep: 1622.20 | bwd_allreduce_microstep: 197.33 | step_microstep: 134.12
[2025-08-03 01:59:05,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.30 | bwd: 7112.80 | bwd_inner: 6564.38 | bwd_allreduce: 548.18 | step: 134.48
{'loss': 0.9595, 'learning_rate': 1.9996643536268202e-05, 'epoch': 0.04}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12106
total_samples=1189, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:07,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.50 | bwd_microstep: 1733.33 | bwd_inner_microstep: 1537.11 | bwd_allreduce_microstep: 196.16 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11825
total_samples=1192, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:10,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.89 | bwd_microstep: 1800.76 | bwd_inner_microstep: 1545.16 | bwd_allreduce_microstep: 255.54 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15467
total_samples=1198, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:13,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.49 | bwd_microstep: 1752.07 | bwd_inner_microstep: 1734.71 | bwd_allreduce_microstep: 17.30 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938
total_samples=1201, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:15,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 01:59:15,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.83 | bwd_microstep: 1815.43 | bwd_inner_microstep: 1584.48 | bwd_allreduce_microstep: 230.87 | step_microstep: 112.97
[2025-08-03 01:59:15,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.64 | bwd: 7101.63 | bwd_inner: 6401.47 | bwd_allreduce: 699.93 | step: 113.29
{'loss': 0.9614, 'learning_rate': 1.9996210894441047e-05, 'epoch': 0.04}
]                                                     4%|▎         | 71/2000 [15:20<5:47:15, 10.80s/it]  4%|▎         | 72/2000 [15:33<6:13:21, 11.62s/it]                                                     4%|▎         | 72/2000 [15:33<6:13:21, 11.62s/it]  4%|▎         | 73/2000 [15:46<6:23:41, 11.95s/it]                                                     4%|▎         | 73/2000 [15:46<6:23:41, 11.95s/it]  4%|▎         | 74/2000 [15:56<6:09:15, 11.50s/it]                                                     4%|▎         | 74/2000 [15:57<6:09:15, 11.50s/it]  4%|▍         | 75/2000 [16:09<6:22:43, 11.93s/it]                                                     4%|▍         | 75/2000 [16:09<6:22:43, 11.93s/it]  4%|▍         | 76/2000 [16:20<6:07:40, 11.47s/it]                                                     4%|▍         | 76/2000 [16:20<6:07:40, 11.47s/it]  4%|▍         | 77/2000 [16:30<5:56:49, 11.13s/it]                                                     4dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12274
total_samples=1204, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:18,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.10 | bwd_microstep: 1822.81 | bwd_inner_microstep: 1585.45 | bwd_allreduce_microstep: 237.29 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12493
total_samples=1208, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:21,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.61 | bwd_microstep: 2043.22 | bwd_inner_microstep: 1829.65 | bwd_allreduce_microstep: 213.52 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14937
total_samples=1212, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:24,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.81 | bwd_microstep: 2096.40 | bwd_inner_microstep: 2055.18 | bwd_allreduce_microstep: 41.17 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11785
total_samples=1215, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:27,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 01:59:27,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.47 | bwd_microstep: 2026.02 | bwd_inner_microstep: 1799.96 | bwd_allreduce_microstep: 225.98 | step_microstep: 126.13
[2025-08-03 01:59:27,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.91 | bwd: 7988.49 | bwd_inner: 7270.24 | bwd_allreduce: 718.02 | step: 126.46
{'loss': 0.9548, 'learning_rate': 1.9995752038716166e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14703
total_samples=1219, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:29,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.45 | bwd_microstep: 2056.47 | bwd_inner_microstep: 1945.49 | bwd_allreduce_microstep: 110.92 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12969
total_samples=1223, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:32,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.97 | bwd_microstep: 1902.09 | bwd_inner_microstep: 1807.85 | bwd_allreduce_microstep: 94.17 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752
total_samples=1228, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:35,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.29 | bwd_microstep: 2035.66 | bwd_inner_microstep: 1912.88 | bwd_allreduce_microstep: 122.71 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13232
total_samples=1232, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:38,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 01:59:38,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.88 | bwd_microstep: 1993.75 | bwd_inner_microstep: 1696.84 | bwd_allreduce_microstep: 296.85 | step_microstep: 113.79
[2025-08-03 01:59:38,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.51 | bwd: 7988.01 | bwd_inner: 7363.06 | bwd_allreduce: 624.72 | step: 114.22
{'loss': 0.9534, 'learning_rate': 1.9995266970296856e-05, 'epoch': 0.04}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12649
total_samples=1236, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:41,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.11 | bwd_microstep: 2171.38 | bwd_inner_microstep: 2071.51 | bwd_allreduce_microstep: 99.80 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13784
total_samples=1240, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:44,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.91 | bwd_microstep: 2056.48 | bwd_inner_microstep: 1919.60 | bwd_allreduce_microstep: 136.83 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13297
total_samples=1244, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:46,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.02 | bwd_microstep: 1786.26 | bwd_inner_microstep: 1680.57 | bwd_allreduce_microstep: 105.63 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14743
total_samples=1248, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:49,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 01:59:49,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.98 | bwd_microstep: 2155.28 | bwd_inner_microstep: 1937.81 | bwd_allreduce_microstep: 217.40 | step_microstep: 137.03
[2025-08-03 01:59:49,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.96 | bwd: 8169.45 | bwd_inner: 7609.47 | bwd_allreduce: 559.74 | step: 137.39
{'loss': 0.9403, 'learning_rate': 1.9994755690455154e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13097
total_samples=1252, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:52,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.22 | bwd_microstep: 2050.08 | bwd_inner_microstep: 1922.07 | bwd_allreduce_microstep: 127.94 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13799
total_samples=1257, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:55,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.97 | bwd_microstep: 1775.72 | bwd_inner_microstep: 1695.46 | bwd_allreduce_microstep: 80.20 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12944
total_samples=1261, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 01:59:57,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.83 | bwd_microstep: 1789.04 | bwd_inner_microstep: 1656.46 | bwd_allreduce_microstep: 132.51 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13115
total_samples=1265, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:00,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 02:00:00,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.89 | bwd_microstep: 1817.86 | bwd_inner_microstep: 1672.55 | bwd_allreduce_microstep: 145.25 | step_microstep: 152.63
[2025-08-03 02:00:00,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.84 | bwd: 7432.75 | bwd_inner: 6946.54 | bwd_allreduce: 485.97 | step: 152.97
{'loss': 0.9504, 'learning_rate': 1.9994218200531823e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13474
total_samples=1269, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:03,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.25 | bwd_microstep: 1843.23 | bwd_inner_microstep: 1719.75 | bwd_allreduce_microstep: 123.41 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11694
total_samples=1272, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:05,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.69 | bwd_microstep: 1798.00 | bwd_inner_microstep: 1558.80 | bwd_allreduce_microstep: 239.14 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13529
total_samples=1276, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:08,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.27 | bwd_microstep: 1773.23 | bwd_inner_microstep: 1705.30 | bwd_allreduce_microstep: 67.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14470
total_samples=1280, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:11,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 02:00:11,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.71 | bwd_microstep: 1778.45 | bwd_inner_microstep: 1747.60 | bwd_allreduce_microstep: 30.79 | step_microstep: 138.93
[2025-08-03 02:00:11,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.85 | bwd: 7192.95 | bwd_inner: 6731.44 | bwd_allreduce: 461.27 | step: 139.36
{'loss': 0.9442, 'learning_rate': 1.999365450193638e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13799
total_samples=1284, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:13,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.49 | bwd_microstep: 1783.82 | bwd_inner_microstep: 1707.07 | bwd_allreduce_microstep: 76.68 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11587
total_samples=1287, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:16,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.96 | bwd_microstep: 1812.58 | bwd_inner_microstep: 1575.90 | bwd_allreduce_microstep: 236.61 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13763
total_samples=1291, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:18,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.97 | bwd_microstep: 1943.33 | bwd_inner_microstep: 1860.01 | bwd_allreduce_microstep: 83.27 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12218
total_samples=1294, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:21,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 02:00:21,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.87 | bwd_microstep: 1811.94 | bwd_inner_microstep: 1566.59 | bwd_allreduce_microstep: 245.29 | step_microstep: 113.60
[2025-08-03 02:00:21,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.22 | bwd: 7351.73 | bwd_inner: 6709.56 | bwd_allreduce: 641.94 | step: 113.95
{'loss': 0.932, 'learning_rate': 1.999306459614705e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13443
total_samples=1298, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:24,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.44 | bwd_microstep: 1829.82 | bwd_inner_microstep: 1705.91 | bwd_allreduce_microstep: 123.84 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13753
total_samples=1302, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:26,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.46 | bwd_microstep: 1695.82 | bwd_inner_microstep: 1659.16 | bwd_allreduce_microstep: 36.60 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13207
total_samples=1306, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:29,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.60 | bwd_microstep: 1797.26 | bwd_inner_microstep: 1702.80 | bwd_allreduce_microstep: 94.39 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13750
total_samples=1310, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:31,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 02:00:31,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 664.90 | bwd_microstep: 1718.16 | bwd_inner_microstep: 1676.95 | bwd_allreduce_microstep: 41.14 | step_microstep: 136.23
[2025-08-03 02:00:31,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2754.34 | bwd: 7041.11 | bwd_inner: 6744.82 | bwd_allreduce: 296.05 | step: 136.69
%|▍         | 77/2000 [16:30<5:56:49, 11.13s/it]  4%|▍         | 78/2000 [16:41<5:58:15, 11.18s/it]                                                     4%|▍         | 78/2000 [16:42<5:58:15, 11.18s/it]  4%|▍         | 79/2000 [16:53<5:58:37, 11.20s/it]                                                     4%|▍         | 79/2000 [16:53<5:58:37, 11.20s/it]  4%|▍         | 80/2000 [17:04<6:01:08, 11.29s/it]                                                     4%|▍         | 80/2000 [17:04<6:01:08, 11.29s/it]  4%|▍         | 81/2000 [17:15<5:55:42, 11.12s/it]                                                     4%|▍         | 81/2000 [17:15<5:55:42, 11.12s/it]  4%|▍         | 82/2000 [17:25<5:49:37, 10.94s/it]                                                     4%|▍         | 82/2000 [17:25<5:49:37, 10.94s/it]  4%|▍         | 83/2000 [17:36<5:46:18, 10.84s/it]                                                     4%|▍         | 83/2000 [17:36<5:46:18, 10.84s/it]  4%|�{'loss': 0.9248, 'learning_rate': 1.99924484847108e-05, 'epoch': 0.04}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13437
total_samples=1314, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:35,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.30 | bwd_microstep: 2208.84 | bwd_inner_microstep: 2023.46 | bwd_allreduce_microstep: 185.32 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12750
total_samples=1318, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:38,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.51 | bwd_microstep: 2297.90 | bwd_inner_microstep: 2208.07 | bwd_allreduce_microstep: 89.77 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13715
total_samples=1322, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:40,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.06 | bwd_microstep: 1846.42 | bwd_inner_microstep: 1727.75 | bwd_allreduce_microstep: 118.61 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373
total_samples=1326, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:43,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 02:00:43,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.00 | bwd_microstep: 1760.28 | bwd_inner_microstep: 1689.22 | bwd_allreduce_microstep: 71.00 | step_microstep: 131.00
[2025-08-03 02:00:43,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.80 | bwd: 8113.48 | bwd_inner: 7648.50 | bwd_allreduce: 464.76 | step: 131.30
{'loss': 0.9391, 'learning_rate': 1.9991806169243302e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13910
total_samples=1330, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:46,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.08 | bwd_microstep: 2085.90 | bwd_inner_microstep: 1899.13 | bwd_allreduce_microstep: 186.71 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11720
total_samples=1333, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:49,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.24 | bwd_microstep: 2379.26 | bwd_inner_microstep: 2223.01 | bwd_allreduce_microstep: 156.20 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15460
total_samples=1338, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:52,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.68 | bwd_microstep: 2064.06 | bwd_inner_microstep: 1812.34 | bwd_allreduce_microstep: 251.66 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13162
total_samples=1343, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:55,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 02:00:55,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.15 | bwd_microstep: 2201.66 | bwd_inner_microstep: 2113.63 | bwd_allreduce_microstep: 87.98 | step_microstep: 123.07
[2025-08-03 02:00:55,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.07 | bwd: 8730.93 | bwd_inner: 8048.10 | bwd_allreduce: 682.61 | step: 123.40
{'loss': 0.9373, 'learning_rate': 1.9991137651428957e-05, 'epoch': 0.04}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12905
total_samples=1347, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:00:58,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.79 | bwd_microstep: 1909.84 | bwd_inner_microstep: 1801.93 | bwd_allreduce_microstep: 107.84 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11507
total_samples=1350, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:00,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.76 | bwd_microstep: 1870.93 | bwd_inner_microstep: 1572.55 | bwd_allreduce_microstep: 298.31 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13233
total_samples=1354, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:03,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.40 | bwd_microstep: 1847.85 | bwd_inner_microstep: 1716.27 | bwd_allreduce_microstep: 131.52 | step_microstep: 0.09
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12848
total_samples=1358, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:05,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 02:01:05,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.66 | bwd_microstep: 1741.24 | bwd_inner_microstep: 1624.52 | bwd_allreduce_microstep: 116.66 | step_microstep: 153.42
[2025-08-03 02:01:05,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.54 | bwd: 7369.90 | bwd_inner: 6715.27 | bwd_allreduce: 654.40 | step: 153.75
{'loss': 0.933, 'learning_rate': 1.999044293302088e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14543
total_samples=1363, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:09,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.90 | bwd_microstep: 2272.46 | bwd_inner_microstep: 2266.51 | bwd_allreduce_microstep: 5.88 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13641
total_samples=1367, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:11,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.24 | bwd_microstep: 1840.86 | bwd_inner_microstep: 1724.78 | bwd_allreduce_microstep: 116.02 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13863
total_samples=1371, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:14,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.76 | bwd_microstep: 1752.72 | bwd_inner_microstep: 1689.96 | bwd_allreduce_microstep: 62.71 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13439
total_samples=1375, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:16,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 02:01:16,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.24 | bwd_microstep: 1794.13 | bwd_inner_microstep: 1692.88 | bwd_allreduce_microstep: 101.19 | step_microstep: 130.97
[2025-08-03 02:01:16,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.07 | bwd: 7660.21 | bwd_inner: 7374.13 | bwd_allreduce: 285.86 | step: 131.27
{'loss': 0.9266, 'learning_rate': 1.998972201584088e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14035
total_samples=1379, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:19,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.72 | bwd_microstep: 1778.59 | bwd_inner_microstep: 1723.56 | bwd_allreduce_microstep: 54.97 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12972
total_samples=1383, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:21,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.28 | bwd_microstep: 1683.64 | bwd_inner_microstep: 1620.83 | bwd_allreduce_microstep: 62.73 | step_microstep: 0.17
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14106
total_samples=1387, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:24,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.55 | bwd_microstep: 1742.25 | bwd_inner_microstep: 1695.36 | bwd_allreduce_microstep: 46.83 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11720
total_samples=1390, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:27,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 02:01:27,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.98 | bwd_microstep: 1741.06 | bwd_inner_microstep: 1525.32 | bwd_allreduce_microstep: 215.68 | step_microstep: 131.58
[2025-08-03 02:01:27,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2754.45 | bwd: 6945.59 | bwd_inner: 6565.06 | bwd_allreduce: 380.28 | step: 132.08
{'loss': 0.9153, 'learning_rate': 1.9988974901779482e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15419
total_samples=1395, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:29,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.98 | bwd_microstep: 1889.59 | bwd_inner_microstep: 1866.71 | bwd_allreduce_microstep: 22.82 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13349
total_samples=1399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:32,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.55 | bwd_microstep: 1761.26 | bwd_inner_microstep: 1665.71 | bwd_allreduce_microstep: 95.49 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13333
total_samples=1403, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:34,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.58 | bwd_microstep: 1732.19 | bwd_inner_microstep: 1671.04 | bwd_allreduce_microstep: 61.10 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13258
total_samples=1407, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:37,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 02:01:37,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.61 | bwd_microstep: 1736.12 | bwd_inner_microstep: 1666.82 | bwd_allreduce_microstep: 69.23 | step_microstep: 159.68
[2025-08-03 02:01:37,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2737.66 | bwd: 7119.21 | bwd_inner: 6870.27 | bwd_allreduce: 248.71 | step: 160.14
�         | 84/2000 [17:46<5:41:00, 10.68s/it]                                                     4%|▍         | 84/2000 [17:46<5:41:00, 10.68s/it]  4%|▍         | 85/2000 [17:58<5:47:20, 10.88s/it]                                                     4%|▍         | 85/2000 [17:58<5:47:20, 10.88s/it]  4%|▍         | 86/2000 [18:10<5:57:34, 11.21s/it]                                                     4%|▍         | 86/2000 [18:10<5:57:34, 11.21s/it]  4%|▍         | 87/2000 [18:20<5:51:57, 11.04s/it]                                                     4%|▍         | 87/2000 [18:20<5:51:57, 11.04s/it]  4%|▍         | 88/2000 [18:31<5:50:43, 11.01s/it]                                                     4%|▍         | 88/2000 [18:31<5:50:43, 11.01s/it]  4%|▍         | 89/2000 [18:41<5:42:39, 10.76s/it]                                                     4%|▍         | 89/2000 [18:41<5:42:39, 10.76s/it]  4%|▍         | 90/2000 [18:52<5:38:37, 10.64s/it]           {'loss': 0.9212, 'learning_rate': 1.998820159279591e-05, 'epoch': 0.04}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13479
total_samples=1411, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:39,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.16 | bwd_microstep: 1776.44 | bwd_inner_microstep: 1701.48 | bwd_allreduce_microstep: 74.90 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12237
total_samples=1414, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:42,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.71 | bwd_microstep: 2141.44 | bwd_inner_microstep: 1976.92 | bwd_allreduce_microstep: 164.46 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13560
total_samples=1418, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:45,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.35 | bwd_microstep: 1850.89 | bwd_inner_microstep: 1712.47 | bwd_allreduce_microstep: 138.34 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12020
total_samples=1421, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:48,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:01:48,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.57 | bwd_microstep: 1885.95 | bwd_inner_microstep: 1549.54 | bwd_allreduce_microstep: 336.35 | step_microstep: 111.70
[2025-08-03 02:01:48,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.71 | bwd: 7654.77 | bwd_inner: 6940.41 | bwd_allreduce: 714.12 | step: 112.16
{'loss': 0.917, 'learning_rate': 1.998740209091807e-05, 'epoch': 0.05}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14745
total_samples=1427, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:50,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.67 | bwd_microstep: 1865.30 | bwd_inner_microstep: 1780.54 | bwd_allreduce_microstep: 84.70 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12453
total_samples=1431, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:53,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.05 | bwd_microstep: 2112.62 | bwd_inner_microstep: 1896.52 | bwd_allreduce_microstep: 216.04 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11591
total_samples=1434, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:56,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1010.08 | bwd_microstep: 1801.37 | bwd_inner_microstep: 1521.54 | bwd_allreduce_microstep: 279.77 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13524
total_samples=1438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:01:59,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:01:59,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.15 | bwd_microstep: 1848.89 | bwd_inner_microstep: 1708.25 | bwd_allreduce_microstep: 140.58 | step_microstep: 152.69
[2025-08-03 02:01:59,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3088.88 | bwd: 7628.22 | bwd_inner: 6906.84 | bwd_allreduce: 721.15 | step: 153.10
{'loss': 0.9262, 'learning_rate': 1.9986576398242566e-05, 'epoch': 0.05}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15671
total_samples=1443, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:02,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.82 | bwd_microstep: 2214.07 | bwd_inner_microstep: 2023.15 | bwd_allreduce_microstep: 190.86 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12635
total_samples=1446, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:05,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.60 | bwd_microstep: 2141.50 | bwd_inner_microstep: 1899.37 | bwd_allreduce_microstep: 242.06 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11587
total_samples=1449, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:08,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.55 | bwd_microstep: 1951.08 | bwd_inner_microstep: 1624.97 | bwd_allreduce_microstep: 326.05 | step_microstep: 0.19
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12877
total_samples=1453, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:11,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 02:02:11,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.58 | bwd_microstep: 2076.55 | bwd_inner_microstep: 1909.13 | bwd_allreduce_microstep: 167.36 | step_microstep: 114.72
[2025-08-03 02:02:11,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.46 | bwd: 8383.25 | bwd_inner: 7456.61 | bwd_allreduce: 926.41 | step: 115.20
{'loss': 0.9182, 'learning_rate': 1.998572451693468e-05, 'epoch': 0.05}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13412
total_samples=1457, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:13,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.63 | bwd_microstep: 1761.62 | bwd_inner_microstep: 1684.88 | bwd_allreduce_microstep: 76.68 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15879
total_samples=1461, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:16,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.15 | bwd_microstep: 2082.13 | bwd_inner_microstep: 1934.22 | bwd_allreduce_microstep: 147.86 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11906
total_samples=1464, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:19,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.39 | bwd_microstep: 1736.60 | bwd_inner_microstep: 1542.48 | bwd_allreduce_microstep: 194.05 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14249
total_samples=1468, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:21,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 02:02:21,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.72 | bwd_microstep: 1824.42 | bwd_inner_microstep: 1722.35 | bwd_allreduce_microstep: 102.01 | step_microstep: 128.21
[2025-08-03 02:02:21,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.83 | bwd: 7404.82 | bwd_inner: 6883.93 | bwd_allreduce: 520.66 | step: 128.53
{'loss': 0.9253, 'learning_rate': 1.998484644922837e-05, 'epoch': 0.05}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13358
total_samples=1472, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:24,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.60 | bwd_microstep: 1721.82 | bwd_inner_microstep: 1625.59 | bwd_allreduce_microstep: 96.17 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14374
total_samples=1478, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:27,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.30 | bwd_microstep: 1846.95 | bwd_inner_microstep: 1729.37 | bwd_allreduce_microstep: 117.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14322
total_samples=1482, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:29,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.73 | bwd_microstep: 1778.45 | bwd_inner_microstep: 1729.55 | bwd_allreduce_microstep: 48.84 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 15674
total_samples=1486, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:33,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 02:02:33,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1887.22 | bwd_microstep: 2077.26 | bwd_inner_microstep: 1894.29 | bwd_allreduce_microstep: 182.91 | step_microstep: 112.08
[2025-08-03 02:02:33,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3961.79 | bwd: 7424.54 | bwd_inner: 6978.80 | bwd_allreduce: 445.50 | step: 112.42
{'loss': 0.9318, 'learning_rate': 1.9983942197426272e-05, 'epoch': 0.05}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12896
total_samples=1491, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:36,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.11 | bwd_microstep: 1821.01 | bwd_inner_microstep: 1605.79 | bwd_allreduce_microstep: 215.16 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=1494, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:38,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.74 | bwd_microstep: 1732.53 | bwd_inner_microstep: 1547.28 | bwd_allreduce_microstep: 185.19 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13560
total_samples=1498, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:41,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.94 | bwd_microstep: 1761.60 | bwd_inner_microstep: 1691.86 | bwd_allreduce_microstep: 69.68 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13699
total_samples=1502, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:44,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 02:02:44,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.06 | bwd_microstep: 1801.63 | bwd_inner_microstep: 1678.92 | bwd_allreduce_microstep: 122.64 | step_microstep: 167.90
[2025-08-03 02:02:44,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.78 | bwd: 7116.82 | bwd_inner: 6523.84 | bwd_allreduce: 592.75 | step: 168.23
{'loss': 0.9158, 'learning_rate': 1.9983011763899674e-05, 'epoch': 0.05}
                                          4%|▍         | 90/2000 [18:52<5:38:37, 10.64s/it]  5%|▍         | 91/2000 [19:03<5:40:51, 10.71s/it]                                                     5%|▍         | 91/2000 [19:03<5:40:51, 10.71s/it]  5%|▍         | 92/2000 [19:14<5:45:37, 10.87s/it]                                                     5%|▍         | 92/2000 [19:14<5:45:37, 10.87s/it]  5%|▍         | 93/2000 [19:26<5:53:03, 11.11s/it]                                                     5%|▍         | 93/2000 [19:26<5:53:03, 11.11s/it]  5%|▍         | 94/2000 [19:36<5:48:37, 10.97s/it]                                                     5%|▍         | 94/2000 [19:36<5:48:37, 10.97s/it]  5%|▍         | 95/2000 [19:48<5:56:46, 11.24s/it]                                                     5%|▍         | 95/2000 [19:48<5:56:46, 11.24s/it]  5%|▍         | 96/2000 [19:58<5:48:21, 10.98s/it]                                                     5%|▍        dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11979
total_samples=1505, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:46,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.96 | bwd_microstep: 1773.29 | bwd_inner_microstep: 1557.45 | bwd_allreduce_microstep: 215.77 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=1508, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:49,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.38 | bwd_microstep: 1742.64 | bwd_inner_microstep: 1533.28 | bwd_allreduce_microstep: 209.30 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11590
total_samples=1511, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:51,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.81 | bwd_microstep: 1763.76 | bwd_inner_microstep: 1541.01 | bwd_allreduce_microstep: 222.68 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13038
total_samples=1515, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:54,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:02:54,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.80 | bwd_microstep: 1755.99 | bwd_inner_microstep: 1654.93 | bwd_allreduce_microstep: 101.01 | step_microstep: 157.45
[2025-08-03 02:02:54,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.88 | bwd: 7035.74 | bwd_inner: 6286.66 | bwd_allreduce: 748.83 | step: 157.94
{'loss': 0.914, 'learning_rate': 1.998205515108853e-05, 'epoch': 0.05}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14349
total_samples=1519, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:57,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 874.31 | bwd_microstep: 2000.75 | bwd_inner_microstep: 1901.73 | bwd_allreduce_microstep: 98.95 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11623
total_samples=1522, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:02:59,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.76 | bwd_microstep: 1800.24 | bwd_inner_microstep: 1607.99 | bwd_allreduce_microstep: 192.18 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11692
total_samples=1525, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:02,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.51 | bwd_microstep: 1729.52 | bwd_inner_microstep: 1524.77 | bwd_allreduce_microstep: 204.69 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11711
total_samples=1528, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:05,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 02:03:05,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.58 | bwd_microstep: 2181.83 | bwd_inner_microstep: 1588.48 | bwd_allreduce_microstep: 593.29 | step_microstep: 110.37
[2025-08-03 02:03:05,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2961.10 | bwd: 7712.38 | bwd_inner: 6622.97 | bwd_allreduce: 1089.19 | step: 110.68
{'loss': 0.9046, 'learning_rate': 1.998107236150145e-05, 'epoch': 0.05}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13222
total_samples=1531, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:08,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.38 | bwd_microstep: 2015.39 | bwd_inner_microstep: 1799.33 | bwd_allreduce_microstep: 216.00 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11879
total_samples=1534, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:11,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.18 | bwd_microstep: 2534.79 | bwd_inner_microstep: 2318.13 | bwd_allreduce_microstep: 216.59 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13232
total_samples=1538, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:14,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.47 | bwd_microstep: 2172.50 | bwd_inner_microstep: 1851.22 | bwd_allreduce_microstep: 321.22 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11704
total_samples=1541, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:18,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 02:03:18,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.11 | bwd_microstep: 2466.68 | bwd_inner_microstep: 2246.13 | bwd_allreduce_microstep: 220.49 | step_microstep: 443.58
[2025-08-03 02:03:18,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.07 | bwd: 9189.41 | bwd_inner: 8214.81 | bwd_allreduce: 974.37 | step: 443.89
{'loss': 0.9044, 'learning_rate': 1.9980063397715685e-05, 'epoch': 0.05}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13524
total_samples=1545, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:21,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.52 | bwd_microstep: 2157.33 | bwd_inner_microstep: 1882.13 | bwd_allreduce_microstep: 275.14 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12154
total_samples=1548, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:24,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.83 | bwd_microstep: 1971.29 | bwd_inner_microstep: 1854.92 | bwd_allreduce_microstep: 116.30 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13660
total_samples=1552, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:26,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.03 | bwd_microstep: 2161.17 | bwd_inner_microstep: 2006.09 | bwd_allreduce_microstep: 155.02 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13502
total_samples=1556, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:29,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 02:03:29,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.48 | bwd_microstep: 1823.18 | bwd_inner_microstep: 1712.13 | bwd_allreduce_microstep: 110.98 | step_microstep: 409.65
[2025-08-03 02:03:29,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.80 | bwd: 8113.02 | bwd_inner: 7455.26 | bwd_allreduce: 657.52 | step: 410.11
{'loss': 0.9087, 'learning_rate': 1.997902826237712e-05, 'epoch': 0.05}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13327
total_samples=1560, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:32,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.88 | bwd_microstep: 2105.15 | bwd_inner_microstep: 1661.21 | bwd_allreduce_microstep: 443.89 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11966
total_samples=1563, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:35,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.05 | bwd_microstep: 1763.63 | bwd_inner_microstep: 1555.60 | bwd_allreduce_microstep: 207.97 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13461
total_samples=1567, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:37,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.10 | bwd_microstep: 1767.84 | bwd_inner_microstep: 1699.37 | bwd_allreduce_microstep: 68.40 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12765
total_samples=1571, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:40,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 32.38
[2025-08-03 02:03:40,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.10 | bwd_microstep: 1782.62 | bwd_inner_microstep: 1608.45 | bwd_allreduce_microstep: 174.11 | step_microstep: 150.53
[2025-08-03 02:03:40,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2740.07 | bwd: 7419.29 | bwd_inner: 6524.62 | bwd_allreduce: 894.44 | step: 150.86
{'loss': 0.9064, 'learning_rate': 1.9977966958200276e-05, 'epoch': 0.05}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11800
total_samples=1574, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:43,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.60 | bwd_microstep: 1807.76 | bwd_inner_microstep: 1574.67 | bwd_allreduce_microstep: 233.03 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12098
total_samples=1577, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:45,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.15 | bwd_microstep: 1746.01 | bwd_inner_microstep: 1555.65 | bwd_allreduce_microstep: 190.30 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12672
total_samples=1581, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:48,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.59 | bwd_microstep: 1778.42 | bwd_inner_microstep: 1597.17 | bwd_allreduce_microstep: 181.19 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13644
total_samples=1585, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:51,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 02:03:51,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.99 | bwd_microstep: 1773.75 | bwd_inner_microstep: 1700.04 | bwd_allreduce_microstep: 73.66 | step_microstep: 145.85
[2025-08-03 02:03:51,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.28 | bwd: 7106.00 | bwd_inner: 6427.53 | bwd_allreduce: 678.25 | step: 146.18
{'loss': 0.9038, 'learning_rate': 1.997687948796831e-05, 'epoch': 0.05}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11678
total_samples=1588, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:53,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.88 | bwd_microstep: 1701.84 | bwd_inner_microstep: 1518.61 | bwd_allreduce_microstep: 183.17 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11693
total_samples=1591, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:56,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.03 | bwd_microstep: 1767.36 | bwd_inner_microstep: 1546.11 | bwd_allreduce_microstep: 221.19 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12405
total_samples=1594, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:03:58,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.96 | bwd_microstep: 2049.29 | bwd_inner_microstep: 1814.63 | bwd_allreduce_microstep: 234.60 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12773
total_samples=1598, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:01,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:04:01,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.90 | bwd_microstep: 1769.66 | bwd_inner_microstep: 1611.41 | bwd_allreduce_microstep: 158.18 | step_microstep: 118.70
[2025-08-03 02:04:01,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.69 | bwd: 7288.21 | bwd_inner: 6490.76 | bwd_allreduce: 797.21 | step: 119.04
 | 96/2000 [19:58<5:48:21, 10.98s/it]  5%|▍         | 97/2000 [20:09<5:42:09, 10.79s/it]                                                     5%|▍         | 97/2000 [20:09<5:42:09, 10.79s/it]  5%|▍         | 98/2000 [20:20<5:45:02, 10.88s/it]                                                     5%|▍         | 98/2000 [20:20<5:45:02, 10.88s/it]  5%|▍         | 99/2000 [20:33<6:02:37, 11.45s/it]                                                     5%|▍         | 99/2000 [20:33<6:02:37, 11.45s/it]  5%|▌         | 100/2000 [20:44<6:04:39, 11.52s/it]                                                      5%|▌         | 100/2000 [20:44<6:04:39, 11.52s/it]  5%|▌         | 101/2000 [20:55<5:56:09, 11.25s/it]                                                      5%|▌         | 101/2000 [20:55<5:56:09, 11.25s/it]  5%|▌         | 102/2000 [21:05<5:47:54, 11.00s/it]                                                      5%|▌         | 102/2000 [21:05<5:47:54, 11.00s/it]  5%|▌   {'loss': 0.9128, 'learning_rate': 1.9975765854532974e-05, 'epoch': 0.05}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11854
total_samples=1601, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:04,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.09 | bwd_microstep: 1769.52 | bwd_inner_microstep: 1542.02 | bwd_allreduce_microstep: 227.43 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14140
total_samples=1605, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:06,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.83 | bwd_microstep: 1803.52 | bwd_inner_microstep: 1739.98 | bwd_allreduce_microstep: 63.48 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13081
total_samples=1609, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:09,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.15 | bwd_microstep: 1994.67 | bwd_inner_microstep: 1866.16 | bwd_allreduce_microstep: 128.45 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12402
total_samples=1613, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:12,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:04:12,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.09 | bwd_microstep: 1807.08 | bwd_inner_microstep: 1585.21 | bwd_allreduce_microstep: 221.80 | step_microstep: 126.62
[2025-08-03 02:04:12,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.11 | bwd: 7374.83 | bwd_inner: 6733.37 | bwd_allreduce: 641.23 | step: 126.93
{'loss': 0.9017, 'learning_rate': 1.997462606081465e-05, 'epoch': 0.05}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12228
total_samples=1616, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:14,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.67 | bwd_microstep: 1850.02 | bwd_inner_microstep: 1595.13 | bwd_allreduce_microstep: 254.83 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13547
total_samples=1620, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:17,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.00 | bwd_microstep: 1801.89 | bwd_inner_microstep: 1708.18 | bwd_allreduce_microstep: 93.64 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14381
total_samples=1624, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:21,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1420.80 | bwd_microstep: 2156.00 | bwd_inner_microstep: 1995.24 | bwd_allreduce_microstep: 160.71 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13961
total_samples=1628, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:24,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 02:04:24,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.34 | bwd_microstep: 2214.14 | bwd_inner_microstep: 2070.55 | bwd_allreduce_microstep: 143.53 | step_microstep: 106.76
[2025-08-03 02:04:24,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3566.74 | bwd: 8022.10 | bwd_inner: 7369.09 | bwd_allreduce: 652.78 | step: 107.18
{'loss': 0.9125, 'learning_rate': 1.9973460109802306e-05, 'epoch': 0.05}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13565
total_samples=1632, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:27,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.96 | bwd_microstep: 2592.99 | bwd_inner_microstep: 2322.73 | bwd_allreduce_microstep: 270.20 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12005
total_samples=1635, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:30,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.72 | bwd_microstep: 2290.99 | bwd_inner_microstep: 1596.17 | bwd_allreduce_microstep: 694.76 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14334
total_samples=1639, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:33,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.67 | bwd_microstep: 2207.40 | bwd_inner_microstep: 1918.16 | bwd_allreduce_microstep: 289.18 | step_microstep: 0.26
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14587
total_samples=1643, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:36,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 02:04:36,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.64 | bwd_microstep: 1765.69 | bwd_inner_microstep: 1685.79 | bwd_allreduce_microstep: 79.83 | step_microstep: 134.99
[2025-08-03 02:04:36,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.92 | bwd: 8857.12 | bwd_inner: 7522.84 | bwd_allreduce: 1334.05 | step: 135.48
{'loss': 0.9029, 'learning_rate': 1.997226800455352e-05, 'epoch': 0.05}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11480
total_samples=1646, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:39,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.63 | bwd_microstep: 2003.92 | bwd_inner_microstep: 1838.72 | bwd_allreduce_microstep: 165.13 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15851
total_samples=1650, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:41,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.50 | bwd_microstep: 1793.37 | bwd_inner_microstep: 1773.74 | bwd_allreduce_microstep: 19.57 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16072
total_samples=1654, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:44,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.36 | bwd_microstep: 1804.39 | bwd_inner_microstep: 1797.83 | bwd_allreduce_microstep: 6.49 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12521
total_samples=1657, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:47,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 02:04:47,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.76 | bwd_microstep: 1877.55 | bwd_inner_microstep: 1573.95 | bwd_allreduce_microstep: 303.53 | step_microstep: 118.08
[2025-08-03 02:04:47,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.16 | bwd: 7479.27 | bwd_inner: 6984.23 | bwd_allreduce: 494.80 | step: 118.41
{'loss': 0.896, 'learning_rate': 1.9971049748194448e-05, 'epoch': 0.05}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12133
total_samples=1660, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:49,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.67 | bwd_microstep: 1789.22 | bwd_inner_microstep: 1549.97 | bwd_allreduce_microstep: 239.19 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11782
total_samples=1663, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:52,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.52 | bwd_microstep: 1766.97 | bwd_inner_microstep: 1559.36 | bwd_allreduce_microstep: 207.54 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14134
total_samples=1667, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:54,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.81 | bwd_microstep: 1822.52 | bwd_inner_microstep: 1750.82 | bwd_allreduce_microstep: 71.64 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13212
total_samples=1671, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:04:57,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:04:57,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.49 | bwd_microstep: 1822.57 | bwd_inner_microstep: 1663.80 | bwd_allreduce_microstep: 158.70 | step_microstep: 144.34
[2025-08-03 02:04:57,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.44 | bwd: 7201.33 | bwd_inner: 6523.96 | bwd_allreduce: 677.14 | step: 144.77
{'loss': 0.9029, 'learning_rate': 1.9969805343919822e-05, 'epoch': 0.05}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11932
total_samples=1674, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:00,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.21 | bwd_microstep: 1835.36 | bwd_inner_microstep: 1589.62 | bwd_allreduce_microstep: 245.67 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13879
total_samples=1679, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:04,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1931.25 | bwd_microstep: 2036.51 | bwd_inner_microstep: 1879.27 | bwd_allreduce_microstep: 157.18 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13880
total_samples=1684, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:06,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.35 | bwd_microstep: 1746.90 | bwd_inner_microstep: 1699.94 | bwd_allreduce_microstep: 46.90 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11771
total_samples=1687, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:09,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 02:05:09,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.92 | bwd_microstep: 1749.16 | bwd_inner_microstep: 1535.59 | bwd_allreduce_microstep: 213.51 | step_microstep: 111.83
[2025-08-03 02:05:09,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4012.65 | bwd: 7367.98 | bwd_inner: 6704.42 | bwd_allreduce: 663.33 | step: 112.29
      | 103/2000 [21:16<5:42:54, 10.85s/it]                                                      5%|▌         | 103/2000 [21:16<5:42:54, 10.85s/it]  5%|▌         | 104/2000 [21:27<5:40:51, 10.79s/it]                                                      5%|▌         | 104/2000 [21:27<5:40:51, 10.79s/it]  5%|▌         | 105/2000 [21:39<5:52:37, 11.17s/it]                                                      5%|▌         | 105/2000 [21:39<5:52:37, 11.17s/it]  5%|▌         | 106/2000 [21:51<6:01:23, 11.45s/it]                                                      5%|▌         | 106/2000 [21:51<6:01:23, 11.45s/it]  5%|▌         | 107/2000 [22:01<5:54:38, 11.24s/it]                                                      5%|▌         | 107/2000 [22:01<5:54:38, 11.24s/it]  5%|▌         | 108/2000 [22:12<5:47:24, 11.02s/it]                                                      5%|▌         | 108/2000 [22:12<5:47:24, 11.02s/it]  5%|▌         | 109/2000 [22:24<5:54:46, 11.26s/{'loss': 0.8997, 'learning_rate': 1.9968534794992947e-05, 'epoch': 0.05}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12499
total_samples=1691, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:12,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.12 | bwd_microstep: 1981.99 | bwd_inner_microstep: 1597.52 | bwd_allreduce_microstep: 384.41 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13256
total_samples=1695, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:14,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.46 | bwd_microstep: 1807.77 | bwd_inner_microstep: 1709.65 | bwd_allreduce_microstep: 98.06 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13402
total_samples=1699, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:17,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.38 | bwd_microstep: 1841.52 | bwd_inner_microstep: 1718.70 | bwd_allreduce_microstep: 122.76 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13730
total_samples=1703, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:20,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 02:05:20,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.75 | bwd_microstep: 1750.00 | bwd_inner_microstep: 1703.08 | bwd_allreduce_microstep: 46.85 | step_microstep: 137.11
[2025-08-03 02:05:20,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.64 | bwd: 7381.35 | bwd_inner: 6728.95 | bwd_allreduce: 652.17 | step: 137.58
{'loss': 0.9057, 'learning_rate': 1.9967238104745695e-05, 'epoch': 0.06}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13172
total_samples=1707, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:22,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.64 | bwd_microstep: 1766.70 | bwd_inner_microstep: 1672.56 | bwd_allreduce_microstep: 94.07 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13948
total_samples=1711, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:25,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.13 | bwd_microstep: 2099.34 | bwd_inner_microstep: 2058.39 | bwd_allreduce_microstep: 40.88 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13179
total_samples=1715, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:27,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.79 | bwd_microstep: 1745.15 | bwd_inner_microstep: 1669.12 | bwd_allreduce_microstep: 75.97 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14705
total_samples=1719, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:30,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 02:05:30,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.78 | bwd_microstep: 1744.75 | bwd_inner_microstep: 1712.30 | bwd_allreduce_microstep: 32.39 | step_microstep: 156.06
[2025-08-03 02:05:30,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.28 | bwd: 7355.98 | bwd_inner: 7112.37 | bwd_allreduce: 243.38 | step: 156.47
{'loss': 0.8862, 'learning_rate': 1.996591527657848e-05, 'epoch': 0.06}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13906
total_samples=1723, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:33,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.44 | bwd_microstep: 1782.56 | bwd_inner_microstep: 1709.36 | bwd_allreduce_microstep: 73.14 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15194
total_samples=1727, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:35,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.85 | bwd_microstep: 1847.72 | bwd_inner_microstep: 1810.85 | bwd_allreduce_microstep: 36.80 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13296
total_samples=1731, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:39,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.29 | bwd_microstep: 2482.04 | bwd_inner_microstep: 2016.39 | bwd_allreduce_microstep: 465.54 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11581
total_samples=1734, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:42,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:05:42,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.64 | bwd_microstep: 2045.64 | bwd_inner_microstep: 1808.55 | bwd_allreduce_microstep: 237.03 | step_microstep: 111.81
[2025-08-03 02:05:42,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.16 | bwd: 8158.02 | bwd_inner: 7345.17 | bwd_allreduce: 812.58 | step: 112.27
{'loss': 0.9042, 'learning_rate': 1.9964566313960265e-05, 'epoch': 0.06}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15687
total_samples=1738, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:44,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.59 | bwd_microstep: 2036.34 | bwd_inner_microstep: 1796.58 | bwd_allreduce_microstep: 239.70 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13290
total_samples=1742, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:47,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.30 | bwd_microstep: 2136.79 | bwd_inner_microstep: 2012.33 | bwd_allreduce_microstep: 124.41 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12734
total_samples=1746, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:50,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.83 | bwd_microstep: 2031.92 | bwd_inner_microstep: 1891.83 | bwd_allreduce_microstep: 140.03 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11711
total_samples=1749, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:53,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 02:05:53,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 762.59 | bwd_microstep: 1936.89 | bwd_inner_microstep: 1774.07 | bwd_allreduce_microstep: 162.77 | step_microstep: 112.18
[2025-08-03 02:05:53,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2872.23 | bwd: 8141.99 | bwd_inner: 7474.79 | bwd_allreduce: 666.97 | step: 112.61
{'loss': 0.8966, 'learning_rate': 1.9963191220428552e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11703
total_samples=1753, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:56,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.39 | bwd_microstep: 2262.35 | bwd_inner_microstep: 2124.32 | bwd_allreduce_microstep: 137.97 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13166
total_samples=1757, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:05:59,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.86 | bwd_microstep: 1843.11 | bwd_inner_microstep: 1692.35 | bwd_allreduce_microstep: 150.70 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12991
total_samples=1761, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:01,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 663.37 | bwd_microstep: 1753.96 | bwd_inner_microstep: 1673.54 | bwd_allreduce_microstep: 80.34 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11705
total_samples=1764, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:04,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.56
[2025-08-03 02:06:04,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.63 | bwd_microstep: 1850.89 | bwd_inner_microstep: 1716.19 | bwd_allreduce_microstep: 134.64 | step_microstep: 161.45
[2025-08-03 02:06:04,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2715.19 | bwd: 7710.36 | bwd_inner: 7206.40 | bwd_allreduce: 503.73 | step: 161.80
{'loss': 0.9014, 'learning_rate': 1.9961789999589357e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11508
total_samples=1767, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:07,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.00 | bwd_microstep: 2051.20 | bwd_inner_microstep: 1816.80 | bwd_allreduce_microstep: 234.33 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13316
total_samples=1771, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:10,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.71 | bwd_microstep: 2066.27 | bwd_inner_microstep: 1942.58 | bwd_allreduce_microstep: 123.63 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13110
total_samples=1775, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:12,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.72 | bwd_microstep: 1757.53 | bwd_inner_microstep: 1662.08 | bwd_allreduce_microstep: 95.38 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13231
total_samples=1779, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:15,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.76
[2025-08-03 02:06:15,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.71 | bwd_microstep: 1773.24 | bwd_inner_microstep: 1689.22 | bwd_allreduce_microstep: 83.95 | step_microstep: 110.53
[2025-08-03 02:06:15,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.08 | bwd: 7648.29 | bwd_inner: 7110.69 | bwd_allreduce: 537.37 | step: 110.86
it]                                                      5%|▌         | 109/2000 [22:24<5:54:46, 11.26s/it]  6%|▌         | 110/2000 [22:34<5:48:38, 11.07s/it]                                                      6%|▌         | 110/2000 [22:34<5:48:38, 11.07s/it]  6%|▌         | 111/2000 [22:45<5:44:11, 10.93s/it]                                                      6%|▌         | 111/2000 [22:45<5:44:11, 10.93s/it]  6%|▌         | 112/2000 [22:56<5:48:28, 11.07s/it]                                                      6%|▌         | 112/2000 [22:56<5:48:28, 11.07s/it]  6%|▌         | 113/2000 [23:08<5:51:48, 11.19s/it]                                                      6%|▌         | 113/2000 [23:08<5:51:48, 11.19s/it]  6%|▌         | 114/2000 [23:19<5:49:04, 11.11s/it]                                                      6%|▌         | 114/2000 [23:19<5:49:04, 11.11s/it]  6%|▌         | 115/2000 [23:30<5:46:56, 11.04s/it]                                    {'loss': 0.8834, 'learning_rate': 1.996036265511722e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11786
total_samples=1782, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:17,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.77 | bwd_microstep: 1758.10 | bwd_inner_microstep: 1539.28 | bwd_allreduce_microstep: 218.74 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13385
total_samples=1786, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:20,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.88 | bwd_microstep: 1781.81 | bwd_inner_microstep: 1694.83 | bwd_allreduce_microstep: 86.92 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13630
total_samples=1790, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:23,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.23 | bwd_microstep: 1819.97 | bwd_inner_microstep: 1730.15 | bwd_allreduce_microstep: 89.76 | step_microstep: 0.22
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14129
total_samples=1795, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:25,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:06:25,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.67 | bwd_microstep: 1713.70 | bwd_inner_microstep: 1654.71 | bwd_allreduce_microstep: 58.93 | step_microstep: 138.00
[2025-08-03 02:06:25,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.48 | bwd: 7073.64 | bwd_inner: 6618.96 | bwd_allreduce: 454.44 | step: 138.51
{'loss': 0.9013, 'learning_rate': 1.995890919075519e-05, 'epoch': 0.06}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14247
total_samples=1800, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:28,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.26 | bwd_microstep: 1836.30 | bwd_inner_microstep: 1741.70 | bwd_allreduce_microstep: 94.54 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14642
total_samples=1804, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:30,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.97 | bwd_microstep: 1768.92 | bwd_inner_microstep: 1759.54 | bwd_allreduce_microstep: 9.32 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13200
total_samples=1808, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:33,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.74 | bwd_microstep: 1820.90 | bwd_inner_microstep: 1685.18 | bwd_allreduce_microstep: 135.65 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14351
total_samples=1812, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:36,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 02:06:36,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.16 | bwd_microstep: 1796.30 | bwd_inner_microstep: 1738.25 | bwd_allreduce_microstep: 57.99 | step_microstep: 137.73
[2025-08-03 02:06:36,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.07 | bwd: 7222.48 | bwd_inner: 6924.67 | bwd_allreduce: 297.58 | step: 138.20
{'loss': 0.8885, 'learning_rate': 1.9957429610314797e-05, 'epoch': 0.06}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13793
total_samples=1816, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:39,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.46 | bwd_microstep: 2063.63 | bwd_inner_microstep: 1772.72 | bwd_allreduce_microstep: 290.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13796
total_samples=1820, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:41,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.60 | bwd_microstep: 1829.44 | bwd_inner_microstep: 1731.36 | bwd_allreduce_microstep: 98.02 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13682
total_samples=1825, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:44,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.19 | bwd_microstep: 1813.25 | bwd_inner_microstep: 1730.44 | bwd_allreduce_microstep: 82.74 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12121
total_samples=1828, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:46,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:06:46,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.83 | bwd_microstep: 1792.07 | bwd_inner_microstep: 1559.21 | bwd_allreduce_microstep: 232.79 | step_microstep: 152.18
[2025-08-03 02:06:46,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.02 | bwd: 7498.44 | bwd_inner: 6793.72 | bwd_allreduce: 704.48 | step: 152.76
{'loss': 0.8844, 'learning_rate': 1.995592391767608e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12497
total_samples=1831, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:50,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.50 | bwd_microstep: 2573.24 | bwd_inner_microstep: 2378.14 | bwd_allreduce_microstep: 195.03 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14178
total_samples=1836, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:53,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.12 | bwd_microstep: 2172.09 | bwd_inner_microstep: 1750.75 | bwd_allreduce_microstep: 421.28 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12946
total_samples=1840, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:56,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.81 | bwd_microstep: 2166.78 | bwd_inner_microstep: 1871.43 | bwd_allreduce_microstep: 295.29 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13281
total_samples=1844, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:06:58,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 02:06:58,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.04 | bwd_microstep: 1767.72 | bwd_inner_microstep: 1692.95 | bwd_allreduce_microstep: 74.72 | step_microstep: 154.97
[2025-08-03 02:06:58,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.40 | bwd: 8679.88 | bwd_inner: 7693.27 | bwd_allreduce: 986.39 | step: 155.40
{'loss': 0.8839, 'learning_rate': 1.995439211678754e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11763
total_samples=1848, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:01,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.51 | bwd_microstep: 2095.21 | bwd_inner_microstep: 1868.78 | bwd_allreduce_microstep: 226.37 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14307
total_samples=1853, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:04,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.68 | bwd_microstep: 1945.10 | bwd_inner_microstep: 1754.71 | bwd_allreduce_microstep: 190.33 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11657
total_samples=1856, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:07,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.06 | bwd_microstep: 1865.44 | bwd_inner_microstep: 1856.41 | bwd_allreduce_microstep: 8.96 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12266
total_samples=1860, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:10,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 02:07:10,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.26 | bwd_microstep: 2050.72 | bwd_inner_microstep: 1859.95 | bwd_allreduce_microstep: 190.69 | step_microstep: 111.34
[2025-08-03 02:07:10,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.44 | bwd: 7956.52 | bwd_inner: 7339.84 | bwd_allreduce: 616.44 | step: 111.79
{'loss': 0.8737, 'learning_rate': 1.995283421166614e-05, 'epoch': 0.06}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13999
total_samples=1864, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:12,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.73 | bwd_microstep: 2029.09 | bwd_inner_microstep: 1753.04 | bwd_allreduce_microstep: 275.99 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13172
total_samples=1868, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:15,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.51 | bwd_microstep: 1775.52 | bwd_inner_microstep: 1691.86 | bwd_allreduce_microstep: 83.60 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11823
total_samples=1871, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:18,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.79 | bwd_microstep: 1826.83 | bwd_inner_microstep: 1563.11 | bwd_allreduce_microstep: 263.66 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12032
total_samples=1874, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:20,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 02:07:20,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.74 | bwd_microstep: 1770.08 | bwd_inner_microstep: 1554.71 | bwd_allreduce_microstep: 215.31 | step_microstep: 148.90
[2025-08-03 02:07:20,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.70 | bwd: 7401.57 | bwd_inner: 6562.72 | bwd_allreduce: 838.63 | step: 149.32
{'loss': 0.8839, 'learning_rate': 1.995125020639731e-05, 'epoch': 0.06}
                  6%|▌         | 115/2000 [23:30<5:46:56, 11.04s/it]  6%|▌         | 116/2000 [23:40<5:40:10, 10.83s/it]                                                      6%|▌         | 116/2000 [23:40<5:40:10, 10.83s/it]  6%|▌         | 117/2000 [23:51<5:36:59, 10.74s/it]                                                      6%|▌         | 117/2000 [23:51<5:36:59, 10.74s/it]  6%|▌         | 118/2000 [24:01<5:37:08, 10.75s/it]                                                      6%|▌         | 118/2000 [24:01<5:37:08, 10.75s/it]  6%|▌         | 119/2000 [24:13<5:48:54, 11.13s/it]                                                      6%|▌         | 119/2000 [24:13<5:48:54, 11.13s/it]  6%|▌         | 120/2000 [24:25<5:49:30, 11.15s/it]                                                      6%|▌         | 120/2000 [24:25<5:49:30, 11.15s/it]  6%|▌         | 121/2000 [24:35<5:44:59, 11.02s/it]                                                      6%|▌         | 121dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13559
total_samples=1878, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:23,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.16 | bwd_microstep: 2229.23 | bwd_inner_microstep: 1926.03 | bwd_allreduce_microstep: 303.15 | step_microstep: 0.09
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13268
total_samples=1883, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:26,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.79 | bwd_microstep: 1831.20 | bwd_inner_microstep: 1687.13 | bwd_allreduce_microstep: 144.00 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11944
total_samples=1886, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:30,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2277.13 | bwd_microstep: 1872.45 | bwd_inner_microstep: 1601.34 | bwd_allreduce_microstep: 271.05 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11798
total_samples=1889, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:33,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 02:07:33,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.09 | bwd_microstep: 1803.27 | bwd_inner_microstep: 1571.46 | bwd_allreduce_microstep: 231.74 | step_microstep: 141.34
[2025-08-03 02:07:33,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4359.10 | bwd: 7736.19 | bwd_inner: 6785.93 | bwd_allreduce: 950.01 | step: 141.64
{'loss': 0.8853, 'learning_rate': 1.994964010513492e-05, 'epoch': 0.06}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13767
total_samples=1893, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:36,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.52 | bwd_microstep: 1845.73 | bwd_inner_microstep: 1717.24 | bwd_allreduce_microstep: 128.43 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13260
total_samples=1897, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:38,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.03 | bwd_microstep: 1712.11 | bwd_inner_microstep: 1657.09 | bwd_allreduce_microstep: 54.96 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13307
total_samples=1901, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:41,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.96 | bwd_microstep: 1696.72 | bwd_inner_microstep: 1654.16 | bwd_allreduce_microstep: 42.50 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11950
total_samples=1904, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:43,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 02:07:43,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.74 | bwd_microstep: 1752.21 | bwd_inner_microstep: 1550.14 | bwd_allreduce_microstep: 202.01 | step_microstep: 112.18
[2025-08-03 02:07:43,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.18 | bwd: 7006.81 | bwd_inner: 6578.62 | bwd_allreduce: 427.97 | step: 112.62
{'loss': 0.8729, 'learning_rate': 1.9948003912101274e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11654
total_samples=1907, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:47,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1209.79 | bwd_microstep: 2578.44 | bwd_inner_microstep: 2344.15 | bwd_allreduce_microstep: 234.23 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13409
total_samples=1911, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:50,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.82 | bwd_microstep: 1828.09 | bwd_inner_microstep: 1654.41 | bwd_allreduce_microstep: 173.62 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13377
total_samples=1915, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:53,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.01 | bwd_microstep: 2058.65 | bwd_inner_microstep: 1848.30 | bwd_allreduce_microstep: 210.27 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11727
total_samples=1918, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:55,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 02:07:55,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.40 | bwd_microstep: 1703.39 | bwd_inner_microstep: 1517.63 | bwd_allreduce_microstep: 185.70 | step_microstep: 116.03
[2025-08-03 02:07:55,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3262.95 | bwd: 8168.62 | bwd_inner: 7364.46 | bwd_allreduce: 803.90 | step: 116.41
{'loss': 0.8779, 'learning_rate': 1.9946341631587086e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938
total_samples=1921, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:07:58,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.06 | bwd_microstep: 1706.70 | bwd_inner_microstep: 1532.16 | bwd_allreduce_microstep: 174.48 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13250
total_samples=1925, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:00,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.23 | bwd_microstep: 1708.83 | bwd_inner_microstep: 1658.59 | bwd_allreduce_microstep: 50.17 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14174
total_samples=1929, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:03,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.88 | bwd_microstep: 1861.90 | bwd_inner_microstep: 1730.70 | bwd_allreduce_microstep: 131.13 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12287
total_samples=1933, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:06,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 02:08:06,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.22 | bwd_microstep: 1969.13 | bwd_inner_microstep: 1583.89 | bwd_allreduce_microstep: 385.19 | step_microstep: 148.57
[2025-08-03 02:08:06,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.32 | bwd: 7246.60 | bwd_inner: 6505.33 | bwd_allreduce: 741.05 | step: 149.01
{'loss': 0.8763, 'learning_rate': 1.9944653267951507e-05, 'epoch': 0.06}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13631
total_samples=1937, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:09,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.90 | bwd_microstep: 2213.71 | bwd_inner_microstep: 2024.79 | bwd_allreduce_microstep: 188.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13578
total_samples=1941, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:12,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.66 | bwd_microstep: 2107.99 | bwd_inner_microstep: 1937.74 | bwd_allreduce_microstep: 170.19 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12949
total_samples=1945, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:14,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.52 | bwd_microstep: 1748.78 | bwd_inner_microstep: 1653.91 | bwd_allreduce_microstep: 94.80 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13788
total_samples=1949, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:17,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:08:17,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.61 | bwd_microstep: 1946.68 | bwd_inner_microstep: 1859.10 | bwd_allreduce_microstep: 87.52 | step_microstep: 167.77
[2025-08-03 02:08:17,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.62 | bwd: 8017.20 | bwd_inner: 7475.53 | bwd_allreduce: 541.44 | step: 168.16
{'loss': 0.8898, 'learning_rate': 1.9942938825622064e-05, 'epoch': 0.06}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13918
total_samples=1953, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:20,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.04 | bwd_microstep: 1936.08 | bwd_inner_microstep: 1929.91 | bwd_allreduce_microstep: 6.11 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14023
total_samples=1959, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:22,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.87 | bwd_microstep: 1827.54 | bwd_inner_microstep: 1737.13 | bwd_allreduce_microstep: 90.35 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12978
total_samples=1963, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:25,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.90 | bwd_microstep: 2097.26 | bwd_inner_microstep: 1965.26 | bwd_allreduce_microstep: 131.95 | step_microstep: 0.20
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13425
total_samples=1968, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:28,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:08:28,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.96 | bwd_microstep: 1930.18 | bwd_inner_microstep: 1676.94 | bwd_allreduce_microstep: 253.18 | step_microstep: 111.71
[2025-08-03 02:08:28,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2752.69 | bwd: 7791.11 | bwd_inner: 7309.22 | bwd_allreduce: 481.67 | step: 112.14
{'loss': 0.8749, 'learning_rate': 1.994119830909469e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11618
total_samples=1971, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:30,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.89 | bwd_microstep: 1700.01 | bwd_inner_microstep: 1511.69 | bwd_allreduce_microstep: 188.25 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14571
total_samples=1975, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:33,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.39 | bwd_microstep: 2114.49 | bwd_inner_microstep: 1731.12 | bwd_allreduce_microstep: 383.30 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13799
total_samples=1979, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:36,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.48 | bwd_microstep: 1782.87 | bwd_inner_microstep: 1694.66 | bwd_allreduce_microstep: 88.14 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12450
total_samples=1983, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:39,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 02:08:39,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.66 | bwd_microstep: 1844.82 | bwd_inner_microstep: 1722.38 | bwd_allreduce_microstep: 122.38 | step_microstep: 146.17
[2025-08-03 02:08:39,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.36 | bwd: 7442.24 | bwd_inner: 6659.86 | bwd_allreduce: 782.15 | step: 146.52
/2000 [24:35<5:44:59, 11.02s/it]  6%|▌         | 122/2000 [24:48<5:59:26, 11.48s/it]                                                      6%|▌         | 122/2000 [24:48<5:59:26, 11.48s/it]  6%|▌         | 123/2000 [24:58<5:47:45, 11.12s/it]                                                      6%|▌         | 123/2000 [24:58<5:47:45, 11.12s/it]  6%|▌         | 124/2000 [25:10<5:54:46, 11.35s/it]                                                      6%|▌         | 124/2000 [25:10<5:54:46, 11.35s/it]  6%|▋         | 125/2000 [25:20<5:46:54, 11.10s/it]                                                      6%|▋         | 125/2000 [25:20<5:46:54, 11.10s/it]  6%|▋         | 126/2000 [25:32<5:48:27, 11.16s/it]                                                      6%|▋         | 126/2000 [25:32<5:48:27, 11.16s/it]  6%|▋         | 127/2000 [25:43<5:46:51, 11.11s/it]                                                      6%|▋         | 127/2000 [25:43<5:46:51, 11.11s/it]  6%|�{'loss': 0.8747, 'learning_rate': 1.9939431722933678e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12246
total_samples=1986, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:41,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.08 | bwd_microstep: 1827.14 | bwd_inner_microstep: 1603.02 | bwd_allreduce_microstep: 224.06 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13470
total_samples=1990, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:44,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.26 | bwd_microstep: 1782.18 | bwd_inner_microstep: 1694.44 | bwd_allreduce_microstep: 87.68 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15343
total_samples=1994, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:46,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.92 | bwd_microstep: 1745.23 | bwd_inner_microstep: 1739.25 | bwd_allreduce_microstep: 5.92 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12928
total_samples=1998, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:49,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 02:08:49,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.56 | bwd_microstep: 2099.10 | bwd_inner_microstep: 1657.19 | bwd_allreduce_microstep: 441.84 | step_microstep: 126.82
[2025-08-03 02:08:49,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.74 | bwd: 7453.71 | bwd_inner: 6693.90 | bwd_allreduce: 759.59 | step: 127.15
{'loss': 0.8743, 'learning_rate': 1.9937639071771704e-05, 'epoch': 0.06}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11630
total_samples=2001, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:52,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.60 | bwd_microstep: 1769.91 | bwd_inner_microstep: 1530.06 | bwd_allreduce_microstep: 239.78 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12932
total_samples=2005, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:54,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.58 | bwd_microstep: 1739.81 | bwd_inner_microstep: 1648.98 | bwd_allreduce_microstep: 90.75 | step_microstep: 0.17
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12912
total_samples=2009, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:08:57,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 663.90 | bwd_microstep: 1762.00 | bwd_inner_microstep: 1633.19 | bwd_allreduce_microstep: 128.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13623
total_samples=2013, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:00,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 02:09:00,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.38 | bwd_microstep: 1708.67 | bwd_inner_microstep: 1668.97 | bwd_allreduce_microstep: 39.63 | step_microstep: 452.03
[2025-08-03 02:09:00,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2754.38 | bwd: 6980.44 | bwd_inner: 6481.20 | bwd_allreduce: 499.00 | step: 452.43
{'loss': 0.8704, 'learning_rate': 1.993582036030978e-05, 'epoch': 0.07}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11967
total_samples=2016, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:03,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.74 | bwd_microstep: 2110.23 | bwd_inner_microstep: 1691.13 | bwd_allreduce_microstep: 418.99 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13037
total_samples=2020, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:05,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.07 | bwd_microstep: 1725.35 | bwd_inner_microstep: 1616.86 | bwd_allreduce_microstep: 108.43 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11836
total_samples=2023, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:08,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.69 | bwd_microstep: 1942.02 | bwd_inner_microstep: 1555.23 | bwd_allreduce_microstep: 386.73 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14695
total_samples=2027, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:11,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:09:11,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.10 | bwd_microstep: 1726.78 | bwd_inner_microstep: 1710.86 | bwd_allreduce_microstep: 15.86 | step_microstep: 155.42
[2025-08-03 02:09:11,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.53 | bwd: 7504.42 | bwd_inner: 6574.09 | bwd_allreduce: 930.07 | step: 155.88
{'loss': 0.8686, 'learning_rate': 1.9933975593317263e-05, 'epoch': 0.07}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13447
total_samples=2031, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:14,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1491.96 | bwd_microstep: 1853.28 | bwd_inner_microstep: 1707.64 | bwd_allreduce_microstep: 145.57 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11981
total_samples=2034, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:18,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1397.69 | bwd_microstep: 2087.82 | bwd_inner_microstep: 1831.22 | bwd_allreduce_microstep: 256.53 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11929
total_samples=2037, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:20,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.57 | bwd_microstep: 1853.78 | bwd_inner_microstep: 1582.33 | bwd_allreduce_microstep: 271.39 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13711
total_samples=2041, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:24,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 02:09:24,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.62 | bwd_microstep: 2098.86 | bwd_inner_microstep: 1938.66 | bwd_allreduce_microstep: 160.13 | step_microstep: 435.81
[2025-08-03 02:09:24,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4276.79 | bwd: 7893.79 | bwd_inner: 7059.84 | bwd_allreduce: 833.70 | step: 436.25
{'loss': 0.8782, 'learning_rate': 1.9932104775631847e-05, 'epoch': 0.07}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13610
total_samples=2045, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:26,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.16 | bwd_microstep: 1959.63 | bwd_inner_microstep: 1700.05 | bwd_allreduce_microstep: 259.51 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11767
total_samples=2048, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:29,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.15 | bwd_microstep: 2191.21 | bwd_inner_microstep: 2153.35 | bwd_allreduce_microstep: 37.79 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13372
total_samples=2052, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:32,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 662.65 | bwd_microstep: 2159.35 | bwd_inner_microstep: 1932.85 | bwd_allreduce_microstep: 226.44 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11618
total_samples=2055, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:35,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 02:09:35,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.96 | bwd_microstep: 1759.70 | bwd_inner_microstep: 1529.41 | bwd_allreduce_microstep: 230.23 | step_microstep: 131.22
[2025-08-03 02:09:35,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2713.84 | bwd: 8069.93 | bwd_inner: 7315.66 | bwd_allreduce: 754.05 | step: 131.64
{'loss': 0.8695, 'learning_rate': 1.993020791215953e-05, 'epoch': 0.07}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13512
total_samples=2059, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:37,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.33 | bwd_microstep: 1754.45 | bwd_inner_microstep: 1665.68 | bwd_allreduce_microstep: 88.70 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13529
total_samples=2063, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:40,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.47 | bwd_microstep: 1876.04 | bwd_inner_microstep: 1736.78 | bwd_allreduce_microstep: 139.20 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11738
total_samples=2066, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:43,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.89 | bwd_microstep: 1821.34 | bwd_inner_microstep: 1576.76 | bwd_allreduce_microstep: 244.51 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11837
total_samples=2069, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:45,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 02:09:45,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.38 | bwd_microstep: 1833.94 | bwd_inner_microstep: 1543.46 | bwd_allreduce_microstep: 290.40 | step_microstep: 163.09
[2025-08-03 02:09:45,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.01 | bwd: 7285.80 | bwd_inner: 6522.67 | bwd_allreduce: 762.89 | step: 163.59
�         | 128/2000 [25:53<5:43:08, 11.00s/it]                                                      6%|▋         | 128/2000 [25:54<5:43:08, 11.00s/it]  6%|▋         | 129/2000 [26:04<5:40:24, 10.92s/it]                                                      6%|▋         | 129/2000 [26:04<5:40:24, 10.92s/it]  6%|▋         | 130/2000 [26:15<5:36:33, 10.80s/it]                                                      6%|▋         | 130/2000 [26:15<5:36:33, 10.80s/it]  7%|▋         | 131/2000 [26:26<5:36:08, 10.79s/it]                                                      7%|▋         | 131/2000 [26:26<5:36:08, 10.79s/it]  7%|▋         | 132/2000 [26:38<5:56:04, 11.44s/it]                                                      7%|▋         | 132/2000 [26:38<5:56:04, 11.44s/it]  7%|▋         | 133/2000 [26:50<5:54:22, 11.39s/it]                                                      7%|▋         | 133/2000 [26:50<5:54:22, 11.39s/it]  7%|▋         | 134/2000 [27:00<5:46:27, 11.{'loss': 0.8799, 'learning_rate': 1.992828500787461e-05, 'epoch': 0.07}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11558
total_samples=2072, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:48,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.83 | bwd_microstep: 1970.77 | bwd_inner_microstep: 1750.34 | bwd_allreduce_microstep: 220.37 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11814
total_samples=2075, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:51,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.13 | bwd_microstep: 1909.50 | bwd_inner_microstep: 1559.22 | bwd_allreduce_microstep: 350.21 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13549
total_samples=2079, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:53,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.59 | bwd_microstep: 1809.71 | bwd_inner_microstep: 1702.45 | bwd_allreduce_microstep: 107.20 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13616
total_samples=2083, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:56,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 02:09:56,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.97 | bwd_microstep: 1776.37 | bwd_inner_microstep: 1662.72 | bwd_allreduce_microstep: 113.58 | step_microstep: 132.03
[2025-08-03 02:09:56,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.44 | bwd: 7466.39 | bwd_inner: 6674.73 | bwd_allreduce: 791.43 | step: 132.39
{'loss': 0.8782, 'learning_rate': 1.9926336067819686e-05, 'epoch': 0.07}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13383
total_samples=2087, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:09:59,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.42 | bwd_microstep: 1794.33 | bwd_inner_microstep: 1692.41 | bwd_allreduce_microstep: 101.86 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11667
total_samples=2090, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:01,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.77 | bwd_microstep: 1997.85 | bwd_inner_microstep: 1776.54 | bwd_allreduce_microstep: 221.24 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14325
total_samples=2094, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:04,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.54 | bwd_microstep: 1852.29 | bwd_inner_microstep: 1781.58 | bwd_allreduce_microstep: 70.64 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13601
total_samples=2098, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:07,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 02:10:07,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.43 | bwd_microstep: 2041.08 | bwd_inner_microstep: 1730.72 | bwd_allreduce_microstep: 310.29 | step_microstep: 140.44
[2025-08-03 02:10:07,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.10 | bwd: 7685.60 | bwd_inner: 6981.25 | bwd_allreduce: 704.12 | step: 140.90
{'loss': 0.8847, 'learning_rate': 1.9924361097105624e-05, 'epoch': 0.07}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13174
total_samples=2102, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:10,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.41 | bwd_microstep: 1723.53 | bwd_inner_microstep: 1658.45 | bwd_allreduce_microstep: 65.02 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15357
total_samples=2106, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:12,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.86 | bwd_microstep: 1755.06 | bwd_inner_microstep: 1748.79 | bwd_allreduce_microstep: 6.21 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13226
total_samples=2110, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:15,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.24 | bwd_microstep: 1851.34 | bwd_inner_microstep: 1729.03 | bwd_allreduce_microstep: 122.25 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14476
total_samples=2115, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:17,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 02:10:17,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.00 | bwd_microstep: 1804.78 | bwd_inner_microstep: 1760.13 | bwd_allreduce_microstep: 44.59 | step_microstep: 110.94
[2025-08-03 02:10:17,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2751.45 | bwd: 7134.77 | bwd_inner: 6896.39 | bwd_allreduce: 238.14 | step: 111.38
{'loss': 0.872, 'learning_rate': 1.9922360100911553e-05, 'epoch': 0.07}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13741
total_samples=2119, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:20,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.87 | bwd_microstep: 1904.28 | bwd_inner_microstep: 1833.71 | bwd_allreduce_microstep: 70.50 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13203
total_samples=2123, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:23,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.39 | bwd_microstep: 1734.16 | bwd_inner_microstep: 1665.81 | bwd_allreduce_microstep: 68.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14094
total_samples=2128, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:25,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.14 | bwd_microstep: 1737.71 | bwd_inner_microstep: 1703.55 | bwd_allreduce_microstep: 34.10 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13580
total_samples=2132, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:28,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 02:10:28,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.47 | bwd_microstep: 1926.87 | bwd_inner_microstep: 1726.52 | bwd_allreduce_microstep: 200.30 | step_microstep: 143.37
[2025-08-03 02:10:28,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2749.80 | bwd: 7303.08 | bwd_inner: 6929.58 | bwd_allreduce: 373.26 | step: 143.72
{'loss': 0.8552, 'learning_rate': 1.992033308448486e-05, 'epoch': 0.07}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14305
total_samples=2136, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:31,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.61 | bwd_microstep: 1729.60 | bwd_inner_microstep: 1662.53 | bwd_allreduce_microstep: 67.01 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12955
total_samples=2140, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:33,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.94 | bwd_microstep: 2115.32 | bwd_inner_microstep: 1923.67 | bwd_allreduce_microstep: 191.60 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13293
total_samples=2144, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:36,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.20 | bwd_microstep: 1782.78 | bwd_inner_microstep: 1702.97 | bwd_allreduce_microstep: 79.74 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11559
total_samples=2147, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:39,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:10:39,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.19 | bwd_microstep: 2177.41 | bwd_inner_microstep: 2048.36 | bwd_allreduce_microstep: 128.98 | step_microstep: 149.87
[2025-08-03 02:10:39,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.87 | bwd: 7805.15 | bwd_inner: 7337.52 | bwd_allreduce: 467.41 | step: 150.31
{'loss': 0.8518, 'learning_rate': 1.9918280053141144e-05, 'epoch': 0.07}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12670
total_samples=2151, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:42,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 658.37 | bwd_microstep: 1992.82 | bwd_inner_microstep: 1775.16 | bwd_allreduce_microstep: 217.59 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15141
total_samples=2155, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:44,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.10 | bwd_microstep: 1811.94 | bwd_inner_microstep: 1769.15 | bwd_allreduce_microstep: 42.72 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13836
total_samples=2159, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:47,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.35 | bwd_microstep: 2080.51 | bwd_inner_microstep: 1933.03 | bwd_allreduce_microstep: 147.43 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12650
total_samples=2163, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:52,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:10:52,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1552.25 | bwd_microstep: 2580.99 | bwd_inner_microstep: 2441.73 | bwd_allreduce_microstep: 139.19 | step_microstep: 108.62
[2025-08-03 02:10:52,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3609.01 | bwd: 8466.32 | bwd_inner: 7919.06 | bwd_allreduce: 547.01 | step: 109.04
14s/it]                                                      7%|▋         | 134/2000 [27:00<5:46:27, 11.14s/it]  7%|▋         | 135/2000 [27:11<5:42:02, 11.00s/it]                                                      7%|▋         | 135/2000 [27:11<5:42:02, 11.00s/it]  7%|▋         | 136/2000 [27:22<5:41:26, 10.99s/it]                                                      7%|▋         | 136/2000 [27:22<5:41:26, 10.99s/it]  7%|▋         | 137/2000 [27:32<5:35:34, 10.81s/it]                                                      7%|▋         | 137/2000 [27:32<5:35:34, 10.81s/it]  7%|▋         | 138/2000 [27:43<5:32:52, 10.73s/it]                                                      7%|▋         | 138/2000 [27:43<5:32:52, 10.73s/it]  7%|▋         | 139/2000 [27:54<5:36:07, 10.84s/it]                                                      7%|▋         | 139/2000 [27:54<5:36:07, 10.84s/it]  7%|▋         | 140/2000 [28:06<5:51:36, 11.34s/it]                                {'loss': 0.8684, 'learning_rate': 1.9916201012264255e-05, 'epoch': 0.07}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12409
total_samples=2167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:54,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.83 | bwd_microstep: 1720.38 | bwd_inner_microstep: 1569.91 | bwd_allreduce_microstep: 150.41 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12230
total_samples=2170, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:10:57,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.27 | bwd_microstep: 1992.32 | bwd_inner_microstep: 1572.64 | bwd_allreduce_microstep: 419.61 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13965
total_samples=2174, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:00,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.60 | bwd_microstep: 2017.82 | bwd_inner_microstep: 1898.67 | bwd_allreduce_microstep: 119.07 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12154
total_samples=2177, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:02,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 02:11:02,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.98 | bwd_microstep: 1888.79 | bwd_inner_microstep: 1716.39 | bwd_allreduce_microstep: 172.33 | step_microstep: 122.21
[2025-08-03 02:11:02,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.60 | bwd: 7619.35 | bwd_inner: 6757.61 | bwd_allreduce: 861.51 | step: 122.83
{'loss': 0.8644, 'learning_rate': 1.9914095967306224e-05, 'epoch': 0.07}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13555
total_samples=2181, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:05,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.86 | bwd_microstep: 1908.14 | bwd_inner_microstep: 1802.54 | bwd_allreduce_microstep: 105.54 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13561
total_samples=2185, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:08,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.30 | bwd_microstep: 1775.48 | bwd_inner_microstep: 1714.00 | bwd_allreduce_microstep: 61.40 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14144
total_samples=2189, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:10,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.41 | bwd_microstep: 1725.96 | bwd_inner_microstep: 1696.81 | bwd_allreduce_microstep: 29.09 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13266
total_samples=2193, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:13,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 02:11:13,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.81 | bwd_microstep: 1806.29 | bwd_inner_microstep: 1695.54 | bwd_allreduce_microstep: 110.69 | step_microstep: 139.49
[2025-08-03 02:11:13,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2750.31 | bwd: 7215.91 | bwd_inner: 6908.89 | bwd_allreduce: 306.80 | step: 139.88
{'loss': 0.8614, 'learning_rate': 1.9911964923787295e-05, 'epoch': 0.07}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13362
total_samples=2197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:16,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.26 | bwd_microstep: 1980.70 | bwd_inner_microstep: 1865.65 | bwd_allreduce_microstep: 114.98 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11677
total_samples=2200, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:18,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.19 | bwd_microstep: 1920.65 | bwd_inner_microstep: 1720.19 | bwd_allreduce_microstep: 200.39 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13542
total_samples=2204, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:21,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.72 | bwd_microstep: 1864.71 | bwd_inner_microstep: 1731.36 | bwd_allreduce_microstep: 133.28 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11920
total_samples=2207, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:24,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:11:24,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.47 | bwd_microstep: 1827.22 | bwd_inner_microstep: 1586.32 | bwd_allreduce_microstep: 240.84 | step_microstep: 141.58
[2025-08-03 02:11:24,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.57 | bwd: 7593.33 | bwd_inner: 6903.53 | bwd_allreduce: 689.56 | step: 141.92
{'loss': 0.8572, 'learning_rate': 1.990980788729588e-05, 'epoch': 0.07}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13906
total_samples=2212, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:26,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.42 | bwd_microstep: 1808.56 | bwd_inner_microstep: 1724.07 | bwd_allreduce_microstep: 84.43 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11502
total_samples=2215, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:29,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.32 | bwd_microstep: 1891.60 | bwd_inner_microstep: 1532.29 | bwd_allreduce_microstep: 359.25 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13308
total_samples=2219, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:32,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.27 | bwd_microstep: 1920.05 | bwd_inner_microstep: 1687.85 | bwd_allreduce_microstep: 232.13 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12018
total_samples=2222, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:35,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 02:11:35,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.70 | bwd_microstep: 1812.56 | bwd_inner_microstep: 1581.34 | bwd_allreduce_microstep: 231.15 | step_microstep: 111.68
[2025-08-03 02:11:35,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.63 | bwd: 7432.82 | bwd_inner: 6525.56 | bwd_allreduce: 907.03 | step: 112.04
{'loss': 0.8655, 'learning_rate': 1.990762486348855e-05, 'epoch': 0.07}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13969
total_samples=2229, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:38,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1568.22 | bwd_microstep: 1846.58 | bwd_inner_microstep: 1712.32 | bwd_allreduce_microstep: 134.20 | step_microstep: 0.09
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12584
total_samples=2233, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:41,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.79 | bwd_microstep: 1763.63 | bwd_inner_microstep: 1607.16 | bwd_allreduce_microstep: 156.41 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13740
total_samples=2237, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:43,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.94 | bwd_microstep: 1798.30 | bwd_inner_microstep: 1723.09 | bwd_allreduce_microstep: 75.14 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625
total_samples=2240, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:46,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 02:11:46,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.41 | bwd_microstep: 1956.23 | bwd_inner_microstep: 1747.00 | bwd_allreduce_microstep: 209.17 | step_microstep: 135.80
[2025-08-03 02:11:46,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3662.30 | bwd: 7364.81 | bwd_inner: 6789.58 | bwd_allreduce: 574.99 | step: 136.16
{'loss': 0.8496, 'learning_rate': 1.9905415858090036e-05, 'epoch': 0.07}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12742
total_samples=2244, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:49,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.13 | bwd_microstep: 2046.83 | bwd_inner_microstep: 1827.89 | bwd_allreduce_microstep: 218.89 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11661
total_samples=2247, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:51,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.07 | bwd_microstep: 1857.44 | bwd_inner_microstep: 1545.26 | bwd_allreduce_microstep: 312.11 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13556
total_samples=2251, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:54,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.81 | bwd_microstep: 2147.86 | bwd_inner_microstep: 1930.48 | bwd_allreduce_microstep: 217.33 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12277
total_samples=2254, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:11:57,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 02:11:57,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.35 | bwd_microstep: 2193.01 | bwd_inner_microstep: 1886.15 | bwd_allreduce_microstep: 306.79 | step_microstep: 111.19
[2025-08-03 02:11:57,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.28 | bwd: 8245.22 | bwd_inner: 7189.76 | bwd_allreduce: 1055.19 | step: 111.55
{'loss': 0.8518, 'learning_rate': 1.9903180876893195e-05, 'epoch': 0.07}
                      7%|▋         | 140/2000 [28:07<5:51:36, 11.34s/it]  7%|▋         | 141/2000 [28:17<5:46:54, 11.20s/it]                                                      7%|▋         | 141/2000 [28:17<5:46:54, 11.20s/it]  7%|▋         | 142/2000 [28:28<5:39:46, 10.97s/it]                                                      7%|▋         | 142/2000 [28:28<5:39:46, 10.97s/it]  7%|▋         | 143/2000 [28:39<5:39:10, 10.96s/it]                                                      7%|▋         | 143/2000 [28:39<5:39:10, 10.96s/it]  7%|▋         | 144/2000 [28:49<5:36:36, 10.88s/it]                                                      7%|▋         | 144/2000 [28:49<5:36:36, 10.88s/it]  7%|▋         | 145/2000 [29:01<5:41:52, 11.06s/it]                                                      7%|▋         | 145/2000 [29:01<5:41:52, 11.06s/it]  7%|▋         | 146/2000 [29:12<5:45:21, 11.18s/it]                                                      7%|▋         |dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11958
total_samples=2257, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:00,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.18 | bwd_microstep: 2178.83 | bwd_inner_microstep: 1972.68 | bwd_allreduce_microstep: 206.09 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13035
total_samples=2262, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:03,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.85 | bwd_microstep: 2101.93 | bwd_inner_microstep: 1982.83 | bwd_allreduce_microstep: 119.04 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13998
total_samples=2266, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:06,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.44 | bwd_microstep: 2038.23 | bwd_inner_microstep: 1889.69 | bwd_allreduce_microstep: 148.47 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11912
total_samples=2269, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:09,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:12:09,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.38 | bwd_microstep: 1740.01 | bwd_inner_microstep: 1543.62 | bwd_allreduce_microstep: 196.33 | step_microstep: 147.57
[2025-08-03 02:12:09,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.79 | bwd: 8059.05 | bwd_inner: 7388.82 | bwd_allreduce: 670.00 | step: 147.99
{'loss': 0.8541, 'learning_rate': 1.9900919925759e-05, 'epoch': 0.07}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11643
total_samples=2272, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:12,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.11 | bwd_microstep: 2245.09 | bwd_inner_microstep: 1893.55 | bwd_allreduce_microstep: 351.47 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11744
total_samples=2275, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:14,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.81 | bwd_microstep: 1730.30 | bwd_inner_microstep: 1525.30 | bwd_allreduce_microstep: 204.94 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13847
total_samples=2279, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:17,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.32 | bwd_microstep: 2197.15 | bwd_inner_microstep: 1763.39 | bwd_allreduce_microstep: 433.70 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11786
total_samples=2282, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:20,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 02:12:20,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.97 | bwd_microstep: 1846.94 | bwd_inner_microstep: 1605.78 | bwd_allreduce_microstep: 241.09 | step_microstep: 148.62
[2025-08-03 02:12:20,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.14 | bwd: 8019.52 | bwd_inner: 6788.01 | bwd_allreduce: 1231.29 | step: 149.06
{'loss': 0.8639, 'learning_rate': 1.989863301061654e-05, 'epoch': 0.07}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12041
total_samples=2285, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:23,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.59 | bwd_microstep: 1768.89 | bwd_inner_microstep: 1574.31 | bwd_allreduce_microstep: 194.51 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11803
total_samples=2288, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:25,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.95 | bwd_microstep: 1804.13 | bwd_inner_microstep: 1555.39 | bwd_allreduce_microstep: 248.65 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13469
total_samples=2292, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:28,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.30 | bwd_microstep: 1851.78 | bwd_inner_microstep: 1805.03 | bwd_allreduce_microstep: 46.67 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14083
total_samples=2296, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:31,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 02:12:31,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.72 | bwd_microstep: 1726.50 | bwd_inner_microstep: 1694.96 | bwd_allreduce_microstep: 31.48 | step_microstep: 143.60
[2025-08-03 02:12:31,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.49 | bwd: 7151.35 | bwd_inner: 6629.69 | bwd_allreduce: 521.40 | step: 144.28
{'loss': 0.862, 'learning_rate': 1.9896320137462984e-05, 'epoch': 0.07}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15926
total_samples=2300, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:35,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1826.16 | bwd_microstep: 2037.61 | bwd_inner_microstep: 1847.57 | bwd_allreduce_microstep: 189.98 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11918
total_samples=2303, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:37,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.07 | bwd_microstep: 1856.10 | bwd_inner_microstep: 1725.83 | bwd_allreduce_microstep: 130.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13632
total_samples=2307, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:40,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.96 | bwd_microstep: 1794.73 | bwd_inner_microstep: 1717.29 | bwd_allreduce_microstep: 77.37 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13952
total_samples=2311, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:42,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 02:12:42,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.01 | bwd_microstep: 1777.00 | bwd_inner_microstep: 1716.30 | bwd_allreduce_microstep: 60.63 | step_microstep: 146.37
[2025-08-03 02:12:42,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3859.13 | bwd: 7465.48 | bwd_inner: 7006.99 | bwd_allreduce: 458.26 | step: 146.70
{'loss': 0.8498, 'learning_rate': 1.9893981312363563e-05, 'epoch': 0.07}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12027
total_samples=2314, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:45,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.94 | bwd_microstep: 1740.70 | bwd_inner_microstep: 1582.19 | bwd_allreduce_microstep: 158.44 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12340
total_samples=2317, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:48,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.51 | bwd_microstep: 2024.05 | bwd_inner_microstep: 1815.45 | bwd_allreduce_microstep: 208.54 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13686
total_samples=2321, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:50,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.70 | bwd_microstep: 1771.01 | bwd_inner_microstep: 1697.21 | bwd_allreduce_microstep: 73.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14303
total_samples=2325, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:53,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 02:12:53,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.87 | bwd_microstep: 1762.58 | bwd_inner_microstep: 1734.15 | bwd_allreduce_microstep: 28.37 | step_microstep: 144.87
[2025-08-03 02:12:53,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.96 | bwd: 7298.38 | bwd_inner: 6828.99 | bwd_allreduce: 469.16 | step: 145.23
{'loss': 0.8676, 'learning_rate': 1.989161654145158e-05, 'epoch': 0.08}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12049
total_samples=2328, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:56,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.40 | bwd_microstep: 1828.59 | bwd_inner_microstep: 1594.02 | bwd_allreduce_microstep: 234.51 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12229
total_samples=2331, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:12:58,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.31 | bwd_microstep: 1926.36 | bwd_inner_microstep: 1574.79 | bwd_allreduce_microstep: 351.51 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13766
total_samples=2335, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:01,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.91 | bwd_microstep: 1815.29 | bwd_inner_microstep: 1695.55 | bwd_allreduce_microstep: 119.68 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12317
total_samples=2338, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:04,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 02:13:04,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.07 | bwd_microstep: 1945.66 | bwd_inner_microstep: 1580.58 | bwd_allreduce_microstep: 365.02 | step_microstep: 134.84
[2025-08-03 02:13:04,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.62 | bwd: 7515.95 | bwd_inner: 6444.93 | bwd_allreduce: 1070.80 | step: 135.26
{'loss': 0.8554, 'learning_rate': 1.9889225830928365e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13169
total_samples=2342, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:06,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.26 | bwd_microstep: 1909.90 | bwd_inner_microstep: 1690.99 | bwd_allreduce_microstep: 218.84 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13283
total_samples=2346, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:09,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.85 | bwd_microstep: 2202.76 | bwd_inner_microstep: 2074.93 | bwd_allreduce_microstep: 127.76 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13072
total_samples=2350, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:12,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.97 | bwd_microstep: 1993.83 | bwd_inner_microstep: 1695.80 | bwd_allreduce_microstep: 297.96 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14737
total_samples=2354, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:15,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 02:13:15,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.46 | bwd_microstep: 1758.74 | bwd_inner_microstep: 1734.18 | bwd_allreduce_microstep: 24.50 | step_microstep: 132.16
[2025-08-03 02:13:15,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2759.47 | bwd: 7865.28 | bwd_inner: 7195.90 | bwd_allreduce: 669.15 | step: 132.49
 146/2000 [29:12<5:45:21, 11.18s/it]  7%|▋         | 147/2000 [29:24<5:46:27, 11.22s/it]                                                      7%|▋         | 147/2000 [29:24<5:46:27, 11.22s/it]  7%|▋         | 148/2000 [29:35<5:47:22, 11.25s/it]                                                      7%|▋         | 148/2000 [29:35<5:47:22, 11.25s/it]  7%|▋         | 149/2000 [29:45<5:39:33, 11.01s/it]                                                      7%|▋         | 149/2000 [29:45<5:39:33, 11.01s/it]  8%|▊         | 150/2000 [29:57<5:46:45, 11.25s/it]                                                      8%|▊         | 150/2000 [29:57<5:46:45, 11.25s/it]  8%|▊         | 151/2000 [30:08<5:40:00, 11.03s/it]                                                      8%|▊         | 151/2000 [30:08<5:40:00, 11.03s/it]  8%|▊         | 152/2000 [30:19<5:37:14, 10.95s/it]                                                      8%|▊         | 152/2000 [30:19<5:37:14, 10.95s/it]  8{'loss': 0.8564, 'learning_rate': 1.9886809187063285e-05, 'epoch': 0.08}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12104
total_samples=2358, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:18,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.15 | bwd_microstep: 2699.35 | bwd_inner_microstep: 1907.52 | bwd_allreduce_microstep: 791.76 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12370
total_samples=2362, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:21,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.40 | bwd_microstep: 1804.35 | bwd_inner_microstep: 1582.26 | bwd_allreduce_microstep: 222.03 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13285
total_samples=2366, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:23,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.96 | bwd_microstep: 1870.41 | bwd_inner_microstep: 1823.06 | bwd_allreduce_microstep: 47.29 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13329
total_samples=2370, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:26,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 02:13:26,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.64 | bwd_microstep: 1928.40 | bwd_inner_microstep: 1683.82 | bwd_allreduce_microstep: 244.52 | step_microstep: 132.06
[2025-08-03 02:13:26,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.08 | bwd: 8302.56 | bwd_inner: 6996.65 | bwd_allreduce: 1305.68 | step: 132.49
{'loss': 0.8541, 'learning_rate': 1.9884366616193707e-05, 'epoch': 0.08}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12127
total_samples=2373, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:29,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.42 | bwd_microstep: 2085.38 | bwd_inner_microstep: 1873.84 | bwd_allreduce_microstep: 211.47 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13442
total_samples=2377, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:32,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.78 | bwd_microstep: 2126.35 | bwd_inner_microstep: 1906.46 | bwd_allreduce_microstep: 219.81 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14257
total_samples=2381, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:35,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.92 | bwd_microstep: 1739.39 | bwd_inner_microstep: 1695.32 | bwd_allreduce_microstep: 44.01 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13414
total_samples=2385, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:37,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:13:37,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.61 | bwd_microstep: 1873.34 | bwd_inner_microstep: 1673.45 | bwd_allreduce_microstep: 199.83 | step_microstep: 127.59
[2025-08-03 02:13:37,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.64 | bwd: 7824.51 | bwd_inner: 7149.07 | bwd_allreduce: 675.19 | step: 127.91
{'loss': 0.8541, 'learning_rate': 1.988189812472498e-05, 'epoch': 0.08}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11575
total_samples=2388, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:40,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 749.36 | bwd_microstep: 1843.97 | bwd_inner_microstep: 1602.18 | bwd_allreduce_microstep: 241.73 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11997
total_samples=2391, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:43,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.06 | bwd_microstep: 2075.54 | bwd_inner_microstep: 1833.06 | bwd_allreduce_microstep: 242.42 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12793
total_samples=2395, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:45,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.32 | bwd_microstep: 1770.20 | bwd_inner_microstep: 1640.51 | bwd_allreduce_microstep: 129.62 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11498
total_samples=2398, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:48,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 02:13:48,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.08 | bwd_microstep: 1765.55 | bwd_inner_microstep: 1559.74 | bwd_allreduce_microstep: 205.73 | step_microstep: 133.46
[2025-08-03 02:13:48,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2889.75 | bwd: 7455.31 | bwd_inner: 6635.49 | bwd_allreduce: 819.59 | step: 133.77
{'loss': 0.8576, 'learning_rate': 1.987940371913044e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14816
total_samples=2402, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:51,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.34 | bwd_microstep: 1825.69 | bwd_inner_microstep: 1759.97 | bwd_allreduce_microstep: 65.66 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508
total_samples=2406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:53,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.95 | bwd_microstep: 1751.00 | bwd_inner_microstep: 1679.42 | bwd_allreduce_microstep: 71.51 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12890
total_samples=2410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:56,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.69 | bwd_microstep: 1824.02 | bwd_inner_microstep: 1671.29 | bwd_allreduce_microstep: 152.66 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13448
total_samples=2414, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:13:59,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:13:59,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.60 | bwd_microstep: 1850.38 | bwd_inner_microstep: 1721.69 | bwd_allreduce_microstep: 128.63 | step_microstep: 124.94
[2025-08-03 02:13:59,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.51 | bwd: 7251.13 | bwd_inner: 6832.38 | bwd_allreduce: 418.52 | step: 125.39
{'loss': 0.8577, 'learning_rate': 1.9876883405951378e-05, 'epoch': 0.08}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12158
total_samples=2417, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:01,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.95 | bwd_microstep: 1977.99 | bwd_inner_microstep: 1558.84 | bwd_allreduce_microstep: 419.04 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13347
total_samples=2421, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:04,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.75 | bwd_microstep: 1804.98 | bwd_inner_microstep: 1710.27 | bwd_allreduce_microstep: 94.65 | step_microstep: 0.18
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14255
total_samples=2425, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:07,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.93 | bwd_microstep: 1854.56 | bwd_inner_microstep: 1724.82 | bwd_allreduce_microstep: 129.67 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13319
total_samples=2429, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:09,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:14:09,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.77 | bwd_microstep: 1775.45 | bwd_inner_microstep: 1697.64 | bwd_allreduce_microstep: 77.74 | step_microstep: 110.14
[2025-08-03 02:14:09,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2708.32 | bwd: 7413.02 | bwd_inner: 6691.59 | bwd_allreduce: 721.17 | step: 110.63
{'loss': 0.8601, 'learning_rate': 1.987433719179702e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13388
total_samples=2433, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:12,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.05 | bwd_microstep: 2022.50 | bwd_inner_microstep: 1876.02 | bwd_allreduce_microstep: 146.42 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12844
total_samples=2437, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:15,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.39 | bwd_microstep: 1763.25 | bwd_inner_microstep: 1657.97 | bwd_allreduce_microstep: 105.22 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13311
total_samples=2441, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:17,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.66 | bwd_microstep: 1816.90 | bwd_inner_microstep: 1760.00 | bwd_allreduce_microstep: 56.84 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13031
total_samples=2444, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:20,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 02:14:20,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.18 | bwd_microstep: 1779.01 | bwd_inner_microstep: 1598.76 | bwd_allreduce_microstep: 180.18 | step_microstep: 145.57
[2025-08-03 02:14:20,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.20 | bwd: 7381.71 | bwd_inner: 6892.74 | bwd_allreduce: 488.74 | step: 145.92
%|▊         | 153/2000 [30:30<5:38:21, 10.99s/it]                                                      8%|▊         | 153/2000 [30:30<5:38:21, 10.99s/it]  8%|▊         | 154/2000 [30:41<5:43:17, 11.16s/it]                                                      8%|▊         | 154/2000 [30:41<5:43:17, 11.16s/it]  8%|▊         | 155/2000 [30:52<5:42:30, 11.14s/it]                                                      8%|▊         | 155/2000 [30:52<5:42:30, 11.14s/it]  8%|▊         | 156/2000 [31:03<5:38:48, 11.02s/it]                                                      8%|▊         | 156/2000 [31:03<5:38:48, 11.02s/it]  8%|▊         | 157/2000 [31:14<5:34:02, 10.87s/it]                                                      8%|▊         | 157/2000 [31:14<5:34:02, 10.87s/it]  8%|▊         | 158/2000 [31:24<5:31:08, 10.79s/it]                                                      8%|▊         | 158/2000 [31:24<5:31:08, 10.79s/it]  8%|▊         | 159/2000 [31:35<5:29:28,{'loss': 0.8528, 'learning_rate': 1.987176508334451e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13620
total_samples=2448, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:23,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.77 | bwd_microstep: 1985.66 | bwd_inner_microstep: 1736.28 | bwd_allreduce_microstep: 249.32 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11567
total_samples=2451, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:25,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.30 | bwd_microstep: 1859.42 | bwd_inner_microstep: 1832.35 | bwd_allreduce_microstep: 27.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13506
total_samples=2455, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:28,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.19 | bwd_microstep: 2164.28 | bwd_inner_microstep: 1904.78 | bwd_allreduce_microstep: 259.43 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12296
total_samples=2458, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:31,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 02:14:31,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.93 | bwd_microstep: 1887.52 | bwd_inner_microstep: 1737.50 | bwd_allreduce_microstep: 149.96 | step_microstep: 422.05
[2025-08-03 02:14:31,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.11 | bwd: 7896.93 | bwd_inner: 7210.90 | bwd_allreduce: 685.79 | step: 422.37
{'loss': 0.8584, 'learning_rate': 1.9869167087338908e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14028
total_samples=2462, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:34,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 749.42 | bwd_microstep: 1810.35 | bwd_inner_microstep: 1731.66 | bwd_allreduce_microstep: 78.62 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15003
total_samples=2466, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:37,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.30 | bwd_microstep: 1996.09 | bwd_inner_microstep: 1875.79 | bwd_allreduce_microstep: 120.25 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15593
total_samples=2471, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:39,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.74 | bwd_microstep: 1774.57 | bwd_inner_microstep: 1752.91 | bwd_allreduce_microstep: 21.60 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12005
total_samples=2474, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:42,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 02:14:42,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.76 | bwd_microstep: 1719.85 | bwd_inner_microstep: 1543.20 | bwd_allreduce_microstep: 176.58 | step_microstep: 154.24
[2025-08-03 02:14:42,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2849.15 | bwd: 7300.91 | bwd_inner: 6903.55 | bwd_allreduce: 397.13 | step: 154.71
{'loss': 0.856, 'learning_rate': 1.9866543210593154e-05, 'epoch': 0.08}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13038
total_samples=2478, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:45,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.46 | bwd_microstep: 1903.65 | bwd_inner_microstep: 1642.71 | bwd_allreduce_microstep: 260.87 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11967
total_samples=2481, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:47,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.37 | bwd_microstep: 1831.24 | bwd_inner_microstep: 1596.85 | bwd_allreduce_microstep: 234.33 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12683
total_samples=2485, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:50,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.46 | bwd_microstep: 1913.65 | bwd_inner_microstep: 1801.60 | bwd_allreduce_microstep: 111.99 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13555
total_samples=2489, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:53,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 02:14:53,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.01 | bwd_microstep: 2176.43 | bwd_inner_microstep: 1916.34 | bwd_allreduce_microstep: 260.02 | step_microstep: 112.36
[2025-08-03 02:14:53,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.24 | bwd: 7825.01 | bwd_inner: 6957.48 | bwd_allreduce: 867.28 | step: 112.82
{'loss': 0.8475, 'learning_rate': 1.986389345998806e-05, 'epoch': 0.08}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11766
total_samples=2492, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:56,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.38 | bwd_microstep: 1715.59 | bwd_inner_microstep: 1518.66 | bwd_allreduce_microstep: 196.87 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11767
total_samples=2495, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:14:58,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.69 | bwd_microstep: 1786.05 | bwd_inner_microstep: 1548.51 | bwd_allreduce_microstep: 237.47 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14029
total_samples=2500, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:01,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.44 | bwd_microstep: 1779.45 | bwd_inner_microstep: 1694.92 | bwd_allreduce_microstep: 84.46 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12891
total_samples=2504, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:03,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:15:03,960] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.74 | bwd_microstep: 1899.05 | bwd_inner_microstep: 1831.18 | bwd_allreduce_microstep: 67.82 | step_microstep: 108.95
[2025-08-03 02:15:03,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.17 | bwd: 7180.20 | bwd_inner: 6593.28 | bwd_allreduce: 586.69 | step: 109.41
{'loss': 0.8444, 'learning_rate': 1.986121784247229e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247
total_samples=2508, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:06,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.79 | bwd_microstep: 1772.27 | bwd_inner_microstep: 1671.95 | bwd_allreduce_microstep: 100.26 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14098
total_samples=2512, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:09,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 914.69 | bwd_microstep: 1808.18 | bwd_inner_microstep: 1734.25 | bwd_allreduce_microstep: 73.88 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11636
total_samples=2515, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:11,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.13 | bwd_microstep: 1750.48 | bwd_inner_microstep: 1526.79 | bwd_allreduce_microstep: 223.62 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11894
total_samples=2518, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:14,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 02:15:14,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.15 | bwd_microstep: 1724.88 | bwd_inner_microstep: 1552.13 | bwd_allreduce_microstep: 172.68 | step_microstep: 138.85
[2025-08-03 02:15:14,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3010.70 | bwd: 7055.86 | bwd_inner: 6485.12 | bwd_allreduce: 570.51 | step: 139.19
{'loss': 0.8663, 'learning_rate': 1.9858516365062334e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13784
total_samples=2522, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:17,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.03 | bwd_microstep: 2419.38 | bwd_inner_microstep: 2254.13 | bwd_allreduce_microstep: 165.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13328
total_samples=2526, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:20,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.21 | bwd_microstep: 1759.65 | bwd_inner_microstep: 1692.82 | bwd_allreduce_microstep: 66.76 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13130
total_samples=2530, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:22,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.49 | bwd_microstep: 1773.79 | bwd_inner_microstep: 1681.89 | bwd_allreduce_microstep: 91.83 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12740
total_samples=2534, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:25,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:15:25,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.25 | bwd_microstep: 1971.29 | bwd_inner_microstep: 1662.76 | bwd_allreduce_microstep: 308.47 | step_microstep: 137.69
[2025-08-03 02:15:25,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.91 | bwd: 7924.16 | bwd_inner: 7291.60 | bwd_allreduce: 632.32 | step: 138.13
 10.74s/it]                                                      8%|▊         | 159/2000 [31:35<5:29:28, 10.74s/it]  8%|▊         | 160/2000 [31:46<5:35:37, 10.94s/it]                                                      8%|▊         | 160/2000 [31:46<5:35:37, 10.94s/it]  8%|▊         | 161/2000 [31:57<5:32:36, 10.85s/it]                                                      8%|▊         | 161/2000 [31:57<5:32:36, 10.85s/it]  8%|▊         | 162/2000 [32:08<5:34:36, 10.92s/it]                                                      8%|▊         | 162/2000 [32:08<5:34:36, 10.92s/it]  8%|▊         | 163/2000 [32:18<5:29:53, 10.78s/it]                                                      8%|▊         | 163/2000 [32:18<5:29:53, 10.78s/it]  8%|▊         | 164/2000 [32:29<5:27:28, 10.70s/it]                                                      8%|▊         | 164/2000 [32:29<5:27:28, 10.70s/it]  8%|▊         | 165/2000 [32:40<5:31:53, 10.85s/it]                            {'loss': 0.8579, 'learning_rate': 1.9855789034842504e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418
total_samples=2538, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:28,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.83 | bwd_microstep: 2074.86 | bwd_inner_microstep: 1912.73 | bwd_allreduce_microstep: 162.07 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14689
total_samples=2545, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:31,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.00 | bwd_microstep: 1758.68 | bwd_inner_microstep: 1730.42 | bwd_allreduce_microstep: 28.20 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11709
total_samples=2548, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:34,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.17 | bwd_microstep: 2132.78 | bwd_inner_microstep: 1812.74 | bwd_allreduce_microstep: 319.97 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332
total_samples=2552, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:36,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 02:15:36,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.16 | bwd_microstep: 1757.15 | bwd_inner_microstep: 1667.97 | bwd_allreduce_microstep: 89.11 | step_microstep: 134.78
[2025-08-03 02:15:36,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.08 | bwd: 7723.52 | bwd_inner: 7123.85 | bwd_allreduce: 599.43 | step: 135.27
{'loss': 0.8395, 'learning_rate': 1.9853035858964907e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13250
total_samples=2556, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:39,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.21 | bwd_microstep: 1960.58 | bwd_inner_microstep: 1954.71 | bwd_allreduce_microstep: 5.81 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13309
total_samples=2560, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:42,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.05 | bwd_microstep: 1975.65 | bwd_inner_microstep: 1844.60 | bwd_allreduce_microstep: 130.99 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13152
total_samples=2564, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:44,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.88 | bwd_microstep: 1956.54 | bwd_inner_microstep: 1618.46 | bwd_allreduce_microstep: 338.01 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13202
total_samples=2568, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:48,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.76
[2025-08-03 02:15:48,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.24 | bwd_microstep: 2108.76 | bwd_inner_microstep: 2060.52 | bwd_allreduce_microstep: 48.16 | step_microstep: 440.44
[2025-08-03 02:15:48,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.31 | bwd: 8001.58 | bwd_inner: 7478.29 | bwd_allreduce: 523.05 | step: 440.87
{'loss': 0.8477, 'learning_rate': 1.9850256844649422e-05, 'epoch': 0.08}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12019
total_samples=2572, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:51,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.12 | bwd_microstep: 2012.69 | bwd_inner_microstep: 1797.06 | bwd_allreduce_microstep: 215.56 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12465
total_samples=2576, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:53,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.14 | bwd_microstep: 1864.20 | bwd_inner_microstep: 1577.37 | bwd_allreduce_microstep: 286.77 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13249
total_samples=2580, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:56,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.61 | bwd_microstep: 1778.86 | bwd_inner_microstep: 1684.72 | bwd_allreduce_microstep: 94.05 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13732
total_samples=2584, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:15:59,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 02:15:59,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.29 | bwd_microstep: 1922.62 | bwd_inner_microstep: 1720.86 | bwd_allreduce_microstep: 201.68 | step_microstep: 145.97
[2025-08-03 02:15:59,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.10 | bwd: 7578.39 | bwd_inner: 6780.01 | bwd_allreduce: 798.14 | step: 146.39
{'loss': 0.8437, 'learning_rate': 1.9847451999183692e-05, 'epoch': 0.08}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13423
total_samples=2588, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:01,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.79 | bwd_microstep: 1812.78 | bwd_inner_microstep: 1786.84 | bwd_allreduce_microstep: 25.88 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13281
total_samples=2592, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:04,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.70 | bwd_microstep: 1819.63 | bwd_inner_microstep: 1694.77 | bwd_allreduce_microstep: 124.79 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11840
total_samples=2595, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:06,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.59 | bwd_microstep: 1750.34 | bwd_inner_microstep: 1552.31 | bwd_allreduce_microstep: 197.96 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13637
total_samples=2599, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:09,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:16:09,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.61 | bwd_microstep: 1902.63 | bwd_inner_microstep: 1713.94 | bwd_allreduce_microstep: 188.63 | step_microstep: 127.53
[2025-08-03 02:16:09,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2746.64 | bwd: 7285.43 | bwd_inner: 6747.85 | bwd_allreduce: 537.34 | step: 127.99
{'loss': 0.8444, 'learning_rate': 1.98446213299231e-05, 'epoch': 0.08}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13525
total_samples=2603, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:12,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.95 | bwd_microstep: 1924.72 | bwd_inner_microstep: 1694.40 | bwd_allreduce_microstep: 230.24 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=2606, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:14,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.56 | bwd_microstep: 1742.23 | bwd_inner_microstep: 1542.35 | bwd_allreduce_microstep: 199.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13860
total_samples=2610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:17,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 664.84 | bwd_microstep: 1879.03 | bwd_inner_microstep: 1746.42 | bwd_allreduce_microstep: 132.55 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13696
total_samples=2614, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:20,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 02:16:20,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.39 | bwd_microstep: 1740.57 | bwd_inner_microstep: 1676.06 | bwd_allreduce_microstep: 64.44 | step_microstep: 128.52
[2025-08-03 02:16:20,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2715.68 | bwd: 7286.60 | bwd_inner: 6659.21 | bwd_allreduce: 627.13 | step: 129.02
{'loss': 0.8447, 'learning_rate': 1.9841764844290744e-05, 'epoch': 0.09}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12317
total_samples=2618, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:22,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.58 | bwd_microstep: 1801.31 | bwd_inner_microstep: 1602.57 | bwd_allreduce_microstep: 198.68 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11739
total_samples=2621, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:25,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.63 | bwd_microstep: 1786.52 | bwd_inner_microstep: 1537.04 | bwd_allreduce_microstep: 249.42 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15416
total_samples=2625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:27,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.25 | bwd_microstep: 1803.83 | bwd_inner_microstep: 1774.24 | bwd_allreduce_microstep: 29.53 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13823
total_samples=2629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:30,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 02:16:30,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.18 | bwd_microstep: 1834.08 | bwd_inner_microstep: 1717.87 | bwd_allreduce_microstep: 116.15 | step_microstep: 137.23
[2025-08-03 02:16:30,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.58 | bwd: 7225.79 | bwd_inner: 6631.71 | bwd_allreduce: 593.85 | step: 137.71
{'loss': 0.8513, 'learning_rate': 1.9838882549777426e-05, 'epoch': 0.09}
                          8%|▊         | 165/2000 [32:40<5:31:53, 10.85s/it]  8%|▊         | 166/2000 [32:51<5:32:57, 10.89s/it]                                                      8%|▊         | 166/2000 [32:51<5:32:57, 10.89s/it]  8%|▊         | 167/2000 [33:03<5:39:22, 11.11s/it]                                                      8%|▊         | 167/2000 [33:03<5:39:22, 11.11s/it]  8%|▊         | 168/2000 [33:13<5:36:42, 11.03s/it]                                                      8%|▊         | 168/2000 [33:14<5:36:42, 11.03s/it]  8%|▊         | 169/2000 [33:24<5:31:37, 10.87s/it]                                                      8%|▊         | 169/2000 [33:24<5:31:37, 10.87s/it]  8%|▊         | 170/2000 [33:34<5:27:44, 10.75s/it]                                                      8%|▊         | 170/2000 [33:35<5:27:44, 10.75s/it]  9%|▊         | 171/2000 [33:45<5:25:24, 10.67s/it]                                                      9%|▊      dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891
total_samples=2632, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:33,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.62 | bwd_microstep: 2068.88 | bwd_inner_microstep: 1813.15 | bwd_allreduce_microstep: 255.67 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12370
total_samples=2635, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:36,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.13 | bwd_microstep: 2038.83 | bwd_inner_microstep: 1805.92 | bwd_allreduce_microstep: 232.84 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11773
total_samples=2638, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:39,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.90 | bwd_microstep: 1994.97 | bwd_inner_microstep: 1777.23 | bwd_allreduce_microstep: 217.68 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15372
total_samples=2643, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:41,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:16:41,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.81 | bwd_microstep: 1903.03 | bwd_inner_microstep: 1754.01 | bwd_allreduce_microstep: 148.95 | step_microstep: 124.84
[2025-08-03 02:16:41,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2753.38 | bwd: 8005.74 | bwd_inner: 7150.29 | bwd_allreduce: 855.22 | step: 125.27
{'loss': 0.8425, 'learning_rate': 1.9835974453941623e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14405
total_samples=2647, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:44,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.92 | bwd_microstep: 2157.77 | bwd_inner_microstep: 1914.31 | bwd_allreduce_microstep: 243.40 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11879
total_samples=2650, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:47,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.54 | bwd_microstep: 1786.99 | bwd_inner_microstep: 1588.67 | bwd_allreduce_microstep: 198.25 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11655
total_samples=2653, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:49,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.00 | bwd_microstep: 1847.62 | bwd_inner_microstep: 1841.56 | bwd_allreduce_microstep: 6.00 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14377
total_samples=2657, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:53,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:16:53,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.26 | bwd_microstep: 2491.89 | bwd_inner_microstep: 2141.61 | bwd_allreduce_microstep: 350.22 | step_microstep: 108.81
[2025-08-03 02:16:53,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.65 | bwd: 8284.32 | bwd_inner: 7486.14 | bwd_allreduce: 797.93 | step: 109.26
{'loss': 0.8456, 'learning_rate': 1.983304056440948e-05, 'epoch': 0.09}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12017
total_samples=2660, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:56,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.72 | bwd_microstep: 2269.08 | bwd_inner_microstep: 1935.97 | bwd_allreduce_microstep: 333.05 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11877
total_samples=2663, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:16:59,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.90 | bwd_microstep: 1905.01 | bwd_inner_microstep: 1580.14 | bwd_allreduce_microstep: 324.81 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13350
total_samples=2667, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:01,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.64 | bwd_microstep: 1894.53 | bwd_inner_microstep: 1686.58 | bwd_allreduce_microstep: 207.88 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14342
total_samples=2671, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:04,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 02:17:04,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.46 | bwd_microstep: 1915.59 | bwd_inner_microstep: 1748.53 | bwd_allreduce_microstep: 166.99 | step_microstep: 109.86
[2025-08-03 02:17:04,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.64 | bwd: 7984.26 | bwd_inner: 6951.22 | bwd_allreduce: 1032.80 | step: 110.21
{'loss': 0.8502, 'learning_rate': 1.983008088887478e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13957
total_samples=2675, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:07,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.33 | bwd_microstep: 1879.16 | bwd_inner_microstep: 1746.07 | bwd_allreduce_microstep: 133.03 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14314
total_samples=2679, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:09,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.53 | bwd_microstep: 1880.24 | bwd_inner_microstep: 1849.25 | bwd_allreduce_microstep: 30.93 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13384
total_samples=2683, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:12,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.19 | bwd_microstep: 1808.91 | bwd_inner_microstep: 1715.54 | bwd_allreduce_microstep: 93.30 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13727
total_samples=2687, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:15,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 02:17:15,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.29 | bwd_microstep: 1762.42 | bwd_inner_microstep: 1681.79 | bwd_allreduce_microstep: 80.56 | step_microstep: 135.75
[2025-08-03 02:17:15,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.28 | bwd: 7330.78 | bwd_inner: 6992.64 | bwd_allreduce: 337.89 | step: 136.20
{'loss': 0.8396, 'learning_rate': 1.9827095435098926e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13504
total_samples=2691, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:18,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.44 | bwd_microstep: 2072.13 | bwd_inner_microstep: 1914.41 | bwd_allreduce_microstep: 157.66 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13795
total_samples=2695, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:20,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.88 | bwd_microstep: 1782.07 | bwd_inner_microstep: 1709.27 | bwd_allreduce_microstep: 72.73 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=2698, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:23,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.61 | bwd_microstep: 1746.48 | bwd_inner_microstep: 1550.66 | bwd_allreduce_microstep: 195.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13660
total_samples=2702, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:25,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 02:17:25,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.97 | bwd_microstep: 1793.61 | bwd_inner_microstep: 1683.06 | bwd_allreduce_microstep: 110.49 | step_microstep: 114.49
[2025-08-03 02:17:25,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.83 | bwd: 7394.35 | bwd_inner: 6857.40 | bwd_allreduce: 536.71 | step: 114.93
{'loss': 0.8398, 'learning_rate': 1.9824084210910924e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13865
total_samples=2706, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:28,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.64 | bwd_microstep: 1757.87 | bwd_inner_microstep: 1725.28 | bwd_allreduce_microstep: 32.52 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14150
total_samples=2710, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:30,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.75 | bwd_microstep: 1792.01 | bwd_inner_microstep: 1740.38 | bwd_allreduce_microstep: 51.57 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15035
total_samples=2714, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:33,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.70 | bwd_microstep: 1976.95 | bwd_inner_microstep: 1780.29 | bwd_allreduce_microstep: 196.59 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13902
total_samples=2719, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:36,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:17:36,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.29 | bwd_microstep: 1933.47 | bwd_inner_microstep: 1906.46 | bwd_allreduce_microstep: 26.95 | step_microstep: 110.24
[2025-08-03 02:17:36,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.32 | bwd: 7460.33 | bwd_inner: 7152.40 | bwd_allreduce: 307.69 | step: 110.71
{'loss': 0.8333, 'learning_rate': 1.9821047224207362e-05, 'epoch': 0.09}
   | 171/2000 [33:45<5:25:24, 10.67s/it]  9%|▊         | 172/2000 [33:56<5:30:24, 10.85s/it]                                                      9%|▊         | 172/2000 [33:56<5:30:24, 10.85s/it]  9%|▊         | 173/2000 [34:08<5:36:25, 11.05s/it]                                                      9%|▊         | 173/2000 [34:08<5:36:25, 11.05s/it]  9%|▊         | 174/2000 [34:19<5:37:44, 11.10s/it]                                                      9%|▊         | 174/2000 [34:19<5:37:44, 11.10s/it]  9%|▉         | 175/2000 [34:30<5:33:19, 10.96s/it]                                                      9%|▉         | 175/2000 [34:30<5:33:19, 10.96s/it]  9%|▉         | 176/2000 [34:40<5:30:01, 10.86s/it]                                                      9%|▉         | 176/2000 [34:40<5:30:01, 10.86s/it]  9%|▉         | 177/2000 [34:51<5:28:24, 10.81s/it]                                                      9%|▉         | 177/2000 [34:51<5:28:24, 10.81s/it]dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13337
total_samples=2723, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:39,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.48 | bwd_microstep: 1811.83 | bwd_inner_microstep: 1697.14 | bwd_allreduce_microstep: 114.62 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14164
total_samples=2727, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:41,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.12 | bwd_microstep: 1718.13 | bwd_inner_microstep: 1697.73 | bwd_allreduce_microstep: 20.33 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14558
total_samples=2731, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:44,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.55 | bwd_microstep: 2046.74 | bwd_inner_microstep: 1904.19 | bwd_allreduce_microstep: 142.48 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14630
total_samples=2735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:47,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 02:17:47,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.42 | bwd_microstep: 1717.41 | bwd_inner_microstep: 1698.08 | bwd_allreduce_microstep: 19.27 | step_microstep: 141.90
[2025-08-03 02:17:47,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.49 | bwd: 7294.15 | bwd_inner: 6997.14 | bwd_allreduce: 296.78 | step: 142.35
{'loss': 0.8343, 'learning_rate': 1.9817984482952378e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14896
total_samples=2739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:49,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.97 | bwd_microstep: 1799.72 | bwd_inner_microstep: 1749.70 | bwd_allreduce_microstep: 49.96 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13340
total_samples=2743, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:52,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.11 | bwd_microstep: 2185.40 | bwd_inner_microstep: 2010.01 | bwd_allreduce_microstep: 175.32 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16104
total_samples=2747, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:55,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.61 | bwd_microstep: 1960.60 | bwd_inner_microstep: 1821.11 | bwd_allreduce_microstep: 139.43 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13826
total_samples=2751, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:17:58,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 02:17:58,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.30 | bwd_microstep: 1736.28 | bwd_inner_microstep: 1699.06 | bwd_allreduce_microstep: 37.15 | step_microstep: 114.76
[2025-08-03 02:17:58,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.92 | bwd: 7682.05 | bwd_inner: 7279.88 | bwd_allreduce: 401.93 | step: 115.09
{'loss': 0.8342, 'learning_rate': 1.9814895995177653e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15373
total_samples=2755, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:00,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.97 | bwd_microstep: 1755.92 | bwd_inner_microstep: 1740.75 | bwd_allreduce_microstep: 15.10 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13345
total_samples=2759, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:03,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.73 | bwd_microstep: 2090.40 | bwd_inner_microstep: 1719.05 | bwd_allreduce_microstep: 371.29 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13824
total_samples=2763, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:06,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.09 | bwd_microstep: 1848.41 | bwd_inner_microstep: 1816.50 | bwd_allreduce_microstep: 31.85 | step_microstep: 0.19
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13737
total_samples=2767, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:08,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 02:18:08,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.12 | bwd_microstep: 1836.77 | bwd_inner_microstep: 1714.55 | bwd_allreduce_microstep: 122.16 | step_microstep: 131.74
[2025-08-03 02:18:08,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.84 | bwd: 7531.56 | bwd_inner: 6990.84 | bwd_allreduce: 540.48 | step: 132.16
{'loss': 0.842, 'learning_rate': 1.9811781768982392e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13345
total_samples=2771, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:11,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.82 | bwd_microstep: 2110.16 | bwd_inner_microstep: 1895.83 | bwd_allreduce_microstep: 214.26 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13478
total_samples=2776, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:14,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.66 | bwd_microstep: 1730.94 | bwd_inner_microstep: 1664.41 | bwd_allreduce_microstep: 66.47 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13133
total_samples=2780, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:17,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.92 | bwd_microstep: 2037.68 | bwd_inner_microstep: 1873.33 | bwd_allreduce_microstep: 164.29 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12418
total_samples=2783, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:19,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 02:18:19,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.97 | bwd_microstep: 1827.32 | bwd_inner_microstep: 1717.44 | bwd_allreduce_microstep: 109.81 | step_microstep: 109.54
[2025-08-03 02:18:19,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.31 | bwd: 7706.15 | bwd_inner: 7151.01 | bwd_allreduce: 554.91 | step: 109.86
{'loss': 0.8371, 'learning_rate': 1.9808641812533286e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13306
total_samples=2787, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:22,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.60 | bwd_microstep: 1738.57 | bwd_inner_microstep: 1668.55 | bwd_allreduce_microstep: 69.95 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13536
total_samples=2791, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:24,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.34 | bwd_microstep: 1965.58 | bwd_inner_microstep: 1850.64 | bwd_allreduce_microstep: 114.87 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11863
total_samples=2794, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:27,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.19 | bwd_microstep: 1780.81 | bwd_inner_microstep: 1544.64 | bwd_allreduce_microstep: 236.10 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12435
total_samples=2797, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:30,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 02:18:30,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.00 | bwd_microstep: 2027.94 | bwd_inner_microstep: 1671.17 | bwd_allreduce_microstep: 356.70 | step_microstep: 118.35
[2025-08-03 02:18:30,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.06 | bwd: 7512.94 | bwd_inner: 6735.00 | bwd_allreduce: 777.70 | step: 118.82
{'loss': 0.8474, 'learning_rate': 1.980547613406451e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14011
total_samples=2801, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:33,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.68 | bwd_microstep: 1744.88 | bwd_inner_microstep: 1700.06 | bwd_allreduce_microstep: 44.75 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13416
total_samples=2805, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:35,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.61 | bwd_microstep: 2166.89 | bwd_inner_microstep: 1735.21 | bwd_allreduce_microstep: 431.61 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11642
total_samples=2808, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:38,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.23 | bwd_microstep: 1764.86 | bwd_inner_microstep: 1534.65 | bwd_allreduce_microstep: 230.15 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13552
total_samples=2813, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:41,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:18:41,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.94 | bwd_microstep: 1853.40 | bwd_inner_microstep: 1801.09 | bwd_allreduce_microstep: 52.24 | step_microstep: 137.40
[2025-08-03 02:18:41,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2732.41 | bwd: 7530.08 | bwd_inner: 6771.02 | bwd_allreduce: 758.82 | step: 137.73
{'loss': 0.8575, 'learning_rate': 1.9802284741877674e-05, 'epoch': 0.09}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14239
total_samples=2819, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:43,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.03 | bwd_microstep: 1763.79 | bwd_inner_microstep: 1701.11 | bwd_allreduce_microstep: 62.63 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13365
total_samples=2823, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:46,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.92 | bwd_microstep: 1828.38 | bwd_inner_microstep: 1700.86 | bwd_allreduce_microstep: 127.45 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13147
total_samples=2827, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:48,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.07 | bwd_microstep: 1765.69 | bwd_inner_microstep: 1670.00 | bwd_allreduce_microstep: 95.62 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13388
total_samples=2831, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:51,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 02:18:51,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.81 | bwd_microstep: 1762.61 | bwd_inner_microstep: 1682.48 | bwd_allreduce_microstep: 80.07 | step_microstep: 135.33
[2025-08-03 02:18:51,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.75 | bwd: 7120.51 | bwd_inner: 6754.44 | bwd_allreduce: 365.85 | step: 135.66
  9%|▉         | 178/2000 [35:01<5:25:46, 10.73s/it]                                                      9%|▉         | 178/2000 [35:01<5:25:46, 10.73s/it]  9%|▉         | 179/2000 [35:12<5:27:32, 10.79s/it]                                                      9%|▉         | 179/2000 [35:12<5:27:32, 10.79s/it]  9%|▉         | 180/2000 [35:23<5:27:16, 10.79s/it]                                                      9%|▉         | 180/2000 [35:23<5:27:16, 10.79s/it]  9%|▉         | 181/2000 [35:34<5:28:28, 10.83s/it]                                                      9%|▉         | 181/2000 [35:34<5:28:28, 10.83s/it]  9%|▉         | 182/2000 [35:45<5:27:18, 10.80s/it]                                                      9%|▉         | 182/2000 [35:45<5:27:18, 10.80s/it]  9%|▉         | 183/2000 [35:56<5:26:46, 10.79s/it]                                                      9%|▉         | 183/2000 [35:56<5:26:46, 10.79s/it]  9%|▉         | 184/2000 [36:06<5:23{'loss': 0.8394, 'learning_rate': 1.9799067644341844e-05, 'epoch': 0.09}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11932
total_samples=2834, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:54,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.60 | bwd_microstep: 1977.93 | bwd_inner_microstep: 1726.84 | bwd_allreduce_microstep: 251.03 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14251
total_samples=2838, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:56,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.67 | bwd_microstep: 1758.19 | bwd_inner_microstep: 1716.92 | bwd_allreduce_microstep: 41.21 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13649
total_samples=2842, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:18:59,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.26 | bwd_microstep: 2039.15 | bwd_inner_microstep: 1884.49 | bwd_allreduce_microstep: 154.60 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13039
total_samples=2846, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:02,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 02:19:02,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.92 | bwd_microstep: 1876.46 | bwd_inner_microstep: 1657.09 | bwd_allreduce_microstep: 219.31 | step_microstep: 122.42
[2025-08-03 02:19:02,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2766.40 | bwd: 7651.77 | bwd_inner: 6985.33 | bwd_allreduce: 666.22 | step: 122.76
{'loss': 0.8363, 'learning_rate': 1.9795824849893483e-05, 'epoch': 0.09}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11905
total_samples=2849, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:05,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.31 | bwd_microstep: 2047.34 | bwd_inner_microstep: 1662.67 | bwd_allreduce_microstep: 384.60 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446
total_samples=2853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:07,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.70 | bwd_microstep: 1769.11 | bwd_inner_microstep: 1698.83 | bwd_allreduce_microstep: 70.21 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11772
total_samples=2856, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:10,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.79 | bwd_microstep: 2100.81 | bwd_inner_microstep: 1632.22 | bwd_allreduce_microstep: 468.49 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11589
total_samples=2859, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:14,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:19:14,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.33 | bwd_microstep: 2423.46 | bwd_inner_microstep: 1634.86 | bwd_allreduce_microstep: 788.54 | step_microstep: 109.57
[2025-08-03 02:19:14,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.06 | bwd: 8340.77 | bwd_inner: 6628.61 | bwd_allreduce: 1711.90 | step: 110.02
{'loss': 0.8348, 'learning_rate': 1.9792556367036432e-05, 'epoch': 0.09}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12036
total_samples=2863, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:16,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.54 | bwd_microstep: 1763.42 | bwd_inner_microstep: 1632.79 | bwd_allreduce_microstep: 130.57 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13494
total_samples=2867, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:19,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.31 | bwd_microstep: 1813.82 | bwd_inner_microstep: 1708.83 | bwd_allreduce_microstep: 104.92 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11845
total_samples=2870, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:22,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.54 | bwd_microstep: 2133.60 | bwd_inner_microstep: 1765.58 | bwd_allreduce_microstep: 367.95 | step_microstep: 0.15
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12459
total_samples=2874, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:24,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.01
[2025-08-03 02:19:24,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.07 | bwd_microstep: 1740.31 | bwd_inner_microstep: 1596.62 | bwd_allreduce_microstep: 143.63 | step_microstep: 112.14
[2025-08-03 02:19:24,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.40 | bwd: 7451.20 | bwd_inner: 6703.82 | bwd_allreduce: 747.15 | step: 112.52
{'loss': 0.8451, 'learning_rate': 1.9789262204341918e-05, 'epoch': 0.09}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11963
total_samples=2877, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:27,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.35 | bwd_microstep: 1765.89 | bwd_inner_microstep: 1545.56 | bwd_allreduce_microstep: 220.27 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13290
total_samples=2881, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:30,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.26 | bwd_microstep: 1935.56 | bwd_inner_microstep: 1841.98 | bwd_allreduce_microstep: 93.52 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11663
total_samples=2884, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:32,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.91 | bwd_microstep: 1744.99 | bwd_inner_microstep: 1534.78 | bwd_allreduce_microstep: 210.14 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13204
total_samples=2888, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:35,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 02:19:35,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.13 | bwd_microstep: 2039.93 | bwd_inner_microstep: 1916.93 | bwd_allreduce_microstep: 118.94 | step_microstep: 110.04
[2025-08-03 02:19:35,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.58 | bwd: 7486.43 | bwd_inner: 6839.24 | bwd_allreduce: 646.86 | step: 110.38
{'loss': 0.8292, 'learning_rate': 1.978594237044849e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13349
total_samples=2892, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:38,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.46 | bwd_microstep: 1845.18 | bwd_inner_microstep: 1728.67 | bwd_allreduce_microstep: 116.45 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13101
total_samples=2896, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:40,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.98 | bwd_microstep: 1882.79 | bwd_inner_microstep: 1831.32 | bwd_allreduce_microstep: 51.42 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11798
total_samples=2899, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:43,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.10 | bwd_microstep: 2096.52 | bwd_inner_microstep: 1895.82 | bwd_allreduce_microstep: 200.63 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11597
total_samples=2902, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:46,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 02:19:46,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.28 | bwd_microstep: 1754.86 | bwd_inner_microstep: 1534.16 | bwd_allreduce_microstep: 220.64 | step_microstep: 140.02
[2025-08-03 02:19:46,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.75 | bwd: 7579.41 | bwd_inner: 6989.97 | bwd_allreduce: 589.21 | step: 140.46
{'loss': 0.8377, 'learning_rate': 1.9782596874062028e-05, 'epoch': 0.09}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13291
total_samples=2906, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:49,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.48 | bwd_microstep: 2038.45 | bwd_inner_microstep: 1720.43 | bwd_allreduce_microstep: 317.96 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13311
total_samples=2910, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:51,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.21 | bwd_microstep: 1840.49 | bwd_inner_microstep: 1798.43 | bwd_allreduce_microstep: 42.00 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12832
total_samples=2913, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:54,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.97 | bwd_microstep: 1936.16 | bwd_inner_microstep: 1607.13 | bwd_allreduce_microstep: 328.97 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14040
total_samples=2917, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:57,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 02:19:57,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.66 | bwd_microstep: 1781.11 | bwd_inner_microstep: 1717.92 | bwd_allreduce_microstep: 63.13 | step_microstep: 127.02
[2025-08-03 02:19:57,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.25 | bwd: 7596.26 | bwd_inner: 6843.89 | bwd_allreduce: 752.13 | step: 127.37
:03, 10.67s/it]                                                      9%|▉         | 184/2000 [36:06<5:23:03, 10.67s/it]  9%|▉         | 185/2000 [36:17<5:24:41, 10.73s/it]                                                      9%|▉         | 185/2000 [36:17<5:24:41, 10.73s/it]  9%|▉         | 186/2000 [36:28<5:32:26, 11.00s/it]                                                      9%|▉         | 186/2000 [36:28<5:32:26, 11.00s/it]  9%|▉         | 187/2000 [36:39<5:29:15, 10.90s/it]                                                      9%|▉         | 187/2000 [36:39<5:29:15, 10.90s/it]  9%|▉         | 188/2000 [36:50<5:27:22, 10.84s/it]                                                      9%|▉         | 188/2000 [36:50<5:27:22, 10.84s/it]  9%|▉         | 189/2000 [37:01<5:27:30, 10.85s/it]                                                      9%|▉         | 189/2000 [37:01<5:27:30, 10.85s/it] 10%|▉         | 190/2000 [37:12<5:27:05, 10.84s/it]                        {'loss': 0.8539, 'learning_rate': 1.977922572395571e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13500
total_samples=2921, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:19:59,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.57 | bwd_microstep: 1765.94 | bwd_inner_microstep: 1695.99 | bwd_allreduce_microstep: 69.89 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12878
total_samples=2925, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:02,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.46 | bwd_microstep: 1753.98 | bwd_inner_microstep: 1648.03 | bwd_allreduce_microstep: 105.88 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14264
total_samples=2930, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:04,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.19 | bwd_microstep: 1810.18 | bwd_inner_microstep: 1736.88 | bwd_allreduce_microstep: 73.24 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13526
total_samples=2934, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:07,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 02:20:07,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 657.02 | bwd_microstep: 1824.44 | bwd_inner_microstep: 1692.22 | bwd_allreduce_microstep: 132.15 | step_microstep: 141.34
[2025-08-03 02:20:07,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2741.18 | bwd: 7154.59 | bwd_inner: 6773.11 | bwd_allreduce: 381.24 | step: 141.71
{'loss': 0.8343, 'learning_rate': 1.9775828928969976e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13532
total_samples=2938, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:10,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.95 | bwd_microstep: 2009.91 | bwd_inner_microstep: 1710.80 | bwd_allreduce_microstep: 299.06 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13966
total_samples=2943, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:12,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.59 | bwd_microstep: 1825.15 | bwd_inner_microstep: 1724.08 | bwd_allreduce_microstep: 101.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13614
total_samples=2947, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:15,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.29 | bwd_microstep: 1744.02 | bwd_inner_microstep: 1686.22 | bwd_allreduce_microstep: 57.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13357
total_samples=2951, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:18,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 02:20:18,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.90 | bwd_microstep: 1870.91 | bwd_inner_microstep: 1706.60 | bwd_allreduce_microstep: 164.24 | step_microstep: 111.47
[2025-08-03 02:20:18,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.66 | bwd: 7450.04 | bwd_inner: 6827.69 | bwd_allreduce: 622.11 | step: 111.80
{'loss': 0.8397, 'learning_rate': 1.977240649801253e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13682
total_samples=2956, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:20,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.48 | bwd_microstep: 1889.16 | bwd_inner_microstep: 1828.77 | bwd_allreduce_microstep: 60.32 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13691
total_samples=2960, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:23,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.64 | bwd_microstep: 1815.10 | bwd_inner_microstep: 1714.05 | bwd_allreduce_microstep: 100.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13476
total_samples=2964, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:26,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.42 | bwd_microstep: 1807.80 | bwd_inner_microstep: 1704.31 | bwd_allreduce_microstep: 103.43 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13030
total_samples=2968, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:29,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 02:20:29,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.43 | bwd_microstep: 2007.08 | bwd_inner_microstep: 1871.35 | bwd_allreduce_microstep: 135.68 | step_microstep: 116.62
[2025-08-03 02:20:29,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.91 | bwd: 7519.19 | bwd_inner: 7118.47 | bwd_allreduce: 400.49 | step: 117.05
{'loss': 0.8344, 'learning_rate': 1.97689584400583e-05, 'epoch': 0.1}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12073
total_samples=2971, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:31,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.18 | bwd_microstep: 1757.15 | bwd_inner_microstep: 1556.25 | bwd_allreduce_microstep: 200.83 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12994
total_samples=2975, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:34,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.24 | bwd_microstep: 1777.67 | bwd_inner_microstep: 1675.17 | bwd_allreduce_microstep: 102.43 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13318
total_samples=2979, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:37,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.86 | bwd_microstep: 2297.26 | bwd_inner_microstep: 2204.41 | bwd_allreduce_microstep: 92.78 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12293
total_samples=2983, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:39,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 02:20:39,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.29 | bwd_microstep: 1719.79 | bwd_inner_microstep: 1592.09 | bwd_allreduce_microstep: 127.64 | step_microstep: 161.95
[2025-08-03 02:20:39,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2710.51 | bwd: 7551.90 | bwd_inner: 7027.91 | bwd_allreduce: 523.75 | step: 162.38
{'loss': 0.8178, 'learning_rate': 1.9765484764149413e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13460
total_samples=2987, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:42,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.32 | bwd_microstep: 1970.84 | bwd_inner_microstep: 1726.35 | bwd_allreduce_microstep: 244.43 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13581
total_samples=2991, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:45,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.24 | bwd_microstep: 1987.91 | bwd_inner_microstep: 1890.90 | bwd_allreduce_microstep: 96.95 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13676
total_samples=2995, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:47,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.09 | bwd_microstep: 1765.05 | bwd_inner_microstep: 1672.23 | bwd_allreduce_microstep: 92.76 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11749
total_samples=2998, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:50,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 02:20:50,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.48 | bwd_microstep: 2032.62 | bwd_inner_microstep: 1764.10 | bwd_allreduce_microstep: 268.45 | step_microstep: 115.77
[2025-08-03 02:20:50,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.06 | bwd: 7756.47 | bwd_inner: 7053.57 | bwd_allreduce: 702.66 | step: 116.23
{'loss': 0.8186, 'learning_rate': 1.976198547939518e-05, 'epoch': 0.1}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11948
total_samples=3001, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:53,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.07 | bwd_microstep: 1770.54 | bwd_inner_microstep: 1547.29 | bwd_allreduce_microstep: 223.19 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13633
total_samples=3005, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:55,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.51 | bwd_microstep: 1795.95 | bwd_inner_microstep: 1710.45 | bwd_allreduce_microstep: 85.43 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14538
total_samples=3010, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:20:58,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.82 | bwd_microstep: 1891.24 | bwd_inner_microstep: 1705.29 | bwd_allreduce_microstep: 185.89 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13255
total_samples=3014, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:01,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 02:21:01,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.12 | bwd_microstep: 2254.15 | bwd_inner_microstep: 2087.90 | bwd_allreduce_microstep: 166.17 | step_microstep: 136.32
[2025-08-03 02:21:01,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.46 | bwd: 7711.93 | bwd_inner: 7050.93 | bwd_allreduce: 660.75 | step: 136.64
{'loss': 0.8358, 'learning_rate': 1.9758460594972068e-05, 'epoch': 0.1}
                             10%|▉         | 190/2000 [37:12<5:27:05, 10.84s/it] 10%|▉         | 191/2000 [37:22<5:22:49, 10.71s/it]                                                     10%|▉         | 191/2000 [37:22<5:22:49, 10.71s/it] 10%|▉         | 192/2000 [37:33<5:22:39, 10.71s/it]                                                     10%|▉         | 192/2000 [37:33<5:22:39, 10.71s/it] 10%|▉         | 193/2000 [37:43<5:22:51, 10.72s/it]                                                     10%|▉         | 193/2000 [37:43<5:22:51, 10.72s/it] 10%|▉         | 194/2000 [37:54<5:22:51, 10.73s/it]                                                     10%|▉         | 194/2000 [37:54<5:22:51, 10.73s/it] 10%|▉         | 195/2000 [38:05<5:24:49, 10.80s/it]                                                     10%|▉         | 195/2000 [38:05<5:24:49, 10.80s/it] 10%|▉         | 196/2000 [38:16<5:26:16, 10.85s/it]                                                     10%|▉  dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13420
total_samples=3018, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:04,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.76 | bwd_microstep: 1772.48 | bwd_inner_microstep: 1690.91 | bwd_allreduce_microstep: 81.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14084
total_samples=3022, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:06,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.06 | bwd_microstep: 1855.58 | bwd_inner_microstep: 1754.19 | bwd_allreduce_microstep: 101.33 | step_microstep: 0.19
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12560
total_samples=3026, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:09,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.85 | bwd_microstep: 1729.11 | bwd_inner_microstep: 1567.42 | bwd_allreduce_microstep: 161.62 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13823
total_samples=3030, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:12,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 02:21:12,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.33 | bwd_microstep: 1707.08 | bwd_inner_microstep: 1654.08 | bwd_allreduce_microstep: 52.93 | step_microstep: 120.97
[2025-08-03 02:21:12,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.92 | bwd: 7064.29 | bwd_inner: 6666.59 | bwd_allreduce: 397.47 | step: 121.39
{'loss': 0.829, 'learning_rate': 1.9754910120123675e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12987
total_samples=3034, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:14,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.16 | bwd_microstep: 1811.47 | bwd_inner_microstep: 1686.64 | bwd_allreduce_microstep: 124.76 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14376
total_samples=3038, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:18,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1976.73 | bwd_microstep: 1859.91 | bwd_inner_microstep: 1749.51 | bwd_allreduce_microstep: 110.34 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13627
total_samples=3042, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:21,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.60 | bwd_microstep: 1785.02 | bwd_inner_microstep: 1721.75 | bwd_allreduce_microstep: 63.20 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11493
total_samples=3045, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:23,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:21:23,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.69 | bwd_microstep: 1885.85 | bwd_inner_microstep: 1707.50 | bwd_allreduce_microstep: 178.29 | step_microstep: 137.18
[2025-08-03 02:21:23,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4088.11 | bwd: 7342.30 | bwd_inner: 6865.39 | bwd_allreduce: 476.67 | step: 137.61
{'loss': 0.8289, 'learning_rate': 1.9751334064160708e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13408
total_samples=3049, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:27,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 991.44 | bwd_microstep: 2084.27 | bwd_inner_microstep: 2030.84 | bwd_allreduce_microstep: 53.38 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14381
total_samples=3054, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:29,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.68 | bwd_microstep: 1695.14 | bwd_inner_microstep: 1664.85 | bwd_allreduce_microstep: 30.23 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13450
total_samples=3058, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:32,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.04 | bwd_microstep: 2023.38 | bwd_inner_microstep: 1890.73 | bwd_allreduce_microstep: 132.59 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13360
total_samples=3062, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:35,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:21:35,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.23 | bwd_microstep: 2155.31 | bwd_inner_microstep: 1997.34 | bwd_allreduce_microstep: 157.91 | step_microstep: 107.44
[2025-08-03 02:21:35,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3039.32 | bwd: 7958.15 | bwd_inner: 7583.75 | bwd_allreduce: 374.18 | step: 107.76
{'loss': 0.8411, 'learning_rate': 1.9747732436460955e-05, 'epoch': 0.1}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12700
total_samples=3066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:38,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.33 | bwd_microstep: 2005.63 | bwd_inner_microstep: 1829.28 | bwd_allreduce_microstep: 176.29 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13657
total_samples=3070, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:40,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.41 | bwd_microstep: 1723.73 | bwd_inner_microstep: 1688.48 | bwd_allreduce_microstep: 35.17 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13886
total_samples=3074, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:43,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.69 | bwd_microstep: 1815.13 | bwd_inner_microstep: 1733.58 | bwd_allreduce_microstep: 81.49 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12640
total_samples=3078, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:46,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.81
[2025-08-03 02:21:46,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.53 | bwd_microstep: 2210.18 | bwd_inner_microstep: 1900.04 | bwd_allreduce_microstep: 310.08 | step_microstep: 137.58
[2025-08-03 02:21:46,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.89 | bwd: 7754.72 | bwd_inner: 7151.37 | bwd_allreduce: 603.10 | step: 137.91
{'loss': 0.8203, 'learning_rate': 1.9744105246469264e-05, 'epoch': 0.1}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14074
total_samples=3082, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:49,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.94 | bwd_microstep: 1877.31 | bwd_inner_microstep: 1827.28 | bwd_allreduce_microstep: 49.96 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13894
total_samples=3086, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:51,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.33 | bwd_microstep: 2051.51 | bwd_inner_microstep: 1914.23 | bwd_allreduce_microstep: 137.21 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13580
total_samples=3090, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:54,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.81 | bwd_microstep: 1938.45 | bwd_inner_microstep: 1711.80 | bwd_allreduce_microstep: 226.58 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625
total_samples=3093, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:21:57,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 02:21:57,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.66 | bwd_microstep: 1901.30 | bwd_inner_microstep: 1656.23 | bwd_allreduce_microstep: 245.00 | step_microstep: 137.38
[2025-08-03 02:21:57,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.68 | bwd: 7768.61 | bwd_inner: 7109.54 | bwd_allreduce: 658.83 | step: 137.73
{'loss': 0.8263, 'learning_rate': 1.9740452503697518e-05, 'epoch': 0.1}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11740
total_samples=3096, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:00,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.51 | bwd_microstep: 2002.29 | bwd_inner_microstep: 1782.14 | bwd_allreduce_microstep: 220.09 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15841
total_samples=3102, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:03,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.85 | bwd_microstep: 2149.71 | bwd_inner_microstep: 2068.60 | bwd_allreduce_microstep: 81.05 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13105
total_samples=3106, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:06,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.06 | bwd_microstep: 2055.53 | bwd_inner_microstep: 1902.62 | bwd_allreduce_microstep: 152.85 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11698
total_samples=3109, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:08,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:22:08,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.09 | bwd_microstep: 1960.53 | bwd_inner_microstep: 1555.98 | bwd_allreduce_microstep: 404.48 | step_microstep: 111.25
[2025-08-03 02:22:08,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.43 | bwd: 8168.11 | bwd_inner: 7309.34 | bwd_allreduce: 858.55 | step: 111.57
{'loss': 0.8336, 'learning_rate': 1.9736774217724614e-05, 'epoch': 0.1}
       | 196/2000 [38:16<5:26:16, 10.85s/it] 10%|▉         | 197/2000 [38:26<5:21:11, 10.69s/it]                                                     10%|▉         | 197/2000 [38:26<5:21:11, 10.69s/it] 10%|▉         | 198/2000 [38:38<5:32:11, 11.06s/it]                                                     10%|▉         | 198/2000 [38:38<5:32:11, 11.06s/it] 10%|▉         | 199/2000 [38:50<5:35:22, 11.17s/it]                                                     10%|▉         | 199/2000 [38:50<5:35:22, 11.17s/it] 10%|█         | 200/2000 [39:01<5:33:47, 11.13s/it]                                                     10%|█         | 200/2000 [39:01<5:33:47, 11.13s/it] 10%|█         | 201/2000 [39:12<5:32:49, 11.10s/it]                                                     10%|█         | 201/2000 [39:12<5:32:49, 11.10s/it] 10%|█         | 202/2000 [39:23<5:35:49, 11.21s/it]                                                     10%|█         | 202/2000 [39:23<5:35:49, 11.21sdynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13689
total_samples=3113, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:11,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.01 | bwd_microstep: 1779.37 | bwd_inner_microstep: 1675.23 | bwd_allreduce_microstep: 104.07 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15038
total_samples=3118, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:14,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.92 | bwd_microstep: 2193.57 | bwd_inner_microstep: 2151.84 | bwd_allreduce_microstep: 41.67 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13118
total_samples=3122, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:17,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.59 | bwd_microstep: 2591.21 | bwd_inner_microstep: 1712.03 | bwd_allreduce_microstep: 879.12 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14467
total_samples=3126, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:20,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:22:20,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1744.25 | bwd_inner_microstep: 1730.47 | bwd_allreduce_microstep: 13.71 | step_microstep: 113.91
[2025-08-03 02:22:20,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.05 | bwd: 8308.45 | bwd_inner: 7269.57 | bwd_allreduce: 1038.66 | step: 114.34
{'loss': 0.8434, 'learning_rate': 1.9733070398196423e-05, 'epoch': 0.1}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13015
total_samples=3130, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:23,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.60 | bwd_microstep: 1783.08 | bwd_inner_microstep: 1601.22 | bwd_allreduce_microstep: 181.80 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14383
total_samples=3134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:25,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.34 | bwd_microstep: 1840.44 | bwd_inner_microstep: 1769.81 | bwd_allreduce_microstep: 70.56 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15819
total_samples=3138, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:28,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.49 | bwd_microstep: 1779.70 | bwd_inner_microstep: 1754.80 | bwd_allreduce_microstep: 24.84 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12077
total_samples=3141, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:31,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 02:22:31,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.69 | bwd_microstep: 1959.58 | bwd_inner_microstep: 1757.19 | bwd_allreduce_microstep: 202.32 | step_microstep: 159.63
[2025-08-03 02:22:31,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2757.05 | bwd: 7362.86 | bwd_inner: 6883.00 | bwd_allreduce: 479.60 | step: 159.98
{'loss': 0.8348, 'learning_rate': 1.9729341054825783e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13811
total_samples=3145, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:33,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.52 | bwd_microstep: 1726.91 | bwd_inner_microstep: 1686.17 | bwd_allreduce_microstep: 40.68 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13258
total_samples=3149, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:36,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.27 | bwd_microstep: 1818.20 | bwd_inner_microstep: 1690.15 | bwd_allreduce_microstep: 127.99 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13570
total_samples=3153, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:38,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.89 | bwd_microstep: 1811.83 | bwd_inner_microstep: 1726.64 | bwd_allreduce_microstep: 85.12 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=3157, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:41,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 02:22:41,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.97 | bwd_microstep: 1761.54 | bwd_inner_microstep: 1699.86 | bwd_allreduce_microstep: 61.62 | step_microstep: 127.21
[2025-08-03 02:22:41,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.57 | bwd: 7118.53 | bwd_inner: 6802.82 | bwd_allreduce: 315.49 | step: 127.54
{'loss': 0.8179, 'learning_rate': 1.972558619739246e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13656
total_samples=3161, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:44,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.60 | bwd_microstep: 1776.72 | bwd_inner_microstep: 1691.63 | bwd_allreduce_microstep: 85.04 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13899
total_samples=3165, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:47,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.78 | bwd_microstep: 2901.52 | bwd_inner_microstep: 2063.19 | bwd_allreduce_microstep: 838.26 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13738
total_samples=3169, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:50,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.51 | bwd_microstep: 1757.38 | bwd_inner_microstep: 1680.91 | bwd_allreduce_microstep: 76.40 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12388
total_samples=3172, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:53,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 02:22:53,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.59 | bwd_microstep: 1970.71 | bwd_inner_microstep: 1800.16 | bwd_allreduce_microstep: 170.49 | step_microstep: 144.95
[2025-08-03 02:22:53,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.41 | bwd: 8406.38 | bwd_inner: 7235.88 | bwd_allreduce: 1170.27 | step: 145.42
{'loss': 0.8361, 'learning_rate': 1.972180583574313e-05, 'epoch': 0.1}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12743
total_samples=3176, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:55,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.78 | bwd_microstep: 1790.79 | bwd_inner_microstep: 1630.27 | bwd_allreduce_microstep: 160.46 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15317
total_samples=3180, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:22:58,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.48 | bwd_microstep: 1901.85 | bwd_inner_microstep: 1772.12 | bwd_allreduce_microstep: 129.67 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13584
total_samples=3184, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:01,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.52 | bwd_microstep: 2223.99 | bwd_inner_microstep: 2093.35 | bwd_allreduce_microstep: 130.58 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13718
total_samples=3188, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:04,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 02:23:04,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.68 | bwd_microstep: 1771.02 | bwd_inner_microstep: 1704.35 | bwd_allreduce_microstep: 66.61 | step_microstep: 128.83
[2025-08-03 02:23:04,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.40 | bwd: 7687.69 | bwd_inner: 7200.08 | bwd_allreduce: 487.39 | step: 129.15
{'loss': 0.8128, 'learning_rate': 1.9717999979791356e-05, 'epoch': 0.1}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12318
total_samples=3191, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:06,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.34 | bwd_microstep: 1979.80 | bwd_inner_microstep: 1759.01 | bwd_allreduce_microstep: 220.71 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11800
total_samples=3194, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:09,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.33 | bwd_microstep: 1860.17 | bwd_inner_microstep: 1535.03 | bwd_allreduce_microstep: 325.08 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13788
total_samples=3198, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:12,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.86 | bwd_microstep: 1778.53 | bwd_inner_microstep: 1710.93 | bwd_allreduce_microstep: 67.55 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13124
total_samples=3202, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:14,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 02:23:14,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.97 | bwd_microstep: 1767.60 | bwd_inner_microstep: 1670.57 | bwd_allreduce_microstep: 96.96 | step_microstep: 118.80
[2025-08-03 02:23:14,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.43 | bwd: 7386.15 | bwd_inner: 6675.53 | bwd_allreduce: 710.38 | step: 119.29
{'loss': 0.8417, 'learning_rate': 1.9714168639517543e-05, 'epoch': 0.1}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11843
total_samples=3205, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:17,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.69 | bwd_microstep: 1986.34 | bwd_inner_microstep: 1732.22 | bwd_allreduce_microstep: 254.05 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12785
total_samples=3209, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:20,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.91 | bwd_microstep: 2015.17 | bwd_inner_microstep: 1636.17 | bwd_allreduce_microstep: 378.94 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13167
total_samples=3213, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:22,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.71 | bwd_microstep: 1879.26 | bwd_inner_microstep: 1818.89 | bwd_allreduce_microstep: 60.31 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11759
total_samples=3216, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:25,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:23:25,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.59 | bwd_microstep: 2117.11 | bwd_inner_microstep: 1888.96 | bwd_allreduce_microstep: 228.09 | step_microstep: 139.37
[2025-08-03 02:23:25,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.83 | bwd: 7997.91 | bwd_inner: 7076.24 | bwd_allreduce: 921.45 | step: 139.70
/it] 10%|█         | 203/2000 [39:35<5:38:55, 11.32s/it]                                                     10%|█         | 203/2000 [39:35<5:38:55, 11.32s/it] 10%|█         | 204/2000 [39:45<5:32:31, 11.11s/it]                                                     10%|█         | 204/2000 [39:45<5:32:31, 11.11s/it] 10%|█         | 205/2000 [39:56<5:25:27, 10.88s/it]                                                     10%|█         | 205/2000 [39:56<5:25:27, 10.88s/it] 10%|█         | 206/2000 [40:07<5:32:43, 11.13s/it]                                                     10%|█         | 206/2000 [40:08<5:32:43, 11.13s/it] 10%|█         | 207/2000 [40:18<5:30:45, 11.07s/it]                                                     10%|█         | 207/2000 [40:18<5:30:45, 11.07s/it] 10%|█         | 208/2000 [40:29<5:26:25, 10.93s/it]                                                     10%|█         | 208/2000 [40:29<5:26:25, 10.93s/it] 10%|█         | 209/2000 [40:40<{'loss': 0.8303, 'learning_rate': 1.9710311824968942e-05, 'epoch': 0.1}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13770
total_samples=3220, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:28,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.06 | bwd_microstep: 1823.62 | bwd_inner_microstep: 1699.97 | bwd_allreduce_microstep: 123.59 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12072
total_samples=3223, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:31,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.52 | bwd_microstep: 2194.57 | bwd_inner_microstep: 1955.28 | bwd_allreduce_microstep: 239.22 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 11987
total_samples=3227, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:34,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.32 | bwd_microstep: 1846.80 | bwd_inner_microstep: 1636.59 | bwd_allreduce_microstep: 210.14 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12943
total_samples=3231, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:36,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 02:23:36,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.87 | bwd_microstep: 1730.56 | bwd_inner_microstep: 1644.04 | bwd_allreduce_microstep: 86.45 | step_microstep: 160.55
[2025-08-03 02:23:36,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.69 | bwd: 7595.59 | bwd_inner: 6935.88 | bwd_allreduce: 659.48 | step: 160.88
{'loss': 0.8353, 'learning_rate': 1.9706429546259592e-05, 'epoch': 0.1}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13481
total_samples=3235, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:39,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.94 | bwd_microstep: 1782.67 | bwd_inner_microstep: 1703.94 | bwd_allreduce_microstep: 78.67 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14633
total_samples=3239, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:42,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.60 | bwd_microstep: 1828.06 | bwd_inner_microstep: 1768.87 | bwd_allreduce_microstep: 59.13 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14153
total_samples=3243, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:44,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.98 | bwd_microstep: 1914.64 | bwd_inner_microstep: 1725.95 | bwd_allreduce_microstep: 188.63 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11923
total_samples=3247, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:47,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.93
[2025-08-03 02:23:47,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.52 | bwd_microstep: 1949.76 | bwd_inner_microstep: 1773.21 | bwd_allreduce_microstep: 176.49 | step_microstep: 134.11
[2025-08-03 02:23:47,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.97 | bwd: 7475.18 | bwd_inner: 6971.96 | bwd_allreduce: 502.99 | step: 134.52
{'loss': 0.8316, 'learning_rate': 1.9702521813570322e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13665
total_samples=3251, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:50,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.72 | bwd_microstep: 1735.76 | bwd_inner_microstep: 1707.91 | bwd_allreduce_microstep: 27.78 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11793
total_samples=3254, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:52,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.31 | bwd_microstep: 1906.41 | bwd_inner_microstep: 1727.57 | bwd_allreduce_microstep: 178.77 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13727
total_samples=3259, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:55,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.16 | bwd_microstep: 1785.36 | bwd_inner_microstep: 1703.69 | bwd_allreduce_microstep: 81.60 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13159
total_samples=3263, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:23:58,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:23:58,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.45 | bwd_microstep: 2102.90 | bwd_inner_microstep: 1950.30 | bwd_allreduce_microstep: 152.54 | step_microstep: 112.96
[2025-08-03 02:23:58,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2728.57 | bwd: 7530.47 | bwd_inner: 7089.47 | bwd_allreduce: 440.77 | step: 113.42
{'loss': 0.8247, 'learning_rate': 1.9698588637148705e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13802
total_samples=3267, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:00,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.97 | bwd_microstep: 1797.43 | bwd_inner_microstep: 1696.65 | bwd_allreduce_microstep: 100.72 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13764
total_samples=3271, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:03,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 908.34 | bwd_microstep: 1917.00 | bwd_inner_microstep: 1730.54 | bwd_allreduce_microstep: 186.40 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11675
total_samples=3274, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:06,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.29 | bwd_microstep: 1765.35 | bwd_inner_microstep: 1533.06 | bwd_allreduce_microstep: 232.23 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12116
total_samples=3277, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:09,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:24:09,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.78 | bwd_microstep: 2045.92 | bwd_inner_microstep: 1603.24 | bwd_allreduce_microstep: 442.57 | step_microstep: 120.09
[2025-08-03 02:24:09,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3013.33 | bwd: 7525.75 | bwd_inner: 6563.50 | bwd_allreduce: 961.99 | step: 120.53
{'loss': 0.8162, 'learning_rate': 1.9694630027309035e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14898
total_samples=3281, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:12,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.49 | bwd_microstep: 2093.32 | bwd_inner_microstep: 2022.51 | bwd_allreduce_microstep: 70.75 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12641
total_samples=3284, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:14,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.83 | bwd_microstep: 1820.06 | bwd_inner_microstep: 1606.17 | bwd_allreduce_microstep: 213.84 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13200
total_samples=3288, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:17,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.75 | bwd_microstep: 1848.89 | bwd_inner_microstep: 1772.77 | bwd_allreduce_microstep: 76.05 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13597
total_samples=3292, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:20,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 02:24:20,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.08 | bwd_microstep: 1804.18 | bwd_inner_microstep: 1713.96 | bwd_allreduce_microstep: 90.15 | step_microstep: 135.12
[2025-08-03 02:24:20,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.09 | bwd: 7566.50 | bwd_inner: 7115.41 | bwd_allreduce: 450.87 | step: 135.47
{'loss': 0.8298, 'learning_rate': 1.9690645994432307e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14368
total_samples=3296, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:22,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.05 | bwd_microstep: 1860.77 | bwd_inner_microstep: 1762.32 | bwd_allreduce_microstep: 98.39 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=3300, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:25,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.19 | bwd_microstep: 2005.49 | bwd_inner_microstep: 1717.60 | bwd_allreduce_microstep: 287.83 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11704
total_samples=3303, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:28,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.75 | bwd_microstep: 2032.24 | bwd_inner_microstep: 1707.07 | bwd_allreduce_microstep: 325.12 | step_microstep: 0.09
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12239
total_samples=3307, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:31,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:24:31,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.68 | bwd_microstep: 1734.83 | bwd_inner_microstep: 1563.89 | bwd_allreduce_microstep: 170.87 | step_microstep: 148.53
[2025-08-03 02:24:31,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.61 | bwd: 7633.39 | bwd_inner: 6750.87 | bwd_allreduce: 882.29 | step: 148.93
5:29:47, 11.05s/it]                                                     10%|█         | 209/2000 [40:40<5:29:47, 11.05s/it] 10%|█         | 210/2000 [40:51<5:28:32, 11.01s/it]                                                     10%|█         | 210/2000 [40:51<5:28:32, 11.01s/it] 11%|█         | 211/2000 [41:02<5:26:06, 10.94s/it]                                                     11%|█         | 211/2000 [41:02<5:26:06, 10.94s/it] 11%|█         | 212/2000 [41:13<5:23:53, 10.87s/it]                                                     11%|█         | 212/2000 [41:13<5:23:53, 10.87s/it] 11%|█         | 213/2000 [41:24<5:24:37, 10.90s/it]                                                     11%|█         | 213/2000 [41:24<5:24:37, 10.90s/it] 11%|█         | 214/2000 [41:35<5:23:47, 10.88s/it]                                                     11%|█         | 214/2000 [41:35<5:23:47, 10.88s/it] 11%|█         | 215/2000 [41:45<5:23:47, 10.88s/it]                    {'loss': 0.826, 'learning_rate': 1.9686636548966177e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13801
total_samples=3311, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:33,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.96 | bwd_microstep: 1916.59 | bwd_inner_microstep: 1858.40 | bwd_allreduce_microstep: 58.13 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=3316, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:37,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.33 | bwd_microstep: 2432.90 | bwd_inner_microstep: 2312.78 | bwd_allreduce_microstep: 120.05 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11784
total_samples=3319, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:39,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.13 | bwd_microstep: 1969.08 | bwd_inner_microstep: 1831.74 | bwd_allreduce_microstep: 137.27 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11661
total_samples=3322, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:42,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:24:42,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.18 | bwd_microstep: 2254.05 | bwd_inner_microstep: 1932.17 | bwd_allreduce_microstep: 321.82 | step_microstep: 112.01
[2025-08-03 02:24:42,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.52 | bwd: 8572.67 | bwd_inner: 7935.09 | bwd_allreduce: 637.35 | step: 112.49
{'loss': 0.8278, 'learning_rate': 1.9682601701424958e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13353
total_samples=3326, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:45,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.13 | bwd_microstep: 1846.76 | bwd_inner_microstep: 1736.57 | bwd_allreduce_microstep: 110.11 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13513
total_samples=3330, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:48,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.83 | bwd_microstep: 1808.56 | bwd_inner_microstep: 1703.92 | bwd_allreduce_microstep: 104.58 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12657
total_samples=3333, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:50,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.32 | bwd_microstep: 1847.06 | bwd_inner_microstep: 1722.80 | bwd_allreduce_microstep: 124.21 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13065
total_samples=3336, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:53,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 02:24:53,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.35 | bwd_microstep: 1851.86 | bwd_inner_microstep: 1603.84 | bwd_allreduce_microstep: 247.95 | step_microstep: 120.89
[2025-08-03 02:24:53,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.57 | bwd: 7354.30 | bwd_inner: 6767.13 | bwd_allreduce: 586.93 | step: 121.34
{'loss': 0.8271, 'learning_rate': 1.9678541462389564e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13602
total_samples=3340, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:56,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.67 | bwd_microstep: 1866.14 | bwd_inner_microstep: 1701.28 | bwd_allreduce_microstep: 164.80 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13351
total_samples=3345, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:24:58,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.30 | bwd_microstep: 1758.59 | bwd_inner_microstep: 1670.13 | bwd_allreduce_microstep: 88.40 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14132
total_samples=3349, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:01,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.76 | bwd_microstep: 1744.06 | bwd_inner_microstep: 1726.08 | bwd_allreduce_microstep: 17.91 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11872
total_samples=3352, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:04,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 02:25:04,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.39 | bwd_microstep: 2254.80 | bwd_inner_microstep: 1890.09 | bwd_allreduce_microstep: 364.65 | step_microstep: 118.29
[2025-08-03 02:25:04,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.05 | bwd: 7623.63 | bwd_inner: 6987.56 | bwd_allreduce: 635.84 | step: 118.68
{'loss': 0.8296, 'learning_rate': 1.9674455842507494e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13744
total_samples=3356, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:06,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.83 | bwd_microstep: 1759.77 | bwd_inner_microstep: 1699.82 | bwd_allreduce_microstep: 59.89 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11703
total_samples=3360, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:10,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1460.05 | bwd_microstep: 2436.48 | bwd_inner_microstep: 2207.29 | bwd_allreduce_microstep: 229.13 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12081
total_samples=3363, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:13,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.34 | bwd_microstep: 1743.07 | bwd_inner_microstep: 1553.11 | bwd_allreduce_microstep: 189.90 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13726
total_samples=3368, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:17,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:25:17,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1209.02 | bwd_microstep: 2624.71 | bwd_inner_microstep: 2499.65 | bwd_allreduce_microstep: 125.00 | step_microstep: 115.06
[2025-08-03 02:25:17,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4070.17 | bwd: 8564.07 | bwd_inner: 7959.87 | bwd_allreduce: 603.98 | step: 115.47
{'loss': 0.8341, 'learning_rate': 1.9670344852492814e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13564
total_samples=3373, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:19,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.33 | bwd_microstep: 1755.89 | bwd_inner_microstep: 1685.00 | bwd_allreduce_microstep: 70.82 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13363
total_samples=3377, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:22,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.58 | bwd_microstep: 2138.10 | bwd_inner_microstep: 2034.17 | bwd_allreduce_microstep: 103.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13411
total_samples=3381, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:25,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.97 | bwd_microstep: 1790.34 | bwd_inner_microstep: 1701.17 | bwd_allreduce_microstep: 89.11 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11705
total_samples=3384, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:28,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 02:25:28,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.04 | bwd_microstep: 1742.67 | bwd_inner_microstep: 1560.68 | bwd_allreduce_microstep: 181.92 | step_microstep: 156.86
[2025-08-03 02:25:28,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.85 | bwd: 7427.04 | bwd_inner: 6981.02 | bwd_allreduce: 445.79 | step: 157.23
{'loss': 0.8172, 'learning_rate': 1.9666208503126115e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14001
total_samples=3388, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:30,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.83 | bwd_microstep: 1919.32 | bwd_inner_microstep: 1744.06 | bwd_allreduce_microstep: 175.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13248
total_samples=3392, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:33,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.02 | bwd_microstep: 1784.27 | bwd_inner_microstep: 1694.75 | bwd_allreduce_microstep: 89.45 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12470
total_samples=3396, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:36,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.11 | bwd_microstep: 2010.14 | bwd_inner_microstep: 1801.97 | bwd_allreduce_microstep: 208.11 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12410
total_samples=3399, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:39,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:25:39,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.84 | bwd_microstep: 2110.14 | bwd_inner_microstep: 1941.84 | bwd_allreduce_microstep: 168.24 | step_microstep: 116.36
[2025-08-03 02:25:39,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.74 | bwd: 7823.93 | bwd_inner: 7182.61 | bwd_allreduce: 641.09 | step: 116.69
{'loss': 0.8234, 'learning_rate': 1.966204680525449e-05, 'epoch': 0.11}
                                 11%|█         | 215/2000 [41:45<5:23:47, 10.88s/it] 11%|█         | 216/2000 [41:57<5:31:55, 11.16s/it]                                                     11%|█         | 216/2000 [41:57<5:31:55, 11.16s/it] 11%|█         | 217/2000 [42:08<5:26:23, 10.98s/it]                                                     11%|█         | 217/2000 [42:08<5:26:23, 10.98s/it] 11%|█         | 218/2000 [42:19<5:25:05, 10.95s/it]                                                     11%|█         | 218/2000 [42:19<5:25:05, 10.95s/it] 11%|█         | 219/2000 [42:32<5:43:57, 11.59s/it]                                                     11%|█         | 219/2000 [42:32<5:43:57, 11.59s/it] 11%|█         | 220/2000 [42:42<5:35:51, 11.32s/it]                                                     11%|█         | 220/2000 [42:43<5:35:51, 11.32s/it] 11%|█         | 221/2000 [42:54<5:33:19, 11.24s/it]                                                     11%|�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11978
total_samples=3402, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:42,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.59 | bwd_microstep: 2353.38 | bwd_inner_microstep: 1969.35 | bwd_allreduce_microstep: 383.96 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13478
total_samples=3406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:45,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.20 | bwd_microstep: 2036.19 | bwd_inner_microstep: 1891.46 | bwd_allreduce_microstep: 144.66 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12303
total_samples=3409, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:47,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.63 | bwd_microstep: 1885.17 | bwd_inner_microstep: 1624.50 | bwd_allreduce_microstep: 260.62 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15304
total_samples=3413, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:50,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 02:25:50,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.11 | bwd_microstep: 1788.54 | bwd_inner_microstep: 1763.98 | bwd_allreduce_microstep: 24.50 | step_microstep: 145.19
[2025-08-03 02:25:50,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.46 | bwd: 8063.35 | bwd_inner: 7249.28 | bwd_allreduce: 813.82 | step: 145.64
{'loss': 0.8329, 'learning_rate': 1.9657859769791506e-05, 'epoch': 0.11}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11742
total_samples=3416, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:53,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.12 | bwd_microstep: 2047.33 | bwd_inner_microstep: 1816.47 | bwd_allreduce_microstep: 230.79 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11740
total_samples=3419, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:56,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.19 | bwd_microstep: 1975.78 | bwd_inner_microstep: 1753.27 | bwd_allreduce_microstep: 222.44 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12975
total_samples=3423, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:25:58,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.42 | bwd_microstep: 1921.37 | bwd_inner_microstep: 1808.66 | bwd_allreduce_microstep: 112.65 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13626
total_samples=3427, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:01,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 02:26:01,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.88 | bwd_microstep: 1791.61 | bwd_inner_microstep: 1710.96 | bwd_allreduce_microstep: 80.58 | step_microstep: 109.56
[2025-08-03 02:26:01,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.53 | bwd: 7736.15 | bwd_inner: 7089.35 | bwd_allreduce: 646.55 | step: 109.90
{'loss': 0.8211, 'learning_rate': 1.965364740771718e-05, 'epoch': 0.11}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11835
total_samples=3430, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:04,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.09 | bwd_microstep: 1839.84 | bwd_inner_microstep: 1606.35 | bwd_allreduce_microstep: 233.42 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13490
total_samples=3434, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:06,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.37 | bwd_microstep: 1828.52 | bwd_inner_microstep: 1732.43 | bwd_allreduce_microstep: 96.04 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13571
total_samples=3438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:09,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.74 | bwd_microstep: 2024.36 | bwd_inner_microstep: 1898.65 | bwd_allreduce_microstep: 125.65 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13274
total_samples=3442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:12,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 02:26:12,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.30 | bwd_microstep: 1727.06 | bwd_inner_microstep: 1669.53 | bwd_allreduce_microstep: 57.47 | step_microstep: 140.38
[2025-08-03 02:26:12,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.43 | bwd: 7419.83 | bwd_inner: 6906.96 | bwd_allreduce: 512.65 | step: 140.80
{'loss': 0.8344, 'learning_rate': 1.9649409730077934e-05, 'epoch': 0.11}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12315
total_samples=3446, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:14,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.67 | bwd_microstep: 1819.41 | bwd_inner_microstep: 1586.93 | bwd_allreduce_microstep: 232.42 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12869
total_samples=3451, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:17,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.50 | bwd_microstep: 1814.47 | bwd_inner_microstep: 1625.25 | bwd_allreduce_microstep: 189.15 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14279
total_samples=3455, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:20,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.21 | bwd_microstep: 1850.23 | bwd_inner_microstep: 1731.78 | bwd_allreduce_microstep: 118.40 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13451
total_samples=3459, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:22,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 02:26:22,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.28 | bwd_microstep: 1954.94 | bwd_inner_microstep: 1845.42 | bwd_allreduce_microstep: 109.45 | step_microstep: 113.64
[2025-08-03 02:26:22,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.58 | bwd: 7439.11 | bwd_inner: 6789.36 | bwd_allreduce: 649.51 | step: 114.06
{'loss': 0.8207, 'learning_rate': 1.964514674798659e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13718
total_samples=3463, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:25,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.20 | bwd_microstep: 1774.17 | bwd_inner_microstep: 1707.45 | bwd_allreduce_microstep: 66.66 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13225
total_samples=3467, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:27,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.28 | bwd_microstep: 1740.32 | bwd_inner_microstep: 1675.26 | bwd_allreduce_microstep: 64.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13746
total_samples=3471, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:30,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.33 | bwd_microstep: 1983.78 | bwd_inner_microstep: 1728.62 | bwd_allreduce_microstep: 255.10 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13220
total_samples=3475, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:33,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 02:26:33,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.94 | bwd_microstep: 1770.94 | bwd_inner_microstep: 1689.60 | bwd_allreduce_microstep: 81.27 | step_microstep: 414.87
[2025-08-03 02:26:33,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.67 | bwd: 7269.26 | bwd_inner: 6800.93 | bwd_allreduce: 468.10 | step: 415.22
{'loss': 0.8273, 'learning_rate': 1.9640858472622316e-05, 'epoch': 0.11}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15598
total_samples=3479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:36,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.97 | bwd_microstep: 1902.97 | bwd_inner_microstep: 1770.57 | bwd_allreduce_microstep: 132.33 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13154
total_samples=3483, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:39,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.55 | bwd_microstep: 1946.07 | bwd_inner_microstep: 1934.34 | bwd_allreduce_microstep: 11.66 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14401
total_samples=3487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:41,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.83 | bwd_microstep: 1760.09 | bwd_inner_microstep: 1715.06 | bwd_allreduce_microstep: 44.95 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13623
total_samples=3492, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:44,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 02:26:44,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.26 | bwd_microstep: 2046.56 | bwd_inner_microstep: 1874.61 | bwd_allreduce_microstep: 171.89 | step_microstep: 121.22
[2025-08-03 02:26:44,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2766.54 | bwd: 7655.74 | bwd_inner: 7294.58 | bwd_allreduce: 360.93 | step: 121.74
{'loss': 0.8323, 'learning_rate': 1.963654491523062e-05, 'epoch': 0.11}
��         | 221/2000 [42:54<5:33:19, 11.24s/it] 11%|█         | 222/2000 [43:05<5:34:06, 11.27s/it]                                                     11%|█         | 222/2000 [43:05<5:34:06, 11.27s/it] 11%|█         | 223/2000 [43:16<5:30:54, 11.17s/it]                                                     11%|█         | 223/2000 [43:16<5:30:54, 11.17s/it] 11%|█         | 224/2000 [43:27<5:26:52, 11.04s/it]                                                     11%|█         | 224/2000 [43:27<5:26:52, 11.04s/it] 11%|█▏        | 225/2000 [43:37<5:23:26, 10.93s/it]                                                     11%|█▏        | 225/2000 [43:37<5:23:26, 10.93s/it] 11%|█▏        | 226/2000 [43:48<5:21:51, 10.89s/it]                                                     11%|█▏        | 226/2000 [43:48<5:21:51, 10.89s/it] 11%|█▏        | 227/2000 [43:59<5:21:30, 10.88s/it]                                                     11%|█▏        | 227/2000 [43:59dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13844
total_samples=3496, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:47,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.15 | bwd_microstep: 2039.96 | bwd_inner_microstep: 1898.20 | bwd_allreduce_microstep: 141.69 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11422
total_samples=3499, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:49,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.69 | bwd_microstep: 1873.49 | bwd_inner_microstep: 1527.02 | bwd_allreduce_microstep: 346.40 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13236
total_samples=3503, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:52,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.43 | bwd_microstep: 1837.85 | bwd_inner_microstep: 1700.78 | bwd_allreduce_microstep: 137.01 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12856
total_samples=3507, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:55,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 02:26:55,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.06 | bwd_microstep: 2029.46 | bwd_inner_microstep: 1875.11 | bwd_allreduce_microstep: 154.29 | step_microstep: 131.75
[2025-08-03 02:26:55,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2741.26 | bwd: 7780.80 | bwd_inner: 7001.10 | bwd_allreduce: 779.47 | step: 132.08
{'loss': 0.8341, 'learning_rate': 1.9632206087123296e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247
total_samples=3511, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:26:58,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.84 | bwd_microstep: 1917.17 | bwd_inner_microstep: 1855.16 | bwd_allreduce_microstep: 61.94 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12833
total_samples=3515, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:00,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.07 | bwd_microstep: 1974.76 | bwd_inner_microstep: 1804.06 | bwd_allreduce_microstep: 170.64 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13246
total_samples=3519, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:03,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.38 | bwd_microstep: 1975.57 | bwd_inner_microstep: 1826.50 | bwd_allreduce_microstep: 149.01 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13638
total_samples=3523, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:06,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:27:06,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.42 | bwd_microstep: 1770.98 | bwd_inner_microstep: 1672.83 | bwd_allreduce_microstep: 98.09 | step_microstep: 116.48
[2025-08-03 02:27:06,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.64 | bwd: 7638.53 | bwd_inner: 7158.55 | bwd_allreduce: 479.75 | step: 116.91
{'loss': 0.8274, 'learning_rate': 1.9627841999678422e-05, 'epoch': 0.11}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15229
total_samples=3527, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:09,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 2148.37 | bwd_inner_microstep: 1999.49 | bwd_allreduce_microstep: 148.81 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11772
total_samples=3530, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:11,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.64 | bwd_microstep: 1811.22 | bwd_inner_microstep: 1602.82 | bwd_allreduce_microstep: 208.33 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11992
total_samples=3533, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:14,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.95 | bwd_microstep: 2046.81 | bwd_inner_microstep: 1827.73 | bwd_allreduce_microstep: 219.01 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12102
total_samples=3537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:17,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 02:27:17,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.60 | bwd_microstep: 1903.21 | bwd_inner_microstep: 1778.49 | bwd_allreduce_microstep: 124.66 | step_microstep: 118.81
[2025-08-03 02:27:17,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.20 | bwd: 7909.66 | bwd_inner: 7208.53 | bwd_allreduce: 700.89 | step: 119.16
{'loss': 0.8178, 'learning_rate': 1.9623452664340305e-05, 'epoch': 0.12}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15020
total_samples=3541, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:20,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.43 | bwd_microstep: 1782.84 | bwd_inner_microstep: 1739.94 | bwd_allreduce_microstep: 42.84 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14129
total_samples=3546, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:22,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.08 | bwd_microstep: 1706.03 | bwd_inner_microstep: 1681.54 | bwd_allreduce_microstep: 24.43 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12424
total_samples=3549, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:25,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.07 | bwd_microstep: 1784.69 | bwd_inner_microstep: 1576.55 | bwd_allreduce_microstep: 208.08 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13893
total_samples=3553, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:27,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 02:27:27,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.23 | bwd_microstep: 1834.11 | bwd_inner_microstep: 1744.08 | bwd_allreduce_microstep: 89.97 | step_microstep: 141.68
[2025-08-03 02:27:27,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.74 | bwd: 7107.72 | bwd_inner: 6742.11 | bwd_allreduce: 365.38 | step: 142.10
{'loss': 0.8299, 'learning_rate': 1.9619038092619465e-05, 'epoch': 0.12}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13244
total_samples=3557, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:30,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.24 | bwd_microstep: 1730.91 | bwd_inner_microstep: 1652.05 | bwd_allreduce_microstep: 78.80 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15784
total_samples=3562, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:33,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.23 | bwd_microstep: 2099.03 | bwd_inner_microstep: 1763.77 | bwd_allreduce_microstep: 335.19 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11758
total_samples=3565, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:35,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.95 | bwd_microstep: 1794.19 | bwd_inner_microstep: 1557.47 | bwd_allreduce_microstep: 236.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13542
total_samples=3569, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:38,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 02:27:38,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.12 | bwd_microstep: 2043.91 | bwd_inner_microstep: 1882.66 | bwd_allreduce_microstep: 161.19 | step_microstep: 113.96
[2025-08-03 02:27:38,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.47 | bwd: 7668.09 | bwd_inner: 6855.95 | bwd_allreduce: 811.91 | step: 114.27
{'loss': 0.8249, 'learning_rate': 1.9614598296092603e-05, 'epoch': 0.12}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13754
total_samples=3574, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:41,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.51 | bwd_microstep: 1776.37 | bwd_inner_microstep: 1692.88 | bwd_allreduce_microstep: 83.41 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11816
total_samples=3578, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:43,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.92 | bwd_microstep: 1737.26 | bwd_inner_microstep: 1540.12 | bwd_allreduce_microstep: 197.08 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13329
total_samples=3582, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:46,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.97 | bwd_microstep: 2047.76 | bwd_inner_microstep: 1889.31 | bwd_allreduce_microstep: 158.39 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13583
total_samples=3586, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:49,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 02:27:49,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.06 | bwd_microstep: 2342.33 | bwd_inner_microstep: 2289.97 | bwd_allreduce_microstep: 52.30 | step_microstep: 108.47
[2025-08-03 02:27:49,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.39 | bwd: 7903.76 | bwd_inner: 7412.27 | bwd_allreduce: 491.26 | step: 108.91
{'loss': 0.831, 'learning_rate': 1.9610133286402565e-05, 'epoch': 0.12}
<5:21:30, 10.88s/it] 11%|█▏        | 228/2000 [44:10<5:22:16, 10.91s/it]                                                     11%|█▏        | 228/2000 [44:10<5:22:16, 10.91s/it] 11%|█▏        | 229/2000 [44:21<5:21:40, 10.90s/it]                                                     11%|█▏        | 229/2000 [44:21<5:21:40, 10.90s/it] 12%|█▏        | 230/2000 [44:32<5:23:35, 10.97s/it]                                                     12%|█▏        | 230/2000 [44:32<5:23:35, 10.97s/it] 12%|█▏        | 231/2000 [44:42<5:18:39, 10.81s/it]                                                     12%|█▏        | 231/2000 [44:42<5:18:39, 10.81s/it] 12%|█▏        | 232/2000 [44:53<5:19:12, 10.83s/it]                                                     12%|█▏        | 232/2000 [44:53<5:19:12, 10.83s/it] 12%|█▏        | 233/2000 [45:04<5:21:32, 10.92s/it]                                                     12%|█▏        | 233/2000 [45:04<5:21:32, 10.92sdynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13257
total_samples=3590, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:52,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 962.38 | bwd_microstep: 1868.67 | bwd_inner_microstep: 1687.54 | bwd_allreduce_microstep: 181.06 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14178
total_samples=3594, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:55,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.87 | bwd_microstep: 1840.54 | bwd_inner_microstep: 1753.13 | bwd_allreduce_microstep: 87.35 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11757
total_samples=3597, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:27:58,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.68 | bwd_microstep: 1991.81 | bwd_inner_microstep: 1778.31 | bwd_allreduce_microstep: 213.42 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11848
total_samples=3600, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:00,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 02:28:00,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.07 | bwd_microstep: 1758.23 | bwd_inner_microstep: 1545.55 | bwd_allreduce_microstep: 212.62 | step_microstep: 133.15
[2025-08-03 02:28:00,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3046.93 | bwd: 7459.30 | bwd_inner: 6764.54 | bwd_allreduce: 694.52 | step: 133.60
{'loss': 0.8276, 'learning_rate': 1.9605643075258323e-05, 'epoch': 0.12}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13899
total_samples=3604, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:03,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.05 | bwd_microstep: 2050.00 | bwd_inner_microstep: 1721.46 | bwd_allreduce_microstep: 328.47 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13227
total_samples=3608, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:06,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.06 | bwd_microstep: 2021.65 | bwd_inner_microstep: 1740.54 | bwd_allreduce_microstep: 281.04 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11720
total_samples=3611, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:09,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.99 | bwd_microstep: 1835.85 | bwd_inner_microstep: 1705.70 | bwd_allreduce_microstep: 130.08 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13982
total_samples=3615, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:12,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 02:28:12,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 958.05 | bwd_microstep: 1979.58 | bwd_inner_microstep: 1744.88 | bwd_allreduce_microstep: 234.63 | step_microstep: 107.64
[2025-08-03 02:28:12,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3018.06 | bwd: 7887.11 | bwd_inner: 6912.57 | bwd_allreduce: 974.31 | step: 108.09
{'loss': 0.8125, 'learning_rate': 1.960112767443493e-05, 'epoch': 0.12}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13460
total_samples=3619, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:14,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.04 | bwd_microstep: 1933.04 | bwd_inner_microstep: 1804.36 | bwd_allreduce_microstep: 128.62 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13628
total_samples=3623, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:17,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.37 | bwd_microstep: 1767.84 | bwd_inner_microstep: 1705.50 | bwd_allreduce_microstep: 62.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14629
total_samples=3627, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:20,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.98 | bwd_microstep: 1723.24 | bwd_inner_microstep: 1714.47 | bwd_allreduce_microstep: 8.72 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14290
total_samples=3631, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:22,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 02:28:22,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.91 | bwd_microstep: 1746.06 | bwd_inner_microstep: 1711.91 | bwd_allreduce_microstep: 34.10 | step_microstep: 132.95
[2025-08-03 02:28:22,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.21 | bwd: 7170.23 | bwd_inner: 6936.22 | bwd_allreduce: 233.78 | step: 133.31
{'loss': 0.8126, 'learning_rate': 1.9596587095773496e-05, 'epoch': 0.12}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13134
total_samples=3635, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:25,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.33 | bwd_microstep: 1912.49 | bwd_inner_microstep: 1687.44 | bwd_allreduce_microstep: 224.99 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=3638, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:28,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.44 | bwd_microstep: 2112.43 | bwd_inner_microstep: 1876.32 | bwd_allreduce_microstep: 236.05 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13282
total_samples=3642, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:30,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.62 | bwd_microstep: 1843.38 | bwd_inner_microstep: 1704.85 | bwd_allreduce_microstep: 138.47 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15654
total_samples=3646, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:33,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 02:28:33,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.66 | bwd_microstep: 2005.22 | bwd_inner_microstep: 1905.58 | bwd_allreduce_microstep: 99.56 | step_microstep: 151.17
[2025-08-03 02:28:33,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.97 | bwd: 7873.57 | bwd_inner: 7174.19 | bwd_allreduce: 699.15 | step: 151.58
{'loss': 0.8156, 'learning_rate': 1.9592021351181163e-05, 'epoch': 0.12}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13795
total_samples=3650, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:36,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.72 | bwd_microstep: 2012.22 | bwd_inner_microstep: 1861.96 | bwd_allreduce_microstep: 150.20 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13188
total_samples=3653, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:39,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.12 | bwd_microstep: 1741.76 | bwd_inner_microstep: 1603.57 | bwd_allreduce_microstep: 138.13 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13216
total_samples=3657, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:41,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.80 | bwd_microstep: 1728.21 | bwd_inner_microstep: 1657.62 | bwd_allreduce_microstep: 70.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13400
total_samples=3661, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:44,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 02:28:44,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.26 | bwd_microstep: 1967.89 | bwd_inner_microstep: 1899.01 | bwd_allreduce_microstep: 68.81 | step_microstep: 116.78
[2025-08-03 02:28:44,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2750.83 | bwd: 7450.12 | bwd_inner: 7022.14 | bwd_allreduce: 427.75 | step: 117.11
{'loss': 0.8129, 'learning_rate': 1.958743045263106e-05, 'epoch': 0.12}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13069
total_samples=3665, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:47,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.34 | bwd_microstep: 1951.08 | bwd_inner_microstep: 1843.65 | bwd_allreduce_microstep: 107.36 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13920
total_samples=3669, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:49,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.08 | bwd_microstep: 1798.13 | bwd_inner_microstep: 1722.90 | bwd_allreduce_microstep: 75.18 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14890
total_samples=3673, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:52,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.82 | bwd_microstep: 2096.49 | bwd_inner_microstep: 1862.45 | bwd_allreduce_microstep: 233.98 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13517
total_samples=3677, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:55,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.78
[2025-08-03 02:28:55,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.93 | bwd_microstep: 2036.41 | bwd_inner_microstep: 1922.77 | bwd_allreduce_microstep: 113.59 | step_microstep: 118.60
[2025-08-03 02:28:55,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.08 | bwd: 7882.16 | bwd_inner: 7351.76 | bwd_allreduce: 530.17 | step: 119.02
{'loss': 0.8182, 'learning_rate': 1.9582814412162288e-05, 'epoch': 0.12}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15750
total_samples=3682, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:28:58,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.78 | bwd_microstep: 1819.03 | bwd_inner_microstep: 1789.88 | bwd_allreduce_microstep: 29.08 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11606
total_samples=3685, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:00,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.26 | bwd_microstep: 1749.02 | bwd_inner_microstep: 1521.65 | bwd_allreduce_microstep: 227.30 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11853
total_samples=3688, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:03,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.95 | bwd_microstep: 1756.29 | bwd_inner_microstep: 1547.11 | bwd_allreduce_microstep: 209.12 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13895
total_samples=3693, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:06,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 02:29:06,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.74 | bwd_microstep: 2058.73 | bwd_inner_microstep: 2052.85 | bwd_allreduce_microstep: 5.82 | step_microstep: 107.57
[2025-08-03 02:29:06,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.67 | bwd: 7383.11 | bwd_inner: 6911.49 | bwd_allreduce: 471.39 | step: 107.92
/it] 12%|█▏        | 234/2000 [45:15<5:21:54, 10.94s/it]                                                     12%|█▏        | 234/2000 [45:15<5:21:54, 10.94s/it] 12%|█▏        | 235/2000 [45:27<5:25:17, 11.06s/it]                                                     12%|█▏        | 235/2000 [45:27<5:25:17, 11.06s/it] 12%|█▏        | 236/2000 [45:37<5:19:18, 10.86s/it]                                                     12%|█▏        | 236/2000 [45:37<5:19:18, 10.86s/it] 12%|█▏        | 237/2000 [45:48<5:21:46, 10.95s/it]                                                     12%|█▏        | 237/2000 [45:48<5:21:46, 10.95s/it] 12%|█▏        | 238/2000 [45:59<5:18:52, 10.86s/it]                                                     12%|█▏        | 238/2000 [45:59<5:18:52, 10.86s/it] 12%|█▏        | 239/2000 [46:10<5:21:02, 10.94s/it]                                                     12%|█▏        | 239/2000 [46:10<5:21:02, 10.94s/it] 12%|█▏{'loss': 0.8069, 'learning_rate': 1.957817324187987e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11754
total_samples=3696, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:08,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.65 | bwd_microstep: 1963.40 | bwd_inner_microstep: 1747.26 | bwd_allreduce_microstep: 216.08 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11800
total_samples=3699, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:11,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.80 | bwd_microstep: 2103.29 | bwd_inner_microstep: 1954.30 | bwd_allreduce_microstep: 148.93 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11756
total_samples=3702, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:14,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.03 | bwd_microstep: 1844.53 | bwd_inner_microstep: 1608.66 | bwd_allreduce_microstep: 235.81 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13682
total_samples=3706, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:17,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 02:29:17,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.20 | bwd_microstep: 1823.46 | bwd_inner_microstep: 1713.82 | bwd_allreduce_microstep: 109.59 | step_microstep: 113.93
[2025-08-03 02:29:17,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.61 | bwd: 7734.73 | bwd_inner: 7024.03 | bwd_allreduce: 710.48 | step: 114.37
{'loss': 0.8152, 'learning_rate': 1.957350695395474e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11920
total_samples=3709, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:19,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.27 | bwd_microstep: 1907.10 | bwd_inner_microstep: 1735.18 | bwd_allreduce_microstep: 171.85 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11675
total_samples=3712, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:22,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.70 | bwd_microstep: 2126.99 | bwd_inner_microstep: 1941.91 | bwd_allreduce_microstep: 185.02 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13396
total_samples=3716, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:25,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.36 | bwd_microstep: 1754.31 | bwd_inner_microstep: 1677.63 | bwd_allreduce_microstep: 76.61 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16149
total_samples=3720, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:28,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 02:29:28,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.89 | bwd_microstep: 2078.41 | bwd_inner_microstep: 1927.17 | bwd_allreduce_microstep: 151.19 | step_microstep: 391.72
[2025-08-03 02:29:28,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.15 | bwd: 7866.86 | bwd_inner: 7281.88 | bwd_allreduce: 584.75 | step: 392.06
{'loss': 0.8148, 'learning_rate': 1.956881556062369e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11793
total_samples=3723, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:31,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.05 | bwd_microstep: 2009.52 | bwd_inner_microstep: 1792.76 | bwd_allreduce_microstep: 216.69 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11870
total_samples=3726, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:34,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.20 | bwd_microstep: 1978.76 | bwd_inner_microstep: 1755.69 | bwd_allreduce_microstep: 223.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15146
total_samples=3730, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:36,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.19 | bwd_microstep: 1778.77 | bwd_inner_microstep: 1751.46 | bwd_allreduce_microstep: 27.24 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13487
total_samples=3735, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:39,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 02:29:39,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.20 | bwd_microstep: 1771.69 | bwd_inner_microstep: 1711.63 | bwd_allreduce_microstep: 60.00 | step_microstep: 120.99
[2025-08-03 02:29:39,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.56 | bwd: 7538.79 | bwd_inner: 7011.54 | bwd_allreduce: 527.02 | step: 121.32
{'loss': 0.8295, 'learning_rate': 1.956409907418935e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12008
total_samples=3738, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:42,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.19 | bwd_microstep: 2007.09 | bwd_inner_microstep: 1799.37 | bwd_allreduce_microstep: 207.66 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11746
total_samples=3742, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:46,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1526.11 | bwd_microstep: 2442.53 | bwd_inner_microstep: 2215.50 | bwd_allreduce_microstep: 226.96 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12188
total_samples=3746, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:48,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.27 | bwd_microstep: 2000.97 | bwd_inner_microstep: 1785.52 | bwd_allreduce_microstep: 215.39 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13829
total_samples=3750, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:51,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 02:29:51,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.24 | bwd_microstep: 1758.48 | bwd_inner_microstep: 1722.15 | bwd_allreduce_microstep: 36.27 | step_microstep: 131.85
[2025-08-03 02:29:51,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3600.75 | bwd: 8209.12 | bwd_inner: 7522.53 | bwd_allreduce: 686.36 | step: 132.16
{'loss': 0.8113, 'learning_rate': 1.9559357507020163e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11563
total_samples=3753, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:54,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.01 | bwd_microstep: 1975.47 | bwd_inner_microstep: 1756.23 | bwd_allreduce_microstep: 219.18 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12462
total_samples=3757, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:56,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.45 | bwd_microstep: 1794.22 | bwd_inner_microstep: 1593.85 | bwd_allreduce_microstep: 200.30 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11860
total_samples=3760, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:29:59,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.97 | bwd_microstep: 1722.93 | bwd_inner_microstep: 1534.48 | bwd_allreduce_microstep: 188.39 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15637
total_samples=3765, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:02,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 02:30:02,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.71 | bwd_microstep: 1800.26 | bwd_inner_microstep: 1794.19 | bwd_allreduce_microstep: 6.00 | step_microstep: 127.21
[2025-08-03 02:30:02,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.07 | bwd: 7292.93 | bwd_inner: 6678.74 | bwd_allreduce: 613.96 | step: 127.54
{'loss': 0.8055, 'learning_rate': 1.955459087155033e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12125
total_samples=3768, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:05,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.07 | bwd_microstep: 2208.20 | bwd_inner_microstep: 1983.95 | bwd_allreduce_microstep: 224.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13098
total_samples=3772, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:07,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.17 | bwd_microstep: 1928.82 | bwd_inner_microstep: 1719.23 | bwd_allreduce_microstep: 209.54 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11871
total_samples=3775, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:10,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.98 | bwd_microstep: 1782.41 | bwd_inner_microstep: 1540.23 | bwd_allreduce_microstep: 242.12 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14867
total_samples=3780, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:13,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 02:30:13,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.48 | bwd_microstep: 1884.76 | bwd_inner_microstep: 1840.11 | bwd_allreduce_microstep: 44.59 | step_microstep: 148.76
[2025-08-03 02:30:13,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.64 | bwd: 7804.24 | bwd_inner: 7083.52 | bwd_allreduce: 720.50 | step: 149.06
        | 240/2000 [46:21<5:17:50, 10.84s/it]                                                     12%|█▏        | 240/2000 [46:21<5:17:50, 10.84s/it] 12%|█▏        | 241/2000 [46:32<5:18:52, 10.88s/it]                                                     12%|█▏        | 241/2000 [46:32<5:18:52, 10.88s/it] 12%|█▏        | 242/2000 [46:43<5:23:20, 11.04s/it]                                                     12%|█▏        | 242/2000 [46:43<5:23:20, 11.04s/it] 12%|█▏        | 243/2000 [46:54<5:21:06, 10.97s/it]                                                     12%|█▏        | 243/2000 [46:54<5:21:06, 10.97s/it] 12%|█▏        | 244/2000 [47:06<5:32:07, 11.35s/it]                                                     12%|█▏        | 244/2000 [47:06<5:32:07, 11.35s/it] 12%|█▏        | 245/2000 [47:16<5:24:27, 11.09s/it]                                                     12%|█▏        | 245/2000 [47:16<5:24:27, 11.09s/it] 12%|█▏        | 246/20{'loss': 0.8093, 'learning_rate': 1.9549799180279793e-05, 'epoch': 0.12}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13404
total_samples=3784, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:15,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.53 | bwd_microstep: 1765.51 | bwd_inner_microstep: 1676.46 | bwd_allreduce_microstep: 88.98 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13706
total_samples=3788, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:18,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.10 | bwd_microstep: 1967.72 | bwd_inner_microstep: 1890.35 | bwd_allreduce_microstep: 77.31 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12551
total_samples=3791, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:21,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.46 | bwd_microstep: 1920.95 | bwd_inner_microstep: 1864.39 | bwd_allreduce_microstep: 56.49 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13313
total_samples=3795, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:24,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 02:30:24,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.46 | bwd_microstep: 2120.09 | bwd_inner_microstep: 1972.15 | bwd_allreduce_microstep: 147.87 | step_microstep: 118.84
[2025-08-03 02:30:24,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.48 | bwd: 7774.32 | bwd_inner: 7403.36 | bwd_allreduce: 370.74 | step: 119.26
{'loss': 0.8098, 'learning_rate': 1.9544982445774217e-05, 'epoch': 0.12}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12879
total_samples=3799, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:26,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.57 | bwd_microstep: 1839.12 | bwd_inner_microstep: 1739.39 | bwd_allreduce_microstep: 99.66 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13474
total_samples=3803, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:29,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.31 | bwd_microstep: 1739.53 | bwd_inner_microstep: 1678.19 | bwd_allreduce_microstep: 61.27 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16347
total_samples=3807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:32,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.93 | bwd_microstep: 2006.67 | bwd_inner_microstep: 1852.49 | bwd_allreduce_microstep: 154.12 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13597
total_samples=3811, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:34,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 02:30:34,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.06 | bwd_microstep: 1717.41 | bwd_inner_microstep: 1666.94 | bwd_allreduce_microstep: 50.41 | step_microstep: 114.93
[2025-08-03 02:30:34,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.80 | bwd: 7302.78 | bwd_inner: 6937.01 | bwd_allreduce: 365.54 | step: 115.28
{'loss': 0.8135, 'learning_rate': 1.9540140680664915e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11678
total_samples=3814, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:37,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.57 | bwd_microstep: 1965.64 | bwd_inner_microstep: 1618.30 | bwd_allreduce_microstep: 347.28 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13173
total_samples=3818, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:40,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.11 | bwd_microstep: 1961.34 | bwd_inner_microstep: 1844.83 | bwd_allreduce_microstep: 116.45 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13723
total_samples=3822, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:42,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.64 | bwd_microstep: 1807.43 | bwd_inner_microstep: 1721.56 | bwd_allreduce_microstep: 85.81 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14015
total_samples=3827, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:45,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:30:45,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.01 | bwd_microstep: 2105.63 | bwd_inner_microstep: 1909.80 | bwd_allreduce_microstep: 195.77 | step_microstep: 122.24
[2025-08-03 02:30:45,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.26 | bwd: 7840.10 | bwd_inner: 7094.49 | bwd_allreduce: 745.38 | step: 122.70
{'loss': 0.8154, 'learning_rate': 1.9535273897648857e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12149
total_samples=3830, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:48,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.99 | bwd_microstep: 1995.37 | bwd_inner_microstep: 1799.62 | bwd_allreduce_microstep: 195.69 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15589
total_samples=3834, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:51,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.87 | bwd_microstep: 2153.05 | bwd_inner_microstep: 1996.79 | bwd_allreduce_microstep: 156.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14128
total_samples=3838, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:54,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.97 | bwd_microstep: 2127.44 | bwd_inner_microstep: 1980.34 | bwd_allreduce_microstep: 147.04 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13171
total_samples=3843, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:57,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 02:30:57,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.22 | bwd_microstep: 1777.29 | bwd_inner_microstep: 1638.61 | bwd_allreduce_microstep: 138.62 | step_microstep: 122.23
[2025-08-03 02:30:57,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.99 | bwd: 8053.19 | bwd_inner: 7415.36 | bwd_allreduce: 637.61 | step: 122.54
{'loss': 0.8267, 'learning_rate': 1.953038210948861e-05, 'epoch': 0.12}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12004
total_samples=3846, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:30:59,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.36 | bwd_microstep: 1806.29 | bwd_inner_microstep: 1559.56 | bwd_allreduce_microstep: 246.66 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12042
total_samples=3849, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:02,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.77 | bwd_microstep: 1816.07 | bwd_inner_microstep: 1550.05 | bwd_allreduce_microstep: 265.96 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13618
total_samples=3853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:04,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.02 | bwd_microstep: 1737.04 | bwd_inner_microstep: 1683.65 | bwd_allreduce_microstep: 53.32 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12785
total_samples=3857, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:07,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 02:31:07,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.57 | bwd_microstep: 1735.44 | bwd_inner_microstep: 1592.88 | bwd_allreduce_microstep: 142.49 | step_microstep: 137.96
[2025-08-03 02:31:07,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.66 | bwd: 7094.87 | bwd_inner: 6386.14 | bwd_allreduce: 708.50 | step: 138.29
{'loss': 0.8183, 'learning_rate': 1.9525465329012322e-05, 'epoch': 0.13}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13320
total_samples=3861, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:11,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1615.81 | bwd_microstep: 2451.57 | bwd_inner_microstep: 2291.10 | bwd_allreduce_microstep: 160.41 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12785
total_samples=3865, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:14,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.43 | bwd_microstep: 1842.22 | bwd_inner_microstep: 1627.98 | bwd_allreduce_microstep: 214.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13933
total_samples=3869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:16,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.73 | bwd_microstep: 1779.27 | bwd_inner_microstep: 1721.90 | bwd_allreduce_microstep: 57.31 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13260
total_samples=3873, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:19,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:31:19,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.44 | bwd_microstep: 1907.68 | bwd_inner_microstep: 1792.14 | bwd_allreduce_microstep: 115.49 | step_microstep: 111.88
[2025-08-03 02:31:19,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3742.35 | bwd: 7980.78 | bwd_inner: 7433.11 | bwd_allreduce: 547.44 | step: 112.20
00 [47:28<5:24:04, 11.09s/it]                                                     12%|█▏        | 246/2000 [47:28<5:24:04, 11.09s/it] 12%|█▏        | 247/2000 [47:39<5:23:11, 11.06s/it]                                                     12%|█▏        | 247/2000 [47:39<5:23:11, 11.06s/it] 12%|█▏        | 248/2000 [47:49<5:18:50, 10.92s/it]                                                     12%|█▏        | 248/2000 [47:49<5:18:50, 10.92s/it] 12%|█▏        | 249/2000 [48:00<5:20:38, 10.99s/it]                                                     12%|█▏        | 249/2000 [48:00<5:20:38, 10.99s/it] 12%|█▎        | 250/2000 [48:12<5:23:19, 11.09s/it]                                                     12%|█▎        | 250/2000 [48:12<5:23:19, 11.09s/it] 13%|█▎        | 251/2000 [48:22<5:16:36, 10.86s/it]                                                     13%|█▎        | 251/2000 [48:22<5:16:36, 10.86s/it] 13%|█▎        | 252/2000 [48:34<5:27:5{'loss': 0.8265, 'learning_rate': 1.952052356911368e-05, 'epoch': 0.13}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13451
total_samples=3877, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:22,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.24 | bwd_microstep: 2002.78 | bwd_inner_microstep: 1654.79 | bwd_allreduce_microstep: 347.93 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11693
total_samples=3880, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:25,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.22 | bwd_microstep: 1779.23 | bwd_inner_microstep: 1537.11 | bwd_allreduce_microstep: 242.05 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13401
total_samples=3884, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:27,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.42 | bwd_microstep: 1816.68 | bwd_inner_microstep: 1715.21 | bwd_allreduce_microstep: 101.40 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13697
total_samples=3888, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:30,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 02:31:30,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.00 | bwd_microstep: 1908.83 | bwd_inner_microstep: 1804.28 | bwd_allreduce_microstep: 104.48 | step_microstep: 136.19
[2025-08-03 02:31:30,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.80 | bwd: 7507.57 | bwd_inner: 6711.40 | bwd_allreduce: 795.94 | step: 136.63
{'loss': 0.8159, 'learning_rate': 1.9515556842751863e-05, 'epoch': 0.13}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13151
total_samples=3892, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:33,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.32 | bwd_microstep: 1939.76 | bwd_inner_microstep: 1803.06 | bwd_allreduce_microstep: 136.63 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13855
total_samples=3897, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:35,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.48 | bwd_microstep: 1760.16 | bwd_inner_microstep: 1740.72 | bwd_allreduce_microstep: 19.35 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13457
total_samples=3901, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:38,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.44 | bwd_microstep: 2086.63 | bwd_inner_microstep: 1939.25 | bwd_allreduce_microstep: 147.31 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13702
total_samples=3905, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:41,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 02:31:41,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.46 | bwd_microstep: 2144.71 | bwd_inner_microstep: 1983.18 | bwd_allreduce_microstep: 161.47 | step_microstep: 111.38
[2025-08-03 02:31:41,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.63 | bwd: 7931.29 | bwd_inner: 7466.22 | bwd_allreduce: 464.81 | step: 111.69
{'loss': 0.8247, 'learning_rate': 1.9510565162951538e-05, 'epoch': 0.13}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12155
total_samples=3908, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:44,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.58 | bwd_microstep: 2233.58 | bwd_inner_microstep: 2014.25 | bwd_allreduce_microstep: 219.27 | step_microstep: 0.09
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13624
total_samples=3912, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:47,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.99 | bwd_microstep: 1838.21 | bwd_inner_microstep: 1707.30 | bwd_allreduce_microstep: 130.85 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13974
total_samples=3916, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:49,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.11 | bwd_microstep: 1801.40 | bwd_inner_microstep: 1728.70 | bwd_allreduce_microstep: 72.63 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12458
total_samples=3920, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:52,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 02:31:52,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.62 | bwd_microstep: 1761.25 | bwd_inner_microstep: 1592.57 | bwd_allreduce_microstep: 168.62 | step_microstep: 111.98
[2025-08-03 02:31:52,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.24 | bwd: 7634.49 | bwd_inner: 7042.83 | bwd_allreduce: 591.44 | step: 112.31
{'loss': 0.817, 'learning_rate': 1.9505548542802805e-05, 'epoch': 0.13}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11847
total_samples=3923, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:55,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.15 | bwd_microstep: 1732.29 | bwd_inner_microstep: 1531.38 | bwd_allreduce_microstep: 200.84 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11975
total_samples=3927, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:31:57,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.75 | bwd_microstep: 1845.97 | bwd_inner_microstep: 1614.02 | bwd_allreduce_microstep: 231.89 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13505
total_samples=3931, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:00,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.11 | bwd_microstep: 1812.61 | bwd_inner_microstep: 1687.51 | bwd_allreduce_microstep: 125.04 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13193
total_samples=3935, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:03,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 02:32:03,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.37 | bwd_microstep: 2092.64 | bwd_inner_microstep: 1947.40 | bwd_allreduce_microstep: 145.18 | step_microstep: 127.88
[2025-08-03 02:32:03,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.32 | bwd: 7483.56 | bwd_inner: 6780.29 | bwd_allreduce: 703.03 | step: 128.36
{'loss': 0.7915, 'learning_rate': 1.950050699546116e-05, 'epoch': 0.13}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11707
total_samples=3938, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:06,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.94 | bwd_microstep: 1845.66 | bwd_inner_microstep: 1694.14 | bwd_allreduce_microstep: 151.46 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11463
total_samples=3941, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:08,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.78 | bwd_microstep: 2221.03 | bwd_inner_microstep: 1990.56 | bwd_allreduce_microstep: 230.41 | step_microstep: 0.09
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13427
total_samples=3946, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:11,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.03 | bwd_microstep: 1739.93 | bwd_inner_microstep: 1645.89 | bwd_allreduce_microstep: 93.98 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15657
total_samples=3950, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:14,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 02:32:14,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.80 | bwd_microstep: 1970.28 | bwd_inner_microstep: 1914.96 | bwd_allreduce_microstep: 55.25 | step_microstep: 133.47
[2025-08-03 02:32:14,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.47 | bwd: 7776.95 | bwd_inner: 7245.55 | bwd_allreduce: 531.17 | step: 133.78
{'loss': 0.8134, 'learning_rate': 1.949544053414748e-05, 'epoch': 0.13}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13184
total_samples=3954, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:16,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.50 | bwd_microstep: 1749.12 | bwd_inner_microstep: 1687.39 | bwd_allreduce_microstep: 61.67 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12590
total_samples=3957, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:19,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.58 | bwd_microstep: 1874.72 | bwd_inner_microstep: 1716.05 | bwd_allreduce_microstep: 158.61 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12875
total_samples=3961, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:22,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.04 | bwd_microstep: 1733.94 | bwd_inner_microstep: 1621.20 | bwd_allreduce_microstep: 112.67 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15120
total_samples=3965, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:24,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 02:32:24,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.33 | bwd_microstep: 1801.37 | bwd_inner_microstep: 1762.83 | bwd_allreduce_microstep: 38.48 | step_microstep: 152.18
[2025-08-03 02:32:24,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.37 | bwd: 7159.20 | bwd_inner: 6787.47 | bwd_allreduce: 371.50 | step: 152.54
7, 11.26s/it]                                                     13%|█▎        | 252/2000 [48:34<5:27:57, 11.26s/it] 13%|█▎        | 253/2000 [48:45<5:23:51, 11.12s/it]                                                     13%|█▎        | 253/2000 [48:45<5:23:51, 11.12s/it] 13%|█▎        | 254/2000 [48:56<5:24:02, 11.14s/it]                                                     13%|█▎        | 254/2000 [48:56<5:24:02, 11.14s/it] 13%|█▎        | 255/2000 [49:07<5:21:39, 11.06s/it]                                                     13%|█▎        | 255/2000 [49:07<5:21:39, 11.06s/it] 13%|█▎        | 256/2000 [49:18<5:18:54, 10.97s/it]                                                     13%|█▎        | 256/2000 [49:18<5:18:54, 10.97s/it] 13%|█▎        | 257/2000 [49:29<5:19:23, 10.99s/it]                                                     13%|█▎        | 257/2000 [49:29<5:19:23, 10.99s/it] 13%|█▎        | 258/2000 [49:39<5:14:27, 10.83s/it]  {'loss': 0.8152, 'learning_rate': 1.9490349172147964e-05, 'epoch': 0.13}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12051
total_samples=3968, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:27,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.60 | bwd_microstep: 1941.73 | bwd_inner_microstep: 1552.09 | bwd_allreduce_microstep: 389.57 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13550
total_samples=3972, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:30,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.82 | bwd_microstep: 1746.64 | bwd_inner_microstep: 1684.71 | bwd_allreduce_microstep: 61.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15870
total_samples=3976, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:32,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.70 | bwd_microstep: 1781.56 | bwd_inner_microstep: 1775.25 | bwd_allreduce_microstep: 6.25 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12017
total_samples=3980, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:35,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 02:32:35,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1091.25 | bwd_microstep: 1747.46 | bwd_inner_microstep: 1542.83 | bwd_allreduce_microstep: 204.57 | step_microstep: 109.77
[2025-08-03 02:32:35,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3187.31 | bwd: 7217.43 | bwd_inner: 6554.88 | bwd_allreduce: 662.33 | step: 110.19
{'loss': 0.815, 'learning_rate': 1.9485232922814117e-05, 'epoch': 0.13}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14344
total_samples=3984, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:38,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.46 | bwd_microstep: 1879.65 | bwd_inner_microstep: 1769.77 | bwd_allreduce_microstep: 109.81 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13447
total_samples=3988, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:41,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.75 | bwd_microstep: 1823.02 | bwd_inner_microstep: 1734.28 | bwd_allreduce_microstep: 88.68 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13127
total_samples=3992, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:43,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.81 | bwd_microstep: 1887.52 | bwd_inner_microstep: 1642.89 | bwd_allreduce_microstep: 244.58 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13595
total_samples=3996, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:46,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:32:46,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.04 | bwd_microstep: 1729.04 | bwd_inner_microstep: 1689.69 | bwd_allreduce_microstep: 39.29 | step_microstep: 116.33
[2025-08-03 02:32:46,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.96 | bwd: 7319.28 | bwd_inner: 6836.63 | bwd_allreduce: 482.43 | step: 116.62
{'loss': 0.8121, 'learning_rate': 1.9480091799562706e-05, 'epoch': 0.13}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14371
total_samples=4000, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:49,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.34 | bwd_microstep: 2157.94 | bwd_inner_microstep: 2020.04 | bwd_allreduce_microstep: 137.83 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12784
total_samples=4004, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:52,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.61 | bwd_microstep: 2035.89 | bwd_inner_microstep: 1954.72 | bwd_allreduce_microstep: 81.12 | step_microstep: 0.22
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12308
total_samples=4008, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:54,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.88 | bwd_microstep: 2082.17 | bwd_inner_microstep: 1643.30 | bwd_allreduce_microstep: 438.82 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12263
total_samples=4011, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:32:57,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:32:57,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.42 | bwd_microstep: 1790.13 | bwd_inner_microstep: 1565.84 | bwd_allreduce_microstep: 224.23 | step_microstep: 140.85
[2025-08-03 02:32:57,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.19 | bwd: 8066.17 | bwd_inner: 7183.89 | bwd_allreduce: 882.07 | step: 141.27
{'loss': 0.8175, 'learning_rate': 1.947492581587573e-05, 'epoch': 0.13}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13232
total_samples=4014, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:00,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.97 | bwd_microstep: 2010.87 | bwd_inner_microstep: 1826.09 | bwd_allreduce_microstep: 184.71 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11604
total_samples=4017, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:03,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.06 | bwd_microstep: 1788.50 | bwd_inner_microstep: 1544.39 | bwd_allreduce_microstep: 244.04 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13769
total_samples=4021, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:05,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.26 | bwd_microstep: 1799.96 | bwd_inner_microstep: 1692.68 | bwd_allreduce_microstep: 107.22 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11914
total_samples=4024, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:08,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 02:33:08,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.10 | bwd_microstep: 1814.31 | bwd_inner_microstep: 1576.02 | bwd_allreduce_microstep: 238.22 | step_microstep: 163.86
[2025-08-03 02:33:08,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2752.33 | bwd: 7413.68 | bwd_inner: 6639.18 | bwd_allreduce: 774.27 | step: 164.18
{'loss': 0.8067, 'learning_rate': 1.9469734985300373e-05, 'epoch': 0.13}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14044
total_samples=4028, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:11,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.99 | bwd_microstep: 1891.16 | bwd_inner_microstep: 1780.30 | bwd_allreduce_microstep: 110.80 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11638
total_samples=4031, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:13,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.53 | bwd_microstep: 2032.57 | bwd_inner_microstep: 1799.53 | bwd_allreduce_microstep: 232.97 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12911
total_samples=4034, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:16,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.46 | bwd_microstep: 1885.11 | bwd_inner_microstep: 1616.40 | bwd_allreduce_microstep: 268.65 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11670
total_samples=4037, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:19,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 02:33:19,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.29 | bwd_microstep: 2018.31 | bwd_inner_microstep: 1801.44 | bwd_allreduce_microstep: 216.81 | step_microstep: 107.38
[2025-08-03 02:33:19,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.22 | bwd: 7827.19 | bwd_inner: 6997.65 | bwd_allreduce: 829.31 | step: 107.72
{'loss': 0.812, 'learning_rate': 1.9464519321448988e-05, 'epoch': 0.13}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12847
total_samples=4041, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:23,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1736.66 | bwd_microstep: 1873.84 | bwd_inner_microstep: 1677.70 | bwd_allreduce_microstep: 196.07 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11726
total_samples=4044, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:25,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.54 | bwd_microstep: 2004.99 | bwd_inner_microstep: 1806.32 | bwd_allreduce_microstep: 198.60 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11804
total_samples=4047, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:28,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.78 | bwd_microstep: 2034.16 | bwd_inner_microstep: 1845.54 | bwd_allreduce_microstep: 188.56 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12528
total_samples=4051, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:31,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 02:33:31,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.12 | bwd_microstep: 1748.50 | bwd_inner_microstep: 1606.78 | bwd_allreduce_microstep: 141.65 | step_microstep: 122.94
[2025-08-03 02:33:31,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3802.02 | bwd: 7661.53 | bwd_inner: 6936.34 | bwd_allreduce: 724.96 | step: 123.36
                                                   13%|█▎        | 258/2000 [49:39<5:14:27, 10.83s/it] 13%|█▎        | 259/2000 [49:50<5:14:32, 10.84s/it]                                                     13%|█▎        | 259/2000 [49:50<5:14:32, 10.84s/it] 13%|█▎        | 260/2000 [50:01<5:12:13, 10.77s/it]                                                     13%|█▎        | 260/2000 [50:01<5:12:13, 10.77s/it] 13%|█▎        | 261/2000 [50:12<5:17:20, 10.95s/it]                                                     13%|█▎        | 261/2000 [50:12<5:17:20, 10.95s/it] 13%|█▎        | 262/2000 [50:23<5:14:45, 10.87s/it]                                                     13%|█▎        | 262/2000 [50:23<5:14:45, 10.87s/it] 13%|█▎        | 263/2000 [50:34<5:16:17, 10.93s/it]                                                     13%|█▎        | 263/2000 [50:34<5:16:17, 10.93s/it] 13%|█▎        | 264/2000 [50:46<5:24:36, 11.22s/it]                  {'loss': 0.82, 'learning_rate': 1.9459278837999048e-05, 'epoch': 0.13}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13199
total_samples=4055, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:33,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.38 | bwd_microstep: 1778.20 | bwd_inner_microstep: 1682.46 | bwd_allreduce_microstep: 95.68 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12031
total_samples=4058, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:36,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.90 | bwd_microstep: 1764.25 | bwd_inner_microstep: 1552.96 | bwd_allreduce_microstep: 211.23 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12979
total_samples=4062, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:39,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.97 | bwd_microstep: 1830.34 | bwd_inner_microstep: 1684.64 | bwd_allreduce_microstep: 145.63 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13072
total_samples=4066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:41,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 02:33:41,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.52 | bwd_microstep: 1925.19 | bwd_inner_microstep: 1852.42 | bwd_allreduce_microstep: 72.71 | step_microstep: 109.36
[2025-08-03 02:33:41,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.70 | bwd: 7298.02 | bwd_inner: 6772.47 | bwd_allreduce: 525.33 | step: 109.66
{'loss': 0.8124, 'learning_rate': 1.9454013548693103e-05, 'epoch': 0.13}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12264
total_samples=4070, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:44,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.27 | bwd_microstep: 1826.04 | bwd_inner_microstep: 1594.04 | bwd_allreduce_microstep: 231.93 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11818
total_samples=4073, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:47,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.20 | bwd_microstep: 1820.50 | bwd_inner_microstep: 1570.94 | bwd_allreduce_microstep: 249.50 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11868
total_samples=4076, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:49,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.72 | bwd_microstep: 1777.21 | bwd_inner_microstep: 1548.57 | bwd_allreduce_microstep: 228.58 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11614
total_samples=4079, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:52,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 02:33:52,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.58 | bwd_microstep: 2206.88 | bwd_inner_microstep: 1970.33 | bwd_allreduce_microstep: 236.49 | step_microstep: 107.06
[2025-08-03 02:33:52,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2869.70 | bwd: 7630.68 | bwd_inner: 6683.86 | bwd_allreduce: 946.58 | step: 107.40
{'loss': 0.8133, 'learning_rate': 1.9448723467338765e-05, 'epoch': 0.13}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13690
total_samples=4083, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:55,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.21 | bwd_microstep: 1755.23 | bwd_inner_microstep: 1686.13 | bwd_allreduce_microstep: 69.05 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12526
total_samples=4087, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:33:57,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.92 | bwd_microstep: 1788.03 | bwd_inner_microstep: 1606.05 | bwd_allreduce_microstep: 181.92 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11852
total_samples=4090, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:00,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.03 | bwd_microstep: 1779.23 | bwd_inner_microstep: 1544.31 | bwd_allreduce_microstep: 234.86 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13374
total_samples=4094, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:03,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:34:03,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.09 | bwd_microstep: 2058.74 | bwd_inner_microstep: 1853.84 | bwd_allreduce_microstep: 204.84 | step_microstep: 132.71
[2025-08-03 02:34:03,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.18 | bwd: 7381.29 | bwd_inner: 6690.31 | bwd_allreduce: 690.74 | step: 133.14
{'loss': 0.8125, 'learning_rate': 1.944340860780865e-05, 'epoch': 0.13}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13179
total_samples=4098, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:06,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.96 | bwd_microstep: 1758.06 | bwd_inner_microstep: 1664.45 | bwd_allreduce_microstep: 93.55 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778
total_samples=4101, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:09,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 933.91 | bwd_microstep: 2019.52 | bwd_inner_microstep: 1811.33 | bwd_allreduce_microstep: 208.12 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13566
total_samples=4105, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:11,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.59 | bwd_microstep: 2062.09 | bwd_inner_microstep: 1911.68 | bwd_allreduce_microstep: 150.35 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11907
total_samples=4108, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:15,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 02:34:15,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.72 | bwd_microstep: 2295.44 | bwd_inner_microstep: 2122.25 | bwd_allreduce_microstep: 173.13 | step_microstep: 111.13
[2025-08-03 02:34:15,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3020.12 | bwd: 8135.16 | bwd_inner: 7509.69 | bwd_allreduce: 625.21 | step: 111.45
{'loss': 0.8114, 'learning_rate': 1.9438068984040366e-05, 'epoch': 0.13}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13369
total_samples=4112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:17,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.24 | bwd_microstep: 1773.92 | bwd_inner_microstep: 1678.76 | bwd_allreduce_microstep: 95.09 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13169
total_samples=4116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:20,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.12 | bwd_microstep: 1824.85 | bwd_inner_microstep: 1689.11 | bwd_allreduce_microstep: 135.69 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13543
total_samples=4120, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:22,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.19 | bwd_microstep: 1730.05 | bwd_inner_microstep: 1674.78 | bwd_allreduce_microstep: 55.20 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11842
total_samples=4123, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:25,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 02:34:25,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.82 | bwd_microstep: 1744.52 | bwd_inner_microstep: 1539.08 | bwd_allreduce_microstep: 205.37 | step_microstep: 133.40
[2025-08-03 02:34:25,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.30 | bwd: 7073.38 | bwd_inner: 6581.72 | bwd_allreduce: 491.41 | step: 133.80
{'loss': 0.8171, 'learning_rate': 1.9432704610036448e-05, 'epoch': 0.13}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12959
total_samples=4127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:28,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.47 | bwd_microstep: 1764.09 | bwd_inner_microstep: 1659.57 | bwd_allreduce_microstep: 104.46 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14116
total_samples=4132, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:30,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.00 | bwd_microstep: 2196.35 | bwd_inner_microstep: 2027.01 | bwd_allreduce_microstep: 169.28 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12905
total_samples=4136, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:33,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.93 | bwd_microstep: 1897.84 | bwd_inner_microstep: 1807.00 | bwd_allreduce_microstep: 90.79 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13279
total_samples=4140, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:36,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.97
[2025-08-03 02:34:36,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.18 | bwd_microstep: 1785.64 | bwd_inner_microstep: 1696.81 | bwd_allreduce_microstep: 88.78 | step_microstep: 124.37
[2025-08-03 02:34:36,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.50 | bwd: 7643.97 | bwd_inner: 7190.38 | bwd_allreduce: 453.38 | step: 124.79
                                   13%|█▎        | 264/2000 [50:46<5:24:36, 11.22s/it] 13%|█▎        | 265/2000 [50:56<5:18:54, 11.03s/it]                                                     13%|█▎        | 265/2000 [50:56<5:18:54, 11.03s/it] 13%|█▎        | 266/2000 [51:07<5:17:46, 11.00s/it]                                                     13%|█▎        | 266/2000 [51:07<5:17:46, 11.00s/it] 13%|█▎        | 267/2000 [51:18<5:14:40, 10.89s/it]                                                     13%|█▎        | 267/2000 [51:18<5:14:40, 10.89s/it] 13%|█▎        | 268/2000 [51:29<5:20:40, 11.11s/it]                                                     13%|█▎        | 268/2000 [51:30<5:20:40, 11.11s/it] 13%|█▎        | 269/2000 [51:40<5:14:05, 10.89s/it]                                                     13%|█▎        | 269/2000 [51:40<5:14:05, 10.89s/it] 14%|█▎        | 270/2000 [51:51<5:13:48, 10.88s/it]                                  {'loss': 0.8102, 'learning_rate': 1.9427315499864345e-05, 'epoch': 0.14}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12937
total_samples=4144, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:39,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.70 | bwd_microstep: 1991.66 | bwd_inner_microstep: 1862.25 | bwd_allreduce_microstep: 129.34 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12386
total_samples=4148, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:41,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.98 | bwd_microstep: 1907.08 | bwd_inner_microstep: 1754.08 | bwd_allreduce_microstep: 152.94 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13084
total_samples=4152, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:44,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.72 | bwd_microstep: 1902.79 | bwd_inner_microstep: 1691.48 | bwd_allreduce_microstep: 211.24 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13644
total_samples=4156, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:47,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 02:34:47,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.91 | bwd_microstep: 2013.09 | bwd_inner_microstep: 1911.05 | bwd_allreduce_microstep: 101.98 | step_microstep: 112.16
[2025-08-03 02:34:47,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.25 | bwd: 7814.68 | bwd_inner: 7218.86 | bwd_allreduce: 595.58 | step: 112.50
{'loss': 0.8267, 'learning_rate': 1.9421901667656364e-05, 'epoch': 0.14}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12712
total_samples=4160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:50,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.44 | bwd_microstep: 1827.19 | bwd_inner_microstep: 1616.46 | bwd_allreduce_microstep: 210.66 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12015
total_samples=4163, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:52,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.78 | bwd_microstep: 1721.18 | bwd_inner_microstep: 1539.53 | bwd_allreduce_microstep: 181.58 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12799
total_samples=4167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:55,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.23 | bwd_microstep: 1920.73 | bwd_inner_microstep: 1854.07 | bwd_allreduce_microstep: 66.61 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13957
total_samples=4171, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:34:58,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 02:34:58,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.95 | bwd_microstep: 2062.24 | bwd_inner_microstep: 1929.10 | bwd_allreduce_microstep: 133.07 | step_microstep: 110.70
[2025-08-03 02:34:58,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.32 | bwd: 7531.39 | bwd_inner: 6939.17 | bwd_allreduce: 591.99 | step: 111.12
{'loss': 0.8019, 'learning_rate': 1.9416463127609655e-05, 'epoch': 0.14}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11657
total_samples=4174, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:00,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.09 | bwd_microstep: 1756.81 | bwd_inner_microstep: 1528.72 | bwd_allreduce_microstep: 228.02 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14403
total_samples=4178, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:03,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.70 | bwd_microstep: 1736.34 | bwd_inner_microstep: 1701.63 | bwd_allreduce_microstep: 34.65 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=4182, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:06,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.14 | bwd_microstep: 2063.91 | bwd_inner_microstep: 1911.28 | bwd_allreduce_microstep: 152.56 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=4186, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:08,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:35:08,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.21 | bwd_microstep: 1878.09 | bwd_inner_microstep: 1833.41 | bwd_allreduce_microstep: 44.62 | step_microstep: 113.94
[2025-08-03 02:35:08,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.09 | bwd: 7435.20 | bwd_inner: 6975.04 | bwd_allreduce: 459.93 | step: 114.29
{'loss': 0.8187, 'learning_rate': 1.9410999893986157e-05, 'epoch': 0.14}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13706
total_samples=4190, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:11,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.51 | bwd_microstep: 1943.24 | bwd_inner_microstep: 1706.87 | bwd_allreduce_microstep: 236.30 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11584
total_samples=4193, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:14,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.02 | bwd_microstep: 1756.50 | bwd_inner_microstep: 1535.49 | bwd_allreduce_microstep: 220.95 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13444
total_samples=4197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:16,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.30 | bwd_microstep: 2061.70 | bwd_inner_microstep: 1928.24 | bwd_allreduce_microstep: 133.39 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13402
total_samples=4201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:19,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 02:35:19,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.17 | bwd_microstep: 2080.90 | bwd_inner_microstep: 1977.88 | bwd_allreduce_microstep: 102.97 | step_microstep: 132.37
[2025-08-03 02:35:19,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2746.94 | bwd: 7842.39 | bwd_inner: 7148.48 | bwd_allreduce: 693.68 | step: 132.70
{'loss': 0.8253, 'learning_rate': 1.9405511981112553e-05, 'epoch': 0.14}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13602
total_samples=4205, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:22,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.74 | bwd_microstep: 2125.90 | bwd_inner_microstep: 1893.72 | bwd_allreduce_microstep: 232.13 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13486
total_samples=4209, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:25,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.70 | bwd_microstep: 1906.53 | bwd_inner_microstep: 1818.56 | bwd_allreduce_microstep: 87.91 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15062
total_samples=4213, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:28,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 753.15 | bwd_microstep: 1818.02 | bwd_inner_microstep: 1772.05 | bwd_allreduce_microstep: 45.91 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13552
total_samples=4217, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:31,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 02:35:31,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.31 | bwd_microstep: 1963.40 | bwd_inner_microstep: 1865.84 | bwd_allreduce_microstep: 97.50 | step_microstep: 107.93
[2025-08-03 02:35:31,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.82 | bwd: 7813.91 | bwd_inner: 7350.16 | bwd_allreduce: 463.52 | step: 108.36
{'loss': 0.8169, 'learning_rate': 1.9399999403380266e-05, 'epoch': 0.14}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14830
total_samples=4221, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:33,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.20 | bwd_microstep: 1725.59 | bwd_inner_microstep: 1707.16 | bwd_allreduce_microstep: 18.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12968
total_samples=4225, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:36,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.31 | bwd_microstep: 1708.05 | bwd_inner_microstep: 1649.18 | bwd_allreduce_microstep: 58.81 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13400
total_samples=4229, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:38,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.48 | bwd_microstep: 2037.42 | bwd_inner_microstep: 1902.80 | bwd_allreduce_microstep: 134.56 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12371
total_samples=4233, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:41,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:35:41,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.92 | bwd_microstep: 1798.43 | bwd_inner_microstep: 1625.42 | bwd_allreduce_microstep: 172.95 | step_microstep: 126.06
[2025-08-03 02:35:41,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.84 | bwd: 7269.53 | bwd_inner: 6884.56 | bwd_allreduce: 384.75 | step: 126.37
                   14%|█▎        | 270/2000 [51:51<5:13:48, 10.88s/it] 14%|█▎        | 271/2000 [52:02<5:15:19, 10.94s/it]                                                     14%|█▎        | 271/2000 [52:02<5:15:19, 10.94s/it] 14%|█▎        | 272/2000 [52:13<5:13:34, 10.89s/it]                                                     14%|█▎        | 272/2000 [52:13<5:13:34, 10.89s/it] 14%|█▎        | 273/2000 [52:23<5:11:34, 10.82s/it]                                                     14%|█▎        | 273/2000 [52:23<5:11:34, 10.82s/it] 14%|█▎        | 274/2000 [52:34<5:13:30, 10.90s/it]                                                     14%|█▎        | 274/2000 [52:34<5:13:30, 10.90s/it] 14%|█▍        | 275/2000 [52:45<5:14:54, 10.95s/it]                                                     14%|█▍        | 275/2000 [52:45<5:14:54, 10.95s/it] 14%|█▍        | 276/2000 [52:56<5:11:00, 10.82s/it]                                                  {'loss': 0.8123, 'learning_rate': 1.9394462175245382e-05, 'epoch': 0.14}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13137
total_samples=4237, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:44,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.66 | bwd_microstep: 2353.28 | bwd_inner_microstep: 2183.82 | bwd_allreduce_microstep: 169.39 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11890
total_samples=4240, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:47,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.80 | bwd_microstep: 1812.90 | bwd_inner_microstep: 1577.02 | bwd_allreduce_microstep: 235.82 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12269
total_samples=4243, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:50,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.66 | bwd_microstep: 2253.38 | bwd_inner_microstep: 1821.15 | bwd_allreduce_microstep: 432.18 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12846
total_samples=4247, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:53,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 02:35:53,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.26 | bwd_microstep: 1819.09 | bwd_inner_microstep: 1639.28 | bwd_allreduce_microstep: 179.74 | step_microstep: 109.24
[2025-08-03 02:35:53,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.31 | bwd: 8238.69 | bwd_inner: 7221.27 | bwd_allreduce: 1017.20 | step: 109.55
{'loss': 0.8126, 'learning_rate': 1.9388900311228636e-05, 'epoch': 0.14}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13872
total_samples=4252, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:55,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.92 | bwd_microstep: 1811.77 | bwd_inner_microstep: 1719.36 | bwd_allreduce_microstep: 92.34 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13304
total_samples=4256, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:35:58,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.66 | bwd_microstep: 1754.03 | bwd_inner_microstep: 1680.38 | bwd_allreduce_microstep: 73.60 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12994
total_samples=4260, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:00,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.47 | bwd_microstep: 1766.07 | bwd_inner_microstep: 1651.32 | bwd_allreduce_microstep: 114.69 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13414
total_samples=4264, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:03,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 02:36:03,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.55 | bwd_microstep: 2000.41 | bwd_inner_microstep: 1868.67 | bwd_allreduce_microstep: 131.68 | step_microstep: 106.90
[2025-08-03 02:36:03,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.52 | bwd: 7332.33 | bwd_inner: 6919.71 | bwd_allreduce: 412.38 | step: 107.25
{'loss': 0.8082, 'learning_rate': 1.9383313825915372e-05, 'epoch': 0.14}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13662
total_samples=4268, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:06,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.73 | bwd_microstep: 1765.01 | bwd_inner_microstep: 1692.79 | bwd_allreduce_microstep: 72.17 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625
total_samples=4271, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:08,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.24 | bwd_microstep: 1739.22 | bwd_inner_microstep: 1538.54 | bwd_allreduce_microstep: 200.62 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13531
total_samples=4275, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:11,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.16 | bwd_microstep: 1744.59 | bwd_inner_microstep: 1689.91 | bwd_allreduce_microstep: 54.61 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11978
total_samples=4278, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:14,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 02:36:14,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.75 | bwd_microstep: 2188.55 | bwd_inner_microstep: 1715.97 | bwd_allreduce_microstep: 472.48 | step_microstep: 113.02
[2025-08-03 02:36:14,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2747.81 | bwd: 7437.41 | bwd_inner: 6637.22 | bwd_allreduce: 799.93 | step: 113.36
{'loss': 0.81, 'learning_rate': 1.9377702733955493e-05, 'epoch': 0.14}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13246
total_samples=4282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:16,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.81 | bwd_microstep: 1757.77 | bwd_inner_microstep: 1679.53 | bwd_allreduce_microstep: 78.18 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790
total_samples=4285, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:19,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.72 | bwd_microstep: 1830.99 | bwd_inner_microstep: 1583.65 | bwd_allreduce_microstep: 247.27 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13175
total_samples=4289, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:22,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.63 | bwd_microstep: 2006.54 | bwd_inner_microstep: 1851.15 | bwd_allreduce_microstep: 155.33 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12951
total_samples=4293, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:25,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:36:25,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.22 | bwd_microstep: 1977.94 | bwd_inner_microstep: 1688.24 | bwd_allreduce_microstep: 289.64 | step_microstep: 108.13
[2025-08-03 02:36:25,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.31 | bwd: 7573.29 | bwd_inner: 6802.57 | bwd_allreduce: 770.49 | step: 108.56
{'loss': 0.8014, 'learning_rate': 1.937206705006344e-05, 'epoch': 0.14}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13539
total_samples=4297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:27,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.99 | bwd_microstep: 2038.57 | bwd_inner_microstep: 1853.37 | bwd_allreduce_microstep: 185.15 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13042
total_samples=4301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:30,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.39 | bwd_microstep: 2114.95 | bwd_inner_microstep: 1972.10 | bwd_allreduce_microstep: 142.79 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12198
total_samples=4304, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:33,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.76 | bwd_microstep: 1897.58 | bwd_inner_microstep: 1856.87 | bwd_allreduce_microstep: 40.65 | step_microstep: 0.09
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14258
total_samples=4308, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:36,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 02:36:36,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.57 | bwd_microstep: 1825.26 | bwd_inner_microstep: 1652.20 | bwd_allreduce_microstep: 173.00 | step_microstep: 144.89
[2025-08-03 02:36:36,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2862.64 | bwd: 7876.42 | bwd_inner: 7334.54 | bwd_allreduce: 541.65 | step: 145.18
{'loss': 0.821, 'learning_rate': 1.9366406789018127e-05, 'epoch': 0.14}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12839
total_samples=4312, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:39,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.61 | bwd_microstep: 2135.94 | bwd_inner_microstep: 1871.36 | bwd_allreduce_microstep: 264.52 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14184
total_samples=4316, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:41,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.30 | bwd_microstep: 1785.46 | bwd_inner_microstep: 1721.29 | bwd_allreduce_microstep: 64.11 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13857
total_samples=4320, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:44,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.69 | bwd_microstep: 1729.67 | bwd_inner_microstep: 1667.87 | bwd_allreduce_microstep: 61.73 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14708
total_samples=4325, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:47,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 02:36:47,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.81 | bwd_microstep: 2015.29 | bwd_inner_microstep: 1898.27 | bwd_allreduce_microstep: 116.97 | step_microstep: 139.86
[2025-08-03 02:36:47,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.35 | bwd: 7666.42 | bwd_inner: 7158.78 | bwd_allreduce: 507.41 | step: 140.19
{'loss': 0.8097, 'learning_rate': 1.9360721965662934e-05, 'epoch': 0.14}
   14%|█▍        | 276/2000 [52:56<5:11:00, 10.82s/it] 14%|█▍        | 277/2000 [53:07<5:16:26, 11.02s/it]                                                     14%|█▍        | 277/2000 [53:07<5:16:26, 11.02s/it] 14%|█▍        | 278/2000 [53:18<5:12:13, 10.88s/it]                                                     14%|█▍        | 278/2000 [53:18<5:12:13, 10.88s/it] 14%|█▍        | 279/2000 [53:29<5:09:46, 10.80s/it]                                                     14%|█▍        | 279/2000 [53:29<5:09:46, 10.80s/it] 14%|█▍        | 280/2000 [53:39<5:10:05, 10.82s/it]                                                     14%|█▍        | 280/2000 [53:39<5:10:05, 10.82s/it] 14%|█▍        | 281/2000 [53:51<5:13:06, 10.93s/it]                                                     14%|█▍        | 281/2000 [53:51<5:13:06, 10.93s/it] 14%|█▍        | 282/2000 [54:02<5:13:10, 10.94s/it]                                                     14%|█▍ dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13133
total_samples=4329, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:49,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.57 | bwd_microstep: 1773.16 | bwd_inner_microstep: 1638.07 | bwd_allreduce_microstep: 135.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13852
total_samples=4333, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:52,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.96 | bwd_microstep: 1862.75 | bwd_inner_microstep: 1709.27 | bwd_allreduce_microstep: 153.41 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11807
total_samples=4336, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:55,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.95 | bwd_microstep: 1978.90 | bwd_inner_microstep: 1542.55 | bwd_allreduce_microstep: 436.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11739
total_samples=4339, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:36:58,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 02:36:58,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.83 | bwd_microstep: 2135.23 | bwd_inner_microstep: 1797.32 | bwd_allreduce_microstep: 337.85 | step_microstep: 134.91
[2025-08-03 02:36:58,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.23 | bwd: 7750.09 | bwd_inner: 6687.21 | bwd_allreduce: 1062.65 | step: 135.24
{'loss': 0.8111, 'learning_rate': 1.9355012594905645e-05, 'epoch': 0.14}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11521
total_samples=4342, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:00,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.84 | bwd_microstep: 1881.73 | bwd_inner_microstep: 1516.32 | bwd_allreduce_microstep: 365.35 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13938
total_samples=4346, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:03,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.31 | bwd_microstep: 1773.12 | bwd_inner_microstep: 1720.25 | bwd_allreduce_microstep: 52.80 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13579
total_samples=4350, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:06,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.30 | bwd_microstep: 1974.48 | bwd_inner_microstep: 1752.83 | bwd_allreduce_microstep: 221.58 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11927
total_samples=4353, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:09,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:37:09,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.42 | bwd_microstep: 2061.39 | bwd_inner_microstep: 1584.75 | bwd_allreduce_microstep: 476.53 | step_microstep: 130.07
[2025-08-03 02:37:09,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.80 | bwd: 7690.75 | bwd_inner: 6574.17 | bwd_allreduce: 1116.31 | step: 130.38
{'loss': 0.7972, 'learning_rate': 1.9349278691718426e-05, 'epoch': 0.14}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812
total_samples=4356, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:11,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.71 | bwd_microstep: 1769.81 | bwd_inner_microstep: 1536.55 | bwd_allreduce_microstep: 233.20 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14177
total_samples=4360, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:14,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.12 | bwd_microstep: 1828.80 | bwd_inner_microstep: 1738.38 | bwd_allreduce_microstep: 90.35 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12213
total_samples=4363, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:17,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.98 | bwd_microstep: 2651.83 | bwd_inner_microstep: 2435.90 | bwd_allreduce_microstep: 215.87 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11924
total_samples=4366, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:20,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:37:20,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.69 | bwd_microstep: 1776.96 | bwd_inner_microstep: 1540.13 | bwd_allreduce_microstep: 236.76 | step_microstep: 109.78
[2025-08-03 02:37:20,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2867.43 | bwd: 8027.44 | bwd_inner: 7250.96 | bwd_allreduce: 776.25 | step: 110.19
{'loss': 0.8086, 'learning_rate': 1.9343520271137764e-05, 'epoch': 0.14}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13602
total_samples=4370, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:23,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.65 | bwd_microstep: 2022.37 | bwd_inner_microstep: 1873.32 | bwd_allreduce_microstep: 148.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13893
total_samples=4374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:25,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.53 | bwd_microstep: 1812.51 | bwd_inner_microstep: 1720.55 | bwd_allreduce_microstep: 91.90 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=4377, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:28,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.84 | bwd_microstep: 1716.67 | bwd_inner_microstep: 1533.13 | bwd_allreduce_microstep: 183.47 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373
total_samples=4381, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:31,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:37:31,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.87 | bwd_microstep: 2054.00 | bwd_inner_microstep: 1936.39 | bwd_allreduce_microstep: 117.54 | step_microstep: 115.96
[2025-08-03 02:37:31,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.81 | bwd: 7605.61 | bwd_inner: 7063.39 | bwd_allreduce: 541.98 | step: 116.35
{'loss': 0.8097, 'learning_rate': 1.9337737348264448e-05, 'epoch': 0.14}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14004
total_samples=4385, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:34,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.00 | bwd_microstep: 2060.92 | bwd_inner_microstep: 1713.96 | bwd_allreduce_microstep: 346.90 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13361
total_samples=4389, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:36,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.17 | bwd_microstep: 1830.19 | bwd_inner_microstep: 1714.56 | bwd_allreduce_microstep: 115.56 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11495
total_samples=4392, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:39,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.13 | bwd_microstep: 1955.06 | bwd_inner_microstep: 1520.29 | bwd_allreduce_microstep: 434.71 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14034
total_samples=4396, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:42,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 02:37:42,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.85 | bwd_microstep: 1750.11 | bwd_inner_microstep: 1669.94 | bwd_allreduce_microstep: 80.11 | step_microstep: 107.23
[2025-08-03 02:37:42,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.07 | bwd: 7596.31 | bwd_inner: 6618.74 | bwd_allreduce: 977.35 | step: 107.67
{'loss': 0.8243, 'learning_rate': 1.9331929938263515e-05, 'epoch': 0.14}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11708
total_samples=4399, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:45,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.81 | bwd_microstep: 2214.85 | bwd_inner_microstep: 1810.20 | bwd_allreduce_microstep: 404.58 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14218
total_samples=4403, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:48,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.09 | bwd_microstep: 2057.53 | bwd_inner_microstep: 2051.38 | bwd_allreduce_microstep: 6.09 | step_microstep: 0.18
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13447
total_samples=4407, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:50,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.73 | bwd_microstep: 1903.14 | bwd_inner_microstep: 1660.03 | bwd_allreduce_microstep: 243.05 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12892
total_samples=4410, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:53,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 02:37:53,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.64 | bwd_microstep: 1742.65 | bwd_inner_microstep: 1589.78 | bwd_allreduce_microstep: 152.81 | step_microstep: 131.00
[2025-08-03 02:37:53,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.20 | bwd: 7918.23 | bwd_inner: 7111.39 | bwd_allreduce: 806.61 | step: 131.46
{'loss': 0.8061, 'learning_rate': 1.9326098056364224e-05, 'epoch': 0.14}
       | 282/2000 [54:02<5:13:10, 10.94s/it] 14%|█▍        | 283/2000 [54:13<5:13:39, 10.96s/it]                                                     14%|█▍        | 283/2000 [54:13<5:13:39, 10.96s/it] 14%|█▍        | 284/2000 [54:24<5:13:41, 10.97s/it]                                                     14%|█▍        | 284/2000 [54:24<5:13:41, 10.97s/it] 14%|█▍        | 285/2000 [54:35<5:16:17, 11.07s/it]                                                     14%|█▍        | 285/2000 [54:35<5:16:17, 11.07s/it] 14%|█▍        | 286/2000 [54:46<5:14:46, 11.02s/it]                                                     14%|█▍        | 286/2000 [54:46<5:14:46, 11.02s/it] 14%|█▍        | 287/2000 [54:57<5:12:51, 10.96s/it]                                                     14%|█▍        | 287/2000 [54:57<5:12:51, 10.96s/it] 14%|█▍        | 288/2000 [55:08<5:14:34, 11.02s/it]                                                     14%|█▍        | 288/200dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11856
total_samples=4413, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:56,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.77 | bwd_microstep: 2017.46 | bwd_inner_microstep: 1818.50 | bwd_allreduce_microstep: 198.89 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13723
total_samples=4417, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:37:59,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.42 | bwd_microstep: 2053.02 | bwd_inner_microstep: 1721.60 | bwd_allreduce_microstep: 331.36 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11595
total_samples=4420, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:01,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.86 | bwd_microstep: 1788.43 | bwd_inner_microstep: 1539.66 | bwd_allreduce_microstep: 248.71 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11888
total_samples=4423, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:04,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58
[2025-08-03 02:38:04,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.52 | bwd_microstep: 1815.76 | bwd_inner_microstep: 1573.92 | bwd_allreduce_microstep: 241.76 | step_microstep: 145.61
[2025-08-03 02:38:04,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.51 | bwd: 7674.72 | bwd_inner: 6653.68 | bwd_allreduce: 1020.80 | step: 146.04
{'loss': 0.8147, 'learning_rate': 1.9320241717860007e-05, 'epoch': 0.14}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=4426, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:07,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.59 | bwd_microstep: 1942.11 | bwd_inner_microstep: 1556.19 | bwd_allreduce_microstep: 385.85 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12662
total_samples=4430, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:09,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.46 | bwd_microstep: 1939.30 | bwd_inner_microstep: 1826.61 | bwd_allreduce_microstep: 112.62 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14609
total_samples=4434, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:12,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.49 | bwd_microstep: 1868.71 | bwd_inner_microstep: 1787.69 | bwd_allreduce_microstep: 80.96 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13129
total_samples=4438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:15,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.22
[2025-08-03 02:38:15,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.03 | bwd_microstep: 2193.86 | bwd_inner_microstep: 2037.83 | bwd_allreduce_microstep: 155.96 | step_microstep: 111.03
[2025-08-03 02:38:15,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.49 | bwd: 7944.03 | bwd_inner: 7208.30 | bwd_allreduce: 735.48 | step: 111.47
{'loss': 0.8144, 'learning_rate': 1.9314360938108427e-05, 'epoch': 0.14}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13622
total_samples=4442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:18,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.19 | bwd_microstep: 1949.89 | bwd_inner_microstep: 1727.08 | bwd_allreduce_microstep: 222.74 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=4446, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:20,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.75 | bwd_microstep: 1833.30 | bwd_inner_microstep: 1714.98 | bwd_allreduce_microstep: 118.26 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14337
total_samples=4450, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:24,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1545.00 | bwd_microstep: 2260.23 | bwd_inner_microstep: 1956.48 | bwd_allreduce_microstep: 303.69 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12675
total_samples=4454, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:27,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:38:27,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.17 | bwd_microstep: 2168.96 | bwd_inner_microstep: 1950.54 | bwd_allreduce_microstep: 218.36 | step_microstep: 148.55
[2025-08-03 02:38:27,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3648.03 | bwd: 8212.42 | bwd_inner: 7349.08 | bwd_allreduce: 863.12 | step: 148.97
{'loss': 0.7944, 'learning_rate': 1.930845573253114e-05, 'epoch': 0.15}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13310
total_samples=4459, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:30,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.63 | bwd_microstep: 1873.68 | bwd_inner_microstep: 1778.76 | bwd_allreduce_microstep: 94.85 | step_microstep: 0.16
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12101
total_samples=4463, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:33,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.03 | bwd_microstep: 1753.15 | bwd_inner_microstep: 1592.38 | bwd_allreduce_microstep: 160.71 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13436
total_samples=4467, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:35,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.73 | bwd_microstep: 1856.65 | bwd_inner_microstep: 1715.57 | bwd_allreduce_microstep: 141.01 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11736
total_samples=4471, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:39,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:38:39,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1598.99 | bwd_microstep: 1847.17 | bwd_inner_microstep: 1599.24 | bwd_allreduce_microstep: 247.87 | step_microstep: 112.46
[2025-08-03 02:38:39,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3685.30 | bwd: 7330.70 | bwd_inner: 6685.94 | bwd_allreduce: 644.53 | step: 112.94
{'loss': 0.8178, 'learning_rate': 1.9302526116613863e-05, 'epoch': 0.15}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13901
total_samples=4475, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:42,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.28 | bwd_microstep: 1980.45 | bwd_inner_microstep: 1745.44 | bwd_allreduce_microstep: 234.95 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13246
total_samples=4479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:44,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.34 | bwd_microstep: 1831.95 | bwd_inner_microstep: 1695.80 | bwd_allreduce_microstep: 136.09 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13357
total_samples=4483, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:47,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.13 | bwd_microstep: 1781.39 | bwd_inner_microstep: 1690.58 | bwd_allreduce_microstep: 90.74 | step_microstep: 0.23
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12361
total_samples=4487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:49,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 02:38:49,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.60 | bwd_microstep: 1738.54 | bwd_inner_microstep: 1573.76 | bwd_allreduce_microstep: 164.71 | step_microstep: 110.13
[2025-08-03 02:38:49,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.28 | bwd: 7332.37 | bwd_inner: 6705.58 | bwd_allreduce: 626.57 | step: 110.57
{'loss': 0.8027, 'learning_rate': 1.9296572105906323e-05, 'epoch': 0.15}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13656
total_samples=4491, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:52,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.57 | bwd_microstep: 2226.14 | bwd_inner_microstep: 1905.91 | bwd_allreduce_microstep: 320.17 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483
total_samples=4495, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:55,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.81 | bwd_microstep: 1720.68 | bwd_inner_microstep: 1664.11 | bwd_allreduce_microstep: 56.51 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13110
total_samples=4500, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:38:58,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.62 | bwd_microstep: 1877.44 | bwd_inner_microstep: 1642.37 | bwd_allreduce_microstep: 235.01 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13302
total_samples=4504, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:00,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:39:00,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.69 | bwd_microstep: 1866.91 | bwd_inner_microstep: 1789.29 | bwd_allreduce_microstep: 77.55 | step_microstep: 107.94
[2025-08-03 02:39:00,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2770.63 | bwd: 7691.21 | bwd_inner: 7001.67 | bwd_allreduce: 689.32 | step: 108.25
{'loss': 0.7935, 'learning_rate': 1.9290593716022218e-05, 'epoch': 0.15}
0 [55:08<5:14:34, 11.02s/it] 14%|█▍        | 289/2000 [55:19<5:13:55, 11.01s/it]                                                     14%|█▍        | 289/2000 [55:19<5:13:55, 11.01s/it] 14%|█▍        | 290/2000 [55:30<5:15:38, 11.08s/it]                                                     14%|█▍        | 290/2000 [55:30<5:15:38, 11.08s/it] 15%|█▍        | 291/2000 [55:42<5:26:19, 11.46s/it]                                                     15%|█▍        | 291/2000 [55:42<5:26:19, 11.46s/it] 15%|█▍        | 292/2000 [55:54<5:26:08, 11.46s/it]                                                     15%|█▍        | 292/2000 [55:54<5:26:08, 11.46s/it] 15%|█▍        | 293/2000 [56:04<5:18:21, 11.19s/it]                                                     15%|█▍        | 293/2000 [56:04<5:18:21, 11.19s/it] 15%|█▍        | 294/2000 [56:15<5:15:47, 11.11s/it]                                                     15%|█▍        | 294/2000 [56:15<5:15:47dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12919
total_samples=4508, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:03,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.54 | bwd_microstep: 2046.11 | bwd_inner_microstep: 1896.76 | bwd_allreduce_microstep: 149.28 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14192
total_samples=4514, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:06,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.35 | bwd_microstep: 1912.16 | bwd_inner_microstep: 1770.37 | bwd_allreduce_microstep: 141.72 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11761
total_samples=4517, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:09,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.05 | bwd_microstep: 1981.29 | bwd_inner_microstep: 1780.67 | bwd_allreduce_microstep: 200.55 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12365
total_samples=4521, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:12,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 02:39:12,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.26 | bwd_microstep: 2042.38 | bwd_inner_microstep: 1591.29 | bwd_allreduce_microstep: 451.03 | step_microstep: 138.09
[2025-08-03 02:39:12,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.14 | bwd: 7981.98 | bwd_inner: 7039.08 | bwd_allreduce: 942.66 | step: 138.53
{'loss': 0.8231, 'learning_rate': 1.928459096263918e-05, 'epoch': 0.15}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11749
total_samples=4524, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:14,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.66 | bwd_microstep: 1859.05 | bwd_inner_microstep: 1688.97 | bwd_allreduce_microstep: 170.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13721
total_samples=4529, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:17,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.15 | bwd_microstep: 1727.50 | bwd_inner_microstep: 1684.31 | bwd_allreduce_microstep: 43.12 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12784
total_samples=4532, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:19,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.63 | bwd_microstep: 1896.50 | bwd_inner_microstep: 1764.47 | bwd_allreduce_microstep: 131.96 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12171
total_samples=4536, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:22,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 02:39:22,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.01 | bwd_microstep: 2150.03 | bwd_inner_microstep: 2030.58 | bwd_allreduce_microstep: 119.39 | step_microstep: 109.60
[2025-08-03 02:39:22,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.37 | bwd: 7633.12 | bwd_inner: 7168.33 | bwd_allreduce: 464.55 | step: 109.93
{'loss': 0.8124, 'learning_rate': 1.9278563861498726e-05, 'epoch': 0.15}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12984
total_samples=4540, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:25,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.30 | bwd_microstep: 1728.64 | bwd_inner_microstep: 1632.71 | bwd_allreduce_microstep: 95.86 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15019
total_samples=4544, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:28,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.67 | bwd_microstep: 2137.15 | bwd_inner_microstep: 1978.67 | bwd_allreduce_microstep: 158.42 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12108
total_samples=4547, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:30,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.49 | bwd_microstep: 1736.05 | bwd_inner_microstep: 1556.44 | bwd_allreduce_microstep: 179.55 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13326
total_samples=4551, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:33,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:39:33,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.77 | bwd_microstep: 2105.74 | bwd_inner_microstep: 2057.59 | bwd_allreduce_microstep: 48.09 | step_microstep: 130.39
[2025-08-03 02:39:33,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.16 | bwd: 7707.63 | bwd_inner: 7225.41 | bwd_allreduce: 482.00 | step: 130.87
{'loss': 0.8038, 'learning_rate': 1.927251242840623e-05, 'epoch': 0.15}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13287
total_samples=4555, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:36,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.61 | bwd_microstep: 1756.40 | bwd_inner_microstep: 1648.09 | bwd_allreduce_microstep: 108.24 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13969
total_samples=4558, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:39,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.57 | bwd_microstep: 1769.82 | bwd_inner_microstep: 1675.30 | bwd_allreduce_microstep: 94.46 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12047
total_samples=4561, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:42,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.25 | bwd_microstep: 2299.25 | bwd_inner_microstep: 2042.87 | bwd_allreduce_microstep: 256.31 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13403
total_samples=4566, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:44,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:39:44,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.70 | bwd_microstep: 1779.82 | bwd_inner_microstep: 1675.97 | bwd_allreduce_microstep: 103.79 | step_microstep: 107.14
[2025-08-03 02:39:44,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.06 | bwd: 7605.33 | bwd_inner: 7042.22 | bwd_allreduce: 562.87 | step: 107.46
{'loss': 0.8088, 'learning_rate': 1.9266436679230866e-05, 'epoch': 0.15}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13406
total_samples=4570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:47,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.30 | bwd_microstep: 1757.10 | bwd_inner_microstep: 1620.76 | bwd_allreduce_microstep: 136.28 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15381
total_samples=4574, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:49,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.06 | bwd_microstep: 1812.81 | bwd_inner_microstep: 1797.63 | bwd_allreduce_microstep: 15.12 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13476
total_samples=4578, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:52,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.00 | bwd_microstep: 1772.89 | bwd_inner_microstep: 1696.72 | bwd_allreduce_microstep: 76.10 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13538
total_samples=4582, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:55,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 02:39:55,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.42 | bwd_microstep: 1791.02 | bwd_inner_microstep: 1707.88 | bwd_allreduce_microstep: 83.08 | step_microstep: 123.89
[2025-08-03 02:39:55,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.73 | bwd: 7133.87 | bwd_inner: 6822.99 | bwd_allreduce: 310.65 | step: 124.21
{'loss': 0.8168, 'learning_rate': 1.926033662990558e-05, 'epoch': 0.15}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11761
total_samples=4585, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:39:57,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.23 | bwd_microstep: 1742.46 | bwd_inner_microstep: 1526.59 | bwd_allreduce_microstep: 215.79 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14592
total_samples=4589, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:00,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.50 | bwd_microstep: 1775.02 | bwd_inner_microstep: 1730.36 | bwd_allreduce_microstep: 44.58 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13206
total_samples=4593, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:02,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 657.70 | bwd_microstep: 1726.61 | bwd_inner_microstep: 1655.86 | bwd_allreduce_microstep: 70.68 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12654
total_samples=4597, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:05,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 02:40:05,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.48 | bwd_microstep: 2034.88 | bwd_inner_microstep: 1779.24 | bwd_allreduce_microstep: 255.57 | step_microstep: 133.41
[2025-08-03 02:40:05,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2770.82 | bwd: 7279.01 | bwd_inner: 6692.05 | bwd_allreduce: 586.71 | step: 133.90
{'loss': 0.802, 'learning_rate': 1.9254212296427043e-05, 'epoch': 0.15}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13322
total_samples=4601, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:08,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.69 | bwd_microstep: 1967.84 | bwd_inner_microstep: 1879.85 | bwd_allreduce_microstep: 87.93 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13145
total_samples=4605, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:11,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.54 | bwd_microstep: 1967.43 | bwd_inner_microstep: 1689.81 | bwd_allreduce_microstep: 277.56 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13676
total_samples=4609, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:13,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.53 | bwd_microstep: 1934.62 | bwd_inner_microstep: 1928.34 | bwd_allreduce_microstep: 6.22 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12034
total_samples=4612, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:16,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:40:16,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.55 | bwd_microstep: 1757.75 | bwd_inner_microstep: 1554.88 | bwd_allreduce_microstep: 202.80 | step_microstep: 123.88
[2025-08-03 02:40:16,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.23 | bwd: 7627.69 | bwd_inner: 7052.88 | bwd_allreduce: 574.58 | step: 124.21
, 11.11s/it] 15%|█▍        | 295/2000 [56:26<5:16:47, 11.15s/it]                                                     15%|█▍        | 295/2000 [56:27<5:16:47, 11.15s/it] 15%|█▍        | 296/2000 [56:37<5:14:11, 11.06s/it]                                                     15%|█▍        | 296/2000 [56:37<5:14:11, 11.06s/it] 15%|█▍        | 297/2000 [56:48<5:13:03, 11.03s/it]                                                     15%|█▍        | 297/2000 [56:48<5:13:03, 11.03s/it] 15%|█▍        | 298/2000 [56:59<5:11:02, 10.97s/it]                                                     15%|█▍        | 298/2000 [56:59<5:11:02, 10.97s/it] 15%|█▍        | 299/2000 [57:10<5:06:14, 10.80s/it]                                                     15%|█▍        | 299/2000 [57:10<5:06:14, 10.80s/it] 15%|█▌        | 300/2000 [57:20<5:03:57, 10.73s/it]                                                     15%|█▌        | 300/2000 [57:20<5:03:57, 10.73s/it] 15{'loss': 0.8101, 'learning_rate': 1.9248063694855603e-05, 'epoch': 0.15}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14759
total_samples=4616, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:19,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.55 | bwd_microstep: 1714.16 | bwd_inner_microstep: 1698.83 | bwd_allreduce_microstep: 15.27 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13360
total_samples=4620, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:21,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.67 | bwd_microstep: 2063.44 | bwd_inner_microstep: 1941.09 | bwd_allreduce_microstep: 122.28 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13683
total_samples=4624, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:24,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.52 | bwd_microstep: 2178.03 | bwd_inner_microstep: 2012.42 | bwd_allreduce_microstep: 165.55 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13713
total_samples=4628, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:27,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 02:40:27,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.89 | bwd_microstep: 1765.97 | bwd_inner_microstep: 1711.63 | bwd_allreduce_microstep: 54.29 | step_microstep: 136.88
[2025-08-03 02:40:27,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.55 | bwd: 7721.65 | bwd_inner: 7363.96 | bwd_allreduce: 357.47 | step: 137.31
{'loss': 0.8076, 'learning_rate': 1.924189084131525e-05, 'epoch': 0.15}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14216
total_samples=4632, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:30,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.71 | bwd_microstep: 1810.78 | bwd_inner_microstep: 1728.87 | bwd_allreduce_microstep: 81.85 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13335
total_samples=4636, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:33,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.54 | bwd_microstep: 2132.23 | bwd_inner_microstep: 1875.33 | bwd_allreduce_microstep: 256.83 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16314
total_samples=4640, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:35,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.72 | bwd_microstep: 2005.56 | bwd_inner_microstep: 1921.50 | bwd_allreduce_microstep: 83.99 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11669
total_samples=4643, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:38,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 02:40:38,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.74 | bwd_microstep: 2154.76 | bwd_inner_microstep: 1948.27 | bwd_allreduce_microstep: 206.43 | step_microstep: 133.50
[2025-08-03 02:40:38,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.63 | bwd: 8103.38 | bwd_inner: 7473.96 | bwd_allreduce: 629.18 | step: 133.82
{'loss': 0.8148, 'learning_rate': 1.923569375199357e-05, 'epoch': 0.15}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12538
total_samples=4647, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:41,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.70 | bwd_microstep: 2152.75 | bwd_inner_microstep: 1827.46 | bwd_allreduce_microstep: 325.22 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11565
total_samples=4650, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:44,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.74 | bwd_microstep: 1876.06 | bwd_inner_microstep: 1700.24 | bwd_allreduce_microstep: 175.75 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15681
total_samples=4655, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:47,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.81 | bwd_microstep: 1771.82 | bwd_inner_microstep: 1765.76 | bwd_allreduce_microstep: 6.00 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13114
total_samples=4659, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:49,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 02:40:49,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.10 | bwd_microstep: 1711.37 | bwd_inner_microstep: 1651.85 | bwd_allreduce_microstep: 59.46 | step_microstep: 147.76
[2025-08-03 02:40:49,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.29 | bwd: 7512.05 | bwd_inner: 6945.32 | bwd_allreduce: 566.50 | step: 148.07
{'loss': 0.804, 'learning_rate': 1.922947244314172e-05, 'epoch': 0.15}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13361
total_samples=4663, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:52,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.83 | bwd_microstep: 1787.45 | bwd_inner_microstep: 1637.91 | bwd_allreduce_microstep: 149.47 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12583
total_samples=4666, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:54,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.94 | bwd_microstep: 1902.85 | bwd_inner_microstep: 1733.20 | bwd_allreduce_microstep: 169.58 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13302
total_samples=4671, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:40:57,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.62 | bwd_microstep: 2137.72 | bwd_inner_microstep: 1803.89 | bwd_allreduce_microstep: 333.77 | step_microstep: 0.21
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12765
total_samples=4675, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:00,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 02:41:00,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.85 | bwd_microstep: 1882.33 | bwd_inner_microstep: 1601.33 | bwd_allreduce_microstep: 280.93 | step_microstep: 110.69
[2025-08-03 02:41:00,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.16 | bwd: 7710.40 | bwd_inner: 6776.33 | bwd_allreduce: 933.83 | step: 111.12
{'loss': 0.8053, 'learning_rate': 1.922322693107434e-05, 'epoch': 0.15}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11931
total_samples=4678, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:03,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.76 | bwd_microstep: 1784.14 | bwd_inner_microstep: 1545.70 | bwd_allreduce_microstep: 238.38 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13389
total_samples=4682, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:05,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.39 | bwd_microstep: 1737.81 | bwd_inner_microstep: 1675.36 | bwd_allreduce_microstep: 62.38 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14470
total_samples=4686, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:08,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.23 | bwd_microstep: 1783.26 | bwd_inner_microstep: 1742.50 | bwd_allreduce_microstep: 40.69 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12707
total_samples=4690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:11,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 02:41:11,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.06 | bwd_microstep: 1873.36 | bwd_inner_microstep: 1753.79 | bwd_allreduce_microstep: 119.51 | step_microstep: 163.34
[2025-08-03 02:41:11,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.35 | bwd: 7178.62 | bwd_inner: 6717.35 | bwd_allreduce: 461.04 | step: 163.67
{'loss': 0.8019, 'learning_rate': 1.9216957232169567e-05, 'epoch': 0.15}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13438
total_samples=4694, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:13,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.04 | bwd_microstep: 1863.49 | bwd_inner_microstep: 1817.96 | bwd_allreduce_microstep: 45.47 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11727
total_samples=4697, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:16,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.44 | bwd_microstep: 1851.43 | bwd_inner_microstep: 1593.94 | bwd_allreduce_microstep: 257.43 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13397
total_samples=4701, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:19,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.76 | bwd_microstep: 2199.27 | bwd_inner_microstep: 1915.50 | bwd_allreduce_microstep: 283.71 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11847
total_samples=4704, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:22,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 02:41:22,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.18 | bwd_microstep: 2075.58 | bwd_inner_microstep: 1832.68 | bwd_allreduce_microstep: 242.84 | step_microstep: 106.59
[2025-08-03 02:41:22,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2878.36 | bwd: 7989.82 | bwd_inner: 7160.07 | bwd_allreduce: 829.52 | step: 106.91
%|█▌        | 301/2000 [57:31<5:04:47, 10.76s/it]                                                     15%|█▌        | 301/2000 [57:31<5:04:47, 10.76s/it] 15%|█▌        | 302/2000 [57:42<5:06:17, 10.82s/it]                                                     15%|█▌        | 302/2000 [57:42<5:06:17, 10.82s/it] 15%|█▌        | 303/2000 [57:53<5:10:52, 10.99s/it]                                                     15%|█▌        | 303/2000 [57:53<5:10:52, 10.99s/it] 15%|█▌        | 304/2000 [58:04<5:08:59, 10.93s/it]                                                     15%|█▌        | 304/2000 [58:04<5:08:59, 10.93s/it] 15%|█▌        | 305/2000 [58:15<5:08:47, 10.93s/it]                                                     15%|█▌        | 305/2000 [58:15<5:08:47, 10.93s/it] 15%|█▌        | 306/2000 [58:26<5:05:00, 10.80s/it]                                                     15%|█▌        | 306/2000 [58:26<5:05:00, 10.80s/it] 15%|█▌        {'loss': 0.8064, 'learning_rate': 1.9210663362868956e-05, 'epoch': 0.15}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11776
total_samples=4707, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:25,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.24 | bwd_microstep: 1839.58 | bwd_inner_microstep: 1586.59 | bwd_allreduce_microstep: 252.93 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11703
total_samples=4710, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:27,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.19 | bwd_microstep: 1758.79 | bwd_inner_microstep: 1608.87 | bwd_allreduce_microstep: 149.86 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13151
total_samples=4714, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:30,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.13 | bwd_microstep: 2236.78 | bwd_inner_microstep: 2066.37 | bwd_allreduce_microstep: 170.35 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12311
total_samples=4717, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:33,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:41:33,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.35 | bwd_microstep: 1866.10 | bwd_inner_microstep: 1564.24 | bwd_allreduce_microstep: 301.80 | step_microstep: 149.07
[2025-08-03 02:41:33,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.83 | bwd: 7701.31 | bwd_inner: 6826.06 | bwd_allreduce: 875.02 | step: 149.48
{'loss': 0.8055, 'learning_rate': 1.9204345339677442e-05, 'epoch': 0.15}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11983
total_samples=4720, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:36,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.51 | bwd_microstep: 1920.18 | bwd_inner_microstep: 1769.25 | bwd_allreduce_microstep: 150.88 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13326
total_samples=4724, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:38,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.71 | bwd_microstep: 1776.74 | bwd_inner_microstep: 1667.24 | bwd_allreduce_microstep: 109.44 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11555
total_samples=4727, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:41,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.25 | bwd_microstep: 2065.97 | bwd_inner_microstep: 1845.07 | bwd_allreduce_microstep: 220.84 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12095
total_samples=4731, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:44,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:41:44,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.94 | bwd_microstep: 1719.73 | bwd_inner_microstep: 1551.01 | bwd_allreduce_microstep: 168.65 | step_microstep: 158.75
[2025-08-03 02:41:44,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.34 | bwd: 7482.66 | bwd_inner: 6832.56 | bwd_allreduce: 649.88 | step: 159.06
{'loss': 0.8041, 'learning_rate': 1.9198003179163308e-05, 'epoch': 0.15}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215
total_samples=4735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:47,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.77 | bwd_microstep: 2106.44 | bwd_inner_microstep: 1995.52 | bwd_allreduce_microstep: 110.86 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13659
total_samples=4739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:49,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.45 | bwd_microstep: 1768.62 | bwd_inner_microstep: 1714.92 | bwd_allreduce_microstep: 53.63 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13272
total_samples=4743, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:52,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.54 | bwd_microstep: 1827.82 | bwd_inner_microstep: 1708.86 | bwd_allreduce_microstep: 118.88 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14850
total_samples=4747, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:55,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:41:55,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.43 | bwd_microstep: 2173.21 | bwd_inner_microstep: 1881.15 | bwd_allreduce_microstep: 292.00 | step_microstep: 122.11
[2025-08-03 02:41:55,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.12 | bwd: 7876.14 | bwd_inner: 7300.45 | bwd_allreduce: 575.45 | step: 122.60
{'loss': 0.8034, 'learning_rate': 1.9191636897958123e-05, 'epoch': 0.15}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=4750, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:41:57,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.30 | bwd_microstep: 1739.40 | bwd_inner_microstep: 1531.56 | bwd_allreduce_microstep: 207.78 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885
total_samples=4753, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:00,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.17 | bwd_microstep: 1785.14 | bwd_inner_microstep: 1548.73 | bwd_allreduce_microstep: 236.35 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14469
total_samples=4757, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:03,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 751.09 | bwd_microstep: 1826.60 | bwd_inner_microstep: 1738.33 | bwd_allreduce_microstep: 88.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13649
total_samples=4761, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:06,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.85
[2025-08-03 02:42:06,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.79 | bwd_microstep: 2124.27 | bwd_inner_microstep: 1956.47 | bwd_allreduce_microstep: 167.73 | step_microstep: 138.00
[2025-08-03 02:42:06,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.28 | bwd: 7475.46 | bwd_inner: 6775.09 | bwd_allreduce: 700.13 | step: 138.33
{'loss': 0.8144, 'learning_rate': 1.9185246512756727e-05, 'epoch': 0.16}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12829
total_samples=4765, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:08,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.39 | bwd_microstep: 1848.98 | bwd_inner_microstep: 1796.26 | bwd_allreduce_microstep: 52.65 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11701
total_samples=4768, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:11,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.39 | bwd_microstep: 1878.34 | bwd_inner_microstep: 1539.63 | bwd_allreduce_microstep: 338.64 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12787
total_samples=4772, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:14,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.75 | bwd_microstep: 1738.12 | bwd_inner_microstep: 1630.14 | bwd_allreduce_microstep: 107.92 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13369
total_samples=4776, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:16,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 02:42:16,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.45 | bwd_microstep: 1889.66 | bwd_inner_microstep: 1668.49 | bwd_allreduce_microstep: 221.10 | step_microstep: 113.13
[2025-08-03 02:42:16,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.91 | bwd: 7355.14 | bwd_inner: 6634.52 | bwd_allreduce: 720.39 | step: 113.43
{'loss': 0.7989, 'learning_rate': 1.9178832040317153e-05, 'epoch': 0.16}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11842
total_samples=4779, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:19,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.35 | bwd_microstep: 1801.37 | bwd_inner_microstep: 1578.73 | bwd_allreduce_microstep: 222.58 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11627
total_samples=4782, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:21,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.64 | bwd_microstep: 1705.41 | bwd_inner_microstep: 1518.01 | bwd_allreduce_microstep: 187.34 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13171
total_samples=4786, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:24,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.56 | bwd_microstep: 1733.21 | bwd_inner_microstep: 1650.53 | bwd_allreduce_microstep: 82.61 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13475
total_samples=4790, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:27,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 02:42:27,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.96 | bwd_microstep: 1765.89 | bwd_inner_microstep: 1697.00 | bwd_allreduce_microstep: 68.82 | step_microstep: 132.12
[2025-08-03 02:42:27,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.45 | bwd: 7005.93 | bwd_inner: 6444.27 | bwd_allreduce: 561.41 | step: 132.56
| 307/2000 [58:37<5:08:48, 10.94s/it]                                                     15%|█▌        | 307/2000 [58:37<5:08:48, 10.94s/it] 15%|█▌        | 308/2000 [58:48<5:09:12, 10.96s/it]                                                     15%|█▌        | 308/2000 [58:48<5:09:12, 10.96s/it] 15%|█▌        | 309/2000 [58:59<5:07:33, 10.91s/it]                                                     15%|█▌        | 309/2000 [58:59<5:07:33, 10.91s/it] 16%|█▌        | 310/2000 [59:10<5:09:28, 10.99s/it]                                                     16%|█▌        | 310/2000 [59:10<5:09:28, 10.99s/it] 16%|█▌        | 311/2000 [59:21<5:07:49, 10.94s/it]                                                     16%|█▌        | 311/2000 [59:21<5:07:49, 10.94s/it] 16%|█▌        | 312/2000 [59:31<5:04:38, 10.83s/it]                                                     16%|█▌        | 312/2000 [59:31<5:04:38, 10.83s/it] 16%|█▌        | 313/2000 [59:4{'loss': 0.7942, 'learning_rate': 1.917239349746061e-05, 'epoch': 0.16}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13456
total_samples=4794, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:29,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.56 | bwd_microstep: 2002.92 | bwd_inner_microstep: 1871.01 | bwd_allreduce_microstep: 131.84 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14179
total_samples=4798, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:32,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.20 | bwd_microstep: 2002.81 | bwd_inner_microstep: 1890.05 | bwd_allreduce_microstep: 112.70 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12919
total_samples=4802, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:35,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.61 | bwd_microstep: 1761.43 | bwd_inner_microstep: 1650.41 | bwd_allreduce_microstep: 110.95 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13965
total_samples=4806, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:38,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 02:42:38,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.82 | bwd_microstep: 2041.95 | bwd_inner_microstep: 1717.09 | bwd_allreduce_microstep: 324.80 | step_microstep: 133.54
[2025-08-03 02:42:38,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.13 | bwd: 7809.16 | bwd_inner: 7128.56 | bwd_allreduce: 680.36 | step: 133.86
{'loss': 0.8091, 'learning_rate': 1.916593090107143e-05, 'epoch': 0.16}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12586
total_samples=4810, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:40,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.94 | bwd_microstep: 1802.18 | bwd_inner_microstep: 1588.05 | bwd_allreduce_microstep: 214.06 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13944
total_samples=4814, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:43,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.93 | bwd_microstep: 1966.25 | bwd_inner_microstep: 1960.30 | bwd_allreduce_microstep: 5.88 | step_microstep: 0.19
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13772
total_samples=4818, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:46,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.65 | bwd_microstep: 1832.71 | bwd_inner_microstep: 1718.47 | bwd_allreduce_microstep: 114.16 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12913
total_samples=4824, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:48,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 02:42:48,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.88 | bwd_microstep: 1760.00 | bwd_inner_microstep: 1593.39 | bwd_allreduce_microstep: 166.55 | step_microstep: 131.23
[2025-08-03 02:42:48,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.32 | bwd: 7361.18 | bwd_inner: 6860.21 | bwd_allreduce: 500.73 | step: 131.64
{'loss': 0.8046, 'learning_rate': 1.9159444268097012e-05, 'epoch': 0.16}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13578
total_samples=4828, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:51,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.22 | bwd_microstep: 1962.97 | bwd_inner_microstep: 1857.22 | bwd_allreduce_microstep: 105.69 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12230
total_samples=4831, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:54,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.70 | bwd_microstep: 1816.36 | bwd_inner_microstep: 1587.13 | bwd_allreduce_microstep: 229.16 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13513
total_samples=4835, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:56,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.05 | bwd_microstep: 1965.08 | bwd_inner_microstep: 1814.36 | bwd_allreduce_microstep: 150.66 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12514
total_samples=4839, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:42:59,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 02:42:59,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.16 | bwd_microstep: 1825.05 | bwd_inner_microstep: 1597.50 | bwd_allreduce_microstep: 227.49 | step_microstep: 162.07
[2025-08-03 02:42:59,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.05 | bwd: 7569.52 | bwd_inner: 6856.21 | bwd_allreduce: 713.07 | step: 162.40
{'loss': 0.7924, 'learning_rate': 1.91529336155478e-05, 'epoch': 0.16}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12620
total_samples=4843, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:02,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.76 | bwd_microstep: 1843.46 | bwd_inner_microstep: 1710.24 | bwd_allreduce_microstep: 133.17 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12138
total_samples=4846, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:05,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.96 | bwd_microstep: 2148.14 | bwd_inner_microstep: 2015.41 | bwd_allreduce_microstep: 132.67 | step_microstep: 0.09
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13593
total_samples=4850, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:07,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 661.96 | bwd_microstep: 1875.50 | bwd_inner_microstep: 1645.33 | bwd_allreduce_microstep: 230.10 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12928
total_samples=4854, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:10,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 02:43:10,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.39 | bwd_microstep: 1698.34 | bwd_inner_microstep: 1599.81 | bwd_allreduce_microstep: 98.46 | step_microstep: 153.03
[2025-08-03 02:43:10,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.00 | bwd: 7565.49 | bwd_inner: 6970.79 | bwd_allreduce: 594.47 | step: 153.35
{'loss': 0.802, 'learning_rate': 1.9146398960497213e-05, 'epoch': 0.16}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13817
total_samples=4858, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:14,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1585.03 | bwd_microstep: 1928.87 | bwd_inner_microstep: 1807.69 | bwd_allreduce_microstep: 121.12 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11700
total_samples=4861, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:16,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.29 | bwd_microstep: 1735.48 | bwd_inner_microstep: 1534.71 | bwd_allreduce_microstep: 200.71 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13417
total_samples=4865, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:19,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.85 | bwd_microstep: 1928.74 | bwd_inner_microstep: 1759.61 | bwd_allreduce_microstep: 169.08 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14089
total_samples=4869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:22,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 02:43:22,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.79 | bwd_microstep: 1878.25 | bwd_inner_microstep: 1736.58 | bwd_allreduce_microstep: 141.61 | step_microstep: 155.70
[2025-08-03 02:43:22,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3682.89 | bwd: 7471.40 | bwd_inner: 6838.58 | bwd_allreduce: 632.58 | step: 156.09
{'loss': 0.7958, 'learning_rate': 1.913984032008163e-05, 'epoch': 0.16}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11711
total_samples=4872, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:24,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.90 | bwd_microstep: 1839.07 | bwd_inner_microstep: 1581.27 | bwd_allreduce_microstep: 257.74 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11807
total_samples=4875, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:27,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.06 | bwd_microstep: 1760.80 | bwd_inner_microstep: 1546.75 | bwd_allreduce_microstep: 213.98 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11965
total_samples=4878, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:30,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.72 | bwd_microstep: 2171.97 | bwd_inner_microstep: 1692.32 | bwd_allreduce_microstep: 479.55 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 15766
total_samples=4881, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:33,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.82
[2025-08-03 02:43:33,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.29 | bwd_microstep: 1847.17 | bwd_inner_microstep: 1739.84 | bwd_allreduce_microstep: 107.27 | step_microstep: 109.55
[2025-08-03 02:43:33,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.90 | bwd: 7619.05 | bwd_inner: 6560.20 | bwd_allreduce: 1058.59 | step: 109.85
1<4:59:37, 10.66s/it]                                                     16%|█▌        | 313/2000 [59:41<4:59:37, 10.66s/it] 16%|█▌        | 314/2000 [59:52<5:02:57, 10.78s/it]                                                     16%|█▌        | 314/2000 [59:52<5:02:57, 10.78s/it] 16%|█▌        | 315/2000 [1:00:03<5:01:37, 10.74s/it]                                                       16%|█▌        | 315/2000 [1:00:03<5:01:37, 10.74s/it] 16%|█▌        | 316/2000 [1:00:14<5:02:44, 10.79s/it]                                                       16%|█▌        | 316/2000 [1:00:14<5:02:44, 10.79s/it] 16%|█▌        | 317/2000 [1:00:25<5:02:54, 10.80s/it]                                                       16%|█▌        | 317/2000 [1:00:25<5:02:54, 10.80s/it] 16%|█▌        | 318/2000 [1:00:36<5:09:50, 11.05s/it]                                                       16%|█▌        | 318/2000 [1:00:36<5:09:50, 11.05s/it] 16%|█▌        | 319/20{'loss': 0.8064, 'learning_rate': 1.9133257711500318e-05, 'epoch': 0.16}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13504
total_samples=4885, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:35,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.30 | bwd_microstep: 1773.53 | bwd_inner_microstep: 1677.11 | bwd_allreduce_microstep: 96.36 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11900
total_samples=4888, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:38,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.19 | bwd_microstep: 1809.13 | bwd_inner_microstep: 1578.20 | bwd_allreduce_microstep: 230.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15450
total_samples=4892, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:40,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.54 | bwd_microstep: 1758.60 | bwd_inner_microstep: 1745.94 | bwd_allreduce_microstep: 12.60 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11738
total_samples=4895, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:43,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 02:43:43,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.12 | bwd_microstep: 1834.15 | bwd_inner_microstep: 1586.25 | bwd_allreduce_microstep: 247.84 | step_microstep: 108.38
[2025-08-03 02:43:43,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.08 | bwd: 7175.46 | bwd_inner: 6587.49 | bwd_allreduce: 587.74 | step: 108.71
{'loss': 0.8061, 'learning_rate': 1.9126651152015404e-05, 'epoch': 0.16}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13749
total_samples=4899, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:46,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.16 | bwd_microstep: 1941.57 | bwd_inner_microstep: 1727.39 | bwd_allreduce_microstep: 214.10 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11596
total_samples=4902, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:48,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.45 | bwd_microstep: 1780.22 | bwd_inner_microstep: 1535.06 | bwd_allreduce_microstep: 245.10 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12916
total_samples=4906, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:51,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.40 | bwd_microstep: 2060.35 | bwd_inner_microstep: 1658.06 | bwd_allreduce_microstep: 402.23 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12028
total_samples=4909, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:54,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 02:43:54,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.59 | bwd_microstep: 2041.54 | bwd_inner_microstep: 1820.12 | bwd_allreduce_microstep: 221.37 | step_microstep: 132.10
[2025-08-03 02:43:54,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.53 | bwd: 7823.73 | bwd_inner: 6740.62 | bwd_allreduce: 1082.88 | step: 132.42
{'loss': 0.7985, 'learning_rate': 1.9120020658951814e-05, 'epoch': 0.16}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14132
total_samples=4913, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:57,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.37 | bwd_microstep: 1957.66 | bwd_inner_microstep: 1809.18 | bwd_allreduce_microstep: 148.42 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14226
total_samples=4917, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:43:59,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.56 | bwd_microstep: 1737.60 | bwd_inner_microstep: 1701.87 | bwd_allreduce_microstep: 35.67 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=4920, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:02,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.03 | bwd_microstep: 1954.86 | bwd_inner_microstep: 1575.70 | bwd_allreduce_microstep: 379.10 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13951
total_samples=4924, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:05,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:44:05,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.24 | bwd_microstep: 2064.49 | bwd_inner_microstep: 1929.02 | bwd_allreduce_microstep: 135.40 | step_microstep: 109.78
[2025-08-03 02:44:05,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.13 | bwd: 7714.65 | bwd_inner: 7015.75 | bwd_allreduce: 698.66 | step: 110.09
{'loss': 0.7958, 'learning_rate': 1.911336624969725e-05, 'epoch': 0.16}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14214
total_samples=4928, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:08,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.51 | bwd_microstep: 2562.93 | bwd_inner_microstep: 1729.28 | bwd_allreduce_microstep: 833.59 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=4932, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:11,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.94 | bwd_microstep: 1743.43 | bwd_inner_microstep: 1676.80 | bwd_allreduce_microstep: 66.56 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13040
total_samples=4936, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:14,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.25 | bwd_microstep: 2038.50 | bwd_inner_microstep: 1670.59 | bwd_allreduce_microstep: 367.84 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13361
total_samples=4940, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:17,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 02:44:17,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.72 | bwd_microstep: 1974.66 | bwd_inner_microstep: 1849.64 | bwd_allreduce_microstep: 124.96 | step_microstep: 120.73
[2025-08-03 02:44:17,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2757.33 | bwd: 8319.56 | bwd_inner: 6926.30 | bwd_allreduce: 1393.03 | step: 121.06
{'loss': 0.803, 'learning_rate': 1.910668794170212e-05, 'epoch': 0.16}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11892
total_samples=4943, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:19,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.06 | bwd_microstep: 1957.68 | bwd_inner_microstep: 1590.25 | bwd_allreduce_microstep: 367.37 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14694
total_samples=4947, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:22,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.83 | bwd_microstep: 1876.23 | bwd_inner_microstep: 1727.57 | bwd_allreduce_microstep: 148.61 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13721
total_samples=4951, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:25,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.47 | bwd_microstep: 2147.06 | bwd_inner_microstep: 1738.09 | bwd_allreduce_microstep: 408.92 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13604
total_samples=4955, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:28,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 02:44:28,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.03 | bwd_microstep: 1981.34 | bwd_inner_microstep: 1870.48 | bwd_allreduce_microstep: 110.80 | step_microstep: 109.11
[2025-08-03 02:44:28,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.32 | bwd: 7962.36 | bwd_inner: 6926.38 | bwd_allreduce: 1035.77 | step: 109.43
{'loss': 0.7967, 'learning_rate': 1.9099985752479505e-05, 'epoch': 0.16}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12449
total_samples=4958, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:30,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.36 | bwd_microstep: 1786.17 | bwd_inner_microstep: 1570.63 | bwd_allreduce_microstep: 215.48 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13586
total_samples=4962, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:33,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 661.53 | bwd_microstep: 1748.07 | bwd_inner_microstep: 1693.66 | bwd_allreduce_microstep: 54.35 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11771
total_samples=4965, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:35,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.28 | bwd_microstep: 1786.41 | bwd_inner_microstep: 1549.80 | bwd_allreduce_microstep: 236.54 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13772
total_samples=4969, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:38,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 02:44:38,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.69 | bwd_microstep: 2117.99 | bwd_inner_microstep: 1921.75 | bwd_allreduce_microstep: 196.18 | step_microstep: 135.62
[2025-08-03 02:44:38,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.80 | bwd: 7438.69 | bwd_inner: 6735.83 | bwd_allreduce: 702.63 | step: 135.94
00 [1:00:47<5:08:13, 11.00s/it]                                                       16%|█▌        | 319/2000 [1:00:47<5:08:13, 11.00s/it] 16%|█▌        | 320/2000 [1:00:58<5:03:17, 10.83s/it]                                                       16%|█▌        | 320/2000 [1:00:58<5:03:17, 10.83s/it] 16%|█▌        | 321/2000 [1:01:09<5:05:26, 10.92s/it]                                                       16%|█▌        | 321/2000 [1:01:09<5:05:26, 10.92s/it] 16%|█▌        | 322/2000 [1:01:20<5:05:22, 10.92s/it]                                                       16%|█▌        | 322/2000 [1:01:20<5:05:22, 10.92s/it] 16%|█▌        | 323/2000 [1:01:31<5:10:25, 11.11s/it]                                                       16%|█▌        | 323/2000 [1:01:31<5:10:25, 11.11s/it] 16%|█▌        | 324/2000 [1:01:43<5:10:49, 11.13s/it]                                                       16%|█▌        | 324/2000 [1:01:43<5:10:49, 11.13s/it] 16%|�{'loss': 0.8021, 'learning_rate': 1.9093259699605125e-05, 'epoch': 0.16}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13312
total_samples=4973, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:41,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.59 | bwd_microstep: 1820.62 | bwd_inner_microstep: 1695.85 | bwd_allreduce_microstep: 124.71 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13260
total_samples=4977, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:44,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.51 | bwd_microstep: 1871.74 | bwd_inner_microstep: 1685.91 | bwd_allreduce_microstep: 185.76 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13396
total_samples=4981, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:46,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.70 | bwd_microstep: 1858.88 | bwd_inner_microstep: 1709.49 | bwd_allreduce_microstep: 149.33 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14752
total_samples=4987, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:49,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 02:44:49,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.57 | bwd_microstep: 1725.19 | bwd_inner_microstep: 1705.67 | bwd_allreduce_microstep: 19.44 | step_microstep: 111.53
[2025-08-03 02:44:49,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.29 | bwd: 7276.46 | bwd_inner: 6796.90 | bwd_allreduce: 479.32 | step: 111.85
{'loss': 0.8031, 'learning_rate': 1.908650980071726e-05, 'epoch': 0.16}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13968
total_samples=4991, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:52,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.02 | bwd_microstep: 1859.09 | bwd_inner_microstep: 1664.16 | bwd_allreduce_microstep: 194.86 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13505
total_samples=4995, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:54,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.47 | bwd_microstep: 1911.30 | bwd_inner_microstep: 1704.04 | bwd_allreduce_microstep: 207.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14116
total_samples=5000, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:44:57,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.50 | bwd_microstep: 2093.22 | bwd_inner_microstep: 1943.15 | bwd_allreduce_microstep: 150.01 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13454
total_samples=5004, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:00,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:45:00,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.53 | bwd_microstep: 1697.83 | bwd_inner_microstep: 1653.22 | bwd_allreduce_microstep: 44.54 | step_microstep: 140.03
[2025-08-03 02:45:00,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2742.45 | bwd: 7561.48 | bwd_inner: 6964.57 | bwd_allreduce: 596.68 | step: 140.44
{'loss': 0.8055, 'learning_rate': 1.9079736073516735e-05, 'epoch': 0.16}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11958
total_samples=5007, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:02,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.56 | bwd_microstep: 1802.99 | bwd_inner_microstep: 1550.88 | bwd_allreduce_microstep: 252.05 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13608
total_samples=5011, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:05,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.39 | bwd_microstep: 1887.61 | bwd_inner_microstep: 1747.28 | bwd_allreduce_microstep: 140.26 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13362
total_samples=5015, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:08,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.66 | bwd_microstep: 2003.55 | bwd_inner_microstep: 1863.91 | bwd_allreduce_microstep: 139.56 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483
total_samples=5019, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:11,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 02:45:11,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.40 | bwd_microstep: 2017.03 | bwd_inner_microstep: 2011.06 | bwd_allreduce_microstep: 5.91 | step_microstep: 135.57
[2025-08-03 02:45:11,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.95 | bwd: 7711.21 | bwd_inner: 7173.11 | bwd_allreduce: 537.86 | step: 135.90
{'loss': 0.8086, 'learning_rate': 1.9072938535766864e-05, 'epoch': 0.16}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12298
total_samples=5022, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:13,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.66 | bwd_microstep: 1750.65 | bwd_inner_microstep: 1555.54 | bwd_allreduce_microstep: 195.05 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13353
total_samples=5026, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:16,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.86 | bwd_microstep: 1773.04 | bwd_inner_microstep: 1667.24 | bwd_allreduce_microstep: 105.73 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13959
total_samples=5030, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:18,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.98 | bwd_microstep: 1775.19 | bwd_inner_microstep: 1725.47 | bwd_allreduce_microstep: 49.66 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13660
total_samples=5035, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:21,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 02:45:21,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.35 | bwd_microstep: 1888.19 | bwd_inner_microstep: 1840.20 | bwd_allreduce_microstep: 47.93 | step_microstep: 161.53
[2025-08-03 02:45:21,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.78 | bwd: 7187.11 | bwd_inner: 6788.45 | bwd_allreduce: 398.44 | step: 161.86
{'loss': 0.7859, 'learning_rate': 1.9066117205293393e-05, 'epoch': 0.16}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12154
total_samples=5038, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:24,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.09 | bwd_microstep: 1805.90 | bwd_inner_microstep: 1577.52 | bwd_allreduce_microstep: 228.31 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14272
total_samples=5042, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:27,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.13 | bwd_microstep: 2070.75 | bwd_inner_microstep: 1900.44 | bwd_allreduce_microstep: 170.23 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15573
total_samples=5046, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:29,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.91 | bwd_microstep: 1925.21 | bwd_inner_microstep: 1751.32 | bwd_allreduce_microstep: 173.83 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13074
total_samples=5051, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:32,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 02:45:32,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.74 | bwd_microstep: 2003.11 | bwd_inner_microstep: 1707.56 | bwd_allreduce_microstep: 295.48 | step_microstep: 109.36
[2025-08-03 02:45:32,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.80 | bwd: 7805.02 | bwd_inner: 6936.83 | bwd_allreduce: 867.95 | step: 109.68
{'loss': 0.8023, 'learning_rate': 1.905927209998447e-05, 'epoch': 0.17}
�▋        | 325/2000 [1:01:53<5:07:04, 11.00s/it]                                                       16%|█▋        | 325/2000 [1:01:53<5:07:04, 11.00s/it] 16%|█▋        | 326/2000 [1:02:04<5:02:39, 10.85s/it]                                                       16%|█▋        | 326/2000 [1:02:04<5:02:39, 10.85s/it] 16%|█▋        | 327/2000 [1:02:15<5:02:04, 10.83s/it]                                                       16%|█▋        | 327/2000 [1:02:15<5:02:04, 10.83s/it] 16%|█▋        | 328/2000 [1:02:26<5:02:55, 10.87s/it]                                                       16%|█▋        | 328/2000 [1:02:26<5:02:55, 10.87s/it] 16%|█▋        | 329/2000 [1:02:36<4:59:20, 10.75s/it]                                                       16%|█▋        | 329/2000 [1:02:36<4:59:20, 10.75s/it] 16%|█▋        | 330/2000 [1:02:47<5:01:42, 10.84s/it]                                                       16%|█▋        | 330/2000 [1:02:47<5:01:42dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11863
total_samples=5054, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:35,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.51 | bwd_microstep: 2173.19 | bwd_inner_microstep: 1920.57 | bwd_allreduce_microstep: 252.55 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13393
total_samples=5058, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:38,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.81 | bwd_microstep: 1813.75 | bwd_inner_microstep: 1706.76 | bwd_allreduce_microstep: 106.92 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13930
total_samples=5062, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:40,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.92 | bwd_microstep: 1811.99 | bwd_inner_microstep: 1731.74 | bwd_allreduce_microstep: 80.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13490
total_samples=5066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:43,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 02:45:43,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.11 | bwd_microstep: 1838.75 | bwd_inner_microstep: 1726.03 | bwd_allreduce_microstep: 112.66 | step_microstep: 131.21
[2025-08-03 02:45:43,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.29 | bwd: 7637.71 | bwd_inner: 7085.11 | bwd_allreduce: 552.38 | step: 131.62
{'loss': 0.7975, 'learning_rate': 1.905240323779058e-05, 'epoch': 0.17}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11692
total_samples=5069, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:46,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.67 | bwd_microstep: 1776.22 | bwd_inner_microstep: 1529.75 | bwd_allreduce_microstep: 246.41 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16183
total_samples=5074, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:48,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.93 | bwd_microstep: 1825.67 | bwd_inner_microstep: 1819.63 | bwd_allreduce_microstep: 5.97 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12882
total_samples=5078, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:51,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 662.15 | bwd_microstep: 1701.10 | bwd_inner_microstep: 1601.39 | bwd_allreduce_microstep: 99.64 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13362
total_samples=5082, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:54,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 02:45:54,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.69 | bwd_microstep: 2050.14 | bwd_inner_microstep: 1755.31 | bwd_allreduce_microstep: 294.77 | step_microstep: 110.90
[2025-08-03 02:45:54,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.37 | bwd: 7353.18 | bwd_inner: 6706.05 | bwd_allreduce: 646.87 | step: 111.22
{'loss': 0.7935, 'learning_rate': 1.904551063672452e-05, 'epoch': 0.17}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11913
total_samples=5085, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:45:57,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.49 | bwd_microstep: 2572.48 | bwd_inner_microstep: 2318.17 | bwd_allreduce_microstep: 254.25 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14054
total_samples=5090, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:00,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.73 | bwd_microstep: 2118.25 | bwd_inner_microstep: 1993.01 | bwd_allreduce_microstep: 125.18 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11827
total_samples=5093, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:03,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.47 | bwd_microstep: 1948.99 | bwd_inner_microstep: 1749.90 | bwd_allreduce_microstep: 199.02 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13070
total_samples=5098, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:05,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 02:46:05,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.94 | bwd_microstep: 1767.46 | bwd_inner_microstep: 1653.83 | bwd_allreduce_microstep: 113.57 | step_microstep: 150.51
[2025-08-03 02:46:05,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.56 | bwd: 8407.23 | bwd_inner: 7714.90 | bwd_allreduce: 692.09 | step: 150.81
{'loss': 0.8087, 'learning_rate': 1.9038594314861328e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13362
total_samples=5102, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:08,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.25 | bwd_microstep: 1838.65 | bwd_inner_microstep: 1781.89 | bwd_allreduce_microstep: 56.69 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13657
total_samples=5106, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:11,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.90 | bwd_microstep: 1859.24 | bwd_inner_microstep: 1692.57 | bwd_allreduce_microstep: 166.60 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12006
total_samples=5109, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:13,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.86 | bwd_microstep: 1811.23 | bwd_inner_microstep: 1566.13 | bwd_allreduce_microstep: 245.03 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13597
total_samples=5113, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:16,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.55
[2025-08-03 02:46:16,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.54 | bwd_microstep: 1809.32 | bwd_inner_microstep: 1707.33 | bwd_allreduce_microstep: 101.93 | step_microstep: 152.17
[2025-08-03 02:46:16,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.49 | bwd: 7318.47 | bwd_inner: 6747.91 | bwd_allreduce: 570.32 | step: 152.49
{'loss': 0.8074, 'learning_rate': 1.9031654290338256e-05, 'epoch': 0.17}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12885
total_samples=5117, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:19,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.07 | bwd_microstep: 1757.51 | bwd_inner_microstep: 1631.93 | bwd_allreduce_microstep: 125.52 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12944
total_samples=5121, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:21,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.82 | bwd_microstep: 1914.37 | bwd_inner_microstep: 1640.71 | bwd_allreduce_microstep: 273.60 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=5124, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:24,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.63 | bwd_microstep: 2054.15 | bwd_inner_microstep: 1864.14 | bwd_allreduce_microstep: 189.95 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11733
total_samples=5127, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:27,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:46:27,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.55 | bwd_microstep: 2001.85 | bwd_inner_microstep: 1768.56 | bwd_allreduce_microstep: 233.23 | step_microstep: 129.41
[2025-08-03 02:46:27,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.00 | bwd: 7727.92 | bwd_inner: 6905.34 | bwd_allreduce: 822.36 | step: 129.83
{'loss': 0.7943, 'learning_rate': 1.90246905813547e-05, 'epoch': 0.17}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13068
total_samples=5131, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:30,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.62 | bwd_microstep: 1831.27 | bwd_inner_microstep: 1684.02 | bwd_allreduce_microstep: 147.18 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13340
total_samples=5136, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:32,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.61 | bwd_microstep: 1954.94 | bwd_inner_microstep: 1685.45 | bwd_allreduce_microstep: 269.43 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13467
total_samples=5140, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:35,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.44 | bwd_microstep: 1946.19 | bwd_inner_microstep: 1825.12 | bwd_allreduce_microstep: 121.00 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11898
total_samples=5143, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:38,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:46:38,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.48 | bwd_microstep: 1750.29 | bwd_inner_microstep: 1547.63 | bwd_allreduce_microstep: 202.59 | step_microstep: 129.10
[2025-08-03 02:46:38,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2745.08 | bwd: 7482.73 | bwd_inner: 6742.22 | bwd_allreduce: 740.28 | step: 129.43
{'loss': 0.7917, 'learning_rate': 1.9017703206172187e-05, 'epoch': 0.17}
, 10.84s/it] 17%|█▋        | 331/2000 [1:02:58<5:02:19, 10.87s/it]                                                       17%|█▋        | 331/2000 [1:02:58<5:02:19, 10.87s/it] 17%|█▋        | 332/2000 [1:03:09<4:59:51, 10.79s/it]                                                       17%|█▋        | 332/2000 [1:03:09<4:59:51, 10.79s/it] 17%|█▋        | 333/2000 [1:03:20<5:07:16, 11.06s/it]                                                       17%|█▋        | 333/2000 [1:03:20<5:07:16, 11.06s/it] 17%|█▋        | 334/2000 [1:03:31<5:03:09, 10.92s/it]                                                       17%|█▋        | 334/2000 [1:03:31<5:03:09, 10.92s/it] 17%|█▋        | 335/2000 [1:03:42<5:03:09, 10.92s/it]                                                       17%|█▋        | 335/2000 [1:03:42<5:03:09, 10.92s/it] 17%|█▋        | 336/2000 [1:03:52<5:01:01, 10.85s/it]                                                       17%|█▋        | 336/2dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12022
total_samples=5146, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:40,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.29 | bwd_microstep: 1795.10 | bwd_inner_microstep: 1539.61 | bwd_allreduce_microstep: 255.42 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12797
total_samples=5150, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:43,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.44 | bwd_microstep: 1771.89 | bwd_inner_microstep: 1638.29 | bwd_allreduce_microstep: 133.54 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13703
total_samples=5154, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:46,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.47 | bwd_microstep: 1965.83 | bwd_inner_microstep: 1892.18 | bwd_allreduce_microstep: 73.60 | step_microstep: 0.09
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12193
total_samples=5158, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:48,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:46:48,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.91 | bwd_microstep: 1759.78 | bwd_inner_microstep: 1581.15 | bwd_allreduce_microstep: 178.57 | step_microstep: 133.29
[2025-08-03 02:46:48,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2749.05 | bwd: 7292.64 | bwd_inner: 6651.23 | bwd_allreduce: 641.20 | step: 133.60
{'loss': 0.7983, 'learning_rate': 1.9010692183114285e-05, 'epoch': 0.17}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13558
total_samples=5162, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:51,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.93 | bwd_microstep: 2043.30 | bwd_inner_microstep: 1857.12 | bwd_allreduce_microstep: 186.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13539
total_samples=5166, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:54,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.52 | bwd_microstep: 1907.27 | bwd_inner_microstep: 1856.99 | bwd_allreduce_microstep: 50.22 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15320
total_samples=5171, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:56,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.90 | bwd_microstep: 1844.78 | bwd_inner_microstep: 1794.72 | bwd_allreduce_microstep: 49.99 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13754
total_samples=5175, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:46:59,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 02:46:59,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.95 | bwd_microstep: 1846.59 | bwd_inner_microstep: 1731.57 | bwd_allreduce_microstep: 114.95 | step_microstep: 111.40
[2025-08-03 02:46:59,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.23 | bwd: 7641.98 | bwd_inner: 7240.40 | bwd_allreduce: 401.35 | step: 111.82
{'loss': 0.7782, 'learning_rate': 1.900365753056659e-05, 'epoch': 0.17}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11987
total_samples=5178, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:02,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.46 | bwd_microstep: 1852.64 | bwd_inner_microstep: 1699.31 | bwd_allreduce_microstep: 153.27 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11709
total_samples=5181, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:04,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.94 | bwd_microstep: 1805.43 | bwd_inner_microstep: 1541.50 | bwd_allreduce_microstep: 263.86 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13735
total_samples=5185, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:07,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.28 | bwd_microstep: 1913.55 | bwd_inner_microstep: 1727.84 | bwd_allreduce_microstep: 185.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14876
total_samples=5189, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:10,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:47:10,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.98 | bwd_microstep: 1826.96 | bwd_inner_microstep: 1772.67 | bwd_allreduce_microstep: 54.24 | step_microstep: 110.26
[2025-08-03 02:47:10,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.58 | bwd: 7398.63 | bwd_inner: 6741.31 | bwd_allreduce: 657.09 | step: 110.60
{'loss': 0.8075, 'learning_rate': 1.8996599266976658e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14195
total_samples=5193, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:12,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.45 | bwd_microstep: 1798.56 | bwd_inner_microstep: 1749.88 | bwd_allreduce_microstep: 48.61 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13040
total_samples=5197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:15,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.16 | bwd_microstep: 1982.19 | bwd_inner_microstep: 1873.92 | bwd_allreduce_microstep: 108.22 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13825
total_samples=5201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:18,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.56 | bwd_microstep: 2038.53 | bwd_inner_microstep: 1892.58 | bwd_allreduce_microstep: 145.89 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11537
total_samples=5204, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:21,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 02:47:21,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.04 | bwd_microstep: 2006.47 | bwd_inner_microstep: 1807.61 | bwd_allreduce_microstep: 198.80 | step_microstep: 126.82
[2025-08-03 02:47:21,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.14 | bwd: 7825.81 | bwd_inner: 7323.99 | bwd_allreduce: 501.59 | step: 127.12
{'loss': 0.8082, 'learning_rate': 1.8989517410853956e-05, 'epoch': 0.17}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12117
total_samples=5208, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:23,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.73 | bwd_microstep: 1754.89 | bwd_inner_microstep: 1558.23 | bwd_allreduce_microstep: 196.59 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11751
total_samples=5211, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:26,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.65 | bwd_microstep: 1746.81 | bwd_inner_microstep: 1535.69 | bwd_allreduce_microstep: 211.05 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13181
total_samples=5216, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:28,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.24 | bwd_microstep: 1774.66 | bwd_inner_microstep: 1637.90 | bwd_allreduce_microstep: 136.70 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13241
total_samples=5220, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:31,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 02:47:31,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.96 | bwd_microstep: 1971.47 | bwd_inner_microstep: 1879.82 | bwd_allreduce_microstep: 91.56 | step_microstep: 112.92
[2025-08-03 02:47:31,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.52 | bwd: 7247.88 | bwd_inner: 6611.67 | bwd_allreduce: 635.96 | step: 113.23
{'loss': 0.7963, 'learning_rate': 1.898241198076983e-05, 'epoch': 0.17}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11983
total_samples=5223, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:34,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.84 | bwd_microstep: 1724.01 | bwd_inner_microstep: 1546.03 | bwd_allreduce_microstep: 177.92 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15918
total_samples=5227, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:37,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.71 | bwd_microstep: 2288.68 | bwd_inner_microstep: 2147.13 | bwd_allreduce_microstep: 141.49 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13228
total_samples=5231, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:40,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.21 | bwd_microstep: 2177.93 | bwd_inner_microstep: 2028.05 | bwd_allreduce_microstep: 149.83 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12421
total_samples=5234, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:42,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 02:47:42,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.79 | bwd_microstep: 1698.99 | bwd_inner_microstep: 1544.02 | bwd_allreduce_microstep: 154.90 | step_microstep: 152.43
[2025-08-03 02:47:42,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.48 | bwd: 7889.66 | bwd_inner: 7265.23 | bwd_allreduce: 624.21 | step: 152.86
{'loss': 0.801, 'learning_rate': 1.8975282995357448e-05, 'epoch': 0.17}
000 [1:03:53<5:01:01, 10.85s/it] 17%|█▋        | 337/2000 [1:04:03<4:58:13, 10.76s/it]                                                       17%|█▋        | 337/2000 [1:04:03<4:58:13, 10.76s/it] 17%|█▋        | 338/2000 [1:04:14<4:59:11, 10.80s/it]                                                       17%|█▋        | 338/2000 [1:04:14<4:59:11, 10.80s/it] 17%|█▋        | 339/2000 [1:04:25<4:57:34, 10.75s/it]                                                       17%|█▋        | 339/2000 [1:04:25<4:57:34, 10.75s/it] 17%|█▋        | 340/2000 [1:04:36<5:00:08, 10.85s/it]                                                       17%|█▋        | 340/2000 [1:04:36<5:00:08, 10.85s/it] 17%|█▋        | 341/2000 [1:04:46<4:56:55, 10.74s/it]                                                       17%|█▋        | 341/2000 [1:04:46<4:56:55, 10.74s/it] 17%|█▋        | 342/2000 [1:04:57<5:00:11, 10.86s/it]                                                       17%|�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11967
total_samples=5237, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:45,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.67 | bwd_microstep: 1849.51 | bwd_inner_microstep: 1581.68 | bwd_allreduce_microstep: 267.77 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11761
total_samples=5240, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:48,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.37 | bwd_microstep: 2220.57 | bwd_inner_microstep: 1986.03 | bwd_allreduce_microstep: 234.48 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13304
total_samples=5244, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:51,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.82 | bwd_microstep: 1778.58 | bwd_inner_microstep: 1686.79 | bwd_allreduce_microstep: 91.73 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13171
total_samples=5248, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:53,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:47:53,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.99 | bwd_microstep: 1949.16 | bwd_inner_microstep: 1843.43 | bwd_allreduce_microstep: 105.66 | step_microstep: 109.95
[2025-08-03 02:47:53,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.77 | bwd: 7797.86 | bwd_inner: 7097.92 | bwd_allreduce: 699.70 | step: 110.30
{'loss': 0.8035, 'learning_rate': 1.8968130473311732e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14142
total_samples=5252, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:56,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.07 | bwd_microstep: 1749.81 | bwd_inner_microstep: 1706.57 | bwd_allreduce_microstep: 43.17 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12013
total_samples=5255, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:47:59,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.76 | bwd_microstep: 2337.22 | bwd_inner_microstep: 1737.41 | bwd_allreduce_microstep: 599.75 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11752
total_samples=5258, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:02,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.04 | bwd_microstep: 1863.90 | bwd_inner_microstep: 1537.57 | bwd_allreduce_microstep: 326.26 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13699
total_samples=5262, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:04,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 02:48:04,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.70 | bwd_microstep: 1845.22 | bwd_inner_microstep: 1685.21 | bwd_allreduce_microstep: 159.96 | step_microstep: 110.29
[2025-08-03 02:48:04,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2740.51 | bwd: 7796.21 | bwd_inner: 6666.76 | bwd_allreduce: 1129.22 | step: 110.62
{'loss': 0.8003, 'learning_rate': 1.896095443338935e-05, 'epoch': 0.17}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13036
total_samples=5266, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:07,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.89 | bwd_microstep: 1962.62 | bwd_inner_microstep: 1690.29 | bwd_allreduce_microstep: 272.26 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11599
total_samples=5269, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:10,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.27 | bwd_microstep: 1718.83 | bwd_inner_microstep: 1542.15 | bwd_allreduce_microstep: 176.61 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13160
total_samples=5273, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:13,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1581.69 | bwd_microstep: 2086.43 | bwd_inner_microstep: 1875.33 | bwd_allreduce_microstep: 211.04 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12059
total_samples=5277, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:16,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74
[2025-08-03 02:48:16,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.53 | bwd_microstep: 1984.88 | bwd_inner_microstep: 1765.01 | bwd_allreduce_microstep: 219.81 | step_microstep: 113.61
[2025-08-03 02:48:16,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3641.30 | bwd: 7752.80 | bwd_inner: 6872.78 | bwd_allreduce: 879.79 | step: 114.04
{'loss': 0.7937, 'learning_rate': 1.8953754894408617e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13633
total_samples=5282, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:19,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.25 | bwd_microstep: 1754.83 | bwd_inner_microstep: 1699.02 | bwd_allreduce_microstep: 55.75 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13512
total_samples=5286, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:22,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.19 | bwd_microstep: 1930.35 | bwd_inner_microstep: 1821.53 | bwd_allreduce_microstep: 108.76 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12380
total_samples=5290, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:24,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.79 | bwd_microstep: 1735.68 | bwd_inner_microstep: 1602.18 | bwd_allreduce_microstep: 133.43 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12105
total_samples=5293, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:27,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.81
[2025-08-03 02:48:27,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.30 | bwd_microstep: 2149.20 | bwd_inner_microstep: 1725.15 | bwd_allreduce_microstep: 424.00 | step_microstep: 109.06
[2025-08-03 02:48:27,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2746.47 | bwd: 7570.11 | bwd_inner: 6847.87 | bwd_allreduce: 722.01 | step: 109.38
{'loss': 0.7994, 'learning_rate': 1.8946531875249496e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13940
total_samples=5297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:30,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.16 | bwd_microstep: 1866.93 | bwd_inner_microstep: 1708.09 | bwd_allreduce_microstep: 158.78 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13478
total_samples=5301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:32,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.21 | bwd_microstep: 1797.61 | bwd_inner_microstep: 1704.11 | bwd_allreduce_microstep: 93.43 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=5305, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:35,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.02 | bwd_microstep: 1721.20 | bwd_inner_microstep: 1663.76 | bwd_allreduce_microstep: 57.38 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13649
total_samples=5309, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:38,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 02:48:38,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.85 | bwd_microstep: 1764.43 | bwd_inner_microstep: 1659.69 | bwd_allreduce_microstep: 104.67 | step_microstep: 442.41
[2025-08-03 02:48:38,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.16 | bwd: 7150.21 | bwd_inner: 6735.65 | bwd_allreduce: 414.33 | step: 442.73
{'loss': 0.8126, 'learning_rate': 1.89392853948535e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13249
total_samples=5314, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:40,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.26 | bwd_microstep: 1802.98 | bwd_inner_microstep: 1699.29 | bwd_allreduce_microstep: 103.63 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11890
total_samples=5317, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:43,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.16 | bwd_microstep: 1777.14 | bwd_inner_microstep: 1553.51 | bwd_allreduce_microstep: 223.56 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12201
total_samples=5320, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:46,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.14 | bwd_microstep: 1822.21 | bwd_inner_microstep: 1578.47 | bwd_allreduce_microstep: 243.68 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13029
total_samples=5324, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:48,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:48:48,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.02 | bwd_microstep: 2040.98 | bwd_inner_microstep: 1986.59 | bwd_allreduce_microstep: 54.34 | step_microstep: 122.00
[2025-08-03 02:48:48,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.50 | bwd: 7443.36 | bwd_inner: 6817.85 | bwd_allreduce: 625.28 | step: 122.32
��▋        | 342/2000 [1:04:57<5:00:11, 10.86s/it] 17%|█▋        | 343/2000 [1:05:08<5:01:31, 10.92s/it]                                                       17%|█▋        | 343/2000 [1:05:08<5:01:31, 10.92s/it] 17%|█▋        | 344/2000 [1:05:19<5:01:51, 10.94s/it]                                                       17%|█▋        | 344/2000 [1:05:19<5:01:51, 10.94s/it] 17%|█▋        | 345/2000 [1:05:31<5:09:10, 11.21s/it]                                                       17%|█▋        | 345/2000 [1:05:31<5:09:10, 11.21s/it] 17%|█▋        | 346/2000 [1:05:42<5:05:20, 11.08s/it]                                                       17%|█▋        | 346/2000 [1:05:42<5:05:20, 11.08s/it] 17%|█▋        | 347/2000 [1:05:53<5:01:59, 10.96s/it]                                                       17%|█▋        | 347/2000 [1:05:53<5:01:59, 10.96s/it] 17%|█▋        | 348/2000 [1:06:03<4:59:57, 10.89s/it]                                          {'loss': 0.8007, 'learning_rate': 1.8932015472223692e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14174
total_samples=5329, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:51,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.03 | bwd_microstep: 2091.27 | bwd_inner_microstep: 1946.35 | bwd_allreduce_microstep: 144.86 | step_microstep: 0.19
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13277
total_samples=5333, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:54,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.60 | bwd_microstep: 1875.09 | bwd_inner_microstep: 1776.89 | bwd_allreduce_microstep: 98.14 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11729
total_samples=5336, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:48:57,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.48 | bwd_microstep: 1800.22 | bwd_inner_microstep: 1556.88 | bwd_allreduce_microstep: 243.27 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11594
total_samples=5339, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:00,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 02:49:00,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.47 | bwd_microstep: 2017.95 | bwd_inner_microstep: 1748.20 | bwd_allreduce_microstep: 269.69 | step_microstep: 105.83
[2025-08-03 02:49:00,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.53 | bwd: 7784.58 | bwd_inner: 7028.33 | bwd_allreduce: 756.03 | step: 106.23
{'loss': 0.804, 'learning_rate': 1.892472212642459e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13144
total_samples=5343, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:02,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.26 | bwd_microstep: 1780.55 | bwd_inner_microstep: 1688.38 | bwd_allreduce_microstep: 92.11 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11655
total_samples=5346, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:05,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.42 | bwd_microstep: 2211.84 | bwd_inner_microstep: 1807.17 | bwd_allreduce_microstep: 404.61 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11873
total_samples=5349, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:08,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.05 | bwd_microstep: 1793.85 | bwd_inner_microstep: 1573.45 | bwd_allreduce_microstep: 220.35 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13384
total_samples=5353, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:11,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:49:11,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.69 | bwd_microstep: 2048.08 | bwd_inner_microstep: 1963.54 | bwd_allreduce_microstep: 84.47 | step_microstep: 132.94
[2025-08-03 02:49:11,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.35 | bwd: 7834.37 | bwd_inner: 7032.53 | bwd_allreduce: 801.61 | step: 133.27
{'loss': 0.7994, 'learning_rate': 1.8917405376582144e-05, 'epoch': 0.17}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13314
total_samples=5358, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:13,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.33 | bwd_microstep: 2046.36 | bwd_inner_microstep: 1888.93 | bwd_allreduce_microstep: 157.36 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11824
total_samples=5361, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:16,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.98 | bwd_microstep: 1943.98 | bwd_inner_microstep: 1762.98 | bwd_allreduce_microstep: 180.93 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14189
total_samples=5365, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:19,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.31 | bwd_microstep: 1827.06 | bwd_inner_microstep: 1723.46 | bwd_allreduce_microstep: 103.53 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13052
total_samples=5369, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:22,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 02:49:22,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.86 | bwd_microstep: 1975.27 | bwd_inner_microstep: 1696.78 | bwd_allreduce_microstep: 278.44 | step_microstep: 113.09
[2025-08-03 02:49:22,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.41 | bwd: 7792.72 | bwd_inner: 7072.14 | bwd_allreduce: 720.34 | step: 113.40
{'loss': 0.8088, 'learning_rate': 1.891006524188368e-05, 'epoch': 0.18}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12664
total_samples=5373, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:24,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.34 | bwd_microstep: 1818.22 | bwd_inner_microstep: 1605.17 | bwd_allreduce_microstep: 212.99 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13915
total_samples=5378, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:27,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.92 | bwd_microstep: 1831.22 | bwd_inner_microstep: 1708.48 | bwd_allreduce_microstep: 122.68 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13181
total_samples=5382, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:30,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.16 | bwd_microstep: 1801.87 | bwd_inner_microstep: 1694.99 | bwd_allreduce_microstep: 106.81 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13218
total_samples=5386, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:32,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 02:49:32,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.83 | bwd_microstep: 1891.34 | bwd_inner_microstep: 1637.27 | bwd_allreduce_microstep: 254.01 | step_microstep: 122.70
[2025-08-03 02:49:32,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.20 | bwd: 7342.70 | bwd_inner: 6645.90 | bwd_allreduce: 696.55 | step: 123.05
{'loss': 0.8014, 'learning_rate': 1.8902701741577844e-05, 'epoch': 0.18}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13466
total_samples=5390, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:35,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.31 | bwd_microstep: 2012.40 | bwd_inner_microstep: 1858.87 | bwd_allreduce_microstep: 153.47 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14546
total_samples=5394, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:38,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.38 | bwd_microstep: 1821.99 | bwd_inner_microstep: 1765.52 | bwd_allreduce_microstep: 56.40 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13256
total_samples=5398, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:40,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.83 | bwd_microstep: 1740.55 | bwd_inner_microstep: 1672.36 | bwd_allreduce_microstep: 68.13 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11820
total_samples=5401, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:43,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 02:49:43,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.49 | bwd_microstep: 1772.73 | bwd_inner_microstep: 1545.40 | bwd_allreduce_microstep: 227.27 | step_microstep: 115.26
[2025-08-03 02:49:43,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.95 | bwd: 7347.72 | bwd_inner: 6842.14 | bwd_allreduce: 505.34 | step: 115.57
{'loss': 0.7991, 'learning_rate': 1.889531489497455e-05, 'epoch': 0.18}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13585
total_samples=5405, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:46,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.94 | bwd_microstep: 1807.85 | bwd_inner_microstep: 1704.64 | bwd_allreduce_microstep: 103.14 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13609
total_samples=5409, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:48,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.49 | bwd_microstep: 1780.81 | bwd_inner_microstep: 1707.32 | bwd_allreduce_microstep: 73.43 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13380
total_samples=5413, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:51,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.66 | bwd_microstep: 1828.68 | bwd_inner_microstep: 1720.72 | bwd_allreduce_microstep: 107.90 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12315
total_samples=5416, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:54,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 02:49:54,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.53 | bwd_microstep: 1779.35 | bwd_inner_microstep: 1581.76 | bwd_allreduce_microstep: 197.53 | step_microstep: 150.35
[2025-08-03 02:49:54,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.55 | bwd: 7196.73 | bwd_inner: 6714.43 | bwd_allreduce: 482.07 | step: 150.65
             17%|█▋        | 348/2000 [1:06:03<4:59:57, 10.89s/it] 17%|█▋        | 349/2000 [1:06:14<5:01:02, 10.94s/it]                                                       17%|█▋        | 349/2000 [1:06:14<5:01:02, 10.94s/it] 18%|█▊        | 350/2000 [1:06:25<5:02:21, 10.99s/it]                                                       18%|█▊        | 350/2000 [1:06:26<5:02:21, 10.99s/it] 18%|█▊        | 351/2000 [1:06:37<5:02:51, 11.02s/it]                                                       18%|█▊        | 351/2000 [1:06:37<5:02:51, 11.02s/it] 18%|█▊        | 352/2000 [1:06:47<4:59:23, 10.90s/it]                                                       18%|█▊        | 352/2000 [1:06:47<4:59:23, 10.90s/it] 18%|█▊        | 353/2000 [1:06:58<4:56:42, 10.81s/it]                                                       18%|█▊        | 353/2000 [1:06:58<4:56:42, 10.81s/it] 18%|█▊        | 354/2000 [1:07:08<4:54:32, 10.74s/it]                      {'loss': 0.7915, 'learning_rate': 1.8887904721444955e-05, 'epoch': 0.18}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13604
total_samples=5420, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:56,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.20 | bwd_microstep: 1818.41 | bwd_inner_microstep: 1703.39 | bwd_allreduce_microstep: 114.95 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13256
total_samples=5424, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:49:59,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.86 | bwd_microstep: 2124.75 | bwd_inner_microstep: 2007.01 | bwd_allreduce_microstep: 117.68 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15282
total_samples=5428, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:02,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.36 | bwd_microstep: 1919.62 | bwd_inner_microstep: 1802.19 | bwd_allreduce_microstep: 117.37 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12495
total_samples=5431, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:05,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 02:50:05,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.96 | bwd_microstep: 1918.08 | bwd_inner_microstep: 1870.18 | bwd_allreduce_microstep: 47.85 | step_microstep: 123.89
[2025-08-03 02:50:05,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.30 | bwd: 7780.91 | bwd_inner: 7382.76 | bwd_allreduce: 397.92 | step: 124.21
{'loss': 0.7935, 'learning_rate': 1.8880471240421365e-05, 'epoch': 0.18}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12886
total_samples=5435, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:08,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 978.47 | bwd_microstep: 2203.96 | bwd_inner_microstep: 2038.48 | bwd_allreduce_microstep: 165.42 | step_microstep: 0.09
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13531
total_samples=5440, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:11,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.35 | bwd_microstep: 1933.18 | bwd_inner_microstep: 1695.73 | bwd_allreduce_microstep: 237.38 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14541
total_samples=5444, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:14,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.57 | bwd_microstep: 2148.52 | bwd_inner_microstep: 1992.40 | bwd_allreduce_microstep: 156.06 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12449
total_samples=5448, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:16,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 02:50:16,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.52 | bwd_microstep: 1830.22 | bwd_inner_microstep: 1640.09 | bwd_allreduce_microstep: 190.06 | step_microstep: 131.64
[2025-08-03 02:50:16,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3157.85 | bwd: 8115.93 | bwd_inner: 7366.67 | bwd_allreduce: 749.00 | step: 132.03
{'loss': 0.7992, 'learning_rate': 1.8873014471397225e-05, 'epoch': 0.18}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12872
total_samples=5452, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:19,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.78 | bwd_microstep: 1901.16 | bwd_inner_microstep: 1638.27 | bwd_allreduce_microstep: 262.83 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891
total_samples=5455, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:22,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.46 | bwd_microstep: 1878.66 | bwd_inner_microstep: 1753.94 | bwd_allreduce_microstep: 124.66 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14125
total_samples=5460, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:24,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.48 | bwd_microstep: 1706.87 | bwd_inner_microstep: 1686.37 | bwd_allreduce_microstep: 20.44 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13500
total_samples=5464, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:27,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 02:50:27,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.91 | bwd_microstep: 1777.99 | bwd_inner_microstep: 1682.27 | bwd_allreduce_microstep: 95.66 | step_microstep: 145.13
[2025-08-03 02:50:27,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.55 | bwd: 7264.73 | bwd_inner: 6760.85 | bwd_allreduce: 503.66 | step: 145.45
{'loss': 0.8024, 'learning_rate': 1.8865534433927034e-05, 'epoch': 0.18}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14310
total_samples=5468, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:30,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.22 | bwd_microstep: 2350.30 | bwd_inner_microstep: 2227.88 | bwd_allreduce_microstep: 122.35 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12397
total_samples=5471, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:33,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.33 | bwd_microstep: 2007.60 | bwd_inner_microstep: 1782.96 | bwd_allreduce_microstep: 224.58 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16120
total_samples=5475, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:36,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.89 | bwd_microstep: 2230.49 | bwd_inner_microstep: 2006.02 | bwd_allreduce_microstep: 224.41 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13824
total_samples=5479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:39,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 02:50:39,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.60 | bwd_microstep: 2093.06 | bwd_inner_microstep: 1787.67 | bwd_allreduce_microstep: 305.32 | step_microstep: 114.92
[2025-08-03 02:50:39,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.97 | bwd: 8681.50 | bwd_inner: 7804.52 | bwd_allreduce: 876.75 | step: 115.25
{'loss': 0.8085, 'learning_rate': 1.8858031147626326e-05, 'epoch': 0.18}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13477
total_samples=5483, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:41,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.73 | bwd_microstep: 1717.47 | bwd_inner_microstep: 1660.55 | bwd_allreduce_microstep: 56.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14021
total_samples=5487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:44,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.39 | bwd_microstep: 1718.71 | bwd_inner_microstep: 1686.15 | bwd_allreduce_microstep: 32.49 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14873
total_samples=5491, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:47,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.00 | bwd_microstep: 2144.99 | bwd_inner_microstep: 1912.20 | bwd_allreduce_microstep: 232.73 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14009
total_samples=5495, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:49,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 02:50:49,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.45 | bwd_microstep: 1841.02 | bwd_inner_microstep: 1751.85 | bwd_allreduce_microstep: 89.11 | step_microstep: 136.00
[2025-08-03 02:50:49,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.50 | bwd: 7422.26 | bwd_inner: 7010.74 | bwd_allreduce: 411.28 | step: 136.34
{'loss': 0.7996, 'learning_rate': 1.885050463217159e-05, 'epoch': 0.18}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13791
total_samples=5499, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:52,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.38 | bwd_microstep: 1910.65 | bwd_inner_microstep: 1719.04 | bwd_allreduce_microstep: 191.54 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12777
total_samples=5502, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:55,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.69 | bwd_microstep: 1910.18 | bwd_inner_microstep: 1619.71 | bwd_allreduce_microstep: 290.41 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12893
total_samples=5506, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:50:57,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 773.01 | bwd_microstep: 1791.83 | bwd_inner_microstep: 1688.20 | bwd_allreduce_microstep: 103.56 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13354
total_samples=5510, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:00,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:51:00,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.33 | bwd_microstep: 1855.01 | bwd_inner_microstep: 1819.95 | bwd_allreduce_microstep: 35.00 | step_microstep: 158.21
[2025-08-03 02:51:00,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.36 | bwd: 7467.71 | bwd_inner: 6846.89 | bwd_allreduce: 620.59 | step: 158.62
                                 18%|█▊        | 354/2000 [1:07:08<4:54:32, 10.74s/it] 18%|█▊        | 355/2000 [1:07:19<4:57:21, 10.85s/it]                                                       18%|█▊        | 355/2000 [1:07:19<4:57:21, 10.85s/it] 18%|█▊        | 356/2000 [1:07:31<5:04:23, 11.11s/it]                                                       18%|█▊        | 356/2000 [1:07:31<5:04:23, 11.11s/it] 18%|█▊        | 357/2000 [1:07:42<4:59:19, 10.93s/it]                                                       18%|█▊        | 357/2000 [1:07:42<4:59:19, 10.93s/it] 18%|█▊        | 358/2000 [1:07:54<5:07:43, 11.24s/it]                                                       18%|█▊        | 358/2000 [1:07:54<5:07:43, 11.24s/it] 18%|█▊        | 359/2000 [1:08:04<5:02:39, 11.07s/it]                                                       18%|█▊        | 359/2000 [1:08:04<5:02:39, 11.07s/it] 18%|█▊        | 360/2000 [1:08:15<5:00:15, 10.98s/it]  {'loss': 0.7885, 'learning_rate': 1.8842954907300236e-05, 'epoch': 0.18}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12981
total_samples=5514, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:03,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.26 | bwd_microstep: 1932.26 | bwd_inner_microstep: 1660.42 | bwd_allreduce_microstep: 271.79 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12728
total_samples=5518, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:06,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.10 | bwd_microstep: 1756.93 | bwd_inner_microstep: 1623.89 | bwd_allreduce_microstep: 132.98 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14915
total_samples=5522, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:08,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.19 | bwd_microstep: 1742.84 | bwd_inner_microstep: 1708.56 | bwd_allreduce_microstep: 34.21 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13025
total_samples=5526, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:11,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 02:51:11,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.00 | bwd_microstep: 1938.62 | bwd_inner_microstep: 1708.28 | bwd_allreduce_microstep: 230.27 | step_microstep: 131.51
[2025-08-03 02:51:11,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.48 | bwd: 7370.71 | bwd_inner: 6701.15 | bwd_allreduce: 669.34 | step: 131.84
{'loss': 0.805, 'learning_rate': 1.883538199281054e-05, 'epoch': 0.18}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11497
total_samples=5529, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:13,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.38 | bwd_microstep: 1777.10 | bwd_inner_microstep: 1530.55 | bwd_allreduce_microstep: 246.49 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13358
total_samples=5533, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:16,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.14 | bwd_microstep: 1828.60 | bwd_inner_microstep: 1702.22 | bwd_allreduce_microstep: 126.32 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15940
total_samples=5537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:19,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.74 | bwd_microstep: 1988.68 | bwd_inner_microstep: 1901.83 | bwd_allreduce_microstep: 86.80 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13678
total_samples=5541, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:22,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 02:51:22,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.66 | bwd_microstep: 1846.33 | bwd_inner_microstep: 1654.68 | bwd_allreduce_microstep: 191.59 | step_microstep: 118.78
[2025-08-03 02:51:22,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.85 | bwd: 7440.75 | bwd_inner: 6789.27 | bwd_allreduce: 651.27 | step: 119.10
{'loss': 0.8014, 'learning_rate': 1.8827785908561585e-05, 'epoch': 0.18}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11648
total_samples=5544, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:24,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.03 | bwd_microstep: 2141.85 | bwd_inner_microstep: 1743.03 | bwd_allreduce_microstep: 398.76 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14277
total_samples=5548, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:27,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.54 | bwd_microstep: 1962.58 | bwd_inner_microstep: 1835.92 | bwd_allreduce_microstep: 126.60 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12970
total_samples=5552, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:30,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.97 | bwd_microstep: 2010.87 | bwd_inner_microstep: 1877.44 | bwd_allreduce_microstep: 133.36 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14732
total_samples=5557, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:33,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 02:51:33,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.90 | bwd_microstep: 1899.18 | bwd_inner_microstep: 1742.11 | bwd_allreduce_microstep: 157.01 | step_microstep: 131.66
[2025-08-03 02:51:33,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.36 | bwd: 8014.53 | bwd_inner: 7198.49 | bwd_allreduce: 815.81 | step: 132.08
{'loss': 0.7884, 'learning_rate': 1.8820166674473217e-05, 'epoch': 0.18}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12052
total_samples=5560, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:35,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.47 | bwd_microstep: 1831.96 | bwd_inner_microstep: 1708.66 | bwd_allreduce_microstep: 123.23 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11945
total_samples=5563, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:38,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.02 | bwd_microstep: 2025.05 | bwd_inner_microstep: 1820.64 | bwd_allreduce_microstep: 204.35 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12795
total_samples=5567, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:41,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.56 | bwd_microstep: 1955.40 | bwd_inner_microstep: 1658.07 | bwd_allreduce_microstep: 297.25 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13689
total_samples=5571, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:44,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 02:51:44,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.38 | bwd_microstep: 1733.51 | bwd_inner_microstep: 1676.24 | bwd_allreduce_microstep: 57.21 | step_microstep: 132.46
[2025-08-03 02:51:44,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2751.36 | bwd: 7545.97 | bwd_inner: 6863.61 | bwd_allreduce: 682.13 | step: 132.99
{'loss': 0.7968, 'learning_rate': 1.881252431052599e-05, 'epoch': 0.18}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12199
total_samples=5574, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:46,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.45 | bwd_microstep: 2051.38 | bwd_inner_microstep: 1826.78 | bwd_allreduce_microstep: 224.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16363
total_samples=5578, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:49,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.85 | bwd_microstep: 2053.67 | bwd_inner_microstep: 1966.70 | bwd_allreduce_microstep: 86.91 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13501
total_samples=5582, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:52,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.33 | bwd_microstep: 1974.59 | bwd_inner_microstep: 1871.99 | bwd_allreduce_microstep: 102.54 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12518
total_samples=5586, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:55,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 02:51:55,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.57 | bwd_microstep: 1808.15 | bwd_inner_microstep: 1602.32 | bwd_allreduce_microstep: 205.77 | step_microstep: 124.41
[2025-08-03 02:51:55,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.14 | bwd: 7887.83 | bwd_inner: 7267.79 | bwd_allreduce: 619.81 | step: 124.71
{'loss': 0.793, 'learning_rate': 1.880485883676111e-05, 'epoch': 0.18}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14285
total_samples=5590, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:51:57,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.38 | bwd_microstep: 1744.27 | bwd_inner_microstep: 1695.37 | bwd_allreduce_microstep: 48.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13229
total_samples=5594, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:00,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 888.25 | bwd_microstep: 1867.96 | bwd_inner_microstep: 1809.01 | bwd_allreduce_microstep: 58.88 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16161
total_samples=5598, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:03,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.55 | bwd_microstep: 1815.07 | bwd_inner_microstep: 1794.40 | bwd_allreduce_microstep: 20.60 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11834
total_samples=5601, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:05,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 02:52:05,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.82 | bwd_microstep: 1830.63 | bwd_inner_microstep: 1613.99 | bwd_allreduce_microstep: 216.57 | step_microstep: 116.54
[2025-08-03 02:52:05,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2991.94 | bwd: 7257.98 | bwd_inner: 6912.78 | bwd_allreduce: 344.97 | step: 116.92
                                                     18%|█▊        | 360/2000 [1:08:15<5:00:15, 10.98s/it] 18%|█▊        | 361/2000 [1:08:26<4:57:04, 10.88s/it]                                                       18%|█▊        | 361/2000 [1:08:26<4:57:04, 10.88s/it] 18%|█▊        | 362/2000 [1:08:36<4:55:20, 10.82s/it]                                                       18%|█▊        | 362/2000 [1:08:36<4:55:20, 10.82s/it] 18%|█▊        | 363/2000 [1:08:48<4:58:48, 10.95s/it]                                                       18%|█▊        | 363/2000 [1:08:48<4:58:48, 10.95s/it] 18%|█▊        | 364/2000 [1:08:58<4:57:06, 10.90s/it]                                                       18%|█▊        | 364/2000 [1:08:58<4:57:06, 10.90s/it] 18%|█▊        | 365/2000 [1:09:10<4:59:11, 10.98s/it]                                                       18%|█▊        | 365/2000 [1:09:10<4:59:11, 10.98s/it] 18%|█▊        | 366/2000 [1:09:20<4:{'loss': 0.7964, 'learning_rate': 1.879717027328039e-05, 'epoch': 0.18}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12344
total_samples=5605, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:08,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.01 | bwd_microstep: 1853.97 | bwd_inner_microstep: 1619.90 | bwd_allreduce_microstep: 233.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13883
total_samples=5609, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:11,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.92 | bwd_microstep: 1856.96 | bwd_inner_microstep: 1831.90 | bwd_allreduce_microstep: 25.00 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15485
total_samples=5614, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:13,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.65 | bwd_microstep: 1761.88 | bwd_inner_microstep: 1755.87 | bwd_allreduce_microstep: 5.95 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12215
total_samples=5617, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:16,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 02:52:16,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.54 | bwd_microstep: 1774.31 | bwd_inner_microstep: 1577.22 | bwd_allreduce_microstep: 197.03 | step_microstep: 126.73
[2025-08-03 02:52:16,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.04 | bwd: 7247.17 | bwd_inner: 6784.89 | bwd_allreduce: 462.05 | step: 127.07
{'loss': 0.793, 'learning_rate': 1.8789458640246193e-05, 'epoch': 0.18}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13634
total_samples=5621, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:19,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.89 | bwd_microstep: 1750.01 | bwd_inner_microstep: 1680.74 | bwd_allreduce_microstep: 69.21 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12376
total_samples=5625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:21,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.48 | bwd_microstep: 1793.30 | bwd_inner_microstep: 1596.04 | bwd_allreduce_microstep: 197.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13871
total_samples=5629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:24,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.85 | bwd_microstep: 2005.43 | bwd_inner_microstep: 1871.12 | bwd_allreduce_microstep: 134.24 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13251
total_samples=5633, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:28,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 02:52:28,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1803.22 | bwd_microstep: 1798.22 | bwd_inner_microstep: 1705.26 | bwd_allreduce_microstep: 92.89 | step_microstep: 112.36
[2025-08-03 02:52:28,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3870.38 | bwd: 7347.00 | bwd_inner: 6853.16 | bwd_allreduce: 493.61 | step: 112.72
{'loss': 0.7892, 'learning_rate': 1.8781723957881374e-05, 'epoch': 0.18}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13059
total_samples=5637, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:30,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.81 | bwd_microstep: 1717.84 | bwd_inner_microstep: 1650.74 | bwd_allreduce_microstep: 67.04 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13287
total_samples=5641, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:33,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.03 | bwd_microstep: 2030.25 | bwd_inner_microstep: 2014.48 | bwd_allreduce_microstep: 15.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14661
total_samples=5646, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:36,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.82 | bwd_microstep: 1944.82 | bwd_inner_microstep: 1723.28 | bwd_allreduce_microstep: 221.47 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13276
total_samples=5650, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:38,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.84
[2025-08-03 02:52:38,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.17 | bwd_microstep: 1714.46 | bwd_inner_microstep: 1623.79 | bwd_allreduce_microstep: 90.60 | step_microstep: 132.05
[2025-08-03 02:52:38,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.76 | bwd: 7407.41 | bwd_inner: 7012.28 | bwd_allreduce: 394.90 | step: 132.38
{'loss': 0.8023, 'learning_rate': 1.8773966246469238e-05, 'epoch': 0.18}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13248
total_samples=5654, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:41,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.19 | bwd_microstep: 2128.43 | bwd_inner_microstep: 1982.91 | bwd_allreduce_microstep: 145.46 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13191
total_samples=5658, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:44,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.62 | bwd_microstep: 1840.86 | bwd_inner_microstep: 1677.88 | bwd_allreduce_microstep: 162.91 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13503
total_samples=5662, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:47,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.93 | bwd_microstep: 1977.78 | bwd_inner_microstep: 1703.76 | bwd_allreduce_microstep: 273.96 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13893
total_samples=5667, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:50,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:52:50,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.64 | bwd_microstep: 2050.01 | bwd_inner_microstep: 2016.66 | bwd_allreduce_microstep: 33.28 | step_microstep: 150.75
[2025-08-03 02:52:50,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2746.30 | bwd: 7997.13 | bwd_inner: 7381.22 | bwd_allreduce: 615.69 | step: 151.19
{'loss': 0.789, 'learning_rate': 1.876618552635348e-05, 'epoch': 0.18}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12494
total_samples=5671, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:52,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.04 | bwd_microstep: 1937.52 | bwd_inner_microstep: 1630.26 | bwd_allreduce_microstep: 307.19 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14027
total_samples=5675, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:55,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.08 | bwd_microstep: 1854.20 | bwd_inner_microstep: 1822.51 | bwd_allreduce_microstep: 31.63 | step_microstep: 0.09
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12940
total_samples=5679, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:52:58,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.53 | bwd_microstep: 1989.62 | bwd_inner_microstep: 1693.70 | bwd_allreduce_microstep: 295.86 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13175
total_samples=5683, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:00,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:53:00,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.01 | bwd_microstep: 1740.48 | bwd_inner_microstep: 1619.47 | bwd_allreduce_microstep: 120.95 | step_microstep: 143.38
[2025-08-03 02:53:00,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.59 | bwd: 7521.88 | bwd_inner: 6765.94 | bwd_allreduce: 755.71 | step: 143.71
{'loss': 0.7956, 'learning_rate': 1.8758381817938126e-05, 'epoch': 0.19}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12234
total_samples=5687, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:03,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.92 | bwd_microstep: 1842.03 | bwd_inner_microstep: 1614.51 | bwd_allreduce_microstep: 227.46 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13282
total_samples=5692, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:06,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.30 | bwd_microstep: 1735.93 | bwd_inner_microstep: 1670.97 | bwd_allreduce_microstep: 64.89 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12660
total_samples=5696, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:08,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.16 | bwd_microstep: 2089.11 | bwd_inner_microstep: 1625.46 | bwd_allreduce_microstep: 463.59 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13546
total_samples=5700, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:11,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 02:53:11,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.12 | bwd_microstep: 1706.65 | bwd_inner_microstep: 1626.17 | bwd_allreduce_microstep: 80.42 | step_microstep: 130.92
[2025-08-03 02:53:11,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.43 | bwd: 7373.77 | bwd_inner: 6537.11 | bwd_allreduce: 836.43 | step: 131.24
56:35, 10.89s/it]                                                       18%|█▊        | 366/2000 [1:09:20<4:56:35, 10.89s/it] 18%|█▊        | 367/2000 [1:09:31<4:53:14, 10.77s/it]                                                       18%|█▊        | 367/2000 [1:09:31<4:53:14, 10.77s/it] 18%|█▊        | 368/2000 [1:09:42<5:00:22, 11.04s/it]                                                       18%|█▊        | 368/2000 [1:09:43<5:00:22, 11.04s/it] 18%|█▊        | 369/2000 [1:09:53<4:57:24, 10.94s/it]                                                       18%|█▊        | 369/2000 [1:09:53<4:57:24, 10.94s/it] 18%|█▊        | 370/2000 [1:10:04<4:59:46, 11.03s/it]                                                       18%|█▊        | 370/2000 [1:10:04<4:59:46, 11.03s/it] 19%|█▊        | 371/2000 [1:10:15<4:57:28, 10.96s/it]                                                       19%|█▊        | 371/2000 [1:10:15<4:57:28, 10.96s/it] 19%|█▊        | {'loss': 0.7859, 'learning_rate': 1.87505551416875e-05, 'epoch': 0.19}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12195
total_samples=5704, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:14,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.74 | bwd_microstep: 1826.28 | bwd_inner_microstep: 1584.71 | bwd_allreduce_microstep: 241.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13119
total_samples=5708, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:17,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.93 | bwd_microstep: 2129.41 | bwd_inner_microstep: 1706.66 | bwd_allreduce_microstep: 422.70 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11896
total_samples=5711, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:19,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.19 | bwd_microstep: 1788.01 | bwd_inner_microstep: 1565.15 | bwd_allreduce_microstep: 222.79 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13301
total_samples=5715, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:22,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 02:53:22,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.81 | bwd_microstep: 1755.15 | bwd_inner_microstep: 1685.18 | bwd_allreduce_microstep: 69.90 | step_microstep: 133.83
[2025-08-03 02:53:22,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.61 | bwd: 7498.90 | bwd_inner: 6541.70 | bwd_allreduce: 956.97 | step: 134.15
{'loss': 0.7997, 'learning_rate': 1.874270551812614e-05, 'epoch': 0.19}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11576
total_samples=5718, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:24,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.75 | bwd_microstep: 1822.03 | bwd_inner_microstep: 1610.45 | bwd_allreduce_microstep: 211.51 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13058
total_samples=5722, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:27,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.75 | bwd_microstep: 2056.42 | bwd_inner_microstep: 1911.12 | bwd_allreduce_microstep: 145.25 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12221
total_samples=5725, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:30,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.82 | bwd_microstep: 1720.33 | bwd_inner_microstep: 1558.04 | bwd_allreduce_microstep: 162.22 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14083
total_samples=5730, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:33,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 02:53:33,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.33 | bwd_microstep: 1846.18 | bwd_inner_microstep: 1711.61 | bwd_allreduce_microstep: 134.50 | step_microstep: 143.40
[2025-08-03 02:53:33,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.58 | bwd: 7445.00 | bwd_inner: 6791.22 | bwd_allreduce: 653.55 | step: 143.75
{'loss': 0.8023, 'learning_rate': 1.8734832967838775e-05, 'epoch': 0.19}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14176
total_samples=5735, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:35,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.24 | bwd_microstep: 2107.40 | bwd_inner_microstep: 1777.88 | bwd_allreduce_microstep: 329.45 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14515
total_samples=5739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:38,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.20 | bwd_microstep: 1794.41 | bwd_inner_microstep: 1741.97 | bwd_allreduce_microstep: 52.37 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14236
total_samples=5743, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:41,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.21 | bwd_microstep: 1695.95 | bwd_inner_microstep: 1683.03 | bwd_allreduce_microstep: 12.85 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11903
total_samples=5746, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:44,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 02:53:44,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.70 | bwd_microstep: 2113.84 | bwd_inner_microstep: 2108.02 | bwd_allreduce_microstep: 5.76 | step_microstep: 113.09
[2025-08-03 02:53:44,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.27 | bwd: 7711.65 | bwd_inner: 7310.90 | bwd_allreduce: 400.52 | step: 113.46
{'loss': 0.7956, 'learning_rate': 1.8726937511470247e-05, 'epoch': 0.19}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13784
total_samples=5750, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:46,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.58 | bwd_microstep: 1771.37 | bwd_inner_microstep: 1686.52 | bwd_allreduce_microstep: 84.79 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12747
total_samples=5754, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:49,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.73 | bwd_microstep: 1989.22 | bwd_inner_microstep: 1659.18 | bwd_allreduce_microstep: 329.98 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13643
total_samples=5758, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:52,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.95 | bwd_microstep: 2248.76 | bwd_inner_microstep: 2104.02 | bwd_allreduce_microstep: 144.68 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11849
total_samples=5761, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:55,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 02:53:55,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.01 | bwd_microstep: 1856.11 | bwd_inner_microstep: 1543.11 | bwd_allreduce_microstep: 312.93 | step_microstep: 112.76
[2025-08-03 02:53:55,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2734.21 | bwd: 7865.52 | bwd_inner: 6992.82 | bwd_allreduce: 872.46 | step: 113.20
{'loss': 0.8032, 'learning_rate': 1.871901916972547e-05, 'epoch': 0.19}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14178
total_samples=5765, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:53:57,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.00 | bwd_microstep: 1923.49 | bwd_inner_microstep: 1843.71 | bwd_allreduce_microstep: 79.69 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13454
total_samples=5769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:00,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.15 | bwd_microstep: 1848.24 | bwd_inner_microstep: 1715.70 | bwd_allreduce_microstep: 132.47 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14550
total_samples=5773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:03,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.51 | bwd_microstep: 1919.03 | bwd_inner_microstep: 1859.10 | bwd_allreduce_microstep: 59.86 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12099
total_samples=5776, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:06,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 02:54:06,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.99 | bwd_microstep: 2277.11 | bwd_inner_microstep: 2053.13 | bwd_allreduce_microstep: 223.92 | step_microstep: 139.46
[2025-08-03 02:54:06,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.59 | bwd: 7967.90 | bwd_inner: 7471.64 | bwd_allreduce: 496.03 | step: 139.79
{'loss': 0.7977, 'learning_rate': 1.8711077963369377e-05, 'epoch': 0.19}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11936
total_samples=5779, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:08,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.33 | bwd_microstep: 1767.41 | bwd_inner_microstep: 1550.63 | bwd_allreduce_microstep: 216.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14964
total_samples=5785, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:11,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.59 | bwd_microstep: 1828.53 | bwd_inner_microstep: 1777.42 | bwd_allreduce_microstep: 51.04 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14004
total_samples=5789, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:14,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.70 | bwd_microstep: 2096.25 | bwd_inner_microstep: 1916.63 | bwd_allreduce_microstep: 179.56 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11823
total_samples=5792, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:16,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 02:54:16,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.27 | bwd_microstep: 1710.68 | bwd_inner_microstep: 1533.76 | bwd_allreduce_microstep: 176.86 | step_microstep: 126.95
[2025-08-03 02:54:16,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.82 | bwd: 7402.92 | bwd_inner: 6778.44 | bwd_allreduce: 624.24 | step: 127.45
372/2000 [1:10:26<4:54:28, 10.85s/it]                                                       19%|█▊        | 372/2000 [1:10:26<4:54:28, 10.85s/it] 19%|█▊        | 373/2000 [1:10:37<4:53:59, 10.84s/it]                                                       19%|█▊        | 373/2000 [1:10:37<4:53:59, 10.84s/it] 19%|█▊        | 374/2000 [1:10:47<4:53:17, 10.82s/it]                                                       19%|█▊        | 374/2000 [1:10:47<4:53:17, 10.82s/it] 19%|█▉        | 375/2000 [1:10:58<4:54:11, 10.86s/it]                                                       19%|█▉        | 375/2000 [1:10:58<4:54:11, 10.86s/it] 19%|█▉        | 376/2000 [1:11:09<4:55:26, 10.92s/it]                                                       19%|█▉        | 376/2000 [1:11:09<4:55:26, 10.92s/it] 19%|█▉        | 377/2000 [1:11:21<4:57:50, 11.01s/it]                                                       19%|█▉        | 377/2000 [1:11:21<4:57:50, 11.01s/it] {'loss': 0.8049, 'learning_rate': 1.8703113913226847e-05, 'epoch': 0.19}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13438
total_samples=5796, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:19,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.33 | bwd_microstep: 2075.77 | bwd_inner_microstep: 1930.02 | bwd_allreduce_microstep: 145.68 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13397
total_samples=5800, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:22,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.32 | bwd_microstep: 1855.74 | bwd_inner_microstep: 1699.72 | bwd_allreduce_microstep: 155.96 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13318
total_samples=5804, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:24,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.81 | bwd_microstep: 1768.05 | bwd_inner_microstep: 1678.90 | bwd_allreduce_microstep: 89.09 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13116
total_samples=5808, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:27,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 02:54:27,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.48 | bwd_microstep: 2108.64 | bwd_inner_microstep: 1940.92 | bwd_allreduce_microstep: 167.66 | step_microstep: 118.24
[2025-08-03 02:54:27,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.88 | bwd: 7808.24 | bwd_inner: 7249.55 | bwd_allreduce: 558.46 | step: 118.58
{'loss': 0.7914, 'learning_rate': 1.8695127040182678e-05, 'epoch': 0.19}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14065
total_samples=5812, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:30,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.35 | bwd_microstep: 1749.74 | bwd_inner_microstep: 1699.96 | bwd_allreduce_microstep: 49.70 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12864
total_samples=5816, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:33,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.84 | bwd_microstep: 1936.71 | bwd_inner_microstep: 1639.92 | bwd_allreduce_microstep: 296.73 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14458
total_samples=5820, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:35,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.89 | bwd_microstep: 1764.84 | bwd_inner_microstep: 1714.68 | bwd_allreduce_microstep: 50.09 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12162
total_samples=5823, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:38,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 02:54:38,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.18 | bwd_microstep: 1760.80 | bwd_inner_microstep: 1564.42 | bwd_allreduce_microstep: 196.31 | step_microstep: 120.73
[2025-08-03 02:54:38,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2733.17 | bwd: 7212.14 | bwd_inner: 6618.98 | bwd_allreduce: 592.91 | step: 121.21
{'loss': 0.8085, 'learning_rate': 1.8687117365181514e-05, 'epoch': 0.19}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13361
total_samples=5827, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:41,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.15 | bwd_microstep: 1982.85 | bwd_inner_microstep: 1858.77 | bwd_allreduce_microstep: 124.02 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12919
total_samples=5830, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:43,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.66 | bwd_microstep: 1774.57 | bwd_inner_microstep: 1611.44 | bwd_allreduce_microstep: 163.06 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13457
total_samples=5834, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:46,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.12 | bwd_microstep: 2032.34 | bwd_inner_microstep: 1869.67 | bwd_allreduce_microstep: 162.60 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12080
total_samples=5838, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:49,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58
[2025-08-03 02:54:49,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.22 | bwd_microstep: 1737.92 | bwd_inner_microstep: 1561.45 | bwd_allreduce_microstep: 176.41 | step_microstep: 142.13
[2025-08-03 02:54:49,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2714.08 | bwd: 7527.72 | bwd_inner: 6901.33 | bwd_allreduce: 626.17 | step: 142.56
{'loss': 0.7803, 'learning_rate': 1.867908490922779e-05, 'epoch': 0.19}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13294
total_samples=5842, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:52,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.80 | bwd_microstep: 2134.43 | bwd_inner_microstep: 1985.31 | bwd_allreduce_microstep: 149.05 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11763
total_samples=5845, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:54,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.81 | bwd_microstep: 1730.71 | bwd_inner_microstep: 1596.99 | bwd_allreduce_microstep: 133.66 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12920
total_samples=5849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:54:57,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.06 | bwd_microstep: 1785.77 | bwd_inner_microstep: 1668.70 | bwd_allreduce_microstep: 117.01 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13491
total_samples=5853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:00,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 02:55:00,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.49 | bwd_microstep: 2163.79 | bwd_inner_microstep: 1987.66 | bwd_allreduce_microstep: 176.06 | step_microstep: 109.42
[2025-08-03 02:55:00,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.10 | bwd: 7814.74 | bwd_inner: 7238.65 | bwd_allreduce: 575.85 | step: 109.73
{'loss': 0.799, 'learning_rate': 1.867102969338569e-05, 'epoch': 0.19}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13897
total_samples=5857, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:02,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.94 | bwd_microstep: 1944.45 | bwd_inner_microstep: 1727.90 | bwd_allreduce_microstep: 216.49 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11697
total_samples=5860, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:05,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.35 | bwd_microstep: 1917.84 | bwd_inner_microstep: 1534.69 | bwd_allreduce_microstep: 383.09 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13561
total_samples=5865, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:08,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.09 | bwd_microstep: 1796.53 | bwd_inner_microstep: 1694.40 | bwd_allreduce_microstep: 102.06 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13518
total_samples=5869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:10,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 02:55:10,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.58 | bwd_microstep: 1850.75 | bwd_inner_microstep: 1669.62 | bwd_allreduce_microstep: 181.06 | step_microstep: 116.74
[2025-08-03 02:55:10,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2824.89 | bwd: 7509.62 | bwd_inner: 6626.61 | bwd_allreduce: 882.77 | step: 117.20
{'loss': 0.7989, 'learning_rate': 1.8662951738779077e-05, 'epoch': 0.19}
19%|█▉        | 378/2000 [1:11:31<4:54:48, 10.91s/it]                                                       19%|█▉        | 378/2000 [1:11:31<4:54:48, 10.91s/it] 19%|█▉        | 379/2000 [1:11:42<4:55:35, 10.94s/it]                                                       19%|█▉        | 379/2000 [1:11:42<4:55:35, 10.94s/it] 19%|█▉        | 380/2000 [1:11:53<4:51:03, 10.78s/it]                                                       19%|█▉        | 380/2000 [1:11:53<4:51:03, 10.78s/it] 19%|█▉        | 381/2000 [1:12:03<4:50:23, 10.76s/it]                                                       19%|█▉        | 381/2000 [1:12:03<4:50:23, 10.76s/it] 19%|█▉        | 382/2000 [1:12:15<4:52:31, 10.85s/it]                                                       19%|█▉        | 382/2000 [1:12:15<4:52:31, 10.85s/it] 19%|█▉        | 383/2000 [1:12:25<4:51:49, 10.83s/it]                                                       19%|█▉        | 383/2000 [1:12:25<4dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15038
total_samples=5873, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:13,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.26 | bwd_microstep: 1879.35 | bwd_inner_microstep: 1837.80 | bwd_allreduce_microstep: 41.49 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12903
total_samples=5877, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:16,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.10 | bwd_microstep: 2104.16 | bwd_inner_microstep: 1909.99 | bwd_allreduce_microstep: 194.11 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15556
total_samples=5881, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:19,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.44 | bwd_microstep: 1808.12 | bwd_inner_microstep: 1768.51 | bwd_allreduce_microstep: 39.54 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12238
total_samples=5884, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:21,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 02:55:21,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.07 | bwd_microstep: 1988.37 | bwd_inner_microstep: 1585.28 | bwd_allreduce_microstep: 403.03 | step_microstep: 109.60
[2025-08-03 02:55:21,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.81 | bwd: 7780.05 | bwd_inner: 7101.58 | bwd_allreduce: 678.24 | step: 110.05
{'loss': 0.7951, 'learning_rate': 1.865485106659145e-05, 'epoch': 0.19}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13580
total_samples=5888, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:24,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.17 | bwd_microstep: 1753.67 | bwd_inner_microstep: 1679.39 | bwd_allreduce_microstep: 74.21 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13073
total_samples=5892, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:27,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.79 | bwd_microstep: 1821.76 | bwd_inner_microstep: 1682.06 | bwd_allreduce_microstep: 139.64 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11630
total_samples=5895, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:29,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.35 | bwd_microstep: 1784.54 | bwd_inner_microstep: 1568.86 | bwd_allreduce_microstep: 215.61 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11868
total_samples=5898, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:32,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 02:55:32,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.71 | bwd_microstep: 1715.11 | bwd_inner_microstep: 1540.33 | bwd_allreduce_microstep: 174.73 | step_microstep: 142.38
[2025-08-03 02:55:32,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.96 | bwd: 7075.13 | bwd_inner: 6470.64 | bwd_allreduce: 604.27 | step: 142.72
{'loss': 0.7948, 'learning_rate': 1.8646727698065865e-05, 'epoch': 0.19}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13526
total_samples=5902, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:34,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.45 | bwd_microstep: 1774.71 | bwd_inner_microstep: 1693.38 | bwd_allreduce_microstep: 81.26 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13537
total_samples=5906, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:37,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.51 | bwd_microstep: 1737.78 | bwd_inner_microstep: 1680.76 | bwd_allreduce_microstep: 56.96 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14108
total_samples=5910, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:40,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.09 | bwd_microstep: 1935.48 | bwd_inner_microstep: 1890.33 | bwd_allreduce_microstep: 45.08 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11750
total_samples=5913, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:42,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 02:55:42,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.42 | bwd_microstep: 1842.21 | bwd_inner_microstep: 1531.54 | bwd_allreduce_microstep: 310.60 | step_microstep: 114.94
[2025-08-03 02:55:42,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.40 | bwd: 7290.22 | bwd_inner: 6796.01 | bwd_allreduce: 493.98 | step: 115.39
{'loss': 0.8062, 'learning_rate': 1.863858165450492e-05, 'epoch': 0.19}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13138
total_samples=5917, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:45,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.54 | bwd_microstep: 1908.50 | bwd_inner_microstep: 1846.44 | bwd_allreduce_microstep: 61.99 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14213
total_samples=5921, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:48,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.61 | bwd_microstep: 1771.93 | bwd_inner_microstep: 1722.64 | bwd_allreduce_microstep: 49.23 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11727
total_samples=5924, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:50,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.81 | bwd_microstep: 1840.09 | bwd_inner_microstep: 1604.24 | bwd_allreduce_microstep: 235.79 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13893
total_samples=5928, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:53,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 02:55:53,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.09 | bwd_microstep: 1828.49 | bwd_inner_microstep: 1722.26 | bwd_allreduce_microstep: 106.16 | step_microstep: 152.46
[2025-08-03 02:55:53,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.97 | bwd: 7349.06 | bwd_inner: 6895.57 | bwd_allreduce: 453.25 | step: 153.02
{'loss': 0.8037, 'learning_rate': 1.863041295727066e-05, 'epoch': 0.19}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15076
total_samples=5932, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:56,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.14 | bwd_microstep: 1886.14 | bwd_inner_microstep: 1704.19 | bwd_allreduce_microstep: 181.89 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13216
total_samples=5936, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:55:58,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.91 | bwd_microstep: 1853.33 | bwd_inner_microstep: 1687.93 | bwd_allreduce_microstep: 165.34 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11634
total_samples=5939, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:01,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.15 | bwd_microstep: 1756.55 | bwd_inner_microstep: 1524.86 | bwd_allreduce_microstep: 231.63 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14648
total_samples=5943, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:03,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 02:56:03,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.19 | bwd_microstep: 1776.27 | bwd_inner_microstep: 1746.14 | bwd_allreduce_microstep: 30.06 | step_microstep: 111.04
[2025-08-03 02:56:03,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.33 | bwd: 7272.33 | bwd_inner: 6663.12 | bwd_allreduce: 608.99 | step: 111.35
{'loss': 0.7996, 'learning_rate': 1.862222162778454e-05, 'epoch': 0.19}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14367
total_samples=5948, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:06,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.14 | bwd_microstep: 1954.83 | bwd_inner_microstep: 1812.12 | bwd_allreduce_microstep: 142.66 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14055
total_samples=5952, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:09,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.52 | bwd_microstep: 1764.75 | bwd_inner_microstep: 1721.50 | bwd_allreduce_microstep: 43.19 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14573
total_samples=5956, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:11,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.81 | bwd_microstep: 1727.17 | bwd_inner_microstep: 1721.13 | bwd_allreduce_microstep: 5.98 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11714
total_samples=5959, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:14,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 02:56:14,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.91 | bwd_microstep: 1738.27 | bwd_inner_microstep: 1530.72 | bwd_allreduce_microstep: 207.48 | step_microstep: 144.34
[2025-08-03 02:56:14,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2739.32 | bwd: 7185.07 | bwd_inner: 6785.47 | bwd_allreduce: 399.38 | step: 144.66
{'loss': 0.791, 'learning_rate': 1.8614007687527374e-05, 'epoch': 0.19}
:51:49, 10.83s/it] 19%|█▉        | 384/2000 [1:12:36<4:53:08, 10.88s/it]                                                       19%|█▉        | 384/2000 [1:12:36<4:53:08, 10.88s/it] 19%|█▉        | 385/2000 [1:12:47<4:48:29, 10.72s/it]                                                       19%|█▉        | 385/2000 [1:12:47<4:48:29, 10.72s/it] 19%|█▉        | 386/2000 [1:12:57<4:46:51, 10.66s/it]                                                       19%|█▉        | 386/2000 [1:12:57<4:46:51, 10.66s/it] 19%|█▉        | 387/2000 [1:13:08<4:46:33, 10.66s/it]                                                       19%|█▉        | 387/2000 [1:13:08<4:46:33, 10.66s/it] 19%|█▉        | 388/2000 [1:13:18<4:45:23, 10.62s/it]                                                       19%|█▉        | 388/2000 [1:13:18<4:45:23, 10.62s/it] 19%|█▉        | 389/2000 [1:13:29<4:43:29, 10.56s/it]                                                       19%|█▉        |dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14751
total_samples=5963, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:17,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.71 | bwd_microstep: 2089.13 | bwd_inner_microstep: 1849.70 | bwd_allreduce_microstep: 239.38 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15261
total_samples=5967, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:20,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.61 | bwd_microstep: 1991.81 | bwd_inner_microstep: 1863.50 | bwd_allreduce_microstep: 128.25 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=5970, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:22,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.89 | bwd_microstep: 1760.83 | bwd_inner_microstep: 1541.97 | bwd_allreduce_microstep: 218.80 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11761
total_samples=5973, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:25,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:56:25,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.99 | bwd_microstep: 1987.36 | bwd_inner_microstep: 1789.67 | bwd_allreduce_microstep: 197.62 | step_microstep: 110.84
[2025-08-03 02:56:25,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.13 | bwd: 7829.18 | bwd_inner: 7044.83 | bwd_allreduce: 784.12 | step: 111.23
{'loss': 0.7937, 'learning_rate': 1.8605771158039253e-05, 'epoch': 0.2}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14043
total_samples=5979, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:28,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.86 | bwd_microstep: 1840.75 | bwd_inner_microstep: 1714.98 | bwd_allreduce_microstep: 125.69 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12712
total_samples=5983, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:30,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.99 | bwd_microstep: 2056.03 | bwd_inner_microstep: 1636.02 | bwd_allreduce_microstep: 419.94 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11712
total_samples=5986, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:33,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.77 | bwd_microstep: 1894.06 | bwd_inner_microstep: 1546.05 | bwd_allreduce_microstep: 347.94 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12947
total_samples=5990, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:36,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 02:56:36,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.36 | bwd_microstep: 2068.84 | bwd_inner_microstep: 1875.96 | bwd_allreduce_microstep: 192.83 | step_microstep: 111.98
[2025-08-03 02:56:36,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.92 | bwd: 7859.72 | bwd_inner: 6773.01 | bwd_allreduce: 1086.48 | step: 112.35
{'loss': 0.8096, 'learning_rate': 1.8597512060919523e-05, 'epoch': 0.2}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13503
total_samples=5994, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:39,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.01 | bwd_microstep: 1976.20 | bwd_inner_microstep: 1825.17 | bwd_allreduce_microstep: 150.97 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11765
total_samples=5997, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:42,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.79 | bwd_microstep: 2068.64 | bwd_inner_microstep: 1823.29 | bwd_allreduce_microstep: 245.27 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13425
total_samples=6001, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:44,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.10 | bwd_microstep: 1824.91 | bwd_inner_microstep: 1704.74 | bwd_allreduce_microstep: 120.11 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11494
total_samples=6004, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:47,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 02:56:47,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.71 | bwd_microstep: 1899.97 | bwd_inner_microstep: 1747.82 | bwd_allreduce_microstep: 152.08 | step_microstep: 130.18
[2025-08-03 02:56:47,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.55 | bwd: 7769.76 | bwd_inner: 7101.02 | bwd_allreduce: 668.51 | step: 130.53
{'loss': 0.8006, 'learning_rate': 1.85892304178267e-05, 'epoch': 0.2}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13289
total_samples=6008, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:50,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 661.32 | bwd_microstep: 2383.04 | bwd_inner_microstep: 2228.94 | bwd_allreduce_microstep: 154.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14152
total_samples=6012, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:53,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.14 | bwd_microstep: 1780.19 | bwd_inner_microstep: 1723.74 | bwd_allreduce_microstep: 56.38 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13488
total_samples=6016, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:56,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.03 | bwd_microstep: 1975.61 | bwd_inner_microstep: 1865.32 | bwd_allreduce_microstep: 110.23 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11533
total_samples=6021, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:56:58,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 02:56:58,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.56 | bwd_microstep: 1906.70 | bwd_inner_microstep: 1727.83 | bwd_allreduce_microstep: 178.80 | step_microstep: 130.34
[2025-08-03 02:56:58,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2753.98 | bwd: 8045.58 | bwd_inner: 7545.83 | bwd_allreduce: 499.51 | step: 130.74
{'loss': 0.7849, 'learning_rate': 1.8580926250478425e-05, 'epoch': 0.2}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12734
total_samples=6025, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:01,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.65 | bwd_microstep: 1728.90 | bwd_inner_microstep: 1596.21 | bwd_allreduce_microstep: 132.63 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13413
total_samples=6028, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:03,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.01 | bwd_microstep: 1787.87 | bwd_inner_microstep: 1634.14 | bwd_allreduce_microstep: 153.67 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11688
total_samples=6031, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:06,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.30 | bwd_microstep: 1996.42 | bwd_inner_microstep: 1776.59 | bwd_allreduce_microstep: 219.76 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11710
total_samples=6034, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:09,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 02:57:09,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.95 | bwd_microstep: 2147.75 | bwd_inner_microstep: 1915.85 | bwd_allreduce_microstep: 231.84 | step_microstep: 111.99
[2025-08-03 02:57:09,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.84 | bwd: 7660.98 | bwd_inner: 6922.79 | bwd_allreduce: 737.96 | step: 112.33
{'loss': 0.7949, 'learning_rate': 1.8572599580651415e-05, 'epoch': 0.2}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11469
total_samples=6037, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:12,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.46 | bwd_microstep: 2051.17 | bwd_inner_microstep: 1821.58 | bwd_allreduce_microstep: 229.53 | step_microstep: 0.19
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15872
total_samples=6041, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:15,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.86 | bwd_microstep: 1778.47 | bwd_inner_microstep: 1769.72 | bwd_allreduce_microstep: 8.69 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13996
total_samples=6045, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:17,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.04 | bwd_microstep: 1718.12 | bwd_inner_microstep: 1671.67 | bwd_allreduce_microstep: 46.37 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13553
total_samples=6049, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:20,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.62
[2025-08-03 02:57:20,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.09 | bwd_microstep: 1774.36 | bwd_inner_microstep: 1711.73 | bwd_allreduce_microstep: 62.56 | step_microstep: 161.83
[2025-08-03 02:57:20,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2745.38 | bwd: 7322.16 | bwd_inner: 6974.70 | bwd_allreduce: 347.23 | step: 162.22
{'loss': 0.7831, 'learning_rate': 1.8564250430181387e-05, 'epoch': 0.2}
 389/2000 [1:13:29<4:43:29, 10.56s/it] 20%|█▉        | 390/2000 [1:13:40<4:47:13, 10.70s/it]                                                       20%|█▉        | 390/2000 [1:13:40<4:47:13, 10.70s/it] 20%|█▉        | 391/2000 [1:13:51<4:50:40, 10.84s/it]                                                       20%|█▉        | 391/2000 [1:13:51<4:50:40, 10.84s/it] 20%|█▉        | 392/2000 [1:14:02<4:52:08, 10.90s/it]                                                       20%|█▉        | 392/2000 [1:14:02<4:52:08, 10.90s/it] 20%|█▉        | 393/2000 [1:14:13<4:54:45, 11.01s/it]                                                       20%|█▉        | 393/2000 [1:14:13<4:54:45, 11.01s/it] 20%|█▉        | 394/2000 [1:14:24<4:53:39, 10.97s/it]                                                       20%|█▉        | 394/2000 [1:14:24<4:53:39, 10.97s/it] 20%|█▉        | 395/2000 [1:14:35<4:50:05, 10.84s/it]                                                      dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11887
total_samples=6052, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:23,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.18 | bwd_microstep: 2156.25 | bwd_inner_microstep: 1831.90 | bwd_allreduce_microstep: 324.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11889
total_samples=6055, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:25,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.42 | bwd_microstep: 1782.12 | bwd_inner_microstep: 1562.73 | bwd_allreduce_microstep: 219.32 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11838
total_samples=6058, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:28,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.48 | bwd_microstep: 1795.77 | bwd_inner_microstep: 1546.65 | bwd_allreduce_microstep: 249.06 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12993
total_samples=6062, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:31,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.83
[2025-08-03 02:57:31,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.21 | bwd_microstep: 1800.98 | bwd_inner_microstep: 1674.60 | bwd_allreduce_microstep: 126.31 | step_microstep: 135.43
[2025-08-03 02:57:31,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.22 | bwd: 7535.17 | bwd_inner: 6615.87 | bwd_allreduce: 919.06 | step: 135.76
{'loss': 0.791, 'learning_rate': 1.8555878820963014e-05, 'epoch': 0.2}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=6066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:33,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.75 | bwd_microstep: 1823.05 | bwd_inner_microstep: 1705.90 | bwd_allreduce_microstep: 117.09 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11671
total_samples=6069, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:36,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.47 | bwd_microstep: 1787.84 | bwd_inner_microstep: 1538.57 | bwd_allreduce_microstep: 249.21 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11679
total_samples=6072, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:39,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.33 | bwd_microstep: 1932.90 | bwd_inner_microstep: 1736.84 | bwd_allreduce_microstep: 195.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=6076, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:41,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 02:57:41,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 660.30 | bwd_microstep: 1748.65 | bwd_inner_microstep: 1691.26 | bwd_allreduce_microstep: 57.33 | step_microstep: 140.08
[2025-08-03 02:57:41,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.77 | bwd: 7292.49 | bwd_inner: 6672.56 | bwd_allreduce: 619.70 | step: 140.50
{'loss': 0.7803, 'learning_rate': 1.8547484774949865e-05, 'epoch': 0.2}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11894
total_samples=6079, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:44,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.31 | bwd_microstep: 1814.51 | bwd_inner_microstep: 1563.33 | bwd_allreduce_microstep: 251.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13442
total_samples=6083, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:46,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.35 | bwd_microstep: 1734.36 | bwd_inner_microstep: 1670.58 | bwd_allreduce_microstep: 63.72 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11788
total_samples=6086, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:49,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.29 | bwd_microstep: 1915.26 | bwd_inner_microstep: 1733.12 | bwd_allreduce_microstep: 182.08 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11662
total_samples=6090, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:52,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 02:57:52,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.61 | bwd_microstep: 1788.01 | bwd_inner_microstep: 1553.49 | bwd_allreduce_microstep: 234.45 | step_microstep: 138.62
[2025-08-03 02:57:52,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.49 | bwd: 7252.18 | bwd_inner: 6520.52 | bwd_allreduce: 731.43 | step: 139.05
{'loss': 0.7883, 'learning_rate': 1.8539068314154355e-05, 'epoch': 0.2}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12085
total_samples=6093, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:54,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.37 | bwd_microstep: 1805.48 | bwd_inner_microstep: 1581.26 | bwd_allreduce_microstep: 224.14 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12897
total_samples=6097, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:57:57,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.08 | bwd_microstep: 2242.21 | bwd_inner_microstep: 2123.01 | bwd_allreduce_microstep: 119.14 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13754
total_samples=6101, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:00,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.72 | bwd_microstep: 1784.71 | bwd_inner_microstep: 1715.44 | bwd_allreduce_microstep: 69.19 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13443
total_samples=6105, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:03,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 02:58:03,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.15 | bwd_microstep: 2113.20 | bwd_inner_microstep: 1974.59 | bwd_allreduce_microstep: 138.54 | step_microstep: 109.46
[2025-08-03 02:58:03,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.26 | bwd: 7945.64 | bwd_inner: 7394.31 | bwd_allreduce: 551.09 | step: 109.85
{'loss': 0.8005, 'learning_rate': 1.8530629460647658e-05, 'epoch': 0.2}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13611
total_samples=6109, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:06,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.79 | bwd_microstep: 1777.73 | bwd_inner_microstep: 1699.38 | bwd_allreduce_microstep: 78.29 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13890
total_samples=6114, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:08,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.72 | bwd_microstep: 1855.97 | bwd_inner_microstep: 1745.44 | bwd_allreduce_microstep: 110.47 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13582
total_samples=6118, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:11,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.54 | bwd_microstep: 1749.12 | bwd_inner_microstep: 1692.97 | bwd_allreduce_microstep: 56.08 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11505
total_samples=6122, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:13,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 02:58:13,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.07 | bwd_microstep: 1831.00 | bwd_inner_microstep: 1595.42 | bwd_allreduce_microstep: 235.52 | step_microstep: 110.02
[2025-08-03 02:58:13,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.04 | bwd: 7213.87 | bwd_inner: 6733.20 | bwd_allreduce: 480.44 | step: 110.37
{'loss': 0.7781, 'learning_rate': 1.8522168236559693e-05, 'epoch': 0.2}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13318
total_samples=6126, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:16,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.18 | bwd_microstep: 1909.49 | bwd_inner_microstep: 1832.97 | bwd_allreduce_microstep: 76.46 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13276
total_samples=6130, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:19,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.34 | bwd_microstep: 1814.89 | bwd_inner_microstep: 1706.81 | bwd_allreduce_microstep: 108.02 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13690
total_samples=6134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:21,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.41 | bwd_microstep: 1761.23 | bwd_inner_microstep: 1694.47 | bwd_allreduce_microstep: 66.69 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13332
total_samples=6138, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:24,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 02:58:24,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.21 | bwd_microstep: 2012.01 | bwd_inner_microstep: 1813.13 | bwd_allreduce_microstep: 198.81 | step_microstep: 134.15
[2025-08-03 02:58:24,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.05 | bwd: 7497.68 | bwd_inner: 7047.38 | bwd_allreduce: 450.06 | step: 134.60
 20%|█▉        | 395/2000 [1:14:35<4:50:05, 10.84s/it] 20%|█▉        | 396/2000 [1:14:46<4:49:40, 10.84s/it]                                                       20%|█▉        | 396/2000 [1:14:46<4:49:40, 10.84s/it] 20%|█▉        | 397/2000 [1:14:56<4:47:23, 10.76s/it]                                                       20%|█▉        | 397/2000 [1:14:56<4:47:23, 10.76s/it] 20%|█▉        | 398/2000 [1:15:07<4:45:37, 10.70s/it]                                                       20%|█▉        | 398/2000 [1:15:07<4:45:37, 10.70s/it] 20%|█▉        | 399/2000 [1:15:18<4:49:09, 10.84s/it]                                                       20%|█▉        | 399/2000 [1:15:18<4:49:09, 10.84s/it] 20%|██        | 400/2000 [1:15:28<4:46:18, 10.74s/it]                                                       20%|██        | 400/2000 [1:15:28<4:46:18, 10.74s/it] 20%|██        | 401/2000 [1:15:39<4:46:16, 10.74s/it]                                    {'loss': 0.7966, 'learning_rate': 1.8513684664079033e-05, 'epoch': 0.2}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13763
total_samples=6143, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:27,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.50 | bwd_microstep: 1834.56 | bwd_inner_microstep: 1700.60 | bwd_allreduce_microstep: 133.90 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14503
total_samples=6148, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:29,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.73 | bwd_microstep: 1807.45 | bwd_inner_microstep: 1737.16 | bwd_allreduce_microstep: 70.23 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13658
total_samples=6152, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:32,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.32 | bwd_microstep: 1857.93 | bwd_inner_microstep: 1712.32 | bwd_allreduce_microstep: 145.54 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11793
total_samples=6155, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:35,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 02:58:35,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.60 | bwd_microstep: 1897.48 | bwd_inner_microstep: 1841.89 | bwd_allreduce_microstep: 55.53 | step_microstep: 119.72
[2025-08-03 02:58:35,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.08 | bwd: 7397.47 | bwd_inner: 6991.97 | bwd_allreduce: 405.28 | step: 120.03
{'loss': 0.8017, 'learning_rate': 1.8505178765452853e-05, 'epoch': 0.2}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14382
total_samples=6159, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:37,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.99 | bwd_microstep: 1783.51 | bwd_inner_microstep: 1665.15 | bwd_allreduce_microstep: 118.29 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12855
total_samples=6163, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:40,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.07 | bwd_microstep: 2088.11 | bwd_inner_microstep: 1666.10 | bwd_allreduce_microstep: 421.93 | step_microstep: 0.09
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12863
total_samples=6167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:43,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 977.44 | bwd_microstep: 1766.37 | bwd_inner_microstep: 1648.73 | bwd_allreduce_microstep: 117.58 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11937
total_samples=6170, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:46,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 02:58:46,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.52 | bwd_microstep: 1855.83 | bwd_inner_microstep: 1597.03 | bwd_allreduce_microstep: 258.73 | step_microstep: 122.07
[2025-08-03 02:58:46,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3131.94 | bwd: 7493.86 | bwd_inner: 6577.01 | bwd_allreduce: 916.62 | step: 122.41
{'loss': 0.7761, 'learning_rate': 1.8496650562986888e-05, 'epoch': 0.2}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13582
total_samples=6174, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:49,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.66 | bwd_microstep: 2036.88 | bwd_inner_microstep: 1710.89 | bwd_allreduce_microstep: 325.93 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11832
total_samples=6177, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:52,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.04 | bwd_microstep: 2059.70 | bwd_inner_microstep: 1818.73 | bwd_allreduce_microstep: 240.91 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11844
total_samples=6180, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:54,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.52 | bwd_microstep: 2015.00 | bwd_inner_microstep: 1799.52 | bwd_allreduce_microstep: 215.42 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=6183, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:58:57,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 02:58:57,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.15 | bwd_microstep: 2029.25 | bwd_inner_microstep: 1812.06 | bwd_allreduce_microstep: 217.12 | step_microstep: 136.13
[2025-08-03 02:58:57,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.30 | bwd: 8140.87 | bwd_inner: 7141.19 | bwd_allreduce: 999.45 | step: 136.47
{'loss': 0.7946, 'learning_rate': 1.8488100079045345e-05, 'epoch': 0.2}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13164
total_samples=6187, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:00,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.75 | bwd_microstep: 2143.65 | bwd_inner_microstep: 1877.15 | bwd_allreduce_microstep: 266.43 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11966
total_samples=6190, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:03,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.87 | bwd_microstep: 1751.01 | bwd_inner_microstep: 1552.23 | bwd_allreduce_microstep: 198.72 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14147
total_samples=6194, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:05,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.39 | bwd_microstep: 1840.46 | bwd_inner_microstep: 1754.35 | bwd_allreduce_microstep: 86.04 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13274
total_samples=6198, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:08,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 02:59:08,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.11 | bwd_microstep: 2103.72 | bwd_inner_microstep: 1889.56 | bwd_allreduce_microstep: 214.11 | step_microstep: 135.09
[2025-08-03 02:59:08,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2757.05 | bwd: 7838.89 | bwd_inner: 7073.28 | bwd_allreduce: 765.37 | step: 135.47
{'loss': 0.7858, 'learning_rate': 1.847952733605088e-05, 'epoch': 0.2}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13143
total_samples=6202, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:11,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.58 | bwd_microstep: 1888.17 | bwd_inner_microstep: 1677.61 | bwd_allreduce_microstep: 210.50 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13366
total_samples=6206, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:14,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.07 | bwd_microstep: 1999.05 | bwd_inner_microstep: 1714.36 | bwd_allreduce_microstep: 284.63 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13167
total_samples=6210, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:17,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.62 | bwd_microstep: 2200.72 | bwd_inner_microstep: 2074.92 | bwd_allreduce_microstep: 125.73 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13151
total_samples=6214, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:20,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 02:59:20,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.60 | bwd_microstep: 2017.42 | bwd_inner_microstep: 1820.70 | bwd_allreduce_microstep: 196.65 | step_microstep: 134.38
[2025-08-03 02:59:20,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.80 | bwd: 8105.40 | bwd_inner: 7287.57 | bwd_allreduce: 817.59 | step: 134.81
{'loss': 0.7934, 'learning_rate': 1.847093235648451e-05, 'epoch': 0.2}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12103
total_samples=6217, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:22,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.50 | bwd_microstep: 1805.63 | bwd_inner_microstep: 1570.97 | bwd_allreduce_microstep: 234.60 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13685
total_samples=6221, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:25,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.31 | bwd_microstep: 1799.60 | bwd_inner_microstep: 1718.77 | bwd_allreduce_microstep: 80.77 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13963
total_samples=6225, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:27,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.64 | bwd_microstep: 1766.03 | bwd_inner_microstep: 1715.69 | bwd_allreduce_microstep: 50.27 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13656
total_samples=6229, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:30,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 02:59:30,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.37 | bwd_microstep: 1927.20 | bwd_inner_microstep: 1826.35 | bwd_allreduce_microstep: 100.79 | step_microstep: 122.43
[2025-08-03 02:59:30,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.75 | bwd: 7298.51 | bwd_inner: 6831.78 | bwd_allreduce: 466.49 | step: 122.89
                   20%|██        | 401/2000 [1:15:39<4:46:16, 10.74s/it] 20%|██        | 402/2000 [1:15:50<4:45:21, 10.71s/it]                                                       20%|██        | 402/2000 [1:15:50<4:45:21, 10.71s/it] 20%|██        | 403/2000 [1:16:01<4:47:58, 10.82s/it]                                                       20%|██        | 403/2000 [1:16:01<4:47:58, 10.82s/it] 20%|██        | 404/2000 [1:16:12<4:52:41, 11.00s/it]                                                       20%|██        | 404/2000 [1:16:12<4:52:41, 11.00s/it] 20%|██        | 405/2000 [1:16:23<4:53:02, 11.02s/it]                                                       20%|██        | 405/2000 [1:16:23<4:53:02, 11.02s/it] 20%|██        | 406/2000 [1:16:35<4:55:25, 11.12s/it]                                                       20%|██        | 406/2000 [1:16:35<4:55:25, 11.12s/it] 20%|██        | 407/2000 [1:16:45<4:50:33, 10.94s/it]                {'loss': 0.7868, 'learning_rate': 1.8462315162885563e-05, 'epoch': 0.2}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12904
total_samples=6232, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:33,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.29 | bwd_microstep: 1846.51 | bwd_inner_microstep: 1622.99 | bwd_allreduce_microstep: 223.46 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13313
total_samples=6236, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:35,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.37 | bwd_microstep: 1736.04 | bwd_inner_microstep: 1677.54 | bwd_allreduce_microstep: 58.44 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13470
total_samples=6240, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:38,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.43 | bwd_microstep: 1813.61 | bwd_inner_microstep: 1703.62 | bwd_allreduce_microstep: 109.92 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13261
total_samples=6244, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:41,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.82
[2025-08-03 02:59:41,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.84 | bwd_microstep: 1981.95 | bwd_inner_microstep: 1876.69 | bwd_allreduce_microstep: 105.19 | step_microstep: 113.22
[2025-08-03 02:59:41,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.84 | bwd: 7378.16 | bwd_inner: 6880.83 | bwd_allreduce: 497.09 | step: 113.64
{'loss': 0.7888, 'learning_rate': 1.8453675777851627e-05, 'epoch': 0.2}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11963
total_samples=6247, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:44,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.37 | bwd_microstep: 1847.73 | bwd_inner_microstep: 1538.31 | bwd_allreduce_microstep: 309.36 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13960
total_samples=6251, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:46,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.77 | bwd_microstep: 2142.18 | bwd_inner_microstep: 1965.56 | bwd_allreduce_microstep: 176.56 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11691
total_samples=6254, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:49,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.48 | bwd_microstep: 1978.43 | bwd_inner_microstep: 1972.53 | bwd_allreduce_microstep: 5.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13144
total_samples=6258, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:52,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 02:59:52,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.75 | bwd_microstep: 2087.57 | bwd_inner_microstep: 1766.25 | bwd_allreduce_microstep: 321.26 | step_microstep: 110.89
[2025-08-03 02:59:52,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.29 | bwd: 8055.95 | bwd_inner: 7242.64 | bwd_allreduce: 813.09 | step: 111.33
{'loss': 0.7952, 'learning_rate': 1.8445014224038485e-05, 'epoch': 0.2}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12914
total_samples=6262, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:55,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.24 | bwd_microstep: 1975.64 | bwd_inner_microstep: 1677.21 | bwd_allreduce_microstep: 298.37 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12289
total_samples=6265, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 02:59:57,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.66 | bwd_microstep: 1739.38 | bwd_inner_microstep: 1567.42 | bwd_allreduce_microstep: 171.90 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11975
total_samples=6268, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:00,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.52 | bwd_microstep: 1763.43 | bwd_inner_microstep: 1552.48 | bwd_allreduce_microstep: 210.88 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13774
total_samples=6273, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:03,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 03:00:03,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.51 | bwd_microstep: 1759.41 | bwd_inner_microstep: 1696.25 | bwd_allreduce_microstep: 63.10 | step_microstep: 149.67
[2025-08-03 03:00:03,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2749.86 | bwd: 7237.91 | bwd_inner: 6493.35 | bwd_allreduce: 744.31 | step: 150.03
{'loss': 0.7846, 'learning_rate': 1.8436330524160048e-05, 'epoch': 0.2}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13977
total_samples=6277, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:05,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.39 | bwd_microstep: 1866.12 | bwd_inner_microstep: 1714.82 | bwd_allreduce_microstep: 151.24 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13591
total_samples=6281, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:08,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.60 | bwd_microstep: 1755.80 | bwd_inner_microstep: 1671.28 | bwd_allreduce_microstep: 84.45 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13532
total_samples=6285, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:11,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.10 | bwd_microstep: 2052.24 | bwd_inner_microstep: 1837.90 | bwd_allreduce_microstep: 214.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15477
total_samples=6289, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:14,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 03:00:14,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.80 | bwd_microstep: 1970.60 | bwd_inner_microstep: 1880.60 | bwd_allreduce_microstep: 89.94 | step_microstep: 135.01
[2025-08-03 03:00:14,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.81 | bwd: 7644.81 | bwd_inner: 7104.60 | bwd_allreduce: 539.99 | step: 135.34
{'loss': 0.8015, 'learning_rate': 1.8427624700988308e-05, 'epoch': 0.21}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13034
total_samples=6294, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:17,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.99 | bwd_microstep: 2156.30 | bwd_inner_microstep: 1850.48 | bwd_allreduce_microstep: 305.75 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14247
total_samples=6298, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:19,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.80 | bwd_microstep: 1741.91 | bwd_inner_microstep: 1689.89 | bwd_allreduce_microstep: 51.96 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11637
total_samples=6301, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:22,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.13 | bwd_microstep: 2074.98 | bwd_inner_microstep: 1858.69 | bwd_allreduce_microstep: 216.22 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13502
total_samples=6305, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:25,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 03:00:25,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.61 | bwd_microstep: 2056.04 | bwd_inner_microstep: 1920.69 | bwd_allreduce_microstep: 135.28 | step_microstep: 114.24
[2025-08-03 03:00:25,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.46 | bwd: 8029.28 | bwd_inner: 7319.74 | bwd_allreduce: 709.29 | step: 114.70
{'loss': 0.788, 'learning_rate': 1.8418896777353272e-05, 'epoch': 0.21}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12241
total_samples=6308, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:28,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.72 | bwd_microstep: 1812.31 | bwd_inner_microstep: 1582.32 | bwd_allreduce_microstep: 229.92 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13718
total_samples=6312, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:30,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.81 | bwd_microstep: 2102.19 | bwd_inner_microstep: 2071.01 | bwd_allreduce_microstep: 31.13 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13157
total_samples=6316, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:34,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.16 | bwd_microstep: 2383.17 | bwd_inner_microstep: 2199.90 | bwd_allreduce_microstep: 183.21 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13703
total_samples=6320, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:36,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 03:00:36,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.30 | bwd_microstep: 1812.52 | bwd_inner_microstep: 1727.33 | bwd_allreduce_microstep: 85.12 | step_microstep: 128.06
[2025-08-03 03:00:36,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.93 | bwd: 8110.23 | bwd_inner: 7580.55 | bwd_allreduce: 529.46 | step: 128.38
                                       20%|██        | 407/2000 [1:16:45<4:50:33, 10.94s/it] 20%|██        | 408/2000 [1:16:56<4:47:53, 10.85s/it]                                                       20%|██        | 408/2000 [1:16:56<4:47:53, 10.85s/it] 20%|██        | 409/2000 [1:17:07<4:51:15, 10.98s/it]                                                       20%|██        | 409/2000 [1:17:07<4:51:15, 10.98s/it] 20%|██        | 410/2000 [1:17:18<4:46:55, 10.83s/it]                                                       20%|██        | 410/2000 [1:17:18<4:46:55, 10.83s/it] 21%|██        | 411/2000 [1:17:29<4:47:47, 10.87s/it]                                                       21%|██        | 411/2000 [1:17:29<4:47:47, 10.87s/it] 21%|██        | 412/2000 [1:17:40<4:50:42, 10.98s/it]                                                       21%|██        | 412/2000 [1:17:40<4:50:42, 10.98s/it] 21%|██        | 413/2000 [1:17:51<4:53:42, 11.10s/{'loss': 0.8009, 'learning_rate': 1.84101467761429e-05, 'epoch': 0.21}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11944
total_samples=6323, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:39,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.39 | bwd_microstep: 1773.60 | bwd_inner_microstep: 1546.59 | bwd_allreduce_microstep: 226.94 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14333
total_samples=6327, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:41,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.38 | bwd_microstep: 1742.31 | bwd_inner_microstep: 1671.71 | bwd_allreduce_microstep: 70.55 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11904
total_samples=6330, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:44,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.68 | bwd_microstep: 1840.00 | bwd_inner_microstep: 1595.74 | bwd_allreduce_microstep: 244.20 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13476
total_samples=6335, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:47,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 03:00:47,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.28 | bwd_microstep: 1721.28 | bwd_inner_microstep: 1647.69 | bwd_allreduce_microstep: 73.52 | step_microstep: 111.43
[2025-08-03 03:00:47,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.67 | bwd: 7077.24 | bwd_inner: 6461.72 | bwd_allreduce: 615.27 | step: 111.73
{'loss': 0.7853, 'learning_rate': 1.8401374720303054e-05, 'epoch': 0.21}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12584
total_samples=6339, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:49,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.64 | bwd_microstep: 1938.53 | bwd_inner_microstep: 1785.62 | bwd_allreduce_microstep: 152.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13122
total_samples=6343, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:52,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.70 | bwd_microstep: 1723.72 | bwd_inner_microstep: 1661.39 | bwd_allreduce_microstep: 62.27 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11696
total_samples=6346, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:54,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.43 | bwd_microstep: 1746.52 | bwd_inner_microstep: 1537.67 | bwd_allreduce_microstep: 208.79 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14061
total_samples=6350, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:00:57,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 03:00:57,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.05 | bwd_microstep: 1751.92 | bwd_inner_microstep: 1721.41 | bwd_allreduce_microstep: 30.45 | step_microstep: 145.89
[2025-08-03 03:00:57,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.75 | bwd: 7160.74 | bwd_inner: 6706.08 | bwd_allreduce: 454.43 | step: 146.19
{'loss': 0.7862, 'learning_rate': 1.8392580632837423e-05, 'epoch': 0.21}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13133
total_samples=6354, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:00,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.87 | bwd_microstep: 1985.10 | bwd_inner_microstep: 1779.43 | bwd_allreduce_microstep: 205.60 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12301
total_samples=6357, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:02,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.80 | bwd_microstep: 1712.40 | bwd_inner_microstep: 1553.50 | bwd_allreduce_microstep: 158.83 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938
total_samples=6360, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:05,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.18 | bwd_microstep: 1945.31 | bwd_inner_microstep: 1771.65 | bwd_allreduce_microstep: 173.60 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13814
total_samples=6364, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:08,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 03:01:08,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.07 | bwd_microstep: 2063.63 | bwd_inner_microstep: 1900.00 | bwd_allreduce_microstep: 163.57 | step_microstep: 110.53
[2025-08-03 03:01:08,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.86 | bwd: 7706.48 | bwd_inner: 7004.58 | bwd_allreduce: 701.67 | step: 110.98
{'loss': 0.7692, 'learning_rate': 1.8383764536807486e-05, 'epoch': 0.21}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=6368, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:11,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.13 | bwd_microstep: 2072.80 | bwd_inner_microstep: 1759.51 | bwd_allreduce_microstep: 313.22 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12717
total_samples=6371, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:14,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.62 | bwd_microstep: 1964.82 | bwd_inner_microstep: 1752.35 | bwd_allreduce_microstep: 212.40 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12269
total_samples=6374, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:16,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.77 | bwd_microstep: 1833.40 | bwd_inner_microstep: 1595.21 | bwd_allreduce_microstep: 238.13 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13176
total_samples=6378, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:19,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 03:01:19,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.21 | bwd_microstep: 2056.70 | bwd_inner_microstep: 1951.02 | bwd_allreduce_microstep: 105.62 | step_microstep: 133.12
[2025-08-03 03:01:19,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.66 | bwd: 7927.76 | bwd_inner: 7058.09 | bwd_allreduce: 869.44 | step: 133.46
{'loss': 0.7833, 'learning_rate': 1.837492645533241e-05, 'epoch': 0.21}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13291
total_samples=6382, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:22,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.79 | bwd_microstep: 1761.88 | bwd_inner_microstep: 1606.86 | bwd_allreduce_microstep: 154.96 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11905
total_samples=6385, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:25,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.18 | bwd_microstep: 2062.74 | bwd_inner_microstep: 1773.63 | bwd_allreduce_microstep: 289.05 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13932
total_samples=6389, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:27,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.81 | bwd_microstep: 2034.89 | bwd_inner_microstep: 1827.01 | bwd_allreduce_microstep: 207.82 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11638
total_samples=6392, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:30,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 03:01:30,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.77 | bwd_microstep: 1767.26 | bwd_inner_microstep: 1542.33 | bwd_allreduce_microstep: 224.87 | step_microstep: 153.38
[2025-08-03 03:01:30,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2743.46 | bwd: 7626.82 | bwd_inner: 6749.82 | bwd_allreduce: 876.77 | step: 153.80
{'loss': 0.7961, 'learning_rate': 1.836606641158905e-05, 'epoch': 0.21}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11631
total_samples=6395, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:33,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.46 | bwd_microstep: 1766.92 | bwd_inner_microstep: 1533.71 | bwd_allreduce_microstep: 233.14 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13241
total_samples=6399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:35,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.91 | bwd_microstep: 1797.71 | bwd_inner_microstep: 1698.15 | bwd_allreduce_microstep: 99.50 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12792
total_samples=6403, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:38,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.53 | bwd_microstep: 1838.84 | bwd_inner_microstep: 1637.93 | bwd_allreduce_microstep: 200.85 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11887
total_samples=6406, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:42,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 03:01:42,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1930.65 | bwd_microstep: 2284.55 | bwd_inner_microstep: 1914.10 | bwd_allreduce_microstep: 370.37 | step_microstep: 107.33
[2025-08-03 03:01:42,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4014.48 | bwd: 7688.07 | bwd_inner: 6783.89 | bwd_allreduce: 903.94 | step: 107.65
it]                                                       21%|██        | 413/2000 [1:17:51<4:53:42, 11.10s/it] 21%|██        | 414/2000 [1:18:01<4:47:06, 10.86s/it]                                                       21%|██        | 414/2000 [1:18:01<4:47:06, 10.86s/it] 21%|██        | 415/2000 [1:18:12<4:43:47, 10.74s/it]                                                       21%|██        | 415/2000 [1:18:12<4:43:47, 10.74s/it] 21%|██        | 416/2000 [1:18:23<4:45:21, 10.81s/it]                                                       21%|██        | 416/2000 [1:18:23<4:45:21, 10.81s/it] 21%|██        | 417/2000 [1:18:34<4:47:59, 10.92s/it]                                                       21%|██        | 417/2000 [1:18:34<4:47:59, 10.92s/it] 21%|██        | 418/2000 [1:18:45<4:47:22, 10.90s/it]                                                       21%|██        | 418/2000 [1:18:45<4:47:22, 10.90s/it] 21%|██        | 419/2000 [1:18{'loss': 0.7954, 'learning_rate': 1.835718442881183e-05, 'epoch': 0.21}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11826
total_samples=6409, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:45,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.05 | bwd_microstep: 1860.14 | bwd_inner_microstep: 1701.00 | bwd_allreduce_microstep: 159.07 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13695
total_samples=6413, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:48,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.90 | bwd_microstep: 2212.62 | bwd_inner_microstep: 1937.76 | bwd_allreduce_microstep: 274.79 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13451
total_samples=6417, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:50,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.07 | bwd_microstep: 1763.72 | bwd_inner_microstep: 1687.89 | bwd_allreduce_microstep: 75.77 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11883
total_samples=6420, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:53,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 03:01:53,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.64 | bwd_microstep: 1791.77 | bwd_inner_microstep: 1555.41 | bwd_allreduce_microstep: 236.30 | step_microstep: 125.98
[2025-08-03 03:01:53,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.60 | bwd: 7628.30 | bwd_inner: 6882.05 | bwd_allreduce: 746.01 | step: 126.42
{'loss': 0.8034, 'learning_rate': 1.8348280530292712e-05, 'epoch': 0.21}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13356
total_samples=6424, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:56,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.42 | bwd_microstep: 1796.50 | bwd_inner_microstep: 1690.58 | bwd_allreduce_microstep: 105.85 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13284
total_samples=6428, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:01:58,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.09 | bwd_microstep: 1802.69 | bwd_inner_microstep: 1700.61 | bwd_allreduce_microstep: 102.02 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12108
total_samples=6431, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:01,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.18 | bwd_microstep: 1784.12 | bwd_inner_microstep: 1569.05 | bwd_allreduce_microstep: 215.00 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12033
total_samples=6434, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:03,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 03:02:03,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1780.88 | bwd_inner_microstep: 1566.95 | bwd_allreduce_microstep: 213.87 | step_microstep: 120.73
[2025-08-03 03:02:03,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.22 | bwd: 7164.24 | bwd_inner: 6527.19 | bwd_allreduce: 636.82 | step: 121.14
{'loss': 0.7906, 'learning_rate': 1.8339354739381138e-05, 'epoch': 0.21}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13796
total_samples=6438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:06,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.81 | bwd_microstep: 2023.73 | bwd_inner_microstep: 1900.95 | bwd_allreduce_microstep: 122.71 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14215
total_samples=6442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:09,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.79 | bwd_microstep: 1711.16 | bwd_inner_microstep: 1693.40 | bwd_allreduce_microstep: 17.69 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11822
total_samples=6445, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:11,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.14 | bwd_microstep: 1776.17 | bwd_inner_microstep: 1581.72 | bwd_allreduce_microstep: 194.38 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12249
total_samples=6448, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:14,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:02:14,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.72 | bwd_microstep: 1885.72 | bwd_inner_microstep: 1843.42 | bwd_allreduce_microstep: 42.23 | step_microstep: 127.50
[2025-08-03 03:02:14,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2740.39 | bwd: 7396.83 | bwd_inner: 7019.49 | bwd_allreduce: 377.10 | step: 127.85
{'loss': 0.7875, 'learning_rate': 1.833040707948395e-05, 'epoch': 0.21}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11861
total_samples=6451, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:17,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.10 | bwd_microstep: 1786.47 | bwd_inner_microstep: 1543.51 | bwd_allreduce_microstep: 242.89 | step_microstep: 0.27
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14092
total_samples=6455, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:19,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.85 | bwd_microstep: 1751.37 | bwd_inner_microstep: 1650.57 | bwd_allreduce_microstep: 100.73 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11837
total_samples=6458, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:22,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.67 | bwd_microstep: 1759.43 | bwd_inner_microstep: 1546.47 | bwd_allreduce_microstep: 212.90 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11668
total_samples=6461, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:25,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 03:02:25,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.55 | bwd_microstep: 2070.78 | bwd_inner_microstep: 1839.81 | bwd_allreduce_microstep: 230.90 | step_microstep: 109.27
[2025-08-03 03:02:25,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.09 | bwd: 7368.10 | bwd_inner: 6580.35 | bwd_allreduce: 787.50 | step: 109.76
{'loss': 0.7872, 'learning_rate': 1.8321437574065347e-05, 'epoch': 0.21}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12881
total_samples=6465, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:27,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.33 | bwd_microstep: 1988.24 | bwd_inner_microstep: 1680.07 | bwd_allreduce_microstep: 308.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14072
total_samples=6469, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:30,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.83 | bwd_microstep: 1788.59 | bwd_inner_microstep: 1724.17 | bwd_allreduce_microstep: 64.36 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13736
total_samples=6473, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:33,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.90 | bwd_microstep: 1810.79 | bwd_inner_microstep: 1716.50 | bwd_allreduce_microstep: 94.23 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11694
total_samples=6476, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:36,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 03:02:36,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.43 | bwd_microstep: 2074.51 | bwd_inner_microstep: 1791.66 | bwd_allreduce_microstep: 282.79 | step_microstep: 116.92
[2025-08-03 03:02:36,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.41 | bwd: 7662.18 | bwd_inner: 6912.39 | bwd_allreduce: 749.56 | step: 117.37
{'loss': 0.8016, 'learning_rate': 1.831244624664681e-05, 'epoch': 0.21}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12401
total_samples=6480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:38,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.07 | bwd_microstep: 2053.64 | bwd_inner_microstep: 1891.68 | bwd_allreduce_microstep: 161.90 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12863
total_samples=6484, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:41,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.39 | bwd_microstep: 2103.63 | bwd_inner_microstep: 1935.99 | bwd_allreduce_microstep: 167.56 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13015
total_samples=6488, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:44,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.82 | bwd_microstep: 2298.83 | bwd_inner_microstep: 2155.52 | bwd_allreduce_microstep: 143.25 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12183
total_samples=6491, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:47,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 03:02:47,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.44 | bwd_microstep: 1787.08 | bwd_inner_microstep: 1583.49 | bwd_allreduce_microstep: 203.52 | step_microstep: 125.22
[2025-08-03 03:02:47,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.65 | bwd: 8243.23 | bwd_inner: 7566.68 | bwd_allreduce: 676.31 | step: 125.67
:57<4:57:03, 11.27s/it]                                                       21%|██        | 419/2000 [1:18:57<4:57:03, 11.27s/it] 21%|██        | 420/2000 [1:19:08<4:53:39, 11.15s/it]                                                       21%|██        | 420/2000 [1:19:08<4:53:39, 11.15s/it] 21%|██        | 421/2000 [1:19:18<4:47:33, 10.93s/it]                                                       21%|██        | 421/2000 [1:19:18<4:47:33, 10.93s/it] 21%|██        | 422/2000 [1:19:29<4:44:51, 10.83s/it]                                                       21%|██        | 422/2000 [1:19:29<4:44:51, 10.83s/it] 21%|██        | 423/2000 [1:19:40<4:42:46, 10.76s/it]                                                       21%|██        | 423/2000 [1:19:40<4:42:46, 10.76s/it] 21%|██        | 424/2000 [1:19:50<4:43:56, 10.81s/it]                                                       21%|██        | 424/2000 [1:19:50<4:43:56, 10.81s/it] 21%|██▏ {'loss': 0.7944, 'learning_rate': 1.8303433120807043e-05, 'epoch': 0.21}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12017
total_samples=6494, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:50,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.44 | bwd_microstep: 1815.29 | bwd_inner_microstep: 1592.73 | bwd_allreduce_microstep: 222.50 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15174
total_samples=6498, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:52,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.19 | bwd_microstep: 1763.58 | bwd_inner_microstep: 1723.79 | bwd_allreduce_microstep: 39.72 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11740
total_samples=6501, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:55,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.94 | bwd_microstep: 2058.34 | bwd_inner_microstep: 1702.62 | bwd_allreduce_microstep: 355.65 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12072
total_samples=6504, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:02:58,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 03:02:58,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.52 | bwd_microstep: 1895.41 | bwd_inner_microstep: 1711.71 | bwd_allreduce_microstep: 183.64 | step_microstep: 134.29
[2025-08-03 03:02:58,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.02 | bwd: 7532.67 | bwd_inner: 6730.84 | bwd_allreduce: 801.59 | step: 134.73
{'loss': 0.7917, 'learning_rate': 1.829439822018192e-05, 'epoch': 0.21}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13296
total_samples=6508, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:00,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.36 | bwd_microstep: 1850.76 | bwd_inner_microstep: 1683.00 | bwd_allreduce_microstep: 167.68 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=6511, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:03,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.52 | bwd_microstep: 1912.85 | bwd_inner_microstep: 1753.56 | bwd_allreduce_microstep: 159.23 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11760
total_samples=6514, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:06,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.12 | bwd_microstep: 1728.45 | bwd_inner_microstep: 1538.76 | bwd_allreduce_microstep: 189.62 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13298
total_samples=6518, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:08,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 03:03:08,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.25 | bwd_microstep: 1772.55 | bwd_inner_microstep: 1685.66 | bwd_allreduce_microstep: 86.83 | step_microstep: 111.15
[2025-08-03 03:03:08,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.18 | bwd: 7264.66 | bwd_inner: 6660.98 | bwd_allreduce: 603.44 | step: 111.66
{'loss': 0.788, 'learning_rate': 1.8285341568464416e-05, 'epoch': 0.21}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12692
total_samples=6522, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:11,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.63 | bwd_microstep: 1792.25 | bwd_inner_microstep: 1598.34 | bwd_allreduce_microstep: 193.84 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11685
total_samples=6525, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:14,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.41 | bwd_microstep: 1908.84 | bwd_inner_microstep: 1646.45 | bwd_allreduce_microstep: 262.32 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13824
total_samples=6529, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:16,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.95 | bwd_microstep: 1828.10 | bwd_inner_microstep: 1720.69 | bwd_allreduce_microstep: 107.35 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14392
total_samples=6533, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:19,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 03:03:19,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.02 | bwd_microstep: 1800.59 | bwd_inner_microstep: 1746.05 | bwd_allreduce_microstep: 54.48 | step_microstep: 128.23
[2025-08-03 03:03:19,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.93 | bwd: 7329.83 | bwd_inner: 6711.52 | bwd_allreduce: 618.07 | step: 128.74
{'loss': 0.7794, 'learning_rate': 1.827626318940454e-05, 'epoch': 0.21}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13604
total_samples=6537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:22,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.54 | bwd_microstep: 1955.90 | bwd_inner_microstep: 1834.52 | bwd_allreduce_microstep: 121.31 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11804
total_samples=6540, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:24,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.88 | bwd_microstep: 1821.20 | bwd_inner_microstep: 1582.81 | bwd_allreduce_microstep: 238.33 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11943
total_samples=6543, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:27,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.26 | bwd_microstep: 1861.52 | bwd_inner_microstep: 1602.86 | bwd_allreduce_microstep: 258.59 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492
total_samples=6547, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:30,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 03:03:30,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.17 | bwd_microstep: 1838.09 | bwd_inner_microstep: 1723.73 | bwd_allreduce_microstep: 114.29 | step_microstep: 131.59
[2025-08-03 03:03:30,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2824.77 | bwd: 7476.75 | bwd_inner: 6743.92 | bwd_allreduce: 732.60 | step: 132.16
{'loss': 0.7841, 'learning_rate': 1.8267163106809288e-05, 'epoch': 0.21}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11674
total_samples=6550, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:32,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.94 | bwd_microstep: 1773.19 | bwd_inner_microstep: 1529.19 | bwd_allreduce_microstep: 243.92 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13087
total_samples=6554, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:35,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.34 | bwd_microstep: 2054.96 | bwd_inner_microstep: 1896.12 | bwd_allreduce_microstep: 158.78 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12056
total_samples=6557, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:38,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.60 | bwd_microstep: 2044.85 | bwd_inner_microstep: 1826.38 | bwd_allreduce_microstep: 218.40 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13840
total_samples=6561, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:41,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 03:03:41,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.83 | bwd_microstep: 1994.46 | bwd_inner_microstep: 1753.95 | bwd_allreduce_microstep: 240.44 | step_microstep: 111.76
[2025-08-03 03:03:41,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.64 | bwd: 7867.50 | bwd_inner: 7005.64 | bwd_allreduce: 861.63 | step: 112.24
{'loss': 0.7839, 'learning_rate': 1.8258041344542567e-05, 'epoch': 0.21}
      | 425/2000 [1:20:02<4:49:00, 11.01s/it]                                                       21%|██▏       | 425/2000 [1:20:02<4:49:00, 11.01s/it] 21%|██▏       | 426/2000 [1:20:13<4:47:17, 10.95s/it]                                                       21%|██▏       | 426/2000 [1:20:13<4:47:17, 10.95s/it] 21%|██▏       | 427/2000 [1:20:23<4:43:23, 10.81s/it]                                                       21%|██▏       | 427/2000 [1:20:23<4:43:23, 10.81s/it] 21%|██▏       | 428/2000 [1:20:34<4:41:59, 10.76s/it]                                                       21%|██▏       | 428/2000 [1:20:34<4:41:59, 10.76s/it] 21%|██▏       | 429/2000 [1:20:45<4:41:52, 10.77s/it]                                                       21%|██▏       | 429/2000 [1:20:45<4:41:52, 10.77s/it] 22%|██▏       | 430/2000 [1:20:56<4:44:27, 10.87s/it]                                                       22%|██▏       | 430/2000 dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13844
total_samples=6565, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:44,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.83 | bwd_microstep: 1847.93 | bwd_inner_microstep: 1744.79 | bwd_allreduce_microstep: 103.07 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12163
total_samples=6568, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:46,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.78 | bwd_microstep: 1773.71 | bwd_inner_microstep: 1576.54 | bwd_allreduce_microstep: 197.10 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713
total_samples=6571, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:49,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.29 | bwd_microstep: 1782.39 | bwd_inner_microstep: 1536.06 | bwd_allreduce_microstep: 246.27 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13762
total_samples=6575, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:52,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 03:03:52,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.58 | bwd_microstep: 2244.73 | bwd_inner_microstep: 1920.57 | bwd_allreduce_microstep: 324.10 | step_microstep: 110.64
[2025-08-03 03:03:52,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2752.41 | bwd: 7648.80 | bwd_inner: 6777.95 | bwd_allreduce: 870.62 | step: 111.11
{'loss': 0.7728, 'learning_rate': 1.824889792652513e-05, 'epoch': 0.22}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14122
total_samples=6579, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:54,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.32 | bwd_microstep: 1890.92 | bwd_inner_microstep: 1858.78 | bwd_allreduce_microstep: 32.07 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12499
total_samples=6583, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:03:57,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.42 | bwd_microstep: 2012.33 | bwd_inner_microstep: 1852.32 | bwd_allreduce_microstep: 159.94 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13543
total_samples=6587, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:00,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.97 | bwd_microstep: 2019.00 | bwd_inner_microstep: 1911.10 | bwd_allreduce_microstep: 107.84 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14411
total_samples=6591, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:03,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 03:04:03,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.73 | bwd_microstep: 1803.42 | bwd_inner_microstep: 1758.43 | bwd_allreduce_microstep: 44.92 | step_microstep: 139.79
[2025-08-03 03:04:03,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.37 | bwd: 7725.71 | bwd_inner: 7380.64 | bwd_allreduce: 344.84 | step: 140.13
{'loss': 0.7918, 'learning_rate': 1.8239732876734525e-05, 'epoch': 0.22}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13348
total_samples=6595, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:05,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.76 | bwd_microstep: 1731.22 | bwd_inner_microstep: 1624.59 | bwd_allreduce_microstep: 106.56 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13683
total_samples=6600, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:08,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.69 | bwd_microstep: 1862.93 | bwd_inner_microstep: 1705.95 | bwd_allreduce_microstep: 156.91 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14204
total_samples=6604, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:11,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.90 | bwd_microstep: 2062.30 | bwd_inner_microstep: 1924.11 | bwd_allreduce_microstep: 138.13 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13502
total_samples=6610, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:14,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 03:04:14,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.62 | bwd_microstep: 2129.43 | bwd_inner_microstep: 1791.61 | bwd_allreduce_microstep: 337.75 | step_microstep: 106.24
[2025-08-03 03:04:14,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.90 | bwd: 7785.93 | bwd_inner: 7046.26 | bwd_allreduce: 739.43 | step: 106.61
{'loss': 0.8026, 'learning_rate': 1.8230546219205032e-05, 'epoch': 0.22}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11644
total_samples=6613, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:16,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.65 | bwd_microstep: 1831.98 | bwd_inner_microstep: 1703.93 | bwd_allreduce_microstep: 127.98 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13230
total_samples=6617, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:19,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.55 | bwd_microstep: 2057.76 | bwd_inner_microstep: 1987.63 | bwd_allreduce_microstep: 70.07 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13455
total_samples=6621, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:22,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.18 | bwd_microstep: 1775.78 | bwd_inner_microstep: 1700.81 | bwd_allreduce_microstep: 74.89 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13588
total_samples=6625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:24,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.36
[2025-08-03 03:04:24,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.62 | bwd_microstep: 1785.01 | bwd_inner_microstep: 1715.29 | bwd_allreduce_microstep: 69.65 | step_microstep: 136.56
[2025-08-03 03:04:24,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2753.93 | bwd: 7450.57 | bwd_inner: 7107.66 | bwd_allreduce: 342.68 | step: 136.95
{'loss': 0.7917, 'learning_rate': 1.822133797802758e-05, 'epoch': 0.22}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13611
total_samples=6630, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:27,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.12 | bwd_microstep: 1773.37 | bwd_inner_microstep: 1691.69 | bwd_allreduce_microstep: 81.61 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11989
total_samples=6633, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:30,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.06 | bwd_microstep: 1743.08 | bwd_inner_microstep: 1543.17 | bwd_allreduce_microstep: 199.84 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13692
total_samples=6637, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:32,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.67 | bwd_microstep: 1796.42 | bwd_inner_microstep: 1719.58 | bwd_allreduce_microstep: 76.77 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15967
total_samples=6641, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:35,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 03:04:35,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.61 | bwd_microstep: 1795.71 | bwd_inner_microstep: 1783.27 | bwd_allreduce_microstep: 12.38 | step_microstep: 142.85
[2025-08-03 03:04:35,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.40 | bwd: 7108.65 | bwd_inner: 6737.71 | bwd_allreduce: 370.69 | step: 143.33
{'loss': 0.7929, 'learning_rate': 1.8212108177349722e-05, 'epoch': 0.22}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15861
total_samples=6645, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:37,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.07 | bwd_microstep: 1822.64 | bwd_inner_microstep: 1783.45 | bwd_allreduce_microstep: 39.12 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12070
total_samples=6648, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:40,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.23 | bwd_microstep: 1807.00 | bwd_inner_microstep: 1785.89 | bwd_allreduce_microstep: 21.05 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13511
total_samples=6653, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:43,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.74 | bwd_microstep: 2088.85 | bwd_inner_microstep: 1898.08 | bwd_allreduce_microstep: 190.70 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13747
total_samples=6658, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:46,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 03:04:46,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.50 | bwd_microstep: 2171.45 | bwd_inner_microstep: 2022.83 | bwd_allreduce_microstep: 148.54 | step_microstep: 158.39
[2025-08-03 03:04:46,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.49 | bwd: 7889.98 | bwd_inner: 7490.24 | bwd_allreduce: 399.51 | step: 158.74
[1:20:56<4:44:27, 10.87s/it] 22%|██▏       | 431/2000 [1:21:07<4:44:07, 10.86s/it]                                                       22%|██▏       | 431/2000 [1:21:07<4:44:07, 10.86s/it] 22%|██▏       | 432/2000 [1:21:18<4:44:58, 10.90s/it]                                                       22%|██▏       | 432/2000 [1:21:18<4:44:58, 10.90s/it] 22%|██▏       | 433/2000 [1:21:29<4:45:37, 10.94s/it]                                                       22%|██▏       | 433/2000 [1:21:29<4:45:37, 10.94s/it] 22%|██▏       | 434/2000 [1:21:39<4:43:26, 10.86s/it]                                                       22%|██▏       | 434/2000 [1:21:39<4:43:26, 10.86s/it] 22%|██▏       | 435/2000 [1:21:50<4:39:32, 10.72s/it]                                                       22%|██▏       | 435/2000 [1:21:50<4:39:32, 10.72s/it] 22%|██▏       | 436/2000 [1:22:01<4:42:48, 10.85s/it]                                            {'loss': 0.7855, 'learning_rate': 1.8202856841375517e-05, 'epoch': 0.22}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13367
total_samples=6663, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:49,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.25 | bwd_microstep: 1786.06 | bwd_inner_microstep: 1687.48 | bwd_allreduce_microstep: 98.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13380
total_samples=6667, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:51,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.29 | bwd_microstep: 1767.98 | bwd_inner_microstep: 1696.81 | bwd_allreduce_microstep: 71.11 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13519
total_samples=6671, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:54,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.28 | bwd_microstep: 1872.78 | bwd_inner_microstep: 1731.28 | bwd_allreduce_microstep: 141.45 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13057
total_samples=6675, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:57,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 03:04:57,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.32 | bwd_microstep: 1963.59 | bwd_inner_microstep: 1842.16 | bwd_allreduce_microstep: 121.37 | step_microstep: 109.92
[2025-08-03 03:04:57,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.08 | bwd: 7390.46 | bwd_inner: 6957.72 | bwd_allreduce: 432.51 | step: 110.35
{'loss': 0.7889, 'learning_rate': 1.819358399436553e-05, 'epoch': 0.22}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11672
total_samples=6678, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:04:59,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.01 | bwd_microstep: 1771.03 | bwd_inner_microstep: 1533.83 | bwd_allreduce_microstep: 237.15 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11779
total_samples=6681, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:02,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.44 | bwd_microstep: 1943.98 | bwd_inner_microstep: 1765.51 | bwd_allreduce_microstep: 178.41 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13576
total_samples=6685, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:05,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.94 | bwd_microstep: 2034.33 | bwd_inner_microstep: 1897.10 | bwd_allreduce_microstep: 137.17 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12474
total_samples=6688, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:08,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 03:05:08,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.82 | bwd_microstep: 1945.00 | bwd_inner_microstep: 1678.81 | bwd_allreduce_microstep: 266.13 | step_microstep: 115.18
[2025-08-03 03:05:08,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.15 | bwd: 7694.40 | bwd_inner: 6875.24 | bwd_allreduce: 818.93 | step: 115.61
{'loss': 0.7891, 'learning_rate': 1.8184289660636715e-05, 'epoch': 0.22}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13803
total_samples=6692, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:10,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.10 | bwd_microstep: 1925.31 | bwd_inner_microstep: 1711.44 | bwd_allreduce_microstep: 213.81 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11792
total_samples=6695, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:13,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.88 | bwd_microstep: 1771.14 | bwd_inner_microstep: 1553.88 | bwd_allreduce_microstep: 217.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16228
total_samples=6700, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:15,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.86 | bwd_microstep: 1814.68 | bwd_inner_microstep: 1806.93 | bwd_allreduce_microstep: 7.70 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12841
total_samples=6704, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:18,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 03:05:18,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.78 | bwd_microstep: 2007.23 | bwd_inner_microstep: 1684.76 | bwd_allreduce_microstep: 322.41 | step_microstep: 129.59
[2025-08-03 03:05:18,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.55 | bwd: 7518.41 | bwd_inner: 6757.00 | bwd_allreduce: 761.18 | step: 129.91
{'loss': 0.7911, 'learning_rate': 1.817497386456238e-05, 'epoch': 0.22}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11515
total_samples=6707, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:21,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.63 | bwd_microstep: 1853.28 | bwd_inner_microstep: 1604.09 | bwd_allreduce_microstep: 249.13 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13226
total_samples=6711, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:24,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.46 | bwd_microstep: 2398.28 | bwd_inner_microstep: 2058.39 | bwd_allreduce_microstep: 339.82 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13433
total_samples=6715, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:27,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.77 | bwd_microstep: 2028.74 | bwd_inner_microstep: 1889.39 | bwd_allreduce_microstep: 139.28 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15964
total_samples=6720, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:30,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:05:30,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.24 | bwd_microstep: 1948.53 | bwd_inner_microstep: 1780.64 | bwd_allreduce_microstep: 167.82 | step_microstep: 112.89
[2025-08-03 03:05:30,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.04 | bwd: 8228.88 | bwd_inner: 7332.51 | bwd_allreduce: 896.14 | step: 113.21
{'loss': 0.7865, 'learning_rate': 1.816563663057211e-05, 'epoch': 0.22}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13252
total_samples=6724, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:32,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.47 | bwd_microstep: 1713.92 | bwd_inner_microstep: 1619.52 | bwd_allreduce_microstep: 94.33 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14265
total_samples=6728, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:35,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.52 | bwd_microstep: 1968.66 | bwd_inner_microstep: 1866.67 | bwd_allreduce_microstep: 101.92 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14422
total_samples=6733, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:38,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.92 | bwd_microstep: 1835.28 | bwd_inner_microstep: 1757.77 | bwd_allreduce_microstep: 77.46 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13838
total_samples=6737, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:40,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 03:05:40,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.01 | bwd_microstep: 1792.69 | bwd_inner_microstep: 1724.37 | bwd_allreduce_microstep: 68.26 | step_microstep: 109.86
[2025-08-03 03:05:40,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2757.86 | bwd: 7310.60 | bwd_inner: 6968.33 | bwd_allreduce: 342.04 | step: 110.31
{'loss': 0.7917, 'learning_rate': 1.815627798315172e-05, 'epoch': 0.22}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13623
total_samples=6741, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:43,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.89 | bwd_microstep: 1785.98 | bwd_inner_microstep: 1702.10 | bwd_allreduce_microstep: 83.81 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247
total_samples=6745, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:45,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.00 | bwd_microstep: 1770.33 | bwd_inner_microstep: 1718.74 | bwd_allreduce_microstep: 51.52 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16056
total_samples=6749, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:48,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 756.11 | bwd_microstep: 2034.34 | bwd_inner_microstep: 1931.35 | bwd_allreduce_microstep: 102.93 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14115
total_samples=6754, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:51,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.73
[2025-08-03 03:05:51,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.43 | bwd_microstep: 1830.46 | bwd_inner_microstep: 1722.62 | bwd_allreduce_microstep: 107.77 | step_microstep: 115.91
[2025-08-03 03:05:51,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.36 | bwd: 7421.15 | bwd_inner: 7074.82 | bwd_allreduce: 346.10 | step: 116.23
           22%|██▏       | 436/2000 [1:22:01<4:42:48, 10.85s/it] 22%|██▏       | 437/2000 [1:22:11<4:40:59, 10.79s/it]                                                       22%|██▏       | 437/2000 [1:22:12<4:40:59, 10.79s/it] 22%|██▏       | 438/2000 [1:22:22<4:41:53, 10.83s/it]                                                       22%|██▏       | 438/2000 [1:22:22<4:41:53, 10.83s/it] 22%|██▏       | 439/2000 [1:22:33<4:41:39, 10.83s/it]                                                       22%|██▏       | 439/2000 [1:22:33<4:41:39, 10.83s/it] 22%|██▏       | 440/2000 [1:22:45<4:46:33, 11.02s/it]                                                       22%|██▏       | 440/2000 [1:22:45<4:46:33, 11.02s/it] 22%|██▏       | 441/2000 [1:22:55<4:42:30, 10.87s/it]                                                       22%|██▏       | 441/2000 [1:22:55<4:42:30, 10.87s/it] 22%|██▏       | 442/2000 [1:23:06<4:40:58, 10.82s/it]{'loss': 0.7936, 'learning_rate': 1.8146897946843162e-05, 'epoch': 0.22}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13589
total_samples=6758, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:54,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.66 | bwd_microstep: 2011.66 | bwd_inner_microstep: 1902.43 | bwd_allreduce_microstep: 109.17 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13534
total_samples=6762, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:57,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.75 | bwd_microstep: 1982.76 | bwd_inner_microstep: 1976.70 | bwd_allreduce_microstep: 5.99 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12568
total_samples=6765, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:05:59,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.14 | bwd_microstep: 1787.83 | bwd_inner_microstep: 1594.23 | bwd_allreduce_microstep: 193.53 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13177
total_samples=6769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:02,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:06:02,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.41 | bwd_microstep: 1868.58 | bwd_inner_microstep: 1663.82 | bwd_allreduce_microstep: 204.70 | step_microstep: 120.88
[2025-08-03 03:06:02,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.90 | bwd: 7650.88 | bwd_inner: 7137.17 | bwd_allreduce: 513.47 | step: 121.23
{'loss': 0.789, 'learning_rate': 1.81374965462445e-05, 'epoch': 0.22}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12445
total_samples=6773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:05,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.90 | bwd_microstep: 2069.31 | bwd_inner_microstep: 1857.59 | bwd_allreduce_microstep: 211.65 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13356
total_samples=6777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:07,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.91 | bwd_microstep: 1799.39 | bwd_inner_microstep: 1700.69 | bwd_allreduce_microstep: 98.63 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11784
total_samples=6780, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:11,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1633.54 | bwd_microstep: 2060.01 | bwd_inner_microstep: 1861.68 | bwd_allreduce_microstep: 198.27 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11656
total_samples=6783, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:14,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 03:06:14,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.36 | bwd_microstep: 1869.60 | bwd_inner_microstep: 1634.40 | bwd_allreduce_microstep: 235.13 | step_microstep: 120.49
[2025-08-03 03:06:14,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3748.64 | bwd: 7798.36 | bwd_inner: 7054.37 | bwd_allreduce: 743.76 | step: 120.80
{'loss': 0.7761, 'learning_rate': 1.81280738060098e-05, 'epoch': 0.22}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11726
total_samples=6786, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:17,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.45 | bwd_microstep: 1837.57 | bwd_inner_microstep: 1687.62 | bwd_allreduce_microstep: 149.89 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12747
total_samples=6790, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:19,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.29 | bwd_microstep: 1746.49 | bwd_inner_microstep: 1630.22 | bwd_allreduce_microstep: 116.21 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11776
total_samples=6793, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:22,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.08 | bwd_microstep: 1978.94 | bwd_inner_microstep: 1757.40 | bwd_allreduce_microstep: 221.47 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11825
total_samples=6796, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:25,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 03:06:25,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 952.58 | bwd_microstep: 2006.00 | bwd_inner_microstep: 1748.40 | bwd_allreduce_microstep: 257.54 | step_microstep: 131.49
[2025-08-03 03:06:25,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3078.34 | bwd: 7569.05 | bwd_inner: 6823.64 | bwd_allreduce: 745.18 | step: 131.81
{'loss': 0.7914, 'learning_rate': 1.8118629750849106e-05, 'epoch': 0.22}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11884
total_samples=6799, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:28,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.06 | bwd_microstep: 1823.49 | bwd_inner_microstep: 1577.89 | bwd_allreduce_microstep: 245.54 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14563
total_samples=6803, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:30,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.10 | bwd_microstep: 1840.92 | bwd_inner_microstep: 1768.67 | bwd_allreduce_microstep: 72.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14165
total_samples=6807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:33,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.66 | bwd_microstep: 1768.23 | bwd_inner_microstep: 1714.93 | bwd_allreduce_microstep: 53.24 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247
total_samples=6811, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:36,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 03:06:36,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.55 | bwd_microstep: 2278.37 | bwd_inner_microstep: 2208.68 | bwd_allreduce_microstep: 69.62 | step_microstep: 133.49
[2025-08-03 03:06:36,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.31 | bwd: 7711.06 | bwd_inner: 7270.17 | bwd_allreduce: 440.67 | step: 133.81
{'loss': 0.7905, 'learning_rate': 1.810916440552835e-05, 'epoch': 0.22}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11888
total_samples=6814, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:39,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.96 | bwd_microstep: 2007.74 | bwd_inner_microstep: 1782.34 | bwd_allreduce_microstep: 225.34 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13943
total_samples=6818, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:41,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.34 | bwd_microstep: 1716.63 | bwd_inner_microstep: 1686.95 | bwd_allreduce_microstep: 29.61 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11683
total_samples=6821, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:44,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.16 | bwd_microstep: 1829.37 | bwd_inner_microstep: 1584.33 | bwd_allreduce_microstep: 244.98 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13182
total_samples=6825, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:47,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 03:06:47,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.98 | bwd_microstep: 1812.95 | bwd_inner_microstep: 1663.46 | bwd_allreduce_microstep: 149.42 | step_microstep: 130.85
[2025-08-03 03:06:47,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.37 | bwd: 7366.73 | bwd_inner: 6717.07 | bwd_allreduce: 649.44 | step: 131.32
{'loss': 0.7932, 'learning_rate': 1.8099677794869297e-05, 'epoch': 0.22}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13444
total_samples=6829, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:49,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.07 | bwd_microstep: 1765.88 | bwd_inner_microstep: 1696.42 | bwd_allreduce_microstep: 69.39 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13324
total_samples=6833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:52,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.12 | bwd_microstep: 1796.24 | bwd_inner_microstep: 1664.47 | bwd_allreduce_microstep: 131.71 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12122
total_samples=6836, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:55,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.59 | bwd_microstep: 1934.85 | bwd_inner_microstep: 1744.57 | bwd_allreduce_microstep: 190.21 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14246
total_samples=6840, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:06:57,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 03:06:57,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.11 | bwd_microstep: 1835.08 | bwd_inner_microstep: 1765.47 | bwd_allreduce_microstep: 69.56 | step_microstep: 145.78
[2025-08-03 03:06:57,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.82 | bwd: 7332.10 | bwd_inner: 6870.93 | bwd_allreduce: 460.94 | step: 146.21
                                                       22%|██▏       | 442/2000 [1:23:06<4:40:58, 10.82s/it] 22%|██▏       | 443/2000 [1:23:17<4:41:29, 10.85s/it]                                                       22%|██▏       | 443/2000 [1:23:17<4:41:29, 10.85s/it] 22%|██▏       | 444/2000 [1:23:29<4:50:11, 11.19s/it]                                                       22%|██▏       | 444/2000 [1:23:29<4:50:11, 11.19s/it] 22%|██▏       | 445/2000 [1:23:40<4:49:40, 11.18s/it]                                                       22%|██▏       | 445/2000 [1:23:40<4:49:40, 11.18s/it] 22%|██▏       | 446/2000 [1:23:51<4:47:46, 11.11s/it]                                                       22%|██▏       | 446/2000 [1:23:51<4:47:46, 11.11s/it] 22%|██▏       | 447/2000 [1:24:02<4:43:53, 10.97s/it]                                                       22%|██▏       | 447/2000 [1:24:02<4:43:53, 10.97s/it] 22%|██▏   {'loss': 0.7818, 'learning_rate': 1.8090169943749477e-05, 'epoch': 0.22}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15092
total_samples=6845, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:00,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.29 | bwd_microstep: 2026.49 | bwd_inner_microstep: 1823.73 | bwd_allreduce_microstep: 202.71 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799
total_samples=6848, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:03,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.44 | bwd_microstep: 1789.60 | bwd_inner_microstep: 1563.80 | bwd_allreduce_microstep: 225.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13430
total_samples=6852, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:05,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.61 | bwd_microstep: 1799.62 | bwd_inner_microstep: 1793.59 | bwd_allreduce_microstep: 5.97 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13231
total_samples=6856, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:08,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 03:07:08,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.21 | bwd_microstep: 1752.11 | bwd_inner_microstep: 1678.17 | bwd_allreduce_microstep: 73.87 | step_microstep: 158.48
[2025-08-03 03:07:08,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.49 | bwd: 7367.88 | bwd_inner: 6859.29 | bwd_allreduce: 508.35 | step: 158.80
{'loss': 0.7832, 'learning_rate': 1.808064087710212e-05, 'epoch': 0.22}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13422
total_samples=6860, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:10,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.10 | bwd_microstep: 1700.05 | bwd_inner_microstep: 1639.46 | bwd_allreduce_microstep: 60.53 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11781
total_samples=6863, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:13,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.20 | bwd_microstep: 1797.84 | bwd_inner_microstep: 1548.71 | bwd_allreduce_microstep: 249.07 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14199
total_samples=6867, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:16,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.95 | bwd_microstep: 1754.12 | bwd_inner_microstep: 1709.57 | bwd_allreduce_microstep: 44.49 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13371
total_samples=6871, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:19,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 03:07:19,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.09 | bwd_microstep: 2016.35 | bwd_inner_microstep: 1898.51 | bwd_allreduce_microstep: 117.77 | step_microstep: 143.78
[2025-08-03 03:07:19,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.27 | bwd: 7268.41 | bwd_inner: 6796.25 | bwd_allreduce: 471.92 | step: 144.20
{'loss': 0.7887, 'learning_rate': 1.8071090619916095e-05, 'epoch': 0.23}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13643
total_samples=6875, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:21,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.03 | bwd_microstep: 1830.58 | bwd_inner_microstep: 1719.42 | bwd_allreduce_microstep: 111.09 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12066
total_samples=6878, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:24,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.45 | bwd_microstep: 1736.19 | bwd_inner_microstep: 1551.61 | bwd_allreduce_microstep: 184.51 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13799
total_samples=6882, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:26,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.43 | bwd_microstep: 2024.88 | bwd_inner_microstep: 1888.20 | bwd_allreduce_microstep: 136.61 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13232
total_samples=6886, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:29,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 03:07:29,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.45 | bwd_microstep: 1819.24 | bwd_inner_microstep: 1711.34 | bwd_allreduce_microstep: 107.84 | step_microstep: 156.25
[2025-08-03 03:07:29,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.29 | bwd: 7410.94 | bwd_inner: 6870.57 | bwd_allreduce: 540.13 | step: 156.59
{'loss': 0.7913, 'learning_rate': 1.8061519197235835e-05, 'epoch': 0.23}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12616
total_samples=6890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:32,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.49 | bwd_microstep: 1964.22 | bwd_inner_microstep: 1626.88 | bwd_allreduce_microstep: 337.28 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13943
total_samples=6894, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:34,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.97 | bwd_microstep: 1721.84 | bwd_inner_microstep: 1691.72 | bwd_allreduce_microstep: 30.06 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15094
total_samples=6898, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:37,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.76 | bwd_microstep: 2202.59 | bwd_inner_microstep: 2052.35 | bwd_allreduce_microstep: 150.16 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12242
total_samples=6901, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:40,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.81
[2025-08-03 03:07:40,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.87 | bwd_microstep: 1915.06 | bwd_inner_microstep: 1581.19 | bwd_allreduce_microstep: 333.80 | step_microstep: 153.89
[2025-08-03 03:07:40,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.03 | bwd: 7803.77 | bwd_inner: 6952.12 | bwd_allreduce: 851.39 | step: 154.26
{'loss': 0.7922, 'learning_rate': 1.8051926634161282e-05, 'epoch': 0.23}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13561
total_samples=6905, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:43,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.55 | bwd_microstep: 1972.47 | bwd_inner_microstep: 1794.55 | bwd_allreduce_microstep: 177.86 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=6908, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:46,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.23 | bwd_microstep: 2451.11 | bwd_inner_microstep: 1759.76 | bwd_allreduce_microstep: 691.29 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12632
total_samples=6911, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:49,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.00 | bwd_microstep: 1779.79 | bwd_inner_microstep: 1591.92 | bwd_allreduce_microstep: 187.80 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11732
total_samples=6914, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:52,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 03:07:52,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.42 | bwd_microstep: 1932.65 | bwd_inner_microstep: 1576.91 | bwd_allreduce_microstep: 355.67 | step_microstep: 126.53
[2025-08-03 03:07:52,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.14 | bwd: 8136.08 | bwd_inner: 6723.13 | bwd_allreduce: 1412.71 | step: 126.97
{'loss': 0.7928, 'learning_rate': 1.804231295584782e-05, 'epoch': 0.23}
    | 448/2000 [1:24:12<4:40:39, 10.85s/it]                                                       22%|██▏       | 448/2000 [1:24:12<4:40:39, 10.85s/it] 22%|██▏       | 449/2000 [1:24:23<4:39:18, 10.80s/it]                                                       22%|██▏       | 449/2000 [1:24:23<4:39:18, 10.80s/it] 22%|██▎       | 450/2000 [1:24:33<4:37:04, 10.73s/it]                                                       22%|██▎       | 450/2000 [1:24:33<4:37:04, 10.73s/it] 23%|██▎       | 451/2000 [1:24:44<4:36:30, 10.71s/it]                                                       23%|██▎       | 451/2000 [1:24:44<4:36:30, 10.71s/it] 23%|██▎       | 452/2000 [1:24:55<4:39:16, 10.82s/it]                                                       23%|██▎       | 452/2000 [1:24:55<4:39:16, 10.82s/it] 23%|██▎       | 453/2000 [1:25:07<4:43:38, 11.00s/it]                                                       23%|██▎       | 453/2000 [1dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11748
total_samples=6917, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:54,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.98 | bwd_microstep: 1871.94 | bwd_inner_microstep: 1602.55 | bwd_allreduce_microstep: 269.34 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11802
total_samples=6920, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:07:57,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.93 | bwd_microstep: 1793.23 | bwd_inner_microstep: 1552.55 | bwd_allreduce_microstep: 240.61 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13345
total_samples=6924, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:00,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.21 | bwd_microstep: 1980.01 | bwd_inner_microstep: 1869.75 | bwd_allreduce_microstep: 110.20 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11848
total_samples=6927, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:02,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 03:08:02,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.19 | bwd_microstep: 1816.46 | bwd_inner_microstep: 1584.53 | bwd_allreduce_microstep: 231.87 | step_microstep: 131.76
[2025-08-03 03:08:02,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.24 | bwd: 7461.68 | bwd_inner: 6609.37 | bwd_allreduce: 852.10 | step: 132.19
{'loss': 0.7965, 'learning_rate': 1.8032678187506187e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12128
total_samples=6930, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:05,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.29 | bwd_microstep: 1759.04 | bwd_inner_microstep: 1560.04 | bwd_allreduce_microstep: 198.93 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13709
total_samples=6934, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:08,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.82 | bwd_microstep: 1809.88 | bwd_inner_microstep: 1732.67 | bwd_allreduce_microstep: 77.14 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14062
total_samples=6938, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:10,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.96 | bwd_microstep: 1853.70 | bwd_inner_microstep: 1744.60 | bwd_allreduce_microstep: 109.03 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13349
total_samples=6942, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:13,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 03:08:13,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.46 | bwd_microstep: 1893.02 | bwd_inner_microstep: 1812.92 | bwd_allreduce_microstep: 80.03 | step_microstep: 112.70
[2025-08-03 03:08:13,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2862.46 | bwd: 7315.68 | bwd_inner: 6850.23 | bwd_allreduce: 465.22 | step: 113.03
{'loss': 0.7895, 'learning_rate': 1.802302235440245e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12220
total_samples=6946, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:16,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.37 | bwd_microstep: 1841.34 | bwd_inner_microstep: 1727.59 | bwd_allreduce_microstep: 113.68 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13300
total_samples=6950, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:18,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.54 | bwd_microstep: 1839.92 | bwd_inner_microstep: 1720.22 | bwd_allreduce_microstep: 119.64 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13301
total_samples=6954, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:21,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.67 | bwd_microstep: 1803.63 | bwd_inner_microstep: 1709.20 | bwd_allreduce_microstep: 94.36 | step_microstep: 0.20
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12608
total_samples=6958, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:24,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.42
[2025-08-03 03:08:24,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.09 | bwd_microstep: 1816.75 | bwd_inner_microstep: 1631.81 | bwd_allreduce_microstep: 184.88 | step_microstep: 131.45
[2025-08-03 03:08:24,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.61 | bwd: 7301.69 | bwd_inner: 6788.82 | bwd_allreduce: 512.64 | step: 131.91
{'loss': 0.7865, 'learning_rate': 1.8013345481857903e-05, 'epoch': 0.23}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13262
total_samples=6962, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:26,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.23 | bwd_microstep: 1836.40 | bwd_inner_microstep: 1704.53 | bwd_allreduce_microstep: 131.80 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12900
total_samples=6966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:29,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.62 | bwd_microstep: 1827.41 | bwd_inner_microstep: 1659.06 | bwd_allreduce_microstep: 168.28 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12944
total_samples=6970, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:32,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.49 | bwd_microstep: 1888.07 | bwd_inner_microstep: 1664.14 | bwd_allreduce_microstep: 223.87 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13394
total_samples=6974, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:35,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.71
[2025-08-03 03:08:35,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.91 | bwd_microstep: 1854.97 | bwd_inner_microstep: 1693.59 | bwd_allreduce_microstep: 161.30 | step_microstep: 429.54
[2025-08-03 03:08:35,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.18 | bwd: 7406.90 | bwd_inner: 6721.32 | bwd_allreduce: 685.34 | step: 429.97
{'loss': 0.8082, 'learning_rate': 1.8003647595249016e-05, 'epoch': 0.23}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13471
total_samples=6978, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:37,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.32 | bwd_microstep: 1810.90 | bwd_inner_microstep: 1705.13 | bwd_allreduce_microstep: 105.70 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13283
total_samples=6982, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:40,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.14 | bwd_microstep: 1861.92 | bwd_inner_microstep: 1802.04 | bwd_allreduce_microstep: 59.82 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13282
total_samples=6986, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:43,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.34 | bwd_microstep: 1793.56 | bwd_inner_microstep: 1671.02 | bwd_allreduce_microstep: 122.47 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13430
total_samples=6990, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:45,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 03:08:45,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.92 | bwd_microstep: 1746.88 | bwd_inner_microstep: 1680.59 | bwd_allreduce_microstep: 66.23 | step_microstep: 153.98
[2025-08-03 03:08:45,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.64 | bwd: 7213.30 | bwd_inner: 6858.78 | bwd_allreduce: 354.29 | step: 154.44
{'loss': 0.7824, 'learning_rate': 1.799392872000736e-05, 'epoch': 0.23}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12684
total_samples=6994, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:48,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.56 | bwd_microstep: 1946.75 | bwd_inner_microstep: 1625.11 | bwd_allreduce_microstep: 321.57 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13343
total_samples=6998, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:50,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.12 | bwd_microstep: 1792.34 | bwd_inner_microstep: 1693.74 | bwd_allreduce_microstep: 98.55 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13340
total_samples=7002, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:53,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.26 | bwd_microstep: 1972.01 | bwd_inner_microstep: 1880.45 | bwd_allreduce_microstep: 91.50 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13367
total_samples=7006, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:56,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:08:56,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.88 | bwd_microstep: 2193.07 | bwd_inner_microstep: 1814.98 | bwd_allreduce_microstep: 378.02 | step_microstep: 130.83
[2025-08-03 03:08:56,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.75 | bwd: 7904.23 | bwd_inner: 7014.28 | bwd_allreduce: 889.72 | step: 131.16
:25:07<4:43:38, 11.00s/it] 23%|██▎       | 454/2000 [1:25:17<4:41:43, 10.93s/it]                                                       23%|██▎       | 454/2000 [1:25:17<4:41:43, 10.93s/it] 23%|██▎       | 455/2000 [1:25:28<4:39:11, 10.84s/it]                                                       23%|██▎       | 455/2000 [1:25:28<4:39:11, 10.84s/it] 23%|██▎       | 456/2000 [1:25:39<4:36:46, 10.76s/it]                                                       23%|██▎       | 456/2000 [1:25:39<4:36:46, 10.76s/it] 23%|██▎       | 457/2000 [1:25:49<4:38:17, 10.82s/it]                                                       23%|██▎       | 457/2000 [1:25:50<4:38:17, 10.82s/it] 23%|██▎       | 458/2000 [1:26:00<4:35:56, 10.74s/it]                                                       23%|██▎       | 458/2000 [1:26:00<4:35:56, 10.74s/it] 23%|██▎       | 459/2000 [1:26:11<4:38:53, 10.86s/it]                                              {'loss': 0.7814, 'learning_rate': 1.7984188881619563e-05, 'epoch': 0.23}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13467
total_samples=7010, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:08:59,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.89 | bwd_microstep: 1990.87 | bwd_inner_microstep: 1873.42 | bwd_allreduce_microstep: 117.38 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334
total_samples=7014, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:02,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.88 | bwd_microstep: 2059.89 | bwd_inner_microstep: 1931.71 | bwd_allreduce_microstep: 128.12 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13877
total_samples=7019, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:05,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.00 | bwd_microstep: 1885.28 | bwd_inner_microstep: 1759.16 | bwd_allreduce_microstep: 126.07 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812
total_samples=7022, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:08,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 03:09:08,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.97 | bwd_microstep: 2055.58 | bwd_inner_microstep: 1839.39 | bwd_allreduce_microstep: 216.13 | step_microstep: 127.22
[2025-08-03 03:09:08,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.68 | bwd: 7991.66 | bwd_inner: 7403.67 | bwd_allreduce: 587.77 | step: 127.69
{'loss': 0.7771, 'learning_rate': 1.797442810562721e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11904
total_samples=7025, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:10,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.38 | bwd_microstep: 2073.36 | bwd_inner_microstep: 1794.58 | bwd_allreduce_microstep: 278.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13024
total_samples=7029, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:13,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.17 | bwd_microstep: 1854.63 | bwd_inner_microstep: 1679.53 | bwd_allreduce_microstep: 175.03 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13432
total_samples=7033, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:16,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.33 | bwd_microstep: 1797.97 | bwd_inner_microstep: 1691.70 | bwd_allreduce_microstep: 106.20 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11801
total_samples=7036, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:18,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 03:09:18,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.59 | bwd_microstep: 2001.67 | bwd_inner_microstep: 1795.60 | bwd_allreduce_microstep: 206.01 | step_microstep: 109.88
[2025-08-03 03:09:18,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.40 | bwd: 7727.67 | bwd_inner: 6961.42 | bwd_allreduce: 766.01 | step: 110.23
{'loss': 0.7855, 'learning_rate': 1.79646464176268e-05, 'epoch': 0.23}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14979
total_samples=7040, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:21,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.35 | bwd_microstep: 1760.04 | bwd_inner_microstep: 1741.77 | bwd_allreduce_microstep: 18.21 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15655
total_samples=7044, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:24,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.31 | bwd_microstep: 1867.40 | bwd_inner_microstep: 1861.42 | bwd_allreduce_microstep: 5.91 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13157
total_samples=7048, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:26,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.60 | bwd_microstep: 1706.29 | bwd_inner_microstep: 1635.56 | bwd_allreduce_microstep: 70.67 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12964
total_samples=7052, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:29,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 03:09:29,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.13 | bwd_microstep: 2103.08 | bwd_inner_microstep: 1961.81 | bwd_allreduce_microstep: 141.20 | step_microstep: 112.91
[2025-08-03 03:09:29,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.34 | bwd: 7436.86 | bwd_inner: 7200.56 | bwd_allreduce: 236.07 | step: 113.35
{'loss': 0.7867, 'learning_rate': 1.7954843843269665e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11990
total_samples=7055, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:34,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1603.13 | bwd_microstep: 2916.08 | bwd_inner_microstep: 2478.28 | bwd_allreduce_microstep: 437.74 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13730
total_samples=7060, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:37,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 798.36 | bwd_microstep: 1914.99 | bwd_inner_microstep: 1687.91 | bwd_allreduce_microstep: 227.02 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13914
total_samples=7065, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:39,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.52 | bwd_microstep: 1735.53 | bwd_inner_microstep: 1661.42 | bwd_allreduce_microstep: 74.05 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13390
total_samples=7069, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:42,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 03:09:42,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.12 | bwd_microstep: 1867.48 | bwd_inner_microstep: 1727.37 | bwd_allreduce_microstep: 140.04 | step_microstep: 130.54
[2025-08-03 03:09:42,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3800.07 | bwd: 8434.12 | bwd_inner: 7554.98 | bwd_allreduce: 878.91 | step: 130.85
{'loss': 0.7903, 'learning_rate': 1.794502040826192e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11673
total_samples=7073, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:45,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.76 | bwd_microstep: 1881.37 | bwd_inner_microstep: 1731.92 | bwd_allreduce_microstep: 149.39 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14190
total_samples=7077, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:47,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.29 | bwd_microstep: 2048.19 | bwd_inner_microstep: 1912.38 | bwd_allreduce_microstep: 135.75 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13965
total_samples=7081, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:50,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.49 | bwd_microstep: 1819.96 | bwd_inner_microstep: 1723.88 | bwd_allreduce_microstep: 96.02 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11839
total_samples=7084, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:54,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 03:09:54,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.56 | bwd_microstep: 2702.53 | bwd_inner_microstep: 2230.56 | bwd_allreduce_microstep: 471.86 | step_microstep: 134.97
[2025-08-03 03:09:54,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.03 | bwd: 8452.10 | bwd_inner: 7598.75 | bwd_allreduce: 853.09 | step: 135.31
{'loss': 0.7948, 'learning_rate': 1.793517613836437e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11832
total_samples=7087, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:09:58,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1570.84 | bwd_microstep: 2258.95 | bwd_inner_microstep: 2046.49 | bwd_allreduce_microstep: 212.40 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12877
total_samples=7091, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:00,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.93 | bwd_microstep: 2049.93 | bwd_inner_microstep: 1996.34 | bwd_allreduce_microstep: 53.52 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13982
total_samples=7095, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:03,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.36 | bwd_microstep: 1812.12 | bwd_inner_microstep: 1731.07 | bwd_allreduce_microstep: 80.98 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12855
total_samples=7098, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:06,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 03:10:06,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.15 | bwd_microstep: 1984.28 | bwd_inner_microstep: 1790.73 | bwd_allreduce_microstep: 193.50 | step_microstep: 121.54
[2025-08-03 03:10:06,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3716.21 | bwd: 8105.32 | bwd_inner: 7564.63 | bwd_allreduce: 540.47 | step: 121.95
         23%|██▎       | 459/2000 [1:26:11<4:38:53, 10.86s/it] 23%|██▎       | 460/2000 [1:26:22<4:41:26, 10.97s/it]                                                       23%|██▎       | 460/2000 [1:26:22<4:41:26, 10.97s/it] 23%|██▎       | 461/2000 [1:26:33<4:41:03, 10.96s/it]                                                       23%|██▎       | 461/2000 [1:26:33<4:41:03, 10.96s/it] 23%|██▎       | 462/2000 [1:26:44<4:38:48, 10.88s/it]                                                       23%|██▎       | 462/2000 [1:26:44<4:38:48, 10.88s/it] 23%|██▎       | 463/2000 [1:26:57<4:52:39, 11.42s/it]                                                       23%|██▎       | 463/2000 [1:26:57<4:52:39, 11.42s/it] 23%|██▎       | 464/2000 [1:27:08<4:55:05, 11.53s/it]                                                       23%|██▎       | 464/2000 [1:27:09<4:55:05, 11.53s/it] 23%|██▎       | 465/2000 [1:27:21<5:00:24, 11.74s/it]  {'loss': 0.7786, 'learning_rate': 1.7925311059392472e-05, 'epoch': 0.23}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13772
total_samples=7102, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:09,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.94 | bwd_microstep: 2023.23 | bwd_inner_microstep: 1874.32 | bwd_allreduce_microstep: 148.84 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12879
total_samples=7106, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:11,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.48 | bwd_microstep: 1892.31 | bwd_inner_microstep: 1743.22 | bwd_allreduce_microstep: 149.02 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12043
total_samples=7110, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:14,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.33 | bwd_microstep: 2023.91 | bwd_inner_microstep: 1779.77 | bwd_allreduce_microstep: 244.08 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11906
total_samples=7113, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:17,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 03:10:17,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.24 | bwd_microstep: 1830.63 | bwd_inner_microstep: 1576.79 | bwd_allreduce_microstep: 253.77 | step_microstep: 138.16
[2025-08-03 03:10:17,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.92 | bwd: 7770.12 | bwd_inner: 6974.10 | bwd_allreduce: 795.79 | step: 138.48
{'loss': 0.7869, 'learning_rate': 1.7915425197216246e-05, 'epoch': 0.23}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12161
total_samples=7117, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:19,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.51 | bwd_microstep: 1821.28 | bwd_inner_microstep: 1588.22 | bwd_allreduce_microstep: 233.00 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13217
total_samples=7121, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:22,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.79 | bwd_microstep: 1940.41 | bwd_inner_microstep: 1682.21 | bwd_allreduce_microstep: 258.13 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13488
total_samples=7125, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:25,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.12 | bwd_microstep: 1908.92 | bwd_inner_microstep: 1643.69 | bwd_allreduce_microstep: 265.18 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13593
total_samples=7129, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:28,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 03:10:28,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.95 | bwd_microstep: 1902.44 | bwd_inner_microstep: 1657.75 | bwd_allreduce_microstep: 244.62 | step_microstep: 132.77
[2025-08-03 03:10:28,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.30 | bwd: 7573.10 | bwd_inner: 6571.87 | bwd_allreduce: 1001.01 | step: 133.10
{'loss': 0.7778, 'learning_rate': 1.7905518577760207e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12176
total_samples=7132, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:31,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.00 | bwd_microstep: 2557.45 | bwd_inner_microstep: 1763.04 | bwd_allreduce_microstep: 794.34 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12486
total_samples=7136, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:34,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.30 | bwd_microstep: 1809.51 | bwd_inner_microstep: 1601.13 | bwd_allreduce_microstep: 208.32 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11714
total_samples=7139, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:36,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.03 | bwd_microstep: 1840.02 | bwd_inner_microstep: 1604.99 | bwd_allreduce_microstep: 234.97 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13982
total_samples=7143, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:39,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 03:10:39,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.83 | bwd_microstep: 2025.31 | bwd_inner_microstep: 1907.31 | bwd_allreduce_microstep: 117.95 | step_microstep: 135.35
[2025-08-03 03:10:39,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2770.09 | bwd: 8232.33 | bwd_inner: 6876.46 | bwd_allreduce: 1355.64 | step: 135.69
{'loss': 0.793, 'learning_rate': 1.7895591227003316e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11660
total_samples=7146, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:42,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.97 | bwd_microstep: 1767.04 | bwd_inner_microstep: 1544.51 | bwd_allreduce_microstep: 222.46 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13611
total_samples=7150, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:44,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.00 | bwd_microstep: 1716.23 | bwd_inner_microstep: 1654.47 | bwd_allreduce_microstep: 61.71 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812
total_samples=7153, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:47,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.31 | bwd_microstep: 2032.16 | bwd_inner_microstep: 1770.21 | bwd_allreduce_microstep: 261.89 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13927
total_samples=7157, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:50,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 03:10:50,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.02 | bwd_microstep: 1874.84 | bwd_inner_microstep: 1747.76 | bwd_allreduce_microstep: 127.01 | step_microstep: 127.81
[2025-08-03 03:10:50,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.22 | bwd: 7390.31 | bwd_inner: 6716.94 | bwd_allreduce: 673.14 | step: 128.23
{'loss': 0.7926, 'learning_rate': 1.788564317097889e-05, 'epoch': 0.23}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11674
total_samples=7160, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:52,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.18 | bwd_microstep: 1787.94 | bwd_inner_microstep: 1553.94 | bwd_allreduce_microstep: 233.93 | step_microstep: 0.16
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13241
total_samples=7164, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:55,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.45 | bwd_microstep: 1821.72 | bwd_inner_microstep: 1659.64 | bwd_allreduce_microstep: 162.02 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13259
total_samples=7168, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:10:57,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.87 | bwd_microstep: 1747.98 | bwd_inner_microstep: 1672.63 | bwd_allreduce_microstep: 75.29 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12357
total_samples=7172, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:00,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 03:11:00,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.01 | bwd_microstep: 1772.66 | bwd_inner_microstep: 1594.87 | bwd_allreduce_microstep: 177.73 | step_microstep: 115.89
[2025-08-03 03:11:00,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.44 | bwd: 7130.35 | bwd_inner: 6481.06 | bwd_allreduce: 649.04 | step: 116.35
{'loss': 0.7839, 'learning_rate': 1.7875674435774546e-05, 'epoch': 0.23}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13395
total_samples=7176, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:03,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.59 | bwd_microstep: 2468.19 | bwd_inner_microstep: 2282.23 | bwd_allreduce_microstep: 185.90 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713
total_samples=7179, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:06,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.19 | bwd_microstep: 1846.31 | bwd_inner_microstep: 1552.76 | bwd_allreduce_microstep: 293.49 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13395
total_samples=7183, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:09,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.47 | bwd_microstep: 1904.02 | bwd_inner_microstep: 1670.98 | bwd_allreduce_microstep: 232.97 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14576
total_samples=7187, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:11,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 03:11:11,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.65 | bwd_microstep: 1706.98 | bwd_inner_microstep: 1692.17 | bwd_allreduce_microstep: 14.76 | step_microstep: 131.60
[2025-08-03 03:11:11,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.83 | bwd: 7925.56 | bwd_inner: 7198.13 | bwd_allreduce: 727.19 | step: 131.99
                                                     23%|██▎       | 465/2000 [1:27:21<5:00:24, 11.74s/it] 23%|██▎       | 466/2000 [1:27:32<4:54:23, 11.51s/it]                                                       23%|██▎       | 466/2000 [1:27:32<4:54:23, 11.51s/it] 23%|██▎       | 467/2000 [1:27:43<4:48:56, 11.31s/it]                                                       23%|██▎       | 467/2000 [1:27:43<4:48:56, 11.31s/it] 23%|██▎       | 468/2000 [1:27:54<4:49:54, 11.35s/it]                                                       23%|██▎       | 468/2000 [1:27:54<4:49:54, 11.35s/it] 23%|██▎       | 469/2000 [1:28:05<4:44:00, 11.13s/it]                                                       23%|██▎       | 469/2000 [1:28:05<4:44:00, 11.13s/it] 24%|██▎       | 470/2000 [1:28:15<4:38:03, 10.90s/it]                                                       24%|██▎       | 470/2000 [1:28:15<4:38:03, 10.90s/it] 24%|██▎     {'loss': 0.7839, 'learning_rate': 1.786568504753213e-05, 'epoch': 0.24}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14680
total_samples=7191, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:14,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.16 | bwd_microstep: 1803.13 | bwd_inner_microstep: 1764.51 | bwd_allreduce_microstep: 38.56 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11775
total_samples=7194, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:17,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.35 | bwd_microstep: 1812.72 | bwd_inner_microstep: 1568.76 | bwd_allreduce_microstep: 243.90 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=7197, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:19,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.94 | bwd_microstep: 1748.52 | bwd_inner_microstep: 1527.82 | bwd_allreduce_microstep: 220.64 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13753
total_samples=7201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:22,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 03:11:22,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.32 | bwd_microstep: 2073.15 | bwd_inner_microstep: 1948.25 | bwd_allreduce_microstep: 124.83 | step_microstep: 113.25
[2025-08-03 03:11:22,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.70 | bwd: 7437.57 | bwd_inner: 6809.33 | bwd_allreduce: 628.01 | step: 113.58
{'loss': 0.7835, 'learning_rate': 1.7855675032447648e-05, 'epoch': 0.24}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13500
total_samples=7205, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:25,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.68 | bwd_microstep: 1933.23 | bwd_inner_microstep: 1704.26 | bwd_allreduce_microstep: 228.91 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13181
total_samples=7209, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:28,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.63 | bwd_microstep: 2019.45 | bwd_inner_microstep: 1872.77 | bwd_allreduce_microstep: 146.61 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12365
total_samples=7212, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:30,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.82 | bwd_microstep: 2104.51 | bwd_inner_microstep: 1880.01 | bwd_allreduce_microstep: 224.43 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13541
total_samples=7216, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:33,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 03:11:33,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.18 | bwd_microstep: 1717.27 | bwd_inner_microstep: 1669.38 | bwd_allreduce_microstep: 47.82 | step_microstep: 445.53
[2025-08-03 03:11:33,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.23 | bwd: 7774.50 | bwd_inner: 7126.43 | bwd_allreduce: 647.84 | step: 445.98
{'loss': 0.7909, 'learning_rate': 1.78456444167712e-05, 'epoch': 0.24}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13188
total_samples=7220, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:36,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.96 | bwd_microstep: 1832.90 | bwd_inner_microstep: 1698.29 | bwd_allreduce_microstep: 134.55 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14521
total_samples=7225, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:38,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.51 | bwd_microstep: 1727.33 | bwd_inner_microstep: 1671.13 | bwd_allreduce_microstep: 56.14 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11945
total_samples=7228, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:41,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.31 | bwd_microstep: 1711.40 | bwd_inner_microstep: 1539.06 | bwd_allreduce_microstep: 172.28 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13147
total_samples=7232, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:44,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:11:44,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.82 | bwd_microstep: 2052.39 | bwd_inner_microstep: 1912.02 | bwd_allreduce_microstep: 140.31 | step_microstep: 113.50
[2025-08-03 03:11:44,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.53 | bwd: 7324.08 | bwd_inner: 6820.49 | bwd_allreduce: 503.35 | step: 113.84
{'loss': 0.7914, 'learning_rate': 1.7835593226806902e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11796
total_samples=7235, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:47,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.31 | bwd_microstep: 2008.20 | bwd_inner_microstep: 1788.62 | bwd_allreduce_microstep: 219.51 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11983
total_samples=7238, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:50,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.34 | bwd_microstep: 2070.75 | bwd_inner_microstep: 1854.67 | bwd_allreduce_microstep: 216.03 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12137
total_samples=7241, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:52,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.87 | bwd_microstep: 1790.54 | bwd_inner_microstep: 1600.50 | bwd_allreduce_microstep: 189.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462
total_samples=7245, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:55,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 03:11:55,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.30 | bwd_microstep: 2055.99 | bwd_inner_microstep: 1732.31 | bwd_allreduce_microstep: 323.61 | step_microstep: 109.91
[2025-08-03 03:11:55,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.76 | bwd: 7925.53 | bwd_inner: 6976.09 | bwd_allreduce: 949.20 | step: 110.24
{'loss': 0.7837, 'learning_rate': 1.7825521488912833e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12162
total_samples=7248, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:11:58,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.84 | bwd_microstep: 1795.68 | bwd_inner_microstep: 1560.04 | bwd_allreduce_microstep: 235.58 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12120
total_samples=7251, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:00,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.80 | bwd_microstep: 1871.85 | bwd_inner_microstep: 1865.76 | bwd_allreduce_microstep: 6.03 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13620
total_samples=7255, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:03,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.13 | bwd_microstep: 1914.18 | bwd_inner_microstep: 1736.95 | bwd_allreduce_microstep: 177.16 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13335
total_samples=7259, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:06,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 03:12:06,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.80 | bwd_microstep: 1811.53 | bwd_inner_microstep: 1706.63 | bwd_allreduce_microstep: 104.83 | step_microstep: 151.61
[2025-08-03 03:12:06,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.50 | bwd: 7393.28 | bwd_inner: 6869.38 | bwd_allreduce: 523.67 | step: 151.96
{'loss': 0.7897, 'learning_rate': 1.7815429229500946e-05, 'epoch': 0.24}
  | 471/2000 [1:28:26<4:40:02, 10.99s/it]                                                       24%|██▎       | 471/2000 [1:28:26<4:40:02, 10.99s/it] 24%|██▎       | 472/2000 [1:28:37<4:37:21, 10.89s/it]                                                       24%|██▎       | 472/2000 [1:28:37<4:37:21, 10.89s/it] 24%|██▎       | 473/2000 [1:28:48<4:40:32, 11.02s/it]                                                       24%|██▎       | 473/2000 [1:28:48<4:40:32, 11.02s/it] 24%|██▎       | 474/2000 [1:28:59<4:36:57, 10.89s/it]                                                       24%|██▎       | 474/2000 [1:28:59<4:36:57, 10.89s/it] 24%|██▍       | 475/2000 [1:29:10<4:38:49, 10.97s/it]                                                       24%|██▍       | 475/2000 [1:29:10<4:38:49, 10.97s/it] 24%|██▍       | 476/2000 [1:29:21<4:36:37, 10.89s/it]                                                       24%|██▍       | 476/2000 [1:2dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625
total_samples=7262, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:09,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.16 | bwd_microstep: 2178.89 | bwd_inner_microstep: 1966.91 | bwd_allreduce_microstep: 211.91 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12157
total_samples=7265, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:11,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.23 | bwd_microstep: 1728.95 | bwd_inner_microstep: 1570.30 | bwd_allreduce_microstep: 158.58 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12634
total_samples=7268, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:14,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.98 | bwd_microstep: 1758.97 | bwd_inner_microstep: 1561.00 | bwd_allreduce_microstep: 197.91 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13814
total_samples=7272, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:17,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 03:12:17,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.72 | bwd_microstep: 2073.76 | bwd_inner_microstep: 1925.56 | bwd_allreduce_microstep: 148.14 | step_microstep: 133.41
[2025-08-03 03:12:17,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.01 | bwd: 7740.61 | bwd_inner: 7023.77 | bwd_allreduce: 716.62 | step: 133.72
{'loss': 0.7746, 'learning_rate': 1.7805316475037016e-05, 'epoch': 0.24}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13525
total_samples=7276, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:20,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.08 | bwd_microstep: 2003.71 | bwd_inner_microstep: 1756.01 | bwd_allreduce_microstep: 247.63 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11704
total_samples=7279, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:22,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.96 | bwd_microstep: 1762.02 | bwd_inner_microstep: 1537.44 | bwd_allreduce_microstep: 224.51 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11712
total_samples=7282, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:25,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 1950.74 | bwd_inner_microstep: 1755.09 | bwd_allreduce_microstep: 195.57 | step_microstep: 0.16
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13308
total_samples=7287, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:27,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 03:12:27,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.35 | bwd_microstep: 1726.24 | bwd_inner_microstep: 1639.08 | bwd_allreduce_microstep: 87.09 | step_microstep: 136.09
[2025-08-03 03:12:27,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.28 | bwd: 7442.76 | bwd_inner: 6687.62 | bwd_allreduce: 754.90 | step: 136.48
{'loss': 0.7919, 'learning_rate': 1.7795183252040568e-05, 'epoch': 0.24}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13169
total_samples=7291, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:30,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.71 | bwd_microstep: 1807.13 | bwd_inner_microstep: 1697.70 | bwd_allreduce_microstep: 109.37 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13668
total_samples=7295, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:33,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.22 | bwd_microstep: 1715.28 | bwd_inner_microstep: 1672.93 | bwd_allreduce_microstep: 42.28 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11711
total_samples=7298, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:35,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.47 | bwd_microstep: 1770.11 | bwd_inner_microstep: 1533.28 | bwd_allreduce_microstep: 236.77 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13262
total_samples=7301, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:38,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 03:12:38,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.01 | bwd_microstep: 1698.11 | bwd_inner_microstep: 1588.23 | bwd_allreduce_microstep: 109.81 | step_microstep: 141.05
[2025-08-03 03:12:38,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2770.34 | bwd: 6990.67 | bwd_inner: 6492.14 | bwd_allreduce: 498.30 | step: 141.37
{'loss': 0.7734, 'learning_rate': 1.7785029587084793e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12018
total_samples=7304, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:41,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.73 | bwd_microstep: 2145.19 | bwd_inner_microstep: 1804.36 | bwd_allreduce_microstep: 340.77 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13775
total_samples=7308, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:43,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 893.09 | bwd_microstep: 1741.51 | bwd_inner_microstep: 1691.95 | bwd_allreduce_microstep: 49.50 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11450
total_samples=7311, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:46,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.08 | bwd_microstep: 2050.53 | bwd_inner_microstep: 1825.37 | bwd_allreduce_microstep: 225.09 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13376
total_samples=7315, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:49,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 03:12:49,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.48 | bwd_microstep: 1742.57 | bwd_inner_microstep: 1691.55 | bwd_allreduce_microstep: 50.95 | step_microstep: 149.42
[2025-08-03 03:12:49,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2986.33 | bwd: 7679.85 | bwd_inner: 7013.22 | bwd_allreduce: 666.39 | step: 149.75
{'loss': 0.7767, 'learning_rate': 1.7774855506796497e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12092
total_samples=7318, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:52,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.19 | bwd_microstep: 2057.77 | bwd_inner_microstep: 1838.51 | bwd_allreduce_microstep: 219.19 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13656
total_samples=7322, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:54,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.19 | bwd_microstep: 1751.86 | bwd_inner_microstep: 1690.03 | bwd_allreduce_microstep: 61.77 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11749
total_samples=7326, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:12:57,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.82 | bwd_microstep: 1909.26 | bwd_inner_microstep: 1762.67 | bwd_allreduce_microstep: 146.53 | step_microstep: 0.23
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12736
total_samples=7330, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:00,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 03:13:00,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.09 | bwd_microstep: 1763.07 | bwd_inner_microstep: 1598.88 | bwd_allreduce_microstep: 164.12 | step_microstep: 157.85
[2025-08-03 03:13:00,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2749.21 | bwd: 7482.01 | bwd_inner: 6890.09 | bwd_allreduce: 591.69 | step: 158.30
{'loss': 0.7809, 'learning_rate': 1.7764661037856013e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12099
total_samples=7333, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:02,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.97 | bwd_microstep: 1736.00 | bwd_inner_microstep: 1551.48 | bwd_allreduce_microstep: 184.45 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13142
total_samples=7338, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:05,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.41 | bwd_microstep: 2356.32 | bwd_inner_microstep: 1612.64 | bwd_allreduce_microstep: 743.61 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11907
total_samples=7341, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:08,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.79 | bwd_microstep: 2075.86 | bwd_inner_microstep: 1609.95 | bwd_allreduce_microstep: 465.81 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13716
total_samples=7345, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:11,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.79
[2025-08-03 03:13:11,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.31 | bwd_microstep: 1702.00 | bwd_inner_microstep: 1663.53 | bwd_allreduce_microstep: 38.41 | step_microstep: 125.01
[2025-08-03 03:13:11,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.41 | bwd: 7870.22 | bwd_inner: 6437.61 | bwd_allreduce: 1432.35 | step: 125.51
9:21<4:36:37, 10.89s/it] 24%|██▍       | 477/2000 [1:29:32<4:37:40, 10.94s/it]                                                       24%|██▍       | 477/2000 [1:29:32<4:37:40, 10.94s/it] 24%|██▍       | 478/2000 [1:29:42<4:35:31, 10.86s/it]                                                       24%|██▍       | 478/2000 [1:29:42<4:35:31, 10.86s/it] 24%|██▍       | 479/2000 [1:29:53<4:30:36, 10.67s/it]                                                       24%|██▍       | 479/2000 [1:29:53<4:30:36, 10.67s/it] 24%|██▍       | 480/2000 [1:30:04<4:33:51, 10.81s/it]                                                       24%|██▍       | 480/2000 [1:30:04<4:33:51, 10.81s/it] 24%|██▍       | 481/2000 [1:30:14<4:33:09, 10.79s/it]                                                       24%|██▍       | 481/2000 [1:30:14<4:33:09, 10.79s/it] 24%|██▍       | 482/2000 [1:30:26<4:35:21, 10.88s/it]                                                {'loss': 0.7902, 'learning_rate': 1.7754446206997152e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11776
total_samples=7348, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:13,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.69 | bwd_microstep: 1802.66 | bwd_inner_microstep: 1789.61 | bwd_allreduce_microstep: 12.99 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12237
total_samples=7351, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:16,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.41 | bwd_microstep: 1779.22 | bwd_inner_microstep: 1577.68 | bwd_allreduce_microstep: 201.48 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11680
total_samples=7354, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:18,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.84 | bwd_microstep: 1692.73 | bwd_inner_microstep: 1516.89 | bwd_allreduce_microstep: 175.77 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13417
total_samples=7359, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:21,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 03:13:21,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.87 | bwd_microstep: 1755.97 | bwd_inner_microstep: 1677.18 | bwd_allreduce_microstep: 78.73 | step_microstep: 134.17
[2025-08-03 03:13:21,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2748.73 | bwd: 7030.62 | bwd_inner: 6561.35 | bwd_allreduce: 469.04 | step: 134.53
{'loss': 0.7847, 'learning_rate': 1.774421104100712e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12183
total_samples=7362, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:24,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.61 | bwd_microstep: 1980.58 | bwd_inner_microstep: 1573.48 | bwd_allreduce_microstep: 407.03 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14841
total_samples=7366, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:27,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.57 | bwd_microstep: 2036.17 | bwd_inner_microstep: 1936.49 | bwd_allreduce_microstep: 99.62 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13936
total_samples=7370, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:30,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.52 | bwd_microstep: 2203.47 | bwd_inner_microstep: 2132.20 | bwd_allreduce_microstep: 71.21 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12857
total_samples=7374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:32,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.52
[2025-08-03 03:13:32,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.95 | bwd_microstep: 1935.58 | bwd_inner_microstep: 1801.13 | bwd_allreduce_microstep: 134.39 | step_microstep: 127.90
[2025-08-03 03:13:32,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2739.59 | bwd: 8155.83 | bwd_inner: 7443.29 | bwd_allreduce: 712.32 | step: 128.34
{'loss': 0.7875, 'learning_rate': 1.7733955566726438e-05, 'epoch': 0.24}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12604
total_samples=7378, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:35,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.15 | bwd_microstep: 2109.32 | bwd_inner_microstep: 1777.94 | bwd_allreduce_microstep: 331.31 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14300
total_samples=7382, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:38,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.76 | bwd_microstep: 1719.17 | bwd_inner_microstep: 1679.52 | bwd_allreduce_microstep: 39.58 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13522
total_samples=7386, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:40,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.35 | bwd_microstep: 1749.44 | bwd_inner_microstep: 1690.68 | bwd_allreduce_microstep: 58.70 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14779
total_samples=7390, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:43,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 03:13:43,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.30 | bwd_microstep: 1830.37 | bwd_inner_microstep: 1773.78 | bwd_allreduce_microstep: 56.53 | step_microstep: 147.83
[2025-08-03 03:13:43,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2745.49 | bwd: 7408.35 | bwd_inner: 6921.91 | bwd_allreduce: 486.21 | step: 148.29
{'loss': 0.7926, 'learning_rate': 1.7723679811048904e-05, 'epoch': 0.24}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12325
total_samples=7394, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:46,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.25 | bwd_microstep: 1947.26 | bwd_inner_microstep: 1598.63 | bwd_allreduce_microstep: 348.57 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14000
total_samples=7398, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:48,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.10 | bwd_microstep: 1973.29 | bwd_inner_microstep: 1707.28 | bwd_allreduce_microstep: 265.95 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13758
total_samples=7402, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:51,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.65 | bwd_microstep: 1927.90 | bwd_inner_microstep: 1880.09 | bwd_allreduce_microstep: 47.74 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14163
total_samples=7406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:54,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:13:54,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.25 | bwd_microstep: 2101.29 | bwd_inner_microstep: 2056.91 | bwd_allreduce_microstep: 44.31 | step_microstep: 109.64
[2025-08-03 03:13:54,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.18 | bwd: 7949.79 | bwd_inner: 7242.91 | bwd_allreduce: 706.65 | step: 109.99
{'loss': 0.7931, 'learning_rate': 1.771338380092148e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12014
total_samples=7409, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:13:57,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.33 | bwd_microstep: 2027.47 | bwd_inner_microstep: 1801.20 | bwd_allreduce_microstep: 226.20 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14775
total_samples=7413, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:00,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.40 | bwd_microstep: 2015.97 | bwd_inner_microstep: 1779.61 | bwd_allreduce_microstep: 236.29 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12602
total_samples=7417, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:03,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.78 | bwd_microstep: 2040.04 | bwd_inner_microstep: 1856.04 | bwd_allreduce_microstep: 183.93 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12757
total_samples=7421, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:05,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 03:14:05,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.25 | bwd_microstep: 1858.96 | bwd_inner_microstep: 1654.05 | bwd_allreduce_microstep: 204.85 | step_microstep: 141.71
[2025-08-03 03:14:05,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.67 | bwd: 7942.48 | bwd_inner: 7090.90 | bwd_allreduce: 851.35 | step: 142.15
{'loss': 0.7769, 'learning_rate': 1.7703067563344252e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11895
total_samples=7424, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:08,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.70 | bwd_microstep: 1965.93 | bwd_inner_microstep: 1615.90 | bwd_allreduce_microstep: 349.96 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12939
total_samples=7428, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:11,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.39 | bwd_microstep: 1748.95 | bwd_inner_microstep: 1652.27 | bwd_allreduce_microstep: 96.62 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=7432, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:14,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.19 | bwd_microstep: 2087.89 | bwd_inner_microstep: 1943.38 | bwd_allreduce_microstep: 144.44 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14088
total_samples=7437, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:16,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 03:14:16,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.03 | bwd_microstep: 1764.50 | bwd_inner_microstep: 1680.82 | bwd_allreduce_microstep: 83.60 | step_microstep: 154.71
[2025-08-03 03:14:16,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.26 | bwd: 7567.31 | bwd_inner: 6892.37 | bwd_allreduce: 674.70 | step: 155.16
       24%|██▍       | 482/2000 [1:30:26<4:35:21, 10.88s/it] 24%|██▍       | 483/2000 [1:30:36<4:30:24, 10.70s/it]                                                       24%|██▍       | 483/2000 [1:30:36<4:30:24, 10.70s/it] 24%|██▍       | 484/2000 [1:30:47<4:35:24, 10.90s/it]                                                       24%|██▍       | 484/2000 [1:30:47<4:35:24, 10.90s/it] 24%|██▍       | 485/2000 [1:30:58<4:33:20, 10.83s/it]                                                       24%|██▍       | 485/2000 [1:30:58<4:33:20, 10.83s/it] 24%|██▍       | 486/2000 [1:31:09<4:35:59, 10.94s/it]                                                       24%|██▍       | 486/2000 [1:31:09<4:35:59, 10.94s/it] 24%|██▍       | 487/2000 [1:31:20<4:37:49, 11.02s/it]                                                       24%|██▍       | 487/2000 [1:31:20<4:37:49, 11.02s/it] 24%|██▍       | 488/2000 [1:31:31<4:36:07, 10.96s/it]    {'loss': 0.7935, 'learning_rate': 1.7692731125370355e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12254
total_samples=7440, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:19,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.02 | bwd_microstep: 1848.80 | bwd_inner_microstep: 1590.50 | bwd_allreduce_microstep: 258.23 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13379
total_samples=7444, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:22,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.65 | bwd_microstep: 2077.01 | bwd_inner_microstep: 1927.75 | bwd_allreduce_microstep: 149.20 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12983
total_samples=7448, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:24,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.83 | bwd_microstep: 1773.71 | bwd_inner_microstep: 1633.23 | bwd_allreduce_microstep: 140.41 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14329
total_samples=7453, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:27,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 03:14:27,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.97 | bwd_microstep: 1842.81 | bwd_inner_microstep: 1748.88 | bwd_allreduce_microstep: 93.87 | step_microstep: 147.71
[2025-08-03 03:14:27,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.40 | bwd: 7542.37 | bwd_inner: 6900.36 | bwd_allreduce: 641.78 | step: 148.03
{'loss': 0.795, 'learning_rate': 1.768237451410589e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11678
total_samples=7456, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:30,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.14 | bwd_microstep: 1793.89 | bwd_inner_microstep: 1771.16 | bwd_allreduce_microstep: 22.67 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11745
total_samples=7459, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:32,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.80 | bwd_microstep: 1674.62 | bwd_inner_microstep: 1509.37 | bwd_allreduce_microstep: 165.18 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11691
total_samples=7462, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:35,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.20 | bwd_microstep: 2057.63 | bwd_inner_microstep: 1817.02 | bwd_allreduce_microstep: 240.55 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13208
total_samples=7465, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:38,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 03:14:38,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.83 | bwd_microstep: 2116.07 | bwd_inner_microstep: 1641.92 | bwd_allreduce_microstep: 474.09 | step_microstep: 110.83
[2025-08-03 03:14:38,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.90 | bwd: 7642.26 | bwd_inner: 6739.46 | bwd_allreduce: 902.57 | step: 111.13
{'loss': 0.7836, 'learning_rate': 1.767199775670986e-05, 'epoch': 0.24}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12488
total_samples=7468, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:41,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.16 | bwd_microstep: 2087.73 | bwd_inner_microstep: 1801.58 | bwd_allreduce_microstep: 286.09 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13588
total_samples=7472, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:43,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.14 | bwd_microstep: 1756.38 | bwd_inner_microstep: 1675.16 | bwd_allreduce_microstep: 81.16 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13669
total_samples=7477, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:46,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.03 | bwd_microstep: 1973.36 | bwd_inner_microstep: 1809.24 | bwd_allreduce_microstep: 164.05 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11649
total_samples=7480, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:49,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.95
[2025-08-03 03:14:49,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.98 | bwd_microstep: 1739.16 | bwd_inner_microstep: 1527.97 | bwd_allreduce_microstep: 211.12 | step_microstep: 131.63
[2025-08-03 03:14:49,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.24 | bwd: 7556.68 | bwd_inner: 6813.94 | bwd_allreduce: 742.50 | step: 132.09
{'loss': 0.7812, 'learning_rate': 1.7661600880394113e-05, 'epoch': 0.25}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12222
total_samples=7483, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:51,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.64 | bwd_microstep: 1742.91 | bwd_inner_microstep: 1559.37 | bwd_allreduce_microstep: 183.47 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14031
total_samples=7487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:54,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.49 | bwd_microstep: 1717.82 | bwd_inner_microstep: 1686.25 | bwd_allreduce_microstep: 31.52 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11850
total_samples=7490, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:56,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.77 | bwd_microstep: 1949.78 | bwd_inner_microstep: 1641.80 | bwd_allreduce_microstep: 307.91 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12197
total_samples=7493, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:14:59,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 03:14:59,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.97 | bwd_microstep: 1978.07 | bwd_inner_microstep: 1563.58 | bwd_allreduce_microstep: 414.43 | step_microstep: 119.17
[2025-08-03 03:14:59,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.80 | bwd: 7388.62 | bwd_inner: 6450.98 | bwd_allreduce: 937.40 | step: 119.52
{'loss': 0.7843, 'learning_rate': 1.7651183912423228e-05, 'epoch': 0.25}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13193
total_samples=7498, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:02,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.65 | bwd_microstep: 1890.87 | bwd_inner_microstep: 1660.11 | bwd_allreduce_microstep: 230.70 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11915
total_samples=7501, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:05,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.12 | bwd_microstep: 2016.67 | bwd_inner_microstep: 1805.20 | bwd_allreduce_microstep: 211.41 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12800
total_samples=7505, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:07,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.95 | bwd_microstep: 1870.19 | bwd_inner_microstep: 1635.25 | bwd_allreduce_microstep: 234.89 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12050
total_samples=7508, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:11,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 03:15:11,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.49 | bwd_microstep: 2356.64 | bwd_inner_microstep: 2114.20 | bwd_allreduce_microstep: 242.38 | step_microstep: 106.89
[2025-08-03 03:15:11,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2859.14 | bwd: 8134.43 | bwd_inner: 7214.75 | bwd_allreduce: 919.46 | step: 107.31
{'loss': 0.7841, 'learning_rate': 1.7640746880114505e-05, 'epoch': 0.25}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373
total_samples=7512, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:13,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.88 | bwd_microstep: 1878.48 | bwd_inner_microstep: 1788.66 | bwd_allreduce_microstep: 89.76 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16093
total_samples=7516, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:16,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.21 | bwd_microstep: 2067.79 | bwd_inner_microstep: 1821.00 | bwd_allreduce_microstep: 246.73 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12363
total_samples=7519, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:19,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.99 | bwd_microstep: 1793.83 | bwd_inner_microstep: 1581.46 | bwd_allreduce_microstep: 212.30 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14500
total_samples=7523, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:22,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 03:15:22,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.90 | bwd_microstep: 1976.01 | bwd_inner_microstep: 1759.87 | bwd_allreduce_microstep: 216.08 | step_microstep: 139.02
[2025-08-03 03:15:22,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.90 | bwd: 7716.17 | bwd_inner: 6950.98 | bwd_allreduce: 764.96 | step: 139.52
                                                   24%|██▍       | 488/2000 [1:31:31<4:36:07, 10.96s/it] 24%|██▍       | 489/2000 [1:31:42<4:34:48, 10.91s/it]                                                       24%|██▍       | 489/2000 [1:31:42<4:34:48, 10.91s/it] 24%|██▍       | 490/2000 [1:31:53<4:34:06, 10.89s/it]                                                       24%|██▍       | 490/2000 [1:31:53<4:34:06, 10.89s/it] 25%|██▍       | 491/2000 [1:32:03<4:33:03, 10.86s/it]                                                       25%|██▍       | 491/2000 [1:32:04<4:33:03, 10.86s/it] 25%|██▍       | 492/2000 [1:32:14<4:31:16, 10.79s/it]                                                       25%|██▍       | 492/2000 [1:32:14<4:31:16, 10.79s/it] 25%|██▍       | 493/2000 [1:32:26<4:35:45, 10.98s/it]                                                       25%|██▍       | 493/2000 [1:32:26<4:35:45, 10.98s/it] 25%|██▍       {'loss': 0.7921, 'learning_rate': 1.7630289810837836e-05, 'epoch': 0.25}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14273
total_samples=7527, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:24,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.41 | bwd_microstep: 1852.79 | bwd_inner_microstep: 1744.42 | bwd_allreduce_microstep: 108.31 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14625
total_samples=7531, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:27,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.57 | bwd_microstep: 2066.55 | bwd_inner_microstep: 1933.08 | bwd_allreduce_microstep: 133.41 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13688
total_samples=7535, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:30,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.99 | bwd_microstep: 1878.96 | bwd_inner_microstep: 1744.85 | bwd_allreduce_microstep: 134.03 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13609
total_samples=7539, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:33,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 03:15:33,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.62 | bwd_microstep: 1813.18 | bwd_inner_microstep: 1713.39 | bwd_allreduce_microstep: 99.72 | step_microstep: 111.78
[2025-08-03 03:15:33,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.53 | bwd: 7611.54 | bwd_inner: 7135.74 | bwd_allreduce: 475.55 | step: 112.37
{'loss': 0.7901, 'learning_rate': 1.7619812732015664e-05, 'epoch': 0.25}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13564
total_samples=7543, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:35,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.24 | bwd_microstep: 1992.42 | bwd_inner_microstep: 1715.84 | bwd_allreduce_microstep: 276.52 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12447
total_samples=7546, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:38,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.99 | bwd_microstep: 1822.33 | bwd_inner_microstep: 1594.16 | bwd_allreduce_microstep: 228.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14150
total_samples=7550, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:41,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.56 | bwd_microstep: 1772.36 | bwd_inner_microstep: 1718.46 | bwd_allreduce_microstep: 53.83 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12608
total_samples=7554, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:43,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 03:15:43,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.11 | bwd_microstep: 2038.94 | bwd_inner_microstep: 2017.00 | bwd_allreduce_microstep: 21.87 | step_microstep: 129.68
[2025-08-03 03:15:43,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.83 | bwd: 7626.11 | bwd_inner: 7045.46 | bwd_allreduce: 580.41 | step: 130.14
{'loss': 0.7779, 'learning_rate': 1.7609315671122912e-05, 'epoch': 0.25}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15596
total_samples=7558, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:46,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.38 | bwd_microstep: 1789.31 | bwd_inner_microstep: 1774.65 | bwd_allreduce_microstep: 14.60 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13680
total_samples=7562, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:49,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.65 | bwd_microstep: 1719.39 | bwd_inner_microstep: 1675.97 | bwd_allreduce_microstep: 43.35 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13982
total_samples=7566, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:52,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.46 | bwd_microstep: 2249.94 | bwd_inner_microstep: 2110.22 | bwd_allreduce_microstep: 139.67 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12437
total_samples=7569, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:55,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:15:55,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.56 | bwd_microstep: 2142.56 | bwd_inner_microstep: 1839.22 | bwd_allreduce_microstep: 303.27 | step_microstep: 110.13
[2025-08-03 03:15:55,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.98 | bwd: 7901.26 | bwd_inner: 7400.05 | bwd_allreduce: 500.97 | step: 110.50
{'loss': 0.7865, 'learning_rate': 1.75987986556869e-05, 'epoch': 0.25}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13995
total_samples=7573, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:15:57,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.28 | bwd_microstep: 1755.77 | bwd_inner_microstep: 1720.14 | bwd_allreduce_microstep: 35.57 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14380
total_samples=7577, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:00,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.18 | bwd_microstep: 2105.30 | bwd_inner_microstep: 1927.14 | bwd_allreduce_microstep: 178.10 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13424
total_samples=7581, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:03,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.44 | bwd_microstep: 2295.45 | bwd_inner_microstep: 2192.22 | bwd_allreduce_microstep: 103.16 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11765
total_samples=7584, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:06,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 03:16:06,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.90 | bwd_microstep: 2041.33 | bwd_inner_microstep: 1805.51 | bwd_allreduce_microstep: 235.75 | step_microstep: 135.08
[2025-08-03 03:16:06,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.73 | bwd: 8197.91 | bwd_inner: 7645.00 | bwd_allreduce: 552.66 | step: 135.59
{'loss': 0.7817, 'learning_rate': 1.758826171328727e-05, 'epoch': 0.25}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13205
total_samples=7588, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:09,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.02 | bwd_microstep: 1805.11 | bwd_inner_microstep: 1683.72 | bwd_allreduce_microstep: 121.31 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11699
total_samples=7591, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:11,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.56 | bwd_microstep: 1788.41 | bwd_inner_microstep: 1551.39 | bwd_allreduce_microstep: 236.96 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12221
total_samples=7595, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:14,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 928.57 | bwd_microstep: 1892.47 | bwd_inner_microstep: 1586.63 | bwd_allreduce_microstep: 305.78 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12828
total_samples=7598, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:17,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 03:16:17,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.44 | bwd_microstep: 1749.74 | bwd_inner_microstep: 1595.13 | bwd_allreduce_microstep: 154.55 | step_microstep: 140.49
[2025-08-03 03:16:17,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3006.53 | bwd: 7235.78 | bwd_inner: 6416.87 | bwd_allreduce: 818.69 | step: 140.93
{'loss': 0.7848, 'learning_rate': 1.7577704871555924e-05, 'epoch': 0.25}
| 494/2000 [1:32:37<4:36:02, 11.00s/it]                                                       25%|██▍       | 494/2000 [1:32:37<4:36:02, 11.00s/it] 25%|██▍       | 495/2000 [1:32:47<4:34:49, 10.96s/it]                                                       25%|██▍       | 495/2000 [1:32:47<4:34:49, 10.96s/it] 25%|██▍       | 496/2000 [1:32:58<4:34:04, 10.93s/it]                                                       25%|██▍       | 496/2000 [1:32:58<4:34:04, 10.93s/it] 25%|██▍       | 497/2000 [1:33:09<4:35:15, 10.99s/it]                                                       25%|██▍       | 497/2000 [1:33:09<4:35:15, 10.99s/it] 25%|██▍       | 498/2000 [1:33:21<4:38:32, 11.13s/it]                                                       25%|██▍       | 498/2000 [1:33:21<4:38:32, 11.13s/it] 25%|██▍       | 499/2000 [1:33:32<4:35:21, 11.01s/it]                                                       25%|██▍       | 499/2000 [1:33:dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13759
total_samples=7603, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:19,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.80 | bwd_microstep: 1736.97 | bwd_inner_microstep: 1677.06 | bwd_allreduce_microstep: 59.84 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11420
total_samples=7606, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:22,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 941.25 | bwd_microstep: 1780.16 | bwd_inner_microstep: 1546.69 | bwd_allreduce_microstep: 233.41 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13361
total_samples=7610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:25,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.00 | bwd_microstep: 1724.19 | bwd_inner_microstep: 1672.28 | bwd_allreduce_microstep: 51.84 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11736
total_samples=7613, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:27,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 03:16:27,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.35 | bwd_microstep: 1955.82 | bwd_inner_microstep: 1751.90 | bwd_allreduce_microstep: 203.86 | step_microstep: 133.02
[2025-08-03 03:16:27,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3054.34 | bwd: 7197.18 | bwd_inner: 6647.92 | bwd_allreduce: 549.03 | step: 133.38
{'loss': 0.7825, 'learning_rate': 1.7567128158176955e-05, 'epoch': 0.25}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13487
total_samples=7617, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:30,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.05 | bwd_microstep: 1999.71 | bwd_inner_microstep: 1859.85 | bwd_allreduce_microstep: 139.80 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14467
total_samples=7621, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:33,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.13 | bwd_microstep: 1895.06 | bwd_inner_microstep: 1733.22 | bwd_allreduce_microstep: 161.78 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13828
total_samples=7625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:36,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.90 | bwd_microstep: 1834.72 | bwd_inner_microstep: 1733.56 | bwd_allreduce_microstep: 101.10 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11735
total_samples=7628, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:38,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 03:16:38,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.28 | bwd_microstep: 1888.96 | bwd_inner_microstep: 1755.78 | bwd_allreduce_microstep: 133.12 | step_microstep: 118.78
[2025-08-03 03:16:38,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.28 | bwd: 7618.50 | bwd_inner: 7082.40 | bwd_allreduce: 535.86 | step: 119.12
{'loss': 0.7835, 'learning_rate': 1.7556531600886554e-05, 'epoch': 0.25}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12951
total_samples=7632, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:41,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.36 | bwd_microstep: 1825.32 | bwd_inner_microstep: 1757.51 | bwd_allreduce_microstep: 67.74 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13652
total_samples=7636, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:44,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.99 | bwd_microstep: 2038.60 | bwd_inner_microstep: 1897.51 | bwd_allreduce_microstep: 141.02 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13923
total_samples=7640, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:46,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.19 | bwd_microstep: 1841.88 | bwd_inner_microstep: 1677.80 | bwd_allreduce_microstep: 164.01 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13401
total_samples=7644, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:49,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:16:49,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.03 | bwd_microstep: 1996.81 | bwd_inner_microstep: 1862.29 | bwd_allreduce_microstep: 134.46 | step_microstep: 109.22
[2025-08-03 03:16:49,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.51 | bwd: 7702.65 | bwd_inner: 7195.10 | bwd_allreduce: 507.31 | step: 109.68
{'loss': 0.7713, 'learning_rate': 1.7545915227472967e-05, 'epoch': 0.25}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11760
total_samples=7647, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:52,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.78 | bwd_microstep: 1797.85 | bwd_inner_microstep: 1567.12 | bwd_allreduce_microstep: 230.66 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13699
total_samples=7651, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:55,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.88 | bwd_microstep: 2060.62 | bwd_inner_microstep: 2054.54 | bwd_allreduce_microstep: 6.02 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13411
total_samples=7655, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:16:57,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.37 | bwd_microstep: 1738.27 | bwd_inner_microstep: 1694.68 | bwd_allreduce_microstep: 43.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13931
total_samples=7659, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:00,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 03:17:00,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.71 | bwd_microstep: 1805.03 | bwd_inner_microstep: 1749.01 | bwd_allreduce_microstep: 55.94 | step_microstep: 136.41
[2025-08-03 03:17:00,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2749.68 | bwd: 7401.82 | bwd_inner: 7065.34 | bwd_allreduce: 336.23 | step: 136.88
{'loss': 0.7858, 'learning_rate': 1.753527906577638e-05, 'epoch': 0.25}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11678
total_samples=7662, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:03,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.66 | bwd_microstep: 1854.25 | bwd_inner_microstep: 1537.94 | bwd_allreduce_microstep: 316.24 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15888
total_samples=7666, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:06,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.66 | bwd_microstep: 2092.47 | bwd_inner_microstep: 1819.14 | bwd_allreduce_microstep: 273.26 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11633
total_samples=7669, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:08,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.76 | bwd_microstep: 2022.11 | bwd_inner_microstep: 1824.69 | bwd_allreduce_microstep: 197.35 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14180
total_samples=7673, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:11,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 03:17:11,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.06 | bwd_microstep: 1802.13 | bwd_inner_microstep: 1724.75 | bwd_allreduce_microstep: 77.32 | step_microstep: 135.03
[2025-08-03 03:17:11,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.08 | bwd: 7771.01 | bwd_inner: 6906.52 | bwd_allreduce: 864.25 | step: 135.40
{'loss': 0.7927, 'learning_rate': 1.7524623143688905e-05, 'epoch': 0.25}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=7676, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:14,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.42 | bwd_microstep: 2155.09 | bwd_inner_microstep: 1896.02 | bwd_allreduce_microstep: 259.00 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15613
total_samples=7681, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:17,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.52 | bwd_microstep: 1883.48 | bwd_inner_microstep: 1735.90 | bwd_allreduce_microstep: 147.52 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13516
total_samples=7685, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:19,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.91 | bwd_microstep: 1748.71 | bwd_inner_microstep: 1697.86 | bwd_allreduce_microstep: 50.78 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15835
total_samples=7690, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:22,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 03:17:22,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.84 | bwd_microstep: 2026.79 | bwd_inner_microstep: 1902.86 | bwd_allreduce_microstep: 123.86 | step_microstep: 111.76
[2025-08-03 03:17:22,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.62 | bwd: 7814.11 | bwd_inner: 7232.64 | bwd_allreduce: 581.24 | step: 112.21
32<4:35:21, 11.01s/it] 25%|██▌       | 500/2000 [1:33:42<4:33:01, 10.92s/it]                                                       25%|██▌       | 500/2000 [1:33:42<4:33:01, 10.92s/it] 25%|██▌       | 501/2000 [1:33:53<4:32:35, 10.91s/it]                                                       25%|██▌       | 501/2000 [1:33:53<4:32:35, 10.91s/it] 25%|██▌       | 502/2000 [1:34:04<4:32:44, 10.92s/it]                                                       25%|██▌       | 502/2000 [1:34:04<4:32:44, 10.92s/it] 25%|██▌       | 503/2000 [1:34:15<4:30:22, 10.84s/it]                                                       25%|██▌       | 503/2000 [1:34:15<4:30:22, 10.84s/it] 25%|██▌       | 504/2000 [1:34:26<4:31:49, 10.90s/it]                                                       25%|██▌       | 504/2000 [1:34:26<4:31:49, 10.90s/it] 25%|██▌       | 505/2000 [1:34:37<4:33:00, 10.96s/it]                                                  {'loss': 0.7912, 'learning_rate': 1.7513947489154443e-05, 'epoch': 0.25}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13711
total_samples=7693, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:25,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.67 | bwd_microstep: 1728.92 | bwd_inner_microstep: 1622.57 | bwd_allreduce_microstep: 106.29 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11624
total_samples=7696, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:27,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.68 | bwd_microstep: 1768.72 | bwd_inner_microstep: 1542.91 | bwd_allreduce_microstep: 225.75 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12628
total_samples=7700, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:30,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.20 | bwd_microstep: 2061.84 | bwd_inner_microstep: 1730.90 | bwd_allreduce_microstep: 330.88 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13706
total_samples=7704, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:33,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 03:17:33,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.49 | bwd_microstep: 1838.67 | bwd_inner_microstep: 1645.12 | bwd_allreduce_microstep: 193.48 | step_microstep: 151.79
[2025-08-03 03:17:33,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.98 | bwd: 7398.20 | bwd_inner: 6541.49 | bwd_allreduce: 856.47 | step: 152.11
{'loss': 0.7828, 'learning_rate': 1.7503252130168657e-05, 'epoch': 0.25}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15020
total_samples=7711, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:36,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.11 | bwd_microstep: 2043.50 | bwd_inner_microstep: 1876.21 | bwd_allreduce_microstep: 167.22 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11594
total_samples=7714, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:38,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.59 | bwd_microstep: 1791.37 | bwd_inner_microstep: 1542.44 | bwd_allreduce_microstep: 248.87 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11826
total_samples=7717, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:41,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.74 | bwd_microstep: 1947.70 | bwd_inner_microstep: 1741.50 | bwd_allreduce_microstep: 206.14 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790
total_samples=7720, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:44,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 03:17:44,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.66 | bwd_microstep: 2036.33 | bwd_inner_microstep: 1812.49 | bwd_allreduce_microstep: 223.77 | step_microstep: 112.31
[2025-08-03 03:17:44,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.04 | bwd: 7818.95 | bwd_inner: 6972.64 | bwd_allreduce: 846.08 | step: 112.74
{'loss': 0.7615, 'learning_rate': 1.749253709477888e-05, 'epoch': 0.25}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14827
total_samples=7724, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:46,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.52 | bwd_microstep: 1845.39 | bwd_inner_microstep: 1787.67 | bwd_allreduce_microstep: 57.65 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14025
total_samples=7728, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:49,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.11 | bwd_microstep: 1738.68 | bwd_inner_microstep: 1695.09 | bwd_allreduce_microstep: 43.53 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12507
total_samples=7731, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:52,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.73 | bwd_microstep: 1862.82 | bwd_inner_microstep: 1745.40 | bwd_allreduce_microstep: 117.34 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11716
total_samples=7734, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:54,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:17:54,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.40 | bwd_microstep: 1757.38 | bwd_inner_microstep: 1535.28 | bwd_allreduce_microstep: 222.03 | step_microstep: 121.05
[2025-08-03 03:17:54,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.67 | bwd: 7204.32 | bwd_inner: 6763.44 | bwd_allreduce: 440.62 | step: 121.49
{'loss': 0.7847, 'learning_rate': 1.748180241108404e-05, 'epoch': 0.25}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12079
total_samples=7738, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:17:57,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.67 | bwd_microstep: 2094.75 | bwd_inner_microstep: 1877.07 | bwd_allreduce_microstep: 217.61 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13252
total_samples=7742, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:00,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.05 | bwd_microstep: 1754.55 | bwd_inner_microstep: 1672.05 | bwd_allreduce_microstep: 82.44 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13541
total_samples=7746, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:02,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.17 | bwd_microstep: 1765.75 | bwd_inner_microstep: 1696.37 | bwd_allreduce_microstep: 69.32 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=7749, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:05,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 03:18:05,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.19 | bwd_microstep: 2119.13 | bwd_inner_microstep: 1953.52 | bwd_allreduce_microstep: 165.55 | step_microstep: 110.75
[2025-08-03 03:18:05,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.02 | bwd: 7734.22 | bwd_inner: 7199.00 | bwd_allreduce: 534.99 | step: 111.08
{'loss': 0.7807, 'learning_rate': 1.74710481072346e-05, 'epoch': 0.25}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14063
total_samples=7753, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:08,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.04 | bwd_microstep: 1724.74 | bwd_inner_microstep: 1686.38 | bwd_allreduce_microstep: 38.30 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15458
total_samples=7758, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:10,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.96 | bwd_microstep: 1753.69 | bwd_inner_microstep: 1747.44 | bwd_allreduce_microstep: 6.18 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12628
total_samples=7762, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:13,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.39 | bwd_microstep: 1774.24 | bwd_inner_microstep: 1600.46 | bwd_allreduce_microstep: 173.72 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13723
total_samples=7766, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:16,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 03:18:16,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.82 | bwd_microstep: 2103.08 | bwd_inner_microstep: 2009.95 | bwd_allreduce_microstep: 93.07 | step_microstep: 128.91
[2025-08-03 03:18:16,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.13 | bwd: 7355.79 | bwd_inner: 7044.22 | bwd_allreduce: 311.35 | step: 129.24
{'loss': 0.7708, 'learning_rate': 1.7460274211432463e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12145
total_samples=7770, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:18,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.03 | bwd_microstep: 1806.02 | bwd_inner_microstep: 1570.54 | bwd_allreduce_microstep: 235.41 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11758
total_samples=7773, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:21,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.50 | bwd_microstep: 2004.60 | bwd_inner_microstep: 1810.38 | bwd_allreduce_microstep: 194.16 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13461
total_samples=7777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:24,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.03 | bwd_microstep: 1930.44 | bwd_inner_microstep: 1694.75 | bwd_allreduce_microstep: 235.63 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13338
total_samples=7781, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:27,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:18:27,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.56 | bwd_microstep: 1904.41 | bwd_inner_microstep: 1717.50 | bwd_allreduce_microstep: 186.85 | step_microstep: 152.50
[2025-08-03 03:18:27,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.04 | bwd: 7645.52 | bwd_inner: 6793.16 | bwd_allreduce: 852.11 | step: 152.82
     25%|██▌       | 505/2000 [1:34:37<4:33:00, 10.96s/it] 25%|██▌       | 506/2000 [1:34:48<4:30:22, 10.86s/it]                                                       25%|██▌       | 506/2000 [1:34:48<4:30:22, 10.86s/it] 25%|██▌       | 507/2000 [1:34:59<4:31:19, 10.90s/it]                                                       25%|██▌       | 507/2000 [1:34:59<4:31:19, 10.90s/it] 25%|██▌       | 508/2000 [1:35:09<4:27:37, 10.76s/it]                                                       25%|██▌       | 508/2000 [1:35:09<4:27:37, 10.76s/it] 25%|██▌       | 509/2000 [1:35:20<4:29:26, 10.84s/it]                                                       25%|██▌       | 509/2000 [1:35:20<4:29:26, 10.84s/it] 26%|██▌       | 510/2000 [1:35:31<4:27:23, 10.77s/it]                                                       26%|██▌       | 510/2000 [1:35:31<4:27:23, 10.77s/it] 26%|██▌       | 511/2000 [1:35:42<4:28:08, 10.81s/it]      {'loss': 0.7669, 'learning_rate': 1.7449480751930915e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11575
total_samples=7784, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:29,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.01 | bwd_microstep: 1844.33 | bwd_inner_microstep: 1612.58 | bwd_allreduce_microstep: 231.68 | step_microstep: 0.22
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12327
total_samples=7788, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:32,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.42 | bwd_microstep: 1834.12 | bwd_inner_microstep: 1719.87 | bwd_allreduce_microstep: 114.19 | step_microstep: 0.09
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12804
total_samples=7792, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:35,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.95 | bwd_microstep: 2095.05 | bwd_inner_microstep: 1925.66 | bwd_allreduce_microstep: 169.32 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12124
total_samples=7795, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:38,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:18:38,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.38 | bwd_microstep: 1822.36 | bwd_inner_microstep: 1593.19 | bwd_allreduce_microstep: 229.11 | step_microstep: 133.20
[2025-08-03 03:18:38,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.69 | bwd: 7595.90 | bwd_inner: 6851.30 | bwd_allreduce: 744.37 | step: 133.66
{'loss': 0.7803, 'learning_rate': 1.7438667757034547e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12199
total_samples=7798, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:40,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.44 | bwd_microstep: 1765.99 | bwd_inner_microstep: 1559.38 | bwd_allreduce_microstep: 206.54 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13533
total_samples=7802, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:43,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.59 | bwd_microstep: 2057.20 | bwd_inner_microstep: 1909.19 | bwd_allreduce_microstep: 147.96 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11736
total_samples=7805, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:46,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.67 | bwd_microstep: 1991.13 | bwd_inner_microstep: 1790.36 | bwd_allreduce_microstep: 200.70 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12571
total_samples=7810, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:48,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 03:18:48,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.38 | bwd_microstep: 1736.12 | bwd_inner_microstep: 1573.33 | bwd_allreduce_microstep: 162.73 | step_microstep: 149.13
[2025-08-03 03:18:48,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.00 | bwd: 7550.50 | bwd_inner: 6832.25 | bwd_allreduce: 718.00 | step: 149.59
{'loss': 0.7702, 'learning_rate': 1.7427835255099173e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12257
total_samples=7814, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:51,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.72 | bwd_microstep: 2155.35 | bwd_inner_microstep: 1926.32 | bwd_allreduce_microstep: 228.95 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14034
total_samples=7818, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:54,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.85 | bwd_microstep: 1848.98 | bwd_inner_microstep: 1776.77 | bwd_allreduce_microstep: 72.15 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13657
total_samples=7822, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:57,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.81 | bwd_microstep: 2042.61 | bwd_inner_microstep: 1929.94 | bwd_allreduce_microstep: 112.61 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11971
total_samples=7825, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:18:59,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:18:59,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.67 | bwd_microstep: 1770.45 | bwd_inner_microstep: 1551.23 | bwd_allreduce_microstep: 219.15 | step_microstep: 138.24
[2025-08-03 03:18:59,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.98 | bwd: 7817.42 | bwd_inner: 7184.25 | bwd_allreduce: 632.93 | step: 138.58
{'loss': 0.7723, 'learning_rate': 1.7416983274531777e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11725
total_samples=7828, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:02,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.64 | bwd_microstep: 1808.63 | bwd_inner_microstep: 1560.12 | bwd_allreduce_microstep: 248.43 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=7831, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:05,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.65 | bwd_microstep: 2124.01 | bwd_inner_microstep: 1813.04 | bwd_allreduce_microstep: 310.90 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11963
total_samples=7834, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:08,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.58 | bwd_microstep: 1797.95 | bwd_inner_microstep: 1596.53 | bwd_allreduce_microstep: 201.36 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11928
total_samples=7837, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:11,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.82
[2025-08-03 03:19:11,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.80 | bwd_microstep: 2079.35 | bwd_inner_microstep: 1862.40 | bwd_allreduce_microstep: 216.89 | step_microstep: 157.14
[2025-08-03 03:19:11,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.59 | bwd: 7809.98 | bwd_inner: 6832.09 | bwd_allreduce: 977.67 | step: 157.60
{'loss': 0.7839, 'learning_rate': 1.74061118437904e-05, 'epoch': 0.26}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13239
total_samples=7841, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:13,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.99 | bwd_microstep: 1909.25 | bwd_inner_microstep: 1903.07 | bwd_allreduce_microstep: 6.12 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12812
total_samples=7845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:16,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.52 | bwd_microstep: 1997.22 | bwd_inner_microstep: 1847.83 | bwd_allreduce_microstep: 149.32 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13436
total_samples=7849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:19,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.22 | bwd_microstep: 2072.55 | bwd_inner_microstep: 1942.57 | bwd_allreduce_microstep: 129.91 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12705
total_samples=7853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:22,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 03:19:22,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.92 | bwd_microstep: 1801.71 | bwd_inner_microstep: 1639.05 | bwd_allreduce_microstep: 162.59 | step_microstep: 127.87
[2025-08-03 03:19:22,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.58 | bwd: 7780.78 | bwd_inner: 7332.52 | bwd_allreduce: 448.03 | step: 128.22
{'loss': 0.7888, 'learning_rate': 1.739522099138411e-05, 'epoch': 0.26}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14113
total_samples=7857, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:24,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.75 | bwd_microstep: 1819.60 | bwd_inner_microstep: 1737.19 | bwd_allreduce_microstep: 82.34 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12770
total_samples=7861, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:27,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.88 | bwd_microstep: 1758.53 | bwd_inner_microstep: 1634.01 | bwd_allreduce_microstep: 124.47 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13302
total_samples=7865, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:30,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 950.98 | bwd_microstep: 1851.93 | bwd_inner_microstep: 1702.06 | bwd_allreduce_microstep: 149.81 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15090
total_samples=7869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:33,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.94
[2025-08-03 03:19:33,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.20 | bwd_microstep: 2027.78 | bwd_inner_microstep: 1756.24 | bwd_allreduce_microstep: 271.47 | step_microstep: 112.26
[2025-08-03 03:19:33,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3068.75 | bwd: 7457.89 | bwd_inner: 6829.50 | bwd_allreduce: 628.16 | step: 112.70
                                                 26%|██▌       | 511/2000 [1:35:42<4:28:08, 10.81s/it] 26%|██▌       | 512/2000 [1:35:52<4:28:31, 10.83s/it]                                                       26%|██▌       | 512/2000 [1:35:52<4:28:31, 10.83s/it] 26%|██▌       | 513/2000 [1:36:03<4:28:10, 10.82s/it]                                                       26%|██▌       | 513/2000 [1:36:03<4:28:10, 10.82s/it] 26%|██▌       | 514/2000 [1:36:14<4:30:04, 10.91s/it]                                                       26%|██▌       | 514/2000 [1:36:14<4:30:04, 10.91s/it] 26%|██▌       | 515/2000 [1:36:25<4:31:33, 10.97s/it]                                                       26%|██▌       | 515/2000 [1:36:25<4:31:33, 10.97s/it] 26%|██▌       | 516/2000 [1:36:36<4:31:52, 10.99s/it]                                                       26%|██▌       | 516/2000 [1:36:37<4:31:52, 10.99s/it] 26%|██▌       | {'loss': 0.7881, 'learning_rate': 1.7384310745872896e-05, 'epoch': 0.26}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14184
total_samples=7873, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:35,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.77 | bwd_microstep: 1793.28 | bwd_inner_microstep: 1754.80 | bwd_allreduce_microstep: 38.42 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14244
total_samples=7878, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:38,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.17 | bwd_microstep: 1793.26 | bwd_inner_microstep: 1707.23 | bwd_allreduce_microstep: 85.96 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14022
total_samples=7882, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:41,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.47 | bwd_microstep: 2065.98 | bwd_inner_microstep: 1895.42 | bwd_allreduce_microstep: 170.50 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13558
total_samples=7886, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:44,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 03:19:44,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 919.35 | bwd_microstep: 1976.75 | bwd_inner_microstep: 1834.97 | bwd_allreduce_microstep: 141.71 | step_microstep: 115.46
[2025-08-03 03:19:44,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2998.69 | bwd: 7629.31 | bwd_inner: 7192.40 | bwd_allreduce: 436.67 | step: 115.90
{'loss': 0.7828, 'learning_rate': 1.7373381135867605e-05, 'epoch': 0.26}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14291
total_samples=7890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:46,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.41 | bwd_microstep: 2013.32 | bwd_inner_microstep: 1863.31 | bwd_allreduce_microstep: 149.94 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13718
total_samples=7894, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:49,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.89 | bwd_microstep: 2101.05 | bwd_inner_microstep: 1963.86 | bwd_allreduce_microstep: 137.13 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13366
total_samples=7899, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:52,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.55 | bwd_microstep: 1714.64 | bwd_inner_microstep: 1663.47 | bwd_allreduce_microstep: 51.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14406
total_samples=7903, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:54,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 03:19:54,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.37 | bwd_microstep: 1762.00 | bwd_inner_microstep: 1730.88 | bwd_allreduce_microstep: 31.06 | step_microstep: 122.83
[2025-08-03 03:19:54,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2740.16 | bwd: 7591.06 | bwd_inner: 7221.52 | bwd_allreduce: 369.31 | step: 123.15
{'loss': 0.7785, 'learning_rate': 1.7362432190029862e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11905
total_samples=7906, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:19:57,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.65 | bwd_microstep: 2042.03 | bwd_inner_microstep: 1830.92 | bwd_allreduce_microstep: 211.05 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13713
total_samples=7910, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:00,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.25 | bwd_microstep: 1776.35 | bwd_inner_microstep: 1707.57 | bwd_allreduce_microstep: 68.72 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14799
total_samples=7915, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:03,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.40 | bwd_microstep: 2120.05 | bwd_inner_microstep: 1951.28 | bwd_allreduce_microstep: 168.70 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13362
total_samples=7919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:06,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 03:20:06,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.75 | bwd_microstep: 1929.60 | bwd_inner_microstep: 1835.12 | bwd_allreduce_microstep: 94.42 | step_microstep: 108.44
[2025-08-03 03:20:06,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.98 | bwd: 7868.07 | bwd_inner: 7324.88 | bwd_allreduce: 542.96 | step: 108.77
{'loss': 0.775, 'learning_rate': 1.7351463937072008e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885
total_samples=7922, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:08,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 663.82 | bwd_microstep: 1892.16 | bwd_inner_microstep: 1753.61 | bwd_allreduce_microstep: 138.49 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13600
total_samples=7926, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:11,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.50 | bwd_microstep: 1883.29 | bwd_inner_microstep: 1822.28 | bwd_allreduce_microstep: 60.94 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12773
total_samples=7930, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:14,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.87 | bwd_microstep: 2090.68 | bwd_inner_microstep: 1925.89 | bwd_allreduce_microstep: 164.73 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13381
total_samples=7934, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:17,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 03:20:17,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.66 | bwd_microstep: 1971.71 | bwd_inner_microstep: 1868.95 | bwd_allreduce_microstep: 102.70 | step_microstep: 137.65
[2025-08-03 03:20:17,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2747.78 | bwd: 7837.88 | bwd_inner: 7370.72 | bwd_allreduce: 466.94 | step: 138.08
{'loss': 0.7754, 'learning_rate': 1.7340476405757e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12360
total_samples=7937, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:20,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.42 | bwd_microstep: 2045.31 | bwd_inner_microstep: 1818.14 | bwd_allreduce_microstep: 227.11 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13360
total_samples=7941, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:22,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.87 | bwd_microstep: 2052.96 | bwd_inner_microstep: 1694.39 | bwd_allreduce_microstep: 358.51 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13497
total_samples=7945, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:25,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.07 | bwd_microstep: 1715.34 | bwd_inner_microstep: 1662.15 | bwd_allreduce_microstep: 53.13 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12725
total_samples=7949, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:27,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:20:27,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.13 | bwd_microstep: 1745.36 | bwd_inner_microstep: 1647.96 | bwd_allreduce_microstep: 97.34 | step_microstep: 112.65
[2025-08-03 03:20:27,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.43 | bwd: 7559.01 | bwd_inner: 6822.61 | bwd_allreduce: 736.15 | step: 113.10
{'loss': 0.7688, 'learning_rate': 1.732946962489836e-05, 'epoch': 0.26}
517/2000 [1:36:47<4:31:37, 10.99s/it]                                                       26%|██▌       | 517/2000 [1:36:48<4:31:37, 10.99s/it] 26%|██▌       | 518/2000 [1:36:59<4:32:01, 11.01s/it]                                                       26%|██▌       | 518/2000 [1:36:59<4:32:01, 11.01s/it] 26%|██▌       | 519/2000 [1:37:09<4:30:09, 10.95s/it]                                                       26%|██▌       | 519/2000 [1:37:09<4:30:09, 10.95s/it] 26%|██▌       | 520/2000 [1:37:20<4:31:26, 11.00s/it]                                                       26%|██▌       | 520/2000 [1:37:20<4:31:26, 11.00s/it] 26%|██▌       | 521/2000 [1:37:32<4:31:38, 11.02s/it]                                                       26%|██▌       | 521/2000 [1:37:32<4:31:38, 11.02s/it] 26%|██▌       | 522/2000 [1:37:42<4:29:55, 10.96s/it]                                                       26%|██▌       | 522/2000 [1:37:42dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11681
total_samples=7952, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:30,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.15 | bwd_microstep: 1825.42 | bwd_inner_microstep: 1691.82 | bwd_allreduce_microstep: 133.52 | step_microstep: 0.14
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12871
total_samples=7956, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:33,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.18 | bwd_microstep: 1974.96 | bwd_inner_microstep: 1835.65 | bwd_allreduce_microstep: 139.24 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12409
total_samples=7960, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:35,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1771.36 | bwd_inner_microstep: 1584.56 | bwd_allreduce_microstep: 186.72 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13094
total_samples=7964, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:38,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 03:20:38,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.26 | bwd_microstep: 1823.08 | bwd_inner_microstep: 1687.63 | bwd_allreduce_microstep: 135.39 | step_microstep: 134.56
[2025-08-03 03:20:38,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.13 | bwd: 7394.86 | bwd_inner: 6799.67 | bwd_allreduce: 594.96 | step: 134.96
{'loss': 0.7842, 'learning_rate': 1.7318443623360092e-05, 'epoch': 0.26}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13358
total_samples=7968, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:41,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.67 | bwd_microstep: 1803.11 | bwd_inner_microstep: 1704.29 | bwd_allreduce_microstep: 98.76 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11623
total_samples=7971, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:43,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.63 | bwd_microstep: 1789.92 | bwd_inner_microstep: 1575.39 | bwd_allreduce_microstep: 214.46 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11827
total_samples=7974, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:46,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.69 | bwd_microstep: 1847.82 | bwd_inner_microstep: 1546.15 | bwd_allreduce_microstep: 301.60 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11695
total_samples=7977, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:49,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.82
[2025-08-03 03:20:49,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.59 | bwd_microstep: 2027.82 | bwd_inner_microstep: 1802.39 | bwd_allreduce_microstep: 225.36 | step_microstep: 124.59
[2025-08-03 03:20:49,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.53 | bwd: 7468.71 | bwd_inner: 6628.22 | bwd_allreduce: 840.26 | step: 124.90
{'loss': 0.785, 'learning_rate': 1.7307398430056595e-05, 'epoch': 0.26}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13154
total_samples=7981, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:52,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.37 | bwd_microstep: 1981.41 | bwd_inner_microstep: 1847.97 | bwd_allreduce_microstep: 133.38 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13261
total_samples=7985, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:54,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.81 | bwd_microstep: 1844.43 | bwd_inner_microstep: 1698.85 | bwd_allreduce_microstep: 145.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14519
total_samples=7989, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:20:57,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.27 | bwd_microstep: 1802.15 | bwd_inner_microstep: 1757.79 | bwd_allreduce_microstep: 44.30 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13690
total_samples=7994, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:00,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 03:21:00,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.56 | bwd_microstep: 2000.05 | bwd_inner_microstep: 1903.56 | bwd_allreduce_microstep: 96.44 | step_microstep: 126.61
[2025-08-03 03:21:00,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.94 | bwd: 7628.09 | bwd_inner: 7208.17 | bwd_allreduce: 419.69 | step: 126.94
{'loss': 0.7852, 'learning_rate': 1.7296334073952606e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717
total_samples=7997, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:03,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.38 | bwd_microstep: 1986.10 | bwd_inner_microstep: 1786.69 | bwd_allreduce_microstep: 199.34 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12225
total_samples=8000, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:05,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.38 | bwd_microstep: 1874.86 | bwd_inner_microstep: 1738.17 | bwd_allreduce_microstep: 136.62 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13256
total_samples=8004, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:08,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.48 | bwd_microstep: 1804.21 | bwd_inner_microstep: 1694.06 | bwd_allreduce_microstep: 110.08 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12948
total_samples=8008, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:10,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 03:21:11,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.78 | bwd_microstep: 1797.18 | bwd_inner_microstep: 1641.44 | bwd_allreduce_microstep: 155.68 | step_microstep: 130.84
[2025-08-03 03:21:11,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.96 | bwd: 7462.40 | bwd_inner: 6860.34 | bwd_allreduce: 601.80 | step: 131.17
{'loss': 0.782, 'learning_rate': 1.72852505840631e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12124
total_samples=8011, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:15,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1958.53 | bwd_microstep: 2368.49 | bwd_inner_microstep: 2091.49 | bwd_allreduce_microstep: 276.94 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11580
total_samples=8014, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:18,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.55 | bwd_microstep: 1887.02 | bwd_inner_microstep: 1665.36 | bwd_allreduce_microstep: 221.61 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14200
total_samples=8019, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:20,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.26 | bwd_microstep: 2143.70 | bwd_inner_microstep: 1991.50 | bwd_allreduce_microstep: 152.13 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11737
total_samples=8022, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:23,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 03:21:23,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.06 | bwd_microstep: 1942.39 | bwd_inner_microstep: 1601.17 | bwd_allreduce_microstep: 341.15 | step_microstep: 140.55
[2025-08-03 03:21:23,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4082.34 | bwd: 8341.66 | bwd_inner: 7349.52 | bwd_allreduce: 991.91 | step: 140.88
{'loss': 0.7791, 'learning_rate': 1.7274147989453246e-05, 'epoch': 0.26}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13246
total_samples=8026, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:26,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.58 | bwd_microstep: 1822.00 | bwd_inner_microstep: 1791.33 | bwd_allreduce_microstep: 30.61 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15664
total_samples=8030, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:29,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.89 | bwd_microstep: 2149.05 | bwd_inner_microstep: 2005.37 | bwd_allreduce_microstep: 143.62 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12993
total_samples=8034, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:31,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.52 | bwd_microstep: 1725.85 | bwd_inner_microstep: 1658.50 | bwd_allreduce_microstep: 67.29 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12991
total_samples=8037, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:34,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 03:21:34,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.54 | bwd_microstep: 2188.75 | bwd_inner_microstep: 1841.11 | bwd_allreduce_microstep: 347.58 | step_microstep: 111.47
[2025-08-03 03:21:34,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.47 | bwd: 7885.70 | bwd_inner: 7296.30 | bwd_allreduce: 589.19 | step: 111.89
<4:29:55, 10.96s/it] 26%|██▌       | 523/2000 [1:37:53<4:27:25, 10.86s/it]                                                       26%|██▌       | 523/2000 [1:37:53<4:27:25, 10.86s/it] 26%|██▌       | 524/2000 [1:38:04<4:26:32, 10.84s/it]                                                       26%|██▌       | 524/2000 [1:38:04<4:26:32, 10.84s/it] 26%|██▋       | 525/2000 [1:38:15<4:27:01, 10.86s/it]                                                       26%|██▋       | 525/2000 [1:38:15<4:27:01, 10.86s/it] 26%|██▋       | 526/2000 [1:38:25<4:25:28, 10.81s/it]                                                       26%|██▋       | 526/2000 [1:38:25<4:25:28, 10.81s/it] 26%|██▋       | 527/2000 [1:38:38<4:40:26, 11.42s/it]                                                       26%|██▋       | 527/2000 [1:38:38<4:40:26, 11.42s/it] 26%|██▋       | 528/2000 [1:38:49<4:37:58, 11.33s/it]                                                    {'loss': 0.7708, 'learning_rate': 1.72630263192383e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11698
total_samples=8040, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:37,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.76 | bwd_microstep: 1801.03 | bwd_inner_microstep: 1539.64 | bwd_allreduce_microstep: 261.33 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12065
total_samples=8043, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:40,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.37 | bwd_microstep: 1771.56 | bwd_inner_microstep: 1558.97 | bwd_allreduce_microstep: 212.52 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14987
total_samples=8048, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:42,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.42 | bwd_microstep: 1920.01 | bwd_inner_microstep: 1907.25 | bwd_allreduce_microstep: 12.71 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14433
total_samples=8054, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:45,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 03:21:45,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.01 | bwd_microstep: 1809.10 | bwd_inner_microstep: 1728.73 | bwd_allreduce_microstep: 80.30 | step_microstep: 129.55
[2025-08-03 03:21:45,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.49 | bwd: 7301.75 | bwd_inner: 6734.59 | bwd_allreduce: 566.93 | step: 129.86
{'loss': 0.7794, 'learning_rate': 1.7251885602583547e-05, 'epoch': 0.26}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11785
total_samples=8057, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:48,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.34 | bwd_microstep: 1800.82 | bwd_inner_microstep: 1560.13 | bwd_allreduce_microstep: 240.63 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13811
total_samples=8061, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:50,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.87 | bwd_microstep: 1837.24 | bwd_inner_microstep: 1705.16 | bwd_allreduce_microstep: 132.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14201
total_samples=8066, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:53,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.55 | bwd_microstep: 1699.00 | bwd_inner_microstep: 1682.54 | bwd_allreduce_microstep: 16.40 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13721
total_samples=8070, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:55,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:21:55,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.31 | bwd_microstep: 1836.98 | bwd_inner_microstep: 1702.73 | bwd_allreduce_microstep: 134.18 | step_microstep: 132.40
[2025-08-03 03:21:55,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.00 | bwd: 7174.08 | bwd_inner: 6650.56 | bwd_allreduce: 523.30 | step: 132.75
{'loss': 0.7817, 'learning_rate': 1.7240725868704218e-05, 'epoch': 0.27}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13649
total_samples=8074, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:21:58,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.49 | bwd_microstep: 1995.68 | bwd_inner_microstep: 1676.72 | bwd_allreduce_microstep: 318.89 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12552
total_samples=8077, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:01,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.45 | bwd_microstep: 1726.32 | bwd_inner_microstep: 1560.92 | bwd_allreduce_microstep: 165.34 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13269
total_samples=8081, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:03,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.19 | bwd_microstep: 1841.36 | bwd_inner_microstep: 1791.74 | bwd_allreduce_microstep: 49.56 | step_microstep: 0.20
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12795
total_samples=8085, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:06,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 03:22:06,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.85 | bwd_microstep: 1854.10 | bwd_inner_microstep: 1633.31 | bwd_allreduce_microstep: 220.72 | step_microstep: 111.13
[2025-08-03 03:22:06,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.91 | bwd: 7417.51 | bwd_inner: 6662.69 | bwd_allreduce: 754.59 | step: 111.56
{'loss': 0.7671, 'learning_rate': 1.722954714686541e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11888
total_samples=8088, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:09,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.82 | bwd_microstep: 2034.10 | bwd_inner_microstep: 1803.33 | bwd_allreduce_microstep: 230.70 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13765
total_samples=8092, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:12,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.69 | bwd_microstep: 1828.04 | bwd_inner_microstep: 1684.91 | bwd_allreduce_microstep: 143.07 | step_microstep: 0.21
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14314
total_samples=8097, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:15,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.51 | bwd_microstep: 2194.52 | bwd_inner_microstep: 2025.42 | bwd_allreduce_microstep: 169.02 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13504
total_samples=8101, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:17,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 03:22:17,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.63 | bwd_microstep: 2060.80 | bwd_inner_microstep: 1917.91 | bwd_allreduce_microstep: 142.83 | step_microstep: 152.10
[2025-08-03 03:22:17,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.57 | bwd: 8117.50 | bwd_inner: 7431.55 | bwd_allreduce: 685.69 | step: 152.55
{'loss': 0.7842, 'learning_rate': 1.7218349466382024e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11974
total_samples=8104, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:20,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.26 | bwd_microstep: 1781.70 | bwd_inner_microstep: 1551.10 | bwd_allreduce_microstep: 230.54 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11907
total_samples=8107, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:23,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.63 | bwd_microstep: 1839.26 | bwd_inner_microstep: 1602.60 | bwd_allreduce_microstep: 236.60 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12330
total_samples=8110, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:25,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.18 | bwd_microstep: 1789.15 | bwd_inner_microstep: 1560.17 | bwd_allreduce_microstep: 228.92 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14262
total_samples=8115, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:28,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.82
[2025-08-03 03:22:28,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.42 | bwd_microstep: 1803.91 | bwd_inner_microstep: 1739.33 | bwd_allreduce_microstep: 64.52 | step_microstep: 403.15
[2025-08-03 03:22:28,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2862.42 | bwd: 7214.07 | bwd_inner: 6453.20 | bwd_allreduce: 760.64 | step: 403.47
{'loss': 0.783, 'learning_rate': 1.7207132856618668e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12081
total_samples=8119, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:31,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.06 | bwd_microstep: 1858.85 | bwd_inner_microstep: 1561.05 | bwd_allreduce_microstep: 297.72 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373
total_samples=8123, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:34,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.86 | bwd_microstep: 2247.92 | bwd_inner_microstep: 1980.60 | bwd_allreduce_microstep: 267.27 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14057
total_samples=8127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:37,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.80 | bwd_microstep: 1823.32 | bwd_inner_microstep: 1686.93 | bwd_allreduce_microstep: 136.31 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11702
total_samples=8130, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:39,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 03:22:39,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.86 | bwd_microstep: 2014.21 | bwd_inner_microstep: 1671.30 | bwd_allreduce_microstep: 342.84 | step_microstep: 130.12
[2025-08-03 03:22:39,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.51 | bwd: 7944.35 | bwd_inner: 6899.88 | bwd_allreduce: 1044.23 | step: 130.63
   26%|██▋       | 528/2000 [1:38:49<4:37:58, 11.33s/it] 26%|██▋       | 529/2000 [1:39:00<4:31:59, 11.09s/it]                                                       26%|██▋       | 529/2000 [1:39:00<4:31:59, 11.09s/it] 26%|██▋       | 530/2000 [1:39:10<4:26:56, 10.90s/it]                                                       26%|██▋       | 530/2000 [1:39:10<4:26:56, 10.90s/it] 27%|██▋       | 531/2000 [1:39:21<4:24:59, 10.82s/it]                                                       27%|██▋       | 531/2000 [1:39:21<4:24:59, 10.82s/it] 27%|██▋       | 532/2000 [1:39:32<4:28:52, 10.99s/it]                                                       27%|██▋       | 532/2000 [1:39:32<4:28:52, 10.99s/it] 27%|██▋       | 533/2000 [1:39:43<4:27:21, 10.94s/it]                                                       27%|██▋       | 533/2000 [1:39:43<4:27:21, 10.94s/it] 27%|██▋       | 534/2000 [1:39:54<4:29:01, 11.01s/it]        {'loss': 0.7739, 'learning_rate': 1.719589734698959e-05, 'epoch': 0.27}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13731
total_samples=8134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:42,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.39 | bwd_microstep: 1756.51 | bwd_inner_microstep: 1708.32 | bwd_allreduce_microstep: 48.13 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12076
total_samples=8137, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:45,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.10 | bwd_microstep: 2344.86 | bwd_inner_microstep: 2082.14 | bwd_allreduce_microstep: 262.67 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15479
total_samples=8141, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:48,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.76 | bwd_microstep: 1779.63 | bwd_inner_microstep: 1768.91 | bwd_allreduce_microstep: 10.66 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11966
total_samples=8144, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:50,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:22:50,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.44 | bwd_microstep: 1895.03 | bwd_inner_microstep: 1758.46 | bwd_allreduce_microstep: 136.51 | step_microstep: 129.18
[2025-08-03 03:22:50,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.62 | bwd: 7776.08 | bwd_inner: 7317.82 | bwd_allreduce: 458.04 | step: 129.52
{'loss': 0.7801, 'learning_rate': 1.718464296695861e-05, 'epoch': 0.27}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13537
total_samples=8148, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:53,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.16 | bwd_microstep: 1705.97 | bwd_inner_microstep: 1671.90 | bwd_allreduce_microstep: 34.00 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13276
total_samples=8152, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:56,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.51 | bwd_microstep: 1842.26 | bwd_inner_microstep: 1682.78 | bwd_allreduce_microstep: 159.41 | step_microstep: 0.20
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12236
total_samples=8156, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:22:58,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.79 | bwd_microstep: 1895.38 | bwd_inner_microstep: 1875.94 | bwd_allreduce_microstep: 19.38 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11628
total_samples=8159, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:01,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 03:23:01,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.07 | bwd_microstep: 1985.66 | bwd_inner_microstep: 1755.06 | bwd_allreduce_microstep: 230.53 | step_microstep: 113.53
[2025-08-03 03:23:01,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.47 | bwd: 7429.32 | bwd_inner: 6985.67 | bwd_allreduce: 443.40 | step: 113.97
{'loss': 0.7722, 'learning_rate': 1.7173369746039026e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11789
total_samples=8162, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:04,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.49 | bwd_microstep: 1791.69 | bwd_inner_microstep: 1559.76 | bwd_allreduce_microstep: 231.86 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12552
total_samples=8165, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:06,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.87 | bwd_microstep: 1801.51 | bwd_inner_microstep: 1599.48 | bwd_allreduce_microstep: 201.97 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13550
total_samples=8169, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:09,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1747.22 | bwd_inner_microstep: 1683.86 | bwd_allreduce_microstep: 63.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11922
total_samples=8172, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:12,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 03:23:12,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.78 | bwd_microstep: 1931.74 | bwd_inner_microstep: 1791.69 | bwd_allreduce_microstep: 139.97 | step_microstep: 110.04
[2025-08-03 03:23:12,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.68 | bwd: 7272.21 | bwd_inner: 6634.80 | bwd_allreduce: 637.17 | step: 110.39
{'loss': 0.7875, 'learning_rate': 1.7162077713793547e-05, 'epoch': 0.27}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14030
total_samples=8176, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:14,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.52 | bwd_microstep: 2012.35 | bwd_inner_microstep: 1759.59 | bwd_allreduce_microstep: 252.70 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11707
total_samples=8179, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:17,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.42 | bwd_microstep: 1855.09 | bwd_inner_microstep: 1604.50 | bwd_allreduce_microstep: 250.53 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11714
total_samples=8182, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:20,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.95 | bwd_microstep: 1815.04 | bwd_inner_microstep: 1564.51 | bwd_allreduce_microstep: 250.45 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11695
total_samples=8185, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:23,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 03:23:23,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.03 | bwd_microstep: 2014.98 | bwd_inner_microstep: 1558.11 | bwd_allreduce_microstep: 456.77 | step_microstep: 116.86
[2025-08-03 03:23:23,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.87 | bwd: 7697.51 | bwd_inner: 6486.73 | bwd_allreduce: 1210.51 | step: 117.21
{'loss': 0.781, 'learning_rate': 1.7150766899834205e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11972
total_samples=8188, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:25,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.20 | bwd_microstep: 1983.74 | bwd_inner_microstep: 1849.66 | bwd_allreduce_microstep: 134.02 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13053
total_samples=8192, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:28,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.54 | bwd_microstep: 1989.44 | bwd_inner_microstep: 1831.41 | bwd_allreduce_microstep: 157.97 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11857
total_samples=8195, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:31,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.33 | bwd_microstep: 2017.30 | bwd_inner_microstep: 1544.40 | bwd_allreduce_microstep: 472.84 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12866
total_samples=8199, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:34,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:23:34,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.29 | bwd_microstep: 1740.37 | bwd_inner_microstep: 1597.19 | bwd_allreduce_microstep: 143.10 | step_microstep: 108.20
[2025-08-03 03:23:34,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.28 | bwd: 7730.90 | bwd_inner: 6822.66 | bwd_allreduce: 908.01 | step: 108.55
{'loss': 0.7739, 'learning_rate': 1.7139437333822303e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12057
total_samples=8202, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:37,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.88 | bwd_microstep: 2193.27 | bwd_inner_microstep: 1872.83 | bwd_allreduce_microstep: 320.38 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11984
total_samples=8205, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:39,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.88 | bwd_microstep: 1772.85 | bwd_inner_microstep: 1553.52 | bwd_allreduce_microstep: 219.26 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12150
total_samples=8208, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:42,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.49 | bwd_microstep: 1782.12 | bwd_inner_microstep: 1563.97 | bwd_allreduce_microstep: 218.09 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11930
total_samples=8211, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:45,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 03:23:45,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.50 | bwd_microstep: 2211.10 | bwd_inner_microstep: 1853.75 | bwd_allreduce_microstep: 357.29 | step_microstep: 110.42
[2025-08-03 03:23:45,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2845.68 | bwd: 7959.39 | bwd_inner: 6844.06 | bwd_allreduce: 1115.09 | step: 110.75
                                               27%|██▋       | 534/2000 [1:39:54<4:29:01, 11.01s/it] 27%|██▋       | 535/2000 [1:40:05<4:28:47, 11.01s/it]                                                       27%|██▋       | 535/2000 [1:40:05<4:28:47, 11.01s/it] 27%|██▋       | 536/2000 [1:40:16<4:26:04, 10.90s/it]                                                       27%|██▋       | 536/2000 [1:40:16<4:26:04, 10.90s/it] 27%|██▋       | 537/2000 [1:40:27<4:22:58, 10.79s/it]                                                       27%|██▋       | 537/2000 [1:40:27<4:22:58, 10.79s/it] 27%|██▋       | 538/2000 [1:40:37<4:24:08, 10.84s/it]                                                       27%|██▋       | 538/2000 [1:40:38<4:24:08, 10.84s/it] 27%|██▋       | 539/2000 [1:40:48<4:24:46, 10.87s/it]                                                       27%|██▋       | 539/2000 [1:40:48<4:24:46, 10.87s/it] 27%|██▋       | 54{'loss': 0.7808, 'learning_rate': 1.7128089045468294e-05, 'epoch': 0.27}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13093
total_samples=8215, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:48,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.30 | bwd_microstep: 2032.22 | bwd_inner_microstep: 1673.22 | bwd_allreduce_microstep: 358.93 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11980
total_samples=8218, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:50,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.56 | bwd_microstep: 1852.03 | bwd_inner_microstep: 1598.43 | bwd_allreduce_microstep: 253.54 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15646
total_samples=8222, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:53,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.77 | bwd_microstep: 1803.63 | bwd_inner_microstep: 1778.39 | bwd_allreduce_microstep: 25.18 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14664
total_samples=8226, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:56,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 03:23:56,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.50 | bwd_microstep: 1836.65 | bwd_inner_microstep: 1758.68 | bwd_allreduce_microstep: 77.90 | step_microstep: 137.71
[2025-08-03 03:23:56,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.07 | bwd: 7524.58 | bwd_inner: 6808.72 | bwd_allreduce: 715.63 | step: 138.16
{'loss': 0.7847, 'learning_rate': 1.711672206453175e-05, 'epoch': 0.27}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15695
total_samples=8230, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:23:59,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.58 | bwd_microstep: 2044.78 | bwd_inner_microstep: 1895.42 | bwd_allreduce_microstep: 149.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11752
total_samples=8233, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:01,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.10 | bwd_microstep: 2088.48 | bwd_inner_microstep: 1843.55 | bwd_allreduce_microstep: 244.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13533
total_samples=8237, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:04,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.02 | bwd_microstep: 1781.86 | bwd_inner_microstep: 1711.41 | bwd_allreduce_microstep: 70.38 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13316
total_samples=8241, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:07,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 03:24:07,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.07 | bwd_microstep: 1790.54 | bwd_inner_microstep: 1700.28 | bwd_allreduce_microstep: 90.20 | step_microstep: 113.95
[2025-08-03 03:24:07,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.70 | bwd: 7705.70 | bwd_inner: 7150.66 | bwd_allreduce: 554.81 | step: 114.26
{'loss': 0.7828, 'learning_rate': 1.7105336420821247e-05, 'epoch': 0.27}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13962
total_samples=8245, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:09,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.27 | bwd_microstep: 2051.02 | bwd_inner_microstep: 2044.94 | bwd_allreduce_microstep: 6.02 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11652
total_samples=8248, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:12,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.51 | bwd_microstep: 1987.82 | bwd_inner_microstep: 1753.97 | bwd_allreduce_microstep: 233.77 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13732
total_samples=8252, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:15,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.06 | bwd_microstep: 1801.72 | bwd_inner_microstep: 1705.00 | bwd_allreduce_microstep: 96.66 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13624
total_samples=8256, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:17,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 03:24:17,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.70 | bwd_microstep: 1731.64 | bwd_inner_microstep: 1677.72 | bwd_allreduce_microstep: 53.85 | step_microstep: 142.00
[2025-08-03 03:24:17,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.46 | bwd: 7572.25 | bwd_inner: 7181.63 | bwd_allreduce: 390.39 | step: 142.43
{'loss': 0.7833, 'learning_rate': 1.709393214419431e-05, 'epoch': 0.27}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13623
total_samples=8260, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:20,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.68 | bwd_microstep: 1868.80 | bwd_inner_microstep: 1735.13 | bwd_allreduce_microstep: 133.60 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12685
total_samples=8263, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:23,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.46 | bwd_microstep: 1752.08 | bwd_inner_microstep: 1580.50 | bwd_allreduce_microstep: 171.51 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13207
total_samples=8267, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:26,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.82 | bwd_microstep: 2073.76 | bwd_inner_microstep: 1956.24 | bwd_allreduce_microstep: 117.46 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13706
total_samples=8271, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:28,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 03:24:28,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.22 | bwd_microstep: 1994.81 | bwd_inner_microstep: 1847.29 | bwd_allreduce_microstep: 147.46 | step_microstep: 130.30
[2025-08-03 03:24:28,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.12 | bwd: 7689.49 | bwd_inner: 7119.16 | bwd_allreduce: 570.11 | step: 130.66
{'loss': 0.778, 'learning_rate': 1.7082509264557333e-05, 'epoch': 0.27}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13342
total_samples=8275, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:31,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.81 | bwd_microstep: 1755.36 | bwd_inner_microstep: 1683.12 | bwd_allreduce_microstep: 72.16 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11692
total_samples=8278, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:34,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.77 | bwd_microstep: 1726.18 | bwd_inner_microstep: 1522.75 | bwd_allreduce_microstep: 203.37 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13579
total_samples=8282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:36,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.08 | bwd_microstep: 1711.01 | bwd_inner_microstep: 1674.76 | bwd_allreduce_microstep: 36.18 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13638
total_samples=8287, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:39,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 03:24:39,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.50 | bwd_microstep: 2368.68 | bwd_inner_microstep: 1914.79 | bwd_allreduce_microstep: 453.82 | step_microstep: 128.36
[2025-08-03 03:24:39,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.10 | bwd: 7561.27 | bwd_inner: 6795.42 | bwd_allreduce: 765.62 | step: 128.83
{'loss': 0.7799, 'learning_rate': 1.7071067811865477e-05, 'epoch': 0.27}
0/2000 [1:41:00<4:27:06, 10.98s/it]                                                       27%|██▋       | 540/2000 [1:41:00<4:27:06, 10.98s/it] 27%|██▋       | 541/2000 [1:41:10<4:26:00, 10.94s/it]                                                       27%|██▋       | 541/2000 [1:41:11<4:26:00, 10.94s/it] 27%|██▋       | 542/2000 [1:41:21<4:25:55, 10.94s/it]                                                       27%|██▋       | 542/2000 [1:41:21<4:25:55, 10.94s/it] 27%|██▋       | 543/2000 [1:41:32<4:25:25, 10.93s/it]                                                       27%|██▋       | 543/2000 [1:41:32<4:25:25, 10.93s/it] 27%|██▋       | 544/2000 [1:41:43<4:25:35, 10.94s/it]                                                       27%|██▋       | 544/2000 [1:41:43<4:25:35, 10.94s/it] 27%|██▋       | 545/2000 [1:41:54<4:24:18, 10.90s/it]                                                       27%|██▋       | 545/2000 [1:41:54<4dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13343
total_samples=8291, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:42,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.51 | bwd_microstep: 2274.06 | bwd_inner_microstep: 2033.52 | bwd_allreduce_microstep: 240.47 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13426
total_samples=8295, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:45,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.71 | bwd_microstep: 1784.47 | bwd_inner_microstep: 1689.63 | bwd_allreduce_microstep: 94.77 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13201
total_samples=8299, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:47,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.49 | bwd_microstep: 1717.12 | bwd_inner_microstep: 1651.72 | bwd_allreduce_microstep: 65.34 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13709
total_samples=8303, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:50,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 03:24:50,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.50 | bwd_microstep: 1866.36 | bwd_inner_microstep: 1827.96 | bwd_allreduce_microstep: 38.33 | step_microstep: 129.01
[2025-08-03 03:24:50,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.14 | bwd: 7642.06 | bwd_inner: 7202.82 | bwd_allreduce: 439.00 | step: 129.37
{'loss': 0.7785, 'learning_rate': 1.705960781612262e-05, 'epoch': 0.27}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12632
total_samples=8307, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:53,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.70 | bwd_microstep: 1994.05 | bwd_inner_microstep: 1834.38 | bwd_allreduce_microstep: 159.61 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14102
total_samples=8311, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:55,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.27 | bwd_microstep: 1762.81 | bwd_inner_microstep: 1719.93 | bwd_allreduce_microstep: 42.82 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13549
total_samples=8315, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:24:58,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.14 | bwd_microstep: 2111.02 | bwd_inner_microstep: 1936.11 | bwd_allreduce_microstep: 174.85 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14522
total_samples=8321, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:01,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 03:25:01,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.43 | bwd_microstep: 1810.19 | bwd_inner_microstep: 1737.82 | bwd_allreduce_microstep: 72.32 | step_microstep: 131.04
[2025-08-03 03:25:01,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2754.47 | bwd: 7678.13 | bwd_inner: 7228.23 | bwd_allreduce: 449.68 | step: 131.48
{'loss': 0.7838, 'learning_rate': 1.7048129307381266e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938
total_samples=8324, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:04,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.74 | bwd_microstep: 1776.12 | bwd_inner_microstep: 1561.16 | bwd_allreduce_microstep: 214.90 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13868
total_samples=8328, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:06,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.44 | bwd_microstep: 1763.72 | bwd_inner_microstep: 1712.27 | bwd_allreduce_microstep: 51.39 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13244
total_samples=8332, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:09,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.86 | bwd_microstep: 1968.08 | bwd_inner_microstep: 1893.50 | bwd_allreduce_microstep: 74.52 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13763
total_samples=8336, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:12,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 03:25:12,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.27 | bwd_microstep: 1791.94 | bwd_inner_microstep: 1714.35 | bwd_allreduce_microstep: 77.52 | step_microstep: 157.99
[2025-08-03 03:25:12,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.23 | bwd: 7299.91 | bwd_inner: 6881.28 | bwd_allreduce: 418.40 | step: 158.40
{'loss': 0.7829, 'learning_rate': 1.7036632315742464e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11816
total_samples=8339, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:14,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.50 | bwd_microstep: 1840.78 | bwd_inner_microstep: 1573.66 | bwd_allreduce_microstep: 267.05 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13754
total_samples=8343, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:17,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.00 | bwd_microstep: 2231.18 | bwd_inner_microstep: 2051.38 | bwd_allreduce_microstep: 179.73 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14270
total_samples=8347, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:20,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.30 | bwd_microstep: 2023.75 | bwd_inner_microstep: 2012.56 | bwd_allreduce_microstep: 11.12 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13870
total_samples=8351, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:23,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 03:25:23,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.55 | bwd_microstep: 1784.61 | bwd_inner_microstep: 1718.50 | bwd_allreduce_microstep: 66.05 | step_microstep: 120.20
[2025-08-03 03:25:23,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.28 | bwd: 7880.36 | bwd_inner: 7356.11 | bwd_allreduce: 524.02 | step: 120.53
{'loss': 0.7714, 'learning_rate': 1.7025116871355737e-05, 'epoch': 0.27}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12166
total_samples=8354, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:25,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.62 | bwd_microstep: 1836.45 | bwd_inner_microstep: 1605.60 | bwd_allreduce_microstep: 230.79 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11903
total_samples=8357, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:28,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.13 | bwd_microstep: 1825.05 | bwd_inner_microstep: 1579.69 | bwd_allreduce_microstep: 245.28 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15433
total_samples=8362, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:31,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.73 | bwd_microstep: 1805.88 | bwd_inner_microstep: 1753.95 | bwd_allreduce_microstep: 51.87 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14520
total_samples=8366, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:33,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 03:25:33,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.58 | bwd_microstep: 1754.86 | bwd_inner_microstep: 1710.99 | bwd_allreduce_microstep: 43.80 | step_microstep: 124.25
[2025-08-03 03:25:33,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.99 | bwd: 7222.28 | bwd_inner: 6650.21 | bwd_allreduce: 571.82 | step: 124.56
{'loss': 0.7904, 'learning_rate': 1.7013583004418994e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13471
total_samples=8370, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:36,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.58 | bwd_microstep: 1834.02 | bwd_inner_microstep: 1711.83 | bwd_allreduce_microstep: 122.12 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12612
total_samples=8373, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:39,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.14 | bwd_microstep: 1974.31 | bwd_inner_microstep: 1738.09 | bwd_allreduce_microstep: 236.15 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13880
total_samples=8377, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:41,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.69 | bwd_microstep: 1774.89 | bwd_inner_microstep: 1720.75 | bwd_allreduce_microstep: 54.07 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13246
total_samples=8381, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:44,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 03:25:44,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.00 | bwd_microstep: 1907.00 | bwd_inner_microstep: 1687.91 | bwd_allreduce_microstep: 219.03 | step_microstep: 115.17
[2025-08-03 03:25:44,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.34 | bwd: 7490.27 | bwd_inner: 6858.58 | bwd_allreduce: 631.45 | step: 115.60
:24:18, 10.90s/it] 27%|██▋       | 546/2000 [1:42:05<4:24:04, 10.90s/it]                                                       27%|██▋       | 546/2000 [1:42:05<4:24:04, 10.90s/it] 27%|██▋       | 547/2000 [1:42:16<4:23:53, 10.90s/it]                                                       27%|██▋       | 547/2000 [1:42:16<4:23:53, 10.90s/it] 27%|██▋       | 548/2000 [1:42:26<4:21:19, 10.80s/it]                                                       27%|██▋       | 548/2000 [1:42:27<4:21:19, 10.80s/it] 27%|██▋       | 549/2000 [1:42:38<4:23:37, 10.90s/it]                                                       27%|██▋       | 549/2000 [1:42:38<4:23:37, 10.90s/it] 28%|██▊       | 550/2000 [1:42:48<4:20:36, 10.78s/it]                                                       28%|██▊       | 550/2000 [1:42:48<4:20:36, 10.78s/it] 28%|██▊       | 551/2000 [1:42:59<4:20:18, 10.78s/it]                                                      {'loss': 0.7855, 'learning_rate': 1.7002030745178455e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14190
total_samples=8385, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:47,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.09 | bwd_microstep: 1780.62 | bwd_inner_microstep: 1727.30 | bwd_allreduce_microstep: 53.26 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625
total_samples=8388, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:49,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.29 | bwd_microstep: 1865.39 | bwd_inner_microstep: 1617.90 | bwd_allreduce_microstep: 247.43 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13233
total_samples=8392, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:52,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.68 | bwd_microstep: 2109.11 | bwd_inner_microstep: 1830.31 | bwd_allreduce_microstep: 278.74 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11968
total_samples=8395, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:55,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 03:25:55,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.87 | bwd_microstep: 2052.64 | bwd_inner_microstep: 1575.98 | bwd_allreduce_microstep: 476.57 | step_microstep: 134.98
[2025-08-03 03:25:55,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.87 | bwd: 7807.81 | bwd_inner: 6751.50 | bwd_allreduce: 1056.06 | step: 135.32
{'loss': 0.7768, 'learning_rate': 1.6990460123928577e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14009
total_samples=8399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:25:58,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.69 | bwd_microstep: 1743.55 | bwd_inner_microstep: 1734.39 | bwd_allreduce_microstep: 9.10 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12215
total_samples=8402, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:00,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.85 | bwd_microstep: 1782.99 | bwd_inner_microstep: 1569.48 | bwd_allreduce_microstep: 213.45 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14615
total_samples=8406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:03,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.48 | bwd_microstep: 1983.52 | bwd_inner_microstep: 1734.39 | bwd_allreduce_microstep: 249.07 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13541
total_samples=8410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:06,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 03:26:06,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.54 | bwd_microstep: 2099.97 | bwd_inner_microstep: 1800.15 | bwd_allreduce_microstep: 299.76 | step_microstep: 124.55
[2025-08-03 03:26:06,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.49 | bwd: 7610.08 | bwd_inner: 6838.40 | bwd_allreduce: 771.46 | step: 124.87
{'loss': 0.7824, 'learning_rate': 1.6978871171011963e-05, 'epoch': 0.28}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11667
total_samples=8413, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:09,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.59 | bwd_microstep: 2242.71 | bwd_inner_microstep: 2028.72 | bwd_allreduce_microstep: 213.93 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13638
total_samples=8417, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:12,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.21 | bwd_microstep: 2024.03 | bwd_inner_microstep: 1712.46 | bwd_allreduce_microstep: 311.50 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=8420, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:15,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.77 | bwd_microstep: 2046.36 | bwd_inner_microstep: 1580.08 | bwd_allreduce_microstep: 466.18 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12217
total_samples=8423, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:17,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 03:26:17,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.22 | bwd_microstep: 1819.27 | bwd_inner_microstep: 1601.27 | bwd_allreduce_microstep: 217.93 | step_microstep: 151.84
[2025-08-03 03:26:17,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.72 | bwd: 8132.42 | bwd_inner: 6922.56 | bwd_allreduce: 1209.61 | step: 152.27
{'loss': 0.7799, 'learning_rate': 1.696726391681929e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13848
total_samples=8427, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:20,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.09 | bwd_microstep: 1971.34 | bwd_inner_microstep: 1720.75 | bwd_allreduce_microstep: 250.52 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11672
total_samples=8431, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:23,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.62 | bwd_microstep: 1764.57 | bwd_inner_microstep: 1534.21 | bwd_allreduce_microstep: 230.29 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11700
total_samples=8434, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:25,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.51 | bwd_microstep: 1700.02 | bwd_inner_microstep: 1524.73 | bwd_allreduce_microstep: 175.22 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13754
total_samples=8438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:28,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:26:28,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.96 | bwd_microstep: 2067.70 | bwd_inner_microstep: 1920.25 | bwd_allreduce_microstep: 147.39 | step_microstep: 108.98
[2025-08-03 03:26:28,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.10 | bwd: 7503.68 | bwd_inner: 6699.94 | bwd_allreduce: 803.50 | step: 109.45
{'loss': 0.7822, 'learning_rate': 1.695563839178923e-05, 'epoch': 0.28}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14342
total_samples=8442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:31,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.41 | bwd_microstep: 1945.37 | bwd_inner_microstep: 1825.22 | bwd_allreduce_microstep: 120.08 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12021
total_samples=8445, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:34,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.92 | bwd_microstep: 1852.03 | bwd_inner_microstep: 1846.10 | bwd_allreduce_microstep: 5.87 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13353
total_samples=8449, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:36,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.95 | bwd_microstep: 1748.89 | bwd_inner_microstep: 1681.69 | bwd_allreduce_microstep: 67.13 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14590
total_samples=8453, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:39,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.70
[2025-08-03 03:26:39,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.07 | bwd_microstep: 1908.25 | bwd_inner_microstep: 1859.12 | bwd_allreduce_microstep: 49.07 | step_microstep: 105.87
[2025-08-03 03:26:39,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.28 | bwd: 7454.59 | bwd_inner: 7212.13 | bwd_allreduce: 242.23 | step: 106.20
{'loss': 0.7629, 'learning_rate': 1.6943994626408365e-05, 'epoch': 0.28}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14801
total_samples=8457, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:42,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.49 | bwd_microstep: 1828.70 | bwd_inner_microstep: 1736.52 | bwd_allreduce_microstep: 92.11 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12045
total_samples=8460, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:44,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.36 | bwd_microstep: 2000.30 | bwd_inner_microstep: 1782.38 | bwd_allreduce_microstep: 217.86 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12903
total_samples=8464, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:47,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.19 | bwd_microstep: 1695.35 | bwd_inner_microstep: 1627.79 | bwd_allreduce_microstep: 67.50 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13476
total_samples=8468, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:50,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 03:26:50,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.46 | bwd_microstep: 1918.91 | bwd_inner_microstep: 1722.31 | bwd_allreduce_microstep: 196.54 | step_microstep: 130.32
[2025-08-03 03:26:50,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.44 | bwd: 7443.31 | bwd_inner: 6869.00 | bwd_allreduce: 574.09 | step: 130.76
 28%|██▊       | 551/2000 [1:42:59<4:20:18, 10.78s/it] 28%|██▊       | 552/2000 [1:43:10<4:22:25, 10.87s/it]                                                       28%|██▊       | 552/2000 [1:43:10<4:22:25, 10.87s/it] 28%|██▊       | 553/2000 [1:43:21<4:22:02, 10.87s/it]                                                       28%|██▊       | 553/2000 [1:43:21<4:22:02, 10.87s/it] 28%|██▊       | 554/2000 [1:43:32<4:26:10, 11.04s/it]                                                       28%|██▊       | 554/2000 [1:43:32<4:26:10, 11.04s/it] 28%|██▊       | 555/2000 [1:43:43<4:23:34, 10.94s/it]                                                       28%|██▊       | 555/2000 [1:43:43<4:23:34, 10.94s/it] 28%|██▊       | 556/2000 [1:43:54<4:21:59, 10.89s/it]                                                       28%|██▊       | 556/2000 [1:43:54<4:21:59, 10.89s/it] 28%|██▊       | 557/2000 [1:44:04<4:20:26, 10.83s/it]          {'loss': 0.789, 'learning_rate': 1.6932332651211115e-05, 'epoch': 0.28}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12772
total_samples=8472, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:52,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.93 | bwd_microstep: 2020.37 | bwd_inner_microstep: 1856.88 | bwd_allreduce_microstep: 163.42 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11663
total_samples=8475, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:55,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.26 | bwd_microstep: 1897.94 | bwd_inner_microstep: 1620.66 | bwd_allreduce_microstep: 277.21 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13692
total_samples=8479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:26:58,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.82 | bwd_microstep: 1832.75 | bwd_inner_microstep: 1729.43 | bwd_allreduce_microstep: 103.25 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 14765
total_samples=8482, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:01,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 03:27:01,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.38 | bwd_microstep: 1848.66 | bwd_inner_microstep: 1721.36 | bwd_allreduce_microstep: 127.24 | step_microstep: 133.89
[2025-08-03 03:27:01,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.31 | bwd: 7599.77 | bwd_inner: 6928.33 | bwd_allreduce: 671.20 | step: 134.33
{'loss': 0.7903, 'learning_rate': 1.692065249677965e-05, 'epoch': 0.28}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12537
total_samples=8486, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:03,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.65 | bwd_microstep: 1883.04 | bwd_inner_microstep: 1649.52 | bwd_allreduce_microstep: 233.44 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12918
total_samples=8490, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:06,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.91 | bwd_microstep: 2102.19 | bwd_inner_microstep: 1780.37 | bwd_allreduce_microstep: 321.76 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13486
total_samples=8494, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:09,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.01 | bwd_microstep: 2283.60 | bwd_inner_microstep: 2116.49 | bwd_allreduce_microstep: 167.06 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13970
total_samples=8498, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:12,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 03:27:12,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.72 | bwd_microstep: 1770.44 | bwd_inner_microstep: 1716.73 | bwd_allreduce_microstep: 53.64 | step_microstep: 132.62
[2025-08-03 03:27:12,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.22 | bwd: 8039.32 | bwd_inner: 7263.10 | bwd_allreduce: 775.97 | step: 132.95
{'loss': 0.7847, 'learning_rate': 1.6908954193743816e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13445
total_samples=8502, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:14,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.17 | bwd_microstep: 1710.16 | bwd_inner_microstep: 1656.11 | bwd_allreduce_microstep: 53.99 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12038
total_samples=8505, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:17,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.72 | bwd_microstep: 1708.69 | bwd_inner_microstep: 1537.69 | bwd_allreduce_microstep: 170.93 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13586
total_samples=8509, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:20,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.03 | bwd_microstep: 2044.74 | bwd_inner_microstep: 1903.57 | bwd_allreduce_microstep: 141.10 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13157
total_samples=8513, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:22,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 03:27:22,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.64 | bwd_microstep: 1780.59 | bwd_inner_microstep: 1687.48 | bwd_allreduce_microstep: 93.05 | step_microstep: 111.71
[2025-08-03 03:27:22,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2707.50 | bwd: 7244.23 | bwd_inner: 6784.85 | bwd_allreduce: 459.15 | step: 112.05
{'loss': 0.7688, 'learning_rate': 1.6897237772781046e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13924
total_samples=8517, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:25,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.55 | bwd_microstep: 1974.11 | bwd_inner_microstep: 1882.86 | bwd_allreduce_microstep: 91.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13475
total_samples=8521, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:28,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.46 | bwd_microstep: 1735.08 | bwd_inner_microstep: 1674.96 | bwd_allreduce_microstep: 60.05 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12911
total_samples=8525, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:30,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.44 | bwd_microstep: 1998.97 | bwd_inner_microstep: 1888.78 | bwd_allreduce_microstep: 110.13 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13955
total_samples=8529, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:33,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 03:27:33,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.45 | bwd_microstep: 1820.97 | bwd_inner_microstep: 1736.99 | bwd_allreduce_microstep: 83.91 | step_microstep: 113.08
[2025-08-03 03:27:33,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.82 | bwd: 7529.18 | bwd_inner: 7183.58 | bwd_allreduce: 345.36 | step: 113.40
{'loss': 0.7852, 'learning_rate': 1.6885503264616282e-05, 'epoch': 0.28}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12465
total_samples=8533, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:36,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.34 | bwd_microstep: 1751.47 | bwd_inner_microstep: 1590.94 | bwd_allreduce_microstep: 160.47 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13810
total_samples=8537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:38,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.97 | bwd_microstep: 1730.40 | bwd_inner_microstep: 1683.48 | bwd_allreduce_microstep: 46.85 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11838
total_samples=8540, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:41,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.70 | bwd_microstep: 1790.67 | bwd_inner_microstep: 1579.47 | bwd_allreduce_microstep: 211.13 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13341
total_samples=8544, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:44,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 03:27:44,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.56 | bwd_microstep: 1901.49 | bwd_inner_microstep: 1856.06 | bwd_allreduce_microstep: 45.36 | step_microstep: 137.53
[2025-08-03 03:27:44,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.50 | bwd: 7174.09 | bwd_inner: 6709.95 | bwd_allreduce: 463.89 | step: 137.98
{'loss': 0.7767, 'learning_rate': 1.6873750700021917e-05, 'epoch': 0.28}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11644
total_samples=8547, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:46,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.04 | bwd_microstep: 1843.41 | bwd_inner_microstep: 1793.05 | bwd_allreduce_microstep: 50.30 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13369
total_samples=8551, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:49,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.12 | bwd_microstep: 1786.97 | bwd_inner_microstep: 1695.58 | bwd_allreduce_microstep: 91.33 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12069
total_samples=8554, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:52,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.59 | bwd_microstep: 2069.88 | bwd_inner_microstep: 1723.98 | bwd_allreduce_microstep: 345.84 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13183
total_samples=8558, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:54,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:27:54,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.85 | bwd_microstep: 1909.58 | bwd_inner_microstep: 1839.45 | bwd_allreduce_microstep: 70.07 | step_microstep: 107.57
[2025-08-03 03:27:54,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.52 | bwd: 7609.88 | bwd_inner: 7052.05 | bwd_allreduce: 557.60 | step: 108.00
                                             28%|██▊       | 557/2000 [1:44:04<4:20:26, 10.83s/it] 28%|██▊       | 558/2000 [1:44:15<4:21:02, 10.86s/it]                                                       28%|██▊       | 558/2000 [1:44:15<4:21:02, 10.86s/it] 28%|██▊       | 559/2000 [1:44:27<4:24:22, 11.01s/it]                                                       28%|██▊       | 559/2000 [1:44:27<4:24:22, 11.01s/it] 28%|██▊       | 560/2000 [1:44:37<4:19:55, 10.83s/it]                                                       28%|██▊       | 560/2000 [1:44:37<4:19:55, 10.83s/it] 28%|██▊       | 561/2000 [1:44:48<4:19:35, 10.82s/it]                                                       28%|██▊       | 561/2000 [1:44:48<4:19:35, 10.82s/it] 28%|██▊       | 562/2000 [1:44:58<4:16:28, 10.70s/it]                                                       28%|██▊       | 562/2000 [1:44:58<4:16:28, 10.70s/it] 28%|██▊       | 563/{'loss': 0.7847, 'learning_rate': 1.686198010981767e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13231
total_samples=8562, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:27:57,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.68 | bwd_microstep: 1960.19 | bwd_inner_microstep: 1807.46 | bwd_allreduce_microstep: 152.67 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12829
total_samples=8566, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:00,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.43 | bwd_microstep: 2170.89 | bwd_inner_microstep: 1967.95 | bwd_allreduce_microstep: 202.88 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14296
total_samples=8570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:03,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.82 | bwd_microstep: 1755.07 | bwd_inner_microstep: 1722.75 | bwd_allreduce_microstep: 32.26 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13946
total_samples=8576, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:05,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 03:28:05,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.59 | bwd_microstep: 1789.81 | bwd_inner_microstep: 1686.01 | bwd_allreduce_microstep: 103.75 | step_microstep: 161.42
[2025-08-03 03:28:05,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2866.44 | bwd: 7676.01 | bwd_inner: 7184.16 | bwd_allreduce: 491.63 | step: 161.74
{'loss': 0.7885, 'learning_rate': 1.6850191524870548e-05, 'epoch': 0.28}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13297
total_samples=8580, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:09,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 763.23 | bwd_microstep: 2309.51 | bwd_inner_microstep: 1957.91 | bwd_allreduce_microstep: 351.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13781
total_samples=8584, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:11,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.41 | bwd_microstep: 1825.58 | bwd_inner_microstep: 1742.92 | bwd_allreduce_microstep: 82.60 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13757
total_samples=8588, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:14,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.26 | bwd_microstep: 1793.23 | bwd_inner_microstep: 1709.62 | bwd_allreduce_microstep: 83.55 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13513
total_samples=8592, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:17,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 03:28:17,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.84 | bwd_microstep: 2087.47 | bwd_inner_microstep: 1936.57 | bwd_allreduce_microstep: 150.84 | step_microstep: 109.88
[2025-08-03 03:28:17,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2866.66 | bwd: 8015.84 | bwd_inner: 7347.01 | bwd_allreduce: 668.61 | step: 110.21
{'loss': 0.7733, 'learning_rate': 1.6838384976094738e-05, 'epoch': 0.28}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11720
total_samples=8595, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:20,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.89 | bwd_microstep: 2249.14 | bwd_inner_microstep: 1894.18 | bwd_allreduce_microstep: 354.90 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13299
total_samples=8599, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:22,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.38 | bwd_microstep: 1860.39 | bwd_inner_microstep: 1820.50 | bwd_allreduce_microstep: 39.82 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12490
total_samples=8603, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:25,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.20 | bwd_microstep: 1751.29 | bwd_inner_microstep: 1583.93 | bwd_allreduce_microstep: 167.31 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13318
total_samples=8607, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:28,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 03:28:28,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.15 | bwd_microstep: 1812.05 | bwd_inner_microstep: 1704.49 | bwd_allreduce_microstep: 107.50 | step_microstep: 126.69
[2025-08-03 03:28:28,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.54 | bwd: 7672.92 | bwd_inner: 7003.09 | bwd_allreduce: 669.60 | step: 127.03
{'loss': 0.7696, 'learning_rate': 1.682656049445154e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=8611, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:30,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.59 | bwd_microstep: 1823.88 | bwd_inner_microstep: 1714.10 | bwd_allreduce_microstep: 109.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14593
total_samples=8616, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:33,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.96 | bwd_microstep: 2105.90 | bwd_inner_microstep: 1932.46 | bwd_allreduce_microstep: 173.37 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11836
total_samples=8619, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:36,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.98 | bwd_microstep: 2103.81 | bwd_inner_microstep: 1870.32 | bwd_allreduce_microstep: 233.43 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13046
total_samples=8623, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:39,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 03:28:39,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.90 | bwd_microstep: 1729.05 | bwd_inner_microstep: 1635.47 | bwd_allreduce_microstep: 93.51 | step_microstep: 115.93
[2025-08-03 03:28:39,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.35 | bwd: 7762.68 | bwd_inner: 7152.36 | bwd_allreduce: 610.09 | step: 116.25
{'loss': 0.7716, 'learning_rate': 1.6814718110949274e-05, 'epoch': 0.28}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14470
total_samples=8627, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:42,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.43 | bwd_microstep: 2147.20 | bwd_inner_microstep: 2088.56 | bwd_allreduce_microstep: 58.58 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13490
total_samples=8631, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:44,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.80 | bwd_microstep: 1861.70 | bwd_inner_microstep: 1716.87 | bwd_allreduce_microstep: 144.77 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13240
total_samples=8635, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:47,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.65 | bwd_microstep: 1820.19 | bwd_inner_microstep: 1722.75 | bwd_allreduce_microstep: 97.37 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12957
total_samples=8639, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:49,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 03:28:50,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.54 | bwd_microstep: 1771.17 | bwd_inner_microstep: 1651.79 | bwd_allreduce_microstep: 119.32 | step_microstep: 124.74
[2025-08-03 03:28:50,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.36 | bwd: 7600.30 | bwd_inner: 7179.97 | bwd_allreduce: 420.11 | step: 125.07
{'loss': 0.7776, 'learning_rate': 1.6802857856643214e-05, 'epoch': 0.28}
2000 [1:45:09<4:17:13, 10.74s/it]                                                       28%|██▊       | 563/2000 [1:45:09<4:17:13, 10.74s/it] 28%|██▊       | 564/2000 [1:45:20<4:19:06, 10.83s/it]                                                       28%|██▊       | 564/2000 [1:45:20<4:19:06, 10.83s/it] 28%|██▊       | 565/2000 [1:45:32<4:22:40, 10.98s/it]                                                       28%|██▊       | 565/2000 [1:45:32<4:22:40, 10.98s/it] 28%|██▊       | 566/2000 [1:45:42<4:21:57, 10.96s/it]                                                       28%|██▊       | 566/2000 [1:45:43<4:21:57, 10.96s/it] 28%|██▊       | 567/2000 [1:45:53<4:21:59, 10.97s/it]                                                       28%|██▊       | 567/2000 [1:45:54<4:21:59, 10.97s/it] 28%|██▊       | 568/2000 [1:46:04<4:21:03, 10.94s/it]                                                       28%|██▊       | 568/2000 [1:46:04<4:2dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15550
total_samples=8644, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:52,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.03 | bwd_microstep: 1785.27 | bwd_inner_microstep: 1758.88 | bwd_allreduce_microstep: 26.32 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12742
total_samples=8648, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:55,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.34 | bwd_microstep: 1983.65 | bwd_inner_microstep: 1689.01 | bwd_allreduce_microstep: 294.58 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13496
total_samples=8652, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:28:58,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1031.05 | bwd_microstep: 1815.59 | bwd_inner_microstep: 1809.64 | bwd_allreduce_microstep: 5.89 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13813
total_samples=8656, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:01,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 03:29:01,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.24 | bwd_microstep: 2054.46 | bwd_inner_microstep: 1818.73 | bwd_allreduce_microstep: 235.66 | step_microstep: 114.56
[2025-08-03 03:29:01,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3134.58 | bwd: 7639.02 | bwd_inner: 7076.26 | bwd_allreduce: 562.53 | step: 114.89
{'loss': 0.7802, 'learning_rate': 1.6790979762635497e-05, 'epoch': 0.28}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14042
total_samples=8660, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:03,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.05 | bwd_microstep: 1817.53 | bwd_inner_microstep: 1692.37 | bwd_allreduce_microstep: 125.09 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14680
total_samples=8664, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:06,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.73 | bwd_microstep: 1739.91 | bwd_inner_microstep: 1697.73 | bwd_allreduce_microstep: 42.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14841
total_samples=8668, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:09,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.99 | bwd_microstep: 2081.57 | bwd_inner_microstep: 1960.39 | bwd_allreduce_microstep: 121.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13249
total_samples=8672, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:12,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 03:29:12,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.61 | bwd_microstep: 2022.99 | bwd_inner_microstep: 1790.86 | bwd_allreduce_microstep: 232.07 | step_microstep: 120.72
[2025-08-03 03:29:12,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.32 | bwd: 7662.04 | bwd_inner: 7141.34 | bwd_allreduce: 520.45 | step: 121.05
{'loss': 0.7913, 'learning_rate': 1.6779083860075032e-05, 'epoch': 0.28}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13466
total_samples=8676, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:14,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.77 | bwd_microstep: 2034.25 | bwd_inner_microstep: 1905.79 | bwd_allreduce_microstep: 128.39 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13401
total_samples=8680, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:17,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.99 | bwd_microstep: 1812.37 | bwd_inner_microstep: 1672.74 | bwd_allreduce_microstep: 139.57 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14981
total_samples=8684, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:20,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.94 | bwd_microstep: 1745.64 | bwd_inner_microstep: 1733.25 | bwd_allreduce_microstep: 12.33 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13277
total_samples=8688, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:22,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 03:29:22,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.78 | bwd_microstep: 1795.31 | bwd_inner_microstep: 1672.97 | bwd_allreduce_microstep: 122.28 | step_microstep: 112.03
[2025-08-03 03:29:22,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.42 | bwd: 7387.62 | bwd_inner: 6984.76 | bwd_allreduce: 402.64 | step: 112.35
{'loss': 0.7787, 'learning_rate': 1.6767170180157442e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12033
total_samples=8691, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:25,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.87 | bwd_microstep: 1763.38 | bwd_inner_microstep: 1551.16 | bwd_allreduce_microstep: 212.16 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12605
total_samples=8695, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:28,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.94 | bwd_microstep: 2071.28 | bwd_inner_microstep: 1879.58 | bwd_allreduce_microstep: 191.62 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11898
total_samples=8698, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:30,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.59 | bwd_microstep: 1901.16 | bwd_inner_microstep: 1756.62 | bwd_allreduce_microstep: 144.48 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13316
total_samples=8704, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:33,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 03:29:33,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.36 | bwd_microstep: 1751.92 | bwd_inner_microstep: 1644.55 | bwd_allreduce_microstep: 107.31 | step_microstep: 155.16
[2025-08-03 03:29:33,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.70 | bwd: 7487.79 | bwd_inner: 6831.90 | bwd_allreduce: 655.64 | step: 155.49
{'loss': 0.7736, 'learning_rate': 1.6755238754124965e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11830
total_samples=8707, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:36,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.44 | bwd_microstep: 1716.07 | bwd_inner_microstep: 1542.99 | bwd_allreduce_microstep: 173.01 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12080
total_samples=8711, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:38,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.42 | bwd_microstep: 2013.86 | bwd_inner_microstep: 1681.69 | bwd_allreduce_microstep: 332.11 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=8714, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:41,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.11 | bwd_microstep: 1991.75 | bwd_inner_microstep: 1771.11 | bwd_allreduce_microstep: 220.58 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12363
total_samples=8718, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:44,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 03:29:44,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.45 | bwd_microstep: 1735.25 | bwd_inner_microstep: 1569.74 | bwd_allreduce_microstep: 165.44 | step_microstep: 137.33
[2025-08-03 03:29:44,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.36 | bwd: 7456.99 | bwd_inner: 6565.54 | bwd_allreduce: 891.21 | step: 137.65
{'loss': 0.774, 'learning_rate': 1.674328961326637e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12243
total_samples=8721, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:47,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.17 | bwd_microstep: 2095.15 | bwd_inner_microstep: 1718.58 | bwd_allreduce_microstep: 376.51 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13934
total_samples=8725, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:49,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.18 | bwd_microstep: 1790.93 | bwd_inner_microstep: 1706.70 | bwd_allreduce_microstep: 84.17 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14360
total_samples=8729, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:52,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.68 | bwd_microstep: 2153.48 | bwd_inner_microstep: 1987.39 | bwd_allreduce_microstep: 166.02 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13039
total_samples=8733, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:55,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.93
[2025-08-03 03:29:55,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.38 | bwd_microstep: 1854.65 | bwd_inner_microstep: 1665.91 | bwd_allreduce_microstep: 188.69 | step_microstep: 123.60
[2025-08-03 03:29:55,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.35 | bwd: 7894.26 | bwd_inner: 7078.57 | bwd_allreduce: 815.47 | step: 123.98
{'loss': 0.7687, 'learning_rate': 1.6731322788916892e-05, 'epoch': 0.29}
1:03, 10.94s/it] 28%|██▊       | 569/2000 [1:46:16<4:22:50, 11.02s/it]                                                       28%|██▊       | 569/2000 [1:46:16<4:22:50, 11.02s/it] 28%|██▊       | 570/2000 [1:46:27<4:22:02, 11.00s/it]                                                       28%|██▊       | 570/2000 [1:46:27<4:22:02, 11.00s/it] 29%|██▊       | 571/2000 [1:46:37<4:18:59, 10.87s/it]                                                       29%|██▊       | 571/2000 [1:46:37<4:18:59, 10.87s/it] 29%|██▊       | 572/2000 [1:46:48<4:18:08, 10.85s/it]                                                       29%|██▊       | 572/2000 [1:46:48<4:18:08, 10.85s/it] 29%|██▊       | 573/2000 [1:46:59<4:17:19, 10.82s/it]                                                       29%|██▊       | 573/2000 [1:46:59<4:17:19, 10.82s/it] 29%|██▊       | 574/2000 [1:47:10<4:19:24, 10.91s/it]                                                      dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14195
total_samples=8737, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:29:58,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.06 | bwd_microstep: 2217.60 | bwd_inner_microstep: 2097.35 | bwd_allreduce_microstep: 120.19 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689
total_samples=8740, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:01,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.70 | bwd_microstep: 1799.87 | bwd_inner_microstep: 1565.42 | bwd_allreduce_microstep: 234.39 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11722
total_samples=8743, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:03,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.20 | bwd_microstep: 1768.72 | bwd_inner_microstep: 1543.51 | bwd_allreduce_microstep: 225.15 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13278
total_samples=8747, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:06,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 03:30:06,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.33 | bwd_microstep: 1922.79 | bwd_inner_microstep: 1839.25 | bwd_allreduce_microstep: 83.48 | step_microstep: 110.91
[2025-08-03 03:30:06,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.22 | bwd: 7709.02 | bwd_inner: 7045.52 | bwd_allreduce: 663.28 | step: 111.24
{'loss': 0.7772, 'learning_rate': 1.6719338312458123e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11741
total_samples=8750, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:08,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.80 | bwd_microstep: 1713.28 | bwd_inner_microstep: 1581.56 | bwd_allreduce_microstep: 131.66 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11767
total_samples=8753, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:11,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.94 | bwd_microstep: 1783.29 | bwd_inner_microstep: 1536.19 | bwd_allreduce_microstep: 247.04 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332
total_samples=8757, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:14,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.51 | bwd_microstep: 1770.55 | bwd_inner_microstep: 1685.67 | bwd_allreduce_microstep: 84.82 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13565
total_samples=8761, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:17,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.30
[2025-08-03 03:30:17,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1305.21 | bwd_microstep: 1809.49 | bwd_inner_microstep: 1708.54 | bwd_allreduce_microstep: 100.89 | step_microstep: 116.59
[2025-08-03 03:30:17,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3417.37 | bwd: 7076.66 | bwd_inner: 6511.94 | bwd_allreduce: 564.49 | step: 116.92
{'loss': 0.7825, 'learning_rate': 1.6707336215317968e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11805
total_samples=8764, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:19,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.06 | bwd_microstep: 1729.02 | bwd_inner_microstep: 1538.56 | bwd_allreduce_microstep: 190.40 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13448
total_samples=8768, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:22,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.68 | bwd_microstep: 1982.66 | bwd_inner_microstep: 1700.60 | bwd_allreduce_microstep: 282.00 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13239
total_samples=8772, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:25,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.51 | bwd_microstep: 2027.04 | bwd_inner_microstep: 1720.60 | bwd_allreduce_microstep: 306.37 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11696
total_samples=8775, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:28,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 03:30:28,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.52 | bwd_microstep: 1920.48 | bwd_inner_microstep: 1546.42 | bwd_allreduce_microstep: 374.00 | step_microstep: 396.59
[2025-08-03 03:30:28,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.71 | bwd: 7659.25 | bwd_inner: 6506.17 | bwd_allreduce: 1152.86 | step: 397.01
{'loss': 0.7787, 'learning_rate': 1.6695316528970517e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12266
total_samples=8778, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:31,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.77 | bwd_microstep: 1741.91 | bwd_inner_microstep: 1568.99 | bwd_allreduce_microstep: 172.86 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16159
total_samples=8782, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:33,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.17 | bwd_microstep: 1839.77 | bwd_inner_microstep: 1833.36 | bwd_allreduce_microstep: 6.34 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13487
total_samples=8786, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:36,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.63 | bwd_microstep: 2196.64 | bwd_inner_microstep: 2003.99 | bwd_allreduce_microstep: 192.58 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13092
total_samples=8790, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:39,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 03:30:39,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 985.25 | bwd_microstep: 1886.58 | bwd_inner_microstep: 1745.00 | bwd_allreduce_microstep: 141.53 | step_microstep: 135.63
[2025-08-03 03:30:39,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3071.75 | bwd: 7664.94 | bwd_inner: 7151.34 | bwd_allreduce: 513.38 | step: 136.22
{'loss': 0.7712, 'learning_rate': 1.6683279284936004e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12274
total_samples=8793, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:42,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.93 | bwd_microstep: 2012.78 | bwd_inner_microstep: 1788.33 | bwd_allreduce_microstep: 224.39 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12450
total_samples=8797, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:46,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1125.26 | bwd_microstep: 2614.02 | bwd_inner_microstep: 2396.41 | bwd_allreduce_microstep: 217.55 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11647
total_samples=8800, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:48,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.19 | bwd_microstep: 1741.03 | bwd_inner_microstep: 1525.81 | bwd_allreduce_microstep: 215.17 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12465
total_samples=8804, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:51,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 03:30:51,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.70 | bwd_microstep: 1798.81 | bwd_inner_microstep: 1613.06 | bwd_allreduce_microstep: 185.69 | step_microstep: 140.33
[2025-08-03 03:30:51,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3191.00 | bwd: 8166.70 | bwd_inner: 7323.60 | bwd_allreduce: 842.87 | step: 140.75
{'loss': 0.7825, 'learning_rate': 1.6671224514780692e-05, 'epoch': 0.29}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14186
total_samples=8808, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:54,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.38 | bwd_microstep: 1803.92 | bwd_inner_microstep: 1730.69 | bwd_allreduce_microstep: 73.16 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13700
total_samples=8813, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:56,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.33 | bwd_microstep: 1886.91 | bwd_inner_microstep: 1836.52 | bwd_allreduce_microstep: 50.32 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13171
total_samples=8817, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:30:59,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.15 | bwd_microstep: 1810.89 | bwd_inner_microstep: 1654.79 | bwd_allreduce_microstep: 156.04 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11503
total_samples=8820, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:02,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 03:31:02,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.33 | bwd_microstep: 2054.36 | bwd_inner_microstep: 1578.47 | bwd_allreduce_microstep: 475.78 | step_microstep: 112.06
[2025-08-03 03:31:02,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.12 | bwd: 7556.12 | bwd_inner: 6800.50 | bwd_allreduce: 755.36 | step: 112.49
 29%|██▊       | 574/2000 [1:47:10<4:19:24, 10.91s/it] 29%|██▉       | 575/2000 [1:47:21<4:19:39, 10.93s/it]                                                       29%|██▉       | 575/2000 [1:47:21<4:19:39, 10.93s/it] 29%|██▉       | 576/2000 [1:47:32<4:19:35, 10.94s/it]                                                       29%|██▉       | 576/2000 [1:47:32<4:19:35, 10.94s/it] 29%|██▉       | 577/2000 [1:47:43<4:21:19, 11.02s/it]                                                       29%|██▉       | 577/2000 [1:47:43<4:21:19, 11.02s/it] 29%|██▉       | 578/2000 [1:47:54<4:22:32, 11.08s/it]                                                       29%|██▉       | 578/2000 [1:47:54<4:22:32, 11.08s/it] 29%|██▉       | 579/2000 [1:48:06<4:27:54, 11.31s/it]                                                       29%|██▉       | 579/2000 [1:48:06<4:27:54, 11.31s/it] 29%|██▉       | 580/2000 [1:48:17<4:24:30, 11.18s/it]            {'loss': 0.7832, 'learning_rate': 1.665915225011681e-05, 'epoch': 0.29}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13525
total_samples=8824, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:05,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.35 | bwd_microstep: 2013.03 | bwd_inner_microstep: 1827.73 | bwd_allreduce_microstep: 185.24 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12418
total_samples=8829, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:07,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.19 | bwd_microstep: 1778.16 | bwd_inner_microstep: 1606.11 | bwd_allreduce_microstep: 171.99 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12863
total_samples=8833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:10,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.24 | bwd_microstep: 2275.47 | bwd_inner_microstep: 2026.15 | bwd_allreduce_microstep: 249.26 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13681
total_samples=8837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:13,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 03:31:13,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.01 | bwd_microstep: 1782.96 | bwd_inner_microstep: 1716.81 | bwd_allreduce_microstep: 66.07 | step_microstep: 139.26
[2025-08-03 03:31:13,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.75 | bwd: 7849.67 | bwd_inner: 7176.80 | bwd_allreduce: 672.64 | step: 139.56
{'loss': 0.7715, 'learning_rate': 1.6647062522602474e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11819
total_samples=8840, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:16,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.08 | bwd_microstep: 1743.44 | bwd_inner_microstep: 1547.96 | bwd_allreduce_microstep: 195.42 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13669
total_samples=8844, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:18,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.89 | bwd_microstep: 1907.46 | bwd_inner_microstep: 1700.72 | bwd_allreduce_microstep: 206.67 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12051
total_samples=8848, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:21,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.49 | bwd_microstep: 1797.81 | bwd_inner_microstep: 1560.73 | bwd_allreduce_microstep: 237.02 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12247
total_samples=8852, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:24,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 03:31:24,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.12 | bwd_microstep: 1884.24 | bwd_inner_microstep: 1878.15 | bwd_allreduce_microstep: 6.03 | step_microstep: 112.52
[2025-08-03 03:31:24,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.50 | bwd: 7333.00 | bwd_inner: 6687.56 | bwd_allreduce: 645.21 | step: 112.85
{'loss': 0.7756, 'learning_rate': 1.6634955363941573e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11888
total_samples=8855, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:27,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.08 | bwd_microstep: 2237.89 | bwd_inner_microstep: 1766.54 | bwd_allreduce_microstep: 471.29 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13427
total_samples=8859, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:29,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.01 | bwd_microstep: 1729.37 | bwd_inner_microstep: 1671.15 | bwd_allreduce_microstep: 58.16 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11749
total_samples=8862, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:32,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.01 | bwd_microstep: 1751.26 | bwd_inner_microstep: 1537.81 | bwd_allreduce_microstep: 213.38 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11967
total_samples=8865, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:35,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 03:31:35,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.23 | bwd_microstep: 1916.19 | bwd_inner_microstep: 1546.74 | bwd_allreduce_microstep: 369.39 | step_microstep: 128.11
[2025-08-03 03:31:35,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.24 | bwd: 7634.76 | bwd_inner: 6522.23 | bwd_allreduce: 1112.30 | step: 128.44
{'loss': 0.7725, 'learning_rate': 1.662283080588373e-05, 'epoch': 0.29}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13414
total_samples=8869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:37,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.44 | bwd_microstep: 1727.47 | bwd_inner_microstep: 1665.23 | bwd_allreduce_microstep: 62.17 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13643
total_samples=8873, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:40,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.33 | bwd_microstep: 2018.79 | bwd_inner_microstep: 1892.57 | bwd_allreduce_microstep: 126.15 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13856
total_samples=8877, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:43,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.40 | bwd_microstep: 1871.68 | bwd_inner_microstep: 1801.18 | bwd_allreduce_microstep: 70.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11839
total_samples=8880, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:46,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 03:31:46,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.40 | bwd_microstep: 2101.25 | bwd_inner_microstep: 1895.46 | bwd_allreduce_microstep: 205.72 | step_microstep: 128.46
[2025-08-03 03:31:46,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.51 | bwd: 7719.24 | bwd_inner: 7254.44 | bwd_allreduce: 464.57 | step: 128.79
{'loss': 0.7744, 'learning_rate': 1.6610688880224178e-05, 'epoch': 0.29}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13812
total_samples=8884, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:48,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.48 | bwd_microstep: 1906.29 | bwd_inner_microstep: 1694.74 | bwd_allreduce_microstep: 211.49 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812
total_samples=8887, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:51,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.66 | bwd_microstep: 1857.15 | bwd_inner_microstep: 1732.06 | bwd_allreduce_microstep: 125.03 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11680
total_samples=8890, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:54,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.18 | bwd_microstep: 2185.80 | bwd_inner_microstep: 1935.53 | bwd_allreduce_microstep: 250.21 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11727
total_samples=8893, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:57,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 03:31:57,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.03 | bwd_microstep: 1788.35 | bwd_inner_microstep: 1545.27 | bwd_allreduce_microstep: 243.02 | step_microstep: 160.37
[2025-08-03 03:31:57,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.28 | bwd: 7737.64 | bwd_inner: 6907.60 | bwd_allreduce: 829.81 | step: 160.71
{'loss': 0.7837, 'learning_rate': 1.65985296188037e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13132
total_samples=8896, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:31:59,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.52 | bwd_microstep: 1916.11 | bwd_inner_microstep: 1742.67 | bwd_allreduce_microstep: 173.38 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14054
total_samples=8900, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:02,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.01 | bwd_microstep: 2061.98 | bwd_inner_microstep: 1941.06 | bwd_allreduce_microstep: 120.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13379
total_samples=8904, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:05,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.40 | bwd_microstep: 1831.52 | bwd_inner_microstep: 1717.23 | bwd_allreduce_microstep: 114.22 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12039
total_samples=8907, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:07,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 03:32:07,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.59 | bwd_microstep: 1736.59 | bwd_inner_microstep: 1552.52 | bwd_allreduce_microstep: 184.01 | step_microstep: 111.19
[2025-08-03 03:32:07,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.45 | bwd: 7546.25 | bwd_inner: 6953.48 | bwd_allreduce: 592.54 | step: 111.50
                                           29%|██▉       | 580/2000 [1:48:17<4:24:30, 11.18s/it] 29%|██▉       | 581/2000 [1:48:28<4:23:48, 11.15s/it]                                                       29%|██▉       | 581/2000 [1:48:28<4:23:48, 11.15s/it] 29%|██▉       | 582/2000 [1:48:39<4:19:28, 10.98s/it]                                                       29%|██▉       | 582/2000 [1:48:39<4:19:28, 10.98s/it] 29%|██▉       | 583/2000 [1:48:49<4:18:54, 10.96s/it]                                                       29%|██▉       | 583/2000 [1:48:49<4:18:54, 10.96s/it] 29%|██▉       | 584/2000 [1:49:00<4:19:02, 10.98s/it]                                                       29%|██▉       | 584/2000 [1:49:00<4:19:02, 10.98s/it] 29%|██▉       | 585/2000 [1:49:11<4:19:05, 10.99s/it]                                                       29%|██▉       | 585/2000 [1:49:11<4:19:05, 10.99s/it] 29%|██▉       | 586/20{'loss': 0.7697, 'learning_rate': 1.6586353053508548e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11898
total_samples=8910, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:10,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.76 | bwd_microstep: 1735.00 | bwd_inner_microstep: 1543.27 | bwd_allreduce_microstep: 191.67 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14067
total_samples=8915, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:13,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.98 | bwd_microstep: 1947.20 | bwd_inner_microstep: 1869.31 | bwd_allreduce_microstep: 77.83 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14424
total_samples=8919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:15,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.77 | bwd_microstep: 1921.35 | bwd_inner_microstep: 1836.70 | bwd_allreduce_microstep: 84.58 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11637
total_samples=8922, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:18,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 03:32:18,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.55 | bwd_microstep: 1804.92 | bwd_inner_microstep: 1536.29 | bwd_allreduce_microstep: 268.56 | step_microstep: 112.58
[2025-08-03 03:32:18,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.99 | bwd: 7408.52 | bwd_inner: 6785.57 | bwd_allreduce: 622.71 | step: 113.00
{'loss': 0.7645, 'learning_rate': 1.657415921627034e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11869
total_samples=8925, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:21,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.87 | bwd_microstep: 1794.92 | bwd_inner_microstep: 1575.54 | bwd_allreduce_microstep: 219.32 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11682
total_samples=8928, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:23,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.56 | bwd_microstep: 1733.25 | bwd_inner_microstep: 1532.31 | bwd_allreduce_microstep: 200.87 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13238
total_samples=8932, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:26,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.36 | bwd_microstep: 1863.93 | bwd_inner_microstep: 1696.41 | bwd_allreduce_microstep: 167.45 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14777
total_samples=8937, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:29,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 03:32:29,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.97 | bwd_microstep: 2181.69 | bwd_inner_microstep: 1881.98 | bwd_allreduce_microstep: 299.64 | step_microstep: 136.24
[2025-08-03 03:32:29,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.69 | bwd: 7573.84 | bwd_inner: 6686.25 | bwd_allreduce: 887.35 | step: 136.56
{'loss': 0.7771, 'learning_rate': 1.6561948139065997e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12395
total_samples=8940, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:32,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.56 | bwd_microstep: 2000.18 | bwd_inner_microstep: 1786.98 | bwd_allreduce_microstep: 213.13 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13066
total_samples=8944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:34,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.70 | bwd_microstep: 1968.05 | bwd_inner_microstep: 1962.13 | bwd_allreduce_microstep: 5.86 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13783
total_samples=8948, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:37,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.39 | bwd_microstep: 1817.95 | bwd_inner_microstep: 1749.95 | bwd_allreduce_microstep: 67.94 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14851
total_samples=8952, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:40,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 03:32:40,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.71 | bwd_microstep: 1989.76 | bwd_inner_microstep: 1854.55 | bwd_allreduce_microstep: 135.15 | step_microstep: 134.89
[2025-08-03 03:32:40,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.30 | bwd: 7775.99 | bwd_inner: 7353.60 | bwd_allreduce: 422.16 | step: 135.19
{'loss': 0.7816, 'learning_rate': 1.654971985391764e-05, 'epoch': 0.29}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14332
total_samples=8956, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:43,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.06 | bwd_microstep: 1832.18 | bwd_inner_microstep: 1753.15 | bwd_allreduce_microstep: 78.97 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12187
total_samples=8959, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:45,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.79 | bwd_microstep: 1756.03 | bwd_inner_microstep: 1564.45 | bwd_allreduce_microstep: 191.52 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13723
total_samples=8964, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:48,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.39 | bwd_microstep: 1760.91 | bwd_inner_microstep: 1697.69 | bwd_allreduce_microstep: 63.17 | step_microstep: 0.09
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14583
total_samples=8969, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:50,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 03:32:50,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.12 | bwd_microstep: 1737.33 | bwd_inner_microstep: 1685.87 | bwd_allreduce_microstep: 51.40 | step_microstep: 134.47
[2025-08-03 03:32:50,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.30 | bwd: 7086.51 | bwd_inner: 6701.15 | bwd_allreduce: 385.13 | step: 134.78
{'loss': 0.7815, 'learning_rate': 1.6537474392892527e-05, 'epoch': 0.29}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12127
total_samples=8972, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:53,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.37 | bwd_microstep: 1994.17 | bwd_inner_microstep: 1803.35 | bwd_allreduce_microstep: 190.77 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12148
total_samples=8975, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:56,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.68 | bwd_microstep: 1980.70 | bwd_inner_microstep: 1756.85 | bwd_allreduce_microstep: 223.79 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12265
total_samples=8979, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:32:58,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.90 | bwd_microstep: 1782.43 | bwd_inner_microstep: 1595.26 | bwd_allreduce_microstep: 187.10 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13526
total_samples=8983, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:01,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 03:33:01,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.52 | bwd_microstep: 1815.80 | bwd_inner_microstep: 1738.91 | bwd_allreduce_microstep: 76.82 | step_microstep: 140.81
[2025-08-03 03:33:01,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.41 | bwd: 7573.15 | bwd_inner: 6894.37 | bwd_allreduce: 678.55 | step: 141.14
{'loss': 0.7767, 'learning_rate': 1.6525211788102946e-05, 'epoch': 0.3}
00 [1:49:22<4:17:22, 10.92s/it]                                                       29%|██▉       | 586/2000 [1:49:22<4:17:22, 10.92s/it] 29%|██▉       | 587/2000 [1:49:33<4:15:06, 10.83s/it]                                                       29%|██▉       | 587/2000 [1:49:33<4:15:06, 10.83s/it] 29%|██▉       | 588/2000 [1:49:44<4:15:22, 10.85s/it]                                                       29%|██▉       | 588/2000 [1:49:44<4:15:22, 10.85s/it] 29%|██▉       | 589/2000 [1:49:55<4:16:37, 10.91s/it]                                                       29%|██▉       | 589/2000 [1:49:55<4:16:37, 10.91s/it] 30%|██▉       | 590/2000 [1:50:05<4:12:22, 10.74s/it]                                                       30%|██▉       | 590/2000 [1:50:05<4:12:22, 10.74s/it] 30%|██▉       | 591/2000 [1:50:16<4:13:07, 10.78s/it]                                                       30%|██▉       | 591/2000 [1:50:16<4:13:dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11865
total_samples=8986, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:04,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.67 | bwd_microstep: 2149.93 | bwd_inner_microstep: 1921.46 | bwd_allreduce_microstep: 228.40 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13687
total_samples=8990, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:07,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1012.23 | bwd_microstep: 1911.82 | bwd_inner_microstep: 1713.81 | bwd_allreduce_microstep: 197.94 | step_microstep: 0.09
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12329
total_samples=8994, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:10,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.70 | bwd_microstep: 1765.67 | bwd_inner_microstep: 1587.71 | bwd_allreduce_microstep: 177.89 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13861
total_samples=8998, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:12,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 03:33:12,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.82 | bwd_microstep: 1757.17 | bwd_inner_microstep: 1714.90 | bwd_allreduce_microstep: 42.21 | step_microstep: 113.37
[2025-08-03 03:33:12,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3133.35 | bwd: 7584.62 | bwd_inner: 6937.88 | bwd_allreduce: 646.51 | step: 113.77
{'loss': 0.7756, 'learning_rate': 1.6512932071706153e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11534
total_samples=9001, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:15,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.34 | bwd_microstep: 1804.38 | bwd_inner_microstep: 1537.12 | bwd_allreduce_microstep: 267.20 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13484
total_samples=9006, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:17,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.46 | bwd_microstep: 1748.87 | bwd_inner_microstep: 1662.80 | bwd_allreduce_microstep: 86.00 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13406
total_samples=9010, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:20,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.82 | bwd_microstep: 1735.60 | bwd_inner_microstep: 1668.72 | bwd_allreduce_microstep: 66.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16053
total_samples=9014, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:23,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.22
[2025-08-03 03:33:23,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.01 | bwd_microstep: 1904.96 | bwd_inner_microstep: 1805.75 | bwd_allreduce_microstep: 99.14 | step_microstep: 129.67
[2025-08-03 03:33:23,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.57 | bwd: 7193.87 | bwd_inner: 6674.39 | bwd_allreduce: 519.24 | step: 130.00
{'loss': 0.7757, 'learning_rate': 1.6500635275904274e-05, 'epoch': 0.3}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14451
total_samples=9018, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:25,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.15 | bwd_microstep: 1883.47 | bwd_inner_microstep: 1711.37 | bwd_allreduce_microstep: 172.04 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12387
total_samples=9021, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:28,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.15 | bwd_microstep: 2269.01 | bwd_inner_microstep: 1808.19 | bwd_allreduce_microstep: 460.71 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13096
total_samples=9025, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:31,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.92 | bwd_microstep: 1916.93 | bwd_inner_microstep: 1852.06 | bwd_allreduce_microstep: 64.82 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12785
total_samples=9029, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:34,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 03:33:34,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.50 | bwd_microstep: 1756.35 | bwd_inner_microstep: 1646.17 | bwd_allreduce_microstep: 110.11 | step_microstep: 154.16
[2025-08-03 03:33:34,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.64 | bwd: 7825.81 | bwd_inner: 7017.81 | bwd_allreduce: 807.74 | step: 154.60
{'loss': 0.7763, 'learning_rate': 1.6488321432944218e-05, 'epoch': 0.3}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14734
total_samples=9033, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:37,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.28 | bwd_microstep: 1879.96 | bwd_inner_microstep: 1761.37 | bwd_allreduce_microstep: 118.53 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12079
total_samples=9036, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:39,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.09 | bwd_microstep: 1784.17 | bwd_inner_microstep: 1560.88 | bwd_allreduce_microstep: 223.22 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13333
total_samples=9040, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:42,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.03 | bwd_microstep: 1861.62 | bwd_inner_microstep: 1610.30 | bwd_allreduce_microstep: 251.24 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14945
total_samples=9044, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:45,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 03:33:45,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.06 | bwd_microstep: 1958.99 | bwd_inner_microstep: 1755.59 | bwd_allreduce_microstep: 203.29 | step_microstep: 135.83
[2025-08-03 03:33:45,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.39 | bwd: 7484.79 | bwd_inner: 6688.16 | bwd_allreduce: 796.36 | step: 136.34
{'loss': 0.7848, 'learning_rate': 1.6475990575117603e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11816
total_samples=9047, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:47,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.98 | bwd_microstep: 2013.02 | bwd_inner_microstep: 1812.67 | bwd_allreduce_microstep: 200.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12028
total_samples=9050, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:50,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.01 | bwd_microstep: 2016.88 | bwd_inner_microstep: 1852.27 | bwd_allreduce_microstep: 164.55 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13109
total_samples=9054, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:53,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.35 | bwd_microstep: 2021.25 | bwd_inner_microstep: 1864.14 | bwd_allreduce_microstep: 157.04 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12816
total_samples=9058, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:56,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 03:33:56,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.70 | bwd_microstep: 1795.62 | bwd_inner_microstep: 1658.10 | bwd_allreduce_microstep: 137.45 | step_microstep: 120.63
[2025-08-03 03:33:56,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.97 | bwd: 7846.82 | bwd_inner: 7187.18 | bwd_allreduce: 659.40 | step: 120.95
{'loss': 0.7644, 'learning_rate': 1.646364273476067e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11549
total_samples=9061, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:33:58,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.66 | bwd_microstep: 1716.26 | bwd_inner_microstep: 1529.00 | bwd_allreduce_microstep: 187.19 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12146
total_samples=9064, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:01,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.96 | bwd_microstep: 1809.14 | bwd_inner_microstep: 1574.75 | bwd_allreduce_microstep: 234.33 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13355
total_samples=9068, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:03,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.88 | bwd_microstep: 1748.05 | bwd_inner_microstep: 1683.93 | bwd_allreduce_microstep: 64.06 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12123
total_samples=9072, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:06,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.35
[2025-08-03 03:34:06,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.63 | bwd_microstep: 2089.34 | bwd_inner_microstep: 1857.25 | bwd_allreduce_microstep: 232.03 | step_microstep: 114.61
[2025-08-03 03:34:06,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2747.04 | bwd: 7362.83 | bwd_inner: 6644.92 | bwd_allreduce: 717.67 | step: 114.94
{'loss': 0.7611, 'learning_rate': 1.6451277944254186e-05, 'epoch': 0.3}
07, 10.78s/it] 30%|██▉       | 592/2000 [1:50:27<4:15:30, 10.89s/it]                                                       30%|██▉       | 592/2000 [1:50:27<4:15:30, 10.89s/it] 30%|██▉       | 593/2000 [1:50:38<4:12:19, 10.76s/it]                                                       30%|██▉       | 593/2000 [1:50:38<4:12:19, 10.76s/it] 30%|██▉       | 594/2000 [1:50:49<4:14:46, 10.87s/it]                                                       30%|██▉       | 594/2000 [1:50:49<4:14:46, 10.87s/it] 30%|██▉       | 595/2000 [1:51:00<4:13:50, 10.84s/it]                                                       30%|██▉       | 595/2000 [1:51:00<4:13:50, 10.84s/it] 30%|██▉       | 596/2000 [1:51:11<4:15:23, 10.91s/it]                                                       30%|██▉       | 596/2000 [1:51:11<4:15:23, 10.91s/it] 30%|██▉       | 597/2000 [1:51:21<4:12:47, 10.81s/it]                                                       3dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12290
total_samples=9075, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:09,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.23 | bwd_microstep: 1768.31 | bwd_inner_microstep: 1565.58 | bwd_allreduce_microstep: 202.68 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13229
total_samples=9079, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:12,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.38 | bwd_microstep: 2063.81 | bwd_inner_microstep: 1704.78 | bwd_allreduce_microstep: 358.97 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11980
total_samples=9082, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:15,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.75 | bwd_microstep: 2031.99 | bwd_inner_microstep: 1809.10 | bwd_allreduce_microstep: 222.82 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13261
total_samples=9086, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:17,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:34:17,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.26 | bwd_microstep: 1751.90 | bwd_inner_microstep: 1663.09 | bwd_allreduce_microstep: 88.74 | step_microstep: 130.94
[2025-08-03 03:34:17,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.56 | bwd: 7616.06 | bwd_inner: 6742.56 | bwd_allreduce: 873.27 | step: 131.25
{'loss': 0.7729, 'learning_rate': 1.6438896236023374e-05, 'epoch': 0.3}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13580
total_samples=9090, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:20,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.86 | bwd_microstep: 1722.02 | bwd_inner_microstep: 1695.80 | bwd_allreduce_microstep: 26.16 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13045
total_samples=9094, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:22,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.17 | bwd_microstep: 1969.87 | bwd_inner_microstep: 1848.90 | bwd_allreduce_microstep: 120.91 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11608
total_samples=9097, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:25,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.28 | bwd_microstep: 1793.46 | bwd_inner_microstep: 1568.54 | bwd_allreduce_microstep: 224.85 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12363
total_samples=9100, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:28,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 03:34:28,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.98 | bwd_microstep: 2116.13 | bwd_inner_microstep: 1858.88 | bwd_allreduce_microstep: 257.19 | step_microstep: 136.25
[2025-08-03 03:34:28,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.21 | bwd: 7601.53 | bwd_inner: 6972.11 | bwd_allreduce: 629.19 | step: 136.58
{'loss': 0.7622, 'learning_rate': 1.6426497642537826e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885
total_samples=9103, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:31,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.13 | bwd_microstep: 2053.87 | bwd_inner_microstep: 1826.17 | bwd_allreduce_microstep: 227.63 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13603
total_samples=9107, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:34,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.71 | bwd_microstep: 1849.58 | bwd_inner_microstep: 1806.24 | bwd_allreduce_microstep: 43.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11957
total_samples=9110, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:36,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.64 | bwd_microstep: 1764.22 | bwd_inner_microstep: 1553.46 | bwd_allreduce_microstep: 210.69 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12944
total_samples=9114, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:39,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 03:34:39,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.54 | bwd_microstep: 1768.35 | bwd_inner_microstep: 1650.10 | bwd_allreduce_microstep: 118.19 | step_microstep: 160.59
[2025-08-03 03:34:39,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.96 | bwd: 7436.07 | bwd_inner: 6835.96 | bwd_allreduce: 599.87 | step: 161.04
{'loss': 0.7776, 'learning_rate': 1.6414082196311402e-05, 'epoch': 0.3}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13596
total_samples=9118, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:41,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.89 | bwd_microstep: 1824.95 | bwd_inner_microstep: 1713.64 | bwd_allreduce_microstep: 111.25 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12021
total_samples=9121, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:44,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.27 | bwd_microstep: 2087.52 | bwd_inner_microstep: 1840.83 | bwd_allreduce_microstep: 246.63 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13644
total_samples=9125, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:47,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.13 | bwd_microstep: 2094.31 | bwd_inner_microstep: 1743.72 | bwd_allreduce_microstep: 350.53 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11946
total_samples=9128, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:50,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 03:34:50,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.56 | bwd_microstep: 1755.73 | bwd_inner_microstep: 1550.80 | bwd_allreduce_microstep: 204.86 | step_microstep: 137.98
[2025-08-03 03:34:50,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.76 | bwd: 7762.57 | bwd_inner: 6848.98 | bwd_allreduce: 913.35 | step: 138.40
{'loss': 0.7702, 'learning_rate': 1.640164992990216e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11802
total_samples=9132, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:52,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.41 | bwd_microstep: 1761.11 | bwd_inner_microstep: 1540.57 | bwd_allreduce_microstep: 220.48 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11591
total_samples=9135, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:55,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.84 | bwd_microstep: 2048.50 | bwd_inner_microstep: 1814.17 | bwd_allreduce_microstep: 234.26 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12927
total_samples=9139, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:34:58,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.37 | bwd_microstep: 1786.57 | bwd_inner_microstep: 1660.56 | bwd_allreduce_microstep: 125.95 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11934
total_samples=9142, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:01,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 03:35:01,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.03 | bwd_microstep: 2342.64 | bwd_inner_microstep: 2303.02 | bwd_allreduce_microstep: 39.55 | step_microstep: 121.86
[2025-08-03 03:35:01,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2766.58 | bwd: 7938.86 | bwd_inner: 7318.32 | bwd_allreduce: 620.31 | step: 122.29
{'loss': 0.7784, 'learning_rate': 1.638920087591228e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11766
total_samples=9145, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:04,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.96 | bwd_microstep: 2341.96 | bwd_inner_microstep: 1986.23 | bwd_allreduce_microstep: 355.67 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11674
total_samples=9148, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:07,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.77 | bwd_microstep: 2008.45 | bwd_inner_microstep: 1810.60 | bwd_allreduce_microstep: 197.79 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12136
total_samples=9152, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:09,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.01 | bwd_microstep: 1821.38 | bwd_inner_microstep: 1583.37 | bwd_allreduce_microstep: 237.94 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13510
total_samples=9156, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:12,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 03:35:12,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.73 | bwd_microstep: 1829.07 | bwd_inner_microstep: 1714.37 | bwd_allreduce_microstep: 114.64 | step_microstep: 120.59
[2025-08-03 03:35:12,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.40 | bwd: 8000.91 | bwd_inner: 7094.57 | bwd_allreduce: 906.10 | step: 120.94
0%|██▉       | 597/2000 [1:51:21<4:12:47, 10.81s/it] 30%|██▉       | 598/2000 [1:51:32<4:13:08, 10.83s/it]                                                       30%|██▉       | 598/2000 [1:51:32<4:13:08, 10.83s/it] 30%|██▉       | 599/2000 [1:51:43<4:13:25, 10.85s/it]                                                       30%|██▉       | 599/2000 [1:51:43<4:13:25, 10.85s/it] 30%|███       | 600/2000 [1:51:54<4:12:16, 10.81s/it]                                                       30%|███       | 600/2000 [1:51:54<4:12:16, 10.81s/it] 30%|███       | 601/2000 [1:52:05<4:13:27, 10.87s/it]                                                       30%|███       | 601/2000 [1:52:05<4:13:27, 10.87s/it] 30%|███       | 602/2000 [1:52:16<4:15:12, 10.95s/it]                                                       30%|███       | 602/2000 [1:52:16<4:15:12, 10.95s/it] 30%|███       | 603/2000 [1:52:27<4:16:52, 11.03s/it]              {'loss': 0.775, 'learning_rate': 1.637673506698794e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12156
total_samples=9159, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:15,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.33 | bwd_microstep: 1829.49 | bwd_inner_microstep: 1572.27 | bwd_allreduce_microstep: 257.16 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14199
total_samples=9163, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:17,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.13 | bwd_microstep: 1836.29 | bwd_inner_microstep: 1799.02 | bwd_allreduce_microstep: 37.21 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13157
total_samples=9167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:20,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.10 | bwd_microstep: 2072.62 | bwd_inner_microstep: 1714.63 | bwd_allreduce_microstep: 357.93 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13664
total_samples=9171, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:23,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:35:23,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.63 | bwd_microstep: 2097.10 | bwd_inner_microstep: 1896.40 | bwd_allreduce_microstep: 200.63 | step_microstep: 110.08
[2025-08-03 03:35:23,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2759.13 | bwd: 7835.56 | bwd_inner: 6982.32 | bwd_allreduce: 853.01 | step: 110.49
{'loss': 0.7868, 'learning_rate': 1.6364252535819284e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12054
total_samples=9174, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:26,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.10 | bwd_microstep: 1768.97 | bwd_inner_microstep: 1548.46 | bwd_allreduce_microstep: 220.44 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13522
total_samples=9178, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:28,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.16 | bwd_microstep: 1881.66 | bwd_inner_microstep: 1812.45 | bwd_allreduce_microstep: 69.14 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11942
total_samples=9181, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:31,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.19 | bwd_microstep: 1925.90 | bwd_inner_microstep: 1755.02 | bwd_allreduce_microstep: 170.81 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11609
total_samples=9184, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:34,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.69
[2025-08-03 03:35:34,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.44 | bwd_microstep: 1736.75 | bwd_inner_microstep: 1527.52 | bwd_allreduce_microstep: 209.16 | step_microstep: 114.08
[2025-08-03 03:35:34,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.83 | bwd: 7313.32 | bwd_inner: 6643.45 | bwd_allreduce: 669.63 | step: 114.41
{'loss': 0.7759, 'learning_rate': 1.6351753315140285e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12067
total_samples=9187, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:36,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.20 | bwd_microstep: 1783.48 | bwd_inner_microstep: 1565.47 | bwd_allreduce_microstep: 217.96 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12321
total_samples=9191, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:39,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.52 | bwd_microstep: 1794.08 | bwd_inner_microstep: 1580.28 | bwd_allreduce_microstep: 213.72 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11650
total_samples=9195, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:42,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.43 | bwd_microstep: 1762.30 | bwd_inner_microstep: 1537.90 | bwd_allreduce_microstep: 224.33 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11852
total_samples=9198, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:44,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 03:35:44,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.79 | bwd_microstep: 1907.08 | bwd_inner_microstep: 1595.28 | bwd_allreduce_microstep: 311.73 | step_microstep: 117.10
[2025-08-03 03:35:44,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.87 | bwd: 7246.99 | bwd_inner: 6278.93 | bwd_allreduce: 967.82 | step: 117.48
{'loss': 0.7727, 'learning_rate': 1.63392374377287e-05, 'epoch': 0.3}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12200
total_samples=9201, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:47,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.21 | bwd_microstep: 1838.62 | bwd_inner_microstep: 1596.13 | bwd_allreduce_microstep: 242.43 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11599
total_samples=9204, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:49,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.61 | bwd_microstep: 1750.82 | bwd_inner_microstep: 1530.60 | bwd_allreduce_microstep: 220.15 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15064
total_samples=9208, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:52,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.88 | bwd_microstep: 1857.45 | bwd_inner_microstep: 1773.43 | bwd_allreduce_microstep: 83.96 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13374
total_samples=9212, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:55,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 03:35:55,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.63 | bwd_microstep: 1816.26 | bwd_inner_microstep: 1707.87 | bwd_allreduce_microstep: 108.32 | step_microstep: 113.90
[2025-08-03 03:35:55,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2742.26 | bwd: 7263.19 | bwd_inner: 6608.03 | bwd_allreduce: 654.93 | step: 114.23
{'loss': 0.7756, 'learning_rate': 1.6326704936405953e-05, 'epoch': 0.3}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14234
total_samples=9217, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:35:58,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.03 | bwd_microstep: 2041.87 | bwd_inner_microstep: 1742.19 | bwd_allreduce_microstep: 299.62 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11627
total_samples=9220, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:00,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.27 | bwd_microstep: 1940.89 | bwd_inner_microstep: 1737.76 | bwd_allreduce_microstep: 203.07 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13685
total_samples=9224, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:03,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.86 | bwd_microstep: 1955.10 | bwd_inner_microstep: 1846.63 | bwd_allreduce_microstep: 108.40 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13352
total_samples=9228, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:06,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.84
[2025-08-03 03:36:06,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.77 | bwd_microstep: 1780.74 | bwd_inner_microstep: 1682.76 | bwd_allreduce_microstep: 97.91 | step_microstep: 165.14
[2025-08-03 03:36:06,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.85 | bwd: 7718.64 | bwd_inner: 7009.34 | bwd_allreduce: 709.07 | step: 165.59
{'loss': 0.7619, 'learning_rate': 1.6314155844037074e-05, 'epoch': 0.3}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13322
total_samples=9232, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:09,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.51 | bwd_microstep: 1963.44 | bwd_inner_microstep: 1932.97 | bwd_allreduce_microstep: 30.41 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11927
total_samples=9235, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:12,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.28 | bwd_microstep: 2185.09 | bwd_inner_microstep: 1775.73 | bwd_allreduce_microstep: 409.31 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13075
total_samples=9239, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:14,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.45 | bwd_microstep: 2079.79 | bwd_inner_microstep: 1983.04 | bwd_allreduce_microstep: 96.69 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13444
total_samples=9244, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:17,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 03:36:17,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.98 | bwd_microstep: 1755.65 | bwd_inner_microstep: 1669.18 | bwd_allreduce_microstep: 86.40 | step_microstep: 145.04
[2025-08-03 03:36:17,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2748.15 | bwd: 7984.02 | bwd_inner: 7360.90 | bwd_allreduce: 622.88 | step: 145.37
                                         30%|███       | 603/2000 [1:52:27<4:16:52, 11.03s/it] 30%|███       | 604/2000 [1:52:38<4:16:46, 11.04s/it]                                                       30%|███       | 604/2000 [1:52:38<4:16:46, 11.04s/it] 30%|███       | 605/2000 [1:52:49<4:13:26, 10.90s/it]                                                       30%|███       | 605/2000 [1:52:49<4:13:26, 10.90s/it] 30%|███       | 606/2000 [1:52:59<4:10:29, 10.78s/it]                                                       30%|███       | 606/2000 [1:52:59<4:10:29, 10.78s/it] 30%|███       | 607/2000 [1:53:10<4:08:03, 10.68s/it]                                                       30%|███       | 607/2000 [1:53:10<4:08:03, 10.68s/it] 30%|███       | 608/2000 [1:53:21<4:10:11, 10.78s/it]                                                       30%|███       | 608/2000 [1:53:21<4:10:11, 10.78s/it] 30%|███       | 609/2000{'loss': 0.7716, 'learning_rate': 1.6301590193530585e-05, 'epoch': 0.3}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12794
total_samples=9248, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:20,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.10 | bwd_microstep: 1794.62 | bwd_inner_microstep: 1655.36 | bwd_allreduce_microstep: 139.20 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11872
total_samples=9252, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:22,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.13 | bwd_microstep: 1750.97 | bwd_inner_microstep: 1539.23 | bwd_allreduce_microstep: 211.68 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13170
total_samples=9256, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:25,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.57 | bwd_microstep: 1982.87 | bwd_inner_microstep: 1883.21 | bwd_allreduce_microstep: 99.59 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15629
total_samples=9262, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:28,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 03:36:28,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.60 | bwd_microstep: 1828.74 | bwd_inner_microstep: 1792.41 | bwd_allreduce_microstep: 36.27 | step_microstep: 108.15
[2025-08-03 03:36:28,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.35 | bwd: 7357.24 | bwd_inner: 6870.20 | bwd_allreduce: 486.81 | step: 108.59
{'loss': 0.7863, 'learning_rate': 1.6289008017838447e-05, 'epoch': 0.3}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13246
total_samples=9266, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:30,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.15 | bwd_microstep: 1775.62 | bwd_inner_microstep: 1689.28 | bwd_allreduce_microstep: 86.28 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14643
total_samples=9270, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:33,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.14 | bwd_microstep: 1821.20 | bwd_inner_microstep: 1761.44 | bwd_allreduce_microstep: 59.70 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13043
total_samples=9274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:36,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.31 | bwd_microstep: 2000.02 | bwd_inner_microstep: 1835.40 | bwd_allreduce_microstep: 164.55 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11570
total_samples=9277, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:39,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 03:36:39,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 976.79 | bwd_microstep: 1972.36 | bwd_inner_microstep: 1763.66 | bwd_allreduce_microstep: 208.63 | step_microstep: 133.96
[2025-08-03 03:36:39,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3113.33 | bwd: 7569.24 | bwd_inner: 7049.78 | bwd_allreduce: 519.24 | step: 134.29
{'loss': 0.7642, 'learning_rate': 1.6276409349955945e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13350
total_samples=9281, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:42,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.33 | bwd_microstep: 1966.51 | bwd_inner_microstep: 1705.26 | bwd_allreduce_microstep: 261.18 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11994
total_samples=9284, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:44,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.14 | bwd_microstep: 1758.63 | bwd_inner_microstep: 1543.72 | bwd_allreduce_microstep: 214.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13429
total_samples=9288, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:47,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.28 | bwd_microstep: 1774.75 | bwd_inner_microstep: 1693.51 | bwd_allreduce_microstep: 81.17 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11608
total_samples=9291, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:50,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 03:36:50,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.13 | bwd_microstep: 2137.16 | bwd_inner_microstep: 1895.79 | bwd_allreduce_microstep: 241.30 | step_microstep: 114.42
[2025-08-03 03:36:50,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.83 | bwd: 7637.08 | bwd_inner: 6838.27 | bwd_allreduce: 798.58 | step: 114.86
{'loss': 0.7829, 'learning_rate': 1.626379422292162e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13388
total_samples=9296, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:52,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.10 | bwd_microstep: 1812.02 | bwd_inner_microstep: 1724.25 | bwd_allreduce_microstep: 87.71 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11882
total_samples=9299, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:55,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.70 | bwd_microstep: 1848.74 | bwd_inner_microstep: 1590.37 | bwd_allreduce_microstep: 258.30 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11906
total_samples=9303, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:36:58,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 751.72 | bwd_microstep: 1852.24 | bwd_inner_microstep: 1609.41 | bwd_allreduce_microstep: 242.77 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12645
total_samples=9307, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:01,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:37:01,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.96 | bwd_microstep: 1779.04 | bwd_inner_microstep: 1619.37 | bwd_allreduce_microstep: 159.60 | step_microstep: 411.67
[2025-08-03 03:37:01,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.41 | bwd: 7292.08 | bwd_inner: 6543.39 | bwd_allreduce: 748.46 | step: 412.01
{'loss': 0.7773, 'learning_rate': 1.6251162669817172e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13205
total_samples=9311, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:03,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.80 | bwd_microstep: 1775.32 | bwd_inner_microstep: 1671.99 | bwd_allreduce_microstep: 103.27 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13277
total_samples=9315, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:06,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.90 | bwd_microstep: 1894.78 | bwd_inner_microstep: 1649.51 | bwd_allreduce_microstep: 245.21 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13074
total_samples=9319, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:09,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.49 | bwd_microstep: 2200.52 | bwd_inner_microstep: 1918.54 | bwd_allreduce_microstep: 281.91 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12127
total_samples=9322, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:12,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 03:37:12,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.82 | bwd_microstep: 1895.87 | bwd_inner_microstep: 1707.76 | bwd_allreduce_microstep: 188.04 | step_microstep: 138.65
[2025-08-03 03:37:12,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.95 | bwd: 7766.55 | bwd_inner: 6947.80 | bwd_allreduce: 818.51 | step: 139.00
{'loss': 0.7684, 'learning_rate': 1.6238514723767372e-05, 'epoch': 0.31}
 [1:53:32<4:13:06, 10.92s/it]                                                       30%|███       | 609/2000 [1:53:32<4:13:06, 10.92s/it] 30%|███       | 610/2000 [1:53:42<4:10:45, 10.82s/it]                                                       30%|███       | 610/2000 [1:53:42<4:10:45, 10.82s/it] 31%|███       | 611/2000 [1:53:54<4:12:48, 10.92s/it]                                                       31%|███       | 611/2000 [1:53:54<4:12:48, 10.92s/it] 31%|███       | 612/2000 [1:54:05<4:12:20, 10.91s/it]                                                       31%|███       | 612/2000 [1:54:05<4:12:20, 10.91s/it] 31%|███       | 613/2000 [1:54:15<4:11:54, 10.90s/it]                                                       31%|███       | 613/2000 [1:54:15<4:11:54, 10.90s/it] 31%|███       | 614/2000 [1:54:26<4:12:48, 10.94s/it]                                                       31%|███       | 614/2000 [1:54:26<4:12:48dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15600
total_samples=9327, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:14,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.34 | bwd_microstep: 1794.52 | bwd_inner_microstep: 1788.56 | bwd_allreduce_microstep: 5.91 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11718
total_samples=9330, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:17,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.73 | bwd_microstep: 1931.28 | bwd_inner_microstep: 1556.03 | bwd_allreduce_microstep: 375.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13751
total_samples=9334, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:19,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.51 | bwd_microstep: 1731.51 | bwd_inner_microstep: 1681.09 | bwd_allreduce_microstep: 50.36 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12668
total_samples=9338, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:22,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:37:22,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.84 | bwd_microstep: 1915.32 | bwd_inner_microstep: 1766.61 | bwd_allreduce_microstep: 148.65 | step_microstep: 129.09
[2025-08-03 03:37:22,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.36 | bwd: 7372.69 | bwd_inner: 6792.28 | bwd_allreduce: 580.19 | step: 129.52
{'loss': 0.784, 'learning_rate': 1.622585041793999e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14511
total_samples=9342, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:26,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.80 | bwd_microstep: 3104.33 | bwd_inner_microstep: 3002.51 | bwd_allreduce_microstep: 101.73 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12129
total_samples=9345, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:29,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.86 | bwd_microstep: 1993.09 | bwd_inner_microstep: 1866.42 | bwd_allreduce_microstep: 126.60 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12176
total_samples=9349, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:31,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.41 | bwd_microstep: 1758.42 | bwd_inner_microstep: 1557.32 | bwd_allreduce_microstep: 201.04 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11715
total_samples=9352, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:34,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:37:34,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.49 | bwd_microstep: 1829.10 | bwd_inner_microstep: 1594.61 | bwd_allreduce_microstep: 234.43 | step_microstep: 158.49
[2025-08-03 03:37:34,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.49 | bwd: 8685.00 | bwd_inner: 8020.87 | bwd_allreduce: 663.87 | step: 158.80
{'loss': 0.7813, 'learning_rate': 1.6213169785545688e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332
total_samples=9356, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:37,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.88 | bwd_microstep: 2211.32 | bwd_inner_microstep: 1934.42 | bwd_allreduce_microstep: 276.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13546
total_samples=9360, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:40,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.16 | bwd_microstep: 1737.21 | bwd_inner_microstep: 1681.00 | bwd_allreduce_microstep: 56.15 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12719
total_samples=9364, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:42,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.34 | bwd_microstep: 1845.50 | bwd_inner_microstep: 1640.97 | bwd_allreduce_microstep: 204.45 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11687
total_samples=9367, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:45,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 03:37:45,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.03 | bwd_microstep: 1831.90 | bwd_inner_microstep: 1579.14 | bwd_allreduce_microstep: 252.70 | step_microstep: 153.27
[2025-08-03 03:37:45,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.34 | bwd: 7625.97 | bwd_inner: 6835.52 | bwd_allreduce: 790.22 | step: 153.58
{'loss': 0.7864, 'learning_rate': 1.6200472859837946e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13009
total_samples=9371, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:48,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.46 | bwd_microstep: 1771.30 | bwd_inner_microstep: 1672.64 | bwd_allreduce_microstep: 98.60 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12116
total_samples=9374, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:50,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.39 | bwd_microstep: 1942.52 | bwd_inner_microstep: 1769.35 | bwd_allreduce_microstep: 173.10 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12716
total_samples=9378, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:53,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.21 | bwd_microstep: 1836.56 | bwd_inner_microstep: 1671.95 | bwd_allreduce_microstep: 164.55 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11872
total_samples=9381, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:56,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 03:37:56,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.48 | bwd_microstep: 2167.94 | bwd_inner_microstep: 1898.97 | bwd_allreduce_microstep: 268.90 | step_microstep: 114.99
[2025-08-03 03:37:56,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.43 | bwd: 7718.36 | bwd_inner: 7012.90 | bwd_allreduce: 705.22 | step: 115.35
{'loss': 0.7773, 'learning_rate': 1.6187759674112972e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13435
total_samples=9385, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:37:59,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.08 | bwd_microstep: 2082.94 | bwd_inner_microstep: 1964.68 | bwd_allreduce_microstep: 118.19 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14015
total_samples=9389, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:01,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.56 | bwd_microstep: 1733.08 | bwd_inner_microstep: 1701.09 | bwd_allreduce_microstep: 31.92 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14027
total_samples=9393, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:04,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.11 | bwd_microstep: 1732.96 | bwd_inner_microstep: 1697.96 | bwd_allreduce_microstep: 34.93 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11987
total_samples=9396, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:07,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 03:38:07,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.73 | bwd_microstep: 1747.13 | bwd_inner_microstep: 1548.43 | bwd_allreduce_microstep: 198.64 | step_microstep: 140.52
[2025-08-03 03:38:07,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.40 | bwd: 7296.15 | bwd_inner: 6912.16 | bwd_allreduce: 383.75 | step: 140.88
{'loss': 0.7643, 'learning_rate': 1.6175030261709615e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14349
total_samples=9400, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:09,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.17 | bwd_microstep: 2003.31 | bwd_inner_microstep: 1772.64 | bwd_allreduce_microstep: 230.61 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14336
total_samples=9404, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:12,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.09 | bwd_microstep: 2175.32 | bwd_inner_microstep: 1871.63 | bwd_allreduce_microstep: 303.63 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13126
total_samples=9409, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:15,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 1856.85 | bwd_inner_microstep: 1687.91 | bwd_allreduce_microstep: 168.89 | step_microstep: 0.21
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12697
total_samples=9413, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:18,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 03:38:18,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.81 | bwd_microstep: 1789.01 | bwd_inner_microstep: 1617.63 | bwd_allreduce_microstep: 171.32 | step_microstep: 117.09
[2025-08-03 03:38:18,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.05 | bwd: 7824.55 | bwd_inner: 6949.79 | bwd_allreduce: 874.52 | step: 117.55
{'loss': 0.7669, 'learning_rate': 1.6162284656009276e-05, 'epoch': 0.31}
, 10.94s/it] 31%|███       | 615/2000 [1:54:37<4:10:23, 10.85s/it]                                                       31%|███       | 615/2000 [1:54:37<4:10:23, 10.85s/it] 31%|███       | 616/2000 [1:54:49<4:18:09, 11.19s/it]                                                       31%|███       | 616/2000 [1:54:49<4:18:09, 11.19s/it] 31%|███       | 617/2000 [1:55:00<4:16:13, 11.12s/it]                                                       31%|███       | 617/2000 [1:55:00<4:16:13, 11.12s/it] 31%|███       | 618/2000 [1:55:11<4:14:55, 11.07s/it]                                                       31%|███       | 618/2000 [1:55:11<4:14:55, 11.07s/it] 31%|███       | 619/2000 [1:55:21<4:11:06, 10.91s/it]                                                       31%|███       | 619/2000 [1:55:22<4:11:06, 10.91s/it] 31%|███       | 620/2000 [1:55:33<4:12:08, 10.96s/it]                                                       31%dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12346
total_samples=9416, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:21,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.53 | bwd_microstep: 2012.09 | bwd_inner_microstep: 1771.95 | bwd_allreduce_microstep: 240.08 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13615
total_samples=9420, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:23,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.63 | bwd_microstep: 1726.56 | bwd_inner_microstep: 1674.02 | bwd_allreduce_microstep: 52.48 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11791
total_samples=9423, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:25,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.91 | bwd_microstep: 1728.82 | bwd_inner_microstep: 1539.78 | bwd_allreduce_microstep: 188.94 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11791
total_samples=9426, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:28,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:38:28,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.51 | bwd_microstep: 1769.13 | bwd_inner_microstep: 1553.53 | bwd_allreduce_microstep: 215.54 | step_microstep: 152.71
[2025-08-03 03:38:28,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.51 | bwd: 7236.66 | bwd_inner: 6539.27 | bwd_allreduce: 697.14 | step: 153.25
{'loss': 0.7616, 'learning_rate': 1.6149522890435815e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13900
total_samples=9430, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:31,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.26 | bwd_microstep: 1740.99 | bwd_inner_microstep: 1694.95 | bwd_allreduce_microstep: 45.97 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13671
total_samples=9434, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:34,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.98 | bwd_microstep: 1995.86 | bwd_inner_microstep: 1867.91 | bwd_allreduce_microstep: 127.89 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13408
total_samples=9438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:36,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.44 | bwd_microstep: 2051.54 | bwd_inner_microstep: 1727.18 | bwd_allreduce_microstep: 324.30 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15246
total_samples=9442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:39,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 03:38:39,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.40 | bwd_microstep: 2183.03 | bwd_inner_microstep: 2032.82 | bwd_allreduce_microstep: 150.16 | step_microstep: 120.37
[2025-08-03 03:38:39,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.00 | bwd: 7971.47 | bwd_inner: 7322.85 | bwd_allreduce: 648.39 | step: 120.69
{'loss': 0.7784, 'learning_rate': 1.6136744998455477e-05, 'epoch': 0.31}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13953
total_samples=9446, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:42,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.69 | bwd_microstep: 1861.68 | bwd_inner_microstep: 1739.49 | bwd_allreduce_microstep: 122.13 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13018
total_samples=9450, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:45,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.74 | bwd_microstep: 1792.75 | bwd_inner_microstep: 1630.50 | bwd_allreduce_microstep: 162.18 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11847
total_samples=9453, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:47,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.07 | bwd_microstep: 1805.69 | bwd_inner_microstep: 1570.38 | bwd_allreduce_microstep: 235.25 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13113
total_samples=9457, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:50,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 03:38:50,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.54 | bwd_microstep: 2124.76 | bwd_inner_microstep: 1960.16 | bwd_allreduce_microstep: 164.53 | step_microstep: 110.95
[2025-08-03 03:38:50,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.96 | bwd: 7584.94 | bwd_inner: 6900.53 | bwd_allreduce: 684.17 | step: 111.42
{'loss': 0.7659, 'learning_rate': 1.6123951013576796e-05, 'epoch': 0.31}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11887
total_samples=9460, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:53,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.88 | bwd_microstep: 1858.98 | bwd_inner_microstep: 1603.18 | bwd_allreduce_microstep: 255.73 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12258
total_samples=9463, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:56,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.37 | bwd_microstep: 1728.85 | bwd_inner_microstep: 1555.99 | bwd_allreduce_microstep: 172.80 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12002
total_samples=9466, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:38:58,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.32 | bwd_microstep: 1763.07 | bwd_inner_microstep: 1555.21 | bwd_allreduce_microstep: 207.79 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14579
total_samples=9471, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:01,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 03:39:01,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.78 | bwd_microstep: 1747.80 | bwd_inner_microstep: 1712.25 | bwd_allreduce_microstep: 35.48 | step_microstep: 164.55
[2025-08-03 03:39:01,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.28 | bwd: 7098.75 | bwd_inner: 6426.63 | bwd_allreduce: 671.88 | step: 164.90
{'loss': 0.7767, 'learning_rate': 1.6111140969350504e-05, 'epoch': 0.31}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11626
total_samples=9474, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:04,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.70 | bwd_microstep: 1985.24 | bwd_inner_microstep: 1758.33 | bwd_allreduce_microstep: 226.84 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891
total_samples=9477, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:06,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.62 | bwd_microstep: 1871.17 | bwd_inner_microstep: 1745.66 | bwd_allreduce_microstep: 125.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11880
total_samples=9480, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:09,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.22 | bwd_microstep: 1888.19 | bwd_inner_microstep: 1759.85 | bwd_allreduce_microstep: 128.27 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13519
total_samples=9484, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:12,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 03:39:12,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.43 | bwd_microstep: 1907.93 | bwd_inner_microstep: 1683.61 | bwd_allreduce_microstep: 224.25 | step_microstep: 139.46
[2025-08-03 03:39:12,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.89 | bwd: 7652.58 | bwd_inner: 6947.46 | bwd_allreduce: 704.88 | step: 139.98
{'loss': 0.7883, 'learning_rate': 1.6098314899369446e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13554
total_samples=9488, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:14,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.90 | bwd_microstep: 1945.17 | bwd_inner_microstep: 1850.88 | bwd_allreduce_microstep: 94.24 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11681
total_samples=9491, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:17,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.10 | bwd_microstep: 1929.83 | bwd_inner_microstep: 1550.38 | bwd_allreduce_microstep: 379.39 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12576
total_samples=9495, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:20,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.22 | bwd_microstep: 1768.85 | bwd_inner_microstep: 1617.22 | bwd_allreduce_microstep: 151.56 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13706
total_samples=9500, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:22,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 03:39:22,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.46 | bwd_microstep: 1865.31 | bwd_inner_microstep: 1686.22 | bwd_allreduce_microstep: 179.03 | step_microstep: 127.02
[2025-08-03 03:39:22,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2730.61 | bwd: 7509.21 | bwd_inner: 6704.70 | bwd_allreduce: 804.28 | step: 127.44
|███       | 620/2000 [1:55:33<4:12:08, 10.96s/it] 31%|███       | 621/2000 [1:55:43<4:08:42, 10.82s/it]                                                       31%|███       | 621/2000 [1:55:43<4:08:42, 10.82s/it] 31%|███       | 622/2000 [1:55:54<4:11:38, 10.96s/it]                                                       31%|███       | 622/2000 [1:55:54<4:11:38, 10.96s/it] 31%|███       | 623/2000 [1:56:05<4:10:54, 10.93s/it]                                                       31%|███       | 623/2000 [1:56:05<4:10:54, 10.93s/it] 31%|███       | 624/2000 [1:56:16<4:06:52, 10.77s/it]                                                       31%|███       | 624/2000 [1:56:16<4:06:52, 10.77s/it] 31%|███▏      | 625/2000 [1:56:27<4:07:53, 10.82s/it]                                                       31%|███▏      | 625/2000 [1:56:27<4:07:53, 10.82s/it] 31%|███▏      | 626/2000 [1:56:37<4:07:10, 10.79s/it]          {'loss': 0.7795, 'learning_rate': 1.6085472837268504e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13629
total_samples=9504, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:25,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.05 | bwd_microstep: 2096.22 | bwd_inner_microstep: 1920.67 | bwd_allreduce_microstep: 175.48 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11757
total_samples=9507, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:28,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.74 | bwd_microstep: 1889.60 | bwd_inner_microstep: 1632.66 | bwd_allreduce_microstep: 256.87 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11653
total_samples=9510, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:30,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.31 | bwd_microstep: 1768.38 | bwd_inner_microstep: 1532.63 | bwd_allreduce_microstep: 235.68 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13864
total_samples=9514, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:33,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:39:33,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.53 | bwd_microstep: 2039.09 | bwd_inner_microstep: 1880.43 | bwd_allreduce_microstep: 158.59 | step_microstep: 122.18
[2025-08-03 03:39:33,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.57 | bwd: 7793.33 | bwd_inner: 6966.39 | bwd_allreduce: 826.71 | step: 122.51
{'loss': 0.7754, 'learning_rate': 1.607261481672448e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13491
total_samples=9518, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:36,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.75 | bwd_microstep: 1822.20 | bwd_inner_microstep: 1727.31 | bwd_allreduce_microstep: 94.82 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13490
total_samples=9522, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:39,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.45 | bwd_microstep: 1753.49 | bwd_inner_microstep: 1682.40 | bwd_allreduce_microstep: 71.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14717
total_samples=9527, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:41,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.37 | bwd_microstep: 1976.74 | bwd_inner_microstep: 1918.24 | bwd_allreduce_microstep: 58.43 | step_microstep: 0.14
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13394
total_samples=9532, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:44,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 03:39:44,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.44 | bwd_microstep: 1794.99 | bwd_inner_microstep: 1674.06 | bwd_allreduce_microstep: 120.86 | step_microstep: 147.58
[2025-08-03 03:39:44,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.95 | bwd: 7347.47 | bwd_inner: 7002.01 | bwd_allreduce: 345.23 | step: 147.93
{'loss': 0.7702, 'learning_rate': 1.6059740871456035e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13187
total_samples=9536, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:47,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.90 | bwd_microstep: 1797.20 | bwd_inner_microstep: 1685.09 | bwd_allreduce_microstep: 112.05 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13213
total_samples=9540, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:49,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.25 | bwd_microstep: 1862.52 | bwd_inner_microstep: 1806.88 | bwd_allreduce_microstep: 55.58 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13192
total_samples=9544, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:52,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 984.67 | bwd_microstep: 2122.83 | bwd_inner_microstep: 1987.66 | bwd_allreduce_microstep: 135.10 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14128
total_samples=9548, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:56,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 03:39:56,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 897.80 | bwd_microstep: 2061.94 | bwd_inner_microstep: 2021.99 | bwd_allreduce_microstep: 39.89 | step_microstep: 129.56
[2025-08-03 03:39:56,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3255.54 | bwd: 7844.54 | bwd_inner: 7501.61 | bwd_allreduce: 342.69 | step: 129.91
{'loss': 0.7673, 'learning_rate': 1.6046851035223594e-05, 'epoch': 0.31}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13669
total_samples=9552, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:39:58,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.74 | bwd_microstep: 1787.92 | bwd_inner_microstep: 1727.67 | bwd_allreduce_microstep: 60.18 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13075
total_samples=9556, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:01,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.25 | bwd_microstep: 1818.48 | bwd_inner_microstep: 1694.81 | bwd_allreduce_microstep: 123.61 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13608
total_samples=9560, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:03,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.42 | bwd_microstep: 1822.66 | bwd_inner_microstep: 1719.35 | bwd_allreduce_microstep: 103.24 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14359
total_samples=9564, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:06,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 03:40:06,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.81 | bwd_microstep: 1798.34 | bwd_inner_microstep: 1698.38 | bwd_allreduce_microstep: 99.89 | step_microstep: 155.56
[2025-08-03 03:40:06,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.15 | bwd: 7227.44 | bwd_inner: 6840.21 | bwd_allreduce: 386.99 | step: 155.89
{'loss': 0.7853, 'learning_rate': 1.603394534182925e-05, 'epoch': 0.32}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14253
total_samples=9570, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:09,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.94 | bwd_microstep: 1795.20 | bwd_inner_microstep: 1712.17 | bwd_allreduce_microstep: 82.96 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14238
total_samples=9575, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:11,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.38 | bwd_microstep: 1867.13 | bwd_inner_microstep: 1757.31 | bwd_allreduce_microstep: 109.76 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13094
total_samples=9579, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:14,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.81 | bwd_microstep: 1899.20 | bwd_inner_microstep: 1834.31 | bwd_allreduce_microstep: 64.83 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11637
total_samples=9582, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:17,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 03:40:17,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.23 | bwd_microstep: 1769.43 | bwd_inner_microstep: 1548.46 | bwd_allreduce_microstep: 220.91 | step_microstep: 113.71
[2025-08-03 03:40:17,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.28 | bwd: 7331.02 | bwd_inner: 6852.25 | bwd_allreduce: 478.53 | step: 114.16
{'loss': 0.7742, 'learning_rate': 1.6021023825116672e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13956
total_samples=9586, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:20,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.48 | bwd_microstep: 2073.38 | bwd_inner_microstep: 1876.10 | bwd_allreduce_microstep: 197.21 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12443
total_samples=9589, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:22,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.93 | bwd_microstep: 1752.42 | bwd_inner_microstep: 1593.87 | bwd_allreduce_microstep: 158.49 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13615
total_samples=9593, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:25,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.10 | bwd_microstep: 1802.60 | bwd_inner_microstep: 1693.75 | bwd_allreduce_microstep: 108.79 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11989
total_samples=9596, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:27,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 03:40:27,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.79 | bwd_microstep: 1827.02 | bwd_inner_microstep: 1579.04 | bwd_allreduce_microstep: 247.91 | step_microstep: 136.09
[2025-08-03 03:40:27,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.23 | bwd: 7455.47 | bwd_inner: 6742.74 | bwd_allreduce: 712.48 | step: 136.43
                                             31%|███▏      | 626/2000 [1:56:37<4:07:10, 10.79s/it] 31%|███▏      | 627/2000 [1:56:48<4:08:28, 10.86s/it]                                                       31%|███▏      | 627/2000 [1:56:48<4:08:28, 10.86s/it] 31%|███▏      | 628/2000 [1:56:59<4:06:38, 10.79s/it]                                                       31%|███▏      | 628/2000 [1:56:59<4:06:38, 10.79s/it] 31%|███▏      | 629/2000 [1:57:10<4:11:52, 11.02s/it]                                                       31%|███▏      | 629/2000 [1:57:10<4:11:52, 11.02s/it] 32%|███▏      | 630/2000 [1:57:21<4:08:17, 10.87s/it]                                                       32%|███▏      | 630/2000 [1:57:21<4:08:17, 10.87s/it] 32%|███▏      | 631/2000 [1:57:32<4:06:15, 10.79s/it]                                                       32%|███▏      | 631/2000 [1:57:32<4:06:15, 10.79s/it] 32%|{'loss': 0.7787, 'learning_rate': 1.6008086518971037e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12969
total_samples=9600, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:30,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.52 | bwd_microstep: 1777.60 | bwd_inner_microstep: 1674.08 | bwd_allreduce_microstep: 103.45 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13611
total_samples=9604, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:33,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.17 | bwd_microstep: 1831.67 | bwd_inner_microstep: 1743.09 | bwd_allreduce_microstep: 88.50 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 14004
total_samples=9609, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:35,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.93 | bwd_microstep: 1935.62 | bwd_inner_microstep: 1711.33 | bwd_allreduce_microstep: 224.23 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12410
total_samples=9612, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:39,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 03:40:39,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1204.36 | bwd_microstep: 1907.46 | bwd_inner_microstep: 1584.62 | bwd_allreduce_microstep: 322.78 | step_microstep: 109.47
[2025-08-03 03:40:39,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3337.92 | bwd: 7452.39 | bwd_inner: 6713.11 | bwd_allreduce: 739.03 | step: 109.94
{'loss': 0.774, 'learning_rate': 1.599513345731892e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14458
total_samples=9616, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:41,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.85 | bwd_microstep: 1710.79 | bwd_inner_microstep: 1684.18 | bwd_allreduce_microstep: 26.55 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13340
total_samples=9620, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:44,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.91 | bwd_microstep: 1806.79 | bwd_inner_microstep: 1695.23 | bwd_allreduce_microstep: 111.50 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12252
total_samples=9624, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:47,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.26 | bwd_microstep: 2079.64 | bwd_inner_microstep: 1797.33 | bwd_allreduce_microstep: 282.24 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14978
total_samples=9628, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:49,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 03:40:49,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.65 | bwd_microstep: 1740.55 | bwd_inner_microstep: 1724.33 | bwd_allreduce_microstep: 16.16 | step_microstep: 125.62
[2025-08-03 03:40:49,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.60 | bwd: 7337.82 | bwd_inner: 6901.05 | bwd_allreduce: 436.53 | step: 125.98
{'loss': 0.772, 'learning_rate': 1.598216467412822e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13431
total_samples=9632, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:52,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.94 | bwd_microstep: 2197.69 | bwd_inner_microstep: 2074.05 | bwd_allreduce_microstep: 123.58 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14065
total_samples=9636, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:55,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.19 | bwd_microstep: 1971.01 | bwd_inner_microstep: 1876.92 | bwd_allreduce_microstep: 94.02 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13815
total_samples=9640, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:40:58,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.09 | bwd_microstep: 1781.09 | bwd_inner_microstep: 1724.67 | bwd_allreduce_microstep: 56.34 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12847
total_samples=9644, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:00,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 03:41:00,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.27 | bwd_microstep: 1810.71 | bwd_inner_microstep: 1639.74 | bwd_allreduce_microstep: 170.90 | step_microstep: 114.55
[2025-08-03 03:41:00,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.42 | bwd: 7760.53 | bwd_inner: 7315.38 | bwd_allreduce: 444.91 | step: 114.87
{'loss': 0.7863, 'learning_rate': 1.5969180203408052e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13384
total_samples=9648, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:03,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.51 | bwd_microstep: 2090.50 | bwd_inner_microstep: 1960.02 | bwd_allreduce_microstep: 130.42 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13207
total_samples=9652, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:06,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.67 | bwd_microstep: 2070.19 | bwd_inner_microstep: 1905.22 | bwd_allreduce_microstep: 164.89 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13383
total_samples=9656, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:09,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.85 | bwd_microstep: 1852.16 | bwd_inner_microstep: 1696.58 | bwd_allreduce_microstep: 155.51 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14793
total_samples=9660, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:12,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 03:41:12,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.41 | bwd_microstep: 2054.44 | bwd_inner_microstep: 1791.34 | bwd_allreduce_microstep: 263.04 | step_microstep: 126.27
[2025-08-03 03:41:12,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.38 | bwd: 8067.32 | bwd_inner: 7353.14 | bwd_allreduce: 713.93 | step: 126.73
{'loss': 0.7792, 'learning_rate': 1.5956180079208684e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13964
total_samples=9664, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:14,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.28 | bwd_microstep: 1762.28 | bwd_inner_microstep: 1715.23 | bwd_allreduce_microstep: 46.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13516
total_samples=9668, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:17,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.91 | bwd_microstep: 1751.20 | bwd_inner_microstep: 1692.04 | bwd_allreduce_microstep: 59.10 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16168
total_samples=9672, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:20,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.83 | bwd_microstep: 2156.40 | bwd_inner_microstep: 2096.32 | bwd_allreduce_microstep: 60.02 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13525
total_samples=9676, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:22,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.81
[2025-08-03 03:41:22,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.66 | bwd_microstep: 1771.92 | bwd_inner_microstep: 1693.74 | bwd_allreduce_microstep: 78.12 | step_microstep: 144.45
[2025-08-03 03:41:22,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.61 | bwd: 7441.85 | bwd_inner: 7197.32 | bwd_allreduce: 244.31 | step: 144.76
{'loss': 0.7842, 'learning_rate': 1.5943164335621418e-05, 'epoch': 0.32}
███▏      | 632/2000 [1:57:42<4:05:47, 10.78s/it]                                                       32%|███▏      | 632/2000 [1:57:42<4:05:47, 10.78s/it] 32%|███▏      | 633/2000 [1:57:54<4:08:37, 10.91s/it]                                                       32%|███▏      | 633/2000 [1:57:54<4:08:37, 10.91s/it] 32%|███▏      | 634/2000 [1:58:04<4:06:03, 10.81s/it]                                                       32%|███▏      | 634/2000 [1:58:04<4:06:03, 10.81s/it] 32%|███▏      | 635/2000 [1:58:15<4:07:18, 10.87s/it]                                                       32%|███▏      | 635/2000 [1:58:15<4:07:18, 10.87s/it] 32%|███▏      | 636/2000 [1:58:26<4:10:07, 11.00s/it]                                                       32%|███▏      | 636/2000 [1:58:26<4:10:07, 11.00s/it] 32%|███▏      | 637/2000 [1:58:37<4:08:06, 10.92s/it]                                                      dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13723
total_samples=9681, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:25,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.58 | bwd_microstep: 1793.34 | bwd_inner_microstep: 1701.09 | bwd_allreduce_microstep: 92.18 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=9684, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:28,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.26 | bwd_microstep: 2018.31 | bwd_inner_microstep: 1807.00 | bwd_allreduce_microstep: 211.25 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12018
total_samples=9687, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:30,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.55 | bwd_microstep: 1768.13 | bwd_inner_microstep: 1558.29 | bwd_allreduce_microstep: 209.77 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13757
total_samples=9691, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:33,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 03:41:33,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.38 | bwd_microstep: 2046.96 | bwd_inner_microstep: 1760.11 | bwd_allreduce_microstep: 286.77 | step_microstep: 113.16
[2025-08-03 03:41:33,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.71 | bwd: 7626.79 | bwd_inner: 6826.49 | bwd_allreduce: 800.06 | step: 113.49
{'loss': 0.7669, 'learning_rate': 1.593013300677853e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14916
total_samples=9695, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:36,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.00 | bwd_microstep: 2049.84 | bwd_inner_microstep: 1926.44 | bwd_allreduce_microstep: 123.33 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11741
total_samples=9698, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:39,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.13 | bwd_microstep: 1738.64 | bwd_inner_microstep: 1536.48 | bwd_allreduce_microstep: 202.10 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12107
total_samples=9702, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:41,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.41 | bwd_microstep: 1839.13 | bwd_inner_microstep: 1593.85 | bwd_allreduce_microstep: 245.21 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13338
total_samples=9706, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:44,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 03:41:44,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.68 | bwd_microstep: 1959.61 | bwd_inner_microstep: 1707.52 | bwd_allreduce_microstep: 252.03 | step_microstep: 112.45
[2025-08-03 03:41:44,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.14 | bwd: 7587.27 | bwd_inner: 6764.29 | bwd_allreduce: 822.75 | step: 112.87
{'loss': 0.7797, 'learning_rate': 1.591708612685316e-05, 'epoch': 0.32}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13584
total_samples=9711, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:47,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.20 | bwd_microstep: 1826.42 | bwd_inner_microstep: 1697.78 | bwd_allreduce_microstep: 128.57 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12998
total_samples=9715, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:49,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.81 | bwd_microstep: 2027.39 | bwd_inner_microstep: 1870.29 | bwd_allreduce_microstep: 157.04 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11936
total_samples=9718, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:52,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.08 | bwd_microstep: 1788.91 | bwd_inner_microstep: 1552.36 | bwd_allreduce_microstep: 236.47 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13836
total_samples=9722, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:55,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 03:41:55,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.67 | bwd_microstep: 1775.66 | bwd_inner_microstep: 1706.46 | bwd_allreduce_microstep: 69.14 | step_microstep: 112.69
[2025-08-03 03:41:55,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.70 | bwd: 7418.44 | bwd_inner: 6826.90 | bwd_allreduce: 591.29 | step: 113.09
{'loss': 0.7685, 'learning_rate': 1.5904023730059227e-05, 'epoch': 0.32}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799
total_samples=9725, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:41:57,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.29 | bwd_microstep: 1851.93 | bwd_inner_microstep: 1560.97 | bwd_allreduce_microstep: 290.90 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12168
total_samples=9728, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:00,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.17 | bwd_microstep: 2049.05 | bwd_inner_microstep: 1816.68 | bwd_allreduce_microstep: 232.30 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13520
total_samples=9732, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:03,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.08 | bwd_microstep: 1907.02 | bwd_inner_microstep: 1682.40 | bwd_allreduce_microstep: 224.56 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13282
total_samples=9736, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:06,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 03:42:06,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.23 | bwd_microstep: 1825.09 | bwd_inner_microstep: 1707.90 | bwd_allreduce_microstep: 117.12 | step_microstep: 109.78
[2025-08-03 03:42:06,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.70 | bwd: 7633.15 | bwd_inner: 6767.95 | bwd_allreduce: 864.96 | step: 110.12
{'loss': 0.7825, 'learning_rate': 1.5890945850651347e-05, 'epoch': 0.32}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11845
total_samples=9739, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:08,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.66 | bwd_microstep: 1742.89 | bwd_inner_microstep: 1543.55 | bwd_allreduce_microstep: 199.28 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11699
total_samples=9742, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:11,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.36 | bwd_microstep: 1939.75 | bwd_inner_microstep: 1737.85 | bwd_allreduce_microstep: 201.84 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11929
total_samples=9745, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:14,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.36 | bwd_microstep: 1970.76 | bwd_inner_microstep: 1755.88 | bwd_allreduce_microstep: 214.82 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13755
total_samples=9750, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:16,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 03:42:16,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.10 | bwd_microstep: 2040.47 | bwd_inner_microstep: 1903.59 | bwd_allreduce_microstep: 136.79 | step_microstep: 110.95
[2025-08-03 03:42:16,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.41 | bwd: 7693.91 | bwd_inner: 6940.89 | bwd_allreduce: 752.78 | step: 111.35
{'loss': 0.7848, 'learning_rate': 1.5877852522924733e-05, 'epoch': 0.32}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12385
total_samples=9753, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:19,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.83 | bwd_microstep: 2019.25 | bwd_inner_microstep: 1793.91 | bwd_allreduce_microstep: 225.27 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11868
total_samples=9756, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:22,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.54 | bwd_microstep: 2170.91 | bwd_inner_microstep: 1949.21 | bwd_allreduce_microstep: 221.63 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16014
total_samples=9760, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:25,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.62 | bwd_microstep: 2017.84 | bwd_inner_microstep: 1974.45 | bwd_allreduce_microstep: 43.34 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13052
total_samples=9763, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:28,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:42:28,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.72 | bwd_microstep: 1996.74 | bwd_inner_microstep: 1781.01 | bwd_allreduce_microstep: 215.66 | step_microstep: 114.99
[2025-08-03 03:42:28,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.64 | bwd: 8204.78 | bwd_inner: 7498.58 | bwd_allreduce: 705.98 | step: 115.31
 32%|███▏      | 637/2000 [1:58:37<4:08:06, 10.92s/it] 32%|███▏      | 638/2000 [1:58:48<4:07:36, 10.91s/it]                                                       32%|███▏      | 638/2000 [1:58:48<4:07:36, 10.91s/it] 32%|███▏      | 639/2000 [1:58:59<4:06:53, 10.88s/it]                                                       32%|███▏      | 639/2000 [1:58:59<4:06:53, 10.88s/it] 32%|███▏      | 640/2000 [1:59:10<4:05:12, 10.82s/it]                                                       32%|███▏      | 640/2000 [1:59:10<4:05:12, 10.82s/it] 32%|███▏      | 641/2000 [1:59:20<4:05:27, 10.84s/it]                                                       32%|███▏      | 641/2000 [1:59:20<4:05:27, 10.84s/it] 32%|███▏      | 642/2000 [1:59:31<4:05:45, 10.86s/it]                                                       32%|███▏      | 642/2000 [1:59:31<4:05:45, 10.86s/it] 32%|███▏      | 643/2000 [1:59:43<4:09:27,{'loss': 0.7692, 'learning_rate': 1.586474378121511e-05, 'epoch': 0.32}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11699
total_samples=9766, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:31,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.78 | bwd_microstep: 2007.75 | bwd_inner_microstep: 1562.96 | bwd_allreduce_microstep: 444.73 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=9769, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:33,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.06 | bwd_microstep: 1760.62 | bwd_inner_microstep: 1542.27 | bwd_allreduce_microstep: 218.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12632
total_samples=9772, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:36,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.33 | bwd_microstep: 1826.40 | bwd_inner_microstep: 1603.96 | bwd_allreduce_microstep: 222.37 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16083
total_samples=9777, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:39,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 03:42:39,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.01 | bwd_microstep: 1832.87 | bwd_inner_microstep: 1801.20 | bwd_allreduce_microstep: 31.61 | step_microstep: 143.51
[2025-08-03 03:42:39,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.11 | bwd: 7427.69 | bwd_inner: 6510.39 | bwd_allreduce: 917.07 | step: 143.98
{'loss': 0.777, 'learning_rate': 1.5851619659898623e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13241
total_samples=9781, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:41,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.95 | bwd_microstep: 1776.50 | bwd_inner_microstep: 1699.35 | bwd_allreduce_microstep: 77.08 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13735
total_samples=9785, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:44,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.96 | bwd_microstep: 1820.95 | bwd_inner_microstep: 1697.71 | bwd_allreduce_microstep: 123.18 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11691
total_samples=9788, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:47,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.99 | bwd_microstep: 2007.73 | bwd_inner_microstep: 1781.56 | bwd_allreduce_microstep: 226.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13178
total_samples=9792, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:49,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 03:42:49,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.65 | bwd_microstep: 1838.01 | bwd_inner_microstep: 1709.52 | bwd_allreduce_microstep: 128.43 | step_microstep: 143.65
[2025-08-03 03:42:49,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.48 | bwd: 7443.23 | bwd_inner: 6888.13 | bwd_allreduce: 554.87 | step: 143.99
{'loss': 0.781, 'learning_rate': 1.5838480193391753e-05, 'epoch': 0.32}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14626
total_samples=9796, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:52,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.98 | bwd_microstep: 1894.33 | bwd_inner_microstep: 1779.15 | bwd_allreduce_microstep: 115.11 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15329
total_samples=9800, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:55,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.24 | bwd_microstep: 2047.95 | bwd_inner_microstep: 1800.16 | bwd_allreduce_microstep: 247.73 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11789
total_samples=9803, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:42:57,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.99 | bwd_microstep: 1795.31 | bwd_inner_microstep: 1558.25 | bwd_allreduce_microstep: 236.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13218
total_samples=9807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:00,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:43:00,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.99 | bwd_microstep: 2026.21 | bwd_inner_microstep: 1899.07 | bwd_allreduce_microstep: 127.07 | step_microstep: 130.27
[2025-08-03 03:43:00,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.12 | bwd: 7763.84 | bwd_inner: 7036.64 | bwd_allreduce: 726.98 | step: 130.71
{'loss': 0.774, 'learning_rate': 1.582532541615122e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13448
total_samples=9811, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:03,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.57 | bwd_microstep: 1770.61 | bwd_inner_microstep: 1685.35 | bwd_allreduce_microstep: 85.20 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12125
total_samples=9814, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:06,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.05 | bwd_microstep: 2073.63 | bwd_inner_microstep: 1591.30 | bwd_allreduce_microstep: 482.22 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13925
total_samples=9818, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:09,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.72 | bwd_microstep: 2171.82 | bwd_inner_microstep: 2005.14 | bwd_allreduce_microstep: 166.62 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12582
total_samples=9822, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:11,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 03:43:11,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.67 | bwd_microstep: 1828.84 | bwd_inner_microstep: 1613.13 | bwd_allreduce_microstep: 215.65 | step_microstep: 109.05
[2025-08-03 03:43:11,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.95 | bwd: 7844.96 | bwd_inner: 6894.94 | bwd_allreduce: 949.76 | step: 109.52
{'loss': 0.7875, 'learning_rate': 1.5812155362673895e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14607
total_samples=9826, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:14,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.29 | bwd_microstep: 1775.22 | bwd_inner_microstep: 1738.88 | bwd_allreduce_microstep: 36.28 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11956
total_samples=9829, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:17,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.18 | bwd_microstep: 1839.94 | bwd_inner_microstep: 1613.40 | bwd_allreduce_microstep: 226.48 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13807
total_samples=9834, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:19,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.49 | bwd_microstep: 1800.70 | bwd_inner_microstep: 1722.56 | bwd_allreduce_microstep: 78.05 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13028
total_samples=9838, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:22,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:43:22,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.21 | bwd_microstep: 1943.41 | bwd_inner_microstep: 1831.62 | bwd_allreduce_microstep: 111.72 | step_microstep: 142.56
[2025-08-03 03:43:22,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.10 | bwd: 7359.32 | bwd_inner: 6906.46 | bwd_allreduce: 452.63 | step: 142.92
{'loss': 0.7767, 'learning_rate': 1.57989700674967e-05, 'epoch': 0.32}
 11.03s/it]                                                       32%|███▏      | 643/2000 [1:59:43<4:09:27, 11.03s/it] 32%|███▏      | 644/2000 [1:59:53<4:07:08, 10.94s/it]                                                       32%|███▏      | 644/2000 [1:59:54<4:07:08, 10.94s/it] 32%|███▏      | 645/2000 [2:00:04<4:05:40, 10.88s/it]                                                       32%|███▏      | 645/2000 [2:00:04<4:05:40, 10.88s/it] 32%|███▏      | 646/2000 [2:00:15<4:06:19, 10.92s/it]                                                       32%|███▏      | 646/2000 [2:00:15<4:06:19, 10.92s/it] 32%|███▏      | 647/2000 [2:00:26<4:06:57, 10.95s/it]                                                       32%|███▏      | 647/2000 [2:00:26<4:06:57, 10.95s/it] 32%|███▏      | 648/2000 [2:00:37<4:04:41, 10.86s/it]                                                       32%|███▏      | 648/2000 [2:00:37<4:0dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11862
total_samples=9841, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:25,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.83 | bwd_microstep: 1731.23 | bwd_inner_microstep: 1538.03 | bwd_allreduce_microstep: 193.13 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11604
total_samples=9844, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:27,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.46 | bwd_microstep: 2001.98 | bwd_inner_microstep: 1779.65 | bwd_allreduce_microstep: 222.26 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14294
total_samples=9848, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:30,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.82 | bwd_microstep: 1785.35 | bwd_inner_microstep: 1734.77 | bwd_allreduce_microstep: 50.52 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12272
total_samples=9851, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:33,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 03:43:33,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.92 | bwd_microstep: 2060.65 | bwd_inner_microstep: 1837.43 | bwd_allreduce_microstep: 223.15 | step_microstep: 115.83
[2025-08-03 03:43:33,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.96 | bwd: 7579.25 | bwd_inner: 6889.88 | bwd_allreduce: 689.14 | step: 116.13
{'loss': 0.7736, 'learning_rate': 1.5785769565196543e-05, 'epoch': 0.32}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13366
total_samples=9855, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:35,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.80 | bwd_microstep: 1703.03 | bwd_inner_microstep: 1655.44 | bwd_allreduce_microstep: 47.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14348
total_samples=9859, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:38,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.04 | bwd_microstep: 1759.70 | bwd_inner_microstep: 1736.46 | bwd_allreduce_microstep: 23.18 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11947
total_samples=9863, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:41,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.70 | bwd_microstep: 1995.25 | bwd_inner_microstep: 1767.44 | bwd_allreduce_microstep: 227.76 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11587
total_samples=9866, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:43,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 03:43:43,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.15 | bwd_microstep: 1875.90 | bwd_inner_microstep: 1602.02 | bwd_allreduce_microstep: 273.81 | step_microstep: 118.03
[2025-08-03 03:43:43,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.63 | bwd: 7333.93 | bwd_inner: 6761.35 | bwd_allreduce: 572.35 | step: 118.36
{'loss': 0.7724, 'learning_rate': 1.5772553890390196e-05, 'epoch': 0.33}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13386
total_samples=9870, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:46,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.42 | bwd_microstep: 1797.31 | bwd_inner_microstep: 1705.40 | bwd_allreduce_microstep: 91.85 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14234
total_samples=9874, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:49,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.81 | bwd_microstep: 1837.54 | bwd_inner_microstep: 1745.86 | bwd_allreduce_microstep: 91.61 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14380
total_samples=9878, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:51,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.28 | bwd_microstep: 1833.56 | bwd_inner_microstep: 1766.99 | bwd_allreduce_microstep: 66.50 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11677
total_samples=9881, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:54,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 03:43:54,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.90 | bwd_microstep: 2006.86 | bwd_inner_microstep: 1801.03 | bwd_allreduce_microstep: 205.77 | step_microstep: 110.04
[2025-08-03 03:43:54,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.34 | bwd: 7475.31 | bwd_inner: 7019.28 | bwd_allreduce: 455.81 | step: 110.38
{'loss': 0.7727, 'learning_rate': 1.5759323077734233e-05, 'epoch': 0.33}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13976
total_samples=9885, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:43:57,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.43 | bwd_microstep: 1708.73 | bwd_inner_microstep: 1674.38 | bwd_allreduce_microstep: 34.29 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13679
total_samples=9889, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:00,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.80 | bwd_microstep: 2187.36 | bwd_inner_microstep: 1904.90 | bwd_allreduce_microstep: 282.40 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14400
total_samples=9893, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:02,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.74 | bwd_microstep: 1953.51 | bwd_inner_microstep: 1880.23 | bwd_allreduce_microstep: 73.21 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11947
total_samples=9896, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:05,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 03:44:05,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.11 | bwd_microstep: 1753.30 | bwd_inner_microstep: 1553.60 | bwd_allreduce_microstep: 199.63 | step_microstep: 130.65
[2025-08-03 03:44:05,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.01 | bwd: 7602.94 | bwd_inner: 7013.10 | bwd_allreduce: 589.61 | step: 131.09
{'loss': 0.7676, 'learning_rate': 1.5746077161924905e-05, 'epoch': 0.33}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14083
total_samples=9900, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:08,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.25 | bwd_microstep: 2130.14 | bwd_inner_microstep: 1689.57 | bwd_allreduce_microstep: 440.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13512
total_samples=9904, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:11,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.40 | bwd_microstep: 1809.18 | bwd_inner_microstep: 1711.87 | bwd_allreduce_microstep: 97.24 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14041
total_samples=9908, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:13,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.20 | bwd_microstep: 2108.46 | bwd_inner_microstep: 1994.52 | bwd_allreduce_microstep: 113.89 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14541
total_samples=9912, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:16,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 03:44:16,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.53 | bwd_microstep: 1994.99 | bwd_inner_microstep: 1881.09 | bwd_allreduce_microstep: 113.85 | step_microstep: 137.12
[2025-08-03 03:44:16,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.33 | bwd: 8042.83 | bwd_inner: 7277.04 | bwd_allreduce: 765.57 | step: 137.59
{'loss': 0.7759, 'learning_rate': 1.5732816177698097e-05, 'epoch': 0.33}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12230
total_samples=9915, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:20,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.28 | bwd_microstep: 2427.02 | bwd_inner_microstep: 1850.19 | bwd_allreduce_microstep: 576.76 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14366
total_samples=9919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:22,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.80 | bwd_microstep: 1823.64 | bwd_inner_microstep: 1773.53 | bwd_allreduce_microstep: 50.04 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11864
total_samples=9922, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:25,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.32 | bwd_microstep: 1703.04 | bwd_inner_microstep: 1533.20 | bwd_allreduce_microstep: 169.78 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12942
total_samples=9926, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:28,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 03:44:28,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 916.36 | bwd_microstep: 1836.04 | bwd_inner_microstep: 1630.66 | bwd_allreduce_microstep: 205.32 | step_microstep: 109.79
[2025-08-03 03:44:28,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3004.69 | bwd: 7789.79 | bwd_inner: 6787.58 | bwd_allreduce: 1001.98 | step: 110.12
4:41, 10.86s/it] 32%|███▏      | 649/2000 [2:00:48<4:04:18, 10.85s/it]                                                       32%|███▏      | 649/2000 [2:00:48<4:04:18, 10.85s/it] 32%|███▎      | 650/2000 [2:00:58<4:02:07, 10.76s/it]                                                       32%|███▎      | 650/2000 [2:00:58<4:02:07, 10.76s/it] 33%|███▎      | 651/2000 [2:01:09<4:01:48, 10.76s/it]                                                       33%|███▎      | 651/2000 [2:01:09<4:01:48, 10.76s/it] 33%|███▎      | 652/2000 [2:01:20<4:02:22, 10.79s/it]                                                       33%|███▎      | 652/2000 [2:01:20<4:02:22, 10.79s/it] 33%|███▎      | 653/2000 [2:01:31<4:05:33, 10.94s/it]                                                       33%|███▎      | 653/2000 [2:01:31<4:05:33, 10.94s/it] 33%|███▎      | 654/2000 [2:01:42<4:07:21, 11.03s/it]                                  {'loss': 0.7662, 'learning_rate': 1.5719540159829185e-05, 'epoch': 0.33}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12752
total_samples=9930, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:30,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.84 | bwd_microstep: 1817.72 | bwd_inner_microstep: 1614.75 | bwd_allreduce_microstep: 202.91 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12466
total_samples=9935, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:33,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.58 | bwd_microstep: 1726.52 | bwd_inner_microstep: 1581.50 | bwd_allreduce_microstep: 144.95 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=9939, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:35,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.60 | bwd_microstep: 1916.05 | bwd_inner_microstep: 1737.15 | bwd_allreduce_microstep: 178.84 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12702
total_samples=9943, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:38,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 03:44:38,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.37 | bwd_microstep: 1758.75 | bwd_inner_microstep: 1588.49 | bwd_allreduce_microstep: 170.18 | step_microstep: 115.18
[2025-08-03 03:44:38,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.32 | bwd: 7219.09 | bwd_inner: 6521.89 | bwd_allreduce: 696.95 | step: 115.52
{'loss': 0.7559, 'learning_rate': 1.5706249143132982e-05, 'epoch': 0.33}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13330
total_samples=9947, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:41,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.78 | bwd_microstep: 1851.46 | bwd_inner_microstep: 1712.35 | bwd_allreduce_microstep: 139.05 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13284
total_samples=9951, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:44,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.44 | bwd_microstep: 2391.33 | bwd_inner_microstep: 2266.45 | bwd_allreduce_microstep: 124.82 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12583
total_samples=9955, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:46,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.88 | bwd_microstep: 1800.98 | bwd_inner_microstep: 1580.74 | bwd_allreduce_microstep: 220.18 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13027
total_samples=9959, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:50,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:44:50,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.50 | bwd_microstep: 2285.54 | bwd_inner_microstep: 1842.18 | bwd_allreduce_microstep: 443.30 | step_microstep: 112.97
[2025-08-03 03:44:50,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.54 | bwd: 8329.36 | bwd_inner: 7401.72 | bwd_allreduce: 927.41 | step: 113.38
{'loss': 0.778, 'learning_rate': 1.5692943162463628e-05, 'epoch': 0.33}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790
total_samples=9962, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:52,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.80 | bwd_microstep: 1811.06 | bwd_inner_microstep: 1556.87 | bwd_allreduce_microstep: 254.13 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14183
total_samples=9966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:55,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.91 | bwd_microstep: 1794.86 | bwd_inner_microstep: 1729.30 | bwd_allreduce_microstep: 65.50 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13189
total_samples=9970, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:44:58,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.22 | bwd_microstep: 2384.35 | bwd_inner_microstep: 2063.92 | bwd_allreduce_microstep: 320.38 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11721
total_samples=9973, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:01,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.79
[2025-08-03 03:45:01,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.05 | bwd_microstep: 1759.25 | bwd_inner_microstep: 1535.26 | bwd_allreduce_microstep: 223.92 | step_microstep: 144.39
[2025-08-03 03:45:01,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.91 | bwd: 7749.57 | bwd_inner: 6885.35 | bwd_allreduce: 864.00 | step: 144.70
{'loss': 0.7819, 'learning_rate': 1.5679622252714507e-05, 'epoch': 0.33}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11669
total_samples=9976, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:03,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.26 | bwd_microstep: 1789.22 | bwd_inner_microstep: 1588.25 | bwd_allreduce_microstep: 200.90 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13270
total_samples=9980, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:06,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.12 | bwd_microstep: 2107.70 | bwd_inner_microstep: 1806.40 | bwd_allreduce_microstep: 301.23 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12568
total_samples=9983, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:09,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.63 | bwd_microstep: 1810.65 | bwd_inner_microstep: 1579.56 | bwd_allreduce_microstep: 231.04 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14396
total_samples=9987, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:11,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 03:45:11,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.09 | bwd_microstep: 1796.34 | bwd_inner_microstep: 1750.22 | bwd_allreduce_microstep: 46.06 | step_microstep: 113.45
[2025-08-03 03:45:11,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.03 | bwd: 7503.97 | bwd_inner: 6724.41 | bwd_allreduce: 779.31 | step: 113.89
{'loss': 0.7754, 'learning_rate': 1.5666286448818152e-05, 'epoch': 0.33}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11937
total_samples=9990, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:14,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.50 | bwd_microstep: 2084.52 | bwd_inner_microstep: 1841.11 | bwd_allreduce_microstep: 243.35 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12965
total_samples=9994, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:17,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.29 | bwd_microstep: 2171.56 | bwd_inner_microstep: 1664.18 | bwd_allreduce_microstep: 507.31 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11960
total_samples=9998, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:20,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.92 | bwd_microstep: 1712.60 | bwd_inner_microstep: 1533.09 | bwd_allreduce_microstep: 179.45 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12138
total_samples=10001, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:22,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 03:45:22,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.09 | bwd_microstep: 1805.07 | bwd_inner_microstep: 1574.18 | bwd_allreduce_microstep: 230.81 | step_microstep: 133.09
[2025-08-03 03:45:22,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.72 | bwd: 7773.79 | bwd_inner: 6612.55 | bwd_allreduce: 1161.01 | step: 133.50
{'loss': 0.7751, 'learning_rate': 1.565293578574615e-05, 'epoch': 0.33}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12058
total_samples=10004, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:25,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.21 | bwd_microstep: 2019.45 | bwd_inner_microstep: 1554.99 | bwd_allreduce_microstep: 464.40 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12748
total_samples=10008, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:28,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.63 | bwd_microstep: 1727.17 | bwd_inner_microstep: 1611.08 | bwd_allreduce_microstep: 116.03 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12459
total_samples=10011, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:30,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.80 | bwd_microstep: 1839.26 | bwd_inner_microstep: 1582.72 | bwd_allreduce_microstep: 256.47 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15894
total_samples=10015, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:33,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 03:45:33,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.60 | bwd_microstep: 1795.67 | bwd_inner_microstep: 1785.01 | bwd_allreduce_microstep: 10.59 | step_microstep: 121.67
[2025-08-03 03:45:33,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.17 | bwd: 7381.60 | bwd_inner: 6533.81 | bwd_allreduce: 847.56 | step: 122.00
                     33%|███▎      | 654/2000 [2:01:42<4:07:21, 11.03s/it] 33%|███▎      | 655/2000 [2:01:53<4:03:20, 10.86s/it]                                                       33%|███▎      | 655/2000 [2:01:53<4:03:20, 10.86s/it] 33%|███▎      | 656/2000 [2:02:04<4:07:58, 11.07s/it]                                                       33%|███▎      | 656/2000 [2:02:04<4:07:58, 11.07s/it] 33%|███▎      | 657/2000 [2:02:15<4:07:20, 11.05s/it]                                                       33%|███▎      | 657/2000 [2:02:16<4:07:20, 11.05s/it] 33%|███▎      | 658/2000 [2:02:26<4:05:04, 10.96s/it]                                                       33%|███▎      | 658/2000 [2:02:26<4:05:04, 10.96s/it] 33%|███▎      | 659/2000 [2:02:37<4:05:12, 10.97s/it]                                                       33%|███▎      | 659/2000 [2:02:37<4:05:12, 10.97s/it] 33%|███▎      | 660/{'loss': 0.7704, 'learning_rate': 1.5639570298509067e-05, 'epoch': 0.33}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12070
total_samples=10019, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:36,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.88 | bwd_microstep: 1802.54 | bwd_inner_microstep: 1561.15 | bwd_allreduce_microstep: 241.32 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11987
total_samples=10022, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:38,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.08 | bwd_microstep: 1844.66 | bwd_inner_microstep: 1590.83 | bwd_allreduce_microstep: 253.77 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13687
total_samples=10026, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:41,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.03 | bwd_microstep: 2114.32 | bwd_inner_microstep: 2007.64 | bwd_allreduce_microstep: 106.62 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13283
total_samples=10030, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:44,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 03:45:44,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 664.79 | bwd_microstep: 1921.47 | bwd_inner_microstep: 1679.56 | bwd_allreduce_microstep: 241.84 | step_microstep: 140.60
[2025-08-03 03:45:44,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2748.72 | bwd: 7683.04 | bwd_inner: 6839.18 | bwd_allreduce: 843.63 | step: 141.05
{'loss': 0.7599, 'learning_rate': 1.5626190022156328e-05, 'epoch': 0.33}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11789
total_samples=10033, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:46,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.01 | bwd_microstep: 1754.20 | bwd_inner_microstep: 1561.77 | bwd_allreduce_microstep: 192.37 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11780
total_samples=10036, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:49,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.13 | bwd_microstep: 1763.50 | bwd_inner_microstep: 1563.27 | bwd_allreduce_microstep: 200.14 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13210
total_samples=10040, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:52,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.03 | bwd_microstep: 2029.52 | bwd_inner_microstep: 1880.74 | bwd_allreduce_microstep: 148.71 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13677
total_samples=10044, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:55,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 03:45:55,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 909.65 | bwd_microstep: 1792.60 | bwd_inner_microstep: 1699.30 | bwd_allreduce_microstep: 93.24 | step_microstep: 108.30
[2025-08-03 03:45:55,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2990.76 | bwd: 7339.86 | bwd_inner: 6705.08 | bwd_allreduce: 634.54 | step: 108.78
{'loss': 0.7726, 'learning_rate': 1.5612794991776147e-05, 'epoch': 0.33}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12614
total_samples=10048, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:45:58,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.28 | bwd_microstep: 2100.55 | bwd_inner_microstep: 2053.05 | bwd_allreduce_microstep: 47.45 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13322
total_samples=10052, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:00,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.89 | bwd_microstep: 1857.75 | bwd_inner_microstep: 1721.82 | bwd_allreduce_microstep: 135.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13747
total_samples=10056, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:03,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.78 | bwd_microstep: 2148.33 | bwd_inner_microstep: 2033.70 | bwd_allreduce_microstep: 114.57 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14028
total_samples=10060, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:06,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 03:46:06,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.19 | bwd_microstep: 2230.70 | bwd_inner_microstep: 1935.74 | bwd_allreduce_microstep: 294.89 | step_microstep: 148.87
[2025-08-03 03:46:06,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.06 | bwd: 8337.39 | bwd_inner: 7744.32 | bwd_allreduce: 592.84 | step: 149.18
{'loss': 0.7744, 'learning_rate': 1.5599385242495437e-05, 'epoch': 0.33}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13536
total_samples=10064, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:09,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.41 | bwd_microstep: 1775.51 | bwd_inner_microstep: 1695.82 | bwd_allreduce_microstep: 79.63 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13545
total_samples=10068, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:11,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.29 | bwd_microstep: 1789.04 | bwd_inner_microstep: 1715.25 | bwd_allreduce_microstep: 73.73 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13278
total_samples=10072, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:14,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.62 | bwd_microstep: 2010.19 | bwd_inner_microstep: 1891.27 | bwd_allreduce_microstep: 118.86 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13506
total_samples=10076, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:17,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 03:46:17,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.50 | bwd_microstep: 2130.76 | bwd_inner_microstep: 2040.14 | bwd_allreduce_microstep: 90.57 | step_microstep: 109.69
[2025-08-03 03:46:17,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.76 | bwd: 7705.55 | bwd_inner: 7342.48 | bwd_allreduce: 362.86 | step: 110.11
{'loss': 0.7764, 'learning_rate': 1.5585960809479698e-05, 'epoch': 0.33}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13509
total_samples=10080, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:20,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.21 | bwd_microstep: 1721.13 | bwd_inner_microstep: 1677.41 | bwd_allreduce_microstep: 43.66 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14665
total_samples=10084, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:22,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.17 | bwd_microstep: 1893.99 | bwd_inner_microstep: 1753.28 | bwd_allreduce_microstep: 140.66 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14303
total_samples=10089, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:25,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.47 | bwd_microstep: 1740.23 | bwd_inner_microstep: 1711.54 | bwd_allreduce_microstep: 28.62 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13184
total_samples=10094, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:28,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:46:28,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.45 | bwd_microstep: 1815.46 | bwd_inner_microstep: 1685.04 | bwd_allreduce_microstep: 130.35 | step_microstep: 121.47
[2025-08-03 03:46:28,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2742.24 | bwd: 7170.86 | bwd_inner: 6827.26 | bwd_allreduce: 343.37 | step: 121.84
{'loss': 0.7697, 'learning_rate': 1.5572521727932937e-05, 'epoch': 0.33}
2000 [2:02:48<4:02:42, 10.87s/it]                                                       33%|███▎      | 660/2000 [2:02:48<4:02:42, 10.87s/it] 33%|███▎      | 661/2000 [2:02:59<4:02:59, 10.89s/it]                                                       33%|███▎      | 661/2000 [2:02:59<4:02:59, 10.89s/it] 33%|███▎      | 662/2000 [2:03:10<4:02:03, 10.85s/it]                                                       33%|███▎      | 662/2000 [2:03:10<4:02:03, 10.85s/it] 33%|███▎      | 663/2000 [2:03:21<4:06:51, 11.08s/it]                                                       33%|███▎      | 663/2000 [2:03:21<4:06:51, 11.08s/it] 33%|███▎      | 664/2000 [2:03:32<4:05:47, 11.04s/it]                                                       33%|███▎      | 664/2000 [2:03:32<4:05:47, 11.04s/it] 33%|███▎      | 665/2000 [2:03:42<4:01:06, 10.84s/it]                                                       33%|███▎      |dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15651
total_samples=10098, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:30,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.84 | bwd_microstep: 1856.84 | bwd_inner_microstep: 1770.09 | bwd_allreduce_microstep: 86.69 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14136
total_samples=10102, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:33,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.26 | bwd_microstep: 2270.00 | bwd_inner_microstep: 2106.40 | bwd_allreduce_microstep: 163.54 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13693
total_samples=10106, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:36,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.63 | bwd_microstep: 1764.43 | bwd_inner_microstep: 1707.02 | bwd_allreduce_microstep: 57.35 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12803
total_samples=10110, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:39,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:46:39,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.78 | bwd_microstep: 1804.23 | bwd_inner_microstep: 1650.58 | bwd_allreduce_microstep: 153.59 | step_microstep: 115.08
[2025-08-03 03:46:39,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2867.44 | bwd: 7695.55 | bwd_inner: 7234.08 | bwd_allreduce: 461.24 | step: 115.42
{'loss': 0.7729, 'learning_rate': 1.5559068033097583e-05, 'epoch': 0.33}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13895
total_samples=10115, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:41,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.98 | bwd_microstep: 1810.84 | bwd_inner_microstep: 1683.74 | bwd_allreduce_microstep: 127.04 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13720
total_samples=10120, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:44,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.16 | bwd_microstep: 2018.83 | bwd_inner_microstep: 1885.45 | bwd_allreduce_microstep: 133.32 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12137
total_samples=10123, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:47,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.93 | bwd_microstep: 1800.07 | bwd_inner_microstep: 1579.97 | bwd_allreduce_microstep: 220.03 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11842
total_samples=10126, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:49,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.73
[2025-08-03 03:46:49,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.70 | bwd_microstep: 1833.48 | bwd_inner_microstep: 1607.89 | bwd_allreduce_microstep: 225.52 | step_microstep: 113.93
[2025-08-03 03:46:49,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.70 | bwd: 7463.27 | bwd_inner: 6757.05 | bwd_allreduce: 705.99 | step: 114.35
{'loss': 0.767, 'learning_rate': 1.554559976025438e-05, 'epoch': 0.33}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12728
total_samples=10130, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:52,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.05 | bwd_microstep: 2236.74 | bwd_inner_microstep: 2021.23 | bwd_allreduce_microstep: 215.44 | step_microstep: 0.28
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13371
total_samples=10134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:55,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.41 | bwd_microstep: 1711.32 | bwd_inner_microstep: 1627.51 | bwd_allreduce_microstep: 83.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13650
total_samples=10138, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:46:58,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.57 | bwd_microstep: 2147.85 | bwd_inner_microstep: 1929.86 | bwd_allreduce_microstep: 217.92 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11485
total_samples=10141, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:01,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 03:47:01,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.86 | bwd_microstep: 1964.42 | bwd_inner_microstep: 1542.31 | bwd_allreduce_microstep: 422.05 | step_microstep: 129.52
[2025-08-03 03:47:01,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.82 | bwd: 8060.39 | bwd_inner: 7120.90 | bwd_allreduce: 939.25 | step: 130.04
{'loss': 0.7675, 'learning_rate': 1.5532116944722308e-05, 'epoch': 0.33}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12243
total_samples=10144, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:03,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.52 | bwd_microstep: 1823.56 | bwd_inner_microstep: 1577.91 | bwd_allreduce_microstep: 245.58 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13988
total_samples=10148, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:06,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.48 | bwd_microstep: 1718.95 | bwd_inner_microstep: 1696.19 | bwd_allreduce_microstep: 22.70 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13962
total_samples=10152, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:08,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.37 | bwd_microstep: 1847.24 | bwd_inner_microstep: 1841.43 | bwd_allreduce_microstep: 5.76 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14046
total_samples=10156, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:11,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 03:47:11,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.76 | bwd_microstep: 2165.31 | bwd_inner_microstep: 1904.96 | bwd_allreduce_microstep: 260.28 | step_microstep: 108.86
[2025-08-03 03:47:11,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.05 | bwd: 7555.11 | bwd_inner: 7020.49 | bwd_allreduce: 534.39 | step: 109.33
{'loss': 0.7667, 'learning_rate': 1.5518619621858474e-05, 'epoch': 0.33}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13439
total_samples=10160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:14,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.82 | bwd_microstep: 2138.39 | bwd_inner_microstep: 1961.83 | bwd_allreduce_microstep: 176.49 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13004
total_samples=10164, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:17,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.63 | bwd_microstep: 1977.53 | bwd_inner_microstep: 1690.14 | bwd_allreduce_microstep: 287.33 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13025
total_samples=10168, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:20,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.03 | bwd_microstep: 1978.59 | bwd_inner_microstep: 1830.29 | bwd_allreduce_microstep: 148.24 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14041
total_samples=10172, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:23,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 03:47:23,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.53 | bwd_microstep: 1733.67 | bwd_inner_microstep: 1705.29 | bwd_allreduce_microstep: 28.31 | step_microstep: 145.19
[2025-08-03 03:47:23,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2770.94 | bwd: 7828.24 | bwd_inner: 7187.55 | bwd_allreduce: 640.45 | step: 145.64
{'loss': 0.758, 'learning_rate': 1.5505107827058038e-05, 'epoch': 0.34}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11829
total_samples=10175, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:25,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.41 | bwd_microstep: 1817.53 | bwd_inner_microstep: 1589.09 | bwd_allreduce_microstep: 228.38 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13205
total_samples=10179, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:28,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.71 | bwd_microstep: 1755.45 | bwd_inner_microstep: 1687.17 | bwd_allreduce_microstep: 68.21 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13189
total_samples=10183, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:30,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.79 | bwd_microstep: 1899.78 | bwd_inner_microstep: 1709.10 | bwd_allreduce_microstep: 190.61 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12180
total_samples=10186, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:33,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 03:47:33,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.33 | bwd_microstep: 1805.76 | bwd_inner_microstep: 1569.34 | bwd_allreduce_microstep: 236.36 | step_microstep: 115.61
[2025-08-03 03:47:33,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.17 | bwd: 7278.57 | bwd_inner: 6554.70 | bwd_allreduce: 723.64 | step: 116.05
 665/2000 [2:03:42<4:01:06, 10.84s/it] 33%|███▎      | 666/2000 [2:03:53<4:02:08, 10.89s/it]                                                       33%|███▎      | 666/2000 [2:03:54<4:02:08, 10.89s/it] 33%|███▎      | 667/2000 [2:04:04<4:00:46, 10.84s/it]                                                       33%|███▎      | 667/2000 [2:04:04<4:00:46, 10.84s/it] 33%|███▎      | 668/2000 [2:04:15<4:03:37, 10.97s/it]                                                       33%|███▎      | 668/2000 [2:04:16<4:03:37, 10.97s/it] 33%|███▎      | 669/2000 [2:04:26<4:02:19, 10.92s/it]                                                       33%|███▎      | 669/2000 [2:04:26<4:02:19, 10.92s/it] 34%|███▎      | 670/2000 [2:04:37<4:03:13, 10.97s/it]                                                       34%|███▎      | 670/2000 [2:04:37<4:03:13, 10.97s/it] 34%|███▎      | 671/2000 [2:04:48<4:00:12, 10.84s/it]            {'loss': 0.7748, 'learning_rate': 1.5491581595754102e-05, 'epoch': 0.34}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12068
total_samples=10189, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:36,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.97 | bwd_microstep: 1758.06 | bwd_inner_microstep: 1549.99 | bwd_allreduce_microstep: 208.00 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13572
total_samples=10193, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:38,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.93 | bwd_microstep: 1757.07 | bwd_inner_microstep: 1683.12 | bwd_allreduce_microstep: 73.88 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14392
total_samples=10197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:41,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.73 | bwd_microstep: 1896.21 | bwd_inner_microstep: 1790.17 | bwd_allreduce_microstep: 105.98 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12966
total_samples=10201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:44,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 03:47:44,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.51 | bwd_microstep: 2365.87 | bwd_inner_microstep: 1659.80 | bwd_allreduce_microstep: 706.02 | step_microstep: 112.45
[2025-08-03 03:47:44,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.07 | bwd: 7777.27 | bwd_inner: 6683.08 | bwd_allreduce: 1093.95 | step: 112.90
{'loss': 0.7717, 'learning_rate': 1.547804096341763e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13269
total_samples=10205, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:47,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.25 | bwd_microstep: 2037.89 | bwd_inner_microstep: 1912.46 | bwd_allreduce_microstep: 125.37 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13474
total_samples=10209, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:50,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.48 | bwd_microstep: 1814.57 | bwd_inner_microstep: 1702.09 | bwd_allreduce_microstep: 112.42 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12053
total_samples=10212, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:52,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.98 | bwd_microstep: 1771.85 | bwd_inner_microstep: 1563.34 | bwd_allreduce_microstep: 208.43 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12013
total_samples=10215, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:55,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 03:47:55,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.65 | bwd_microstep: 1747.17 | bwd_inner_microstep: 1550.11 | bwd_allreduce_microstep: 197.00 | step_microstep: 113.82
[2025-08-03 03:47:55,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.28 | bwd: 7371.53 | bwd_inner: 6727.99 | bwd_allreduce: 643.30 | step: 114.22
{'loss': 0.7666, 'learning_rate': 1.546448596555736e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13237
total_samples=10219, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:47:57,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 662.23 | bwd_microstep: 1691.14 | bwd_inner_microstep: 1659.06 | bwd_allreduce_microstep: 32.02 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15934
total_samples=10223, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:00,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.73 | bwd_microstep: 2043.96 | bwd_inner_microstep: 1893.25 | bwd_allreduce_microstep: 150.64 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373
total_samples=10227, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:03,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.87 | bwd_microstep: 1767.94 | bwd_inner_microstep: 1704.88 | bwd_allreduce_microstep: 63.00 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11947
total_samples=10230, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:05,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:48:05,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.99 | bwd_microstep: 1793.62 | bwd_inner_microstep: 1578.76 | bwd_allreduce_microstep: 214.80 | step_microstep: 125.95
[2025-08-03 03:48:05,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.75 | bwd: 7296.72 | bwd_inner: 6835.94 | bwd_allreduce: 460.54 | step: 126.53
{'loss': 0.776, 'learning_rate': 1.5450916637719683e-05, 'epoch': 0.34}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13698
total_samples=10234, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:08,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.59 | bwd_microstep: 1804.28 | bwd_inner_microstep: 1701.10 | bwd_allreduce_microstep: 103.12 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13470
total_samples=10238, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:11,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.78 | bwd_microstep: 2050.91 | bwd_inner_microstep: 1889.18 | bwd_allreduce_microstep: 161.65 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14575
total_samples=10242, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:13,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.33 | bwd_microstep: 1844.55 | bwd_inner_microstep: 1763.05 | bwd_allreduce_microstep: 81.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11763
total_samples=10245, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:17,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 03:48:17,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.30 | bwd_microstep: 2348.25 | bwd_inner_microstep: 1549.63 | bwd_allreduce_microstep: 798.54 | step_microstep: 133.15
[2025-08-03 03:48:17,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.94 | bwd: 8048.04 | bwd_inner: 6902.97 | bwd_allreduce: 1144.83 | step: 133.50
{'loss': 0.7719, 'learning_rate': 1.5437333015488586e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14031
total_samples=10249, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:19,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.79 | bwd_microstep: 1732.99 | bwd_inner_microstep: 1691.98 | bwd_allreduce_microstep: 40.93 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12489
total_samples=10253, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:22,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.96 | bwd_microstep: 2340.83 | bwd_inner_microstep: 2101.77 | bwd_allreduce_microstep: 238.99 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13229
total_samples=10257, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:25,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.63 | bwd_microstep: 1829.07 | bwd_inner_microstep: 1706.42 | bwd_allreduce_microstep: 122.58 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14464
total_samples=10261, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:28,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.03
[2025-08-03 03:48:28,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.17 | bwd_microstep: 1807.47 | bwd_inner_microstep: 1753.56 | bwd_allreduce_microstep: 53.83 | step_microstep: 425.97
[2025-08-03 03:48:28,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.48 | bwd: 7710.41 | bwd_inner: 7253.73 | bwd_allreduce: 456.43 | step: 426.47
{'loss': 0.7732, 'learning_rate': 1.5423735134485537e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13491
total_samples=10265, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:31,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.78 | bwd_microstep: 2125.86 | bwd_inner_microstep: 1861.42 | bwd_allreduce_microstep: 264.37 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12544
total_samples=10268, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:33,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.06 | bwd_microstep: 1842.89 | bwd_inner_microstep: 1715.21 | bwd_allreduce_microstep: 127.59 | step_microstep: 0.47
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12034
total_samples=10271, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:36,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.96 | bwd_microstep: 1784.83 | bwd_inner_microstep: 1563.61 | bwd_allreduce_microstep: 221.16 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13530
total_samples=10275, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:39,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 03:48:39,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.28 | bwd_microstep: 1836.38 | bwd_inner_microstep: 1729.27 | bwd_allreduce_microstep: 107.05 | step_microstep: 114.64
[2025-08-03 03:48:39,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.01 | bwd: 7590.02 | bwd_inner: 6869.50 | bwd_allreduce: 720.26 | step: 115.33
                                           34%|███▎      | 671/2000 [2:04:48<4:00:12, 10.84s/it] 34%|███▎      | 672/2000 [2:04:59<4:01:14, 10.90s/it]                                                       34%|███▎      | 672/2000 [2:04:59<4:01:14, 10.90s/it] 34%|███▎      | 673/2000 [2:05:10<3:59:07, 10.81s/it]                                                       34%|███▎      | 673/2000 [2:05:10<3:59:07, 10.81s/it] 34%|███▎      | 674/2000 [2:05:20<3:57:08, 10.73s/it]                                                       34%|███▎      | 674/2000 [2:05:20<3:57:08, 10.73s/it] 34%|███▍      | 675/2000 [2:05:31<4:00:53, 10.91s/it]                                                       34%|███▍      | 675/2000 [2:05:31<4:00:53, 10.91s/it] 34%|███▍      | 676/2000 [2:05:43<4:03:03, 11.01s/it]                                                       34%|███▍      | 676/2000 [2:05:43<4:03:03, 11.01s/it] 34%|�{'loss': 0.7667, 'learning_rate': 1.5410123030369387e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13465
total_samples=10279, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:41,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.37 | bwd_microstep: 1829.43 | bwd_inner_microstep: 1716.51 | bwd_allreduce_microstep: 112.85 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11691
total_samples=10282, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:44,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.59 | bwd_microstep: 1903.05 | bwd_inner_microstep: 1580.06 | bwd_allreduce_microstep: 322.93 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12556
total_samples=10286, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:47,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.74 | bwd_microstep: 2070.32 | bwd_inner_microstep: 1879.90 | bwd_allreduce_microstep: 190.35 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12939
total_samples=10290, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:50,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 03:48:50,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.96 | bwd_microstep: 2045.35 | bwd_inner_microstep: 1697.28 | bwd_allreduce_microstep: 348.00 | step_microstep: 111.45
[2025-08-03 03:48:50,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.59 | bwd: 7848.20 | bwd_inner: 6873.72 | bwd_allreduce: 974.21 | step: 111.92
{'loss': 0.7843, 'learning_rate': 1.5396496738836292e-05, 'epoch': 0.34}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12860
total_samples=10294, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:52,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.85 | bwd_microstep: 1761.64 | bwd_inner_microstep: 1668.28 | bwd_allreduce_microstep: 93.30 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12550
total_samples=10297, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:55,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.40 | bwd_microstep: 1768.32 | bwd_inner_microstep: 1586.73 | bwd_allreduce_microstep: 181.52 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12070
total_samples=10300, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:48:57,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.60 | bwd_microstep: 1797.29 | bwd_inner_microstep: 1563.92 | bwd_allreduce_microstep: 233.29 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15812
total_samples=10304, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:00,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 03:49:00,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.66 | bwd_microstep: 1851.70 | bwd_inner_microstep: 1734.29 | bwd_allreduce_microstep: 117.35 | step_microstep: 139.32
[2025-08-03 03:49:00,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.45 | bwd: 7179.01 | bwd_inner: 6553.22 | bwd_allreduce: 625.55 | step: 139.89
{'loss': 0.772, 'learning_rate': 1.5382856295619622e-05, 'epoch': 0.34}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12787
total_samples=10308, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:03,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.89 | bwd_microstep: 1786.18 | bwd_inner_microstep: 1650.50 | bwd_allreduce_microstep: 135.62 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483
total_samples=10312, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:06,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.66 | bwd_microstep: 2018.55 | bwd_inner_microstep: 1888.26 | bwd_allreduce_microstep: 130.21 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13228
total_samples=10316, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:08,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.09 | bwd_microstep: 1773.30 | bwd_inner_microstep: 1704.21 | bwd_allreduce_microstep: 69.02 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13440
total_samples=10320, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:11,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77
[2025-08-03 03:49:11,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.97 | bwd_microstep: 2030.31 | bwd_inner_microstep: 1894.71 | bwd_allreduce_microstep: 135.54 | step_microstep: 126.46
[2025-08-03 03:49:11,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.53 | bwd: 7608.39 | bwd_inner: 7137.68 | bwd_allreduce: 470.47 | step: 127.06
{'loss': 0.7741, 'learning_rate': 1.536920173648984e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13361
total_samples=10324, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:14,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.24 | bwd_microstep: 2005.42 | bwd_inner_microstep: 1910.90 | bwd_allreduce_microstep: 94.46 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11664
total_samples=10327, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:16,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.26 | bwd_microstep: 1761.59 | bwd_inner_microstep: 1526.99 | bwd_allreduce_microstep: 234.54 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12268
total_samples=10330, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:19,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.33 | bwd_microstep: 1804.69 | bwd_inner_microstep: 1586.85 | bwd_allreduce_microstep: 217.77 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11766
total_samples=10333, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:22,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 03:49:22,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.41 | bwd_microstep: 1774.68 | bwd_inner_microstep: 1546.03 | bwd_allreduce_microstep: 228.58 | step_microstep: 131.62
[2025-08-03 03:49:22,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.17 | bwd: 7346.42 | bwd_inner: 6570.77 | bwd_allreduce: 775.42 | step: 132.09
{'loss': 0.7698, 'learning_rate': 1.535553309725444e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13386
total_samples=10337, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:25,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.80 | bwd_microstep: 2087.04 | bwd_inner_microstep: 1711.76 | bwd_allreduce_microstep: 375.21 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12730
total_samples=10340, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:27,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.59 | bwd_microstep: 2125.44 | bwd_inner_microstep: 1920.18 | bwd_allreduce_microstep: 205.19 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14256
total_samples=10344, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:31,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.66 | bwd_microstep: 2369.45 | bwd_inner_microstep: 1902.83 | bwd_allreduce_microstep: 466.50 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12072
total_samples=10347, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:33,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 03:49:33,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.46 | bwd_microstep: 1739.46 | bwd_inner_microstep: 1544.42 | bwd_allreduce_microstep: 194.98 | step_microstep: 119.91
[2025-08-03 03:49:33,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2750.44 | bwd: 8321.45 | bwd_inner: 7079.21 | bwd_allreduce: 1241.95 | step: 120.52
{'loss': 0.7654, 'learning_rate': 1.5341850413757834e-05, 'epoch': 0.34}
�██▍      | 677/2000 [2:05:54<4:01:42, 10.96s/it]                                                       34%|███▍      | 677/2000 [2:05:54<4:01:42, 10.96s/it] 34%|███▍      | 678/2000 [2:06:05<4:02:12, 10.99s/it]                                                       34%|███▍      | 678/2000 [2:06:05<4:02:12, 10.99s/it] 34%|███▍      | 679/2000 [2:06:15<3:58:29, 10.83s/it]                                                       34%|███▍      | 679/2000 [2:06:15<3:58:29, 10.83s/it] 34%|███▍      | 680/2000 [2:06:26<3:58:29, 10.84s/it]                                                       34%|███▍      | 680/2000 [2:06:26<3:58:29, 10.84s/it] 34%|███▍      | 681/2000 [2:06:36<3:56:43, 10.77s/it]                                                       34%|███▍      | 681/2000 [2:06:37<3:56:43, 10.77s/it] 34%|███▍      | 682/2000 [2:06:48<4:01:32, 11.00s/it]                                                       3dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16028
total_samples=10351, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:36,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.06 | bwd_microstep: 2163.19 | bwd_inner_microstep: 2157.12 | bwd_allreduce_microstep: 6.01 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12730
total_samples=10355, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:39,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.47 | bwd_microstep: 1807.96 | bwd_inner_microstep: 1626.69 | bwd_allreduce_microstep: 181.20 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15935
total_samples=10361, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:41,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.24 | bwd_microstep: 1841.49 | bwd_inner_microstep: 1813.24 | bwd_allreduce_microstep: 28.19 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11746
total_samples=10364, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:44,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 03:49:44,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.87 | bwd_microstep: 2007.75 | bwd_inner_microstep: 1598.99 | bwd_allreduce_microstep: 408.70 | step_microstep: 143.15
[2025-08-03 03:49:44,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.57 | bwd: 7820.45 | bwd_inner: 7196.04 | bwd_allreduce: 624.18 | step: 143.61
{'loss': 0.7676, 'learning_rate': 1.532815372188126e-05, 'epoch': 0.34}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12810
total_samples=10368, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:47,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.58 | bwd_microstep: 2328.68 | bwd_inner_microstep: 2035.95 | bwd_allreduce_microstep: 292.67 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11973
total_samples=10371, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:50,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.55 | bwd_microstep: 1825.73 | bwd_inner_microstep: 1563.13 | bwd_allreduce_microstep: 262.54 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11958
total_samples=10374, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:53,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.58 | bwd_microstep: 1798.48 | bwd_inner_microstep: 1562.04 | bwd_allreduce_microstep: 236.37 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11763
total_samples=10377, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:55,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.30
[2025-08-03 03:49:55,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.67 | bwd_microstep: 1825.52 | bwd_inner_microstep: 1695.45 | bwd_allreduce_microstep: 130.00 | step_microstep: 134.08
[2025-08-03 03:49:55,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.31 | bwd: 7778.46 | bwd_inner: 6856.58 | bwd_allreduce: 921.65 | step: 134.53
{'loss': 0.7733, 'learning_rate': 1.5314443057542703e-05, 'epoch': 0.34}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12734
total_samples=10381, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:49:58,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.66 | bwd_microstep: 1790.39 | bwd_inner_microstep: 1656.60 | bwd_allreduce_microstep: 133.72 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11632
total_samples=10384, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:01,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.57 | bwd_microstep: 1938.77 | bwd_inner_microstep: 1529.50 | bwd_allreduce_microstep: 409.20 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13858
total_samples=10388, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:03,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.38 | bwd_microstep: 1802.36 | bwd_inner_microstep: 1727.09 | bwd_allreduce_microstep: 75.20 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13340
total_samples=10392, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:07,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 03:50:07,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.00 | bwd_microstep: 2128.05 | bwd_inner_microstep: 1893.82 | bwd_allreduce_microstep: 234.17 | step_microstep: 397.69
[2025-08-03 03:50:07,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.54 | bwd: 7659.63 | bwd_inner: 6807.01 | bwd_allreduce: 852.38 | step: 398.11
{'loss': 0.7657, 'learning_rate': 1.530071845669678e-05, 'epoch': 0.34}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13475
total_samples=10396, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:09,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.77 | bwd_microstep: 1777.78 | bwd_inner_microstep: 1671.32 | bwd_allreduce_microstep: 106.39 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11678
total_samples=10399, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:12,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.84 | bwd_microstep: 1983.75 | bwd_inner_microstep: 1770.95 | bwd_allreduce_microstep: 212.74 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13955
total_samples=10403, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:14,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.63 | bwd_microstep: 1808.67 | bwd_inner_microstep: 1740.64 | bwd_allreduce_microstep: 67.96 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11978
total_samples=10406, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:17,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 03:50:17,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.69 | bwd_microstep: 1686.31 | bwd_inner_microstep: 1530.21 | bwd_allreduce_microstep: 156.04 | step_microstep: 137.38
[2025-08-03 03:50:17,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.87 | bwd: 7256.56 | bwd_inner: 6713.12 | bwd_allreduce: 543.20 | step: 137.87
{'loss': 0.77, 'learning_rate': 1.5286979955334655e-05, 'epoch': 0.34}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13293
total_samples=10410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:20,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.45 | bwd_microstep: 2055.84 | bwd_inner_microstep: 1701.93 | bwd_allreduce_microstep: 353.85 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12257
total_samples=10413, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:23,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.38 | bwd_microstep: 1940.81 | bwd_inner_microstep: 1562.69 | bwd_allreduce_microstep: 378.05 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13204
total_samples=10417, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:25,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.97 | bwd_microstep: 1857.96 | bwd_inner_microstep: 1736.70 | bwd_allreduce_microstep: 121.19 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13052
total_samples=10421, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:28,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 03:50:28,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.60 | bwd_microstep: 1901.49 | bwd_inner_microstep: 1707.37 | bwd_allreduce_microstep: 194.06 | step_microstep: 127.62
[2025-08-03 03:50:28,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.33 | bwd: 7756.15 | bwd_inner: 6708.69 | bwd_allreduce: 1047.22 | step: 128.12
{'loss': 0.7782, 'learning_rate': 1.5273227589483945e-05, 'epoch': 0.34}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12641
total_samples=10425, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:31,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.90 | bwd_microstep: 1734.95 | bwd_inner_microstep: 1644.16 | bwd_allreduce_microstep: 90.72 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13550
total_samples=10429, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:33,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.07 | bwd_microstep: 1735.82 | bwd_inner_microstep: 1677.56 | bwd_allreduce_microstep: 58.20 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13287
total_samples=10433, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:36,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.29 | bwd_microstep: 1776.46 | bwd_inner_microstep: 1689.65 | bwd_allreduce_microstep: 86.75 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11912
total_samples=10436, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:38,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.80
[2025-08-03 03:50:38,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.46 | bwd_microstep: 1863.50 | bwd_inner_microstep: 1714.77 | bwd_allreduce_microstep: 148.67 | step_microstep: 108.35
[2025-08-03 03:50:38,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.64 | bwd: 7110.78 | bwd_inner: 6726.13 | bwd_allreduce: 384.41 | step: 108.80
4%|███▍      | 682/2000 [2:06:48<4:01:32, 11.00s/it] 34%|███▍      | 683/2000 [2:06:59<4:02:19, 11.04s/it]                                                       34%|███▍      | 683/2000 [2:06:59<4:02:19, 11.04s/it] 34%|███▍      | 684/2000 [2:07:10<4:02:05, 11.04s/it]                                                       34%|███▍      | 684/2000 [2:07:10<4:02:05, 11.04s/it] 34%|███▍      | 685/2000 [2:07:21<4:02:54, 11.08s/it]                                                       34%|███▍      | 685/2000 [2:07:21<4:02:54, 11.08s/it] 34%|███▍      | 686/2000 [2:07:32<3:59:01, 10.91s/it]                                                       34%|███▍      | 686/2000 [2:07:32<3:59:01, 10.91s/it] 34%|███▍      | 687/2000 [2:07:43<3:59:51, 10.96s/it]                                                       34%|███▍      | 687/2000 [2:07:43<3:59:51, 10.96s/it] 34%|███▍      | 688/2000 [2:07:53<3:55:43, 1{'loss': 0.7714, 'learning_rate': 1.5259461395208628e-05, 'epoch': 0.34}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13077
total_samples=10440, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:41,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.99 | bwd_microstep: 2015.12 | bwd_inner_microstep: 1845.07 | bwd_allreduce_microstep: 169.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13876
total_samples=10444, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:44,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.25 | bwd_microstep: 2344.45 | bwd_inner_microstep: 2208.31 | bwd_allreduce_microstep: 136.08 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13327
total_samples=10448, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:47,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.06 | bwd_microstep: 1696.94 | bwd_inner_microstep: 1650.02 | bwd_allreduce_microstep: 46.86 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12052
total_samples=10451, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:51,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:50:51,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1197.23 | bwd_microstep: 2448.55 | bwd_inner_microstep: 2251.44 | bwd_allreduce_microstep: 197.05 | step_microstep: 108.76
[2025-08-03 03:50:51,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3277.44 | bwd: 8505.11 | bwd_inner: 7954.84 | bwd_allreduce: 550.05 | step: 109.10
{'loss': 0.7617, 'learning_rate': 1.5245681408608946e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14516
total_samples=10456, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:53,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.57 | bwd_microstep: 1821.80 | bwd_inner_microstep: 1747.92 | bwd_allreduce_microstep: 73.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13738
total_samples=10460, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:56,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.64 | bwd_microstep: 1831.74 | bwd_inner_microstep: 1731.71 | bwd_allreduce_microstep: 99.95 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14754
total_samples=10467, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:50:59,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.39 | bwd_microstep: 2018.32 | bwd_inner_microstep: 1940.05 | bwd_allreduce_microstep: 78.20 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16356
total_samples=10471, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:02,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 03:51:02,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.32 | bwd_microstep: 1896.55 | bwd_inner_microstep: 1890.22 | bwd_allreduce_microstep: 6.28 | step_microstep: 127.37
[2025-08-03 03:51:02,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.85 | bwd: 7568.45 | bwd_inner: 7309.88 | bwd_allreduce: 258.33 | step: 127.73
{'loss': 0.7624, 'learning_rate': 1.52318876658213e-05, 'epoch': 0.34}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13997
total_samples=10475, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:04,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.77 | bwd_microstep: 1711.92 | bwd_inner_microstep: 1680.21 | bwd_allreduce_microstep: 31.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13893
total_samples=10479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:07,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.52 | bwd_microstep: 1870.13 | bwd_inner_microstep: 1745.75 | bwd_allreduce_microstep: 124.30 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13570
total_samples=10483, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:09,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.73 | bwd_microstep: 1717.47 | bwd_inner_microstep: 1671.56 | bwd_allreduce_microstep: 45.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14911
total_samples=10487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:12,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 03:51:12,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.26 | bwd_microstep: 2023.00 | bwd_inner_microstep: 1905.40 | bwd_allreduce_microstep: 117.54 | step_microstep: 107.57
[2025-08-03 03:51:12,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.21 | bwd: 7322.57 | bwd_inner: 7002.92 | bwd_allreduce: 319.42 | step: 107.91
{'loss': 0.7661, 'learning_rate': 1.5218080203018181e-05, 'epoch': 0.35}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14102
total_samples=10491, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:15,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.76 | bwd_microstep: 1807.77 | bwd_inner_microstep: 1727.05 | bwd_allreduce_microstep: 80.66 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13487
total_samples=10495, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:17,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.27 | bwd_microstep: 1772.19 | bwd_inner_microstep: 1681.36 | bwd_allreduce_microstep: 90.77 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 16150
total_samples=10499, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:20,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.19 | bwd_microstep: 1831.59 | bwd_inner_microstep: 1762.57 | bwd_allreduce_microstep: 68.96 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13228
total_samples=10503, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:23,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 03:51:23,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.22 | bwd_microstep: 2085.83 | bwd_inner_microstep: 1712.73 | bwd_allreduce_microstep: 373.04 | step_microstep: 110.18
[2025-08-03 03:51:23,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.37 | bwd: 7497.44 | bwd_inner: 6883.69 | bwd_allreduce: 613.50 | step: 110.62
{'loss': 0.7785, 'learning_rate': 1.5204259056408046e-05, 'epoch': 0.35}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12832
total_samples=10507, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:26,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.10 | bwd_microstep: 2068.35 | bwd_inner_microstep: 1827.42 | bwd_allreduce_microstep: 240.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13438
total_samples=10511, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:28,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.93 | bwd_microstep: 1744.75 | bwd_inner_microstep: 1682.92 | bwd_allreduce_microstep: 61.77 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11993
total_samples=10514, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:31,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.61 | bwd_microstep: 1773.59 | bwd_inner_microstep: 1562.73 | bwd_allreduce_microstep: 210.79 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13926
total_samples=10518, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:33,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 03:51:33,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.96 | bwd_microstep: 1804.78 | bwd_inner_microstep: 1706.46 | bwd_allreduce_microstep: 98.25 | step_microstep: 137.42
[2025-08-03 03:51:33,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.53 | bwd: 7391.52 | bwd_inner: 6779.53 | bwd_allreduce: 611.75 | step: 137.73
{'loss': 0.7609, 'learning_rate': 1.5190424262235241e-05, 'epoch': 0.35}
0.78s/it]                                                       34%|███▍      | 688/2000 [2:07:53<3:55:43, 10.78s/it] 34%|███▍      | 689/2000 [2:08:06<4:05:01, 11.21s/it]                                                       34%|███▍      | 689/2000 [2:08:06<4:05:01, 11.21s/it] 34%|███▍      | 690/2000 [2:08:16<4:02:30, 11.11s/it]                                                       34%|███▍      | 690/2000 [2:08:16<4:02:30, 11.11s/it] 35%|███▍      | 691/2000 [2:08:27<3:58:28, 10.93s/it]                                                       35%|███▍      | 691/2000 [2:08:27<3:58:28, 10.93s/it] 35%|███▍      | 692/2000 [2:08:38<3:57:12, 10.88s/it]                                                       35%|███▍      | 692/2000 [2:08:38<3:57:12, 10.88s/it] 35%|███▍      | 693/2000 [2:08:48<3:55:23, 10.81s/it]                                                       35%|███▍      | 693/2000 [2:08:48<3:55:dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14061
total_samples=10523, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:36,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.64 | bwd_microstep: 1995.37 | bwd_inner_microstep: 1846.74 | bwd_allreduce_microstep: 148.58 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13845
total_samples=10527, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:39,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.84 | bwd_microstep: 1710.98 | bwd_inner_microstep: 1681.39 | bwd_allreduce_microstep: 29.53 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13391
total_samples=10531, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:41,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.44 | bwd_microstep: 1888.18 | bwd_inner_microstep: 1777.04 | bwd_allreduce_microstep: 111.08 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14116
total_samples=10535, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:44,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 03:51:44,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.85 | bwd_microstep: 1761.85 | bwd_inner_microstep: 1712.33 | bwd_allreduce_microstep: 49.46 | step_microstep: 125.53
[2025-08-03 03:51:44,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.68 | bwd: 7356.43 | bwd_inner: 7017.49 | bwd_allreduce: 338.71 | step: 125.97
{'loss': 0.7714, 'learning_rate': 1.5176575856779904e-05, 'epoch': 0.35}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13284
total_samples=10539, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:47,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.55 | bwd_microstep: 1858.72 | bwd_inner_microstep: 1678.34 | bwd_allreduce_microstep: 180.31 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13347
total_samples=10543, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:50,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.67 | bwd_microstep: 2209.21 | bwd_inner_microstep: 2084.99 | bwd_allreduce_microstep: 124.15 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12048
total_samples=10546, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:52,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.18 | bwd_microstep: 1994.27 | bwd_inner_microstep: 1565.46 | bwd_allreduce_microstep: 428.74 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15771
total_samples=10551, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:55,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 03:51:55,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.53 | bwd_microstep: 1795.34 | bwd_inner_microstep: 1789.13 | bwd_allreduce_microstep: 6.14 | step_microstep: 133.03
[2025-08-03 03:51:55,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.86 | bwd: 7857.58 | bwd_inner: 7117.91 | bwd_allreduce: 739.43 | step: 133.55
{'loss': 0.7681, 'learning_rate': 1.516271387635786e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11826
total_samples=10554, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:51:58,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.49 | bwd_microstep: 1877.85 | bwd_inner_microstep: 1545.45 | bwd_allreduce_microstep: 332.34 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13415
total_samples=10559, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:01,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.50 | bwd_microstep: 2015.72 | bwd_inner_microstep: 2009.69 | bwd_allreduce_microstep: 5.96 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13270
total_samples=10563, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:03,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.97 | bwd_microstep: 1939.89 | bwd_inner_microstep: 1720.30 | bwd_allreduce_microstep: 219.52 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13190
total_samples=10567, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:06,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 03:52:06,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.70 | bwd_microstep: 1822.66 | bwd_inner_microstep: 1712.88 | bwd_allreduce_microstep: 109.72 | step_microstep: 113.63
[2025-08-03 03:52:06,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.60 | bwd: 7656.17 | bwd_inner: 6988.32 | bwd_allreduce: 667.62 | step: 113.97
{'loss': 0.7672, 'learning_rate': 1.5148838357320537e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11737
total_samples=10570, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:09,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.11 | bwd_microstep: 2206.09 | bwd_inner_microstep: 2009.27 | bwd_allreduce_microstep: 196.77 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12307
total_samples=10573, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:12,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.71 | bwd_microstep: 1856.23 | bwd_inner_microstep: 1598.15 | bwd_allreduce_microstep: 258.02 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12763
total_samples=10577, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:14,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.75 | bwd_microstep: 1802.62 | bwd_inner_microstep: 1620.08 | bwd_allreduce_microstep: 182.47 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13173
total_samples=10581, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:17,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 03:52:17,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.04 | bwd_microstep: 1789.25 | bwd_inner_microstep: 1690.47 | bwd_allreduce_microstep: 98.71 | step_microstep: 147.43
[2025-08-03 03:52:17,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.53 | bwd: 7654.25 | bwd_inner: 6917.97 | bwd_allreduce: 736.04 | step: 147.78
{'loss': 0.7616, 'learning_rate': 1.5134949336054866e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11972
total_samples=10584, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:20,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.01 | bwd_microstep: 1789.52 | bwd_inner_microstep: 1552.50 | bwd_allreduce_microstep: 236.97 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11657
total_samples=10587, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:22,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.87 | bwd_microstep: 1923.88 | bwd_inner_microstep: 1557.07 | bwd_allreduce_microstep: 366.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13403
total_samples=10591, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:25,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.75 | bwd_microstep: 1839.59 | bwd_inner_microstep: 1707.96 | bwd_allreduce_microstep: 131.55 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13651
total_samples=10595, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:28,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 03:52:28,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.87 | bwd_microstep: 1812.56 | bwd_inner_microstep: 1716.53 | bwd_allreduce_microstep: 95.97 | step_microstep: 114.75
[2025-08-03 03:52:28,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.43 | bwd: 7365.61 | bwd_inner: 6534.06 | bwd_allreduce: 831.31 | step: 115.14
{'loss': 0.7684, 'learning_rate': 1.512104684898319e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12002
total_samples=10598, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:30,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.34 | bwd_microstep: 1843.30 | bwd_inner_microstep: 1700.61 | bwd_allreduce_microstep: 142.62 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14250
total_samples=10602, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:33,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.09 | bwd_microstep: 1785.56 | bwd_inner_microstep: 1727.38 | bwd_allreduce_microstep: 58.12 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13595
total_samples=10606, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:36,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.43 | bwd_microstep: 2234.74 | bwd_inner_microstep: 1902.32 | bwd_allreduce_microstep: 332.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13592
total_samples=10610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:39,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 03:52:39,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.54 | bwd_microstep: 1861.53 | bwd_inner_microstep: 1728.37 | bwd_allreduce_microstep: 133.11 | step_microstep: 157.42
[2025-08-03 03:52:39,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.31 | bwd: 7725.18 | bwd_inner: 7058.67 | bwd_allreduce: 666.28 | step: 157.88
23, 10.81s/it] 35%|███▍      | 694/2000 [2:08:59<3:53:47, 10.74s/it]                                                       35%|███▍      | 694/2000 [2:08:59<3:53:47, 10.74s/it] 35%|███▍      | 695/2000 [2:09:10<3:56:00, 10.85s/it]                                                       35%|███▍      | 695/2000 [2:09:10<3:56:00, 10.85s/it] 35%|███▍      | 696/2000 [2:09:21<3:56:24, 10.88s/it]                                                       35%|███▍      | 696/2000 [2:09:21<3:56:24, 10.88s/it] 35%|███▍      | 697/2000 [2:09:32<3:56:23, 10.89s/it]                                                       35%|███▍      | 697/2000 [2:09:32<3:56:23, 10.89s/it] 35%|███▍      | 698/2000 [2:09:43<3:54:45, 10.82s/it]                                                       35%|███▍      | 698/2000 [2:09:43<3:54:45, 10.82s/it] 35%|███▍      | 699/2000 [2:09:54<3:55:49, 10.88s/it]                                    {'loss': 0.7654, 'learning_rate': 1.5107130932563151e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11986
total_samples=10613, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:41,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.39 | bwd_microstep: 1805.61 | bwd_inner_microstep: 1559.64 | bwd_allreduce_microstep: 245.90 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13932
total_samples=10617, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:44,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.92 | bwd_microstep: 2174.67 | bwd_inner_microstep: 1998.72 | bwd_allreduce_microstep: 175.89 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13870
total_samples=10621, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:47,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.20 | bwd_microstep: 1733.86 | bwd_inner_microstep: 1689.96 | bwd_allreduce_microstep: 43.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13491
total_samples=10625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:49,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 03:52:49,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.64 | bwd_microstep: 1765.87 | bwd_inner_microstep: 1698.68 | bwd_allreduce_microstep: 67.13 | step_microstep: 145.54
[2025-08-03 03:52:49,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.08 | bwd: 7480.05 | bwd_inner: 6946.99 | bwd_allreduce: 532.83 | step: 145.86
{'loss': 0.7687, 'learning_rate': 1.5093201623287631e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11834
total_samples=10628, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:52,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.47 | bwd_microstep: 2038.57 | bwd_inner_microstep: 1558.66 | bwd_allreduce_microstep: 479.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752
total_samples=10632, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:55,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.25 | bwd_microstep: 2077.07 | bwd_inner_microstep: 1767.97 | bwd_allreduce_microstep: 309.04 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13191
total_samples=10636, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:52:58,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.38 | bwd_microstep: 1960.80 | bwd_inner_microstep: 1863.64 | bwd_allreduce_microstep: 97.09 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16099
total_samples=10642, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:01,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 03:53:01,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.63 | bwd_microstep: 1861.97 | bwd_inner_microstep: 1827.74 | bwd_allreduce_microstep: 34.17 | step_microstep: 137.18
[2025-08-03 03:53:01,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.66 | bwd: 7938.46 | bwd_inner: 7018.01 | bwd_allreduce: 920.23 | step: 137.59
{'loss': 0.7673, 'learning_rate': 1.507925895768461e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11771
total_samples=10645, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:03,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.35 | bwd_microstep: 1709.26 | bwd_inner_microstep: 1519.20 | bwd_allreduce_microstep: 189.99 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13090
total_samples=10649, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:06,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.94 | bwd_microstep: 2162.52 | bwd_inner_microstep: 2008.75 | bwd_allreduce_microstep: 153.71 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13206
total_samples=10653, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:09,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.81 | bwd_microstep: 2296.20 | bwd_inner_microstep: 1886.99 | bwd_allreduce_microstep: 409.11 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13451
total_samples=10657, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:12,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 03:53:12,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.16 | bwd_microstep: 1758.80 | bwd_inner_microstep: 1682.82 | bwd_allreduce_microstep: 75.91 | step_microstep: 141.98
[2025-08-03 03:53:12,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.19 | bwd: 7926.82 | bwd_inner: 7097.78 | bwd_allreduce: 828.79 | step: 142.30
{'loss': 0.7689, 'learning_rate': 1.5065302972317108e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11684
total_samples=10660, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:15,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.88 | bwd_microstep: 1928.08 | bwd_inner_microstep: 1754.61 | bwd_allreduce_microstep: 173.41 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13770
total_samples=10664, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:17,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.78 | bwd_microstep: 1893.75 | bwd_inner_microstep: 1887.80 | bwd_allreduce_microstep: 5.89 | step_microstep: 0.09
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13967
total_samples=10669, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:20,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 997.49 | bwd_microstep: 1743.11 | bwd_inner_microstep: 1665.92 | bwd_allreduce_microstep: 77.12 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16208
total_samples=10673, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:23,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 03:53:23,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.08 | bwd_microstep: 1812.81 | bwd_inner_microstep: 1800.08 | bwd_allreduce_microstep: 12.67 | step_microstep: 135.95
[2025-08-03 03:53:23,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3126.16 | bwd: 7377.79 | bwd_inner: 7108.41 | bwd_allreduce: 269.16 | step: 136.25
{'loss': 0.7635, 'learning_rate': 1.5051333703783069e-05, 'epoch': 0.35}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12357
total_samples=10677, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:26,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.01 | bwd_microstep: 2092.04 | bwd_inner_microstep: 1946.70 | bwd_allreduce_microstep: 145.28 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14906
total_samples=10682, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:28,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.80 | bwd_microstep: 1760.82 | bwd_inner_microstep: 1750.44 | bwd_allreduce_microstep: 10.32 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13680
total_samples=10686, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:31,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.35 | bwd_microstep: 1926.47 | bwd_inner_microstep: 1851.92 | bwd_allreduce_microstep: 74.49 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13429
total_samples=10690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:34,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 03:53:34,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.44 | bwd_microstep: 1735.68 | bwd_inner_microstep: 1655.30 | bwd_allreduce_microstep: 80.32 | step_microstep: 155.06
[2025-08-03 03:53:34,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2746.53 | bwd: 7515.06 | bwd_inner: 7204.35 | bwd_allreduce: 310.49 | step: 155.39
{'loss': 0.7638, 'learning_rate': 1.5037351188715265e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12468
total_samples=10693, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:36,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.16 | bwd_microstep: 1938.36 | bwd_inner_microstep: 1607.46 | bwd_allreduce_microstep: 330.84 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13029
total_samples=10697, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:41,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2227.99 | bwd_microstep: 2097.14 | bwd_inner_microstep: 1919.41 | bwd_allreduce_microstep: 177.67 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11942
total_samples=10700, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:44,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.10 | bwd_microstep: 2230.35 | bwd_inner_microstep: 1931.21 | bwd_allreduce_microstep: 299.08 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11568
total_samples=10703, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:46,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 03:53:46,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.75 | bwd_microstep: 1805.01 | bwd_inner_microstep: 1554.60 | bwd_allreduce_microstep: 250.33 | step_microstep: 132.84
[2025-08-03 03:53:46,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4321.94 | bwd: 8070.91 | bwd_inner: 7012.67 | bwd_allreduce: 1058.01 | step: 133.30
                   35%|███▍      | 699/2000 [2:09:54<3:55:49, 10.88s/it] 35%|███▌      | 700/2000 [2:10:04<3:55:06, 10.85s/it]                                                       35%|███▌      | 700/2000 [2:10:04<3:55:06, 10.85s/it] 35%|███▌      | 701/2000 [2:10:16<3:57:20, 10.96s/it]                                                       35%|███▌      | 701/2000 [2:10:16<3:57:20, 10.96s/it] 35%|███▌      | 702/2000 [2:10:27<3:58:36, 11.03s/it]                                                       35%|███▌      | 702/2000 [2:10:27<3:58:36, 11.03s/it] 35%|███▌      | 703/2000 [2:10:38<3:57:59, 11.01s/it]                                                       35%|███▌      | 703/2000 [2:10:38<3:57:59, 11.01s/it] 35%|███▌      | 704/2000 [2:10:48<3:56:05, 10.93s/it]                                                       35%|███▌      | 704/2000 [2:10:49<3:56:05, 10.93s/it] 35%|███▌      | 705/20{'loss': 0.76, 'learning_rate': 1.5023355463781221e-05, 'epoch': 0.35}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11540
total_samples=10706, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:49,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.55 | bwd_microstep: 2130.54 | bwd_inner_microstep: 1927.15 | bwd_allreduce_microstep: 203.33 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11999
total_samples=10709, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:52,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.46 | bwd_microstep: 1822.13 | bwd_inner_microstep: 1591.15 | bwd_allreduce_microstep: 230.91 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14267
total_samples=10713, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:55,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.46 | bwd_microstep: 1746.24 | bwd_inner_microstep: 1677.90 | bwd_allreduce_microstep: 68.27 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11722
total_samples=10716, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:53:58,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 03:53:58,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.92 | bwd_microstep: 2857.28 | bwd_inner_microstep: 2535.78 | bwd_allreduce_microstep: 321.44 | step_microstep: 112.11
[2025-08-03 03:53:58,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.31 | bwd: 8556.25 | bwd_inner: 7731.99 | bwd_allreduce: 824.01 | step: 112.45
{'loss': 0.7625, 'learning_rate': 1.5009346565683088e-05, 'epoch': 0.35}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13199
total_samples=10720, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:01,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.58 | bwd_microstep: 1995.85 | bwd_inner_microstep: 1871.21 | bwd_allreduce_microstep: 124.57 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13821
total_samples=10725, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:04,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.70 | bwd_microstep: 2047.26 | bwd_inner_microstep: 1738.27 | bwd_allreduce_microstep: 308.93 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13877
total_samples=10729, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:06,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.07 | bwd_microstep: 1784.98 | bwd_inner_microstep: 1721.68 | bwd_allreduce_microstep: 63.24 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13371
total_samples=10733, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:09,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 03:54:09,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.04 | bwd_microstep: 1697.98 | bwd_inner_microstep: 1613.99 | bwd_allreduce_microstep: 83.92 | step_microstep: 135.74
[2025-08-03 03:54:09,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2749.33 | bwd: 7526.11 | bwd_inner: 6945.14 | bwd_allreduce: 580.74 | step: 136.17
{'loss': 0.7618, 'learning_rate': 1.4995324531157569e-05, 'epoch': 0.35}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13369
total_samples=10737, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:12,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.54 | bwd_microstep: 2177.59 | bwd_inner_microstep: 2011.27 | bwd_allreduce_microstep: 166.25 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12884
total_samples=10741, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:15,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.66 | bwd_microstep: 1877.95 | bwd_inner_microstep: 1687.76 | bwd_allreduce_microstep: 190.12 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14757
total_samples=10745, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:17,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.68 | bwd_microstep: 1930.93 | bwd_inner_microstep: 1900.02 | bwd_allreduce_microstep: 30.84 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11929
total_samples=10748, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:20,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 03:54:20,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.86 | bwd_microstep: 1884.32 | bwd_inner_microstep: 1561.69 | bwd_allreduce_microstep: 322.57 | step_microstep: 115.04
[2025-08-03 03:54:20,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.66 | bwd: 7870.83 | bwd_inner: 7160.74 | bwd_allreduce: 709.86 | step: 115.38
{'loss': 0.7658, 'learning_rate': 1.4981289396975818e-05, 'epoch': 0.35}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14096
total_samples=10752, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:23,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.22 | bwd_microstep: 1963.91 | bwd_inner_microstep: 1883.41 | bwd_allreduce_microstep: 80.44 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13160
total_samples=10756, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:26,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.52 | bwd_microstep: 1976.74 | bwd_inner_microstep: 1851.69 | bwd_allreduce_microstep: 124.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13735
total_samples=10762, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:28,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.65 | bwd_microstep: 2016.31 | bwd_inner_microstep: 1714.50 | bwd_allreduce_microstep: 301.74 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11749
total_samples=10765, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:31,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 03:54:31,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.41 | bwd_microstep: 1853.73 | bwd_inner_microstep: 1619.85 | bwd_allreduce_microstep: 233.81 | step_microstep: 127.55
[2025-08-03 03:54:31,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.75 | bwd: 7810.73 | bwd_inner: 7069.44 | bwd_allreduce: 741.05 | step: 127.89
{'loss': 0.7709, 'learning_rate': 1.4967241199943332e-05, 'epoch': 0.35}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13454
total_samples=10769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:34,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.64 | bwd_microstep: 1749.65 | bwd_inner_microstep: 1683.56 | bwd_allreduce_microstep: 66.02 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13022
total_samples=10773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:36,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.82 | bwd_microstep: 1841.85 | bwd_inner_microstep: 1835.40 | bwd_allreduce_microstep: 6.38 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12681
total_samples=10777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:39,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.26 | bwd_microstep: 1878.12 | bwd_inner_microstep: 1812.65 | bwd_allreduce_microstep: 65.41 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11861
total_samples=10780, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:42,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 03:54:42,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.67 | bwd_microstep: 1878.99 | bwd_inner_microstep: 1703.79 | bwd_allreduce_microstep: 175.14 | step_microstep: 106.95
[2025-08-03 03:54:42,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.34 | bwd: 7348.65 | bwd_inner: 7035.38 | bwd_allreduce: 313.03 | step: 107.32
{'loss': 0.7696, 'learning_rate': 1.4953179976899878e-05, 'epoch': 0.35}
00 [2:11:01<4:08:23, 11.51s/it]                                                       35%|███▌      | 705/2000 [2:11:01<4:08:23, 11.51s/it] 35%|███▌      | 706/2000 [2:11:13<4:10:12, 11.60s/it]                                                       35%|███▌      | 706/2000 [2:11:13<4:10:12, 11.60s/it] 35%|███▌      | 707/2000 [2:11:24<4:04:30, 11.35s/it]                                                       35%|███▌      | 707/2000 [2:11:24<4:04:30, 11.35s/it] 35%|███▌      | 708/2000 [2:11:35<4:02:51, 11.28s/it]                                                       35%|███▌      | 708/2000 [2:11:35<4:02:51, 11.28s/it] 35%|███▌      | 709/2000 [2:11:46<4:01:04, 11.20s/it]                                                       35%|███▌      | 709/2000 [2:11:46<4:01:04, 11.20s/it] 36%|███▌      | 710/2000 [2:11:57<3:56:52, 11.02s/it]                                                       36%|███▌      | 7dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14036
total_samples=10785, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:44,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.88 | bwd_microstep: 1790.16 | bwd_inner_microstep: 1711.18 | bwd_allreduce_microstep: 78.91 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689
total_samples=10788, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:47,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.26 | bwd_microstep: 2046.83 | bwd_inner_microstep: 1826.49 | bwd_allreduce_microstep: 220.28 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13713
total_samples=10793, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:50,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.83 | bwd_microstep: 1772.05 | bwd_inner_microstep: 1694.41 | bwd_allreduce_microstep: 77.57 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11930
total_samples=10796, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:52,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 03:54:52,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.40 | bwd_microstep: 1791.47 | bwd_inner_microstep: 1553.32 | bwd_allreduce_microstep: 238.09 | step_microstep: 118.79
[2025-08-03 03:54:52,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.30 | bwd: 7400.56 | bwd_inner: 6785.40 | bwd_allreduce: 614.92 | step: 119.18
{'loss': 0.7639, 'learning_rate': 1.4939105764719369e-05, 'epoch': 0.36}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13425
total_samples=10800, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:55,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.40 | bwd_microstep: 1859.41 | bwd_inner_microstep: 1656.75 | bwd_allreduce_microstep: 202.61 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11946
total_samples=10803, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:54:58,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.08 | bwd_microstep: 1853.49 | bwd_inner_microstep: 1556.68 | bwd_allreduce_microstep: 296.75 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14406
total_samples=10807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:00,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.30 | bwd_microstep: 1867.95 | bwd_inner_microstep: 1825.37 | bwd_allreduce_microstep: 42.52 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12081
total_samples=10810, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:03,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 03:55:03,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.98 | bwd_microstep: 1716.26 | bwd_inner_microstep: 1545.01 | bwd_allreduce_microstep: 171.18 | step_microstep: 141.22
[2025-08-03 03:55:03,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.69 | bwd: 7297.16 | bwd_inner: 6583.79 | bwd_allreduce: 713.13 | step: 141.57
{'loss': 0.767, 'learning_rate': 1.4925018600309784e-05, 'epoch': 0.36}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11984
total_samples=10813, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:06,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.99 | bwd_microstep: 2225.00 | bwd_inner_microstep: 1995.95 | bwd_allreduce_microstep: 228.99 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11793
total_samples=10816, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:09,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.72 | bwd_microstep: 1873.00 | bwd_inner_microstep: 1543.28 | bwd_allreduce_microstep: 329.65 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12993
total_samples=10820, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:11,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.28 | bwd_microstep: 1837.36 | bwd_inner_microstep: 1670.03 | bwd_allreduce_microstep: 167.26 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11918
total_samples=10823, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:15,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 03:55:15,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.94 | bwd_microstep: 2348.84 | bwd_inner_microstep: 2108.59 | bwd_allreduce_microstep: 240.19 | step_microstep: 129.25
[2025-08-03 03:55:15,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.86 | bwd: 8284.25 | bwd_inner: 7317.84 | bwd_allreduce: 966.17 | step: 129.68
{'loss': 0.7571, 'learning_rate': 1.4910918520613074e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13475
total_samples=10827, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:17,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.44 | bwd_microstep: 1788.00 | bwd_inner_microstep: 1671.24 | bwd_allreduce_microstep: 116.69 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11925
total_samples=10830, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:20,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.02 | bwd_microstep: 1815.27 | bwd_inner_microstep: 1557.46 | bwd_allreduce_microstep: 257.75 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15231
total_samples=10834, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:23,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.73 | bwd_microstep: 2199.63 | bwd_inner_microstep: 2005.83 | bwd_allreduce_microstep: 193.73 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13699
total_samples=10838, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:25,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 03:55:25,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.48 | bwd_microstep: 1774.45 | bwd_inner_microstep: 1659.89 | bwd_allreduce_microstep: 114.51 | step_microstep: 136.80
[2025-08-03 03:55:25,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.60 | bwd: 7577.39 | bwd_inner: 6894.41 | bwd_allreduce: 682.75 | step: 137.14
{'loss': 0.7691, 'learning_rate': 1.4896805562605052e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13571
total_samples=10842, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:28,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.94 | bwd_microstep: 1764.29 | bwd_inner_microstep: 1693.90 | bwd_allreduce_microstep: 70.33 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13323
total_samples=10846, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:31,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.52 | bwd_microstep: 1817.12 | bwd_inner_microstep: 1711.63 | bwd_allreduce_microstep: 105.43 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13642
total_samples=10850, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:33,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.46 | bwd_microstep: 1798.61 | bwd_inner_microstep: 1704.66 | bwd_allreduce_microstep: 93.89 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11895
total_samples=10853, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:36,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 03:55:36,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.97 | bwd_microstep: 1701.18 | bwd_inner_microstep: 1538.72 | bwd_allreduce_microstep: 162.40 | step_microstep: 121.88
[2025-08-03 03:55:36,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2754.80 | bwd: 7081.26 | bwd_inner: 6648.90 | bwd_allreduce: 432.13 | step: 122.33
{'loss': 0.7583, 'learning_rate': 1.4882679763295307e-05, 'epoch': 0.36}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12089
total_samples=10856, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:39,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.74 | bwd_microstep: 2069.63 | bwd_inner_microstep: 1834.36 | bwd_allreduce_microstep: 235.21 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11652
total_samples=10859, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:41,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.74 | bwd_microstep: 1844.17 | bwd_inner_microstep: 1598.21 | bwd_allreduce_microstep: 245.90 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15370
total_samples=10865, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:44,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.02 | bwd_microstep: 1744.26 | bwd_inner_microstep: 1738.20 | bwd_allreduce_microstep: 6.00 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11864
total_samples=10868, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:47,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 03:55:47,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.67 | bwd_microstep: 2041.86 | bwd_inner_microstep: 1970.41 | bwd_allreduce_microstep: 71.39 | step_microstep: 145.87
[2025-08-03 03:55:47,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.11 | bwd: 7699.97 | bwd_inner: 7141.18 | bwd_allreduce: 558.57 | step: 146.29
10/2000 [2:11:57<3:56:52, 11.02s/it] 36%|███▌      | 711/2000 [2:12:07<3:54:13, 10.90s/it]                                                       36%|███▌      | 711/2000 [2:12:07<3:54:13, 10.90s/it] 36%|███▌      | 712/2000 [2:12:18<3:51:54, 10.80s/it]                                                       36%|███▌      | 712/2000 [2:12:18<3:51:54, 10.80s/it] 36%|███▌      | 713/2000 [2:12:29<3:56:34, 11.03s/it]                                                       36%|███▌      | 713/2000 [2:12:29<3:56:34, 11.03s/it] 36%|███▌      | 714/2000 [2:12:40<3:55:09, 10.97s/it]                                                       36%|███▌      | 714/2000 [2:12:40<3:55:09, 10.97s/it] 36%|███▌      | 715/2000 [2:12:51<3:50:47, 10.78s/it]                                                       36%|███▌      | 715/2000 [2:12:51<3:50:47, 10.78s/it] 36%|███▌      | 716/2000 [2:13:02<3:52:04, 10.84s/it]              {'loss': 0.7794, 'learning_rate': 1.4868541159727097e-05, 'epoch': 0.36}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12458
total_samples=10871, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:49,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.30 | bwd_microstep: 1754.38 | bwd_inner_microstep: 1572.54 | bwd_allreduce_microstep: 181.77 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12214
total_samples=10875, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:52,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.93 | bwd_microstep: 2048.17 | bwd_inner_microstep: 1804.26 | bwd_allreduce_microstep: 243.85 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13578
total_samples=10879, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:55,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.16 | bwd_microstep: 2127.60 | bwd_inner_microstep: 2018.02 | bwd_allreduce_microstep: 109.53 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13755
total_samples=10883, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:55:58,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 03:55:58,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.35 | bwd_microstep: 1793.65 | bwd_inner_microstep: 1717.62 | bwd_allreduce_microstep: 75.97 | step_microstep: 161.01
[2025-08-03 03:55:58,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.68 | bwd: 7723.85 | bwd_inner: 7112.44 | bwd_allreduce: 611.18 | step: 161.34
{'loss': 0.7766, 'learning_rate': 1.4854389788977266e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13361
total_samples=10887, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:00,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.87 | bwd_microstep: 1798.30 | bwd_inner_microstep: 1706.99 | bwd_allreduce_microstep: 91.24 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=10890, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:03,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.48 | bwd_microstep: 1742.06 | bwd_inner_microstep: 1530.02 | bwd_allreduce_microstep: 211.97 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14854
total_samples=10894, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:06,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.21 | bwd_microstep: 2079.15 | bwd_inner_microstep: 1935.34 | bwd_allreduce_microstep: 143.74 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13465
total_samples=10898, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:08,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 03:56:08,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.50 | bwd_microstep: 1781.01 | bwd_inner_microstep: 1706.85 | bwd_allreduce_microstep: 74.09 | step_microstep: 133.48
[2025-08-03 03:56:08,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.98 | bwd: 7400.57 | bwd_inner: 6879.20 | bwd_allreduce: 521.13 | step: 133.92
{'loss': 0.7608, 'learning_rate': 1.4840225688156132e-05, 'epoch': 0.36}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11999
total_samples=10902, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:11,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.11 | bwd_microstep: 2105.42 | bwd_inner_microstep: 1981.01 | bwd_allreduce_microstep: 124.35 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14491
total_samples=10906, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:14,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.95 | bwd_microstep: 1942.82 | bwd_inner_microstep: 1882.81 | bwd_allreduce_microstep: 59.94 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12898
total_samples=10910, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:17,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.39 | bwd_microstep: 1914.50 | bwd_inner_microstep: 1659.09 | bwd_allreduce_microstep: 255.35 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13293
total_samples=10914, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:19,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.55
[2025-08-03 03:56:19,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.54 | bwd_microstep: 1770.06 | bwd_inner_microstep: 1684.19 | bwd_allreduce_microstep: 85.80 | step_microstep: 118.09
[2025-08-03 03:56:19,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.92 | bwd: 7732.85 | bwd_inner: 7207.10 | bwd_allreduce: 525.52 | step: 118.43
{'loss': 0.7678, 'learning_rate': 1.4826048894407396e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13494
total_samples=10918, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:22,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.53 | bwd_microstep: 2233.80 | bwd_inner_microstep: 2101.42 | bwd_allreduce_microstep: 132.31 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13759
total_samples=10922, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:25,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.37 | bwd_microstep: 1785.38 | bwd_inner_microstep: 1705.78 | bwd_allreduce_microstep: 79.54 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15969
total_samples=10926, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:28,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.99 | bwd_microstep: 2241.82 | bwd_inner_microstep: 1929.26 | bwd_allreduce_microstep: 312.50 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12281
total_samples=10930, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:31,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 03:56:31,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.50 | bwd_microstep: 1982.01 | bwd_inner_microstep: 1629.53 | bwd_allreduce_microstep: 352.42 | step_microstep: 109.24
[2025-08-03 03:56:31,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.33 | bwd: 8243.05 | bwd_inner: 7365.98 | bwd_allreduce: 876.85 | step: 109.68
{'loss': 0.7816, 'learning_rate': 1.4811859444908053e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13369
total_samples=10934, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:33,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.63 | bwd_microstep: 1739.80 | bwd_inner_microstep: 1674.88 | bwd_allreduce_microstep: 64.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13734
total_samples=10938, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:36,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.79 | bwd_microstep: 2291.44 | bwd_inner_microstep: 1984.25 | bwd_allreduce_microstep: 307.13 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12925
total_samples=10942, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:39,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.72 | bwd_microstep: 1829.88 | bwd_inner_microstep: 1658.74 | bwd_allreduce_microstep: 171.08 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13520
total_samples=10946, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:42,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 03:56:42,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.86 | bwd_microstep: 1778.92 | bwd_inner_microstep: 1705.97 | bwd_allreduce_microstep: 72.88 | step_microstep: 141.67
[2025-08-03 03:56:42,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.93 | bwd: 7640.09 | bwd_inner: 7023.83 | bwd_allreduce: 616.02 | step: 141.98
{'loss': 0.7684, 'learning_rate': 1.4797657376868273e-05, 'epoch': 0.36}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13866
total_samples=10950, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:44,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.99 | bwd_microstep: 1768.20 | bwd_inner_microstep: 1699.13 | bwd_allreduce_microstep: 69.01 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15468
total_samples=10955, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:47,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.60 | bwd_microstep: 2037.58 | bwd_inner_microstep: 1801.11 | bwd_allreduce_microstep: 236.41 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11887
total_samples=10958, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:50,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.19 | bwd_microstep: 1772.68 | bwd_inner_microstep: 1548.73 | bwd_allreduce_microstep: 223.89 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14371
total_samples=10963, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:52,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 03:56:52,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.93 | bwd_microstep: 1751.83 | bwd_inner_microstep: 1709.97 | bwd_allreduce_microstep: 41.80 | step_microstep: 154.11
[2025-08-03 03:56:52,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.64 | bwd: 7330.34 | bwd_inner: 6758.93 | bwd_allreduce: 571.18 | step: 154.45
                                         36%|███▌      | 716/2000 [2:13:02<3:52:04, 10.84s/it] 36%|███▌      | 717/2000 [2:13:13<3:52:54, 10.89s/it]                                                       36%|███▌      | 717/2000 [2:13:13<3:52:54, 10.89s/it] 36%|███▌      | 718/2000 [2:13:23<3:51:24, 10.83s/it]                                                       36%|███▌      | 718/2000 [2:13:23<3:51:24, 10.83s/it] 36%|███▌      | 719/2000 [2:13:34<3:52:09, 10.87s/it]                                                       36%|███▌      | 719/2000 [2:13:34<3:52:09, 10.87s/it] 36%|███▌      | 720/2000 [2:13:46<3:55:49, 11.05s/it]                                                       36%|███▌      | 720/2000 [2:13:46<3:55:49, 11.05s/it] 36%|███▌      | 721/2000 [2:13:57<3:54:35, 11.01s/it]                                                       36%|███▌      | 721/2000 [2:13:57<3:54:35, 11.01s/it] 36%|█�{'loss': 0.7617, 'learning_rate': 1.4783442727531328e-05, 'epoch': 0.36}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12227
total_samples=10966, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:55,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.19 | bwd_microstep: 1736.40 | bwd_inner_microstep: 1558.91 | bwd_allreduce_microstep: 177.42 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13436
total_samples=10970, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:56:58,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.87 | bwd_microstep: 2002.41 | bwd_inner_microstep: 1862.09 | bwd_allreduce_microstep: 140.25 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446
total_samples=10974, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:00,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.32 | bwd_microstep: 1802.90 | bwd_inner_microstep: 1696.40 | bwd_allreduce_microstep: 106.43 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13149
total_samples=10978, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:03,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 03:57:03,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.28 | bwd_microstep: 1869.71 | bwd_inner_microstep: 1814.58 | bwd_allreduce_microstep: 55.07 | step_microstep: 138.98
[2025-08-03 03:57:03,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.59 | bwd: 7411.46 | bwd_inner: 6931.98 | bwd_allreduce: 479.25 | step: 139.55
{'loss': 0.7697, 'learning_rate': 1.4769215534173476e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13628
total_samples=10982, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:06,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.03 | bwd_microstep: 2010.70 | bwd_inner_microstep: 1879.59 | bwd_allreduce_microstep: 131.04 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11859
total_samples=10985, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:09,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 962.71 | bwd_microstep: 1973.19 | bwd_inner_microstep: 1796.63 | bwd_allreduce_microstep: 176.50 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13169
total_samples=10989, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:12,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.41 | bwd_microstep: 2080.35 | bwd_inner_microstep: 1950.61 | bwd_allreduce_microstep: 129.67 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13582
total_samples=10993, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:14,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 03:57:14,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.91 | bwd_microstep: 1866.00 | bwd_inner_microstep: 1706.99 | bwd_allreduce_microstep: 158.94 | step_microstep: 127.68
[2025-08-03 03:57:14,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3040.99 | bwd: 7930.29 | bwd_inner: 7333.82 | bwd_allreduce: 596.22 | step: 128.30
{'loss': 0.7656, 'learning_rate': 1.4754975834103877e-05, 'epoch': 0.36}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13199
total_samples=10997, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:17,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.81 | bwd_microstep: 1792.67 | bwd_inner_microstep: 1653.72 | bwd_allreduce_microstep: 138.88 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12137
total_samples=11000, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:20,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.29 | bwd_microstep: 1758.66 | bwd_inner_microstep: 1583.22 | bwd_allreduce_microstep: 175.36 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12875
total_samples=11004, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:22,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.49 | bwd_microstep: 1875.24 | bwd_inner_microstep: 1794.53 | bwd_allreduce_microstep: 80.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13650
total_samples=11008, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:25,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:57:25,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.55 | bwd_microstep: 2106.22 | bwd_inner_microstep: 1884.92 | bwd_allreduce_microstep: 221.23 | step_microstep: 131.58
[2025-08-03 03:57:25,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.07 | bwd: 7532.83 | bwd_inner: 6916.38 | bwd_allreduce: 616.21 | step: 131.97
{'loss': 0.7677, 'learning_rate': 1.4740723664664483e-05, 'epoch': 0.36}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13245
total_samples=11012, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:28,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.63 | bwd_microstep: 2034.60 | bwd_inner_microstep: 1950.91 | bwd_allreduce_microstep: 83.62 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12906
total_samples=11016, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:31,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.42 | bwd_microstep: 2102.16 | bwd_inner_microstep: 1920.41 | bwd_allreduce_microstep: 181.69 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11774
total_samples=11019, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:34,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.53 | bwd_microstep: 1883.41 | bwd_inner_microstep: 1700.80 | bwd_allreduce_microstep: 182.53 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13524
total_samples=11023, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:36,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 03:57:36,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.70 | bwd_microstep: 1820.70 | bwd_inner_microstep: 1697.57 | bwd_allreduce_microstep: 123.07 | step_microstep: 117.83
[2025-08-03 03:57:36,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.22 | bwd: 7840.91 | bwd_inner: 7269.68 | bwd_allreduce: 570.99 | step: 118.28
{'loss': 0.7585, 'learning_rate': 1.4726459063229946e-05, 'epoch': 0.36}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14883
total_samples=11027, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:39,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.72 | bwd_microstep: 2037.64 | bwd_inner_microstep: 1725.22 | bwd_allreduce_microstep: 312.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13486
total_samples=11031, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:42,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.54 | bwd_microstep: 1806.81 | bwd_inner_microstep: 1714.32 | bwd_allreduce_microstep: 92.42 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12772
total_samples=11035, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:45,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1089.30 | bwd_microstep: 1768.88 | bwd_inner_microstep: 1666.15 | bwd_allreduce_microstep: 102.68 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14277
total_samples=11041, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:47,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 03:57:47,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.59 | bwd_microstep: 1734.71 | bwd_inner_microstep: 1686.33 | bwd_allreduce_microstep: 48.31 | step_microstep: 146.12
[2025-08-03 03:57:47,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3194.09 | bwd: 7348.09 | bwd_inner: 6792.02 | bwd_allreduce: 555.84 | step: 146.45
{'loss': 0.7704, 'learning_rate': 1.4712182067207516e-05, 'epoch': 0.36}
��█▌      | 722/2000 [2:14:07<3:51:51, 10.89s/it]                                                       36%|███▌      | 722/2000 [2:14:07<3:51:51, 10.89s/it] 36%|███▌      | 723/2000 [2:14:18<3:50:22, 10.82s/it]                                                       36%|███▌      | 723/2000 [2:14:18<3:50:22, 10.82s/it] 36%|███▌      | 724/2000 [2:14:29<3:54:09, 11.01s/it]                                                       36%|███▌      | 724/2000 [2:14:29<3:54:09, 11.01s/it] 36%|███▋      | 725/2000 [2:14:40<3:52:32, 10.94s/it]                                                       36%|███▋      | 725/2000 [2:14:40<3:52:32, 10.94s/it] 36%|███▋      | 726/2000 [2:14:51<3:53:22, 10.99s/it]                                                       36%|███▋      | 726/2000 [2:14:51<3:53:22, 10.99s/it] 36%|███▋      | 727/2000 [2:15:02<3:53:12, 10.99s/it]                                                       36%dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13339
total_samples=11045, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:50,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.21 | bwd_microstep: 2195.58 | bwd_inner_microstep: 2077.32 | bwd_allreduce_microstep: 118.19 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11698
total_samples=11048, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:53,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.06 | bwd_microstep: 1834.03 | bwd_inner_microstep: 1559.32 | bwd_allreduce_microstep: 274.64 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12634
total_samples=11051, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:57:57,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1963.11 | bwd_microstep: 1854.16 | bwd_inner_microstep: 1611.88 | bwd_allreduce_microstep: 242.22 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 16307
total_samples=11056, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:00,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 03:58:00,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.11 | bwd_microstep: 2123.91 | bwd_inner_microstep: 1884.96 | bwd_allreduce_microstep: 238.87 | step_microstep: 150.93
[2025-08-03 03:58:00,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4064.43 | bwd: 8007.72 | bwd_inner: 7133.48 | bwd_allreduce: 874.01 | step: 151.24
{'loss': 0.7668, 'learning_rate': 1.4697892714036959e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13719
total_samples=11060, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:02,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.22 | bwd_microstep: 1756.63 | bwd_inner_microstep: 1694.15 | bwd_allreduce_microstep: 62.41 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11798
total_samples=11063, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:05,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.35 | bwd_microstep: 1785.90 | bwd_inner_microstep: 1542.18 | bwd_allreduce_microstep: 243.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13302
total_samples=11067, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:08,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.12 | bwd_microstep: 1828.31 | bwd_inner_microstep: 1777.75 | bwd_allreduce_microstep: 50.50 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689
total_samples=11070, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:10,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 03:58:10,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.38 | bwd_microstep: 1805.02 | bwd_inner_microstep: 1576.39 | bwd_allreduce_microstep: 228.57 | step_microstep: 123.01
[2025-08-03 03:58:10,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.99 | bwd: 7175.91 | bwd_inner: 6590.48 | bwd_allreduce: 585.21 | step: 123.34
{'loss': 0.7718, 'learning_rate': 1.4683591041190433e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13143
total_samples=11074, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:13,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.45 | bwd_microstep: 1999.63 | bwd_inner_microstep: 1866.77 | bwd_allreduce_microstep: 132.79 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11692
total_samples=11077, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:16,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.44 | bwd_microstep: 2491.87 | bwd_inner_microstep: 1530.58 | bwd_allreduce_microstep: 961.23 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13022
total_samples=11081, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:19,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.85 | bwd_microstep: 1791.59 | bwd_inner_microstep: 1691.01 | bwd_allreduce_microstep: 100.52 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12137
total_samples=11084, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:22,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 03:58:22,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.89 | bwd_microstep: 2028.90 | bwd_inner_microstep: 2008.13 | bwd_allreduce_microstep: 20.71 | step_microstep: 136.30
[2025-08-03 03:58:22,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.56 | bwd: 8312.03 | bwd_inner: 7096.48 | bwd_allreduce: 1215.33 | step: 136.63
{'loss': 0.7641, 'learning_rate': 1.4669277086172406e-05, 'epoch': 0.36}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13766
total_samples=11088, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:25,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.48 | bwd_microstep: 2107.77 | bwd_inner_microstep: 1944.34 | bwd_allreduce_microstep: 163.38 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12004
total_samples=11091, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:28,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.70 | bwd_microstep: 2060.98 | bwd_inner_microstep: 1836.18 | bwd_allreduce_microstep: 224.74 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13430
total_samples=11096, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:30,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.71 | bwd_microstep: 1719.19 | bwd_inner_microstep: 1676.25 | bwd_allreduce_microstep: 42.87 | step_microstep: 0.09
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13409
total_samples=11100, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:33,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.83
[2025-08-03 03:58:33,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.25 | bwd_microstep: 1976.96 | bwd_inner_microstep: 1665.11 | bwd_allreduce_microstep: 311.78 | step_microstep: 125.06
[2025-08-03 03:58:33,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.06 | bwd: 7864.95 | bwd_inner: 7121.87 | bwd_allreduce: 742.85 | step: 125.34
{'loss': 0.7642, 'learning_rate': 1.4654950886519563e-05, 'epoch': 0.37}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12980
total_samples=11104, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:36,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.81 | bwd_microstep: 1924.69 | bwd_inner_microstep: 1789.52 | bwd_allreduce_microstep: 135.11 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11857
total_samples=11107, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:39,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.23 | bwd_microstep: 2111.42 | bwd_inner_microstep: 1949.95 | bwd_allreduce_microstep: 161.40 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14106
total_samples=11112, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:42,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 891.05 | bwd_microstep: 2086.84 | bwd_inner_microstep: 1906.89 | bwd_allreduce_microstep: 179.89 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13589
total_samples=11116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:44,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 03:58:44,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.70 | bwd_microstep: 1754.04 | bwd_inner_microstep: 1688.55 | bwd_allreduce_microstep: 65.42 | step_microstep: 161.32
[2025-08-03 03:58:44,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2961.73 | bwd: 7877.03 | bwd_inner: 7334.89 | bwd_allreduce: 541.90 | step: 161.65
{'loss': 0.7676, 'learning_rate': 1.4640612479800686e-05, 'epoch': 0.37}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12971
total_samples=11120, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:47,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.15 | bwd_microstep: 1927.23 | bwd_inner_microstep: 1815.04 | bwd_allreduce_microstep: 112.12 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12046
total_samples=11123, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:50,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.11 | bwd_microstep: 2109.51 | bwd_inner_microstep: 1892.54 | bwd_allreduce_microstep: 216.90 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13915
total_samples=11127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:53,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.17 | bwd_microstep: 1784.94 | bwd_inner_microstep: 1724.38 | bwd_allreduce_microstep: 60.49 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13086
total_samples=11131, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:55,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.70
[2025-08-03 03:58:55,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.92 | bwd_microstep: 1727.43 | bwd_inner_microstep: 1645.86 | bwd_allreduce_microstep: 81.51 | step_microstep: 135.78
[2025-08-03 03:58:55,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.28 | bwd: 7549.16 | bwd_inner: 7077.83 | bwd_allreduce: 471.10 | step: 136.11
|███▋      | 727/2000 [2:15:02<3:53:12, 10.99s/it] 36%|███▋      | 728/2000 [2:15:15<4:03:04, 11.47s/it]                                                       36%|███▋      | 728/2000 [2:15:15<4:03:04, 11.47s/it] 36%|███▋      | 729/2000 [2:15:25<3:56:25, 11.16s/it]                                                       36%|███▋      | 729/2000 [2:15:25<3:56:25, 11.16s/it] 36%|███▋      | 730/2000 [2:15:37<3:58:38, 11.27s/it]                                                       36%|███▋      | 730/2000 [2:15:37<3:58:38, 11.27s/it] 37%|███▋      | 731/2000 [2:15:48<3:57:13, 11.22s/it]                                                       37%|███▋      | 731/2000 [2:15:48<3:57:13, 11.22s/it] 37%|███▋      | 732/2000 [2:15:59<3:57:54, 11.26s/it]                                                       37%|███▋      | 732/2000 [2:15:59<3:57:54, 11.26s/it] 37%|███▋      | 733/2000 [2:16:10<3:55:00, 11.{'loss': 0.7737, 'learning_rate': 1.4626261903616579e-05, 'epoch': 0.37}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12947
total_samples=11135, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:58:58,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 980.59 | bwd_microstep: 1731.15 | bwd_inner_microstep: 1625.45 | bwd_allreduce_microstep: 105.64 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14161
total_samples=11139, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:01,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 980.42 | bwd_microstep: 1896.12 | bwd_inner_microstep: 1753.56 | bwd_allreduce_microstep: 142.50 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13341
total_samples=11143, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:04,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.00 | bwd_microstep: 2292.65 | bwd_inner_microstep: 1994.33 | bwd_allreduce_microstep: 298.26 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13533
total_samples=11147, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:07,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 03:59:07,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.59 | bwd_microstep: 1940.03 | bwd_inner_microstep: 1831.57 | bwd_allreduce_microstep: 108.39 | step_microstep: 113.00
[2025-08-03 03:59:07,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3346.54 | bwd: 7860.00 | bwd_inner: 7204.91 | bwd_allreduce: 654.86 | step: 113.31
{'loss': 0.7619, 'learning_rate': 1.4611899195599952e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12445
total_samples=11150, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:09,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.58 | bwd_microstep: 1784.88 | bwd_inner_microstep: 1580.53 | bwd_allreduce_microstep: 204.29 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13117
total_samples=11154, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:12,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.46 | bwd_microstep: 1842.86 | bwd_inner_microstep: 1634.86 | bwd_allreduce_microstep: 207.95 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14618
total_samples=11158, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:15,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.76 | bwd_microstep: 2080.95 | bwd_inner_microstep: 1925.70 | bwd_allreduce_microstep: 155.19 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14189
total_samples=11162, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:18,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:59:18,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.09 | bwd_microstep: 1827.51 | bwd_inner_microstep: 1709.75 | bwd_allreduce_microstep: 117.69 | step_microstep: 150.94
[2025-08-03 03:59:18,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.81 | bwd: 7536.26 | bwd_inner: 6850.83 | bwd_allreduce: 685.19 | step: 151.35
{'loss': 0.7704, 'learning_rate': 1.4597524393415336e-05, 'epoch': 0.37}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14096
total_samples=11166, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:20,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.65 | bwd_microstep: 1880.93 | bwd_inner_microstep: 1724.00 | bwd_allreduce_microstep: 156.87 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14750
total_samples=11170, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:23,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.20 | bwd_microstep: 1794.36 | bwd_inner_microstep: 1724.43 | bwd_allreduce_microstep: 69.87 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13526
total_samples=11174, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:25,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.81 | bwd_microstep: 1779.29 | bwd_inner_microstep: 1689.06 | bwd_allreduce_microstep: 90.16 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12403
total_samples=11178, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:28,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 03:59:28,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.85 | bwd_microstep: 1809.59 | bwd_inner_microstep: 1594.14 | bwd_allreduce_microstep: 215.38 | step_microstep: 136.84
[2025-08-03 03:59:28,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.45 | bwd: 7264.21 | bwd_inner: 6731.63 | bwd_allreduce: 532.36 | step: 137.17
{'loss': 0.7708, 'learning_rate': 1.4583137534758968e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11710
total_samples=11181, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:31,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.53 | bwd_microstep: 1961.35 | bwd_inner_microstep: 1517.50 | bwd_allreduce_microstep: 443.78 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13824
total_samples=11185, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:34,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.20 | bwd_microstep: 1821.32 | bwd_inner_microstep: 1764.99 | bwd_allreduce_microstep: 56.26 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14024
total_samples=11190, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:37,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.58 | bwd_microstep: 2502.92 | bwd_inner_microstep: 2217.92 | bwd_allreduce_microstep: 284.94 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13221
total_samples=11194, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:40,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 03:59:40,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.30 | bwd_microstep: 1790.73 | bwd_inner_microstep: 1694.86 | bwd_allreduce_microstep: 95.81 | step_microstep: 131.08
[2025-08-03 03:59:40,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.52 | bwd: 8076.36 | bwd_inner: 7195.26 | bwd_allreduce: 880.88 | step: 131.41
{'loss': 0.7492, 'learning_rate': 1.4568738657358715e-05, 'epoch': 0.37}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14208
total_samples=11198, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:42,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.69 | bwd_microstep: 1908.40 | bwd_inner_microstep: 1810.19 | bwd_allreduce_microstep: 98.15 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12000
total_samples=11201, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:45,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.77 | bwd_microstep: 1854.49 | bwd_inner_microstep: 1593.20 | bwd_allreduce_microstep: 261.23 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12243
total_samples=11205, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:47,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.88 | bwd_microstep: 1811.23 | bwd_inner_microstep: 1619.40 | bwd_allreduce_microstep: 191.76 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13359
total_samples=11210, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:50,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 03:59:50,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.85 | bwd_microstep: 1736.97 | bwd_inner_microstep: 1642.92 | bwd_allreduce_microstep: 93.99 | step_microstep: 456.31
[2025-08-03 03:59:50,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.12 | bwd: 7311.14 | bwd_inner: 6665.71 | bwd_allreduce: 645.20 | step: 456.64
{'loss': 0.763, 'learning_rate': 1.455432779897395e-05, 'epoch': 0.37}
13s/it]                                                       37%|███▋      | 733/2000 [2:16:10<3:55:00, 11.13s/it] 37%|███▋      | 734/2000 [2:16:22<3:58:11, 11.29s/it]                                                       37%|███▋      | 734/2000 [2:16:22<3:58:11, 11.29s/it] 37%|███▋      | 735/2000 [2:16:32<3:55:02, 11.15s/it]                                                       37%|███▋      | 735/2000 [2:16:33<3:55:02, 11.15s/it] 37%|███▋      | 736/2000 [2:16:43<3:51:03, 10.97s/it]                                                       37%|███▋      | 736/2000 [2:16:43<3:51:03, 10.97s/it] 37%|███▋      | 737/2000 [2:16:54<3:53:02, 11.07s/it]                                                       37%|███▋      | 737/2000 [2:16:54<3:53:02, 11.07s/it] 37%|███▋      | 738/2000 [2:17:05<3:51:33, 11.01s/it]                                                       37%|███▋      | 738/2000 [2:17:05<3:51:33dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12694
total_samples=11214, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:53,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.18 | bwd_microstep: 2000.49 | bwd_inner_microstep: 1900.61 | bwd_allreduce_microstep: 99.82 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11853
total_samples=11217, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:56,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.45 | bwd_microstep: 2131.74 | bwd_inner_microstep: 1746.67 | bwd_allreduce_microstep: 385.01 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11862
total_samples=11220, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 03:59:59,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.39 | bwd_microstep: 1788.74 | bwd_inner_microstep: 1557.63 | bwd_allreduce_microstep: 231.06 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12183
total_samples=11223, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:02,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 04:00:02,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.28 | bwd_microstep: 2085.22 | bwd_inner_microstep: 1806.45 | bwd_allreduce_microstep: 278.70 | step_microstep: 119.84
[2025-08-03 04:00:02,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.23 | bwd: 8006.24 | bwd_inner: 7011.35 | bwd_allreduce: 994.67 | step: 120.17
{'loss': 0.7659, 'learning_rate': 1.4539904997395468e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11934
total_samples=11226, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:04,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.98 | bwd_microstep: 2081.92 | bwd_inner_microstep: 1605.24 | bwd_allreduce_microstep: 476.58 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13351
total_samples=11230, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:07,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.97 | bwd_microstep: 1817.15 | bwd_inner_microstep: 1686.85 | bwd_allreduce_microstep: 130.24 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 16139
total_samples=11235, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:10,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.90 | bwd_microstep: 1810.13 | bwd_inner_microstep: 1767.87 | bwd_allreduce_microstep: 42.20 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11679
total_samples=11238, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:12,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 04:00:12,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.79 | bwd_microstep: 1702.04 | bwd_inner_microstep: 1523.90 | bwd_allreduce_microstep: 178.08 | step_microstep: 160.01
[2025-08-03 04:00:12,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2753.57 | bwd: 7411.28 | bwd_inner: 6583.86 | bwd_allreduce: 827.15 | step: 160.44
{'loss': 0.7666, 'learning_rate': 1.4525470290445392e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11760
total_samples=11241, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:15,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.92 | bwd_microstep: 2048.47 | bwd_inner_microstep: 1831.76 | bwd_allreduce_microstep: 216.64 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13380
total_samples=11246, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:18,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.19 | bwd_microstep: 1783.83 | bwd_inner_microstep: 1674.57 | bwd_allreduce_microstep: 109.19 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12297
total_samples=11249, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:20,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.91 | bwd_microstep: 1768.19 | bwd_inner_microstep: 1564.74 | bwd_allreduce_microstep: 203.39 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14405
total_samples=11253, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:23,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 04:00:23,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.27 | bwd_microstep: 2024.78 | bwd_inner_microstep: 1902.36 | bwd_allreduce_microstep: 122.34 | step_microstep: 110.25
[2025-08-03 04:00:23,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.22 | bwd: 7625.32 | bwd_inner: 6973.43 | bwd_allreduce: 651.65 | step: 110.59
{'loss': 0.7583, 'learning_rate': 1.4511023715977048e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11602
total_samples=11256, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:26,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.66 | bwd_microstep: 2083.62 | bwd_inner_microstep: 1762.23 | bwd_allreduce_microstep: 321.32 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13258
total_samples=11260, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:29,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.15 | bwd_microstep: 1779.00 | bwd_inner_microstep: 1695.29 | bwd_allreduce_microstep: 83.64 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13322
total_samples=11264, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:31,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.10 | bwd_microstep: 1927.42 | bwd_inner_microstep: 1732.41 | bwd_allreduce_microstep: 194.94 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14345
total_samples=11268, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:34,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 04:00:34,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.84 | bwd_microstep: 1843.81 | bwd_inner_microstep: 1779.17 | bwd_allreduce_microstep: 64.57 | step_microstep: 148.63
[2025-08-03 04:00:34,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.68 | bwd: 7633.90 | bwd_inner: 6969.10 | bwd_allreduce: 664.55 | step: 149.00
{'loss': 0.7611, 'learning_rate': 1.4496565311874902e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11900
total_samples=11271, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:37,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.11 | bwd_microstep: 1737.92 | bwd_inner_microstep: 1545.62 | bwd_allreduce_microstep: 192.22 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12768
total_samples=11275, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:39,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.90 | bwd_microstep: 1797.24 | bwd_inner_microstep: 1625.71 | bwd_allreduce_microstep: 171.46 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12225
total_samples=11278, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:42,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.28 | bwd_microstep: 1745.04 | bwd_inner_microstep: 1557.16 | bwd_allreduce_microstep: 187.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13287
total_samples=11282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:45,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 04:00:45,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.77 | bwd_microstep: 2021.18 | bwd_inner_microstep: 1871.54 | bwd_allreduce_microstep: 149.57 | step_microstep: 143.92
[2025-08-03 04:00:45,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.98 | bwd: 7301.43 | bwd_inner: 6600.03 | bwd_allreduce: 701.15 | step: 144.41
{'loss': 0.7605, 'learning_rate': 1.4482095116054421e-05, 'epoch': 0.37}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13526
total_samples=11286, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:48,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.66 | bwd_microstep: 2373.21 | bwd_inner_microstep: 2155.99 | bwd_allreduce_microstep: 217.16 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13162
total_samples=11290, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:50,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.34 | bwd_microstep: 1755.23 | bwd_inner_microstep: 1673.37 | bwd_allreduce_microstep: 81.80 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11672
total_samples=11293, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:53,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.53 | bwd_microstep: 2064.66 | bwd_inner_microstep: 1589.72 | bwd_allreduce_microstep: 474.83 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13514
total_samples=11297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:56,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 04:00:56,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.47 | bwd_microstep: 1940.03 | bwd_inner_microstep: 1873.34 | bwd_allreduce_microstep: 66.60 | step_microstep: 121.15
[2025-08-03 04:00:56,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.92 | bwd: 8133.16 | bwd_inner: 7292.43 | bwd_allreduce: 840.45 | step: 121.48
, 11.01s/it] 37%|███▋      | 739/2000 [2:17:16<3:52:59, 11.09s/it]                                                       37%|███▋      | 739/2000 [2:17:17<3:52:59, 11.09s/it] 37%|███▋      | 740/2000 [2:17:27<3:50:02, 10.95s/it]                                                       37%|███▋      | 740/2000 [2:17:27<3:50:02, 10.95s/it] 37%|███▋      | 741/2000 [2:17:38<3:49:07, 10.92s/it]                                                       37%|███▋      | 741/2000 [2:17:38<3:49:07, 10.92s/it] 37%|███▋      | 742/2000 [2:17:49<3:49:00, 10.92s/it]                                                       37%|███▋      | 742/2000 [2:17:49<3:49:00, 10.92s/it] 37%|███▋      | 743/2000 [2:17:59<3:46:34, 10.82s/it]                                                       37%|███▋      | 743/2000 [2:17:59<3:46:34, 10.82s/it] 37%|███▋      | 744/2000 [2:18:11<3:49:56, 10.98s/it]                                      {'loss': 0.7597, 'learning_rate': 1.4467613166462024e-05, 'epoch': 0.37}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12814
total_samples=11301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:00:59,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.41 | bwd_microstep: 1720.98 | bwd_inner_microstep: 1634.56 | bwd_allreduce_microstep: 86.35 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12625
total_samples=11305, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:01,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.19 | bwd_microstep: 1994.67 | bwd_inner_microstep: 1818.41 | bwd_allreduce_microstep: 176.20 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11841
total_samples=11308, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:04,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.11 | bwd_microstep: 1923.74 | bwd_inner_microstep: 1789.54 | bwd_allreduce_microstep: 134.12 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11866
total_samples=11311, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:07,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 04:01:07,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.03 | bwd_microstep: 1799.38 | bwd_inner_microstep: 1556.99 | bwd_allreduce_microstep: 242.33 | step_microstep: 133.46
[2025-08-03 04:01:07,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.68 | bwd: 7438.81 | bwd_inner: 6799.50 | bwd_allreduce: 639.06 | step: 133.83
{'loss': 0.7565, 'learning_rate': 1.4453119501074924e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11878
total_samples=11314, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:09,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.62 | bwd_microstep: 1769.78 | bwd_inner_microstep: 1558.23 | bwd_allreduce_microstep: 211.48 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11940
total_samples=11317, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:12,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.14 | bwd_microstep: 2044.28 | bwd_inner_microstep: 1832.31 | bwd_allreduce_microstep: 211.91 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12064
total_samples=11320, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:15,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.72 | bwd_microstep: 1946.46 | bwd_inner_microstep: 1765.46 | bwd_allreduce_microstep: 180.94 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13507
total_samples=11324, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:18,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 04:01:18,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.72 | bwd_microstep: 2225.52 | bwd_inner_microstep: 2069.85 | bwd_allreduce_microstep: 155.61 | step_microstep: 159.29
[2025-08-03 04:01:18,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.13 | bwd: 7986.10 | bwd_inner: 7225.84 | bwd_allreduce: 760.02 | step: 159.75
{'loss': 0.7679, 'learning_rate': 1.4438614157901073e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12018
total_samples=11327, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:21,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.08 | bwd_microstep: 1832.99 | bwd_inner_microstep: 1582.23 | bwd_allreduce_microstep: 250.70 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11911
total_samples=11330, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:23,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.83 | bwd_microstep: 1932.13 | bwd_inner_microstep: 1571.64 | bwd_allreduce_microstep: 360.42 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11665
total_samples=11333, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:26,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.54 | bwd_microstep: 1800.86 | bwd_inner_microstep: 1565.05 | bwd_allreduce_microstep: 235.75 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13046
total_samples=11337, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:29,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 04:01:29,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.97 | bwd_microstep: 1981.74 | bwd_inner_microstep: 1658.36 | bwd_allreduce_microstep: 323.31 | step_microstep: 114.78
[2025-08-03 04:01:29,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.35 | bwd: 7547.77 | bwd_inner: 6377.27 | bwd_allreduce: 1170.27 | step: 115.12
{'loss': 0.7599, 'learning_rate': 1.4424097174979038e-05, 'epoch': 0.37}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446
total_samples=11341, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:31,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.37 | bwd_microstep: 1790.02 | bwd_inner_microstep: 1707.33 | bwd_allreduce_microstep: 82.62 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12569
total_samples=11344, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:34,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.29 | bwd_microstep: 1851.29 | bwd_inner_microstep: 1585.64 | bwd_allreduce_microstep: 265.59 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14941
total_samples=11348, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:37,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.72 | bwd_microstep: 1917.78 | bwd_inner_microstep: 1740.01 | bwd_allreduce_microstep: 177.71 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13350
total_samples=11352, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:40,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 04:01:40,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.74 | bwd_microstep: 1837.58 | bwd_inner_microstep: 1676.32 | bwd_allreduce_microstep: 161.19 | step_microstep: 143.08
[2025-08-03 04:01:40,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2855.05 | bwd: 7396.72 | bwd_inner: 6709.30 | bwd_allreduce: 687.19 | step: 143.51
{'loss': 0.7685, 'learning_rate': 1.4409568590377918e-05, 'epoch': 0.37}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11983
total_samples=11355, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:42,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.11 | bwd_microstep: 1888.35 | bwd_inner_microstep: 1715.16 | bwd_allreduce_microstep: 173.13 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13050
total_samples=11359, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:45,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.38 | bwd_microstep: 1974.85 | bwd_inner_microstep: 1681.70 | bwd_allreduce_microstep: 293.09 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14615
total_samples=11363, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:48,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.15 | bwd_microstep: 1785.95 | bwd_inner_microstep: 1739.51 | bwd_allreduce_microstep: 46.37 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13927
total_samples=11367, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:50,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 04:01:50,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.31 | bwd_microstep: 1881.48 | bwd_inner_microstep: 1755.30 | bwd_allreduce_microstep: 126.11 | step_microstep: 120.90
[2025-08-03 04:01:50,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.88 | bwd: 7530.68 | bwd_inner: 6891.67 | bwd_allreduce: 638.78 | step: 121.26
{'loss': 0.7732, 'learning_rate': 1.4395028442197231e-05, 'epoch': 0.37}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13608
total_samples=11371, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:53,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.20 | bwd_microstep: 1751.84 | bwd_inner_microstep: 1618.70 | bwd_allreduce_microstep: 133.08 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13827
total_samples=11375, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:55,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.31 | bwd_microstep: 1740.87 | bwd_inner_microstep: 1704.13 | bwd_allreduce_microstep: 36.67 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13154
total_samples=11379, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:01:58,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.96 | bwd_microstep: 2038.53 | bwd_inner_microstep: 2030.49 | bwd_allreduce_microstep: 7.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837
total_samples=11383, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:01,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 04:02:01,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.07 | bwd_microstep: 2269.48 | bwd_inner_microstep: 1912.80 | bwd_allreduce_microstep: 356.62 | step_microstep: 114.94
[2025-08-03 04:02:01,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.47 | bwd: 7800.78 | bwd_inner: 7266.11 | bwd_allreduce: 534.43 | step: 115.27
                 37%|███▋      | 744/2000 [2:18:11<3:49:56, 10.98s/it] 37%|███▋      | 745/2000 [2:18:22<3:48:03, 10.90s/it]                                                       37%|███▋      | 745/2000 [2:18:22<3:48:03, 10.90s/it] 37%|███▋      | 746/2000 [2:18:33<3:50:24, 11.02s/it]                                                       37%|███▋      | 746/2000 [2:18:33<3:50:24, 11.02s/it] 37%|███▋      | 747/2000 [2:18:44<3:48:53, 10.96s/it]                                                       37%|███▋      | 747/2000 [2:18:44<3:48:53, 10.96s/it] 37%|███▋      | 748/2000 [2:18:54<3:47:12, 10.89s/it]                                                       37%|███▋      | 748/2000 [2:18:54<3:47:12, 10.89s/it] 37%|███▋      | 749/2000 [2:19:05<3:46:38, 10.87s/it]                                                       37%|███▋      | 749/2000 [2:19:05<3:46:38, 10.87s/it] 38%|███▊      | 750/2000{'loss': 0.7784, 'learning_rate': 1.4380476768566825e-05, 'epoch': 0.38}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11953
total_samples=11386, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:04,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.98 | bwd_microstep: 1802.81 | bwd_inner_microstep: 1586.05 | bwd_allreduce_microstep: 216.69 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13426
total_samples=11390, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:07,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.09 | bwd_microstep: 1946.52 | bwd_inner_microstep: 1745.92 | bwd_allreduce_microstep: 200.53 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14098
total_samples=11395, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:09,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.74 | bwd_microstep: 1768.58 | bwd_inner_microstep: 1709.69 | bwd_allreduce_microstep: 58.83 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13274
total_samples=11399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:12,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 04:02:12,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.91 | bwd_microstep: 2227.44 | bwd_inner_microstep: 1949.92 | bwd_allreduce_microstep: 277.46 | step_microstep: 113.92
[2025-08-03 04:02:12,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.65 | bwd: 7745.40 | bwd_inner: 6991.57 | bwd_allreduce: 753.59 | step: 114.38
{'loss': 0.755, 'learning_rate': 1.4365913607646762e-05, 'epoch': 0.38}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11974
total_samples=11402, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:15,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.42 | bwd_microstep: 1737.08 | bwd_inner_microstep: 1539.56 | bwd_allreduce_microstep: 197.45 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13681
total_samples=11406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:18,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.82 | bwd_microstep: 1995.48 | bwd_inner_microstep: 1903.78 | bwd_allreduce_microstep: 91.64 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14262
total_samples=11410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:20,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.24 | bwd_microstep: 1787.97 | bwd_inner_microstep: 1716.44 | bwd_allreduce_microstep: 71.46 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12893
total_samples=11414, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:23,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 04:02:23,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.32 | bwd_microstep: 1797.61 | bwd_inner_microstep: 1695.41 | bwd_allreduce_microstep: 102.14 | step_microstep: 136.02
[2025-08-03 04:02:23,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.73 | bwd: 7318.19 | bwd_inner: 6855.18 | bwd_allreduce: 462.77 | step: 136.50
{'loss': 0.7668, 'learning_rate': 1.4351338997627233e-05, 'epoch': 0.38}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12144
total_samples=11418, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:26,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.54 | bwd_microstep: 1966.45 | bwd_inner_microstep: 1544.81 | bwd_allreduce_microstep: 421.57 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13813
total_samples=11422, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:29,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.15 | bwd_microstep: 2242.00 | bwd_inner_microstep: 2148.09 | bwd_allreduce_microstep: 93.84 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15573
total_samples=11426, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:31,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.56 | bwd_microstep: 1960.76 | bwd_inner_microstep: 1903.15 | bwd_allreduce_microstep: 57.54 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11766
total_samples=11429, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:34,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 04:02:34,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.81 | bwd_microstep: 1752.74 | bwd_inner_microstep: 1541.19 | bwd_allreduce_microstep: 211.49 | step_microstep: 169.80
[2025-08-03 04:02:34,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.99 | bwd: 7922.00 | bwd_inner: 7137.24 | bwd_allreduce: 784.52 | step: 170.17
{'loss': 0.7631, 'learning_rate': 1.433675297672846e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13528
total_samples=11433, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:37,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.44 | bwd_microstep: 1762.67 | bwd_inner_microstep: 1710.04 | bwd_allreduce_microstep: 52.57 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14751
total_samples=11437, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:40,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.01 | bwd_microstep: 2389.09 | bwd_inner_microstep: 2185.56 | bwd_allreduce_microstep: 203.47 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13253
total_samples=11441, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:43,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.65 | bwd_microstep: 2042.15 | bwd_inner_microstep: 1902.37 | bwd_allreduce_microstep: 139.71 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13719
total_samples=11445, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:46,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 04:02:46,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.18 | bwd_microstep: 2034.68 | bwd_inner_microstep: 1850.19 | bwd_allreduce_microstep: 184.44 | step_microstep: 110.96
[2025-08-03 04:02:46,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.21 | bwd: 8228.64 | bwd_inner: 7648.14 | bwd_allreduce: 580.27 | step: 111.46
{'loss': 0.764, 'learning_rate': 1.4322155583200577e-05, 'epoch': 0.38}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13437
total_samples=11450, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:48,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.99 | bwd_microstep: 1715.55 | bwd_inner_microstep: 1639.78 | bwd_allreduce_microstep: 75.70 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13139
total_samples=11454, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:51,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.29 | bwd_microstep: 1800.59 | bwd_inner_microstep: 1690.43 | bwd_allreduce_microstep: 110.09 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11496
total_samples=11457, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:53,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.41 | bwd_microstep: 1842.00 | bwd_inner_microstep: 1588.80 | bwd_allreduce_microstep: 253.14 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13117
total_samples=11461, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:56,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.55
[2025-08-03 04:02:56,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.24 | bwd_microstep: 1921.96 | bwd_inner_microstep: 1809.29 | bwd_allreduce_microstep: 112.59 | step_microstep: 140.22
[2025-08-03 04:02:56,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.87 | bwd: 7280.14 | bwd_inner: 6728.30 | bwd_allreduce: 551.60 | step: 140.65
{'loss': 0.7623, 'learning_rate': 1.4307546855323549e-05, 'epoch': 0.38}
 [2:19:16<3:47:24, 10.92s/it]                                                       38%|███▊      | 750/2000 [2:19:16<3:47:24, 10.92s/it] 38%|███▊      | 751/2000 [2:19:27<3:47:45, 10.94s/it]                                                       38%|███▊      | 751/2000 [2:19:27<3:47:45, 10.94s/it] 38%|███▊      | 752/2000 [2:19:38<3:44:59, 10.82s/it]                                                       38%|███▊      | 752/2000 [2:19:38<3:44:59, 10.82s/it] 38%|███▊      | 753/2000 [2:19:49<3:47:18, 10.94s/it]                                                       38%|███▊      | 753/2000 [2:19:49<3:47:18, 10.94s/it] 38%|███▊      | 754/2000 [2:20:00<3:50:40, 11.11s/it]                                                       38%|███▊      | 754/2000 [2:20:01<3:50:40, 11.11s/it] 38%|███▊      | 755/2000 [2:20:11<3:47:01, 10.94s/it]                                                       38%|███▊      | 755dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492
total_samples=11465, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:02:59,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.06 | bwd_microstep: 1732.27 | bwd_inner_microstep: 1678.79 | bwd_allreduce_microstep: 53.41 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13889
total_samples=11469, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:02,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.50 | bwd_microstep: 2298.70 | bwd_inner_microstep: 2121.21 | bwd_allreduce_microstep: 177.43 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12633
total_samples=11472, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:04,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.09 | bwd_microstep: 1832.54 | bwd_inner_microstep: 1609.61 | bwd_allreduce_microstep: 222.86 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13498
total_samples=11476, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:07,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 04:03:07,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.22 | bwd_microstep: 1766.46 | bwd_inner_microstep: 1676.21 | bwd_allreduce_microstep: 90.20 | step_microstep: 112.85
[2025-08-03 04:03:07,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.80 | bwd: 7630.03 | bwd_inner: 7085.81 | bwd_allreduce: 543.98 | step: 113.32
{'loss': 0.7583, 'learning_rate': 1.429292683140706e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13795
total_samples=11480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:10,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.79 | bwd_microstep: 2052.93 | bwd_inner_microstep: 2005.58 | bwd_allreduce_microstep: 47.29 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14315
total_samples=11484, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:13,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.12 | bwd_microstep: 1846.25 | bwd_inner_microstep: 1743.48 | bwd_allreduce_microstep: 102.70 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14192
total_samples=11488, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:15,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.73 | bwd_microstep: 1794.66 | bwd_inner_microstep: 1727.13 | bwd_allreduce_microstep: 67.46 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13950
total_samples=11493, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:18,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 04:03:18,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.09 | bwd_microstep: 1745.32 | bwd_inner_microstep: 1705.19 | bwd_allreduce_microstep: 40.07 | step_microstep: 135.84
[2025-08-03 04:03:18,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.66 | bwd: 7439.21 | bwd_inner: 7181.38 | bwd_allreduce: 257.60 | step: 136.16
{'loss': 0.7733, 'learning_rate': 1.4278295549790419e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13658
total_samples=11497, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:21,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.24 | bwd_microstep: 2018.23 | bwd_inner_microstep: 1902.86 | bwd_allreduce_microstep: 115.30 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13580
total_samples=11501, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:23,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.46 | bwd_microstep: 1715.20 | bwd_inner_microstep: 1670.60 | bwd_allreduce_microstep: 44.54 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12580
total_samples=11505, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:26,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.40 | bwd_microstep: 1942.63 | bwd_inner_microstep: 1793.24 | bwd_allreduce_microstep: 149.32 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13447
total_samples=11509, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:29,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 04:03:29,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.06 | bwd_microstep: 1941.11 | bwd_inner_microstep: 1860.90 | bwd_allreduce_microstep: 80.16 | step_microstep: 461.81
[2025-08-03 04:03:29,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.10 | bwd: 7617.22 | bwd_inner: 7227.59 | bwd_allreduce: 389.40 | step: 462.16
{'loss': 0.7747, 'learning_rate': 1.4263653048842461e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418
total_samples=11513, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:32,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.87 | bwd_microstep: 1768.89 | bwd_inner_microstep: 1690.15 | bwd_allreduce_microstep: 78.67 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12519
total_samples=11517, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:34,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.42 | bwd_microstep: 1795.64 | bwd_inner_microstep: 1612.51 | bwd_allreduce_microstep: 183.08 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12345
total_samples=11520, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:37,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.74 | bwd_microstep: 2050.86 | bwd_inner_microstep: 1840.70 | bwd_allreduce_microstep: 210.10 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12851
total_samples=11524, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:40,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 04:03:40,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.81 | bwd_microstep: 2254.38 | bwd_inner_microstep: 2194.28 | bwd_allreduce_microstep: 60.03 | step_microstep: 115.67
[2025-08-03 04:03:40,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.77 | bwd: 7869.82 | bwd_inner: 7337.63 | bwd_allreduce: 531.96 | step: 115.99
{'loss': 0.772, 'learning_rate': 1.424899936696143e-05, 'epoch': 0.38}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14310
total_samples=11530, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:43,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.98 | bwd_microstep: 1835.20 | bwd_inner_microstep: 1788.39 | bwd_allreduce_microstep: 46.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13764
total_samples=11534, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:46,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.79 | bwd_microstep: 2111.42 | bwd_inner_microstep: 2016.18 | bwd_allreduce_microstep: 95.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13734
total_samples=11538, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:48,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.46 | bwd_microstep: 1843.18 | bwd_inner_microstep: 1724.96 | bwd_allreduce_microstep: 118.16 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13030
total_samples=11542, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:51,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 04:03:51,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.99 | bwd_microstep: 1835.71 | bwd_inner_microstep: 1705.22 | bwd_allreduce_microstep: 130.42 | step_microstep: 160.47
[2025-08-03 04:03:51,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2877.15 | bwd: 7625.56 | bwd_inner: 7234.75 | bwd_allreduce: 390.58 | step: 160.81
{'loss': 0.7752, 'learning_rate': 1.4234334542574906e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13288
total_samples=11546, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:54,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.19 | bwd_microstep: 2267.22 | bwd_inner_microstep: 1950.43 | bwd_allreduce_microstep: 316.73 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14329
total_samples=11551, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:57,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.95 | bwd_microstep: 1811.86 | bwd_inner_microstep: 1721.34 | bwd_allreduce_microstep: 90.45 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13601
total_samples=11555, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:03:59,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.36 | bwd_microstep: 1802.29 | bwd_inner_microstep: 1717.29 | bwd_allreduce_microstep: 84.93 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13339
total_samples=11559, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:02,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 04:04:02,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.36 | bwd_microstep: 1845.35 | bwd_inner_microstep: 1722.78 | bwd_allreduce_microstep: 122.51 | step_microstep: 133.54
[2025-08-03 04:04:02,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.78 | bwd: 7726.76 | bwd_inner: 7111.83 | bwd_allreduce: 614.70 | step: 134.01
/2000 [2:20:11<3:47:01, 10.94s/it] 38%|███▊      | 756/2000 [2:20:22<3:46:33, 10.93s/it]                                                       38%|███▊      | 756/2000 [2:20:22<3:46:33, 10.93s/it] 38%|███▊      | 757/2000 [2:20:33<3:45:14, 10.87s/it]                                                       38%|███▊      | 757/2000 [2:20:33<3:45:14, 10.87s/it] 38%|███▊      | 758/2000 [2:20:44<3:47:19, 10.98s/it]                                                       38%|███▊      | 758/2000 [2:20:44<3:47:19, 10.98s/it] 38%|███▊      | 759/2000 [2:20:55<3:48:06, 11.03s/it]                                                       38%|███▊      | 759/2000 [2:20:55<3:48:06, 11.03s/it] 38%|███▊      | 760/2000 [2:21:06<3:47:38, 11.01s/it]                                                       38%|███▊      | 760/2000 [2:21:06<3:47:38, 11.01s/it] 38%|███▊      | 761/2000 [2:21:17<3:47:26, 11.01s/it]                {'loss': 0.7648, 'learning_rate': 1.4219658614139674e-05, 'epoch': 0.38}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11568
total_samples=11563, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:05,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.02 | bwd_microstep: 1773.73 | bwd_inner_microstep: 1531.72 | bwd_allreduce_microstep: 241.94 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13642
total_samples=11567, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:08,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.03 | bwd_microstep: 2035.16 | bwd_inner_microstep: 1989.59 | bwd_allreduce_microstep: 45.49 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13016
total_samples=11571, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:10,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.30 | bwd_microstep: 1749.00 | bwd_inner_microstep: 1639.07 | bwd_allreduce_microstep: 109.86 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14575
total_samples=11576, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:13,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.80
[2025-08-03 04:04:13,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.76 | bwd_microstep: 1860.26 | bwd_inner_microstep: 1777.24 | bwd_allreduce_microstep: 82.95 | step_microstep: 131.70
[2025-08-03 04:04:13,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.03 | bwd: 7418.20 | bwd_inner: 6937.62 | bwd_allreduce: 480.33 | step: 132.04
{'loss': 0.7694, 'learning_rate': 1.4204971620141648e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13274
total_samples=11580, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:16,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.31 | bwd_microstep: 2186.37 | bwd_inner_microstep: 1860.72 | bwd_allreduce_microstep: 325.58 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15449
total_samples=11584, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:18,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.99 | bwd_microstep: 1756.73 | bwd_inner_microstep: 1725.48 | bwd_allreduce_microstep: 31.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13374
total_samples=11588, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:21,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.39 | bwd_microstep: 1815.39 | bwd_inner_microstep: 1706.58 | bwd_allreduce_microstep: 108.75 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13148
total_samples=11592, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:24,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 04:04:24,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.92 | bwd_microstep: 1924.06 | bwd_inner_microstep: 1851.23 | bwd_allreduce_microstep: 72.77 | step_microstep: 136.83
[2025-08-03 04:04:24,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.54 | bwd: 7682.59 | bwd_inner: 7144.00 | bwd_allreduce: 538.36 | step: 137.16
{'loss': 0.7631, 'learning_rate': 1.4190273599095761e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13377
total_samples=11596, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:26,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.80 | bwd_microstep: 1724.68 | bwd_inner_microstep: 1665.58 | bwd_allreduce_microstep: 59.04 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13245
total_samples=11601, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:29,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 756.67 | bwd_microstep: 1844.58 | bwd_inner_microstep: 1672.38 | bwd_allreduce_microstep: 172.14 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13696
total_samples=11605, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:32,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.25 | bwd_microstep: 1910.39 | bwd_inner_microstep: 1842.57 | bwd_allreduce_microstep: 67.76 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13258
total_samples=11610, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:35,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 04:04:35,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.20 | bwd_microstep: 2222.54 | bwd_inner_microstep: 2216.40 | bwd_allreduce_microstep: 6.08 | step_microstep: 123.72
[2025-08-03 04:04:35,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.86 | bwd: 7702.23 | bwd_inner: 7396.93 | bwd_allreduce: 305.08 | step: 124.16
{'loss': 0.7625, 'learning_rate': 1.4175564589545853e-05, 'epoch': 0.38}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14164
total_samples=11615, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:37,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.57 | bwd_microstep: 1723.80 | bwd_inner_microstep: 1673.75 | bwd_allreduce_microstep: 49.99 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13567
total_samples=11619, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:40,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.87 | bwd_microstep: 1798.61 | bwd_inner_microstep: 1719.06 | bwd_allreduce_microstep: 79.48 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14636
total_samples=11623, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:43,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.57 | bwd_microstep: 2411.69 | bwd_inner_microstep: 2216.14 | bwd_allreduce_microstep: 195.49 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13110
total_samples=11627, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:46,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 04:04:46,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.36 | bwd_microstep: 2253.32 | bwd_inner_microstep: 2039.51 | bwd_allreduce_microstep: 213.74 | step_microstep: 112.72
[2025-08-03 04:04:46,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.31 | bwd: 8187.47 | bwd_inner: 7648.45 | bwd_allreduce: 538.78 | step: 113.05
{'loss': 0.7714, 'learning_rate': 1.4160844630064596e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13791
total_samples=11631, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:49,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.43 | bwd_microstep: 1826.34 | bwd_inner_microstep: 1785.99 | bwd_allreduce_microstep: 40.29 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13134
total_samples=11635, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:52,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.68 | bwd_microstep: 1876.82 | bwd_inner_microstep: 1703.52 | bwd_allreduce_microstep: 173.23 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13660
total_samples=11640, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:54,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.49 | bwd_microstep: 1996.10 | bwd_inner_microstep: 1936.47 | bwd_allreduce_microstep: 59.57 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12941
total_samples=11644, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:04:57,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 04:04:57,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.33 | bwd_microstep: 1808.99 | bwd_inner_microstep: 1667.58 | bwd_allreduce_microstep: 141.35 | step_microstep: 133.76
[2025-08-03 04:04:57,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.88 | bwd: 7508.30 | bwd_inner: 7093.56 | bwd_allreduce: 414.52 | step: 134.19
{'loss': 0.7746, 'learning_rate': 1.4146113759253362e-05, 'epoch': 0.38}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14320
total_samples=11649, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:00,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.17 | bwd_microstep: 2238.23 | bwd_inner_microstep: 1908.22 | bwd_allreduce_microstep: 329.95 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13272
total_samples=11653, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:03,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.05 | bwd_microstep: 1827.40 | bwd_inner_microstep: 1663.04 | bwd_allreduce_microstep: 164.29 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12841
total_samples=11657, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:05,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.54 | bwd_microstep: 1923.13 | bwd_inner_microstep: 1640.71 | bwd_allreduce_microstep: 282.35 | step_microstep: 0.27
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13590
total_samples=11661, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:08,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 04:05:08,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.01 | bwd_microstep: 1992.47 | bwd_inner_microstep: 1687.47 | bwd_allreduce_microstep: 304.94 | step_microstep: 141.34
[2025-08-03 04:05:08,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.71 | bwd: 7981.28 | bwd_inner: 6899.43 | bwd_allreduce: 1081.62 | step: 141.85
                                       38%|███▊      | 761/2000 [2:21:17<3:47:26, 11.01s/it] 38%|███▊      | 762/2000 [2:21:28<3:45:20, 10.92s/it]                                                       38%|███▊      | 762/2000 [2:21:28<3:45:20, 10.92s/it] 38%|███▊      | 763/2000 [2:21:39<3:45:35, 10.94s/it]                                                       38%|███▊      | 763/2000 [2:21:39<3:45:35, 10.94s/it] 38%|███▊      | 764/2000 [2:21:50<3:45:39, 10.95s/it]                                                       38%|███▊      | 764/2000 [2:21:50<3:45:39, 10.95s/it] 38%|███▊      | 765/2000 [2:22:01<3:48:15, 11.09s/it]                                                       38%|███▊      | 765/2000 [2:22:01<3:48:15, 11.09s/it] 38%|███▊      | 766/2000 [2:22:12<3:46:08, 11.00s/it]                                                       38%|███▊      | 766/2000 [2:22:12<3:46:08, 11.00s/it] 38%|██{'loss': 0.7566, 'learning_rate': 1.4131372015742141e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13696
total_samples=11665, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:11,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.25 | bwd_microstep: 2054.78 | bwd_inner_microstep: 1707.78 | bwd_allreduce_microstep: 346.94 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12636
total_samples=11669, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:14,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.96 | bwd_microstep: 1767.09 | bwd_inner_microstep: 1603.08 | bwd_allreduce_microstep: 163.95 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11857
total_samples=11672, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:16,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.30 | bwd_microstep: 1812.85 | bwd_inner_microstep: 1551.74 | bwd_allreduce_microstep: 261.05 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13344
total_samples=11676, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:19,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 04:05:19,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.48 | bwd_microstep: 1720.50 | bwd_inner_microstep: 1670.30 | bwd_allreduce_microstep: 50.15 | step_microstep: 110.70
[2025-08-03 04:05:19,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.92 | bwd: 7355.28 | bwd_inner: 6532.88 | bwd_allreduce: 822.17 | step: 111.03
{'loss': 0.7615, 'learning_rate': 1.411661943818944e-05, 'epoch': 0.38}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11753
total_samples=11679, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:21,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.08 | bwd_microstep: 1811.55 | bwd_inner_microstep: 1576.57 | bwd_allreduce_microstep: 234.91 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11863
total_samples=11682, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:24,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.30 | bwd_microstep: 1749.21 | bwd_inner_microstep: 1549.03 | bwd_allreduce_microstep: 200.12 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13209
total_samples=11686, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:27,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.40 | bwd_microstep: 1878.73 | bwd_inner_microstep: 1695.48 | bwd_allreduce_microstep: 183.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13494
total_samples=11690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:30,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 04:05:30,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.95 | bwd_microstep: 2178.08 | bwd_inner_microstep: 1881.27 | bwd_allreduce_microstep: 296.75 | step_microstep: 123.78
[2025-08-03 04:05:30,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.66 | bwd: 7617.63 | bwd_inner: 6702.34 | bwd_allreduce: 915.05 | step: 124.09
{'loss': 0.7673, 'learning_rate': 1.4101856065282174e-05, 'epoch': 0.38}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14271
total_samples=11694, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:32,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.07 | bwd_microstep: 1749.39 | bwd_inner_microstep: 1703.38 | bwd_allreduce_microstep: 45.95 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13595
total_samples=11698, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:35,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.60 | bwd_microstep: 1713.99 | bwd_inner_microstep: 1666.35 | bwd_allreduce_microstep: 47.58 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12189
total_samples=11701, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:37,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.07 | bwd_microstep: 1787.50 | bwd_inner_microstep: 1579.94 | bwd_allreduce_microstep: 207.50 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13873
total_samples=11705, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:40,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 04:05:40,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.71 | bwd_microstep: 1829.59 | bwd_inner_microstep: 1706.02 | bwd_allreduce_microstep: 123.50 | step_microstep: 141.08
[2025-08-03 04:05:40,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2824.37 | bwd: 7080.53 | bwd_inner: 6655.66 | bwd_allreduce: 424.60 | step: 141.41
{'loss': 0.7605, 'learning_rate': 1.4087081935735565e-05, 'epoch': 0.39}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11586
total_samples=11708, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:43,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.96 | bwd_microstep: 1910.20 | bwd_inner_microstep: 1701.64 | bwd_allreduce_microstep: 208.49 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11642
total_samples=11711, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:45,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.64 | bwd_microstep: 1797.75 | bwd_inner_microstep: 1565.39 | bwd_allreduce_microstep: 232.30 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12081
total_samples=11714, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:48,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.14 | bwd_microstep: 1945.95 | bwd_inner_microstep: 1569.86 | bwd_allreduce_microstep: 376.03 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13543
total_samples=11718, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:51,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 04:05:51,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.99 | bwd_microstep: 2151.39 | bwd_inner_microstep: 2018.80 | bwd_allreduce_microstep: 132.53 | step_microstep: 122.67
[2025-08-03 04:05:51,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2885.67 | bwd: 7805.34 | bwd_inner: 6855.68 | bwd_allreduce: 949.44 | step: 123.00
{'loss': 0.7571, 'learning_rate': 1.4072297088293043e-05, 'epoch': 0.39}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810
total_samples=11721, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:54,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.93 | bwd_microstep: 1783.88 | bwd_inner_microstep: 1546.35 | bwd_allreduce_microstep: 237.45 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12059
total_samples=11724, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:05:57,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.96 | bwd_microstep: 2149.02 | bwd_inner_microstep: 1940.43 | bwd_allreduce_microstep: 208.53 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11656
total_samples=11727, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:00,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.45 | bwd_microstep: 2096.72 | bwd_inner_microstep: 1965.93 | bwd_allreduce_microstep: 130.72 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13035
total_samples=11731, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:02,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 04:06:02,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.45 | bwd_microstep: 1795.22 | bwd_inner_microstep: 1685.06 | bwd_allreduce_microstep: 110.10 | step_microstep: 120.65
[2025-08-03 04:06:02,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.73 | bwd: 7824.88 | bwd_inner: 7137.75 | bwd_allreduce: 686.88 | step: 121.06
{'loss': 0.7746, 'learning_rate': 1.4057501561726157e-05, 'epoch': 0.39}
█▊      | 767/2000 [2:22:23<3:47:22, 11.06s/it]                                                       38%|███▊      | 767/2000 [2:22:23<3:47:22, 11.06s/it] 38%|███▊      | 768/2000 [2:22:34<3:44:03, 10.91s/it]                                                       38%|███▊      | 768/2000 [2:22:34<3:44:03, 10.91s/it] 38%|███▊      | 769/2000 [2:22:45<3:43:34, 10.90s/it]                                                       38%|███▊      | 769/2000 [2:22:45<3:43:34, 10.90s/it] 38%|███▊      | 770/2000 [2:22:55<3:40:05, 10.74s/it]                                                       38%|███▊      | 770/2000 [2:22:55<3:40:05, 10.74s/it] 39%|███▊      | 771/2000 [2:23:06<3:42:20, 10.85s/it]                                                       39%|███▊      | 771/2000 [2:23:06<3:42:20, 10.85s/it] 39%|███▊      | 772/2000 [2:23:17<3:43:23, 10.91s/it]                                                       39%|�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13499
total_samples=11735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:05,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.64 | bwd_microstep: 1741.69 | bwd_inner_microstep: 1683.23 | bwd_allreduce_microstep: 58.39 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13686
total_samples=11739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:08,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.92 | bwd_microstep: 2017.58 | bwd_inner_microstep: 1885.01 | bwd_allreduce_microstep: 132.51 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11953
total_samples=11742, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:10,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.68 | bwd_microstep: 2106.44 | bwd_inner_microstep: 1924.78 | bwd_allreduce_microstep: 181.59 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14804
total_samples=11746, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:14,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 04:06:14,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.62 | bwd_microstep: 1850.83 | bwd_inner_microstep: 1764.46 | bwd_allreduce_microstep: 86.31 | step_microstep: 448.06
[2025-08-03 04:06:14,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.75 | bwd: 7716.58 | bwd_inner: 7257.48 | bwd_allreduce: 458.88 | step: 448.39
{'loss': 0.7599, 'learning_rate': 1.4042695394834435e-05, 'epoch': 0.39}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12494
total_samples=11749, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:16,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.33 | bwd_microstep: 1983.26 | bwd_inner_microstep: 1762.60 | bwd_allreduce_microstep: 220.59 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13151
total_samples=11753, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:19,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.81 | bwd_microstep: 1815.11 | bwd_inner_microstep: 1703.60 | bwd_allreduce_microstep: 111.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11965
total_samples=11756, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:21,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.09 | bwd_microstep: 1728.51 | bwd_inner_microstep: 1533.74 | bwd_allreduce_microstep: 194.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13240
total_samples=11760, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:24,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 04:06:24,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.88 | bwd_microstep: 1861.18 | bwd_inner_microstep: 1808.68 | bwd_allreduce_microstep: 52.44 | step_microstep: 111.49
[2025-08-03 04:06:24,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.03 | bwd: 7388.11 | bwd_inner: 6808.62 | bwd_allreduce: 579.26 | step: 111.83
{'loss': 0.7682, 'learning_rate': 1.4027878626445339e-05, 'epoch': 0.39}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11532
total_samples=11763, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:27,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.55 | bwd_microstep: 1841.04 | bwd_inner_microstep: 1589.41 | bwd_allreduce_microstep: 251.56 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11611
total_samples=11766, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:30,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.24 | bwd_microstep: 2737.66 | bwd_inner_microstep: 2500.75 | bwd_allreduce_microstep: 236.86 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11763
total_samples=11769, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:33,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.80 | bwd_microstep: 2163.04 | bwd_inner_microstep: 1697.03 | bwd_allreduce_microstep: 465.91 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13194
total_samples=11773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:37,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.75
[2025-08-03 04:06:37,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1471.47 | bwd_microstep: 2000.58 | bwd_inner_microstep: 1893.89 | bwd_allreduce_microstep: 106.63 | step_microstep: 121.64
[2025-08-03 04:06:37,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3654.99 | bwd: 8742.36 | bwd_inner: 7681.09 | bwd_allreduce: 1061.02 | step: 122.04
{'loss': 0.7638, 'learning_rate': 1.4013051295414108e-05, 'epoch': 0.39}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13903
total_samples=11777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:40,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.76 | bwd_microstep: 1844.87 | bwd_inner_microstep: 1746.03 | bwd_allreduce_microstep: 98.77 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13714
total_samples=11781, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:42,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.76 | bwd_microstep: 1818.74 | bwd_inner_microstep: 1692.48 | bwd_allreduce_microstep: 126.20 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11762
total_samples=11784, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:45,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.06 | bwd_microstep: 2013.43 | bwd_inner_microstep: 1798.51 | bwd_allreduce_microstep: 214.85 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12497
total_samples=11788, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:48,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 04:06:48,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.03 | bwd_microstep: 2055.86 | bwd_inner_microstep: 1600.61 | bwd_allreduce_microstep: 455.15 | step_microstep: 132.76
[2025-08-03 04:06:48,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.54 | bwd: 7732.95 | bwd_inner: 6837.64 | bwd_allreduce: 895.04 | step: 133.22
{'loss': 0.7619, 'learning_rate': 1.3998213440623691e-05, 'epoch': 0.39}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12716
total_samples=11792, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:51,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.10 | bwd_microstep: 1739.85 | bwd_inner_microstep: 1605.60 | bwd_allreduce_microstep: 134.18 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13604
total_samples=11796, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:53,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.52 | bwd_microstep: 1789.46 | bwd_inner_microstep: 1738.43 | bwd_allreduce_microstep: 50.96 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14036
total_samples=11800, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:56,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.08 | bwd_microstep: 1796.01 | bwd_inner_microstep: 1732.96 | bwd_allreduce_microstep: 62.98 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12039
total_samples=11803, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:06:58,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.77
[2025-08-03 04:06:58,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.34 | bwd_microstep: 1750.84 | bwd_inner_microstep: 1564.81 | bwd_allreduce_microstep: 185.97 | step_microstep: 121.84
[2025-08-03 04:06:58,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.96 | bwd: 7076.20 | bwd_inner: 6641.80 | bwd_allreduce: 434.16 | step: 122.17
{'loss': 0.7574, 'learning_rate': 1.3983365100984633e-05, 'epoch': 0.39}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=11807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:01,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.83 | bwd_microstep: 1789.77 | bwd_inner_microstep: 1694.23 | bwd_allreduce_microstep: 95.47 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790
total_samples=11810, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:04,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.86 | bwd_microstep: 1927.85 | bwd_inner_microstep: 1643.74 | bwd_allreduce_microstep: 284.04 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13642
total_samples=11814, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:06,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.08 | bwd_microstep: 1744.27 | bwd_inner_microstep: 1684.82 | bwd_allreduce_microstep: 59.39 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13060
total_samples=11818, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:09,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 04:07:09,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.35 | bwd_microstep: 1966.80 | bwd_inner_microstep: 1941.24 | bwd_allreduce_microstep: 25.49 | step_microstep: 130.07
[2025-08-03 04:07:09,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.05 | bwd: 7428.74 | bwd_inner: 6964.03 | bwd_allreduce: 464.48 | step: 130.41
��██▊      | 772/2000 [2:23:17<3:43:23, 10.91s/it] 39%|███▊      | 773/2000 [2:23:28<3:45:35, 11.03s/it]                                                       39%|███▊      | 773/2000 [2:23:28<3:45:35, 11.03s/it] 39%|███▊      | 774/2000 [2:23:39<3:43:07, 10.92s/it]                                                       39%|███▊      | 774/2000 [2:23:39<3:43:07, 10.92s/it] 39%|███▉      | 775/2000 [2:23:52<3:54:34, 11.49s/it]                                                       39%|███▉      | 775/2000 [2:23:52<3:54:34, 11.49s/it] 39%|███▉      | 776/2000 [2:24:03<3:51:24, 11.34s/it]                                                       39%|███▉      | 776/2000 [2:24:03<3:51:24, 11.34s/it] 39%|███▉      | 777/2000 [2:24:13<3:44:58, 11.04s/it]                                                       39%|███▉      | 777/2000 [2:24:13<3:44:58, 11.04s/it] 39%|███▉      | 778/2000 [2:24:24<3:42:30, 10.93{'loss': 0.7553, 'learning_rate': 1.3968506315434973e-05, 'epoch': 0.39}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11751
total_samples=11821, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:12,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.38 | bwd_microstep: 1997.33 | bwd_inner_microstep: 1788.73 | bwd_allreduce_microstep: 208.54 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13526
total_samples=11825, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:14,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.74 | bwd_microstep: 1745.62 | bwd_inner_microstep: 1675.25 | bwd_allreduce_microstep: 70.30 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13720
total_samples=11829, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:17,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.39 | bwd_microstep: 1973.59 | bwd_inner_microstep: 1903.31 | bwd_allreduce_microstep: 70.22 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13593
total_samples=11833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:20,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 04:07:20,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.59 | bwd_microstep: 2184.30 | bwd_inner_microstep: 2056.10 | bwd_allreduce_microstep: 128.13 | step_microstep: 124.49
[2025-08-03 04:07:20,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.03 | bwd: 7900.88 | bwd_inner: 7423.39 | bwd_allreduce: 477.26 | step: 124.93
{'loss': 0.7603, 'learning_rate': 1.3953637122940147e-05, 'epoch': 0.39}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13222
total_samples=11837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:23,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.70 | bwd_microstep: 1972.20 | bwd_inner_microstep: 1866.18 | bwd_allreduce_microstep: 105.95 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13023
total_samples=11841, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:25,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.00 | bwd_microstep: 1728.40 | bwd_inner_microstep: 1650.27 | bwd_allreduce_microstep: 78.06 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12746
total_samples=11845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:28,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.42 | bwd_microstep: 1727.08 | bwd_inner_microstep: 1613.11 | bwd_allreduce_microstep: 113.91 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334
total_samples=11849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:32,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:07:32,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.96 | bwd_microstep: 2900.58 | bwd_inner_microstep: 2828.39 | bwd_allreduce_microstep: 72.13 | step_microstep: 129.17
[2025-08-03 04:07:32,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.01 | bwd: 8328.31 | bwd_inner: 7957.95 | bwd_allreduce: 370.12 | step: 129.63
{'loss': 0.7654, 'learning_rate': 1.3938757562492873e-05, 'epoch': 0.39}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14491
total_samples=11853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:35,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.03 | bwd_microstep: 2056.46 | bwd_inner_microstep: 1770.34 | bwd_allreduce_microstep: 286.04 | step_microstep: 0.15
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12757
total_samples=11857, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:37,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.89 | bwd_microstep: 1963.20 | bwd_inner_microstep: 1793.94 | bwd_allreduce_microstep: 169.19 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12288
total_samples=11860, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:40,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.29 | bwd_microstep: 1839.32 | bwd_inner_microstep: 1601.09 | bwd_allreduce_microstep: 238.16 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14808
total_samples=11865, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:43,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 04:07:43,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.33 | bwd_microstep: 1709.27 | bwd_inner_microstep: 1700.86 | bwd_allreduce_microstep: 8.34 | step_microstep: 159.57
[2025-08-03 04:07:43,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.47 | bwd: 7568.29 | bwd_inner: 6866.23 | bwd_allreduce: 701.82 | step: 160.08
{'loss': 0.7601, 'learning_rate': 1.3923867673113067e-05, 'epoch': 0.39}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12718
total_samples=11869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:45,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.12 | bwd_microstep: 2086.77 | bwd_inner_microstep: 1585.35 | bwd_allreduce_microstep: 501.36 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14585
total_samples=11873, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:48,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.15 | bwd_microstep: 2010.77 | bwd_inner_microstep: 1864.79 | bwd_allreduce_microstep: 145.90 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12963
total_samples=11876, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:51,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.47 | bwd_microstep: 1791.66 | bwd_inner_microstep: 1630.75 | bwd_allreduce_microstep: 160.84 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13260
total_samples=11880, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:53,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 04:07:53,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.49 | bwd_microstep: 1707.81 | bwd_inner_microstep: 1660.49 | bwd_allreduce_microstep: 47.25 | step_microstep: 142.40
[2025-08-03 04:07:53,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.16 | bwd: 7597.07 | bwd_inner: 6741.39 | bwd_allreduce: 855.44 | step: 142.80
{'loss': 0.7684, 'learning_rate': 1.390896749384773e-05, 'epoch': 0.39}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12667
total_samples=11884, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:56,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.54 | bwd_microstep: 1774.77 | bwd_inner_microstep: 1600.69 | bwd_allreduce_microstep: 174.01 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13323
total_samples=11888, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:07:58,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.95 | bwd_microstep: 1718.96 | bwd_inner_microstep: 1663.33 | bwd_allreduce_microstep: 55.56 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12040
total_samples=11891, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:01,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.06 | bwd_microstep: 1826.99 | bwd_inner_microstep: 1691.62 | bwd_allreduce_microstep: 135.31 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12945
total_samples=11895, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:04,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 04:08:04,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.33 | bwd_microstep: 1846.47 | bwd_inner_microstep: 1761.52 | bwd_allreduce_microstep: 84.88 | step_microstep: 124.41
[2025-08-03 04:08:04,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2747.81 | bwd: 7167.23 | bwd_inner: 6717.16 | bwd_allreduce: 449.84 | step: 124.76
{'loss': 0.7561, 'learning_rate': 1.3894057063770841e-05, 'epoch': 0.39}
s/it]                                                       39%|███▉      | 778/2000 [2:24:24<3:42:30, 10.93s/it] 39%|███▉      | 779/2000 [2:24:35<3:43:55, 11.00s/it]                                                       39%|███▉      | 779/2000 [2:24:35<3:43:55, 11.00s/it] 39%|███▉      | 780/2000 [2:24:47<3:47:08, 11.17s/it]                                                       39%|███▉      | 780/2000 [2:24:47<3:47:08, 11.17s/it] 39%|███▉      | 781/2000 [2:24:57<3:45:00, 11.07s/it]                                                       39%|███▉      | 781/2000 [2:24:58<3:45:00, 11.07s/it] 39%|███▉      | 782/2000 [2:25:08<3:43:23, 11.00s/it]                                                       39%|███▉      | 782/2000 [2:25:08<3:43:23, 11.00s/it] 39%|███▉      | 783/2000 [2:25:19<3:39:20, 10.81s/it]                                                       39%|███▉      | 783/2000 [2:25:19<3:39:20, dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12879
total_samples=11898, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:07,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.36 | bwd_microstep: 1949.13 | bwd_inner_microstep: 1779.58 | bwd_allreduce_microstep: 169.48 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13204
total_samples=11902, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:09,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.83 | bwd_microstep: 1828.47 | bwd_inner_microstep: 1688.15 | bwd_allreduce_microstep: 140.25 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837
total_samples=11906, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:12,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.00 | bwd_microstep: 1752.00 | bwd_inner_microstep: 1703.82 | bwd_allreduce_microstep: 48.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14042
total_samples=11910, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:14,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 04:08:14,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.95 | bwd_microstep: 1757.95 | bwd_inner_microstep: 1722.74 | bwd_allreduce_microstep: 35.15 | step_microstep: 132.10
[2025-08-03 04:08:14,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.06 | bwd: 7287.59 | bwd_inner: 6894.29 | bwd_allreduce: 393.08 | step: 132.54
{'loss': 0.76, 'learning_rate': 1.3879136421983265e-05, 'epoch': 0.39}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13998
total_samples=11914, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:17,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.63 | bwd_microstep: 2054.59 | bwd_inner_microstep: 1774.33 | bwd_allreduce_microstep: 280.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14999
total_samples=11919, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:20,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.03 | bwd_microstep: 1808.57 | bwd_inner_microstep: 1746.99 | bwd_allreduce_microstep: 61.51 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13618
total_samples=11923, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:23,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.82 | bwd_microstep: 2037.15 | bwd_inner_microstep: 2031.29 | bwd_allreduce_microstep: 5.80 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13717
total_samples=11927, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:25,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 04:08:25,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.92 | bwd_microstep: 1757.28 | bwd_inner_microstep: 1684.32 | bwd_allreduce_microstep: 72.89 | step_microstep: 154.25
[2025-08-03 04:08:25,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.34 | bwd: 7657.64 | bwd_inner: 7236.92 | bwd_allreduce: 420.49 | step: 154.63
{'loss': 0.7706, 'learning_rate': 1.3864205607612648e-05, 'epoch': 0.39}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13726
total_samples=11931, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:28,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.16 | bwd_microstep: 1808.19 | bwd_inner_microstep: 1724.47 | bwd_allreduce_microstep: 83.65 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11711
total_samples=11934, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:31,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.38 | bwd_microstep: 2094.75 | bwd_inner_microstep: 1811.00 | bwd_allreduce_microstep: 283.68 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14128
total_samples=11939, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:34,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.50 | bwd_microstep: 2092.27 | bwd_inner_microstep: 1758.60 | bwd_allreduce_microstep: 333.61 | step_microstep: 0.11
dynamic ViT batch size: 7, images per sample: 7.0, dynamic token length: 16384
total_samples=11940, num_samples=1, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:37,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 04:08:37,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.90 | bwd_microstep: 1945.20 | bwd_inner_microstep: 1920.79 | bwd_allreduce_microstep: 24.31 | step_microstep: 145.65
[2025-08-03 04:08:37,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.89 | bwd: 7940.46 | bwd_inner: 7214.88 | bwd_allreduce: 725.31 | step: 146.02
{'loss': 0.7626, 'learning_rate': 1.3849264659813314e-05, 'epoch': 0.39}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12915
total_samples=11944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:39,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.38 | bwd_microstep: 2043.84 | bwd_inner_microstep: 1859.13 | bwd_allreduce_microstep: 184.65 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13153
total_samples=11948, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:42,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.26 | bwd_microstep: 1765.89 | bwd_inner_microstep: 1683.43 | bwd_allreduce_microstep: 82.40 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13425
total_samples=11953, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:46,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1745.19 | bwd_microstep: 1945.31 | bwd_inner_microstep: 1742.75 | bwd_allreduce_microstep: 202.50 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13606
total_samples=11957, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:49,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 04:08:49,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.70 | bwd_microstep: 2039.18 | bwd_inner_microstep: 1732.84 | bwd_allreduce_microstep: 306.27 | step_microstep: 135.41
[2025-08-03 04:08:49,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3824.47 | bwd: 7794.27 | bwd_inner: 7018.16 | bwd_allreduce: 775.89 | step: 135.71
{'loss': 0.7544, 'learning_rate': 1.3834313617766146e-05, 'epoch': 0.39}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11803
total_samples=11960, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:52,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.00 | bwd_microstep: 2290.04 | bwd_inner_microstep: 1993.69 | bwd_allreduce_microstep: 296.30 | step_microstep: 0.09
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14742
total_samples=11964, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:54,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.24 | bwd_microstep: 1790.60 | bwd_inner_microstep: 1739.84 | bwd_allreduce_microstep: 50.70 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12980
total_samples=11968, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:08:57,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.25 | bwd_microstep: 1737.62 | bwd_inner_microstep: 1651.22 | bwd_allreduce_microstep: 86.34 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13793
total_samples=11972, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:00,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 04:09:00,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.80 | bwd_microstep: 1797.19 | bwd_inner_microstep: 1708.44 | bwd_allreduce_microstep: 88.68 | step_microstep: 142.09
[2025-08-03 04:09:00,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.23 | bwd: 7615.51 | bwd_inner: 7093.18 | bwd_allreduce: 522.10 | step: 142.41
{'loss': 0.7535, 'learning_rate': 1.3819352520678519e-05, 'epoch': 0.39}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13583
total_samples=11976, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:03,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.35 | bwd_microstep: 2261.30 | bwd_inner_microstep: 2226.83 | bwd_allreduce_microstep: 34.41 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14040
total_samples=11980, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:06,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.31 | bwd_microstep: 2292.97 | bwd_inner_microstep: 2137.82 | bwd_allreduce_microstep: 155.09 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12823
total_samples=11984, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:08,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.87 | bwd_microstep: 1755.26 | bwd_inner_microstep: 1653.82 | bwd_allreduce_microstep: 101.37 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12289
total_samples=11988, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:11,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 33.26
[2025-08-03 04:09:11,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.52 | bwd_microstep: 1729.52 | bwd_inner_microstep: 1583.83 | bwd_allreduce_microstep: 145.63 | step_microstep: 155.48
[2025-08-03 04:09:11,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.98 | bwd: 8039.09 | bwd_inner: 7602.30 | bwd_allreduce: 436.57 | step: 155.90
10.81s/it] 39%|███▉      | 784/2000 [2:25:29<3:37:55, 10.75s/it]                                                       39%|███▉      | 784/2000 [2:25:29<3:37:55, 10.75s/it] 39%|███▉      | 785/2000 [2:25:40<3:38:46, 10.80s/it]                                                       39%|███▉      | 785/2000 [2:25:40<3:38:46, 10.80s/it] 39%|███▉      | 786/2000 [2:25:51<3:41:03, 10.93s/it]                                                       39%|███▉      | 786/2000 [2:25:51<3:41:03, 10.93s/it] 39%|███▉      | 787/2000 [2:26:04<3:47:54, 11.27s/it]                                                       39%|███▉      | 787/2000 [2:26:04<3:47:54, 11.27s/it] 39%|███▉      | 788/2000 [2:26:14<3:45:14, 11.15s/it]                                                       39%|███▉      | 788/2000 [2:26:14<3:45:14, 11.15s/it] 39%|███▉      | 789/2000 [2:26:26<3:46:14, 11.21s/it]                                        {'loss': 0.7615, 'learning_rate': 1.380438140778416e-05, 'epoch': 0.39}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11722
total_samples=11991, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:13,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.68 | bwd_microstep: 1803.36 | bwd_inner_microstep: 1563.86 | bwd_allreduce_microstep: 239.44 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13502
total_samples=11996, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:16,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.11 | bwd_microstep: 2076.89 | bwd_inner_microstep: 1927.67 | bwd_allreduce_microstep: 149.16 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13170
total_samples=12000, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:19,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.14 | bwd_microstep: 2052.93 | bwd_inner_microstep: 1893.84 | bwd_allreduce_microstep: 159.02 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11753
total_samples=12003, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:22,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 04:09:22,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.07 | bwd_microstep: 1869.53 | bwd_inner_microstep: 1549.35 | bwd_allreduce_microstep: 320.11 | step_microstep: 136.14
[2025-08-03 04:09:22,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.93 | bwd: 7802.76 | bwd_inner: 6934.72 | bwd_allreduce: 867.81 | step: 136.48
{'loss': 0.7644, 'learning_rate': 1.378940031834307e-05, 'epoch': 0.4}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13681
total_samples=12007, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:25,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 2051.32 | bwd_inner_microstep: 1727.79 | bwd_allreduce_microstep: 323.46 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15036
total_samples=12011, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:27,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.09 | bwd_microstep: 1737.51 | bwd_inner_microstep: 1731.15 | bwd_allreduce_microstep: 6.29 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13376
total_samples=12015, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:30,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.08 | bwd_microstep: 1727.61 | bwd_inner_microstep: 1658.90 | bwd_allreduce_microstep: 68.65 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13311
total_samples=12019, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:33,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:09:33,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.97 | bwd_microstep: 2093.05 | bwd_inner_microstep: 1733.84 | bwd_allreduce_microstep: 359.16 | step_microstep: 111.00
[2025-08-03 04:09:33,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.14 | bwd: 7609.54 | bwd_inner: 6851.68 | bwd_allreduce: 757.63 | step: 111.42
{'loss': 0.7618, 'learning_rate': 1.3774409291641407e-05, 'epoch': 0.4}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12432
total_samples=12022, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:36,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.00 | bwd_microstep: 1967.37 | bwd_inner_microstep: 1751.26 | bwd_allreduce_microstep: 216.04 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13096
total_samples=12027, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:38,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.35 | bwd_microstep: 2175.76 | bwd_inner_microstep: 2169.60 | bwd_allreduce_microstep: 6.10 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12931
total_samples=12031, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:41,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.73 | bwd_microstep: 1764.16 | bwd_inner_microstep: 1655.03 | bwd_allreduce_microstep: 109.07 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11851
total_samples=12034, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:44,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70
[2025-08-03 04:09:44,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.09 | bwd_microstep: 1960.29 | bwd_inner_microstep: 1541.46 | bwd_allreduce_microstep: 418.77 | step_microstep: 172.28
[2025-08-03 04:09:44,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.10 | bwd: 7867.64 | bwd_inner: 7117.35 | bwd_allreduce: 750.05 | step: 172.62
{'loss': 0.7513, 'learning_rate': 1.3759408366991391e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14299
total_samples=12038, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:47,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.07 | bwd_microstep: 1772.43 | bwd_inner_microstep: 1732.80 | bwd_allreduce_microstep: 39.56 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13333
total_samples=12042, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:49,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.33 | bwd_microstep: 1741.18 | bwd_inner_microstep: 1677.75 | bwd_allreduce_microstep: 63.36 | step_microstep: 0.29
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15409
total_samples=12046, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:52,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.31 | bwd_microstep: 2083.10 | bwd_inner_microstep: 1930.34 | bwd_allreduce_microstep: 152.69 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15219
total_samples=12050, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:55,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 04:09:55,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.41 | bwd_microstep: 1761.16 | bwd_inner_microstep: 1712.29 | bwd_allreduce_microstep: 48.80 | step_microstep: 116.23
[2025-08-03 04:09:55,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.05 | bwd: 7357.92 | bwd_inner: 7053.18 | bwd_allreduce: 304.50 | step: 116.73
{'loss': 0.7645, 'learning_rate': 1.3744397583731204e-05, 'epoch': 0.4}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13662
total_samples=12054, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:09:57,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.76 | bwd_microstep: 1869.40 | bwd_inner_microstep: 1813.03 | bwd_allreduce_microstep: 56.29 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13922
total_samples=12059, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:00,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.35 | bwd_microstep: 1857.35 | bwd_inner_microstep: 1739.37 | bwd_allreduce_microstep: 117.92 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12408
total_samples=12062, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:03,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.34 | bwd_microstep: 2149.18 | bwd_inner_microstep: 1795.81 | bwd_allreduce_microstep: 353.31 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11861
total_samples=12065, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:06,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 04:10:06,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.53 | bwd_microstep: 2239.06 | bwd_inner_microstep: 1838.02 | bwd_allreduce_microstep: 400.97 | step_microstep: 115.50
[2025-08-03 04:10:06,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.91 | bwd: 8115.05 | bwd_inner: 7186.23 | bwd_allreduce: 928.57 | step: 116.02
{'loss': 0.763, 'learning_rate': 1.3729376981224869e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13938
total_samples=12069, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:09,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.75 | bwd_microstep: 1856.61 | bwd_inner_microstep: 1822.87 | bwd_allreduce_microstep: 33.68 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12773
total_samples=12073, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:11,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 667.63 | bwd_microstep: 1737.85 | bwd_inner_microstep: 1634.58 | bwd_allreduce_microstep: 103.22 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11577
total_samples=12076, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:14,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.92 | bwd_microstep: 1766.92 | bwd_inner_microstep: 1533.54 | bwd_allreduce_microstep: 233.32 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11801
total_samples=12079, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:16,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 04:10:16,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.12 | bwd_microstep: 1814.75 | bwd_inner_microstep: 1581.77 | bwd_allreduce_microstep: 232.91 | step_microstep: 153.00
[2025-08-03 04:10:16,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.35 | bwd: 7176.18 | bwd_inner: 6572.75 | bwd_allreduce: 603.20 | step: 153.42
               39%|███▉      | 789/2000 [2:26:26<3:46:14, 11.21s/it] 40%|███▉      | 790/2000 [2:26:37<3:45:03, 11.16s/it]                                                       40%|███▉      | 790/2000 [2:26:37<3:45:03, 11.16s/it] 40%|███▉      | 791/2000 [2:26:48<3:42:53, 11.06s/it]                                                       40%|███▉      | 791/2000 [2:26:48<3:42:53, 11.06s/it] 40%|███▉      | 792/2000 [2:26:59<3:43:28, 11.10s/it]                                                       40%|███▉      | 792/2000 [2:26:59<3:43:28, 11.10s/it] 40%|███▉      | 793/2000 [2:27:09<3:40:13, 10.95s/it]                                                       40%|███▉      | 793/2000 [2:27:09<3:40:13, 10.95s/it] 40%|███▉      | 794/2000 [2:27:21<3:42:33, 11.07s/it]                                                       40%|███▉      | 794/2000 [2:27:21<3:42:33, 11.07s/it] 40%|███▉      | 795/2000 [{'loss': 0.7763, 'learning_rate': 1.3714346598862168e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13359
total_samples=12083, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:19,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 654.16 | bwd_microstep: 1929.47 | bwd_inner_microstep: 1673.93 | bwd_allreduce_microstep: 255.45 | step_microstep: 0.16
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14280
total_samples=12087, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:22,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.93 | bwd_microstep: 1871.87 | bwd_inner_microstep: 1733.58 | bwd_allreduce_microstep: 138.23 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11575
total_samples=12090, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:24,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.63 | bwd_microstep: 1863.31 | bwd_inner_microstep: 1545.97 | bwd_allreduce_microstep: 317.27 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14448
total_samples=12094, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:27,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 04:10:27,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.55 | bwd_microstep: 1781.10 | bwd_inner_microstep: 1728.98 | bwd_allreduce_microstep: 52.06 | step_microstep: 142.44
[2025-08-03 04:10:27,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.21 | bwd: 7445.81 | bwd_inner: 6682.45 | bwd_allreduce: 763.10 | step: 142.84
{'loss': 0.7656, 'learning_rate': 1.3699306476058523e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13304
total_samples=12098, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:30,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.30 | bwd_microstep: 1737.61 | bwd_inner_microstep: 1673.87 | bwd_allreduce_microstep: 63.67 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12072
total_samples=12101, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:32,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.06 | bwd_microstep: 1984.74 | bwd_inner_microstep: 1806.24 | bwd_allreduce_microstep: 178.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11902
total_samples=12104, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:35,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.16 | bwd_microstep: 1762.10 | bwd_inner_microstep: 1553.50 | bwd_allreduce_microstep: 208.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13632
total_samples=12108, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:38,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:10:38,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.13 | bwd_microstep: 1805.80 | bwd_inner_microstep: 1722.44 | bwd_allreduce_microstep: 83.31 | step_microstep: 135.69
[2025-08-03 04:10:38,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.57 | bwd: 7290.30 | bwd_inner: 6756.05 | bwd_allreduce: 534.02 | step: 136.03
{'loss': 0.7549, 'learning_rate': 1.3684256652254906e-05, 'epoch': 0.4}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11794
total_samples=12111, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:41,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.62 | bwd_microstep: 2258.59 | bwd_inner_microstep: 1898.04 | bwd_allreduce_microstep: 360.49 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375
total_samples=12115, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:43,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.43 | bwd_microstep: 1891.94 | bwd_inner_microstep: 1697.94 | bwd_allreduce_microstep: 193.93 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11741
total_samples=12118, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:46,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.75 | bwd_microstep: 2047.36 | bwd_inner_microstep: 1817.55 | bwd_allreduce_microstep: 229.74 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13312
total_samples=12122, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:49,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 04:10:49,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.45 | bwd_microstep: 1930.39 | bwd_inner_microstep: 1701.54 | bwd_allreduce_microstep: 228.78 | step_microstep: 177.62
[2025-08-03 04:10:49,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.17 | bwd: 8128.32 | bwd_inner: 7115.07 | bwd_allreduce: 1013.02 | step: 177.94
{'loss': 0.7637, 'learning_rate': 1.3669197166917723e-05, 'epoch': 0.4}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11869
total_samples=12125, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:52,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.40 | bwd_microstep: 1946.13 | bwd_inner_microstep: 1727.93 | bwd_allreduce_microstep: 218.13 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13937
total_samples=12129, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:55,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.89 | bwd_microstep: 2097.61 | bwd_inner_microstep: 2065.51 | bwd_allreduce_microstep: 32.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13561
total_samples=12133, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:10:57,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.14 | bwd_microstep: 1912.81 | bwd_inner_microstep: 1731.66 | bwd_allreduce_microstep: 181.09 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14367
total_samples=12137, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:00,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 04:11:00,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.51 | bwd_microstep: 2052.56 | bwd_inner_microstep: 1768.39 | bwd_allreduce_microstep: 284.10 | step_microstep: 134.08
[2025-08-03 04:11:00,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.87 | bwd: 8009.15 | bwd_inner: 7293.47 | bwd_allreduce: 715.43 | step: 134.51
{'loss': 0.7688, 'learning_rate': 1.365412805953872e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14107
total_samples=12141, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:03,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.75 | bwd_microstep: 1733.55 | bwd_inner_microstep: 1703.44 | bwd_allreduce_microstep: 30.05 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13258
total_samples=12145, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:06,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.04 | bwd_microstep: 1982.57 | bwd_inner_microstep: 1717.63 | bwd_allreduce_microstep: 264.86 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13489
total_samples=12150, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:08,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.23 | bwd_microstep: 1745.47 | bwd_inner_microstep: 1664.90 | bwd_allreduce_microstep: 80.50 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13935
total_samples=12154, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:11,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 04:11:11,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.55 | bwd_microstep: 1906.49 | bwd_inner_microstep: 1698.16 | bwd_allreduce_microstep: 208.25 | step_microstep: 139.39
[2025-08-03 04:11:11,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.50 | bwd: 7368.12 | bwd_inner: 6784.12 | bwd_allreduce: 583.75 | step: 139.73
{'loss': 0.765, 'learning_rate': 1.3639049369634878e-05, 'epoch': 0.4}
2:27:31<3:38:28, 10.88s/it]                                                       40%|███▉      | 795/2000 [2:27:31<3:38:28, 10.88s/it] 40%|███▉      | 796/2000 [2:27:42<3:37:06, 10.82s/it]                                                       40%|███▉      | 796/2000 [2:27:42<3:37:06, 10.82s/it] 40%|███▉      | 797/2000 [2:27:52<3:35:17, 10.74s/it]                                                       40%|███▉      | 797/2000 [2:27:52<3:35:17, 10.74s/it] 40%|███▉      | 798/2000 [2:28:04<3:39:17, 10.95s/it]                                                       40%|███▉      | 798/2000 [2:28:04<3:39:17, 10.95s/it] 40%|███▉      | 799/2000 [2:28:15<3:41:11, 11.05s/it]                                                       40%|███▉      | 799/2000 [2:28:15<3:41:11, 11.05s/it] 40%|████      | 800/2000 [2:28:26<3:38:29, 10.92s/it]                                                       40%|████      | 800/2dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12908
total_samples=12158, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:14,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.02 | bwd_microstep: 2047.83 | bwd_inner_microstep: 1673.33 | bwd_allreduce_microstep: 374.43 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13326
total_samples=12162, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:16,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.43 | bwd_microstep: 1883.49 | bwd_inner_microstep: 1698.52 | bwd_allreduce_microstep: 184.90 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13262
total_samples=12166, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:20,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.60 | bwd_microstep: 2364.69 | bwd_inner_microstep: 1959.61 | bwd_allreduce_microstep: 404.96 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12966
total_samples=12170, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:22,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 04:11:22,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.81 | bwd_microstep: 1873.21 | bwd_inner_microstep: 1664.29 | bwd_allreduce_microstep: 208.86 | step_microstep: 114.72
[2025-08-03 04:11:22,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.80 | bwd: 8169.26 | bwd_inner: 6995.78 | bwd_allreduce: 1173.21 | step: 115.16
{'loss': 0.7519, 'learning_rate': 1.3623961136748296e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14167
total_samples=12174, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:25,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.72 | bwd_microstep: 1995.30 | bwd_inner_microstep: 1868.81 | bwd_allreduce_microstep: 126.42 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16070
total_samples=12178, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:28,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.99 | bwd_microstep: 1804.53 | bwd_inner_microstep: 1798.48 | bwd_allreduce_microstep: 5.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13666
total_samples=12182, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:30,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.79 | bwd_microstep: 1814.39 | bwd_inner_microstep: 1720.83 | bwd_allreduce_microstep: 93.50 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11672
total_samples=12185, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:33,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.62
[2025-08-03 04:11:33,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.62 | bwd_microstep: 2125.67 | bwd_inner_microstep: 1760.51 | bwd_allreduce_microstep: 365.08 | step_microstep: 113.54
[2025-08-03 04:11:33,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.06 | bwd: 7739.94 | bwd_inner: 7148.63 | bwd_allreduce: 591.08 | step: 113.87
{'loss': 0.7637, 'learning_rate': 1.3608863400446113e-05, 'epoch': 0.4}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15730
total_samples=12189, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:36,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.92 | bwd_microstep: 1941.91 | bwd_inner_microstep: 1740.32 | bwd_allreduce_microstep: 201.53 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14035
total_samples=12194, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:39,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.16 | bwd_microstep: 1833.60 | bwd_inner_microstep: 1745.72 | bwd_allreduce_microstep: 87.81 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13232
total_samples=12198, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:41,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.57 | bwd_microstep: 1817.37 | bwd_inner_microstep: 1707.65 | bwd_allreduce_microstep: 109.66 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13911
total_samples=12202, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:44,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 04:11:44,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.58 | bwd_microstep: 1962.90 | bwd_inner_microstep: 1734.96 | bwd_allreduce_microstep: 227.88 | step_microstep: 146.04
[2025-08-03 04:11:44,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.17 | bwd: 7555.82 | bwd_inner: 6928.64 | bwd_allreduce: 626.95 | step: 146.37
{'loss': 0.7701, 'learning_rate': 1.3593756200320373e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13635
total_samples=12206, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:47,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.47 | bwd_microstep: 1848.33 | bwd_inner_microstep: 1802.70 | bwd_allreduce_microstep: 45.56 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12677
total_samples=12210, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:49,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.14 | bwd_microstep: 1767.94 | bwd_inner_microstep: 1646.58 | bwd_allreduce_microstep: 121.30 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14341
total_samples=12214, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:52,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.21 | bwd_microstep: 1695.97 | bwd_inner_microstep: 1688.31 | bwd_allreduce_microstep: 7.60 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11591
total_samples=12217, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:55,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 04:11:55,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.35 | bwd_microstep: 1819.13 | bwd_inner_microstep: 1583.40 | bwd_allreduce_microstep: 235.67 | step_microstep: 154.54
[2025-08-03 04:11:55,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.11 | bwd: 7131.42 | bwd_inner: 6720.99 | bwd_allreduce: 410.20 | step: 154.98
{'loss': 0.7636, 'learning_rate': 1.357863957598796e-05, 'epoch': 0.4}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11777
total_samples=12220, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:11:57,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.13 | bwd_microstep: 2052.30 | bwd_inner_microstep: 1856.12 | bwd_allreduce_microstep: 196.12 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12495
total_samples=12224, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:00,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 902.38 | bwd_microstep: 1784.36 | bwd_inner_microstep: 1647.63 | bwd_allreduce_microstep: 136.67 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11864
total_samples=12227, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:03,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.08 | bwd_microstep: 1727.64 | bwd_inner_microstep: 1613.21 | bwd_allreduce_microstep: 114.37 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11868
total_samples=12230, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:06,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 04:12:06,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.69 | bwd_microstep: 2062.66 | bwd_inner_microstep: 1839.09 | bwd_allreduce_microstep: 223.50 | step_microstep: 116.62
[2025-08-03 04:12:06,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3015.20 | bwd: 7627.01 | bwd_inner: 6956.03 | bwd_allreduce: 670.74 | step: 116.94
{'loss': 0.7599, 'learning_rate': 1.356351356709045e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14319
total_samples=12234, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:08,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.07 | bwd_microstep: 1859.15 | bwd_inner_microstep: 1741.16 | bwd_allreduce_microstep: 117.92 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13759
total_samples=12238, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:11,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.79 | bwd_microstep: 2231.66 | bwd_inner_microstep: 2054.89 | bwd_allreduce_microstep: 176.71 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13854
total_samples=12242, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:14,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.74 | bwd_microstep: 2095.36 | bwd_inner_microstep: 1986.60 | bwd_allreduce_microstep: 108.70 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13437
total_samples=12246, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:17,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 04:12:17,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.65 | bwd_microstep: 1751.64 | bwd_inner_microstep: 1659.12 | bwd_allreduce_microstep: 92.46 | step_microstep: 138.71
[2025-08-03 04:12:17,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.18 | bwd: 7937.85 | bwd_inner: 7441.77 | bwd_allreduce: 495.85 | step: 139.13
000 [2:28:26<3:38:29, 10.92s/it] 40%|████      | 801/2000 [2:28:37<3:41:09, 11.07s/it]                                                       40%|████      | 801/2000 [2:28:37<3:41:09, 11.07s/it] 40%|████      | 802/2000 [2:28:48<3:40:37, 11.05s/it]                                                       40%|████      | 802/2000 [2:28:48<3:40:37, 11.05s/it] 40%|████      | 803/2000 [2:28:59<3:39:09, 10.99s/it]                                                       40%|████      | 803/2000 [2:28:59<3:39:09, 10.99s/it] 40%|████      | 804/2000 [2:29:09<3:35:24, 10.81s/it]                                                       40%|████      | 804/2000 [2:29:09<3:35:24, 10.81s/it] 40%|████      | 805/2000 [2:29:20<3:36:52, 10.89s/it]                                                       40%|████      | 805/2000 [2:29:20<3:36:52, 10.89s/it] 40%|████      | 806/2000 [2:29:32<3:38:31, 10.98s/it]                  {'loss': 0.7681, 'learning_rate': 1.3548378213294042e-05, 'epoch': 0.4}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11699
total_samples=12249, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:19,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.24 | bwd_microstep: 1785.88 | bwd_inner_microstep: 1549.23 | bwd_allreduce_microstep: 236.59 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15288
total_samples=12254, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:22,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.66 | bwd_microstep: 1826.70 | bwd_inner_microstep: 1792.65 | bwd_allreduce_microstep: 33.98 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12429
total_samples=12257, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:25,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.57 | bwd_microstep: 1829.50 | bwd_inner_microstep: 1593.23 | bwd_allreduce_microstep: 236.21 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13945
total_samples=12261, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:28,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 04:12:28,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.94 | bwd_microstep: 2096.24 | bwd_inner_microstep: 1940.86 | bwd_allreduce_microstep: 155.32 | step_microstep: 115.03
[2025-08-03 04:12:28,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.33 | bwd: 7538.37 | bwd_inner: 6875.96 | bwd_allreduce: 662.18 | step: 115.38
{'loss': 0.7745, 'learning_rate': 1.3533233554289433e-05, 'epoch': 0.4}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11788
total_samples=12264, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:31,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.61 | bwd_microstep: 2262.61 | bwd_inner_microstep: 1996.73 | bwd_allreduce_microstep: 265.80 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11843
total_samples=12267, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:33,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.98 | bwd_microstep: 1777.65 | bwd_inner_microstep: 1571.49 | bwd_allreduce_microstep: 206.09 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11925
total_samples=12270, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:36,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.45 | bwd_microstep: 2401.57 | bwd_inner_microstep: 1566.04 | bwd_allreduce_microstep: 835.47 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13834
total_samples=12274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:39,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.55
[2025-08-03 04:12:39,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.05 | bwd_microstep: 1838.46 | bwd_inner_microstep: 1746.95 | bwd_allreduce_microstep: 91.45 | step_microstep: 157.82
[2025-08-03 04:12:39,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.02 | bwd: 8280.34 | bwd_inner: 6881.20 | bwd_allreduce: 1398.90 | step: 158.27
{'loss': 0.7637, 'learning_rate': 1.3518079629791725e-05, 'epoch': 0.4}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=12277, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:42,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.16 | bwd_microstep: 1734.54 | bwd_inner_microstep: 1536.47 | bwd_allreduce_microstep: 198.01 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12657
total_samples=12280, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:44,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.78 | bwd_microstep: 1731.03 | bwd_inner_microstep: 1581.29 | bwd_allreduce_microstep: 149.67 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12407
total_samples=12284, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:47,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.16 | bwd_microstep: 1750.21 | bwd_inner_microstep: 1581.12 | bwd_allreduce_microstep: 169.02 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14855
total_samples=12288, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:49,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 04:12:49,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.01 | bwd_microstep: 1746.34 | bwd_inner_microstep: 1732.01 | bwd_allreduce_microstep: 14.27 | step_microstep: 140.05
[2025-08-03 04:12:49,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.04 | bwd: 6962.17 | bwd_inner: 6430.89 | bwd_allreduce: 531.05 | step: 140.39
{'loss': 0.7616, 'learning_rate': 1.3502916479540327e-05, 'epoch': 0.4}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13165
total_samples=12292, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:52,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.67 | bwd_microstep: 1816.79 | bwd_inner_microstep: 1691.88 | bwd_allreduce_microstep: 124.84 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13662
total_samples=12296, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:55,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.42 | bwd_microstep: 1944.23 | bwd_inner_microstep: 1717.16 | bwd_allreduce_microstep: 227.00 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13295
total_samples=12300, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:12:58,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.10 | bwd_microstep: 2138.20 | bwd_inner_microstep: 1968.45 | bwd_allreduce_microstep: 169.69 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12994
total_samples=12304, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:01,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.54
[2025-08-03 04:13:01,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.89 | bwd_microstep: 2141.57 | bwd_inner_microstep: 1692.56 | bwd_allreduce_microstep: 448.95 | step_microstep: 112.50
[2025-08-03 04:13:01,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.01 | bwd: 8040.84 | bwd_inner: 7070.05 | bwd_allreduce: 970.57 | step: 112.98
{'loss': 0.7707, 'learning_rate': 1.3487744143298822e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15581
total_samples=12308, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:03,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 1911.74 | bwd_inner_microstep: 1785.66 | bwd_allreduce_microstep: 126.02 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13242
total_samples=12312, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:06,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.97 | bwd_microstep: 1903.32 | bwd_inner_microstep: 1717.59 | bwd_allreduce_microstep: 185.66 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13279
total_samples=12316, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:09,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.55 | bwd_microstep: 1847.25 | bwd_inner_microstep: 1676.19 | bwd_allreduce_microstep: 170.99 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13202
total_samples=12320, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:11,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 04:13:11,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.13 | bwd_microstep: 1739.02 | bwd_inner_microstep: 1668.77 | bwd_allreduce_microstep: 70.19 | step_microstep: 113.40
[2025-08-03 04:13:11,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.53 | bwd: 7401.39 | bwd_inner: 6848.20 | bwd_allreduce: 552.94 | step: 113.76
{'loss': 0.7715, 'learning_rate': 1.3472562660854902e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13788
total_samples=12324, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:14,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.81 | bwd_microstep: 1816.38 | bwd_inner_microstep: 1722.57 | bwd_allreduce_microstep: 93.73 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15406
total_samples=12329, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:17,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.81 | bwd_microstep: 2205.42 | bwd_inner_microstep: 2074.92 | bwd_allreduce_microstep: 130.43 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13962
total_samples=12333, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:20,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.02 | bwd_microstep: 2137.28 | bwd_inner_microstep: 2030.26 | bwd_allreduce_microstep: 106.96 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12877
total_samples=12337, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:23,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.58
[2025-08-03 04:13:23,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.64 | bwd_microstep: 2017.66 | bwd_inner_microstep: 1803.71 | bwd_allreduce_microstep: 213.89 | step_microstep: 135.64
[2025-08-03 04:13:23,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.21 | bwd: 8176.77 | bwd_inner: 7631.45 | bwd_allreduce: 545.08 | step: 135.97
                                     40%|████      | 806/2000 [2:29:32<3:38:31, 10.98s/it] 40%|████      | 807/2000 [2:29:42<3:37:06, 10.92s/it]                                                       40%|████      | 807/2000 [2:29:42<3:37:06, 10.92s/it] 40%|████      | 808/2000 [2:29:54<3:40:49, 11.12s/it]                                                       40%|████      | 808/2000 [2:29:54<3:40:49, 11.12s/it] 40%|████      | 809/2000 [2:30:04<3:35:09, 10.84s/it]                                                       40%|████      | 809/2000 [2:30:04<3:35:09, 10.84s/it] 40%|████      | 810/2000 [2:30:15<3:37:32, 10.97s/it]                                                       40%|████      | 810/2000 [2:30:16<3:37:32, 10.97s/it] 41%|████      | 811/2000 [2:30:26<3:35:14, 10.86s/it]                                                       41%|████      | 811/2000 [2:30:26<3:35:14, 10.86s/it] 41%|██�{'loss': 0.7638, 'learning_rate': 1.345737207202023e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13394
total_samples=12341, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:25,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.35 | bwd_microstep: 1793.15 | bwd_inner_microstep: 1703.38 | bwd_allreduce_microstep: 89.70 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13151
total_samples=12345, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:28,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.76 | bwd_microstep: 1764.60 | bwd_inner_microstep: 1695.61 | bwd_allreduce_microstep: 68.93 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13484
total_samples=12349, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:30,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.54 | bwd_microstep: 1766.09 | bwd_inner_microstep: 1705.24 | bwd_allreduce_microstep: 60.79 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11918
total_samples=12352, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:34,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:13:34,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.73 | bwd_microstep: 2387.91 | bwd_inner_microstep: 2169.38 | bwd_allreduce_microstep: 218.46 | step_microstep: 117.17
[2025-08-03 04:13:34,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2748.31 | bwd: 7711.81 | bwd_inner: 7273.61 | bwd_allreduce: 437.97 | step: 117.51
{'loss': 0.7591, 'learning_rate': 1.3442172416630355e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13222
total_samples=12356, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:36,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.54 | bwd_microstep: 1730.45 | bwd_inner_microstep: 1669.17 | bwd_allreduce_microstep: 61.21 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11755
total_samples=12359, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:39,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.39 | bwd_microstep: 1980.21 | bwd_inner_microstep: 1604.97 | bwd_allreduce_microstep: 375.18 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15402
total_samples=12363, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:42,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.47 | bwd_microstep: 1955.32 | bwd_inner_microstep: 1884.01 | bwd_allreduce_microstep: 71.24 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14111
total_samples=12367, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:45,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:13:45,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.66 | bwd_microstep: 2073.18 | bwd_inner_microstep: 1919.36 | bwd_allreduce_microstep: 153.75 | step_microstep: 110.82
[2025-08-03 04:13:45,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.99 | bwd: 7739.20 | bwd_inner: 7077.51 | bwd_allreduce: 661.46 | step: 111.41
{'loss': 0.7581, 'learning_rate': 1.3426963734544601e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13624
total_samples=12371, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:47,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.19 | bwd_microstep: 1750.41 | bwd_inner_microstep: 1693.41 | bwd_allreduce_microstep: 56.94 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13951
total_samples=12375, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:50,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.30 | bwd_microstep: 1982.40 | bwd_inner_microstep: 1731.33 | bwd_allreduce_microstep: 251.01 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215
total_samples=12379, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:53,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 653.20 | bwd_microstep: 2024.90 | bwd_inner_microstep: 1696.77 | bwd_allreduce_microstep: 328.07 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13092
total_samples=12383, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:55,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 04:13:55,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.65 | bwd_microstep: 1734.30 | bwd_inner_microstep: 1672.37 | bwd_allreduce_microstep: 61.87 | step_microstep: 133.84
[2025-08-03 04:13:55,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2732.27 | bwd: 7492.06 | bwd_inner: 6793.87 | bwd_allreduce: 697.96 | step: 134.23
{'loss': 0.7608, 'learning_rate': 1.3411746065645961e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13417
total_samples=12387, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:13:58,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.09 | bwd_microstep: 1784.05 | bwd_inner_microstep: 1702.75 | bwd_allreduce_microstep: 81.23 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11827
total_samples=12390, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:00,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.87 | bwd_microstep: 1732.28 | bwd_inner_microstep: 1548.87 | bwd_allreduce_microstep: 183.34 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13293
total_samples=12395, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:03,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.01 | bwd_microstep: 1801.46 | bwd_inner_microstep: 1688.72 | bwd_allreduce_microstep: 112.68 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12836
total_samples=12399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:06,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 04:14:06,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.45 | bwd_microstep: 2167.14 | bwd_inner_microstep: 2033.03 | bwd_allreduce_microstep: 134.05 | step_microstep: 113.10
[2025-08-03 04:14:06,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.35 | bwd: 7484.97 | bwd_inner: 6973.37 | bwd_allreduce: 511.37 | step: 113.46
{'loss': 0.761, 'learning_rate': 1.3396519449841006e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14313
total_samples=12404, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:09,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.00 | bwd_microstep: 1742.48 | bwd_inner_microstep: 1707.72 | bwd_allreduce_microstep: 34.68 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11777
total_samples=12407, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:11,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.99 | bwd_microstep: 1991.21 | bwd_inner_microstep: 1779.58 | bwd_allreduce_microstep: 211.57 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13973
total_samples=12412, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:14,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.73 | bwd_microstep: 2009.03 | bwd_inner_microstep: 1896.26 | bwd_allreduce_microstep: 112.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14316
total_samples=12416, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:17,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 04:14:17,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.88 | bwd_microstep: 1990.58 | bwd_inner_microstep: 1896.70 | bwd_allreduce_microstep: 93.82 | step_microstep: 127.68
[2025-08-03 04:14:17,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.53 | bwd: 7733.35 | bwd_inner: 7280.25 | bwd_allreduce: 452.87 | step: 128.20
{'loss': 0.7718, 'learning_rate': 1.3381283927059751e-05, 'epoch': 0.41}
�█      | 812/2000 [2:30:38<3:38:30, 11.04s/it]                                                       41%|████      | 812/2000 [2:30:38<3:38:30, 11.04s/it] 41%|████      | 813/2000 [2:30:48<3:37:42, 11.00s/it]                                                       41%|████      | 813/2000 [2:30:48<3:37:42, 11.00s/it] 41%|████      | 814/2000 [2:30:59<3:37:32, 11.01s/it]                                                       41%|████      | 814/2000 [2:30:59<3:37:32, 11.01s/it] 41%|████      | 815/2000 [2:31:10<3:35:33, 10.91s/it]                                                       41%|████      | 815/2000 [2:31:10<3:35:33, 10.91s/it] 41%|████      | 816/2000 [2:31:21<3:33:57, 10.84s/it]                                                       41%|████      | 816/2000 [2:31:21<3:33:57, 10.84s/it] 41%|████      | 817/2000 [2:31:32<3:34:31, 10.88s/it]                                                       41%|█dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13432
total_samples=12420, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:20,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.94 | bwd_microstep: 1822.44 | bwd_inner_microstep: 1719.41 | bwd_allreduce_microstep: 102.97 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12630
total_samples=12423, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:22,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.97 | bwd_microstep: 1999.54 | bwd_inner_microstep: 1817.63 | bwd_allreduce_microstep: 181.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16059
total_samples=12427, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:25,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.72 | bwd_microstep: 1810.36 | bwd_inner_microstep: 1804.31 | bwd_allreduce_microstep: 5.99 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14007
total_samples=12431, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:28,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 04:14:28,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.11 | bwd_microstep: 1844.76 | bwd_inner_microstep: 1708.75 | bwd_allreduce_microstep: 135.94 | step_microstep: 111.67
[2025-08-03 04:14:28,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.68 | bwd: 7477.15 | bwd_inner: 7050.09 | bwd_allreduce: 426.83 | step: 112.12
{'loss': 0.7615, 'learning_rate': 1.3366039537255589e-05, 'epoch': 0.41}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12840
total_samples=12435, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:31,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.19 | bwd_microstep: 2159.62 | bwd_inner_microstep: 1994.27 | bwd_allreduce_microstep: 165.28 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11839
total_samples=12438, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:33,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.10 | bwd_microstep: 2002.41 | bwd_inner_microstep: 1665.64 | bwd_allreduce_microstep: 336.71 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12884
total_samples=12442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:36,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.34 | bwd_microstep: 1942.07 | bwd_inner_microstep: 1931.44 | bwd_allreduce_microstep: 10.57 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11998
total_samples=12445, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:39,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 04:14:39,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.39 | bwd_microstep: 1752.05 | bwd_inner_microstep: 1554.01 | bwd_allreduce_microstep: 197.98 | step_microstep: 124.77
[2025-08-03 04:14:39,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2742.94 | bwd: 7856.20 | bwd_inner: 7145.36 | bwd_allreduce: 710.60 | step: 125.11
{'loss': 0.7582, 'learning_rate': 1.3350786320405145e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13530
total_samples=12449, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:41,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.95 | bwd_microstep: 1686.36 | bwd_inner_microstep: 1646.90 | bwd_allreduce_microstep: 39.39 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11970
total_samples=12452, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:44,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.54 | bwd_microstep: 1865.30 | bwd_inner_microstep: 1546.57 | bwd_allreduce_microstep: 318.66 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12891
total_samples=12456, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:47,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.92 | bwd_microstep: 1940.15 | bwd_inner_microstep: 1791.49 | bwd_allreduce_microstep: 148.60 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12164
total_samples=12460, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:50,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 04:14:50,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.87 | bwd_microstep: 2037.60 | bwd_inner_microstep: 1830.00 | bwd_allreduce_microstep: 207.53 | step_microstep: 130.64
[2025-08-03 04:14:50,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.20 | bwd: 7529.46 | bwd_inner: 6814.96 | bwd_allreduce: 714.27 | step: 130.98
{'loss': 0.7588, 'learning_rate': 1.3335524316508208e-05, 'epoch': 0.41}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11849
total_samples=12463, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:52,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.28 | bwd_microstep: 1858.05 | bwd_inner_microstep: 1721.37 | bwd_allreduce_microstep: 136.62 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12518
total_samples=12466, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:55,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.89 | bwd_microstep: 2072.74 | bwd_inner_microstep: 1837.85 | bwd_allreduce_microstep: 234.83 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12021
total_samples=12469, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:14:58,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.51 | bwd_microstep: 2008.40 | bwd_inner_microstep: 1795.28 | bwd_allreduce_microstep: 213.05 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11629
total_samples=12472, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:01,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 04:15:01,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.76 | bwd_microstep: 1817.84 | bwd_inner_microstep: 1566.77 | bwd_allreduce_microstep: 251.00 | step_microstep: 115.29
[2025-08-03 04:15:01,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.37 | bwd: 7757.09 | bwd_inner: 6921.27 | bwd_allreduce: 835.58 | step: 115.61
{'loss': 0.7599, 'learning_rate': 1.3320253565587602e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13352
total_samples=12476, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:03,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.16 | bwd_microstep: 1720.20 | bwd_inner_microstep: 1661.09 | bwd_allreduce_microstep: 59.04 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11805
total_samples=12479, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:06,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.00 | bwd_microstep: 2068.78 | bwd_inner_microstep: 1606.31 | bwd_allreduce_microstep: 462.37 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12385
total_samples=12482, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:09,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.80 | bwd_microstep: 1850.95 | bwd_inner_microstep: 1602.53 | bwd_allreduce_microstep: 248.35 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11982
total_samples=12485, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:12,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 04:15:12,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.59 | bwd_microstep: 2069.36 | bwd_inner_microstep: 1719.59 | bwd_allreduce_microstep: 349.71 | step_microstep: 114.28
[2025-08-03 04:15:12,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.49 | bwd: 7709.33 | bwd_inner: 6589.54 | bwd_allreduce: 1119.53 | step: 114.62
{'loss': 0.7709, 'learning_rate': 1.3304974107689088e-05, 'epoch': 0.41}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13573
total_samples=12490, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:14,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.04 | bwd_microstep: 1774.90 | bwd_inner_microstep: 1683.25 | bwd_allreduce_microstep: 91.58 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11534
total_samples=12493, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:17,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.17 | bwd_microstep: 1874.20 | bwd_inner_microstep: 1592.11 | bwd_allreduce_microstep: 282.02 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11839
total_samples=12496, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:19,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.17 | bwd_microstep: 1775.97 | bwd_inner_microstep: 1556.66 | bwd_allreduce_microstep: 219.24 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12148
total_samples=12499, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:22,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 04:15:22,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.68 | bwd_microstep: 1883.39 | bwd_inner_microstep: 1747.10 | bwd_allreduce_microstep: 136.22 | step_microstep: 131.82
[2025-08-03 04:15:22,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.98 | bwd: 7308.50 | bwd_inner: 6579.12 | bwd_allreduce: 729.14 | step: 132.30
███      | 817/2000 [2:31:32<3:34:31, 10.88s/it] 41%|████      | 818/2000 [2:31:43<3:33:47, 10.85s/it]                                                       41%|████      | 818/2000 [2:31:43<3:33:47, 10.85s/it] 41%|████      | 819/2000 [2:31:54<3:34:49, 10.91s/it]                                                       41%|████      | 819/2000 [2:31:54<3:34:49, 10.91s/it] 41%|████      | 820/2000 [2:32:04<3:33:54, 10.88s/it]                                                       41%|████      | 820/2000 [2:32:05<3:33:54, 10.88s/it] 41%|████      | 821/2000 [2:32:15<3:34:29, 10.92s/it]                                                       41%|████      | 821/2000 [2:32:16<3:34:29, 10.92s/it] 41%|████      | 822/2000 [2:32:26<3:34:51, 10.94s/it]                                                       41%|████      | 822/2000 [2:32:26<3:34:51, 10.94s/it] 41%|████      | 823/2000 [2:32:37<3:32:10, 10.82s/{'loss': 0.7526, 'learning_rate': 1.328968598288127e-05, 'epoch': 0.41}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13951
total_samples=12503, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:25,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.48 | bwd_microstep: 1759.68 | bwd_inner_microstep: 1676.15 | bwd_allreduce_microstep: 83.46 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11548
total_samples=12506, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:27,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.92 | bwd_microstep: 1784.28 | bwd_inner_microstep: 1576.90 | bwd_allreduce_microstep: 207.32 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14777
total_samples=12510, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:30,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.06 | bwd_microstep: 1753.59 | bwd_inner_microstep: 1730.27 | bwd_allreduce_microstep: 23.26 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243
total_samples=12514, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:32,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 04:15:32,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.23 | bwd_microstep: 1734.09 | bwd_inner_microstep: 1669.26 | bwd_allreduce_microstep: 64.77 | step_microstep: 129.35
[2025-08-03 04:15:32,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.62 | bwd: 7031.70 | bwd_inner: 6652.57 | bwd_allreduce: 378.89 | step: 129.84
{'loss': 0.7575, 'learning_rate': 1.3274389231255466e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13747
total_samples=12518, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:35,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.86 | bwd_microstep: 1734.60 | bwd_inner_microstep: 1687.84 | bwd_allreduce_microstep: 46.70 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14030
total_samples=12522, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:38,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.98 | bwd_microstep: 1803.98 | bwd_inner_microstep: 1771.64 | bwd_allreduce_microstep: 32.28 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14025
total_samples=12526, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:40,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.56 | bwd_microstep: 1831.69 | bwd_inner_microstep: 1739.48 | bwd_allreduce_microstep: 92.15 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13679
total_samples=12530, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:43,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 04:15:43,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 957.84 | bwd_microstep: 1832.76 | bwd_inner_microstep: 1724.05 | bwd_allreduce_microstep: 108.64 | step_microstep: 109.22
[2025-08-03 04:15:43,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3052.18 | bwd: 7203.09 | bwd_inner: 6923.00 | bwd_allreduce: 279.85 | step: 109.56
{'loss': 0.7563, 'learning_rate': 1.3259083892925633e-05, 'epoch': 0.41}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11606
total_samples=12533, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:46,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.53 | bwd_microstep: 1947.24 | bwd_inner_microstep: 1537.60 | bwd_allreduce_microstep: 409.57 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15038
total_samples=12537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:48,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.62 | bwd_microstep: 1875.65 | bwd_inner_microstep: 1869.80 | bwd_allreduce_microstep: 5.79 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12201
total_samples=12541, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:51,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.04 | bwd_microstep: 1995.28 | bwd_inner_microstep: 1591.95 | bwd_allreduce_microstep: 403.26 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13306
total_samples=12545, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:54,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 04:15:54,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.59 | bwd_microstep: 1772.62 | bwd_inner_microstep: 1694.81 | bwd_allreduce_microstep: 77.75 | step_microstep: 157.47
[2025-08-03 04:15:54,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.72 | bwd: 7590.83 | bwd_inner: 6694.16 | bwd_allreduce: 896.44 | step: 157.89
{'loss': 0.7557, 'learning_rate': 1.3243770008028225e-05, 'epoch': 0.41}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11758
total_samples=12548, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:15:57,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.08 | bwd_microstep: 2182.03 | bwd_inner_microstep: 1707.41 | bwd_allreduce_microstep: 474.52 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13224
total_samples=12552, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:00,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.82 | bwd_microstep: 2053.08 | bwd_inner_microstep: 1923.87 | bwd_allreduce_microstep: 129.15 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12895
total_samples=12555, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:02,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.98 | bwd_microstep: 1800.28 | bwd_inner_microstep: 1615.22 | bwd_allreduce_microstep: 185.00 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12781
total_samples=12559, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:05,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 04:16:05,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.85 | bwd_microstep: 1975.13 | bwd_inner_microstep: 1796.65 | bwd_allreduce_microstep: 178.42 | step_microstep: 129.71
[2025-08-03 04:16:05,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.65 | bwd: 8010.57 | bwd_inner: 7043.17 | bwd_allreduce: 967.15 | step: 130.05
{'loss': 0.7626, 'learning_rate': 1.3228447616722128e-05, 'epoch': 0.41}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11738
total_samples=12562, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:08,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.59 | bwd_microstep: 1989.69 | bwd_inner_microstep: 1777.07 | bwd_allreduce_microstep: 212.55 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14090
total_samples=12566, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:11,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.49 | bwd_microstep: 2144.62 | bwd_inner_microstep: 1873.56 | bwd_allreduce_microstep: 270.99 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13190
total_samples=12570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:13,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.22 | bwd_microstep: 1797.33 | bwd_inner_microstep: 1711.78 | bwd_allreduce_microstep: 85.48 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12197
total_samples=12573, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:16,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 04:16:16,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.22 | bwd_microstep: 1835.07 | bwd_inner_microstep: 1597.60 | bwd_allreduce_microstep: 237.41 | step_microstep: 397.72
[2025-08-03 04:16:16,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.45 | bwd: 7766.76 | bwd_inner: 6960.01 | bwd_allreduce: 806.51 | step: 398.18
{'loss': 0.7662, 'learning_rate': 1.3213116759188525e-05, 'epoch': 0.41}
it]                                                       41%|████      | 823/2000 [2:32:37<3:32:10, 10.82s/it] 41%|████      | 824/2000 [2:32:47<3:28:39, 10.65s/it]                                                       41%|████      | 824/2000 [2:32:47<3:28:39, 10.65s/it] 41%|████▏     | 825/2000 [2:32:58<3:28:55, 10.67s/it]                                                       41%|████▏     | 825/2000 [2:32:58<3:28:55, 10.67s/it] 41%|████▏     | 826/2000 [2:33:09<3:29:45, 10.72s/it]                                                       41%|████▏     | 826/2000 [2:33:09<3:29:45, 10.72s/it] 41%|████▏     | 827/2000 [2:33:20<3:32:41, 10.88s/it]                                                       41%|████▏     | 827/2000 [2:33:20<3:32:41, 10.88s/it] 41%|████▏     | 828/2000 [2:33:31<3:34:49, 11.00s/it]                                                       41%|████▏     | 828/2000 [2:3dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11791
total_samples=12576, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:19,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.47 | bwd_microstep: 1921.55 | bwd_inner_microstep: 1557.12 | bwd_allreduce_microstep: 364.37 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13955
total_samples=12580, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:22,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.09 | bwd_microstep: 1842.89 | bwd_inner_microstep: 1739.07 | bwd_allreduce_microstep: 103.75 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12765
total_samples=12584, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:24,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.01 | bwd_microstep: 1767.39 | bwd_inner_microstep: 1639.07 | bwd_allreduce_microstep: 128.26 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13083
total_samples=12588, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:27,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 04:16:27,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.14 | bwd_microstep: 1978.12 | bwd_inner_microstep: 1700.00 | bwd_allreduce_microstep: 278.06 | step_microstep: 115.34
[2025-08-03 04:16:27,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.64 | bwd: 7509.99 | bwd_inner: 6635.25 | bwd_allreduce: 874.52 | step: 115.79
{'loss': 0.7721, 'learning_rate': 1.31977774756308e-05, 'epoch': 0.41}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14303
total_samples=12592, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:30,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.42 | bwd_microstep: 1796.45 | bwd_inner_microstep: 1738.19 | bwd_allreduce_microstep: 58.20 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13188
total_samples=12596, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:33,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.08 | bwd_microstep: 2300.76 | bwd_inner_microstep: 2148.46 | bwd_allreduce_microstep: 152.24 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=12600, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:36,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.14 | bwd_microstep: 1763.92 | bwd_inner_microstep: 1684.28 | bwd_allreduce_microstep: 79.58 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12539
total_samples=12604, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:38,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 04:16:38,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.90 | bwd_microstep: 2029.14 | bwd_inner_microstep: 1855.00 | bwd_allreduce_microstep: 174.07 | step_microstep: 141.55
[2025-08-03 04:16:38,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.49 | bwd: 7890.32 | bwd_inner: 7425.92 | bwd_allreduce: 464.17 | step: 141.87
{'loss': 0.7674, 'learning_rate': 1.3182429806274442e-05, 'epoch': 0.41}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11611
total_samples=12607, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:41,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.79 | bwd_microstep: 1822.50 | bwd_inner_microstep: 1536.82 | bwd_allreduce_microstep: 285.61 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12531
total_samples=12611, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:44,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.61 | bwd_microstep: 1752.63 | bwd_inner_microstep: 1598.20 | bwd_allreduce_microstep: 154.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14783
total_samples=12615, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:46,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.13 | bwd_microstep: 1950.65 | bwd_inner_microstep: 1871.33 | bwd_allreduce_microstep: 79.25 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12219
total_samples=12618, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:49,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 04:16:49,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.51 | bwd_microstep: 1828.44 | bwd_inner_microstep: 1707.28 | bwd_allreduce_microstep: 121.09 | step_microstep: 143.80
[2025-08-03 04:16:49,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.98 | bwd: 7354.26 | bwd_inner: 6713.64 | bwd_allreduce: 640.39 | step: 144.23
{'loss': 0.7649, 'learning_rate': 1.3167073791366915e-05, 'epoch': 0.42}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810
total_samples=12621, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:52,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.22 | bwd_microstep: 1858.97 | bwd_inner_microstep: 1600.19 | bwd_allreduce_microstep: 258.72 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12954
total_samples=12625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:54,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.06 | bwd_microstep: 1809.31 | bwd_inner_microstep: 1648.83 | bwd_allreduce_microstep: 160.41 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14681
total_samples=12629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:16:57,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.91 | bwd_microstep: 2150.42 | bwd_inner_microstep: 1985.23 | bwd_allreduce_microstep: 165.13 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13380
total_samples=12633, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:00,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 04:17:00,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.52 | bwd_microstep: 1999.74 | bwd_inner_microstep: 1881.02 | bwd_allreduce_microstep: 118.66 | step_microstep: 108.66
[2025-08-03 04:17:00,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.66 | bwd: 7818.49 | bwd_inner: 7115.26 | bwd_allreduce: 703.00 | step: 109.01
{'loss': 0.7663, 'learning_rate': 1.3151709471177589e-05, 'epoch': 0.42}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12298
total_samples=12636, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:03,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.06 | bwd_microstep: 2054.85 | bwd_inner_microstep: 1820.77 | bwd_allreduce_microstep: 234.01 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12609
total_samples=12639, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:06,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.35 | bwd_microstep: 2355.86 | bwd_inner_microstep: 2349.83 | bwd_allreduce_microstep: 5.97 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12943
total_samples=12643, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:09,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.25 | bwd_microstep: 1972.37 | bwd_inner_microstep: 1812.17 | bwd_allreduce_microstep: 160.14 | step_microstep: 0.19
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14117
total_samples=12648, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:12,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 04:17:12,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.52 | bwd_microstep: 1757.75 | bwd_inner_microstep: 1691.75 | bwd_allreduce_microstep: 65.93 | step_microstep: 158.14
[2025-08-03 04:17:12,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2738.09 | bwd: 8140.88 | bwd_inner: 7674.51 | bwd_allreduce: 466.13 | step: 158.60
{'loss': 0.7688, 'learning_rate': 1.3136336885997591e-05, 'epoch': 0.42}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11660
total_samples=12651, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:14,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.61 | bwd_microstep: 1780.40 | bwd_inner_microstep: 1546.57 | bwd_allreduce_microstep: 233.77 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11776
total_samples=12654, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:17,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.92 | bwd_microstep: 1960.41 | bwd_inner_microstep: 1542.13 | bwd_allreduce_microstep: 418.22 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12475
total_samples=12658, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:20,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.30 | bwd_microstep: 1988.92 | bwd_inner_microstep: 1802.49 | bwd_allreduce_microstep: 186.37 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446
total_samples=12662, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:23,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 04:17:23,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.87 | bwd_microstep: 2109.53 | bwd_inner_microstep: 2058.65 | bwd_allreduce_microstep: 50.82 | step_microstep: 110.25
[2025-08-03 04:17:23,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.64 | bwd: 7839.32 | bwd_inner: 6949.84 | bwd_allreduce: 889.24 | step: 110.60
3:31<3:34:49, 11.00s/it] 41%|████▏     | 829/2000 [2:33:42<3:33:36, 10.94s/it]                                                       41%|████▏     | 829/2000 [2:33:42<3:33:36, 10.94s/it] 42%|████▏     | 830/2000 [2:33:53<3:34:48, 11.02s/it]                                                       42%|████▏     | 830/2000 [2:33:53<3:34:48, 11.02s/it] 42%|████▏     | 831/2000 [2:34:04<3:32:20, 10.90s/it]                                                       42%|████▏     | 831/2000 [2:34:04<3:32:20, 10.90s/it] 42%|████▏     | 832/2000 [2:34:15<3:33:02, 10.94s/it]                                                       42%|████▏     | 832/2000 [2:34:15<3:33:02, 10.94s/it] 42%|████▏     | 833/2000 [2:34:26<3:35:25, 11.08s/it]                                                       42%|████▏     | 833/2000 [2:34:26<3:35:25, 11.08s/it] 42%|████▏     | 834/2000 [2:34:37<3:35:10, 11.07s/it]    {'loss': 0.7551, 'learning_rate': 1.3120956076139746e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13801
total_samples=12666, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:25,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 663.74 | bwd_microstep: 2012.17 | bwd_inner_microstep: 1868.10 | bwd_allreduce_microstep: 144.01 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16350
total_samples=12670, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:28,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.64 | bwd_microstep: 1836.85 | bwd_inner_microstep: 1830.77 | bwd_allreduce_microstep: 6.02 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11696
total_samples=12673, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:31,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.21 | bwd_microstep: 1797.83 | bwd_inner_microstep: 1583.59 | bwd_allreduce_microstep: 214.17 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12348
total_samples=12678, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:34,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 04:17:34,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.39 | bwd_microstep: 2045.53 | bwd_inner_microstep: 1577.81 | bwd_allreduce_microstep: 467.62 | step_microstep: 123.40
[2025-08-03 04:17:34,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.92 | bwd: 7692.43 | bwd_inner: 6860.28 | bwd_allreduce: 831.89 | step: 123.87
{'loss': 0.7525, 'learning_rate': 1.3105567081938423e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13224
total_samples=12682, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:36,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.18 | bwd_microstep: 1800.98 | bwd_inner_microstep: 1701.60 | bwd_allreduce_microstep: 99.31 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12064
total_samples=12685, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:39,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.74 | bwd_microstep: 1799.83 | bwd_inner_microstep: 1569.72 | bwd_allreduce_microstep: 230.05 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12035
total_samples=12688, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:41,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.29 | bwd_microstep: 1790.08 | bwd_inner_microstep: 1551.15 | bwd_allreduce_microstep: 238.87 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12942
total_samples=12692, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:44,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 04:17:44,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.18 | bwd_microstep: 1917.56 | bwd_inner_microstep: 1669.14 | bwd_allreduce_microstep: 248.36 | step_microstep: 128.18
[2025-08-03 04:17:44,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.32 | bwd: 7308.52 | bwd_inner: 6491.60 | bwd_allreduce: 816.68 | step: 128.68
{'loss': 0.7763, 'learning_rate': 1.3090169943749475e-05, 'epoch': 0.42}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11954
total_samples=12695, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:47,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.81 | bwd_microstep: 1798.49 | bwd_inner_microstep: 1571.91 | bwd_allreduce_microstep: 226.52 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13542
total_samples=12699, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:49,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.99 | bwd_microstep: 1824.21 | bwd_inner_microstep: 1700.71 | bwd_allreduce_microstep: 123.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12342
total_samples=12702, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:52,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.73 | bwd_microstep: 2043.03 | bwd_inner_microstep: 1808.67 | bwd_allreduce_microstep: 234.30 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11830
total_samples=12705, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:55,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 04:17:55,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.85 | bwd_microstep: 2103.82 | bwd_inner_microstep: 1871.41 | bwd_allreduce_microstep: 232.34 | step_microstep: 149.35
[2025-08-03 04:17:55,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.31 | bwd: 7769.59 | bwd_inner: 6952.70 | bwd_allreduce: 816.67 | step: 149.68
{'loss': 0.7597, 'learning_rate': 1.3074764701950095e-05, 'epoch': 0.42}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11836
total_samples=12708, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:17:58,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.69 | bwd_microstep: 1735.19 | bwd_inner_microstep: 1535.23 | bwd_allreduce_microstep: 199.90 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12762
total_samples=12713, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:00,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.30 | bwd_microstep: 1923.79 | bwd_inner_microstep: 1647.64 | bwd_allreduce_microstep: 276.07 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13688
total_samples=12717, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:03,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.93 | bwd_microstep: 1748.95 | bwd_inner_microstep: 1696.29 | bwd_allreduce_microstep: 52.60 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14634
total_samples=12722, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:06,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.78
[2025-08-03 04:18:06,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.26 | bwd_microstep: 1788.66 | bwd_inner_microstep: 1717.21 | bwd_allreduce_microstep: 71.39 | step_microstep: 138.96
[2025-08-03 04:18:06,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.12 | bwd: 7196.66 | bwd_inner: 6596.37 | bwd_allreduce: 600.05 | step: 139.41
{'loss': 0.7647, 'learning_rate': 1.305935139693874e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13203
total_samples=12726, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:08,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.93 | bwd_microstep: 1894.63 | bwd_inner_microstep: 1687.04 | bwd_allreduce_microstep: 207.52 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12201
total_samples=12729, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:11,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.82 | bwd_microstep: 1779.81 | bwd_inner_microstep: 1595.07 | bwd_allreduce_microstep: 184.68 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12158
total_samples=12732, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:14,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.12 | bwd_microstep: 2101.34 | bwd_inner_microstep: 1814.51 | bwd_allreduce_microstep: 286.77 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13336
total_samples=12736, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:16,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 04:18:16,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.93 | bwd_microstep: 1779.42 | bwd_inner_microstep: 1669.29 | bwd_allreduce_microstep: 110.06 | step_microstep: 138.76
[2025-08-03 04:18:16,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.72 | bwd: 7555.26 | bwd_inner: 6765.91 | bwd_allreduce: 789.11 | step: 139.09
{'loss': 0.7666, 'learning_rate': 1.3043930069134998e-05, 'epoch': 0.42}
                                                   42%|████▏     | 834/2000 [2:34:37<3:35:10, 11.07s/it] 42%|████▏     | 835/2000 [2:34:48<3:34:14, 11.03s/it]                                                       42%|████▏     | 835/2000 [2:34:48<3:34:14, 11.03s/it] 42%|████▏     | 836/2000 [2:34:59<3:31:12, 10.89s/it]                                                       42%|████▏     | 836/2000 [2:34:59<3:31:12, 10.89s/it] 42%|████▏     | 837/2000 [2:35:10<3:31:54, 10.93s/it]                                                       42%|████▏     | 837/2000 [2:35:10<3:31:54, 10.93s/it] 42%|████▏     | 838/2000 [2:35:20<3:29:04, 10.80s/it]                                                       42%|████▏     | 838/2000 [2:35:20<3:29:04, 10.80s/it] 42%|████▏     | 839/2000 [2:35:31<3:29:01, 10.80s/it]                                                       42%|████▏     | 839/2000 [2:35:dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12318
total_samples=12739, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:19,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.92 | bwd_microstep: 1854.15 | bwd_inner_microstep: 1615.72 | bwd_allreduce_microstep: 238.37 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12025
total_samples=12742, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:22,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.19 | bwd_microstep: 2006.38 | bwd_inner_microstep: 1780.34 | bwd_allreduce_microstep: 225.96 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12842
total_samples=12746, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:25,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.91 | bwd_microstep: 1864.03 | bwd_inner_microstep: 1655.04 | bwd_allreduce_microstep: 208.91 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11662
total_samples=12749, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:27,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.61
[2025-08-03 04:18:27,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.09 | bwd_microstep: 1779.14 | bwd_inner_microstep: 1560.75 | bwd_allreduce_microstep: 218.33 | step_microstep: 138.24
[2025-08-03 04:18:27,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.04 | bwd: 7503.75 | bwd_inner: 6611.85 | bwd_allreduce: 891.66 | step: 138.70
{'loss': 0.7769, 'learning_rate': 1.3028500758979507e-05, 'epoch': 0.42}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13453
total_samples=12753, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:30,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.65 | bwd_microstep: 1808.98 | bwd_inner_microstep: 1711.06 | bwd_allreduce_microstep: 97.85 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13658
total_samples=12757, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:32,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.33 | bwd_microstep: 1819.13 | bwd_inner_microstep: 1740.22 | bwd_allreduce_microstep: 78.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13684
total_samples=12761, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:36,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.24 | bwd_microstep: 2263.07 | bwd_inner_microstep: 1995.26 | bwd_allreduce_microstep: 267.75 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11967
total_samples=12764, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:38,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 04:18:38,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.00 | bwd_microstep: 1769.02 | bwd_inner_microstep: 1562.58 | bwd_allreduce_microstep: 206.36 | step_microstep: 133.40
[2025-08-03 04:18:38,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.15 | bwd: 7660.25 | bwd_inner: 7009.12 | bwd_allreduce: 650.89 | step: 133.88
{'loss': 0.7706, 'learning_rate': 1.3013063506933838e-05, 'epoch': 0.42}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11735
total_samples=12767, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:41,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.88 | bwd_microstep: 1848.40 | bwd_inner_microstep: 1604.72 | bwd_allreduce_microstep: 243.61 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12692
total_samples=12771, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:43,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.11 | bwd_microstep: 1871.20 | bwd_inner_microstep: 1636.00 | bwd_allreduce_microstep: 235.14 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11674
total_samples=12774, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:46,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.28 | bwd_microstep: 1759.53 | bwd_inner_microstep: 1530.70 | bwd_allreduce_microstep: 228.77 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11797
total_samples=12777, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:49,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 04:18:49,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.73 | bwd_microstep: 1786.24 | bwd_inner_microstep: 1567.07 | bwd_allreduce_microstep: 219.11 | step_microstep: 110.42
[2025-08-03 04:18:49,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.93 | bwd: 7265.42 | bwd_inner: 6338.49 | bwd_allreduce: 926.69 | step: 110.75
{'loss': 0.7677, 'learning_rate': 1.299761835348038e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13254
total_samples=12781, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:51,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.83 | bwd_microstep: 1842.09 | bwd_inner_microstep: 1698.14 | bwd_allreduce_microstep: 143.89 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13626
total_samples=12785, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:54,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.71 | bwd_microstep: 1816.55 | bwd_inner_microstep: 1733.45 | bwd_allreduce_microstep: 83.04 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12858
total_samples=12789, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:18:56,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.32 | bwd_microstep: 1797.95 | bwd_inner_microstep: 1644.96 | bwd_allreduce_microstep: 152.92 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11842
total_samples=12792, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:00,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 04:19:00,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.03 | bwd_microstep: 2275.76 | bwd_inner_microstep: 2070.27 | bwd_allreduce_microstep: 205.42 | step_microstep: 146.49
[2025-08-03 04:19:00,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.82 | bwd: 7732.40 | bwd_inner: 7146.81 | bwd_allreduce: 585.35 | step: 146.94
{'loss': 0.7728, 'learning_rate': 1.2982165339122248e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13386
total_samples=12796, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:02,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.23 | bwd_microstep: 1767.02 | bwd_inner_microstep: 1683.92 | bwd_allreduce_microstep: 83.04 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13579
total_samples=12800, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:05,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.76 | bwd_microstep: 2057.59 | bwd_inner_microstep: 1918.63 | bwd_allreduce_microstep: 138.89 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11878
total_samples=12803, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:08,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.72 | bwd_microstep: 1813.58 | bwd_inner_microstep: 1570.10 | bwd_allreduce_microstep: 243.41 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11937
total_samples=12806, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:10,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 04:19:10,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.81 | bwd_microstep: 1833.87 | bwd_inner_microstep: 1545.14 | bwd_allreduce_microstep: 288.67 | step_microstep: 138.85
[2025-08-03 04:19:10,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2754.44 | bwd: 7472.11 | bwd_inner: 6717.79 | bwd_allreduce: 754.09 | step: 139.30
{'loss': 0.7677, 'learning_rate': 1.296670450438317e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13946
total_samples=12810, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:13,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.77 | bwd_microstep: 1842.35 | bwd_inner_microstep: 1746.68 | bwd_allreduce_microstep: 95.61 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13257
total_samples=12814, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:16,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.01 | bwd_microstep: 1762.49 | bwd_inner_microstep: 1629.84 | bwd_allreduce_microstep: 132.59 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11703
total_samples=12817, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:18,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.97 | bwd_microstep: 2044.85 | bwd_inner_microstep: 1861.51 | bwd_allreduce_microstep: 183.28 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14524
total_samples=12821, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:21,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 04:19:21,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.62 | bwd_microstep: 1832.19 | bwd_inner_microstep: 1734.00 | bwd_allreduce_microstep: 98.12 | step_microstep: 127.31
[2025-08-03 04:19:21,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.30 | bwd: 7481.92 | bwd_inner: 6972.02 | bwd_allreduce: 509.67 | step: 127.65
31<3:29:01, 10.80s/it] 42%|████▏     | 840/2000 [2:35:42<3:28:53, 10.80s/it]                                                       42%|████▏     | 840/2000 [2:35:42<3:28:53, 10.80s/it] 42%|████▏     | 841/2000 [2:35:53<3:29:31, 10.85s/it]                                                       42%|████▏     | 841/2000 [2:35:53<3:29:31, 10.85s/it] 42%|████▏     | 842/2000 [2:36:04<3:27:15, 10.74s/it]                                                       42%|████▏     | 842/2000 [2:36:04<3:27:15, 10.74s/it] 42%|████▏     | 843/2000 [2:36:15<3:28:46, 10.83s/it]                                                       42%|████▏     | 843/2000 [2:36:15<3:28:46, 10.83s/it] 42%|████▏     | 844/2000 [2:36:25<3:27:59, 10.80s/it]                                                       42%|████▏     | 844/2000 [2:36:25<3:27:59, 10.80s/it] 42%|████▏     | 845/2000 [2:36:36<3:27:26, 10.78s/it]      {'loss': 0.7573, 'learning_rate': 1.2951235889807386e-05, 'epoch': 0.42}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13993
total_samples=12826, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:24,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.49 | bwd_microstep: 1836.38 | bwd_inner_microstep: 1804.25 | bwd_allreduce_microstep: 32.03 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12011
total_samples=12829, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:26,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.65 | bwd_microstep: 1911.18 | bwd_inner_microstep: 1552.36 | bwd_allreduce_microstep: 358.74 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14807
total_samples=12833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:29,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.69 | bwd_microstep: 2042.85 | bwd_inner_microstep: 1781.25 | bwd_allreduce_microstep: 261.54 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14100
total_samples=12837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:32,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 04:19:32,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.06 | bwd_microstep: 1793.68 | bwd_inner_microstep: 1748.79 | bwd_allreduce_microstep: 44.82 | step_microstep: 137.92
[2025-08-03 04:19:32,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.83 | bwd: 7584.13 | bwd_inner: 6886.66 | bwd_allreduce: 697.19 | step: 138.38
{'loss': 0.7565, 'learning_rate': 1.2935759535959528e-05, 'epoch': 0.42}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13703
total_samples=12841, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:35,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.55 | bwd_microstep: 2249.38 | bwd_inner_microstep: 2004.85 | bwd_allreduce_microstep: 244.47 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12243
total_samples=12844, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:38,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.91 | bwd_microstep: 1893.77 | bwd_inner_microstep: 1741.68 | bwd_allreduce_microstep: 152.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14197
total_samples=12849, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:40,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.19 | bwd_microstep: 1732.49 | bwd_inner_microstep: 1695.02 | bwd_allreduce_microstep: 37.40 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13343
total_samples=12853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:43,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 04:19:43,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.32 | bwd_microstep: 1751.75 | bwd_inner_microstep: 1702.44 | bwd_allreduce_microstep: 49.25 | step_microstep: 138.75
[2025-08-03 04:19:43,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2754.91 | bwd: 7627.44 | bwd_inner: 7143.97 | bwd_allreduce: 483.23 | step: 139.20
{'loss': 0.7611, 'learning_rate': 1.2920275483424538e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13612
total_samples=12857, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:46,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.97 | bwd_microstep: 1988.26 | bwd_inner_microstep: 1720.70 | bwd_allreduce_microstep: 267.50 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12319
total_samples=12860, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:48,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.86 | bwd_microstep: 1762.15 | bwd_inner_microstep: 1574.70 | bwd_allreduce_microstep: 187.39 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13872
total_samples=12864, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:51,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.11 | bwd_microstep: 2063.41 | bwd_inner_microstep: 1942.92 | bwd_allreduce_microstep: 120.43 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11905
total_samples=12867, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:54,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 04:19:54,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.18 | bwd_microstep: 2191.50 | bwd_inner_microstep: 1799.95 | bwd_allreduce_microstep: 391.48 | step_microstep: 119.64
[2025-08-03 04:19:54,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2752.04 | bwd: 8005.37 | bwd_inner: 7038.26 | bwd_allreduce: 966.87 | step: 120.08
{'loss': 0.7588, 'learning_rate': 1.2904783772807534e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14029
total_samples=12872, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:57,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.95 | bwd_microstep: 1750.33 | bwd_inner_microstep: 1704.60 | bwd_allreduce_microstep: 45.67 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13642
total_samples=12876, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:19:59,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.20 | bwd_microstep: 1773.61 | bwd_inner_microstep: 1703.15 | bwd_allreduce_microstep: 70.40 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12614
total_samples=12880, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:02,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.95 | bwd_microstep: 1968.99 | bwd_inner_microstep: 1618.76 | bwd_allreduce_microstep: 350.16 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14196
total_samples=12884, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:05,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 04:20:05,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.94 | bwd_microstep: 1784.76 | bwd_inner_microstep: 1734.03 | bwd_allreduce_microstep: 50.67 | step_microstep: 149.47
[2025-08-03 04:20:05,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.96 | bwd: 7277.74 | bwd_inner: 6760.53 | bwd_allreduce: 516.98 | step: 149.86
{'loss': 0.7483, 'learning_rate': 1.2889284444733722e-05, 'epoch': 0.42}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13436
total_samples=12888, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:07,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.46 | bwd_microstep: 1988.51 | bwd_inner_microstep: 1709.63 | bwd_allreduce_microstep: 278.82 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13534
total_samples=12892, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:10,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.02 | bwd_microstep: 1749.11 | bwd_inner_microstep: 1672.57 | bwd_allreduce_microstep: 76.48 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13890
total_samples=12896, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:12,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.23 | bwd_microstep: 1739.78 | bwd_inner_microstep: 1696.45 | bwd_allreduce_microstep: 43.26 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15430
total_samples=12901, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:15,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 04:20:15,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.93 | bwd_microstep: 1755.46 | bwd_inner_microstep: 1744.26 | bwd_allreduce_microstep: 11.12 | step_microstep: 109.46
[2025-08-03 04:20:15,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.57 | bwd: 7232.90 | bwd_inner: 6822.91 | bwd_allreduce: 409.76 | step: 109.89
{'loss': 0.7715, 'learning_rate': 1.2873777539848284e-05, 'epoch': 0.42}
                                                 42%|████▏     | 845/2000 [2:36:36<3:27:26, 10.78s/it] 42%|████▏     | 846/2000 [2:36:47<3:27:33, 10.79s/it]                                                       42%|████▏     | 846/2000 [2:36:47<3:27:33, 10.79s/it] 42%|████▏     | 847/2000 [2:36:58<3:27:41, 10.81s/it]                                                       42%|████▏     | 847/2000 [2:36:58<3:27:41, 10.81s/it] 42%|████▏     | 848/2000 [2:37:09<3:29:51, 10.93s/it]                                                       42%|████▏     | 848/2000 [2:37:09<3:29:51, 10.93s/it] 42%|████▏     | 849/2000 [2:37:19<3:27:26, 10.81s/it]                                                       42%|████▏     | 849/2000 [2:37:19<3:27:26, 10.81s/it] 42%|████▎     | 850/2000 [2:37:30<3:25:17, 10.71s/it]                                                       42%|████▎     | 850/2000 [2:37:30dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14246
total_samples=12905, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:18,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.50 | bwd_microstep: 1797.03 | bwd_inner_microstep: 1731.46 | bwd_allreduce_microstep: 65.50 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13597
total_samples=12909, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:20,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.08 | bwd_microstep: 2058.72 | bwd_inner_microstep: 1991.54 | bwd_allreduce_microstep: 67.12 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14148
total_samples=12915, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:23,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.58 | bwd_microstep: 1895.55 | bwd_inner_microstep: 1861.30 | bwd_allreduce_microstep: 34.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13784
total_samples=12919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:26,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 04:20:26,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.67 | bwd_microstep: 2147.31 | bwd_inner_microstep: 1860.85 | bwd_allreduce_microstep: 286.40 | step_microstep: 129.81
[2025-08-03 04:20:26,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.76 | bwd: 7898.65 | bwd_inner: 7445.15 | bwd_allreduce: 453.28 | step: 130.27
{'loss': 0.7604, 'learning_rate': 1.2858263098816265e-05, 'epoch': 0.43}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13668
total_samples=12923, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:29,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.00 | bwd_microstep: 2077.96 | bwd_inner_microstep: 1908.36 | bwd_allreduce_microstep: 169.54 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11594
total_samples=12926, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:32,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.64 | bwd_microstep: 1822.90 | bwd_inner_microstep: 1608.01 | bwd_allreduce_microstep: 214.82 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14846
total_samples=12930, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:34,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.76 | bwd_microstep: 1820.77 | bwd_inner_microstep: 1770.10 | bwd_allreduce_microstep: 50.60 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13403
total_samples=12934, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:37,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 04:20:37,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.43 | bwd_microstep: 2123.79 | bwd_inner_microstep: 1977.71 | bwd_allreduce_microstep: 146.01 | step_microstep: 118.83
[2025-08-03 04:20:37,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.77 | bwd: 7845.47 | bwd_inner: 7264.17 | bwd_allreduce: 581.06 | step: 119.25
{'loss': 0.7577, 'learning_rate': 1.2842741162322487e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14113
total_samples=12938, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:40,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.62 | bwd_microstep: 1963.84 | bwd_inner_microstep: 1861.50 | bwd_allreduce_microstep: 102.27 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11581
total_samples=12941, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:43,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.23 | bwd_microstep: 1824.31 | bwd_inner_microstep: 1540.49 | bwd_allreduce_microstep: 283.77 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 16198
total_samples=12945, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:45,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.24 | bwd_microstep: 1771.15 | bwd_inner_microstep: 1734.99 | bwd_allreduce_microstep: 36.10 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13466
total_samples=12949, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:48,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 04:20:48,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.18 | bwd_microstep: 1780.82 | bwd_inner_microstep: 1683.27 | bwd_allreduce_microstep: 97.49 | step_microstep: 156.56
[2025-08-03 04:20:48,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.20 | bwd: 7340.17 | bwd_inner: 6820.24 | bwd_allreduce: 519.71 | step: 156.87
{'loss': 0.7552, 'learning_rate': 1.282721177107141e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14090
total_samples=12953, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:51,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.70 | bwd_microstep: 1833.16 | bwd_inner_microstep: 1739.75 | bwd_allreduce_microstep: 93.34 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12058
total_samples=12956, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:54,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 833.42 | bwd_microstep: 2023.41 | bwd_inner_microstep: 1584.40 | bwd_allreduce_microstep: 438.94 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14359
total_samples=12960, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:56,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.72 | bwd_microstep: 1724.69 | bwd_inner_microstep: 1696.45 | bwd_allreduce_microstep: 28.17 | step_microstep: 0.19
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12803
total_samples=12964, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:20:59,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 04:20:59,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.53 | bwd_microstep: 2007.80 | bwd_inner_microstep: 1845.83 | bwd_allreduce_microstep: 161.91 | step_microstep: 108.29
[2025-08-03 04:20:59,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2918.30 | bwd: 7589.11 | bwd_inner: 6866.42 | bwd_allreduce: 722.44 | step: 108.72
{'loss': 0.7663, 'learning_rate': 1.2811674965787058e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13488
total_samples=12968, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:02,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.16 | bwd_microstep: 2069.04 | bwd_inner_microstep: 1737.86 | bwd_allreduce_microstep: 331.12 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12212
total_samples=12972, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:05,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.70 | bwd_microstep: 1986.25 | bwd_inner_microstep: 1768.89 | bwd_allreduce_microstep: 217.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11694
total_samples=12975, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:07,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.55 | bwd_microstep: 1788.20 | bwd_inner_microstep: 1592.73 | bwd_allreduce_microstep: 195.41 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15353
total_samples=12979, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:10,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 04:21:10,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.27 | bwd_microstep: 1964.43 | bwd_inner_microstep: 1808.80 | bwd_allreduce_microstep: 155.56 | step_microstep: 109.05
[2025-08-03 04:21:10,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.61 | bwd: 7807.97 | bwd_inner: 6908.28 | bwd_allreduce: 899.46 | step: 109.37
{'loss': 0.7674, 'learning_rate': 1.279613078721289e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13333
total_samples=12983, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:12,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.35 | bwd_microstep: 1714.03 | bwd_inner_microstep: 1663.72 | bwd_allreduce_microstep: 50.25 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13112
total_samples=12987, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:15,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.59 | bwd_microstep: 1821.98 | bwd_inner_microstep: 1695.31 | bwd_allreduce_microstep: 126.60 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11695
total_samples=12990, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:18,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.49 | bwd_microstep: 2462.73 | bwd_inner_microstep: 1601.65 | bwd_allreduce_microstep: 861.01 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13295
total_samples=12994, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:21,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 04:21:21,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.11 | bwd_microstep: 1877.94 | bwd_inner_microstep: 1704.16 | bwd_allreduce_microstep: 173.72 | step_microstep: 127.68
[2025-08-03 04:21:21,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.46 | bwd: 7876.73 | bwd_inner: 6664.84 | bwd_allreduce: 1211.65 | step: 128.07
<3:25:17, 10.71s/it] 43%|████▎     | 851/2000 [2:37:41<3:27:39, 10.84s/it]                                                       43%|████▎     | 851/2000 [2:37:41<3:27:39, 10.84s/it] 43%|████▎     | 852/2000 [2:37:52<3:29:00, 10.92s/it]                                                       43%|████▎     | 852/2000 [2:37:52<3:29:00, 10.92s/it] 43%|████▎     | 853/2000 [2:38:03<3:26:59, 10.83s/it]                                                       43%|████▎     | 853/2000 [2:38:03<3:26:59, 10.83s/it] 43%|████▎     | 854/2000 [2:38:14<3:27:35, 10.87s/it]                                                       43%|████▎     | 854/2000 [2:38:14<3:27:35, 10.87s/it] 43%|████▎     | 855/2000 [2:38:25<3:28:30, 10.93s/it]                                                       43%|████▎     | 855/2000 [2:38:25<3:28:30, 10.93s/it] 43%|████▎     | 856/2000 [2:38:36<3:29:27, 10.99s/it]        {'loss': 0.7532, 'learning_rate': 1.2780579276111702e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13592
total_samples=12998, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.91 | bwd_microstep: 1772.00 | bwd_inner_microstep: 1703.86 | bwd_allreduce_microstep: 68.07 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13283
total_samples=13002, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:26,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.28 | bwd_microstep: 1830.68 | bwd_inner_microstep: 1725.79 | bwd_allreduce_microstep: 104.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13442
total_samples=13007, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:29,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.94 | bwd_microstep: 1932.87 | bwd_inner_microstep: 1701.37 | bwd_allreduce_microstep: 231.43 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13622
total_samples=13012, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:32,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 04:21:32,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.05 | bwd_microstep: 2189.59 | bwd_inner_microstep: 2024.84 | bwd_allreduce_microstep: 164.68 | step_microstep: 141.60
[2025-08-03 04:21:32,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.11 | bwd: 7725.19 | bwd_inner: 7155.86 | bwd_allreduce: 569.10 | step: 142.02
{'loss': 0.7561, 'learning_rate': 1.276502047326552e-05, 'epoch': 0.43}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12064
total_samples=13015, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:35,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.29 | bwd_microstep: 1795.50 | bwd_inner_microstep: 1565.21 | bwd_allreduce_microstep: 230.22 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13467
total_samples=13019, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:37,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.38 | bwd_microstep: 1854.23 | bwd_inner_microstep: 1796.58 | bwd_allreduce_microstep: 57.58 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12874
total_samples=13022, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:40,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.93 | bwd_microstep: 2042.74 | bwd_inner_microstep: 1811.31 | bwd_allreduce_microstep: 231.37 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13895
total_samples=13026, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:43,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 04:21:43,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.17 | bwd_microstep: 1998.12 | bwd_inner_microstep: 1739.34 | bwd_allreduce_microstep: 258.71 | step_microstep: 140.49
[2025-08-03 04:21:43,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.70 | bwd: 7690.64 | bwd_inner: 6912.43 | bwd_allreduce: 777.97 | step: 140.82
{'loss': 0.7641, 'learning_rate': 1.2749454419475486e-05, 'epoch': 0.43}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12030
total_samples=13029, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:46,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.86 | bwd_microstep: 1817.50 | bwd_inner_microstep: 1580.65 | bwd_allreduce_microstep: 236.79 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13875
total_samples=13033, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:48,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.68 | bwd_microstep: 1865.23 | bwd_inner_microstep: 1707.90 | bwd_allreduce_microstep: 157.27 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13947
total_samples=13037, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:51,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.15 | bwd_microstep: 2034.86 | bwd_inner_microstep: 2028.69 | bwd_allreduce_microstep: 6.10 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11932
total_samples=13040, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:54,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.01
[2025-08-03 04:21:54,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.73 | bwd_microstep: 1879.73 | bwd_inner_microstep: 1610.36 | bwd_allreduce_microstep: 269.31 | step_microstep: 139.06
[2025-08-03 04:21:54,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.35 | bwd: 7597.36 | bwd_inner: 6927.59 | bwd_allreduce: 669.54 | step: 139.49
{'loss': 0.7652, 'learning_rate': 1.273388115556177e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13272
total_samples=13044, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:57,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.61 | bwd_microstep: 1902.98 | bwd_inner_microstep: 1708.82 | bwd_allreduce_microstep: 194.10 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13581
total_samples=13048, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:21:59,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.36 | bwd_microstep: 1975.20 | bwd_inner_microstep: 1930.16 | bwd_allreduce_microstep: 44.98 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12188
total_samples=13051, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:02,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.01 | bwd_microstep: 1753.68 | bwd_inner_microstep: 1567.18 | bwd_allreduce_microstep: 186.44 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13737
total_samples=13055, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:05,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.75
[2025-08-03 04:22:05,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 863.67 | bwd_microstep: 1818.76 | bwd_inner_microstep: 1727.18 | bwd_allreduce_microstep: 91.52 | step_microstep: 112.20
[2025-08-03 04:22:05,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2947.59 | bwd: 7450.68 | bwd_inner: 6933.34 | bwd_allreduce: 517.11 | step: 112.54
{'loss': 0.7631, 'learning_rate': 1.2718300722363431e-05, 'epoch': 0.43}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13420
total_samples=13060, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:07,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.10 | bwd_microstep: 1703.03 | bwd_inner_microstep: 1632.15 | bwd_allreduce_microstep: 70.83 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13935
total_samples=13064, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:10,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.08 | bwd_microstep: 2125.66 | bwd_inner_microstep: 1947.25 | bwd_allreduce_microstep: 178.34 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11670
total_samples=13067, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:13,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.66 | bwd_microstep: 1748.07 | bwd_inner_microstep: 1530.15 | bwd_allreduce_microstep: 217.85 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11986
total_samples=13070, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:16,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 04:22:16,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.16 | bwd_microstep: 1965.21 | bwd_inner_microstep: 1788.15 | bwd_allreduce_microstep: 177.00 | step_microstep: 162.83
[2025-08-03 04:22:16,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.93 | bwd: 7542.02 | bwd_inner: 6897.69 | bwd_allreduce: 644.10 | step: 163.20
{'loss': 0.7575, 'learning_rate': 1.2702713160738344e-05, 'epoch': 0.43}
                                               43%|████▎     | 856/2000 [2:38:36<3:29:27, 10.99s/it] 43%|████▎     | 857/2000 [2:38:47<3:29:10, 10.98s/it]                                                       43%|████▎     | 857/2000 [2:38:47<3:29:10, 10.98s/it] 43%|████▎     | 858/2000 [2:38:58<3:28:53, 10.98s/it]                                                       43%|████▎     | 858/2000 [2:38:58<3:28:53, 10.98s/it] 43%|████▎     | 859/2000 [2:39:09<3:28:23, 10.96s/it]                                                       43%|████▎     | 859/2000 [2:39:09<3:28:23, 10.96s/it] 43%|████▎     | 860/2000 [2:39:20<3:27:39, 10.93s/it]                                                       43%|████▎     | 860/2000 [2:39:20<3:27:39, 10.93s/it] 43%|████▎     | 861/2000 [2:39:30<3:26:39, 10.89s/it]                                                       43%|████▎     | 861/2000 [2:39:30<3dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891
total_samples=13073, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:18,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.32 | bwd_microstep: 2099.37 | bwd_inner_microstep: 1880.38 | bwd_allreduce_microstep: 218.93 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12805
total_samples=13078, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:21,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.69 | bwd_microstep: 1846.47 | bwd_inner_microstep: 1646.91 | bwd_allreduce_microstep: 199.49 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12687
total_samples=13082, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:24,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.06 | bwd_microstep: 1765.43 | bwd_inner_microstep: 1622.77 | bwd_allreduce_microstep: 142.60 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13649
total_samples=13086, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:27,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 04:22:27,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.58 | bwd_microstep: 2002.45 | bwd_inner_microstep: 1888.77 | bwd_allreduce_microstep: 113.62 | step_microstep: 108.53
[2025-08-03 04:22:27,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.59 | bwd: 7713.76 | bwd_inner: 7038.82 | bwd_allreduce: 674.71 | step: 108.85
{'loss': 0.7691, 'learning_rate': 1.2687118511563075e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14988
total_samples=13090, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:29,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.57 | bwd_microstep: 1732.67 | bwd_inner_microstep: 1719.40 | bwd_allreduce_microstep: 13.21 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12606
total_samples=13094, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:32,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.56 | bwd_microstep: 2115.99 | bwd_inner_microstep: 1763.32 | bwd_allreduce_microstep: 352.61 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11771
total_samples=13097, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:34,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.27 | bwd_microstep: 1706.92 | bwd_inner_microstep: 1519.61 | bwd_allreduce_microstep: 187.24 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11781
total_samples=13100, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:37,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 04:22:37,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.72 | bwd_microstep: 2030.22 | bwd_inner_microstep: 1565.29 | bwd_allreduce_microstep: 464.87 | step_microstep: 118.36
[2025-08-03 04:22:37,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.04 | bwd: 7585.84 | bwd_inner: 6567.62 | bwd_allreduce: 1018.00 | step: 118.68
{'loss': 0.7621, 'learning_rate': 1.2671516815732767e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13665
total_samples=13104, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:40,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.08 | bwd_microstep: 1708.58 | bwd_inner_microstep: 1668.76 | bwd_allreduce_microstep: 39.74 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12770
total_samples=13108, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:42,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.97 | bwd_microstep: 1761.17 | bwd_inner_microstep: 1630.32 | bwd_allreduce_microstep: 130.78 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13512
total_samples=13112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:45,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.64 | bwd_microstep: 1772.12 | bwd_inner_microstep: 1699.82 | bwd_allreduce_microstep: 72.22 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14328
total_samples=13116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:48,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 04:22:48,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.55 | bwd_microstep: 1786.85 | bwd_inner_microstep: 1726.92 | bwd_allreduce_microstep: 59.87 | step_microstep: 134.00
[2025-08-03 04:22:48,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.18 | bwd: 7028.75 | bwd_inner: 6725.82 | bwd_allreduce: 302.68 | step: 134.33
{'loss': 0.7619, 'learning_rate': 1.2655908114161053e-05, 'epoch': 0.43}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13881
total_samples=13119, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:50,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.88 | bwd_microstep: 1763.16 | bwd_inner_microstep: 1637.63 | bwd_allreduce_microstep: 125.46 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13324
total_samples=13123, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:53,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.99 | bwd_microstep: 2035.26 | bwd_inner_microstep: 1981.80 | bwd_allreduce_microstep: 53.40 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13903
total_samples=13128, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:56,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.42 | bwd_microstep: 1770.57 | bwd_inner_microstep: 1710.49 | bwd_allreduce_microstep: 60.02 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12106
total_samples=13131, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:22:58,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 04:22:58,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.90 | bwd_microstep: 1757.66 | bwd_inner_microstep: 1556.27 | bwd_allreduce_microstep: 201.32 | step_microstep: 112.62
[2025-08-03 04:22:58,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2745.12 | bwd: 7326.71 | bwd_inner: 6886.20 | bwd_allreduce: 440.27 | step: 113.05
{'loss': 0.7613, 'learning_rate': 1.2640292447779932e-05, 'epoch': 0.43}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11710
total_samples=13134, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:01,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.94 | bwd_microstep: 1932.81 | bwd_inner_microstep: 1926.57 | bwd_allreduce_microstep: 6.18 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11597
total_samples=13137, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:03,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.33 | bwd_microstep: 1711.91 | bwd_inner_microstep: 1516.29 | bwd_allreduce_microstep: 195.55 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13974
total_samples=13141, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:06,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.77 | bwd_microstep: 2044.69 | bwd_inner_microstep: 1906.43 | bwd_allreduce_microstep: 138.19 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13296
total_samples=13145, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:09,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 04:23:09,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.98 | bwd_microstep: 2028.91 | bwd_inner_microstep: 1853.03 | bwd_allreduce_microstep: 175.81 | step_microstep: 113.21
[2025-08-03 04:23:09,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.96 | bwd: 7718.36 | bwd_inner: 7202.32 | bwd_allreduce: 515.81 | step: 113.54
{'loss': 0.7621, 'learning_rate': 1.2624669857539669e-05, 'epoch': 0.43}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12553
total_samples=13149, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:12,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.12 | bwd_microstep: 2055.88 | bwd_inner_microstep: 1830.22 | bwd_allreduce_microstep: 225.59 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13775
total_samples=13153, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:15,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.29 | bwd_microstep: 1755.83 | bwd_inner_microstep: 1691.46 | bwd_allreduce_microstep: 64.31 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12732
total_samples=13156, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:17,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.16 | bwd_microstep: 1771.47 | bwd_inner_microstep: 1597.85 | bwd_allreduce_microstep: 173.54 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13716
total_samples=13160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:20,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 04:23:20,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.01 | bwd_microstep: 1831.74 | bwd_inner_microstep: 1731.21 | bwd_allreduce_microstep: 100.47 | step_microstep: 114.51
[2025-08-03 04:23:20,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.51 | bwd: 7414.97 | bwd_inner: 6850.74 | bwd_allreduce: 563.99 | step: 114.90
:26:39, 10.89s/it] 43%|████▎     | 862/2000 [2:39:41<3:27:07, 10.92s/it]                                                       43%|████▎     | 862/2000 [2:39:41<3:27:07, 10.92s/it] 43%|████▎     | 863/2000 [2:39:52<3:26:22, 10.89s/it]                                                       43%|████▎     | 863/2000 [2:39:52<3:26:22, 10.89s/it] 43%|████▎     | 864/2000 [2:40:03<3:22:40, 10.70s/it]                                                       43%|████▎     | 864/2000 [2:40:03<3:22:40, 10.70s/it] 43%|████▎     | 865/2000 [2:40:13<3:21:29, 10.65s/it]                                                       43%|████▎     | 865/2000 [2:40:13<3:21:29, 10.65s/it] 43%|████▎     | 866/2000 [2:40:24<3:22:52, 10.73s/it]                                                       43%|████▎     | 866/2000 [2:40:24<3:22:52, 10.73s/it] 43%|████▎     | 867/2000 [2:40:35<3:22:25, 10.72s/it]          {'loss': 0.7671, 'learning_rate': 1.2609040384408685e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13254
total_samples=13164, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:23,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.22 | bwd_microstep: 2134.67 | bwd_inner_microstep: 2008.43 | bwd_allreduce_microstep: 126.17 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13115
total_samples=13168, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:25,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.12 | bwd_microstep: 1786.49 | bwd_inner_microstep: 1696.39 | bwd_allreduce_microstep: 90.03 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12692
total_samples=13172, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:28,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.00 | bwd_microstep: 1912.68 | bwd_inner_microstep: 1648.73 | bwd_allreduce_microstep: 263.88 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15242
total_samples=13176, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:31,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 04:23:31,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.57 | bwd_microstep: 2044.89 | bwd_inner_microstep: 1845.47 | bwd_allreduce_microstep: 199.36 | step_microstep: 114.04
[2025-08-03 04:23:31,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.84 | bwd: 7878.78 | bwd_inner: 7199.02 | bwd_allreduce: 679.52 | step: 114.48
{'loss': 0.7642, 'learning_rate': 1.2593404069373452e-05, 'epoch': 0.43}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12712
total_samples=13179, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:33,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.33 | bwd_microstep: 1769.69 | bwd_inner_microstep: 1585.72 | bwd_allreduce_microstep: 183.90 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11845
total_samples=13182, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:36,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 746.21 | bwd_microstep: 1844.10 | bwd_inner_microstep: 1628.61 | bwd_allreduce_microstep: 215.42 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13188
total_samples=13186, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:39,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.30 | bwd_microstep: 1829.65 | bwd_inner_microstep: 1791.30 | bwd_allreduce_microstep: 38.28 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12715
total_samples=13190, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:42,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 31.25
[2025-08-03 04:23:42,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.98 | bwd_microstep: 1952.88 | bwd_inner_microstep: 1676.43 | bwd_allreduce_microstep: 276.38 | step_microstep: 135.08
[2025-08-03 04:23:42,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.74 | bwd: 7396.36 | bwd_inner: 6682.06 | bwd_allreduce: 714.06 | step: 135.41
{'loss': 0.7529, 'learning_rate': 1.2577760953438382e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15214
total_samples=13194, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:44,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.03 | bwd_microstep: 1911.93 | bwd_inner_microstep: 1795.20 | bwd_allreduce_microstep: 116.66 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13356
total_samples=13198, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:47,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.60 | bwd_microstep: 1731.56 | bwd_inner_microstep: 1689.70 | bwd_allreduce_microstep: 41.80 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13204
total_samples=13202, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:49,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.55 | bwd_microstep: 1778.32 | bwd_inner_microstep: 1686.48 | bwd_allreduce_microstep: 91.77 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13451
total_samples=13206, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:52,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 04:23:52,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.55 | bwd_microstep: 2177.66 | bwd_inner_microstep: 1983.79 | bwd_allreduce_microstep: 193.81 | step_microstep: 148.97
[2025-08-03 04:23:52,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2750.65 | bwd: 7599.51 | bwd_inner: 7155.17 | bwd_allreduce: 444.12 | step: 149.41
{'loss': 0.7586, 'learning_rate': 1.2562111077625723e-05, 'epoch': 0.43}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13271
total_samples=13210, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:56,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.34 | bwd_microstep: 2215.27 | bwd_inner_microstep: 1939.38 | bwd_allreduce_microstep: 275.83 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=13214, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:23:59,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.34 | bwd_microstep: 2241.70 | bwd_inner_microstep: 2117.10 | bwd_allreduce_microstep: 124.54 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14168
total_samples=13218, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:01,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.50 | bwd_microstep: 1949.68 | bwd_inner_microstep: 1833.51 | bwd_allreduce_microstep: 116.11 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13719
total_samples=13222, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:04,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 04:24:04,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.93 | bwd_microstep: 1797.54 | bwd_inner_microstep: 1703.83 | bwd_allreduce_microstep: 93.63 | step_microstep: 163.33
[2025-08-03 04:24:04,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.03 | bwd: 8204.24 | bwd_inner: 7593.81 | bwd_allreduce: 610.19 | step: 163.78
{'loss': 0.7607, 'learning_rate': 1.2546454482975454e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13769
total_samples=13227, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:07,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.19 | bwd_microstep: 1931.93 | bwd_inner_microstep: 1908.01 | bwd_allreduce_microstep: 23.86 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13509
total_samples=13231, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:10,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.33 | bwd_microstep: 2083.68 | bwd_inner_microstep: 1932.52 | bwd_allreduce_microstep: 151.10 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12700
total_samples=13235, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:12,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.03 | bwd_microstep: 1732.37 | bwd_inner_microstep: 1631.28 | bwd_allreduce_microstep: 101.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462
total_samples=13239, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:15,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73
[2025-08-03 04:24:15,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.69 | bwd_microstep: 1756.15 | bwd_inner_microstep: 1688.40 | bwd_allreduce_microstep: 67.69 | step_microstep: 147.51
[2025-08-03 04:24:15,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2855.17 | bwd: 7504.18 | bwd_inner: 7160.20 | bwd_allreduce: 343.75 | step: 147.84
{'loss': 0.7548, 'learning_rate': 1.2530791210545163e-05, 'epoch': 0.44}
                                             43%|████▎     | 867/2000 [2:40:35<3:22:25, 10.72s/it] 43%|████▎     | 868/2000 [2:40:46<3:24:27, 10.84s/it]                                                       43%|████▎     | 868/2000 [2:40:46<3:24:27, 10.84s/it] 43%|████▎     | 869/2000 [2:40:56<3:23:34, 10.80s/it]                                                       43%|████▎     | 869/2000 [2:40:57<3:23:34, 10.80s/it] 44%|████▎     | 870/2000 [2:41:07<3:23:41, 10.82s/it]                                                       44%|████▎     | 870/2000 [2:41:07<3:23:41, 10.82s/it] 44%|████▎     | 871/2000 [2:41:19<3:27:24, 11.02s/it]                                                       44%|████▎     | 871/2000 [2:41:19<3:27:24, 11.02s/it] 44%|████▎     | 872/2000 [2:41:30<3:26:02, 10.96s/it]                                                       44%|████▎     | 872/2000 [2:41:30<3:2dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12502
total_samples=13243, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:18,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.09 | bwd_microstep: 1943.60 | bwd_inner_microstep: 1822.04 | bwd_allreduce_microstep: 121.50 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15005
total_samples=13248, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:20,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.39 | bwd_microstep: 2138.21 | bwd_inner_microstep: 2109.51 | bwd_allreduce_microstep: 28.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13384
total_samples=13252, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:23,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.12 | bwd_microstep: 1871.50 | bwd_inner_microstep: 1762.69 | bwd_allreduce_microstep: 108.74 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11952
total_samples=13255, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:26,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 04:24:26,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.36 | bwd_microstep: 1776.91 | bwd_inner_microstep: 1563.40 | bwd_allreduce_microstep: 213.44 | step_microstep: 149.64
[2025-08-03 04:24:26,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2743.90 | bwd: 7730.27 | bwd_inner: 7257.63 | bwd_allreduce: 472.41 | step: 149.98
{'loss': 0.7653, 'learning_rate': 1.251512130140996e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13443
total_samples=13259, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:29,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.00 | bwd_microstep: 1993.26 | bwd_inner_microstep: 1858.85 | bwd_allreduce_microstep: 134.35 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12955
total_samples=13263, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:31,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.91 | bwd_microstep: 1847.74 | bwd_inner_microstep: 1694.54 | bwd_allreduce_microstep: 153.14 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15009
total_samples=13267, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:34,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.49 | bwd_microstep: 1960.24 | bwd_inner_microstep: 1863.47 | bwd_allreduce_microstep: 96.70 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13163
total_samples=13271, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:37,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 04:24:37,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.74 | bwd_microstep: 1727.63 | bwd_inner_microstep: 1665.67 | bwd_allreduce_microstep: 61.90 | step_microstep: 135.92
[2025-08-03 04:24:37,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2824.07 | bwd: 7528.93 | bwd_inner: 7082.53 | bwd_allreduce: 446.16 | step: 136.39
{'loss': 0.7619, 'learning_rate': 1.2499444796662354e-05, 'epoch': 0.44}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11910
total_samples=13274, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:39,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.03 | bwd_microstep: 1768.26 | bwd_inner_microstep: 1573.94 | bwd_allreduce_microstep: 194.25 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12896
total_samples=13280, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:42,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.38 | bwd_microstep: 2044.39 | bwd_inner_microstep: 1657.85 | bwd_allreduce_microstep: 386.49 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14295
total_samples=13284, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:44,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.21 | bwd_microstep: 1778.52 | bwd_inner_microstep: 1733.71 | bwd_allreduce_microstep: 44.75 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13654
total_samples=13288, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:47,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 04:24:47,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.87 | bwd_microstep: 1719.68 | bwd_inner_microstep: 1673.10 | bwd_allreduce_microstep: 46.51 | step_microstep: 131.47
[2025-08-03 04:24:47,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.43 | bwd: 7310.89 | bwd_inner: 6638.59 | bwd_allreduce: 672.07 | step: 131.81
{'loss': 0.7494, 'learning_rate': 1.248376173741215e-05, 'epoch': 0.44}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11661
total_samples=13291, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:50,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.65 | bwd_microstep: 1955.23 | bwd_inner_microstep: 1789.72 | bwd_allreduce_microstep: 165.45 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11772
total_samples=13294, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:52,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.84 | bwd_microstep: 1817.31 | bwd_inner_microstep: 1550.30 | bwd_allreduce_microstep: 266.95 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13201
total_samples=13298, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:55,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.32 | bwd_microstep: 1735.39 | bwd_inner_microstep: 1666.80 | bwd_allreduce_microstep: 68.53 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13390
total_samples=13302, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:24:58,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 04:24:58,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.65 | bwd_microstep: 2023.16 | bwd_inner_microstep: 1843.94 | bwd_allreduce_microstep: 179.15 | step_microstep: 116.04
[2025-08-03 04:24:58,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.40 | bwd: 7531.14 | bwd_inner: 6850.75 | bwd_allreduce: 680.15 | step: 116.38
{'loss': 0.7672, 'learning_rate': 1.2468072164786342e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492
total_samples=13306, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:00,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.36 | bwd_microstep: 1743.80 | bwd_inner_microstep: 1697.08 | bwd_allreduce_microstep: 46.66 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11976
total_samples=13309, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:03,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.72 | bwd_microstep: 1835.02 | bwd_inner_microstep: 1603.42 | bwd_allreduce_microstep: 231.54 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13304
total_samples=13313, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:06,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.85 | bwd_microstep: 1818.68 | bwd_inner_microstep: 1694.93 | bwd_allreduce_microstep: 123.68 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13773
total_samples=13317, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:09,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 04:25:09,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.63 | bwd_microstep: 1958.76 | bwd_inner_microstep: 1690.01 | bwd_allreduce_microstep: 268.68 | step_microstep: 143.48
[2025-08-03 04:25:09,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.49 | bwd: 7356.30 | bwd_inner: 6685.44 | bwd_allreduce: 670.64 | step: 143.82
{'loss': 0.7714, 'learning_rate': 1.2452376119929009e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15009
total_samples=13321, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:11,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.06 | bwd_microstep: 1824.30 | bwd_inner_microstep: 1777.07 | bwd_allreduce_microstep: 47.16 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14451
total_samples=13325, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:14,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.35 | bwd_microstep: 1799.33 | bwd_inner_microstep: 1754.52 | bwd_allreduce_microstep: 44.75 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12899
total_samples=13329, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:17,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.64 | bwd_microstep: 2032.72 | bwd_inner_microstep: 1863.88 | bwd_allreduce_microstep: 168.78 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13723
total_samples=13334, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:19,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.68
[2025-08-03 04:25:19,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.94 | bwd_microstep: 1887.04 | bwd_inner_microstep: 1731.50 | bwd_allreduce_microstep: 155.48 | step_microstep: 112.30
[2025-08-03 04:25:19,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.91 | bwd: 7543.44 | bwd_inner: 7126.97 | bwd_allreduce: 416.23 | step: 112.62
6:02, 10.96s/it] 44%|████▎     | 873/2000 [2:41:41<3:25:56, 10.96s/it]                                                       44%|████▎     | 873/2000 [2:41:41<3:25:56, 10.96s/it] 44%|████▎     | 874/2000 [2:41:51<3:24:54, 10.92s/it]                                                       44%|████▎     | 874/2000 [2:41:51<3:24:54, 10.92s/it] 44%|████▍     | 875/2000 [2:42:02<3:22:36, 10.81s/it]                                                       44%|████▍     | 875/2000 [2:42:02<3:22:36, 10.81s/it] 44%|████▍     | 876/2000 [2:42:13<3:22:17, 10.80s/it]                                                       44%|████▍     | 876/2000 [2:42:13<3:22:17, 10.80s/it] 44%|████▍     | 877/2000 [2:42:23<3:21:04, 10.74s/it]                                                       44%|████▍     | 877/2000 [2:42:23<3:21:04, 10.74s/it] 44%|████▍     | 878/2000 [2:42:34<3:21:19, 10.77s/it]            {'loss': 0.7645, 'learning_rate': 1.2436673644001196e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14999
total_samples=13338, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:22,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.05 | bwd_microstep: 2126.63 | bwd_inner_microstep: 1992.39 | bwd_allreduce_microstep: 134.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13537
total_samples=13342, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:25,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.29 | bwd_microstep: 1810.92 | bwd_inner_microstep: 1718.58 | bwd_allreduce_microstep: 92.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14191
total_samples=13347, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:27,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.35 | bwd_microstep: 1844.18 | bwd_inner_microstep: 1828.85 | bwd_allreduce_microstep: 15.26 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13811
total_samples=13351, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:30,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.66
[2025-08-03 04:25:30,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.41 | bwd_microstep: 1934.50 | bwd_inner_microstep: 1928.62 | bwd_allreduce_microstep: 5.82 | step_microstep: 112.01
[2025-08-03 04:25:30,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.03 | bwd: 7716.28 | bwd_inner: 7468.43 | bwd_allreduce: 247.63 | step: 112.34
{'loss': 0.754, 'learning_rate': 1.2420964778180815e-05, 'epoch': 0.44}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12693
total_samples=13355, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:33,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.80 | bwd_microstep: 1932.66 | bwd_inner_microstep: 1637.74 | bwd_allreduce_microstep: 294.84 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13242
total_samples=13359, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:36,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.74 | bwd_microstep: 2048.87 | bwd_inner_microstep: 1692.98 | bwd_allreduce_microstep: 355.83 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13513
total_samples=13364, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:39,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.80 | bwd_microstep: 2265.42 | bwd_inner_microstep: 2092.60 | bwd_allreduce_microstep: 172.76 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13799
total_samples=13368, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:42,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.63
[2025-08-03 04:25:42,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.17 | bwd_microstep: 2082.49 | bwd_inner_microstep: 1943.91 | bwd_allreduce_microstep: 138.52 | step_microstep: 114.03
[2025-08-03 04:25:42,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.45 | bwd: 8329.50 | bwd_inner: 7367.22 | bwd_allreduce: 962.03 | step: 114.39
{'loss': 0.7608, 'learning_rate': 1.2405249563662539e-05, 'epoch': 0.44}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=13371, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:45,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.76 | bwd_microstep: 1809.39 | bwd_inner_microstep: 1572.86 | bwd_allreduce_microstep: 236.47 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15878
total_samples=13375, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:47,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.62 | bwd_microstep: 1809.54 | bwd_inner_microstep: 1803.48 | bwd_allreduce_microstep: 6.00 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13728
total_samples=13379, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:50,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.83 | bwd_microstep: 1790.45 | bwd_inner_microstep: 1701.07 | bwd_allreduce_microstep: 89.32 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13426
total_samples=13383, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:52,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 33.34
[2025-08-03 04:25:52,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.84 | bwd_microstep: 1885.78 | bwd_inner_microstep: 1812.05 | bwd_allreduce_microstep: 73.67 | step_microstep: 150.99
[2025-08-03 04:25:52,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.96 | bwd: 7295.21 | bwd_inner: 6889.45 | bwd_allreduce: 405.54 | step: 151.34
{'loss': 0.7584, 'learning_rate': 1.2389528041657679e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15860
total_samples=13387, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:55,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.34 | bwd_microstep: 1933.46 | bwd_inner_microstep: 1927.51 | bwd_allreduce_microstep: 5.89 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13763
total_samples=13392, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:25:58,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 995.48 | bwd_microstep: 1719.75 | bwd_inner_microstep: 1653.16 | bwd_allreduce_microstep: 66.53 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13424
total_samples=13396, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:01,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.12 | bwd_microstep: 2115.58 | bwd_inner_microstep: 2018.16 | bwd_allreduce_microstep: 97.35 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13416
total_samples=13400, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:04,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 04:26:04,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.99 | bwd_microstep: 1734.35 | bwd_inner_microstep: 1670.01 | bwd_allreduce_microstep: 64.28 | step_microstep: 136.17
[2025-08-03 04:26:04,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3056.85 | bwd: 7503.18 | bwd_inner: 7268.84 | bwd_allreduce: 234.11 | step: 136.50
{'loss': 0.7444, 'learning_rate': 1.23738002533941e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13977
total_samples=13404, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:06,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.68 | bwd_microstep: 1707.41 | bwd_inner_microstep: 1676.22 | bwd_allreduce_microstep: 31.13 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11736
total_samples=13407, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:09,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.68 | bwd_microstep: 1780.50 | bwd_inner_microstep: 1550.09 | bwd_allreduce_microstep: 230.32 | step_microstep: 0.14
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13024
total_samples=13412, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:12,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.16 | bwd_microstep: 2770.86 | bwd_inner_microstep: 2291.31 | bwd_allreduce_microstep: 479.45 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14026
total_samples=13416, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:15,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 04:26:15,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.23 | bwd_microstep: 1863.10 | bwd_inner_microstep: 1829.36 | bwd_allreduce_microstep: 33.68 | step_microstep: 118.77
[2025-08-03 04:26:15,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.69 | bwd: 8121.93 | bwd_inner: 7346.99 | bwd_allreduce: 774.65 | step: 119.13
{'loss': 0.7608, 'learning_rate': 1.2358066240116092e-05, 'epoch': 0.44}
                                           44%|████▍     | 878/2000 [2:42:34<3:21:19, 10.77s/it] 44%|████▍     | 879/2000 [2:42:45<3:22:16, 10.83s/it]                                                       44%|████▍     | 879/2000 [2:42:45<3:22:16, 10.83s/it] 44%|████▍     | 880/2000 [2:42:57<3:26:23, 11.06s/it]                                                       44%|████▍     | 880/2000 [2:42:57<3:26:23, 11.06s/it] 44%|████▍     | 881/2000 [2:43:07<3:23:37, 10.92s/it]                                                       44%|████▍     | 881/2000 [2:43:07<3:23:37, 10.92s/it] 44%|████▍     | 882/2000 [2:43:18<3:24:06, 10.95s/it]                                                       44%|████▍     | 882/2000 [2:43:18<3:24:06, 10.95s/it] 44%|████▍     | 883/2000 [2:43:30<3:26:10, 11.08s/it]                                                       44%|████▍     | 883/2000 [2:43:30<3:26:dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12099
total_samples=13419, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:17,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.43 | bwd_microstep: 1772.11 | bwd_inner_microstep: 1564.14 | bwd_allreduce_microstep: 207.89 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12921
total_samples=13423, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:20,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.65 | bwd_microstep: 1712.31 | bwd_inner_microstep: 1635.25 | bwd_allreduce_microstep: 77.00 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11801
total_samples=13426, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:22,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.51 | bwd_microstep: 1751.69 | bwd_inner_microstep: 1541.81 | bwd_allreduce_microstep: 209.82 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13226
total_samples=13430, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:25,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 04:26:25,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.32 | bwd_microstep: 2071.28 | bwd_inner_microstep: 1712.43 | bwd_allreduce_microstep: 358.79 | step_microstep: 114.34
[2025-08-03 04:26:25,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.85 | bwd: 7307.43 | bwd_inner: 6453.63 | bwd_allreduce: 853.57 | step: 114.84
{'loss': 0.7611, 'learning_rate': 1.2342326043084268e-05, 'epoch': 0.44}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11980
total_samples=13433, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:28,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.94 | bwd_microstep: 1784.26 | bwd_inner_microstep: 1555.95 | bwd_allreduce_microstep: 228.25 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14421
total_samples=13437, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:31,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.68 | bwd_microstep: 2072.84 | bwd_inner_microstep: 1996.79 | bwd_allreduce_microstep: 75.98 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12534
total_samples=13441, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:33,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.60 | bwd_microstep: 1817.59 | bwd_inner_microstep: 1622.60 | bwd_allreduce_microstep: 194.92 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13293
total_samples=13446, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:36,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 04:26:36,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.42 | bwd_microstep: 1845.08 | bwd_inner_microstep: 1705.15 | bwd_allreduce_microstep: 139.87 | step_microstep: 159.50
[2025-08-03 04:26:36,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.56 | bwd: 7519.82 | bwd_inner: 6880.49 | bwd_allreduce: 639.09 | step: 159.82
{'loss': 0.7687, 'learning_rate': 1.2326579703575464e-05, 'epoch': 0.44}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11788
total_samples=13449, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:39,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.91 | bwd_microstep: 1903.88 | bwd_inner_microstep: 1594.42 | bwd_allreduce_microstep: 309.40 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11879
total_samples=13453, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:42,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 757.65 | bwd_microstep: 1830.36 | bwd_inner_microstep: 1583.03 | bwd_allreduce_microstep: 247.27 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11868
total_samples=13456, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:44,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 932.06 | bwd_microstep: 1785.61 | bwd_inner_microstep: 1594.75 | bwd_allreduce_microstep: 190.79 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13109
total_samples=13460, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:47,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 04:26:47,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.29 | bwd_microstep: 1902.02 | bwd_inner_microstep: 1895.98 | bwd_allreduce_microstep: 5.97 | step_microstep: 133.12
[2025-08-03 04:26:47,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3096.84 | bwd: 7421.92 | bwd_inner: 6668.17 | bwd_allreduce: 753.51 | step: 133.44
{'loss': 0.7699, 'learning_rate': 1.2310827262882614e-05, 'epoch': 0.44}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13116
total_samples=13464, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:50,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.01 | bwd_microstep: 1855.56 | bwd_inner_microstep: 1681.78 | bwd_allreduce_microstep: 173.72 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11765
total_samples=13467, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:53,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.19 | bwd_microstep: 2168.49 | bwd_inner_microstep: 1954.91 | bwd_allreduce_microstep: 213.52 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11586
total_samples=13470, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:56,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.01 | bwd_microstep: 2161.39 | bwd_inner_microstep: 1941.46 | bwd_allreduce_microstep: 219.87 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12132
total_samples=13473, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:26:58,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 04:26:58,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.70 | bwd_microstep: 1880.29 | bwd_inner_microstep: 1755.25 | bwd_allreduce_microstep: 124.99 | step_microstep: 127.88
[2025-08-03 04:26:58,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.84 | bwd: 8065.79 | bwd_inner: 7333.38 | bwd_allreduce: 732.17 | step: 128.33
{'loss': 0.7461, 'learning_rate': 1.2295068762314661e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13497
total_samples=13477, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:01,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.91 | bwd_microstep: 1787.38 | bwd_inner_microstep: 1703.25 | bwd_allreduce_microstep: 84.06 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=13480, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:04,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.64 | bwd_microstep: 1806.91 | bwd_inner_microstep: 1576.70 | bwd_allreduce_microstep: 230.15 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14075
total_samples=13485, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:06,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.64 | bwd_microstep: 1949.69 | bwd_inner_microstep: 1766.02 | bwd_allreduce_microstep: 183.60 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13450
total_samples=13489, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:09,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 04:27:09,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.31 | bwd_microstep: 1825.51 | bwd_inner_microstep: 1633.20 | bwd_allreduce_microstep: 192.24 | step_microstep: 131.73
[2025-08-03 04:27:09,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.43 | bwd: 7369.53 | bwd_inner: 6679.18 | bwd_allreduce: 690.13 | step: 132.06
{'loss': 0.7519, 'learning_rate': 1.2279304243196438e-05, 'epoch': 0.44}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11988
total_samples=13492, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:12,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.10 | bwd_microstep: 1942.36 | bwd_inner_microstep: 1724.82 | bwd_allreduce_microstep: 217.47 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14417
total_samples=13496, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:15,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.70 | bwd_microstep: 2150.53 | bwd_inner_microstep: 2014.51 | bwd_allreduce_microstep: 135.94 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13548
total_samples=13500, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:17,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.02 | bwd_microstep: 1828.12 | bwd_inner_microstep: 1710.62 | bwd_allreduce_microstep: 117.43 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13593
total_samples=13504, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:20,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 04:27:20,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.65 | bwd_microstep: 1724.53 | bwd_inner_microstep: 1659.49 | bwd_allreduce_microstep: 64.97 | step_microstep: 122.02
[2025-08-03 04:27:20,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.39 | bwd: 7645.58 | bwd_inner: 7109.44 | bwd_allreduce: 535.90 | step: 122.46
10, 11.08s/it] 44%|████▍     | 884/2000 [2:43:40<3:23:18, 10.93s/it]                                                       44%|████▍     | 884/2000 [2:43:40<3:23:18, 10.93s/it] 44%|████▍     | 885/2000 [2:43:51<3:22:19, 10.89s/it]                                                       44%|████▍     | 885/2000 [2:43:51<3:22:19, 10.89s/it] 44%|████▍     | 886/2000 [2:44:02<3:22:24, 10.90s/it]                                                       44%|████▍     | 886/2000 [2:44:02<3:22:24, 10.90s/it] 44%|████▍     | 887/2000 [2:44:13<3:24:21, 11.02s/it]                                                       44%|████▍     | 887/2000 [2:44:13<3:24:21, 11.02s/it] 44%|████▍     | 888/2000 [2:44:24<3:21:55, 10.90s/it]                                                       44%|████▍     | 888/2000 [2:44:24<3:21:55, 10.90s/it] 44%|████▍     | 889/2000 [2:44:35<3:21:44, 10.89s/it]              {'loss': 0.7466, 'learning_rate': 1.2263533746868552e-05, 'epoch': 0.44}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14449
total_samples=13508, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:23,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.36 | bwd_microstep: 1922.34 | bwd_inner_microstep: 1736.33 | bwd_allreduce_microstep: 185.94 | step_microstep: 0.14
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12663
total_samples=13512, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:26,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.94 | bwd_microstep: 2235.61 | bwd_inner_microstep: 2130.13 | bwd_allreduce_microstep: 105.42 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13761
total_samples=13516, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:28,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.95 | bwd_microstep: 1811.10 | bwd_inner_microstep: 1712.24 | bwd_allreduce_microstep: 98.79 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12761
total_samples=13520, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:31,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 04:27:31,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.84 | bwd_microstep: 2032.37 | bwd_inner_microstep: 1669.22 | bwd_allreduce_microstep: 363.09 | step_microstep: 111.77
[2025-08-03 04:27:31,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.03 | bwd: 8001.48 | bwd_inner: 7247.91 | bwd_allreduce: 753.33 | step: 112.28
{'loss': 0.7649, 'learning_rate': 1.2247757314687296e-05, 'epoch': 0.45}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12239
total_samples=13523, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:34,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.96 | bwd_microstep: 1825.86 | bwd_inner_microstep: 1585.11 | bwd_allreduce_microstep: 240.68 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12244
total_samples=13527, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:36,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 755.78 | bwd_microstep: 1826.83 | bwd_inner_microstep: 1591.46 | bwd_allreduce_microstep: 235.30 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12777
total_samples=13531, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:39,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.53 | bwd_microstep: 1808.50 | bwd_inner_microstep: 1616.93 | bwd_allreduce_microstep: 191.50 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13691
total_samples=13535, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:42,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 04:27:42,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.38 | bwd_microstep: 1844.25 | bwd_inner_microstep: 1698.37 | bwd_allreduce_microstep: 145.82 | step_microstep: 146.79
[2025-08-03 04:27:42,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2884.58 | bwd: 7305.49 | bwd_inner: 6491.87 | bwd_allreduce: 813.38 | step: 147.27
{'loss': 0.7679, 'learning_rate': 1.2231974988024522e-05, 'epoch': 0.45}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13778
total_samples=13539, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:45,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.36 | bwd_microstep: 1852.64 | bwd_inner_microstep: 1802.24 | bwd_allreduce_microstep: 50.33 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11817
total_samples=13542, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:47,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.48 | bwd_microstep: 1720.15 | bwd_inner_microstep: 1542.62 | bwd_allreduce_microstep: 177.46 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12529
total_samples=13546, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:50,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.48 | bwd_microstep: 2155.71 | bwd_inner_microstep: 1820.70 | bwd_allreduce_microstep: 334.95 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14014
total_samples=13550, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:53,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 04:27:53,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.50 | bwd_microstep: 2236.76 | bwd_inner_microstep: 1768.86 | bwd_allreduce_microstep: 467.79 | step_microstep: 136.69
[2025-08-03 04:27:53,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.75 | bwd: 7965.31 | bwd_inner: 6934.46 | bwd_allreduce: 1030.59 | step: 137.14
{'loss': 0.7631, 'learning_rate': 1.2216186808267544e-05, 'epoch': 0.45}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13296
total_samples=13554, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:56,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.76 | bwd_microstep: 1837.75 | bwd_inner_microstep: 1703.37 | bwd_allreduce_microstep: 134.31 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13426
total_samples=13558, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:27:59,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.91 | bwd_microstep: 2023.07 | bwd_inner_microstep: 2016.72 | bwd_allreduce_microstep: 6.29 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12391
total_samples=13561, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:01,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.11 | bwd_microstep: 1736.63 | bwd_inner_microstep: 1576.05 | bwd_allreduce_microstep: 160.51 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13352
total_samples=13566, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:04,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 04:28:04,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.54 | bwd_microstep: 1775.79 | bwd_inner_microstep: 1694.26 | bwd_allreduce_microstep: 81.47 | step_microstep: 115.41
[2025-08-03 04:28:04,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.26 | bwd: 7373.29 | bwd_inner: 6990.40 | bwd_allreduce: 382.67 | step: 115.84
{'loss': 0.7568, 'learning_rate': 1.2200392816819022e-05, 'epoch': 0.45}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14416
total_samples=13570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:07,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.23 | bwd_microstep: 2014.14 | bwd_inner_microstep: 1862.26 | bwd_allreduce_microstep: 151.81 | step_microstep: 0.17
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13395
total_samples=13574, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:09,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.23 | bwd_microstep: 2195.69 | bwd_inner_microstep: 1908.24 | bwd_allreduce_microstep: 287.39 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13794
total_samples=13578, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:12,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.76 | bwd_microstep: 1989.71 | bwd_inner_microstep: 1758.06 | bwd_allreduce_microstep: 231.59 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12150
total_samples=13581, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:15,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 04:28:15,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.19 | bwd_microstep: 2185.75 | bwd_inner_microstep: 1820.38 | bwd_allreduce_microstep: 365.31 | step_microstep: 128.38
[2025-08-03 04:28:15,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.34 | bwd: 8385.35 | bwd_inner: 7348.93 | bwd_allreduce: 1036.18 | step: 128.87
{'loss': 0.7635, 'learning_rate': 1.2184593055096853e-05, 'epoch': 0.45}
                                         44%|████▍     | 889/2000 [2:44:35<3:21:44, 10.89s/it] 44%|████▍     | 890/2000 [2:44:46<3:23:24, 11.00s/it]                                                       44%|████▍     | 890/2000 [2:44:46<3:23:24, 11.00s/it] 45%|████▍     | 891/2000 [2:44:57<3:21:14, 10.89s/it]                                                       45%|████▍     | 891/2000 [2:44:57<3:21:14, 10.89s/it] 45%|████▍     | 892/2000 [2:45:08<3:23:11, 11.00s/it]                                                       45%|████▍     | 892/2000 [2:45:08<3:23:11, 11.00s/it] 45%|████▍     | 893/2000 [2:45:19<3:20:47, 10.88s/it]                                                       45%|████▍     | 893/2000 [2:45:19<3:20:47, 10.88s/it] 45%|████▍     | 894/2000 [2:45:30<3:24:41, 11.10s/it]                                                       45%|████▍     | 894/2000 [2:45:30<3:24:41dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13838
total_samples=13585, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:18,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.39 | bwd_microstep: 1850.42 | bwd_inner_microstep: 1736.41 | bwd_allreduce_microstep: 113.94 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11715
total_samples=13588, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:21,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.02 | bwd_microstep: 1803.93 | bwd_inner_microstep: 1569.69 | bwd_allreduce_microstep: 234.18 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12400
total_samples=13591, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:23,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.50 | bwd_microstep: 1953.09 | bwd_inner_microstep: 1600.73 | bwd_allreduce_microstep: 352.30 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13209
total_samples=13595, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:26,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 04:28:26,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.40 | bwd_microstep: 1972.51 | bwd_inner_microstep: 1838.14 | bwd_allreduce_microstep: 134.30 | step_microstep: 145.00
[2025-08-03 04:28:26,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.24 | bwd: 7580.00 | bwd_inner: 6744.97 | bwd_allreduce: 834.81 | step: 145.35
{'loss': 0.7633, 'learning_rate': 1.2168787564534078e-05, 'epoch': 0.45}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13899
total_samples=13599, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:29,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.96 | bwd_microstep: 1963.85 | bwd_inner_microstep: 1749.23 | bwd_allreduce_microstep: 214.55 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11681
total_samples=13602, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:32,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.60 | bwd_microstep: 2040.14 | bwd_inner_microstep: 1823.24 | bwd_allreduce_microstep: 216.83 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13726
total_samples=13606, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:35,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.67 | bwd_microstep: 1989.78 | bwd_inner_microstep: 1875.44 | bwd_allreduce_microstep: 114.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13250
total_samples=13610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:37,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 04:28:37,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.40 | bwd_microstep: 1795.91 | bwd_inner_microstep: 1697.92 | bwd_allreduce_microstep: 97.93 | step_microstep: 118.03
[2025-08-03 04:28:37,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.55 | bwd: 7789.73 | bwd_inner: 7145.82 | bwd_allreduce: 643.67 | step: 118.50
{'loss': 0.7574, 'learning_rate': 1.215297638657875e-05, 'epoch': 0.45}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13405
total_samples=13614, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:40,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.39 | bwd_microstep: 1735.37 | bwd_inner_microstep: 1666.32 | bwd_allreduce_microstep: 68.98 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12041
total_samples=13618, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:42,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.47 | bwd_microstep: 1783.76 | bwd_inner_microstep: 1567.48 | bwd_allreduce_microstep: 216.21 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13660
total_samples=13622, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:45,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.24 | bwd_microstep: 1770.23 | bwd_inner_microstep: 1697.18 | bwd_allreduce_microstep: 72.98 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13480
total_samples=13626, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:48,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 04:28:48,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.23 | bwd_microstep: 1796.90 | bwd_inner_microstep: 1712.63 | bwd_allreduce_microstep: 84.21 | step_microstep: 440.26
[2025-08-03 04:28:48,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.26 | bwd: 7086.31 | bwd_inner: 6643.60 | bwd_allreduce: 442.47 | step: 440.72
{'loss': 0.7637, 'learning_rate': 1.2137159562693839e-05, 'epoch': 0.45}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 15629
total_samples=13630, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:51,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.58 | bwd_microstep: 2540.35 | bwd_inner_microstep: 2395.88 | bwd_allreduce_microstep: 144.41 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11841
total_samples=13633, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:54,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.38 | bwd_microstep: 1839.61 | bwd_inner_microstep: 1601.78 | bwd_allreduce_microstep: 237.76 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14871
total_samples=13637, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:28:58,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1509.81 | bwd_microstep: 2602.61 | bwd_inner_microstep: 2448.23 | bwd_allreduce_microstep: 154.31 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13500
total_samples=13641, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:01,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73
[2025-08-03 04:29:01,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.69 | bwd_microstep: 1720.21 | bwd_inner_microstep: 1674.37 | bwd_allreduce_microstep: 45.76 | step_microstep: 114.09
[2025-08-03 04:29:01,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3603.39 | bwd: 8702.82 | bwd_inner: 8120.25 | bwd_allreduce: 582.34 | step: 114.46
{'loss': 0.7674, 'learning_rate': 1.2121337134357121e-05, 'epoch': 0.45}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11759
total_samples=13644, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:04,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.88 | bwd_microstep: 1997.39 | bwd_inner_microstep: 1775.88 | bwd_allreduce_microstep: 221.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11777
total_samples=13648, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:06,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.73 | bwd_microstep: 2003.54 | bwd_inner_microstep: 1809.47 | bwd_allreduce_microstep: 194.00 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12902
total_samples=13652, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:09,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.02 | bwd_microstep: 1847.41 | bwd_inner_microstep: 1666.97 | bwd_allreduce_microstep: 180.38 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13617
total_samples=13656, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:12,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 04:29:12,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.33 | bwd_microstep: 2208.47 | bwd_inner_microstep: 2055.20 | bwd_allreduce_microstep: 153.15 | step_microstep: 120.36
[2025-08-03 04:29:12,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.90 | bwd: 8056.86 | bwd_inner: 7307.54 | bwd_allreduce: 749.04 | step: 120.81
{'loss': 0.7575, 'learning_rate': 1.2105509143061072e-05, 'epoch': 0.45}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11848
total_samples=13659, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:15,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.80 | bwd_microstep: 1830.41 | bwd_inner_microstep: 1569.87 | bwd_allreduce_microstep: 260.47 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12239
total_samples=13662, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:17,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.97 | bwd_microstep: 1842.73 | bwd_inner_microstep: 1606.50 | bwd_allreduce_microstep: 236.16 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13395
total_samples=13666, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:20,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.09 | bwd_microstep: 1836.93 | bwd_inner_microstep: 1713.18 | bwd_allreduce_microstep: 123.67 | step_microstep: 0.19
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12379
total_samples=13670, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:23,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 04:29:23,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.96 | bwd_microstep: 1986.05 | bwd_inner_microstep: 1851.74 | bwd_allreduce_microstep: 134.25 | step_microstep: 115.55
[2025-08-03 04:29:23,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.75 | bwd: 7496.16 | bwd_inner: 6741.29 | bwd_allreduce: 754.63 | step: 116.06
, 11.10s/it] 45%|████▍     | 895/2000 [2:45:41<3:23:20, 11.04s/it]                                                       45%|████▍     | 895/2000 [2:45:41<3:23:20, 11.04s/it] 45%|████▍     | 896/2000 [2:45:52<3:23:10, 11.04s/it]                                                       45%|████▍     | 896/2000 [2:45:52<3:23:10, 11.04s/it] 45%|████▍     | 897/2000 [2:46:03<3:20:57, 10.93s/it]                                                       45%|████▍     | 897/2000 [2:46:03<3:20:57, 10.93s/it] 45%|████▍     | 898/2000 [2:46:16<3:30:54, 11.48s/it]                                                       45%|████▍     | 898/2000 [2:46:16<3:30:54, 11.48s/it] 45%|████▍     | 899/2000 [2:46:27<3:29:47, 11.43s/it]                                                       45%|████▍     | 899/2000 [2:46:27<3:29:47, 11.43s/it] 45%|████▌     | 900/2000 [2:46:38<3:26:01, 11.24s/it]                {'loss': 0.753, 'learning_rate': 1.2089675630312755e-05, 'epoch': 0.45}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11977
total_samples=13673, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:26,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.11 | bwd_microstep: 2143.60 | bwd_inner_microstep: 2004.28 | bwd_allreduce_microstep: 139.26 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12068
total_samples=13676, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:29,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 959.44 | bwd_microstep: 1742.46 | bwd_inner_microstep: 1577.05 | bwd_allreduce_microstep: 165.34 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14555
total_samples=13680, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:31,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.88 | bwd_microstep: 2124.71 | bwd_inner_microstep: 1942.98 | bwd_allreduce_microstep: 181.66 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13242
total_samples=13684, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:34,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 04:29:34,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.68 | bwd_microstep: 2062.80 | bwd_inner_microstep: 1990.45 | bwd_allreduce_microstep: 72.28 | step_microstep: 113.05
[2025-08-03 04:29:34,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3058.01 | bwd: 8073.62 | bwd_inner: 7514.76 | bwd_allreduce: 558.63 | step: 113.50
{'loss': 0.7638, 'learning_rate': 1.2073836637633705e-05, 'epoch': 0.45}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12243
total_samples=13688, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:37,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.40 | bwd_microstep: 1834.90 | bwd_inner_microstep: 1587.23 | bwd_allreduce_microstep: 247.60 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11949
total_samples=13691, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:40,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.37 | bwd_microstep: 2000.13 | bwd_inner_microstep: 1787.81 | bwd_allreduce_microstep: 212.26 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12285
total_samples=13694, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:43,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.27 | bwd_microstep: 1932.34 | bwd_inner_microstep: 1802.41 | bwd_allreduce_microstep: 129.87 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13401
total_samples=13699, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:45,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 04:29:45,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 769.91 | bwd_microstep: 1845.21 | bwd_inner_microstep: 1690.08 | bwd_allreduce_microstep: 155.06 | step_microstep: 116.09
[2025-08-03 04:29:45,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2932.88 | bwd: 7612.62 | bwd_inner: 6867.53 | bwd_allreduce: 744.87 | step: 116.52
{'loss': 0.7606, 'learning_rate': 1.2057992206559837e-05, 'epoch': 0.45}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11780
total_samples=13702, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:48,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.85 | bwd_microstep: 1999.31 | bwd_inner_microstep: 1763.01 | bwd_allreduce_microstep: 236.24 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12148
total_samples=13705, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:51,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.12 | bwd_microstep: 1786.93 | bwd_inner_microstep: 1588.31 | bwd_allreduce_microstep: 198.55 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12891
total_samples=13708, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:53,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.17 | bwd_microstep: 1718.23 | bwd_inner_microstep: 1588.88 | bwd_allreduce_microstep: 129.28 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13180
total_samples=13712, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:56,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 04:29:56,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.03 | bwd_microstep: 2088.92 | bwd_inner_microstep: 1937.15 | bwd_allreduce_microstep: 151.70 | step_microstep: 128.99
[2025-08-03 04:29:56,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.09 | bwd: 7593.43 | bwd_inner: 6877.34 | bwd_allreduce: 715.85 | step: 129.43
{'loss': 0.7558, 'learning_rate': 1.204214237864133e-05, 'epoch': 0.45}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12149
total_samples=13716, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:29:59,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.28 | bwd_microstep: 1789.26 | bwd_inner_microstep: 1590.10 | bwd_allreduce_microstep: 199.09 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13774
total_samples=13720, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:02,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.31 | bwd_microstep: 1866.21 | bwd_inner_microstep: 1836.22 | bwd_allreduce_microstep: 29.93 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13432
total_samples=13724, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:04,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.58 | bwd_microstep: 1827.23 | bwd_inner_microstep: 1680.26 | bwd_allreduce_microstep: 146.90 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13681
total_samples=13728, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:07,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 04:30:07,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.75 | bwd_microstep: 1902.44 | bwd_inner_microstep: 1738.13 | bwd_allreduce_microstep: 164.25 | step_microstep: 133.43
[2025-08-03 04:30:07,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.85 | bwd: 7385.18 | bwd_inner: 6844.71 | bwd_allreduce: 540.24 | step: 133.77
{'loss': 0.7538, 'learning_rate': 1.2026287195442503e-05, 'epoch': 0.45}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14046
total_samples=13732, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:10,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.44 | bwd_microstep: 2072.33 | bwd_inner_microstep: 1918.70 | bwd_allreduce_microstep: 153.56 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13393
total_samples=13736, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:12,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.36 | bwd_microstep: 1854.55 | bwd_inner_microstep: 1804.00 | bwd_allreduce_microstep: 50.48 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12126
total_samples=13739, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:15,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.99 | bwd_microstep: 2021.30 | bwd_inner_microstep: 1582.91 | bwd_allreduce_microstep: 438.32 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13613
total_samples=13743, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:18,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:30:18,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.67 | bwd_microstep: 1763.24 | bwd_inner_microstep: 1675.35 | bwd_allreduce_microstep: 87.82 | step_microstep: 155.26
[2025-08-03 04:30:18,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.39 | bwd: 7711.46 | bwd_inner: 6980.96 | bwd_allreduce: 730.28 | step: 155.60
{'loss': 0.7608, 'learning_rate': 1.2010426698541728e-05, 'epoch': 0.45}
                                       45%|████▌     | 900/2000 [2:46:38<3:26:01, 11.24s/it] 45%|████▌     | 901/2000 [2:46:49<3:27:41, 11.34s/it]                                                       45%|████▌     | 901/2000 [2:46:49<3:27:41, 11.34s/it] 45%|████▌     | 902/2000 [2:47:00<3:25:29, 11.23s/it]                                                       45%|████▌     | 902/2000 [2:47:00<3:25:29, 11.23s/it] 45%|████▌     | 903/2000 [2:47:11<3:23:21, 11.12s/it]                                                       45%|████▌     | 903/2000 [2:47:11<3:23:21, 11.12s/it] 45%|████▌     | 904/2000 [2:47:22<3:20:49, 10.99s/it]                                                       45%|████▌     | 904/2000 [2:47:22<3:20:49, 10.99s/it] 45%|████▌     | 905/2000 [2:47:33<3:20:45, 11.00s/it]                                                       45%|████▌     | 905/2000 [2:47:33<3:20:45, dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13060
total_samples=13747, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:21,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.05 | bwd_microstep: 1805.79 | bwd_inner_microstep: 1674.86 | bwd_allreduce_microstep: 130.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13328
total_samples=13751, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:23,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.63 | bwd_microstep: 1932.91 | bwd_inner_microstep: 1811.31 | bwd_allreduce_microstep: 121.53 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 14005
total_samples=13755, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:26,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.52 | bwd_microstep: 1991.05 | bwd_inner_microstep: 1984.99 | bwd_allreduce_microstep: 5.99 | step_microstep: 0.20
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13804
total_samples=13759, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:29,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 04:30:29,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.52 | bwd_microstep: 1846.51 | bwd_inner_microstep: 1685.65 | bwd_allreduce_microstep: 160.80 | step_microstep: 133.71
[2025-08-03 04:30:29,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.63 | bwd: 7576.30 | bwd_inner: 7156.81 | bwd_allreduce: 419.26 | step: 134.15
{'loss': 0.7741, 'learning_rate': 1.199456092953131e-05, 'epoch': 0.45}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11904
total_samples=13762, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:32,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.79 | bwd_microstep: 1999.69 | bwd_inner_microstep: 1796.63 | bwd_allreduce_microstep: 202.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13627
total_samples=13766, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:34,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.14 | bwd_microstep: 1792.38 | bwd_inner_microstep: 1712.33 | bwd_allreduce_microstep: 79.99 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11980
total_samples=13769, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:37,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.30 | bwd_microstep: 1793.85 | bwd_inner_microstep: 1555.90 | bwd_allreduce_microstep: 237.88 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12813
total_samples=13774, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:39,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 04:30:39,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.84 | bwd_microstep: 1814.86 | bwd_inner_microstep: 1647.70 | bwd_allreduce_microstep: 167.10 | step_microstep: 118.16
[2025-08-03 04:30:39,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.00 | bwd: 7400.83 | bwd_inner: 6712.55 | bwd_allreduce: 688.04 | step: 118.49
{'loss': 0.7489, 'learning_rate': 1.197868993001738e-05, 'epoch': 0.45}
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11558
total_samples=13777, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:42,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.64 | bwd_microstep: 1843.78 | bwd_inner_microstep: 1524.18 | bwd_allreduce_microstep: 319.54 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13278
total_samples=13781, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:45,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.26 | bwd_microstep: 2084.96 | bwd_inner_microstep: 1932.68 | bwd_allreduce_microstep: 152.21 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11907
total_samples=13784, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:48,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.34 | bwd_microstep: 1768.60 | bwd_inner_microstep: 1559.92 | bwd_allreduce_microstep: 208.61 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11924
total_samples=13787, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:50,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 04:30:50,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.46 | bwd_microstep: 1741.14 | bwd_inner_microstep: 1537.61 | bwd_allreduce_microstep: 203.46 | step_microstep: 117.63
[2025-08-03 04:30:50,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.62 | bwd: 7438.52 | bwd_inner: 6554.39 | bwd_allreduce: 883.90 | step: 118.01
{'loss': 0.7515, 'learning_rate': 1.1962813741619777e-05, 'epoch': 0.45}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13631
total_samples=13791, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:53,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.22 | bwd_microstep: 1799.93 | bwd_inner_microstep: 1718.74 | bwd_allreduce_microstep: 81.12 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15100
total_samples=13796, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:55,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.20 | bwd_microstep: 1798.31 | bwd_inner_microstep: 1760.66 | bwd_allreduce_microstep: 37.59 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11715
total_samples=13799, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:30:58,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.39 | bwd_microstep: 1857.17 | bwd_inner_microstep: 1737.09 | bwd_allreduce_microstep: 120.01 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12117
total_samples=13802, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:01,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 04:31:01,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.00 | bwd_microstep: 1812.48 | bwd_inner_microstep: 1562.17 | bwd_allreduce_microstep: 250.25 | step_microstep: 146.04
[2025-08-03 04:31:01,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.74 | bwd: 7267.93 | bwd_inner: 6778.65 | bwd_allreduce: 489.04 | step: 146.38
{'loss': 0.7603, 'learning_rate': 1.194693240597196e-05, 'epoch': 0.45}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13126
total_samples=13806, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:03,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.10 | bwd_microstep: 1733.43 | bwd_inner_microstep: 1660.36 | bwd_allreduce_microstep: 73.00 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13616
total_samples=13810, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:06,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.37 | bwd_microstep: 2103.66 | bwd_inner_microstep: 1870.95 | bwd_allreduce_microstep: 232.65 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11651
total_samples=13813, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:09,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.06 | bwd_microstep: 2060.27 | bwd_inner_microstep: 1839.36 | bwd_allreduce_microstep: 220.84 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13150
total_samples=13816, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:12,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:31:12,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.19 | bwd_microstep: 1813.64 | bwd_inner_microstep: 1629.61 | bwd_allreduce_microstep: 183.97 | step_microstep: 136.89
[2025-08-03 04:31:12,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.67 | bwd: 7711.05 | bwd_inner: 7000.29 | bwd_allreduce: 710.53 | step: 137.23
{'loss': 0.7544, 'learning_rate': 1.1931045964720882e-05, 'epoch': 0.46}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15962
total_samples=13821, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:14,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.25 | bwd_microstep: 1887.80 | bwd_inner_microstep: 1862.51 | bwd_allreduce_microstep: 25.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375
total_samples=13825, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:17,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.28 | bwd_microstep: 1730.34 | bwd_inner_microstep: 1672.33 | bwd_allreduce_microstep: 57.95 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12536
total_samples=13829, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:20,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.68 | bwd_microstep: 1957.31 | bwd_inner_microstep: 1603.07 | bwd_allreduce_microstep: 354.19 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14455
total_samples=13833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:22,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 04:31:22,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.61 | bwd_microstep: 1857.74 | bwd_inner_microstep: 1802.27 | bwd_allreduce_microstep: 55.41 | step_microstep: 130.06
[2025-08-03 04:31:22,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.75 | bwd: 7433.25 | bwd_inner: 6940.17 | bwd_allreduce: 492.85 | step: 130.50
11.00s/it] 45%|████▌     | 906/2000 [2:47:44<3:19:48, 10.96s/it]                                                       45%|████▌     | 906/2000 [2:47:44<3:19:48, 10.96s/it] 45%|████▌     | 907/2000 [2:47:54<3:17:50, 10.86s/it]                                                       45%|████▌     | 907/2000 [2:47:54<3:17:50, 10.86s/it] 45%|████▌     | 908/2000 [2:48:05<3:16:40, 10.81s/it]                                                       45%|████▌     | 908/2000 [2:48:05<3:16:40, 10.81s/it] 45%|████▌     | 909/2000 [2:48:16<3:15:03, 10.73s/it]                                                       45%|████▌     | 909/2000 [2:48:16<3:15:03, 10.73s/it] 46%|████▌     | 910/2000 [2:48:27<3:16:17, 10.81s/it]                                                       46%|████▌     | 910/2000 [2:48:27<3:16:17, 10.81s/it] 46%|████▌     | 911/2000 [2:48:37<3:15:17, 10.76s/it]                  {'loss': 0.7605, 'learning_rate': 1.1915154459526876e-05, 'epoch': 0.46}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14168
total_samples=13839, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:25,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.92 | bwd_microstep: 1940.89 | bwd_inner_microstep: 1735.52 | bwd_allreduce_microstep: 205.31 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14415
total_samples=13844, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:28,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.73 | bwd_microstep: 1745.62 | bwd_inner_microstep: 1716.09 | bwd_allreduce_microstep: 29.47 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11993
total_samples=13847, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:30,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.80 | bwd_microstep: 1832.87 | bwd_inner_microstep: 1596.88 | bwd_allreduce_microstep: 235.93 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11803
total_samples=13850, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:33,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 04:31:33,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.12 | bwd_microstep: 1984.03 | bwd_inner_microstep: 1776.12 | bwd_allreduce_microstep: 207.85 | step_microstep: 134.32
[2025-08-03 04:31:33,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.50 | bwd: 7503.46 | bwd_inner: 6824.59 | bwd_allreduce: 678.63 | step: 134.64
{'loss': 0.759, 'learning_rate': 1.189925793206357e-05, 'epoch': 0.46}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13397
total_samples=13854, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:36,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.27 | bwd_microstep: 1935.29 | bwd_inner_microstep: 1929.21 | bwd_allreduce_microstep: 6.02 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13121
total_samples=13858, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:39,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.03 | bwd_microstep: 1955.38 | bwd_inner_microstep: 1840.29 | bwd_allreduce_microstep: 115.03 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11910
total_samples=13861, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:41,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.86 | bwd_microstep: 2008.55 | bwd_inner_microstep: 1874.46 | bwd_allreduce_microstep: 134.02 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778
total_samples=13864, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:44,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 04:31:44,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.96 | bwd_microstep: 1768.43 | bwd_inner_microstep: 1547.95 | bwd_allreduce_microstep: 220.40 | step_microstep: 141.33
[2025-08-03 04:31:44,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.05 | bwd: 7667.71 | bwd_inner: 7191.91 | bwd_allreduce: 475.55 | step: 141.65
{'loss': 0.764, 'learning_rate': 1.188335642401775e-05, 'epoch': 0.46}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13653
total_samples=13868, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:47,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.67 | bwd_microstep: 1859.07 | bwd_inner_microstep: 1728.73 | bwd_allreduce_microstep: 130.28 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13419
total_samples=13872, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:49,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.51 | bwd_microstep: 1998.19 | bwd_inner_microstep: 1901.04 | bwd_allreduce_microstep: 97.09 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15734
total_samples=13876, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:52,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.35 | bwd_microstep: 1784.48 | bwd_inner_microstep: 1778.27 | bwd_allreduce_microstep: 6.14 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11960
total_samples=13879, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:55,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 04:31:55,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.64 | bwd_microstep: 2025.20 | bwd_inner_microstep: 1674.61 | bwd_allreduce_microstep: 350.53 | step_microstep: 134.62
[2025-08-03 04:31:55,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.11 | bwd: 7666.99 | bwd_inner: 7082.64 | bwd_allreduce: 584.12 | step: 135.06
{'loss': 0.752, 'learning_rate': 1.1867449977089264e-05, 'epoch': 0.46}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13545
total_samples=13883, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:31:58,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.25 | bwd_microstep: 2139.95 | bwd_inner_microstep: 1977.37 | bwd_allreduce_microstep: 162.51 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12742
total_samples=13887, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:01,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.80 | bwd_microstep: 1787.77 | bwd_inner_microstep: 1660.85 | bwd_allreduce_microstep: 126.86 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11517
total_samples=13890, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:03,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.59 | bwd_microstep: 1821.64 | bwd_inner_microstep: 1541.08 | bwd_allreduce_microstep: 280.49 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12190
total_samples=13893, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:06,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 04:32:06,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.77 | bwd_microstep: 1785.44 | bwd_inner_microstep: 1562.20 | bwd_allreduce_microstep: 223.17 | step_microstep: 137.25
[2025-08-03 04:32:06,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.35 | bwd: 7534.86 | bwd_inner: 6741.49 | bwd_allreduce: 793.12 | step: 137.64
{'loss': 0.7531, 'learning_rate': 1.1851538632990922e-05, 'epoch': 0.46}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12793
total_samples=13897, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:09,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.10 | bwd_microstep: 1918.89 | bwd_inner_microstep: 1865.55 | bwd_allreduce_microstep: 53.27 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13530
total_samples=13901, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:11,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.27 | bwd_microstep: 1828.60 | bwd_inner_microstep: 1730.60 | bwd_allreduce_microstep: 97.95 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12280
total_samples=13905, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:14,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.87 | bwd_microstep: 1814.10 | bwd_inner_microstep: 1561.38 | bwd_allreduce_microstep: 252.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14659
total_samples=13909, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:16,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 04:32:16,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.60 | bwd_microstep: 1815.43 | bwd_inner_microstep: 1719.80 | bwd_allreduce_microstep: 95.57 | step_microstep: 129.37
[2025-08-03 04:32:16,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.77 | bwd: 7377.07 | bwd_inner: 6877.33 | bwd_allreduce: 499.51 | step: 129.81
{'loss': 0.751, 'learning_rate': 1.1835622433448361e-05, 'epoch': 0.46}
                                     46%|████▌     | 911/2000 [2:48:37<3:15:17, 10.76s/it] 46%|████▌     | 912/2000 [2:48:48<3:15:14, 10.77s/it]                                                       46%|████▌     | 912/2000 [2:48:48<3:15:14, 10.77s/it] 46%|████▌     | 913/2000 [2:48:59<3:15:57, 10.82s/it]                                                       46%|████▌     | 913/2000 [2:48:59<3:15:57, 10.82s/it] 46%|████▌     | 914/2000 [2:49:10<3:16:34, 10.86s/it]                                                       46%|████▌     | 914/2000 [2:49:10<3:16:34, 10.86s/it] 46%|████▌     | 915/2000 [2:49:21<3:16:06, 10.84s/it]                                                       46%|████▌     | 915/2000 [2:49:21<3:16:06, 10.84s/it] 46%|████▌     | 916/2000 [2:49:31<3:14:48, 10.78s/it]                                                       46%|████▌     | 916/2000 [2:49:31<3:14:48, 10dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11591
total_samples=13912, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:19,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.07 | bwd_microstep: 1850.49 | bwd_inner_microstep: 1581.06 | bwd_allreduce_microstep: 269.36 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13768
total_samples=13916, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:22,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.89 | bwd_microstep: 1886.11 | bwd_inner_microstep: 1734.81 | bwd_allreduce_microstep: 151.23 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11973
total_samples=13919, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:25,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.09 | bwd_microstep: 1986.85 | bwd_inner_microstep: 1573.34 | bwd_allreduce_microstep: 413.45 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13428
total_samples=13923, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:27,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.62
[2025-08-03 04:32:27,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.99 | bwd_microstep: 1725.20 | bwd_inner_microstep: 1679.86 | bwd_allreduce_microstep: 45.28 | step_microstep: 142.07
[2025-08-03 04:32:27,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.99 | bwd: 7448.69 | bwd_inner: 6569.06 | bwd_allreduce: 879.40 | step: 142.41
{'loss': 0.754, 'learning_rate': 1.181970142019997e-05, 'epoch': 0.46}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12732
total_samples=13927, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:30,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.37 | bwd_microstep: 1856.83 | bwd_inner_microstep: 1639.15 | bwd_allreduce_microstep: 217.62 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13570
total_samples=13931, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:32,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.30 | bwd_microstep: 1836.97 | bwd_inner_microstep: 1691.05 | bwd_allreduce_microstep: 145.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14301
total_samples=13935, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:35,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.10 | bwd_microstep: 1777.97 | bwd_inner_microstep: 1728.71 | bwd_allreduce_microstep: 49.20 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13580
total_samples=13939, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:38,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 04:32:38,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.31 | bwd_microstep: 1804.44 | bwd_inner_microstep: 1724.12 | bwd_allreduce_microstep: 80.26 | step_microstep: 137.03
[2025-08-03 04:32:38,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.01 | bwd: 7276.26 | bwd_inner: 6783.02 | bwd_allreduce: 493.02 | step: 137.36
{'loss': 0.7561, 'learning_rate': 1.1803775634996735e-05, 'epoch': 0.46}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11694
total_samples=13942, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:40,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.79 | bwd_microstep: 1838.66 | bwd_inner_microstep: 1533.97 | bwd_allreduce_microstep: 304.62 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12947
total_samples=13946, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:43,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.18 | bwd_microstep: 1831.10 | bwd_inner_microstep: 1683.61 | bwd_allreduce_microstep: 147.43 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12805
total_samples=13950, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:46,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.10 | bwd_microstep: 2252.34 | bwd_inner_microstep: 2082.41 | bwd_allreduce_microstep: 169.85 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13756
total_samples=13954, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:49,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 04:32:49,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 753.28 | bwd_microstep: 2035.48 | bwd_inner_microstep: 1766.92 | bwd_allreduce_microstep: 268.50 | step_microstep: 114.60
[2025-08-03 04:32:49,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2867.27 | bwd: 7957.63 | bwd_inner: 7066.91 | bwd_allreduce: 890.47 | step: 114.97
{'loss': 0.7596, 'learning_rate': 1.1787845119602184e-05, 'epoch': 0.46}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13787
total_samples=13958, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:52,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.06 | bwd_microstep: 2039.75 | bwd_inner_microstep: 2033.72 | bwd_allreduce_microstep: 5.96 | step_microstep: 0.20
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 16284
total_samples=13962, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:54,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.79 | bwd_microstep: 1805.62 | bwd_inner_microstep: 1795.98 | bwd_allreduce_microstep: 9.58 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11913
total_samples=13965, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:32:57,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.13 | bwd_microstep: 1823.43 | bwd_inner_microstep: 1580.93 | bwd_allreduce_microstep: 242.44 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14369
total_samples=13969, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:00,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 04:33:00,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.79 | bwd_microstep: 1987.55 | bwd_inner_microstep: 1917.36 | bwd_allreduce_microstep: 70.12 | step_microstep: 137.38
[2025-08-03 04:33:00,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.69 | bwd: 7656.39 | bwd_inner: 7327.98 | bwd_allreduce: 328.18 | step: 137.79
{'loss': 0.7457, 'learning_rate': 1.177190991579223e-05, 'epoch': 0.46}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13101
total_samples=13973, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:03,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.58 | bwd_microstep: 1998.59 | bwd_inner_microstep: 1856.21 | bwd_allreduce_microstep: 142.32 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13540
total_samples=13977, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:05,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.38 | bwd_microstep: 1759.80 | bwd_inner_microstep: 1682.07 | bwd_allreduce_microstep: 77.66 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12109
total_samples=13980, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:08,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.59 | bwd_microstep: 1726.47 | bwd_inner_microstep: 1551.68 | bwd_allreduce_microstep: 174.73 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13066
total_samples=13984, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:11,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 04:33:11,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.30 | bwd_microstep: 2179.55 | bwd_inner_microstep: 2045.48 | bwd_allreduce_microstep: 134.01 | step_microstep: 111.30
[2025-08-03 04:33:11,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.79 | bwd: 7664.46 | bwd_inner: 7135.44 | bwd_allreduce: 528.79 | step: 111.75
{'loss': 0.7528, 'learning_rate': 1.1755970065355087e-05, 'epoch': 0.46}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13091
total_samples=13988, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:14,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.63 | bwd_microstep: 2053.88 | bwd_inner_microstep: 1898.41 | bwd_allreduce_microstep: 155.40 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15608
total_samples=13992, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:17,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.69 | bwd_microstep: 2062.27 | bwd_inner_microstep: 1893.48 | bwd_allreduce_microstep: 168.71 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12012
total_samples=13995, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:19,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.76 | bwd_microstep: 1766.57 | bwd_inner_microstep: 1559.55 | bwd_allreduce_microstep: 206.96 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13422
total_samples=13999, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:22,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 04:33:22,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.14 | bwd_microstep: 1836.14 | bwd_inner_microstep: 1701.26 | bwd_allreduce_microstep: 134.82 | step_microstep: 126.44
[2025-08-03 04:33:22,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.15 | bwd: 7718.91 | bwd_inner: 7052.68 | bwd_allreduce: 665.98 | step: 126.83
.78s/it] 46%|████▌     | 917/2000 [2:49:42<3:14:18, 10.76s/it]                                                       46%|████▌     | 917/2000 [2:49:42<3:14:18, 10.76s/it] 46%|████▌     | 918/2000 [2:49:53<3:13:03, 10.71s/it]                                                       46%|████▌     | 918/2000 [2:49:53<3:13:03, 10.71s/it] 46%|████▌     | 919/2000 [2:50:04<3:16:00, 10.88s/it]                                                       46%|████▌     | 919/2000 [2:50:04<3:16:00, 10.88s/it] 46%|████▌     | 920/2000 [2:50:15<3:16:20, 10.91s/it]                                                       46%|████▌     | 920/2000 [2:50:15<3:16:20, 10.91s/it] 46%|████▌     | 921/2000 [2:50:26<3:15:54, 10.89s/it]                                                       46%|████▌     | 921/2000 [2:50:26<3:15:54, 10.89s/it] 46%|████▌     | 922/2000 [2:50:37<3:16:10, 10.92s/it]                    {'loss': 0.762, 'learning_rate': 1.174002561009116e-05, 'epoch': 0.46}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11882
total_samples=14002, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:25,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.89 | bwd_microstep: 1947.33 | bwd_inner_microstep: 1740.68 | bwd_allreduce_microstep: 206.58 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=14006, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:27,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.10 | bwd_microstep: 1986.13 | bwd_inner_microstep: 1868.35 | bwd_allreduce_microstep: 117.72 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11820
total_samples=14009, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:30,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.97 | bwd_microstep: 2002.81 | bwd_inner_microstep: 1769.42 | bwd_allreduce_microstep: 233.32 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13062
total_samples=14014, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:33,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.19
[2025-08-03 04:33:33,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.99 | bwd_microstep: 1813.77 | bwd_inner_microstep: 1665.00 | bwd_allreduce_microstep: 148.70 | step_microstep: 153.95
[2025-08-03 04:33:33,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.87 | bwd: 7750.09 | bwd_inner: 7043.45 | bwd_allreduce: 706.40 | step: 154.28
{'loss': 0.748, 'learning_rate': 1.1724076591812919e-05, 'epoch': 0.46}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12820
total_samples=14018, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:36,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.59 | bwd_microstep: 2110.63 | bwd_inner_microstep: 1673.71 | bwd_allreduce_microstep: 436.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14380
total_samples=14022, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:38,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.47 | bwd_microstep: 1794.13 | bwd_inner_microstep: 1746.41 | bwd_allreduce_microstep: 47.65 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11794
total_samples=14025, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:41,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.36 | bwd_microstep: 2051.06 | bwd_inner_microstep: 1824.69 | bwd_allreduce_microstep: 226.31 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11875
total_samples=14028, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:44,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 04:33:44,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.33 | bwd_microstep: 2065.42 | bwd_inner_microstep: 1584.61 | bwd_allreduce_microstep: 480.74 | step_microstep: 111.33
[2025-08-03 04:33:44,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2869.68 | bwd: 8021.29 | bwd_inner: 6829.42 | bwd_allreduce: 1191.64 | step: 111.77
{'loss': 0.7613, 'learning_rate': 1.1708123052344803e-05, 'epoch': 0.46}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11679
total_samples=14031, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:47,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.13 | bwd_microstep: 1741.54 | bwd_inner_microstep: 1528.89 | bwd_allreduce_microstep: 212.58 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13150
total_samples=14035, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:49,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.28 | bwd_microstep: 1727.79 | bwd_inner_microstep: 1659.35 | bwd_allreduce_microstep: 68.37 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11946
total_samples=14038, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:52,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.28 | bwd_microstep: 1735.00 | bwd_inner_microstep: 1543.64 | bwd_allreduce_microstep: 191.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12751
total_samples=14041, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:54,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 04:33:54,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.66 | bwd_microstep: 1760.97 | bwd_inner_microstep: 1587.54 | bwd_allreduce_microstep: 173.37 | step_microstep: 140.46
[2025-08-03 04:33:54,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2742.26 | bwd: 6965.35 | bwd_inner: 6319.42 | bwd_allreduce: 645.70 | step: 140.81
{'loss': 0.7435, 'learning_rate': 1.1692165033523117e-05, 'epoch': 0.46}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12187
total_samples=14044, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:33:57,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.43 | bwd_microstep: 1968.18 | bwd_inner_microstep: 1692.83 | bwd_allreduce_microstep: 275.28 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12692
total_samples=14048, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:00,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.93 | bwd_microstep: 2198.62 | bwd_inner_microstep: 1727.64 | bwd_allreduce_microstep: 470.87 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12124
total_samples=14051, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:03,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.72 | bwd_microstep: 2208.73 | bwd_inner_microstep: 1980.67 | bwd_allreduce_microstep: 227.99 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13607
total_samples=14055, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:06,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 04:34:06,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 849.35 | bwd_microstep: 1827.89 | bwd_inner_microstep: 1714.52 | bwd_allreduce_microstep: 113.30 | step_microstep: 108.36
[2025-08-03 04:34:06,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2920.35 | bwd: 8203.45 | bwd_inner: 7115.66 | bwd_allreduce: 1087.50 | step: 108.70
{'loss': 0.7614, 'learning_rate': 1.1676202577195901e-05, 'epoch': 0.46}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13876
total_samples=14060, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:09,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.76 | bwd_microstep: 2007.53 | bwd_inner_microstep: 2001.21 | bwd_allreduce_microstep: 6.25 | step_microstep: 0.19
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13454
total_samples=14064, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:11,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.29 | bwd_microstep: 1914.28 | bwd_inner_microstep: 1716.10 | bwd_allreduce_microstep: 198.12 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11698
total_samples=14067, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:14,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.00 | bwd_microstep: 1834.35 | bwd_inner_microstep: 1603.12 | bwd_allreduce_microstep: 231.17 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13472
total_samples=14071, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:17,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 04:34:17,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.95 | bwd_microstep: 1917.69 | bwd_inner_microstep: 1840.65 | bwd_allreduce_microstep: 76.98 | step_microstep: 111.83
[2025-08-03 04:34:17,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2861.92 | bwd: 7673.90 | bwd_inner: 7161.08 | bwd_allreduce: 512.59 | step: 112.25
{'loss': 0.7673, 'learning_rate': 1.1660235725222835e-05, 'epoch': 0.46}
                                   46%|████▌     | 922/2000 [2:50:37<3:16:10, 10.92s/it] 46%|████▌     | 923/2000 [2:50:48<3:16:37, 10.95s/it]                                                       46%|████▌     | 923/2000 [2:50:48<3:16:37, 10.95s/it] 46%|████▌     | 924/2000 [2:50:59<3:18:24, 11.06s/it]                                                       46%|████▌     | 924/2000 [2:50:59<3:18:24, 11.06s/it] 46%|████▋     | 925/2000 [2:51:09<3:13:24, 10.80s/it]                                                       46%|████▋     | 925/2000 [2:51:09<3:13:24, 10.80s/it] 46%|████▋     | 926/2000 [2:51:21<3:17:24, 11.03s/it]                                                       46%|████▋     | 926/2000 [2:51:21<3:17:24, 11.03s/it] 46%|████▋     | 927/2000 [2:51:32<3:16:54, 11.01s/it]                                                       46%|████▋     | 927/2000 [2:51:32<3:16:54, 11.0dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15785
total_samples=14075, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:20,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.91 | bwd_microstep: 2016.56 | bwd_inner_microstep: 1869.39 | bwd_allreduce_microstep: 147.11 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13674
total_samples=14079, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:22,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.90 | bwd_microstep: 1794.08 | bwd_inner_microstep: 1685.00 | bwd_allreduce_microstep: 109.01 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11764
total_samples=14082, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:25,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.54 | bwd_microstep: 1805.68 | bwd_inner_microstep: 1559.84 | bwd_allreduce_microstep: 245.78 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14039
total_samples=14087, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:28,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 04:34:28,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.39 | bwd_microstep: 1994.53 | bwd_inner_microstep: 1737.77 | bwd_allreduce_microstep: 256.70 | step_microstep: 114.30
[2025-08-03 04:34:28,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.68 | bwd: 7610.90 | bwd_inner: 6851.97 | bwd_allreduce: 758.68 | step: 114.73
{'loss': 0.7567, 'learning_rate': 1.164426451947513e-05, 'epoch': 0.46}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15909
total_samples=14091, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:30,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.37 | bwd_microstep: 1833.08 | bwd_inner_microstep: 1813.28 | bwd_allreduce_microstep: 19.74 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=14095, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:33,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.78 | bwd_microstep: 1821.80 | bwd_inner_microstep: 1690.16 | bwd_allreduce_microstep: 131.57 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12203
total_samples=14098, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:36,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.20 | bwd_microstep: 2010.85 | bwd_inner_microstep: 1812.41 | bwd_allreduce_microstep: 198.37 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13568
total_samples=14102, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:39,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 04:34:39,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.32 | bwd_microstep: 1828.83 | bwd_inner_microstep: 1712.69 | bwd_allreduce_microstep: 116.07 | step_microstep: 112.27
[2025-08-03 04:34:39,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.60 | bwd: 7494.61 | bwd_inner: 7028.53 | bwd_allreduce: 465.84 | step: 112.63
{'loss': 0.7526, 'learning_rate': 1.1628289001835405e-05, 'epoch': 0.46}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14624
total_samples=14106, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:41,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.51 | bwd_microstep: 1779.29 | bwd_inner_microstep: 1730.96 | bwd_allreduce_microstep: 48.25 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15728
total_samples=14111, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:44,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.74 | bwd_microstep: 2191.55 | bwd_inner_microstep: 1896.65 | bwd_allreduce_microstep: 294.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13843
total_samples=14115, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:47,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.65 | bwd_microstep: 1798.06 | bwd_inner_microstep: 1732.93 | bwd_allreduce_microstep: 65.07 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12681
total_samples=14120, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:49,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 04:34:49,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.40 | bwd_microstep: 1739.05 | bwd_inner_microstep: 1590.34 | bwd_allreduce_microstep: 148.65 | step_microstep: 133.69
[2025-08-03 04:34:49,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.23 | bwd: 7508.01 | bwd_inner: 6950.88 | bwd_allreduce: 556.89 | step: 134.02
{'loss': 0.7562, 'learning_rate': 1.1612309214197599e-05, 'epoch': 0.47}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13224
total_samples=14124, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:52,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.43 | bwd_microstep: 2092.14 | bwd_inner_microstep: 1868.99 | bwd_allreduce_microstep: 223.08 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13813
total_samples=14129, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:55,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.51 | bwd_microstep: 2242.58 | bwd_inner_microstep: 2008.68 | bwd_allreduce_microstep: 233.83 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13671
total_samples=14133, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:34:58,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.57 | bwd_microstep: 1766.69 | bwd_inner_microstep: 1703.40 | bwd_allreduce_microstep: 63.22 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778
total_samples=14136, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:01,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.87
[2025-08-03 04:35:01,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.80 | bwd_microstep: 1904.76 | bwd_inner_microstep: 1539.76 | bwd_allreduce_microstep: 364.93 | step_microstep: 137.68
[2025-08-03 04:35:01,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.24 | bwd: 8006.20 | bwd_inner: 7120.82 | bwd_allreduce: 885.14 | step: 138.01
{'loss': 0.7546, 'learning_rate': 1.1596325198466841e-05, 'epoch': 0.47}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13193
total_samples=14140, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:03,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.99 | bwd_microstep: 1812.04 | bwd_inner_microstep: 1702.99 | bwd_allreduce_microstep: 108.98 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13409
total_samples=14144, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:06,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.12 | bwd_microstep: 1831.82 | bwd_inner_microstep: 1706.17 | bwd_allreduce_microstep: 125.59 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13536
total_samples=14148, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:08,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.81 | bwd_microstep: 1786.97 | bwd_inner_microstep: 1699.88 | bwd_allreduce_microstep: 87.02 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12313
total_samples=14151, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:11,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 04:35:11,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.83 | bwd_microstep: 1924.52 | bwd_inner_microstep: 1581.14 | bwd_allreduce_microstep: 343.32 | step_microstep: 114.77
[2025-08-03 04:35:11,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.66 | bwd: 7355.40 | bwd_inner: 6690.19 | bwd_allreduce: 664.98 | step: 115.20
{'loss': 0.763, 'learning_rate': 1.1580336996559343e-05, 'epoch': 0.47}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12687
total_samples=14155, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:14,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.11 | bwd_microstep: 1864.83 | bwd_inner_microstep: 1759.48 | bwd_allreduce_microstep: 105.29 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13147
total_samples=14159, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:17,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.72 | bwd_microstep: 2054.84 | bwd_inner_microstep: 1886.74 | bwd_allreduce_microstep: 168.03 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12424
total_samples=14163, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:19,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.71 | bwd_microstep: 1997.66 | bwd_inner_microstep: 1819.33 | bwd_allreduce_microstep: 178.28 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14284
total_samples=14167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:22,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 04:35:22,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.68 | bwd_microstep: 2001.56 | bwd_inner_microstep: 1764.50 | bwd_allreduce_microstep: 237.00 | step_microstep: 128.67
[2025-08-03 04:35:22,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2749.16 | bwd: 7918.94 | bwd_inner: 7230.04 | bwd_allreduce: 688.68 | step: 128.98
1s/it] 46%|████▋     | 928/2000 [2:51:43<3:15:51, 10.96s/it]                                                       46%|████▋     | 928/2000 [2:51:43<3:15:51, 10.96s/it] 46%|████▋     | 929/2000 [2:51:53<3:14:39, 10.91s/it]                                                       46%|████▋     | 929/2000 [2:51:53<3:14:39, 10.91s/it] 46%|████▋     | 930/2000 [2:52:04<3:13:49, 10.87s/it]                                                       46%|████▋     | 930/2000 [2:52:04<3:13:49, 10.87s/it] 47%|████▋     | 931/2000 [2:52:15<3:15:30, 10.97s/it]                                                       47%|████▋     | 931/2000 [2:52:15<3:15:30, 10.97s/it] 47%|████▋     | 932/2000 [2:52:26<3:13:26, 10.87s/it]                                                       47%|████▋     | 932/2000 [2:52:26<3:13:26, 10.87s/it] 47%|████▋     | 933/2000 [2:52:37<3:14:37, 10.94s/it]                      {'loss': 0.7413, 'learning_rate': 1.156434465040231e-05, 'epoch': 0.47}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13513
total_samples=14171, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:25,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.02 | bwd_microstep: 2039.01 | bwd_inner_microstep: 2032.97 | bwd_allreduce_microstep: 5.98 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13196
total_samples=14175, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:28,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.62 | bwd_microstep: 2293.76 | bwd_inner_microstep: 2132.75 | bwd_allreduce_microstep: 160.96 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14381
total_samples=14179, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:31,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.08 | bwd_microstep: 2067.23 | bwd_inner_microstep: 1917.11 | bwd_allreduce_microstep: 150.05 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12323
total_samples=14183, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:34,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 04:35:34,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.82 | bwd_microstep: 1835.98 | bwd_inner_microstep: 1617.23 | bwd_allreduce_microstep: 218.69 | step_microstep: 132.48
[2025-08-03 04:35:34,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.46 | bwd: 8236.03 | bwd_inner: 7700.06 | bwd_allreduce: 535.75 | step: 132.83
{'loss': 0.7633, 'learning_rate': 1.1548348201933799e-05, 'epoch': 0.47}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14183
total_samples=14187, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:37,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.84 | bwd_microstep: 2232.18 | bwd_inner_microstep: 2071.02 | bwd_allreduce_microstep: 161.10 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13379
total_samples=14191, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:39,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.83 | bwd_microstep: 1749.76 | bwd_inner_microstep: 1668.03 | bwd_allreduce_microstep: 81.67 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13248
total_samples=14195, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:42,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 968.63 | bwd_microstep: 1752.10 | bwd_inner_microstep: 1680.47 | bwd_allreduce_microstep: 71.56 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13900
total_samples=14199, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:45,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 04:35:45,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.59 | bwd_microstep: 1771.05 | bwd_inner_microstep: 1693.49 | bwd_allreduce_microstep: 77.49 | step_microstep: 144.19
[2025-08-03 04:35:45,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3020.81 | bwd: 7505.12 | bwd_inner: 7113.01 | bwd_allreduce: 391.89 | step: 144.60
{'loss': 0.7576, 'learning_rate': 1.1532347693102632e-05, 'epoch': 0.47}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13379
total_samples=14203, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:47,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.58 | bwd_microstep: 1805.30 | bwd_inner_microstep: 1705.28 | bwd_allreduce_microstep: 99.96 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15302
total_samples=14207, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:50,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.45 | bwd_microstep: 1849.08 | bwd_inner_microstep: 1785.05 | bwd_allreduce_microstep: 63.96 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13279
total_samples=14211, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:53,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.20 | bwd_microstep: 1730.27 | bwd_inner_microstep: 1688.39 | bwd_allreduce_microstep: 41.81 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12261
total_samples=14214, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:55,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 04:35:55,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.81 | bwd_microstep: 1836.79 | bwd_inner_microstep: 1679.59 | bwd_allreduce_microstep: 157.14 | step_microstep: 120.88
[2025-08-03 04:35:55,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.97 | bwd: 7221.49 | bwd_inner: 6858.31 | bwd_allreduce: 362.94 | step: 121.31
{'loss': 0.7489, 'learning_rate': 1.151634316586828e-05, 'epoch': 0.47}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12933
total_samples=14218, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:35:58,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.23 | bwd_microstep: 1918.51 | bwd_inner_microstep: 1688.25 | bwd_allreduce_microstep: 230.20 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13045
total_samples=14222, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:01,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.17 | bwd_microstep: 2055.65 | bwd_inner_microstep: 1860.58 | bwd_allreduce_microstep: 195.01 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14474
total_samples=14226, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:03,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.17 | bwd_microstep: 1712.49 | bwd_inner_microstep: 1701.48 | bwd_allreduce_microstep: 10.94 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14459
total_samples=14230, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:06,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 04:36:06,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.43 | bwd_microstep: 1863.84 | bwd_inner_microstep: 1829.45 | bwd_allreduce_microstep: 34.33 | step_microstep: 152.40
[2025-08-03 04:36:06,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.92 | bwd: 7550.54 | bwd_inner: 7079.76 | bwd_allreduce: 470.56 | step: 152.74
{'loss': 0.7519, 'learning_rate': 1.150033466220075e-05, 'epoch': 0.47}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11849
total_samples=14233, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:09,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.47 | bwd_microstep: 2283.59 | bwd_inner_microstep: 1540.08 | bwd_allreduce_microstep: 743.44 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12997
total_samples=14237, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:12,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.18 | bwd_microstep: 1729.58 | bwd_inner_microstep: 1640.60 | bwd_allreduce_microstep: 88.91 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418
total_samples=14241, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:14,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.45 | bwd_microstep: 1785.32 | bwd_inner_microstep: 1705.60 | bwd_allreduce_microstep: 79.66 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14636
total_samples=14245, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:17,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 04:36:17,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.59 | bwd_microstep: 1794.54 | bwd_inner_microstep: 1762.64 | bwd_allreduce_microstep: 31.83 | step_microstep: 116.97
[2025-08-03 04:36:17,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2730.63 | bwd: 7593.07 | bwd_inner: 6648.91 | bwd_allreduce: 943.92 | step: 117.44
{'loss': 0.7489, 'learning_rate': 1.1484322224080474e-05, 'epoch': 0.47}
                                 47%|████▋     | 933/2000 [2:52:37<3:14:37, 10.94s/it] 47%|████▋     | 934/2000 [2:52:49<3:17:37, 11.12s/it]                                                       47%|████▋     | 934/2000 [2:52:49<3:17:37, 11.12s/it] 47%|████▋     | 935/2000 [2:53:00<3:16:47, 11.09s/it]                                                       47%|████▋     | 935/2000 [2:53:00<3:16:47, 11.09s/it] 47%|████▋     | 936/2000 [2:53:10<3:13:28, 10.91s/it]                                                       47%|████▋     | 936/2000 [2:53:10<3:13:28, 10.91s/it] 47%|████▋     | 937/2000 [2:53:21<3:12:44, 10.88s/it]                                                       47%|████▋     | 937/2000 [2:53:21<3:12:44, 10.88s/it] 47%|████▋     | 938/2000 [2:53:32<3:12:04, 10.85s/it]                                                       47%|████▋     | 938/2000 [2:53:32<3:12:04, 10.85sdynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14399
total_samples=14249, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:20,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.32 | bwd_microstep: 2038.89 | bwd_inner_microstep: 1891.86 | bwd_allreduce_microstep: 146.97 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12010
total_samples=14252, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:22,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.95 | bwd_microstep: 1727.77 | bwd_inner_microstep: 1550.94 | bwd_allreduce_microstep: 176.77 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243
total_samples=14256, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:25,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.64 | bwd_microstep: 2085.01 | bwd_inner_microstep: 1834.85 | bwd_allreduce_microstep: 250.10 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13541
total_samples=14260, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:28,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 04:36:28,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.29 | bwd_microstep: 2087.56 | bwd_inner_microstep: 1896.31 | bwd_allreduce_microstep: 191.17 | step_microstep: 137.67
[2025-08-03 04:36:28,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.13 | bwd: 7939.27 | bwd_inner: 7173.95 | bwd_allreduce: 765.09 | step: 137.97
{'loss': 0.7639, 'learning_rate': 1.1468305893498204e-05, 'epoch': 0.47}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12681
total_samples=14264, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:31,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.46 | bwd_microstep: 1964.79 | bwd_inner_microstep: 1845.08 | bwd_allreduce_microstep: 119.64 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11738
total_samples=14267, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:34,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.71 | bwd_microstep: 1964.84 | bwd_inner_microstep: 1760.83 | bwd_allreduce_microstep: 203.95 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13780
total_samples=14271, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:36,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.13 | bwd_microstep: 1815.41 | bwd_inner_microstep: 1727.41 | bwd_allreduce_microstep: 87.93 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334
total_samples=14275, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:39,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 04:36:39,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.73 | bwd_microstep: 2003.76 | bwd_inner_microstep: 1996.05 | bwd_allreduce_microstep: 7.65 | step_microstep: 115.63
[2025-08-03 04:36:39,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2721.97 | bwd: 7748.84 | bwd_inner: 7329.36 | bwd_allreduce: 419.24 | step: 115.98
{'loss': 0.7572, 'learning_rate': 1.1452285712454905e-05, 'epoch': 0.47}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11921
total_samples=14278, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:42,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.85 | bwd_microstep: 2169.41 | bwd_inner_microstep: 2063.19 | bwd_allreduce_microstep: 106.16 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11933
total_samples=14282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:45,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.22 | bwd_microstep: 1870.82 | bwd_inner_microstep: 1620.95 | bwd_allreduce_microstep: 249.81 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305
total_samples=14286, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:47,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.68 | bwd_microstep: 1834.97 | bwd_inner_microstep: 1693.42 | bwd_allreduce_microstep: 141.49 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13252
total_samples=14290, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:50,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 04:36:50,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.39 | bwd_microstep: 2110.43 | bwd_inner_microstep: 1961.36 | bwd_allreduce_microstep: 149.00 | step_microstep: 119.06
[2025-08-03 04:36:50,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2855.07 | bwd: 7985.68 | bwd_inner: 7338.92 | bwd_allreduce: 646.54 | step: 119.48
{'loss': 0.7502, 'learning_rate': 1.1436261722961627e-05, 'epoch': 0.47}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11789
total_samples=14293, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:53,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.11 | bwd_microstep: 2097.79 | bwd_inner_microstep: 1843.15 | bwd_allreduce_microstep: 254.59 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15919
total_samples=14297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:56,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.47 | bwd_microstep: 1805.69 | bwd_inner_microstep: 1767.44 | bwd_allreduce_microstep: 38.18 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12876
total_samples=14301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:36:58,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.93 | bwd_microstep: 1820.93 | bwd_inner_microstep: 1637.68 | bwd_allreduce_microstep: 183.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13893
total_samples=14305, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:01,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 04:37:01,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.52 | bwd_microstep: 1791.74 | bwd_inner_microstep: 1728.41 | bwd_allreduce_microstep: 63.25 | step_microstep: 124.64
[2025-08-03 04:37:01,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.96 | bwd: 7516.21 | bwd_inner: 6976.68 | bwd_allreduce: 539.29 | step: 124.97
{'loss': 0.762, 'learning_rate': 1.1420233967039423e-05, 'epoch': 0.47}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12949
total_samples=14308, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:04,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.08 | bwd_microstep: 1827.26 | bwd_inner_microstep: 1609.27 | bwd_allreduce_microstep: 217.92 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264
total_samples=14312, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:06,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.27 | bwd_microstep: 1723.01 | bwd_inner_microstep: 1661.11 | bwd_allreduce_microstep: 61.84 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12406
total_samples=14315, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:09,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.99 | bwd_microstep: 1716.20 | bwd_inner_microstep: 1555.67 | bwd_allreduce_microstep: 160.47 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13396
total_samples=14319, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:12,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 04:37:12,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.79 | bwd_microstep: 1987.64 | bwd_inner_microstep: 1851.33 | bwd_allreduce_microstep: 136.24 | step_microstep: 123.11
[2025-08-03 04:37:12,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2746.07 | bwd: 7254.18 | bwd_inner: 6677.38 | bwd_allreduce: 576.55 | step: 123.55
{'loss': 0.7491, 'learning_rate': 1.1404202486719205e-05, 'epoch': 0.47}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13322
total_samples=14323, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:14,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.47 | bwd_microstep: 2026.72 | bwd_inner_microstep: 1844.87 | bwd_allreduce_microstep: 181.79 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11624
total_samples=14326, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:17,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.74 | bwd_microstep: 1811.72 | bwd_inner_microstep: 1559.37 | bwd_allreduce_microstep: 252.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11858
total_samples=14329, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:20,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.34 | bwd_microstep: 2263.07 | bwd_inner_microstep: 1829.66 | bwd_allreduce_microstep: 433.34 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13339
total_samples=14333, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:23,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 04:37:23,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.33 | bwd_microstep: 1810.81 | bwd_inner_microstep: 1709.36 | bwd_allreduce_microstep: 101.39 | step_microstep: 155.61
[2025-08-03 04:37:23,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.82 | bwd: 7912.36 | bwd_inner: 6943.25 | bwd_allreduce: 968.88 | step: 156.04
/it] 47%|████▋     | 939/2000 [2:53:43<3:13:36, 10.95s/it]                                                       47%|████▋     | 939/2000 [2:53:43<3:13:36, 10.95s/it] 47%|████▋     | 940/2000 [2:53:54<3:13:21, 10.94s/it]                                                       47%|████▋     | 940/2000 [2:53:54<3:13:21, 10.94s/it] 47%|████▋     | 941/2000 [2:54:05<3:15:02, 11.05s/it]                                                       47%|████▋     | 941/2000 [2:54:05<3:15:02, 11.05s/it] 47%|████▋     | 942/2000 [2:54:16<3:13:21, 10.97s/it]                                                       47%|████▋     | 942/2000 [2:54:16<3:13:21, 10.97s/it] 47%|████▋     | 943/2000 [2:54:26<3:10:28, 10.81s/it]                                                       47%|████▋     | 943/2000 [2:54:26<3:10:28, 10.81s/it] 47%|████▋     | 944/2000 [2:54:38<3:12:06, 10.92s/it]                        {'loss': 0.7534, 'learning_rate': 1.138816732404167e-05, 'epoch': 0.47}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15840
total_samples=14337, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:26,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.56 | bwd_microstep: 2015.19 | bwd_inner_microstep: 1917.05 | bwd_allreduce_microstep: 98.08 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13225
total_samples=14341, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:28,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.93 | bwd_microstep: 1733.11 | bwd_inner_microstep: 1628.78 | bwd_allreduce_microstep: 104.26 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11951
total_samples=14344, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:31,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.85 | bwd_microstep: 1776.18 | bwd_inner_microstep: 1560.74 | bwd_allreduce_microstep: 215.37 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13793
total_samples=14348, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:33,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 04:37:33,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.32 | bwd_microstep: 1932.37 | bwd_inner_microstep: 1721.92 | bwd_allreduce_microstep: 210.38 | step_microstep: 152.43
[2025-08-03 04:37:33,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.59 | bwd: 7456.89 | bwd_inner: 6828.48 | bwd_allreduce: 628.17 | step: 152.76
{'loss': 0.7505, 'learning_rate': 1.1372128521057155e-05, 'epoch': 0.47}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14724
total_samples=14352, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:36,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.12 | bwd_microstep: 1946.56 | bwd_inner_microstep: 1900.60 | bwd_allreduce_microstep: 45.89 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12123
total_samples=14355, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:39,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.80 | bwd_microstep: 1915.81 | bwd_inner_microstep: 1725.47 | bwd_allreduce_microstep: 190.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11829
total_samples=14358, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:41,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.08 | bwd_microstep: 1900.27 | bwd_inner_microstep: 1554.03 | bwd_allreduce_microstep: 346.17 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13645
total_samples=14362, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:44,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:37:44,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.57 | bwd_microstep: 2074.17 | bwd_inner_microstep: 1923.60 | bwd_allreduce_microstep: 150.50 | step_microstep: 113.12
[2025-08-03 04:37:44,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.50 | bwd: 7836.85 | bwd_inner: 7103.70 | bwd_allreduce: 732.92 | step: 113.47
{'loss': 0.7466, 'learning_rate': 1.1356086119825553e-05, 'epoch': 0.47}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13182
total_samples=14366, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:48,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 666.47 | bwd_microstep: 2338.75 | bwd_inner_microstep: 1857.73 | bwd_allreduce_microstep: 480.95 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12376
total_samples=14370, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:50,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.71 | bwd_microstep: 2049.57 | bwd_inner_microstep: 1825.98 | bwd_allreduce_microstep: 223.52 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13379
total_samples=14374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:53,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.86 | bwd_microstep: 1948.05 | bwd_inner_microstep: 1941.70 | bwd_allreduce_microstep: 6.29 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13791
total_samples=14378, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:56,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 04:37:56,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.68 | bwd_microstep: 2078.95 | bwd_inner_microstep: 1958.11 | bwd_allreduce_microstep: 120.77 | step_microstep: 112.42
[2025-08-03 04:37:56,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.66 | bwd: 8415.37 | bwd_inner: 7583.52 | bwd_allreduce: 831.61 | step: 112.87
{'loss': 0.7484, 'learning_rate': 1.1340040162416197e-05, 'epoch': 0.47}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14118
total_samples=14383, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:37:59,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.35 | bwd_microstep: 2042.53 | bwd_inner_microstep: 1889.68 | bwd_allreduce_microstep: 152.78 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14188
total_samples=14387, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:01,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.00 | bwd_microstep: 1765.16 | bwd_inner_microstep: 1715.62 | bwd_allreduce_microstep: 49.47 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13672
total_samples=14391, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:04,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.06 | bwd_microstep: 1858.68 | bwd_inner_microstep: 1801.59 | bwd_allreduce_microstep: 57.03 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13189
total_samples=14395, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:07,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 04:38:07,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.17 | bwd_microstep: 1783.43 | bwd_inner_microstep: 1690.01 | bwd_allreduce_microstep: 93.35 | step_microstep: 132.40
[2025-08-03 04:38:07,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2824.51 | bwd: 7449.84 | bwd_inner: 7096.89 | bwd_allreduce: 352.71 | step: 132.74
{'loss': 0.7636, 'learning_rate': 1.1323990690907734e-05, 'epoch': 0.47}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12526
total_samples=14399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:10,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.60 | bwd_microstep: 2071.42 | bwd_inner_microstep: 1906.61 | bwd_allreduce_microstep: 164.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13276
total_samples=14403, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:12,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.32 | bwd_microstep: 1807.95 | bwd_inner_microstep: 1703.44 | bwd_allreduce_microstep: 104.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12216
total_samples=14406, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:15,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.69 | bwd_microstep: 1876.48 | bwd_inner_microstep: 1594.41 | bwd_allreduce_microstep: 282.00 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13257
total_samples=14410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:18,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:38:18,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.35 | bwd_microstep: 2307.02 | bwd_inner_microstep: 1943.46 | bwd_allreduce_microstep: 363.50 | step_microstep: 132.27
[2025-08-03 04:38:18,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.88 | bwd: 8062.91 | bwd_inner: 7147.92 | bwd_allreduce: 914.76 | step: 132.59
{'loss': 0.7425, 'learning_rate': 1.1307937747388034e-05, 'epoch': 0.47}
                               47%|████▋     | 944/2000 [2:54:38<3:12:06, 10.92s/it] 47%|████▋     | 945/2000 [2:54:48<3:10:46, 10.85s/it]                                                       47%|████▋     | 945/2000 [2:54:48<3:10:46, 10.85s/it] 47%|████▋     | 946/2000 [2:54:59<3:11:37, 10.91s/it]                                                       47%|████▋     | 946/2000 [2:54:59<3:11:37, 10.91s/it] 47%|████▋     | 947/2000 [2:55:11<3:15:11, 11.12s/it]                                                       47%|████▋     | 947/2000 [2:55:11<3:15:11, 11.12s/it] 47%|████▋     | 948/2000 [2:55:22<3:12:56, 11.00s/it]                                                       47%|████▋     | 948/2000 [2:55:22<3:12:56, 11.00s/it] 47%|████▋     | 949/2000 [2:55:33<3:14:26, 11.10s/it]                                                       47%|████▋     | 949/2000 [2:55:33<3:14:26, 11.10s/idynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11808
total_samples=14413, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:21,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.79 | bwd_microstep: 1842.62 | bwd_inner_microstep: 1572.91 | bwd_allreduce_microstep: 269.64 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13690
total_samples=14417, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:23,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.05 | bwd_microstep: 1830.04 | bwd_inner_microstep: 1712.06 | bwd_allreduce_microstep: 117.92 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13124
total_samples=14421, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:26,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.47 | bwd_microstep: 1846.79 | bwd_inner_microstep: 1704.64 | bwd_allreduce_microstep: 142.09 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15459
total_samples=14428, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:29,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 04:38:29,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.38 | bwd_microstep: 1920.36 | bwd_inner_microstep: 1774.30 | bwd_allreduce_microstep: 145.99 | step_microstep: 129.66
[2025-08-03 04:38:29,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.63 | bwd: 7439.86 | bwd_inner: 6763.91 | bwd_allreduce: 675.72 | step: 130.09
{'loss': 0.7502, 'learning_rate': 1.1291881373954066e-05, 'epoch': 0.47}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13569
total_samples=14432, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:31,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.29 | bwd_microstep: 1736.14 | bwd_inner_microstep: 1682.86 | bwd_allreduce_microstep: 53.22 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12757
total_samples=14436, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:34,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.86 | bwd_microstep: 1941.15 | bwd_inner_microstep: 1611.12 | bwd_allreduce_microstep: 329.96 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13423
total_samples=14440, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:37,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.04 | bwd_microstep: 2015.46 | bwd_inner_microstep: 1911.54 | bwd_allreduce_microstep: 103.85 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12143
total_samples=14443, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:40,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 04:38:40,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.75 | bwd_microstep: 2035.94 | bwd_inner_microstep: 1823.44 | bwd_allreduce_microstep: 212.43 | step_microstep: 133.47
[2025-08-03 04:38:40,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2722.87 | bwd: 7728.74 | bwd_inner: 7028.95 | bwd_allreduce: 699.55 | step: 133.96
{'loss': 0.7478, 'learning_rate': 1.1275821612711803e-05, 'epoch': 0.48}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334
total_samples=14447, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:42,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.39 | bwd_microstep: 1752.00 | bwd_inner_microstep: 1682.37 | bwd_allreduce_microstep: 69.55 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11883
total_samples=14450, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:45,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.95 | bwd_microstep: 1745.93 | bwd_inner_microstep: 1543.38 | bwd_allreduce_microstep: 202.48 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14013
total_samples=14454, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:47,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.58 | bwd_microstep: 1792.46 | bwd_inner_microstep: 1741.52 | bwd_allreduce_microstep: 50.88 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14101
total_samples=14458, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:51,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 04:38:51,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.27 | bwd_microstep: 2205.85 | bwd_inner_microstep: 2004.95 | bwd_allreduce_microstep: 200.83 | step_microstep: 134.63
[2025-08-03 04:38:51,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.11 | bwd: 7496.28 | bwd_inner: 6972.22 | bwd_allreduce: 523.83 | step: 134.96
{'loss': 0.7616, 'learning_rate': 1.1259758505776092e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11643
total_samples=14461, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:54,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.38 | bwd_microstep: 2189.33 | bwd_inner_microstep: 1950.43 | bwd_allreduce_microstep: 238.84 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13328
total_samples=14465, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:56,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.12 | bwd_microstep: 1788.34 | bwd_inner_microstep: 1706.01 | bwd_allreduce_microstep: 82.26 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13398
total_samples=14469, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:38:59,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.83 | bwd_microstep: 1755.05 | bwd_inner_microstep: 1684.71 | bwd_allreduce_microstep: 70.28 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13719
total_samples=14473, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:02,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 04:39:02,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.49 | bwd_microstep: 2176.74 | bwd_inner_microstep: 1861.22 | bwd_allreduce_microstep: 315.46 | step_microstep: 119.20
[2025-08-03 04:39:02,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.75 | bwd: 7909.51 | bwd_inner: 7202.36 | bwd_allreduce: 706.92 | step: 119.64
{'loss': 0.7558, 'learning_rate': 1.1243692095270565e-05, 'epoch': 0.48}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13024
total_samples=14477, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:04,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.06 | bwd_microstep: 1884.48 | bwd_inner_microstep: 1669.82 | bwd_allreduce_microstep: 214.59 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14592
total_samples=14481, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:07,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.78 | bwd_microstep: 1741.70 | bwd_inner_microstep: 1720.98 | bwd_allreduce_microstep: 20.64 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13414
total_samples=14485, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:10,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.13 | bwd_microstep: 1990.39 | bwd_inner_microstep: 1984.28 | bwd_allreduce_microstep: 6.04 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13368
total_samples=14489, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:13,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 04:39:13,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.88 | bwd_microstep: 2448.68 | bwd_inner_microstep: 2292.78 | bwd_allreduce_microstep: 155.83 | step_microstep: 142.29
[2025-08-03 04:39:13,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.79 | bwd: 8065.28 | bwd_inner: 7667.84 | bwd_allreduce: 397.18 | step: 142.63
{'loss': 0.7557, 'learning_rate': 1.1227622423327501e-05, 'epoch': 0.48}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13966
total_samples=14493, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:16,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.23 | bwd_microstep: 1804.33 | bwd_inner_microstep: 1686.79 | bwd_allreduce_microstep: 117.47 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13498
total_samples=14497, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:18,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.63 | bwd_microstep: 1861.56 | bwd_inner_microstep: 1707.06 | bwd_allreduce_microstep: 154.43 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13542
total_samples=14501, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:21,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 654.97 | bwd_microstep: 1731.78 | bwd_inner_microstep: 1687.87 | bwd_allreduce_microstep: 43.84 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13728
total_samples=14505, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:24,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 04:39:24,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.44 | bwd_microstep: 1950.83 | bwd_inner_microstep: 1911.15 | bwd_allreduce_microstep: 39.62 | step_microstep: 117.98
[2025-08-03 04:39:24,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.20 | bwd: 7348.54 | bwd_inner: 6992.88 | bwd_allreduce: 355.44 | step: 118.44
t] 48%|████▊     | 950/2000 [2:55:44<3:12:17, 10.99s/it]                                                       48%|████▊     | 950/2000 [2:55:44<3:12:17, 10.99s/it] 48%|████▊     | 951/2000 [2:55:55<3:11:53, 10.98s/it]                                                       48%|████▊     | 951/2000 [2:55:55<3:11:53, 10.98s/it] 48%|████▊     | 952/2000 [2:56:05<3:10:40, 10.92s/it]                                                       48%|████▊     | 952/2000 [2:56:05<3:10:40, 10.92s/it] 48%|████▊     | 953/2000 [2:56:17<3:11:39, 10.98s/it]                                                       48%|████▊     | 953/2000 [2:56:17<3:11:39, 10.98s/it] 48%|████▊     | 954/2000 [2:56:28<3:13:18, 11.09s/it]                                                       48%|████▊     | 954/2000 [2:56:28<3:13:18, 11.09s/it] 48%|████▊     | 955/2000 [2:56:38<3:10:31, 10.94s/it]                          {'loss': 0.7552, 'learning_rate': 1.1211549532087749e-05, 'epoch': 0.48}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12097
total_samples=14509, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:26,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.89 | bwd_microstep: 2021.53 | bwd_inner_microstep: 1570.72 | bwd_allreduce_microstep: 450.71 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14687
total_samples=14513, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:29,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.31 | bwd_microstep: 2031.62 | bwd_inner_microstep: 1743.30 | bwd_allreduce_microstep: 288.26 | step_microstep: 0.23
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12293
total_samples=14517, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:32,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.41 | bwd_microstep: 1954.88 | bwd_inner_microstep: 1760.91 | bwd_allreduce_microstep: 193.91 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12977
total_samples=14521, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:35,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.56
[2025-08-03 04:39:35,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.23 | bwd_microstep: 1783.75 | bwd_inner_microstep: 1615.07 | bwd_allreduce_microstep: 168.62 | step_microstep: 116.80
[2025-08-03 04:39:35,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.76 | bwd: 7791.84 | bwd_inner: 6690.01 | bwd_allreduce: 1101.57 | step: 117.26
{'loss': 0.7524, 'learning_rate': 1.119547346370059e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11845
total_samples=14524, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:37,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.51 | bwd_microstep: 1769.36 | bwd_inner_microstep: 1548.96 | bwd_allreduce_microstep: 220.33 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15402
total_samples=14528, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:40,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.57 | bwd_microstep: 1879.39 | bwd_inner_microstep: 1860.47 | bwd_allreduce_microstep: 18.85 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13695
total_samples=14532, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:42,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.69 | bwd_microstep: 1762.63 | bwd_inner_microstep: 1693.76 | bwd_allreduce_microstep: 68.81 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13435
total_samples=14536, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:45,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 04:39:45,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.98 | bwd_microstep: 1733.58 | bwd_inner_microstep: 1678.60 | bwd_allreduce_microstep: 54.91 | step_microstep: 115.00
[2025-08-03 04:39:45,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2739.68 | bwd: 7145.00 | bwd_inner: 6781.80 | bwd_allreduce: 362.98 | step: 115.34
{'loss': 0.7535, 'learning_rate': 1.1179394260323639e-05, 'epoch': 0.48}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13554
total_samples=14540, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:48,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.31 | bwd_microstep: 1704.31 | bwd_inner_microstep: 1665.22 | bwd_allreduce_microstep: 39.02 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13489
total_samples=14544, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:50,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.09 | bwd_microstep: 1761.18 | bwd_inner_microstep: 1724.33 | bwd_allreduce_microstep: 36.78 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13451
total_samples=14548, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:54,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 982.75 | bwd_microstep: 2598.48 | bwd_inner_microstep: 2467.05 | bwd_allreduce_microstep: 131.37 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11629
total_samples=14551, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:57,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:39:57,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.36 | bwd_microstep: 1898.08 | bwd_inner_microstep: 1600.83 | bwd_allreduce_microstep: 297.18 | step_microstep: 131.57
[2025-08-03 04:39:57,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3074.44 | bwd: 7962.09 | bwd_inner: 7457.43 | bwd_allreduce: 504.43 | step: 132.00
{'loss': 0.7411, 'learning_rate': 1.1163311964122733e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11682
total_samples=14554, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:39:59,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.48 | bwd_microstep: 1756.80 | bwd_inner_microstep: 1537.68 | bwd_allreduce_microstep: 219.06 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14132
total_samples=14559, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:02,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.67 | bwd_microstep: 1820.71 | bwd_inner_microstep: 1747.49 | bwd_allreduce_microstep: 73.15 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13094
total_samples=14563, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:05,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.88 | bwd_microstep: 2168.06 | bwd_inner_microstep: 1807.55 | bwd_allreduce_microstep: 360.44 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11637
total_samples=14566, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:08,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 04:40:08,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.28 | bwd_microstep: 2065.99 | bwd_inner_microstep: 1848.18 | bwd_allreduce_microstep: 217.75 | step_microstep: 394.19
[2025-08-03 04:40:08,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.24 | bwd: 7811.61 | bwd_inner: 6940.89 | bwd_allreduce: 870.48 | step: 394.57
{'loss': 0.7509, 'learning_rate': 1.114722661727182e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=14569, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:11,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.36 | bwd_microstep: 2065.43 | bwd_inner_microstep: 1839.12 | bwd_allreduce_microstep: 226.24 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13921
total_samples=14575, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:14,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 865.03 | bwd_microstep: 2008.82 | bwd_inner_microstep: 1883.08 | bwd_allreduce_microstep: 125.68 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14388
total_samples=14579, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:16,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.04 | bwd_microstep: 1816.04 | bwd_inner_microstep: 1747.21 | bwd_allreduce_microstep: 68.77 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12111
total_samples=14582, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:19,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.75
[2025-08-03 04:40:19,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.29 | bwd_microstep: 1736.37 | bwd_inner_microstep: 1554.06 | bwd_allreduce_microstep: 182.24 | step_microstep: 122.46
[2025-08-03 04:40:19,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2961.65 | bwd: 7626.71 | bwd_inner: 7023.47 | bwd_allreduce: 603.00 | step: 122.79
{'loss': 0.7538, 'learning_rate': 1.1131138261952845e-05, 'epoch': 0.48}
                             48%|████▊     | 955/2000 [2:56:39<3:10:31, 10.94s/it] 48%|████▊     | 956/2000 [2:56:50<3:10:51, 10.97s/it]                                                       48%|████▊     | 956/2000 [2:56:50<3:10:51, 10.97s/it] 48%|████▊     | 957/2000 [2:57:00<3:07:23, 10.78s/it]                                                       48%|████▊     | 957/2000 [2:57:00<3:07:23, 10.78s/it] 48%|████▊     | 958/2000 [2:57:11<3:11:05, 11.00s/it]                                                       48%|████▊     | 958/2000 [2:57:11<3:11:05, 11.00s/it] 48%|████▊     | 959/2000 [2:57:23<3:12:49, 11.11s/it]                                                       48%|████▊     | 959/2000 [2:57:23<3:12:49, 11.11s/it] 48%|████▊     | 960/2000 [2:57:34<3:12:10, 11.09s/it]                                                       48%|████▊     | 960/2000 [2:57:34<3:12:10, 11.09s/it]dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11688
total_samples=14585, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:21,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.77 | bwd_microstep: 1776.40 | bwd_inner_microstep: 1678.48 | bwd_allreduce_microstep: 97.85 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16292
total_samples=14589, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:24,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.55 | bwd_microstep: 2119.39 | bwd_inner_microstep: 1891.52 | bwd_allreduce_microstep: 227.81 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15040
total_samples=14593, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:27,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.81 | bwd_microstep: 1759.16 | bwd_inner_microstep: 1738.90 | bwd_allreduce_microstep: 20.20 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11835
total_samples=14596, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:30,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:40:30,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.67 | bwd_microstep: 1937.30 | bwd_inner_microstep: 1550.19 | bwd_allreduce_microstep: 387.03 | step_microstep: 114.42
[2025-08-03 04:40:30,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.74 | bwd: 7592.30 | bwd_inner: 6859.09 | bwd_allreduce: 732.97 | step: 114.87
{'loss': 0.7579, 'learning_rate': 1.1115046940355643e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11601
total_samples=14599, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:32,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.65 | bwd_microstep: 1796.62 | bwd_inner_microstep: 1573.82 | bwd_allreduce_microstep: 222.73 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13287
total_samples=14603, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:35,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.30 | bwd_microstep: 1991.89 | bwd_inner_microstep: 1825.56 | bwd_allreduce_microstep: 166.27 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13169
total_samples=14607, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:38,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 874.80 | bwd_microstep: 1778.08 | bwd_inner_microstep: 1690.92 | bwd_allreduce_microstep: 87.10 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11563
total_samples=14610, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:41,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 04:40:41,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.57 | bwd_microstep: 1826.66 | bwd_inner_microstep: 1566.11 | bwd_allreduce_microstep: 260.49 | step_microstep: 167.99
[2025-08-03 04:40:41,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2973.25 | bwd: 7393.30 | bwd_inner: 6656.41 | bwd_allreduce: 736.66 | step: 168.43
{'loss': 0.7547, 'learning_rate': 1.109895269467783e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713
total_samples=14613, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:43,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.59 | bwd_microstep: 1863.64 | bwd_inner_microstep: 1532.80 | bwd_allreduce_microstep: 330.79 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12209
total_samples=14616, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:46,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.52 | bwd_microstep: 2131.18 | bwd_inner_microstep: 1952.40 | bwd_allreduce_microstep: 178.71 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12923
total_samples=14620, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:49,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.61 | bwd_microstep: 1802.69 | bwd_inner_microstep: 1685.24 | bwd_allreduce_microstep: 117.39 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14643
total_samples=14624, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:52,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 04:40:52,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.65 | bwd_microstep: 1804.22 | bwd_inner_microstep: 1766.42 | bwd_allreduce_microstep: 37.74 | step_microstep: 133.14
[2025-08-03 04:40:52,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.30 | bwd: 7601.77 | bwd_inner: 6936.84 | bwd_allreduce: 664.70 | step: 133.48
{'loss': 0.7571, 'learning_rate': 1.1082855567124693e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11823
total_samples=14627, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:54,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.68 | bwd_microstep: 2156.14 | bwd_inner_microstep: 1956.35 | bwd_allreduce_microstep: 199.72 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13928
total_samples=14631, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:40:57,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.66 | bwd_microstep: 1961.16 | bwd_inner_microstep: 1849.54 | bwd_allreduce_microstep: 111.55 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13165
total_samples=14635, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:00,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.70 | bwd_microstep: 1746.59 | bwd_inner_microstep: 1656.11 | bwd_allreduce_microstep: 90.42 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14417
total_samples=14639, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:03,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.76
[2025-08-03 04:41:03,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.41 | bwd_microstep: 2021.51 | bwd_inner_microstep: 1908.03 | bwd_allreduce_microstep: 113.42 | step_microstep: 111.76
[2025-08-03 04:41:03,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.38 | bwd: 7885.44 | bwd_inner: 7370.02 | bwd_allreduce: 515.19 | step: 112.18
{'loss': 0.7622, 'learning_rate': 1.1066755599909065e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11612
total_samples=14642, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:05,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.89 | bwd_microstep: 1842.58 | bwd_inner_microstep: 1550.71 | bwd_allreduce_microstep: 291.80 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14260
total_samples=14646, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:08,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.50 | bwd_microstep: 2029.14 | bwd_inner_microstep: 1918.01 | bwd_allreduce_microstep: 111.05 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14125
total_samples=14650, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:11,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.59 | bwd_microstep: 1833.05 | bwd_inner_microstep: 1758.21 | bwd_allreduce_microstep: 74.77 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13716
total_samples=14654, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:13,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 04:41:13,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.52 | bwd_microstep: 1793.14 | bwd_inner_microstep: 1753.84 | bwd_allreduce_microstep: 39.23 | step_microstep: 111.59
[2025-08-03 04:41:13,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.42 | bwd: 7497.96 | bwd_inner: 6980.77 | bwd_allreduce: 516.95 | step: 112.08
{'loss': 0.7562, 'learning_rate': 1.105065283525124e-05, 'epoch': 0.48}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11991
total_samples=14658, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:16,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.53 | bwd_microstep: 1795.09 | bwd_inner_microstep: 1573.15 | bwd_allreduce_microstep: 221.87 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13697
total_samples=14662, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:18,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.76 | bwd_microstep: 1728.17 | bwd_inner_microstep: 1681.28 | bwd_allreduce_microstep: 46.82 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13775
total_samples=14667, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:21,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.42 | bwd_microstep: 1750.26 | bwd_inner_microstep: 1676.16 | bwd_allreduce_microstep: 74.02 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13192
total_samples=14671, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:24,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 04:41:24,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.78 | bwd_microstep: 1906.09 | bwd_inner_microstep: 1824.69 | bwd_allreduce_microstep: 81.34 | step_microstep: 111.92
[2025-08-03 04:41:24,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.42 | bwd: 7179.66 | bwd_inner: 6755.26 | bwd_allreduce: 424.15 | step: 112.24
 48%|████▊     | 961/2000 [2:57:45<3:10:43, 11.01s/it]                                                       48%|████▊     | 961/2000 [2:57:45<3:10:43, 11.01s/it] 48%|████▊     | 962/2000 [2:57:55<3:09:44, 10.97s/it]                                                       48%|████▊     | 962/2000 [2:57:56<3:09:44, 10.97s/it] 48%|████▊     | 963/2000 [2:58:06<3:09:07, 10.94s/it]                                                       48%|████▊     | 963/2000 [2:58:06<3:09:07, 10.94s/it] 48%|████▊     | 964/2000 [2:58:18<3:10:03, 11.01s/it]                                                       48%|████▊     | 964/2000 [2:58:18<3:10:03, 11.01s/it] 48%|████▊     | 965/2000 [2:58:28<3:08:18, 10.92s/it]                                                       48%|████▊     | 965/2000 [2:58:28<3:08:18, 10.92s/it] 48%|████▊     | 966/2000 [2:58:39<3:05:20, 10.76s/it]                            {'loss': 0.7564, 'learning_rate': 1.1034547315378838e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11754
total_samples=14674, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:27,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.70 | bwd_microstep: 2180.99 | bwd_inner_microstep: 1905.04 | bwd_allreduce_microstep: 275.87 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15084
total_samples=14678, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:30,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.85 | bwd_microstep: 2021.19 | bwd_inner_microstep: 1889.84 | bwd_allreduce_microstep: 131.28 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14773
total_samples=14682, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:32,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.32 | bwd_microstep: 2033.65 | bwd_inner_microstep: 1932.65 | bwd_allreduce_microstep: 100.94 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13218
total_samples=14686, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:35,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 04:41:35,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.61 | bwd_microstep: 2018.77 | bwd_inner_microstep: 2012.68 | bwd_allreduce_microstep: 6.02 | step_microstep: 132.51
[2025-08-03 04:41:35,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.42 | bwd: 8254.63 | bwd_inner: 7740.21 | bwd_allreduce: 514.19 | step: 132.92
{'loss': 0.7641, 'learning_rate': 1.101843908252671e-05, 'epoch': 0.48}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12651
total_samples=14690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:38,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.58 | bwd_microstep: 1866.39 | bwd_inner_microstep: 1634.83 | bwd_allreduce_microstep: 231.49 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15010
total_samples=14694, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:41,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.17 | bwd_microstep: 1995.31 | bwd_inner_microstep: 1938.39 | bwd_allreduce_microstep: 56.84 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=14698, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:43,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.86 | bwd_microstep: 1849.92 | bwd_inner_microstep: 1692.43 | bwd_allreduce_microstep: 157.43 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13363
total_samples=14702, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:46,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 04:41:46,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.49 | bwd_microstep: 2010.57 | bwd_inner_microstep: 1882.69 | bwd_allreduce_microstep: 127.81 | step_microstep: 138.01
[2025-08-03 04:41:46,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.02 | bwd: 7722.23 | bwd_inner: 7148.34 | bwd_allreduce: 573.65 | step: 138.49
{'loss': 0.7579, 'learning_rate': 1.1002328178936813e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11916
total_samples=14705, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:49,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.90 | bwd_microstep: 2195.63 | bwd_inner_microstep: 1789.79 | bwd_allreduce_microstep: 405.77 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14092
total_samples=14709, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:52,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.21 | bwd_microstep: 1939.05 | bwd_inner_microstep: 1704.86 | bwd_allreduce_microstep: 234.13 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14206
total_samples=14713, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:55,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.13 | bwd_microstep: 1793.14 | bwd_inner_microstep: 1748.31 | bwd_allreduce_microstep: 44.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13214
total_samples=14717, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:41:57,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 04:41:57,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.55 | bwd_microstep: 2030.55 | bwd_inner_microstep: 1751.21 | bwd_allreduce_microstep: 279.27 | step_microstep: 110.04
[2025-08-03 04:41:57,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2759.72 | bwd: 7958.41 | bwd_inner: 6994.17 | bwd_allreduce: 964.00 | step: 110.38
{'loss': 0.7672, 'learning_rate': 1.0986214646858115e-05, 'epoch': 0.48}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11757
total_samples=14720, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:00,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.91 | bwd_microstep: 1839.32 | bwd_inner_microstep: 1720.06 | bwd_allreduce_microstep: 119.19 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14191
total_samples=14724, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:03,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.75 | bwd_microstep: 1846.27 | bwd_inner_microstep: 1739.05 | bwd_allreduce_microstep: 107.16 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13335
total_samples=14728, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:05,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.82 | bwd_microstep: 1875.42 | bwd_inner_microstep: 1693.94 | bwd_allreduce_microstep: 181.42 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14023
total_samples=14732, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:08,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 04:42:08,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.80 | bwd_microstep: 1820.75 | bwd_inner_microstep: 1753.48 | bwd_allreduce_microstep: 67.21 | step_microstep: 115.01
[2025-08-03 04:42:08,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.21 | bwd: 7381.81 | bwd_inner: 6906.52 | bwd_allreduce: 475.06 | step: 115.34
{'loss': 0.7632, 'learning_rate': 1.0970098528546482e-05, 'epoch': 0.48}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13606
total_samples=14736, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:11,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.93 | bwd_microstep: 2021.90 | bwd_inner_microstep: 1876.89 | bwd_allreduce_microstep: 144.95 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13159
total_samples=14740, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:13,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.61 | bwd_microstep: 1759.92 | bwd_inner_microstep: 1647.23 | bwd_allreduce_microstep: 112.63 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13297
total_samples=14744, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:16,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.99 | bwd_microstep: 1737.30 | bwd_inner_microstep: 1689.57 | bwd_allreduce_microstep: 47.67 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13912
total_samples=14749, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:19,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 04:42:19,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.52 | bwd_microstep: 1988.81 | bwd_inner_microstep: 1764.18 | bwd_allreduce_microstep: 224.57 | step_microstep: 134.34
[2025-08-03 04:42:19,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.98 | bwd: 7507.98 | bwd_inner: 6977.86 | bwd_allreduce: 529.89 | step: 134.67
{'loss': 0.7578, 'learning_rate': 1.0953979866264549e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15109
total_samples=14753, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:22,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.33 | bwd_microstep: 2039.52 | bwd_inner_microstep: 1952.70 | bwd_allreduce_microstep: 86.77 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11851
total_samples=14756, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:24,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.52 | bwd_microstep: 1937.84 | bwd_inner_microstep: 1559.74 | bwd_allreduce_microstep: 378.03 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13462
total_samples=14761, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:27,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.14 | bwd_microstep: 1856.61 | bwd_inner_microstep: 1698.70 | bwd_allreduce_microstep: 157.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13324
total_samples=14765, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:30,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 04:42:30,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.88 | bwd_microstep: 1910.19 | bwd_inner_microstep: 1852.57 | bwd_allreduce_microstep: 57.56 | step_microstep: 138.97
[2025-08-03 04:42:30,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.80 | bwd: 7744.21 | bwd_inner: 7063.69 | bwd_allreduce: 680.28 | step: 139.31
                           48%|████▊     | 966/2000 [2:58:39<3:05:20, 10.76s/it] 48%|████▊     | 967/2000 [2:58:50<3:09:27, 11.00s/it]                                                       48%|████▊     | 967/2000 [2:58:50<3:09:27, 11.00s/it] 48%|████▊     | 968/2000 [2:59:01<3:09:03, 10.99s/it]                                                       48%|████▊     | 968/2000 [2:59:01<3:09:03, 10.99s/it] 48%|████▊     | 969/2000 [2:59:12<3:09:43, 11.04s/it]                                                       48%|████▊     | 969/2000 [2:59:12<3:09:43, 11.04s/it] 48%|████▊     | 970/2000 [2:59:23<3:07:24, 10.92s/it]                                                       48%|████▊     | 970/2000 [2:59:23<3:07:24, 10.92s/it] 49%|████▊     | 971/2000 [2:59:34<3:06:14, 10.86s/it]                                                       49%|████▊     | 971/2000 [2:59:34<3:06:14, 10.86s/it] {'loss': 0.7591, 'learning_rate': 1.0937858702281631e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14244
total_samples=14769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:32,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.87 | bwd_microstep: 1844.06 | bwd_inner_microstep: 1752.92 | bwd_allreduce_microstep: 91.09 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11695
total_samples=14772, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:35,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.12 | bwd_microstep: 1854.40 | bwd_inner_microstep: 1604.58 | bwd_allreduce_microstep: 249.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13532
total_samples=14776, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:38,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.93 | bwd_microstep: 1817.79 | bwd_inner_microstep: 1724.24 | bwd_allreduce_microstep: 93.48 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13475
total_samples=14781, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:40,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 04:42:40,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.60 | bwd_microstep: 1859.91 | bwd_inner_microstep: 1700.82 | bwd_allreduce_microstep: 159.02 | step_microstep: 137.08
[2025-08-03 04:42:40,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.44 | bwd: 7376.21 | bwd_inner: 6782.56 | bwd_allreduce: 593.41 | step: 137.43
{'loss': 0.7487, 'learning_rate': 1.0921735078873599e-05, 'epoch': 0.49}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13784
total_samples=14785, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:43,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.13 | bwd_microstep: 2172.41 | bwd_inner_microstep: 1905.63 | bwd_allreduce_microstep: 266.70 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508
total_samples=14789, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:46,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.08 | bwd_microstep: 2035.00 | bwd_inner_microstep: 1902.81 | bwd_allreduce_microstep: 132.13 | step_microstep: 0.20
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13325
total_samples=14793, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:49,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.17 | bwd_microstep: 1751.85 | bwd_inner_microstep: 1638.76 | bwd_allreduce_microstep: 113.04 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13360
total_samples=14797, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:52,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 04:42:52,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.22 | bwd_microstep: 1944.28 | bwd_inner_microstep: 1825.17 | bwd_allreduce_microstep: 119.05 | step_microstep: 112.51
[2025-08-03 04:42:52,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.53 | bwd: 7903.59 | bwd_inner: 7272.36 | bwd_allreduce: 631.00 | step: 112.97
{'loss': 0.7517, 'learning_rate': 1.090560903832278e-05, 'epoch': 0.49}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11853
total_samples=14801, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:54,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.80 | bwd_microstep: 1994.29 | bwd_inner_microstep: 1798.28 | bwd_allreduce_microstep: 195.95 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11858
total_samples=14804, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:42:57,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.60 | bwd_microstep: 1812.50 | bwd_inner_microstep: 1585.33 | bwd_allreduce_microstep: 227.10 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11569
total_samples=14807, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:00,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.56 | bwd_microstep: 1863.92 | bwd_inner_microstep: 1737.01 | bwd_allreduce_microstep: 126.84 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12950
total_samples=14812, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:02,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:43:02,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.16 | bwd_microstep: 1904.67 | bwd_inner_microstep: 1674.20 | bwd_allreduce_microstep: 230.41 | step_microstep: 113.31
[2025-08-03 04:43:02,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.06 | bwd: 7575.43 | bwd_inner: 6794.81 | bwd_allreduce: 780.38 | step: 113.77
{'loss': 0.7537, 'learning_rate': 1.088948062291783e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13633
total_samples=14817, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:05,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.25 | bwd_microstep: 1893.01 | bwd_inner_microstep: 1848.55 | bwd_allreduce_microstep: 44.40 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13087
total_samples=14821, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:08,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.03 | bwd_microstep: 1868.65 | bwd_inner_microstep: 1629.97 | bwd_allreduce_microstep: 238.61 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13217
total_samples=14825, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:10,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.23 | bwd_microstep: 1949.08 | bwd_inner_microstep: 1855.97 | bwd_allreduce_microstep: 93.05 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13535
total_samples=14829, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:14,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 04:43:14,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.71 | bwd_microstep: 1874.68 | bwd_inner_microstep: 1827.44 | bwd_allreduce_microstep: 47.18 | step_microstep: 467.03
[2025-08-03 04:43:14,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2752.15 | bwd: 7585.47 | bwd_inner: 7161.93 | bwd_allreduce: 423.32 | step: 467.37
{'loss': 0.751, 'learning_rate': 1.087334987495364e-05, 'epoch': 0.49}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12920
total_samples=14833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:16,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.14 | bwd_microstep: 1714.56 | bwd_inner_microstep: 1630.89 | bwd_allreduce_microstep: 83.61 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13161
total_samples=14837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:19,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.98 | bwd_microstep: 1998.39 | bwd_inner_microstep: 1879.93 | bwd_allreduce_microstep: 118.39 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14489
total_samples=14841, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:21,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.08 | bwd_microstep: 1859.46 | bwd_inner_microstep: 1853.42 | bwd_allreduce_microstep: 5.98 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13091
total_samples=14845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:24,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 04:43:24,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.92 | bwd_microstep: 1973.81 | bwd_inner_microstep: 1815.68 | bwd_allreduce_microstep: 158.07 | step_microstep: 113.00
[2025-08-03 04:43:24,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2750.06 | bwd: 7546.27 | bwd_inner: 7179.91 | bwd_allreduce: 366.13 | step: 113.47
49%|████▊     | 972/2000 [2:59:45<3:06:54, 10.91s/it]                                                       49%|████▊     | 972/2000 [2:59:45<3:06:54, 10.91s/it] 49%|████▊     | 973/2000 [2:59:55<3:05:19, 10.83s/it]                                                       49%|████▊     | 973/2000 [2:59:55<3:05:19, 10.83s/it] 49%|████▊     | 974/2000 [3:00:06<3:06:33, 10.91s/it]                                                       49%|████▊     | 974/2000 [3:00:06<3:06:33, 10.91s/it] 49%|████▉     | 975/2000 [3:00:17<3:06:06, 10.89s/it]                                                       49%|████▉     | 975/2000 [3:00:17<3:06:06, 10.89s/it] 49%|████▉     | 976/2000 [3:00:28<3:07:08, 10.97s/it]                                                       49%|████▉     | 976/2000 [3:00:28<3:07:08, 10.97s/it] 49%|████▉     | 977/2000 [3:00:39<3:05:54, 10.90s/it]                              {'loss': 0.7619, 'learning_rate': 1.0857216836731221e-05, 'epoch': 0.49}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12176
total_samples=14848, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:27,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.09 | bwd_microstep: 1697.13 | bwd_inner_microstep: 1539.02 | bwd_allreduce_microstep: 158.04 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15055
total_samples=14852, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:29,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.39 | bwd_microstep: 1800.27 | bwd_inner_microstep: 1762.87 | bwd_allreduce_microstep: 37.34 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14169
total_samples=14856, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:32,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.98 | bwd_microstep: 1710.88 | bwd_inner_microstep: 1687.64 | bwd_allreduce_microstep: 23.18 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13876
total_samples=14860, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:35,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 04:43:35,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.10 | bwd_microstep: 1890.72 | bwd_inner_microstep: 1707.99 | bwd_allreduce_microstep: 182.66 | step_microstep: 428.54
[2025-08-03 04:43:35,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.48 | bwd: 7099.03 | bwd_inner: 6697.51 | bwd_allreduce: 401.29 | step: 428.88
{'loss': 0.7588, 'learning_rate': 1.0841081550557577e-05, 'epoch': 0.49}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11983
total_samples=14863, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:38,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.38 | bwd_microstep: 1954.53 | bwd_inner_microstep: 1547.34 | bwd_allreduce_microstep: 407.13 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13694
total_samples=14867, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:40,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.47 | bwd_microstep: 1841.28 | bwd_inner_microstep: 1717.48 | bwd_allreduce_microstep: 123.73 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12570
total_samples=14871, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:43,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.13 | bwd_microstep: 1818.42 | bwd_inner_microstep: 1626.69 | bwd_allreduce_microstep: 191.66 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14237
total_samples=14875, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:46,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 04:43:46,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.80 | bwd_microstep: 1751.67 | bwd_inner_microstep: 1658.96 | bwd_allreduce_microstep: 92.65 | step_microstep: 135.79
[2025-08-03 04:43:46,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.70 | bwd: 7365.95 | bwd_inner: 6550.46 | bwd_allreduce: 815.26 | step: 136.21
{'loss': 0.7643, 'learning_rate': 1.0824944058745623e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13592
total_samples=14879, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:48,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.66 | bwd_microstep: 1957.21 | bwd_inner_microstep: 1822.12 | bwd_allreduce_microstep: 135.02 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12722
total_samples=14883, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:51,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.65 | bwd_microstep: 2333.20 | bwd_inner_microstep: 1883.05 | bwd_allreduce_microstep: 450.09 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11748
total_samples=14886, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:54,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.04 | bwd_microstep: 1787.93 | bwd_inner_microstep: 1549.76 | bwd_allreduce_microstep: 238.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247
total_samples=14890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:43:57,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 04:43:57,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.40 | bwd_microstep: 2010.09 | bwd_inner_microstep: 1878.69 | bwd_allreduce_microstep: 131.30 | step_microstep: 135.67
[2025-08-03 04:43:57,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.68 | bwd: 8088.48 | bwd_inner: 7133.64 | bwd_allreduce: 954.59 | step: 135.99
{'loss': 0.7477, 'learning_rate': 1.0808804403614044e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15361
total_samples=14894, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:00,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.35 | bwd_microstep: 1771.36 | bwd_inner_microstep: 1764.76 | bwd_allreduce_microstep: 6.54 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12783
total_samples=14898, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:02,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.76 | bwd_microstep: 1768.57 | bwd_inner_microstep: 1649.27 | bwd_allreduce_microstep: 119.24 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11912
total_samples=14901, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:05,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.81 | bwd_microstep: 1746.51 | bwd_inner_microstep: 1546.43 | bwd_allreduce_microstep: 200.02 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13251
total_samples=14905, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:07,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 04:44:07,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.77 | bwd_microstep: 1963.86 | bwd_inner_microstep: 1850.09 | bwd_allreduce_microstep: 113.71 | step_microstep: 130.79
[2025-08-03 04:44:07,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.62 | bwd: 7250.36 | bwd_inner: 6810.54 | bwd_allreduce: 439.58 | step: 131.12
{'loss': 0.7507, 'learning_rate': 1.0792662627487207e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13392
total_samples=14909, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:10,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.34 | bwd_microstep: 1816.96 | bwd_inner_microstep: 1702.33 | bwd_allreduce_microstep: 114.57 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13571
total_samples=14913, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:13,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.63 | bwd_microstep: 1980.55 | bwd_inner_microstep: 1847.95 | bwd_allreduce_microstep: 132.53 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13370
total_samples=14917, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:16,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.60 | bwd_microstep: 2222.46 | bwd_inner_microstep: 2057.16 | bwd_allreduce_microstep: 165.24 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14792
total_samples=14921, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:19,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 04:44:19,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.12 | bwd_microstep: 1854.51 | bwd_inner_microstep: 1758.22 | bwd_allreduce_microstep: 96.23 | step_microstep: 130.47
[2025-08-03 04:44:19,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.62 | bwd: 7874.53 | bwd_inner: 7365.65 | bwd_allreduce: 508.65 | step: 130.94
{'loss': 0.7582, 'learning_rate': 1.0776518772695035e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13602
total_samples=14925, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:21,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.88 | bwd_microstep: 1786.84 | bwd_inner_microstep: 1704.34 | bwd_allreduce_microstep: 82.44 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14231
total_samples=14929, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:24,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.33 | bwd_microstep: 1784.14 | bwd_inner_microstep: 1739.80 | bwd_allreduce_microstep: 44.27 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11611
total_samples=14932, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:26,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.47 | bwd_microstep: 1918.68 | bwd_inner_microstep: 1531.63 | bwd_allreduce_microstep: 386.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13935
total_samples=14936, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:30,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 04:44:30,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.61 | bwd_microstep: 2413.69 | bwd_inner_microstep: 2063.36 | bwd_allreduce_microstep: 350.28 | step_microstep: 110.05
[2025-08-03 04:44:30,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.22 | bwd: 7903.41 | bwd_inner: 7039.11 | bwd_allreduce: 864.06 | step: 110.54
                         49%|████▉     | 977/2000 [3:00:39<3:05:54, 10.90s/it] 49%|████▉     | 978/2000 [3:00:50<3:04:22, 10.82s/it]                                                       49%|████▉     | 978/2000 [3:00:50<3:04:22, 10.82s/it] 49%|████▉     | 979/2000 [3:01:00<3:03:15, 10.77s/it]                                                       49%|████▉     | 979/2000 [3:01:00<3:03:15, 10.77s/it] 49%|████▉     | 980/2000 [3:01:12<3:05:59, 10.94s/it]                                                       49%|████▉     | 980/2000 [3:01:12<3:05:59, 10.94s/it] 49%|████▉     | 981/2000 [3:01:22<3:03:37, 10.81s/it]                                                       49%|████▉     | 981/2000 [3:01:22<3:03:37, 10.81s/it] 49%|████▉     | 982/2000 [3:01:33<3:05:06, 10.91s/it]                                                       49%|████▉     | 982/2000 [3:01:33<3:05:06, 10.91s/it] 49{'loss': 0.7611, 'learning_rate': 1.0760372881572904e-05, 'epoch': 0.49}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13782
total_samples=14940, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:32,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.79 | bwd_microstep: 1717.24 | bwd_inner_microstep: 1667.70 | bwd_allreduce_microstep: 49.47 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11621
total_samples=14943, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:35,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.33 | bwd_microstep: 1793.22 | bwd_inner_microstep: 1561.61 | bwd_allreduce_microstep: 231.55 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11595
total_samples=14946, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:37,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.48 | bwd_microstep: 1774.15 | bwd_inner_microstep: 1545.79 | bwd_allreduce_microstep: 228.29 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11736
total_samples=14949, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:40,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 04:44:40,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.03 | bwd_microstep: 2066.85 | bwd_inner_microstep: 1850.71 | bwd_allreduce_microstep: 216.07 | step_microstep: 143.48
[2025-08-03 04:44:40,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2746.56 | bwd: 7351.52 | bwd_inner: 6625.80 | bwd_allreduce: 725.46 | step: 143.98
{'loss': 0.7421, 'learning_rate': 1.0744224996461541e-05, 'epoch': 0.49}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13130
total_samples=14953, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:43,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.52 | bwd_microstep: 1749.12 | bwd_inner_microstep: 1656.55 | bwd_allreduce_microstep: 92.49 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11721
total_samples=14956, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:45,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.10 | bwd_microstep: 1704.84 | bwd_inner_microstep: 1529.20 | bwd_allreduce_microstep: 175.58 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13551
total_samples=14960, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:48,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.76 | bwd_microstep: 2085.83 | bwd_inner_microstep: 1867.00 | bwd_allreduce_microstep: 218.77 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11726
total_samples=14963, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:51,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 04:44:51,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.72 | bwd_microstep: 2163.90 | bwd_inner_microstep: 1958.77 | bwd_allreduce_microstep: 205.06 | step_microstep: 118.55
[2025-08-03 04:44:51,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.04 | bwd: 7703.74 | bwd_inner: 7011.52 | bwd_allreduce: 691.98 | step: 119.05
{'loss': 0.7571, 'learning_rate': 1.0728075159706881e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14062
total_samples=14968, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:54,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.21 | bwd_microstep: 1775.68 | bwd_inner_microstep: 1713.66 | bwd_allreduce_microstep: 61.96 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11973
total_samples=14971, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:56,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.28 | bwd_microstep: 1798.77 | bwd_inner_microstep: 1571.68 | bwd_allreduce_microstep: 227.04 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13650
total_samples=14975, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:44:59,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.80 | bwd_microstep: 1820.89 | bwd_inner_microstep: 1736.29 | bwd_allreduce_microstep: 84.54 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11572
total_samples=14978, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:02,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 04:45:02,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.08 | bwd_microstep: 1783.33 | bwd_inner_microstep: 1545.57 | bwd_allreduce_microstep: 237.69 | step_microstep: 134.12
[2025-08-03 04:45:02,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.30 | bwd: 7178.73 | bwd_inner: 6567.19 | bwd_allreduce: 611.30 | step: 134.55
{'loss': 0.7614, 'learning_rate': 1.0711923413659995e-05, 'epoch': 0.49}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13383
total_samples=14982, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:04,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.08 | bwd_microstep: 1771.95 | bwd_inner_microstep: 1686.47 | bwd_allreduce_microstep: 85.41 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12082
total_samples=14985, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:07,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.52 | bwd_microstep: 1952.89 | bwd_inner_microstep: 1747.89 | bwd_allreduce_microstep: 204.94 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13887
total_samples=14989, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:10,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.61 | bwd_microstep: 2021.63 | bwd_inner_microstep: 1889.38 | bwd_allreduce_microstep: 132.19 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13434
total_samples=14993, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:13,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 04:45:13,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.77 | bwd_microstep: 2016.14 | bwd_inner_microstep: 1992.88 | bwd_allreduce_microstep: 23.19 | step_microstep: 128.02
[2025-08-03 04:45:13,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2762.91 | bwd: 7762.66 | bwd_inner: 7316.62 | bwd_allreduce: 445.80 | step: 128.57
{'loss': 0.7589, 'learning_rate': 1.069576980067695e-05, 'epoch': 0.49}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12862
total_samples=14997, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:15,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.69 | bwd_microstep: 1788.64 | bwd_inner_microstep: 1657.95 | bwd_allreduce_microstep: 130.62 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11676
total_samples=15000, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:18,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.94 | bwd_microstep: 1809.60 | bwd_inner_microstep: 1577.76 | bwd_allreduce_microstep: 231.78 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12431
total_samples=15004, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:21,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.88 | bwd_microstep: 1996.63 | bwd_inner_microstep: 1793.18 | bwd_allreduce_microstep: 203.38 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13414
total_samples=15008, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:23,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.66
[2025-08-03 04:45:23,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.99 | bwd_microstep: 1829.77 | bwd_inner_microstep: 1771.79 | bwd_allreduce_microstep: 57.92 | step_microstep: 140.58
[2025-08-03 04:45:23,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.43 | bwd: 7424.69 | bwd_inner: 6800.67 | bwd_allreduce: 623.78 | step: 141.08
%|████▉     | 983/2000 [3:01:45<3:06:06, 10.98s/it]                                                       49%|████▉     | 983/2000 [3:01:45<3:06:06, 10.98s/it] 49%|████▉     | 984/2000 [3:01:55<3:03:54, 10.86s/it]                                                       49%|████▉     | 984/2000 [3:01:55<3:03:54, 10.86s/it] 49%|████▉     | 985/2000 [3:02:06<3:04:06, 10.88s/it]                                                       49%|████▉     | 985/2000 [3:02:06<3:04:06, 10.88s/it] 49%|████▉     | 986/2000 [3:02:17<3:01:49, 10.76s/it]                                                       49%|████▉     | 986/2000 [3:02:17<3:01:49, 10.76s/it] 49%|████▉     | 987/2000 [3:02:28<3:02:51, 10.83s/it]                                                       49%|████▉     | 987/2000 [3:02:28<3:02:51, 10.83s/it] 49%|████▉     | 988/2000 [3:02:38<3:02:02, 10.79s/it]                                {'loss': 0.763, 'learning_rate': 1.0679614363118718e-05, 'epoch': 0.49}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13424
total_samples=15012, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:26,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 926.55 | bwd_microstep: 1728.81 | bwd_inner_microstep: 1663.79 | bwd_allreduce_microstep: 64.95 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11638
total_samples=15015, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:29,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.46 | bwd_microstep: 1739.41 | bwd_inner_microstep: 1549.39 | bwd_allreduce_microstep: 189.95 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12014
total_samples=15018, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:32,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.19 | bwd_microstep: 2214.99 | bwd_inner_microstep: 1862.12 | bwd_allreduce_microstep: 352.81 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14038
total_samples=15022, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:35,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 04:45:35,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.33 | bwd_microstep: 1947.32 | bwd_inner_microstep: 1737.88 | bwd_allreduce_microstep: 209.37 | step_microstep: 125.35
[2025-08-03 04:45:35,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3000.46 | bwd: 7630.57 | bwd_inner: 6813.16 | bwd_allreduce: 817.17 | step: 125.70
{'loss': 0.7569, 'learning_rate': 1.0663457143351044e-05, 'epoch': 0.49}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15114
total_samples=15028, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:37,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.35 | bwd_microstep: 1897.13 | bwd_inner_microstep: 1753.01 | bwd_allreduce_microstep: 144.05 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12368
total_samples=15031, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:40,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.96 | bwd_microstep: 1927.20 | bwd_inner_microstep: 1746.80 | bwd_allreduce_microstep: 180.34 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13385
total_samples=15035, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:43,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.95 | bwd_microstep: 1935.91 | bwd_inner_microstep: 1717.83 | bwd_allreduce_microstep: 218.01 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11987
total_samples=15038, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:45,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 04:45:45,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 659.40 | bwd_microstep: 1965.81 | bwd_inner_microstep: 1629.22 | bwd_allreduce_microstep: 336.52 | step_microstep: 111.59
[2025-08-03 04:45:45,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2735.59 | bwd: 7726.09 | bwd_inner: 6846.84 | bwd_allreduce: 878.99 | step: 111.94
{'loss': 0.7492, 'learning_rate': 1.0647298183744359e-05, 'epoch': 0.49}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11654
total_samples=15041, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:48,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.47 | bwd_microstep: 1990.51 | bwd_inner_microstep: 1763.00 | bwd_allreduce_microstep: 227.44 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13175
total_samples=15045, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:51,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.02 | bwd_microstep: 1809.67 | bwd_inner_microstep: 1713.92 | bwd_allreduce_microstep: 95.68 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11911
total_samples=15048, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:53,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.67 | bwd_microstep: 1727.57 | bwd_inner_microstep: 1553.35 | bwd_allreduce_microstep: 174.15 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13609
total_samples=15052, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:56,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.56
[2025-08-03 04:45:56,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.14 | bwd_microstep: 1758.32 | bwd_inner_microstep: 1703.10 | bwd_allreduce_microstep: 55.15 | step_microstep: 123.99
[2025-08-03 04:45:56,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.22 | bwd: 7286.11 | bwd_inner: 6733.36 | bwd_allreduce: 552.50 | step: 124.46
{'loss': 0.7523, 'learning_rate': 1.0631137526673647e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13260
total_samples=15056, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:45:59,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.06 | bwd_microstep: 1955.66 | bwd_inner_microstep: 1905.65 | bwd_allreduce_microstep: 49.94 | step_microstep: 0.17
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13171
total_samples=15060, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:01,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.37 | bwd_microstep: 1978.22 | bwd_inner_microstep: 1689.00 | bwd_allreduce_microstep: 289.15 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13785
total_samples=15064, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:04,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 660.59 | bwd_microstep: 1957.06 | bwd_inner_microstep: 1895.08 | bwd_allreduce_microstep: 61.91 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13784
total_samples=15068, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:07,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 31.51
[2025-08-03 04:46:07,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.86 | bwd_microstep: 1780.53 | bwd_inner_microstep: 1694.63 | bwd_allreduce_microstep: 85.84 | step_microstep: 136.54
[2025-08-03 04:46:07,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.80 | bwd: 7671.52 | bwd_inner: 7184.36 | bwd_allreduce: 486.92 | step: 137.02
{'loss': 0.7522, 'learning_rate': 1.061497521451835e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13296
total_samples=15072, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:10,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.93 | bwd_microstep: 1897.27 | bwd_inner_microstep: 1715.65 | bwd_allreduce_microstep: 181.52 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12544
total_samples=15076, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:12,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.10 | bwd_microstep: 1794.95 | bwd_inner_microstep: 1632.53 | bwd_allreduce_microstep: 162.36 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11792
total_samples=15079, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:15,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.12 | bwd_microstep: 1735.16 | bwd_inner_microstep: 1540.06 | bwd_allreduce_microstep: 195.03 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13988
total_samples=15083, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:18,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 04:46:18,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.40 | bwd_microstep: 2039.36 | bwd_inner_microstep: 1764.64 | bwd_allreduce_microstep: 274.66 | step_microstep: 124.09
[2025-08-03 04:46:18,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.48 | bwd: 7466.79 | bwd_inner: 6652.88 | bwd_allreduce: 813.66 | step: 124.65
{'loss': 0.7584, 'learning_rate': 1.0598811289662243e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13650
total_samples=15087, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:20,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.52 | bwd_microstep: 1945.21 | bwd_inner_microstep: 1939.09 | bwd_allreduce_microstep: 6.05 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11795
total_samples=15090, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:23,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.17 | bwd_microstep: 1768.60 | bwd_inner_microstep: 1552.02 | bwd_allreduce_microstep: 216.50 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11712
total_samples=15093, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:25,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.61 | bwd_microstep: 1748.88 | bwd_inner_microstep: 1535.15 | bwd_allreduce_microstep: 213.66 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12935
total_samples=15097, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:28,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:46:28,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.98 | bwd_microstep: 1821.15 | bwd_inner_microstep: 1674.97 | bwd_allreduce_microstep: 146.11 | step_microstep: 113.71
[2025-08-03 04:46:28,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.21 | bwd: 7283.88 | bwd_inner: 6701.23 | bwd_allreduce: 582.41 | step: 114.15
                       49%|████▉     | 988/2000 [3:02:38<3:02:02, 10.79s/it] 49%|████▉     | 989/2000 [3:02:49<3:03:19, 10.88s/it]                                                       49%|████▉     | 989/2000 [3:02:49<3:03:19, 10.88s/it] 50%|████▉     | 990/2000 [3:03:00<3:03:19, 10.89s/it]                                                       50%|████▉     | 990/2000 [3:03:00<3:03:19, 10.89s/it] 50%|████▉     | 991/2000 [3:03:11<3:01:14, 10.78s/it]                                                       50%|████▉     | 991/2000 [3:03:11<3:01:14, 10.78s/it] 50%|████▉     | 992/2000 [3:03:22<3:01:53, 10.83s/it]                                                       50%|████▉     | 992/2000 [3:03:22<3:01:53, 10.83s/it] 50%|████▉     | 993/2000 [3:03:32<3:01:05, 10.79s/it]                                                       50%|████▉     | 993/2000 [3:03:32<3:01:05, 10.79s/it] 50%|{'loss': 0.7505, 'learning_rate': 1.0582645794493337e-05, 'epoch': 0.5}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12862
total_samples=15101, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:31,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.62 | bwd_microstep: 2104.45 | bwd_inner_microstep: 1864.97 | bwd_allreduce_microstep: 239.39 | step_microstep: 0.37
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13474
total_samples=15105, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:34,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.56 | bwd_microstep: 1730.36 | bwd_inner_microstep: 1672.66 | bwd_allreduce_microstep: 57.62 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11729
total_samples=15108, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:36,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.05 | bwd_microstep: 1914.30 | bwd_inner_microstep: 1726.18 | bwd_allreduce_microstep: 188.05 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12048
total_samples=15112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:39,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 04:46:39,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.78 | bwd_microstep: 2023.04 | bwd_inner_microstep: 1816.83 | bwd_allreduce_microstep: 206.13 | step_microstep: 143.46
[2025-08-03 04:46:39,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.94 | bwd: 7772.19 | bwd_inner: 7080.65 | bwd_allreduce: 691.28 | step: 144.21
{'loss': 0.7638, 'learning_rate': 1.0566478771403763e-05, 'epoch': 0.5}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15118
total_samples=15116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:42,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.50 | bwd_microstep: 2165.85 | bwd_inner_microstep: 2001.53 | bwd_allreduce_microstep: 164.21 | step_microstep: 0.48
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13310
total_samples=15120, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:45,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.80 | bwd_microstep: 1738.74 | bwd_inner_microstep: 1625.37 | bwd_allreduce_microstep: 113.29 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11739
total_samples=15123, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:47,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.54 | bwd_microstep: 1768.59 | bwd_inner_microstep: 1553.59 | bwd_allreduce_microstep: 214.93 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13391
total_samples=15127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:50,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 04:46:50,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.17 | bwd_microstep: 2015.90 | bwd_inner_microstep: 1911.10 | bwd_allreduce_microstep: 104.73 | step_microstep: 117.06
[2025-08-03 04:46:50,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2751.94 | bwd: 7689.13 | bwd_inner: 7091.59 | bwd_allreduce: 597.27 | step: 117.81
{'loss': 0.7586, 'learning_rate': 1.055031026278965e-05, 'epoch': 0.5}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13641
total_samples=15131, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:53,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.05 | bwd_microstep: 1746.88 | bwd_inner_microstep: 1661.85 | bwd_allreduce_microstep: 84.96 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13333
total_samples=15135, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:55,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.08 | bwd_microstep: 1766.37 | bwd_inner_microstep: 1689.67 | bwd_allreduce_microstep: 76.63 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12696
total_samples=15138, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:46:58,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.08 | bwd_microstep: 2233.54 | bwd_inner_microstep: 2039.88 | bwd_allreduce_microstep: 193.60 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14314
total_samples=15142, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:01,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 04:47:01,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.79 | bwd_microstep: 2009.10 | bwd_inner_microstep: 1894.83 | bwd_allreduce_microstep: 114.20 | step_microstep: 131.01
[2025-08-03 04:47:01,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2731.93 | bwd: 7755.96 | bwd_inner: 7286.23 | bwd_allreduce: 469.47 | step: 131.58
{'loss': 0.7601, 'learning_rate': 1.0534140311051026e-05, 'epoch': 0.5}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13441
total_samples=15146, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:04,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.12 | bwd_microstep: 1730.59 | bwd_inner_microstep: 1658.07 | bwd_allreduce_microstep: 72.46 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14028
total_samples=15150, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:06,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.93 | bwd_microstep: 1807.19 | bwd_inner_microstep: 1714.04 | bwd_allreduce_microstep: 93.08 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11705
total_samples=15153, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:09,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.89 | bwd_microstep: 1775.88 | bwd_inner_microstep: 1540.37 | bwd_allreduce_microstep: 235.45 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11889
total_samples=15156, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:12,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.63
[2025-08-03 04:47:12,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.29 | bwd_microstep: 1896.79 | bwd_inner_microstep: 1742.78 | bwd_allreduce_microstep: 153.94 | step_microstep: 153.28
[2025-08-03 04:47:12,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.15 | bwd: 7210.49 | bwd_inner: 6655.25 | bwd_allreduce: 555.01 | step: 153.60
{'loss': 0.7377, 'learning_rate': 1.0517968958591705e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13203
total_samples=15160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:14,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.23 | bwd_microstep: 2202.71 | bwd_inner_microstep: 2039.52 | bwd_allreduce_microstep: 163.13 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12805
total_samples=15163, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:17,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.98 | bwd_microstep: 1781.97 | bwd_inner_microstep: 1614.63 | bwd_allreduce_microstep: 167.27 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13777
total_samples=15167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:21,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.18 | bwd_microstep: 2743.55 | bwd_inner_microstep: 2463.03 | bwd_allreduce_microstep: 280.46 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13877
total_samples=15171, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:23,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.55
[2025-08-03 04:47:23,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.76 | bwd_microstep: 1743.61 | bwd_inner_microstep: 1696.67 | bwd_allreduce_microstep: 46.87 | step_microstep: 112.76
[2025-08-03 04:47:23,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2748.09 | bwd: 8471.90 | bwd_inner: 7813.85 | bwd_allreduce: 657.81 | step: 113.22
████▉     | 994/2000 [3:03:43<2:59:33, 10.71s/it]                                                       50%|████▉     | 994/2000 [3:03:43<2:59:33, 10.71s/it] 50%|████▉     | 995/2000 [3:03:54<3:01:08, 10.81s/it]                                                       50%|████▉     | 995/2000 [3:03:54<3:01:08, 10.81s/it] 50%|████▉     | 996/2000 [3:04:05<3:01:23, 10.84s/it]                                                       50%|████▉     | 996/2000 [3:04:05<3:01:23, 10.84s/it] 50%|████▉     | 997/2000 [3:04:16<3:02:00, 10.89s/it]                                                       50%|████▉     | 997/2000 [3:04:16<3:02:00, 10.89s/it] 50%|████▉     | 998/2000 [3:04:26<2:59:44, 10.76s/it]                                                       50%|████▉     | 998/2000 [3:04:26<2:59:44, 10.76s/it] 50%|████▉     | 999/2000 [3:04:38<3:04:07, 11.04s/it]                                  {'loss': 0.7506, 'learning_rate': 1.0501796247819176e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15887
total_samples=15176, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:26,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.56 | bwd_microstep: 1805.84 | bwd_inner_microstep: 1777.46 | bwd_allreduce_microstep: 28.31 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12896
total_samples=15180, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:28,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.77 | bwd_microstep: 1798.85 | bwd_inner_microstep: 1715.90 | bwd_allreduce_microstep: 82.84 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11851
total_samples=15183, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:31,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.00 | bwd_microstep: 2254.82 | bwd_inner_microstep: 2037.48 | bwd_allreduce_microstep: 217.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11925
total_samples=15186, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:34,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 04:47:34,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.38 | bwd_microstep: 1955.60 | bwd_inner_microstep: 1776.31 | bwd_allreduce_microstep: 179.22 | step_microstep: 133.96
[2025-08-03 04:47:34,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.65 | bwd: 7815.15 | bwd_inner: 7307.17 | bwd_allreduce: 507.72 | step: 134.29
{'loss': 0.7476, 'learning_rate': 1.0485622221144485e-05, 'epoch': 0.5}
                     50%|████▉     | 999/2000 [3:04:38<3:04:07, 11.04s/it] 50%|█████     | 1000/2000 [3:04:49<3:04:05, 11.05s/it]                                                        50%|█████     | 1000/2000 [3:04:49<3:04:05, 11.05s/it][INFO|trainer.py:2936] 2025-08-03 04:47:37,712 >> Saving model checkpoint to work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000
[INFO|configuration_utils.py:473] 2025-08-03 04:47:37,728 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/config.json
[INFO|configuration_utils.py:594] 2025-08-03 04:47:37,732 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/generation_config.json
[INFO|modeling_utils.py:2493] 2025-08-03 04:47:41,733 >> Model weights saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/model.safetensors
[INFO|tokenization_utils_base.py:2433] 2025-08-03 04:47:41,739 >> tokenizer config file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2442] 2025-08-03 04:47:41,743 >> Special tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/special_tokens_map.json
[INFO|tokenization_utils_base.py:2493] 2025-08-03 04:47:41,745 >> added tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/added_tokens.json
[2025-08-03 04:47:42,273] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step1000 is about to be saved!
[2025-08-03 04:47:42,287] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_24_mp_rank_00_model_states.pt...
[2025-08-03 04:47:42,294] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_model_states.pt
[2025-08-03 04:47:42,294] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2025-08-03 04:47:42,328] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_8_mp_rank_00_model_states.pt...
[2025-08-03 04:47:42,295] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_16_mp_rank_00_model_states.pt...
[2025-08-03 04:47:42,354] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2025-08-03 04:47:42,385] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_16_mp_rank_00_model_states.pt.
[2025-08-03 04:47:42,456] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_8_mp_rank_00_model_states.pt.
[2025-08-03 04:47:42,438] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/zero_pp_rank_24_mp_rank_00_model_states.pt.
[2025-08-03 04:47:42,479] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt...
[2025-08-03 04:47:42,478] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-08-03 04:47:42,512] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt...
[2025-08-03 04:47:42,474] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt...
[2025-08-03 04:47:43,856] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt.
[2025-08-03 04:47:43,857] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt
[2025-08-03 04:47:44,043] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt.
[2025-08-03 04:47:44,044] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt
[2025-08-03 04:47:44,065] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt.
[2025-08-03 04:47:44,065] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt
[2025-08-03 04:47:44,136] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-08-03 04:47:44,161] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-1000/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-08-03 04:47:44,429] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1000 is ready now!
[2025-08-03 04:47:44,463] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1000 is ready now!
[2025-08-03 04:47:44,430] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1000 is ready now!
[2025-08-03 04:47:44,424] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1000 is ready now!
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15085
total_samples=15190, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:47,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.23 | bwd_microstep: 1741.66 | bwd_inner_microstep: 1735.42 | bwd_allreduce_microstep: 6.18 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12262
total_samples=15194, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:49,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.85 | bwd_microstep: 1753.62 | bwd_inner_microstep: 1574.48 | bwd_allreduce_microstep: 179.08 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11949
total_samples=15197, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:52,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.28 | bwd_microstep: 1873.81 | bwd_inner_microstep: 1554.20 | bwd_allreduce_microstep: 319.55 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13303
total_samples=15201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:55,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.74
[2025-08-03 04:47:55,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.45 | bwd_microstep: 2020.97 | bwd_inner_microstep: 1878.44 | bwd_allreduce_microstep: 142.47 | step_microstep: 135.80
[2025-08-03 04:47:55,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.75 | bwd: 7390.11 | bwd_inner: 6742.51 | bwd_allreduce: 647.36 | step: 136.13
{'loss': 0.7533, 'learning_rate': 1.046944692098213e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13912
total_samples=15206, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:47:57,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.28 | bwd_microstep: 1937.82 | bwd_inner_microstep: 1778.89 | bwd_allreduce_microstep: 158.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13738
total_samples=15210, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:00,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.96 | bwd_microstep: 1823.58 | bwd_inner_microstep: 1748.85 | bwd_allreduce_microstep: 74.67 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14041
total_samples=15214, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:02,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.78 | bwd_microstep: 1773.63 | bwd_inner_microstep: 1729.50 | bwd_allreduce_microstep: 44.06 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13202
total_samples=15218, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:05,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 04:48:05,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.43 | bwd_microstep: 1860.89 | bwd_inner_microstep: 1756.32 | bwd_allreduce_microstep: 104.51 | step_microstep: 131.27
[2025-08-03 04:48:05,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.38 | bwd: 7395.97 | bwd_inner: 7013.56 | bwd_allreduce: 382.18 | step: 131.60
{'loss': 0.759, 'learning_rate': 1.0453270389749956e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15024
total_samples=15223, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:08,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.64 | bwd_microstep: 1901.65 | bwd_inner_microstep: 1862.91 | bwd_allreduce_microstep: 38.67 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13602
total_samples=15227, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:10,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.92 | bwd_microstep: 1763.52 | bwd_inner_microstep: 1700.76 | bwd_allreduce_microstep: 62.69 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11562
total_samples=15230, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:13,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.71 | bwd_microstep: 1788.26 | bwd_inner_microstep: 1564.57 | bwd_allreduce_microstep: 223.61 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12114
total_samples=15234, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:16,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 04:48:16,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.84 | bwd_microstep: 2004.52 | bwd_inner_microstep: 1622.16 | bwd_allreduce_microstep: 382.30 | step_microstep: 111.21
[2025-08-03 04:48:16,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.03 | bwd: 7457.99 | bwd_inner: 6750.39 | bwd_allreduce: 707.36 | step: 111.71
{'loss': 0.7616, 'learning_rate': 1.0437092669869025e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14875
total_samples=15238, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:19,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 1990.82 | bwd_inner_microstep: 1888.38 | bwd_allreduce_microstep: 102.38 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334
total_samples=15242, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:21,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.20 | bwd_microstep: 1686.06 | bwd_inner_microstep: 1647.06 | bwd_allreduce_microstep: 38.94 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12774
total_samples=15245, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:24,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.55 | bwd_microstep: 1955.05 | bwd_inner_microstep: 1791.23 | bwd_allreduce_microstep: 163.74 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12502
total_samples=15248, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:27,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 04:48:27,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 874.34 | bwd_microstep: 1779.61 | bwd_inner_microstep: 1580.97 | bwd_allreduce_microstep: 198.57 | step_microstep: 117.13
[2025-08-03 04:48:27,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2968.98 | bwd: 7411.59 | bwd_inner: 6907.64 | bwd_allreduce: 503.70 | step: 117.60
{'loss': 0.7575, 'learning_rate': 1.0420913803763522e-05, 'epoch': 0.5}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12710
total_samples=15252, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:29,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.88 | bwd_microstep: 1869.15 | bwd_inner_microstep: 1658.77 | bwd_allreduce_microstep: 210.32 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14002
total_samples=15256, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:32,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.62 | bwd_microstep: 1774.63 | bwd_inner_microstep: 1723.30 | bwd_allreduce_microstep: 51.26 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11957
total_samples=15259, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:35,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.17 | bwd_microstep: 1981.25 | bwd_inner_microstep: 1586.29 | bwd_allreduce_microstep: 394.89 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13962
total_samples=15263, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:38,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 04:48:38,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.36 | bwd_microstep: 2123.32 | bwd_inner_microstep: 1819.79 | bwd_allreduce_microstep: 303.47 | step_microstep: 112.68
[2025-08-03 04:48:38,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.96 | bwd: 7748.40 | bwd_inner: 6788.14 | bwd_allreduce: 960.02 | step: 113.14
{'loss': 0.7425, 'learning_rate': 1.0404733833860639e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13580
total_samples=15267, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:40,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.36 | bwd_microstep: 1799.90 | bwd_inner_microstep: 1700.24 | bwd_allreduce_microstep: 99.58 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13383
total_samples=15271, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:43,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.60 | bwd_microstep: 1717.24 | bwd_inner_microstep: 1659.19 | bwd_allreduce_microstep: 57.97 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11933
total_samples=15274, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:45,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.33 | bwd_microstep: 1808.92 | bwd_inner_microstep: 1584.61 | bwd_allreduce_microstep: 224.25 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13250
total_samples=15278, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:48,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 04:48:48,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.58 | bwd_microstep: 2098.62 | bwd_inner_microstep: 1667.52 | bwd_allreduce_microstep: 431.04 | step_microstep: 123.67
[2025-08-03 04:48:48,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.79 | bwd: 7424.73 | bwd_inner: 6611.56 | bwd_allreduce: 812.92 | step: 124.17
 50%|█████     | 1001/2000 [3:05:09<3:50:21, 13.84s/it]                                                        50%|█████     | 1001/2000 [3:05:10<3:50:21, 13.84s/it] 50%|█████     | 1002/2000 [3:05:20<3:33:58, 12.86s/it]                                                        50%|█████     | 1002/2000 [3:05:20<3:33:58, 12.86s/it] 50%|█████     | 1003/2000 [3:05:31<3:23:06, 12.22s/it]                                                        50%|█████     | 1003/2000 [3:05:31<3:23:06, 12.22s/it] 50%|█████     | 1004/2000 [3:05:42<3:15:49, 11.80s/it]                                                        50%|█████     | 1004/2000 [3:05:42<3:15:49, 11.80s/it] 50%|█████     | 1005/2000 [3:05:53<3:11:43, 11.56s/it]                                                        50%|█████     | 1005/2000 [3:05:53<3:11:43, 11.56s/it] 50%|█████     | 1006/2000 [3:06:03<3:06:57, 11.29s/it]            {'loss': 0.742, 'learning_rate': 1.0388552802590461e-05, 'epoch': 0.5}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=15282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:51,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.41 | bwd_microstep: 1876.99 | bwd_inner_microstep: 1796.26 | bwd_allreduce_microstep: 80.65 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13858
total_samples=15287, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:54,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.70 | bwd_microstep: 1726.62 | bwd_inner_microstep: 1683.35 | bwd_allreduce_microstep: 43.21 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13915
total_samples=15291, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:56,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.47 | bwd_microstep: 2056.81 | bwd_inner_microstep: 2050.72 | bwd_allreduce_microstep: 6.03 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15045
total_samples=15295, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:48:59,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 04:48:59,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.31 | bwd_microstep: 1808.56 | bwd_inner_microstep: 1764.28 | bwd_allreduce_microstep: 44.21 | step_microstep: 120.28
[2025-08-03 04:48:59,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.82 | bwd: 7469.03 | bwd_inner: 7294.60 | bwd_allreduce: 174.19 | step: 120.76
{'loss': 0.76, 'learning_rate': 1.0372370752385854e-05, 'epoch': 0.5}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13620
total_samples=15299, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:02,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.46 | bwd_microstep: 1840.60 | bwd_inner_microstep: 1709.75 | bwd_allreduce_microstep: 130.78 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13082
total_samples=15303, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:04,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.94 | bwd_microstep: 1845.84 | bwd_inner_microstep: 1652.52 | bwd_allreduce_microstep: 193.25 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13300
total_samples=15307, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:07,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.85 | bwd_microstep: 2009.51 | bwd_inner_microstep: 1864.43 | bwd_allreduce_microstep: 145.01 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11840
total_samples=15310, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:10,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 04:49:10,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.26 | bwd_microstep: 1850.05 | bwd_inner_microstep: 1602.55 | bwd_allreduce_microstep: 247.43 | step_microstep: 123.44
[2025-08-03 04:49:10,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2849.44 | bwd: 7546.04 | bwd_inner: 6829.24 | bwd_allreduce: 716.56 | step: 123.91
{'loss': 0.753, 'learning_rate': 1.0356187725682359e-05, 'epoch': 0.5}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13969
total_samples=15314, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:13,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.23 | bwd_microstep: 1932.33 | bwd_inner_microstep: 1727.81 | bwd_allreduce_microstep: 204.46 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13123
total_samples=15318, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:15,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.95 | bwd_microstep: 1783.92 | bwd_inner_microstep: 1682.76 | bwd_allreduce_microstep: 101.09 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13554
total_samples=15323, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:18,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.93 | bwd_microstep: 1980.93 | bwd_inner_microstep: 1692.82 | bwd_allreduce_microstep: 288.04 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13475
total_samples=15327, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:21,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 04:49:21,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.27 | bwd_microstep: 1845.86 | bwd_inner_microstep: 1797.20 | bwd_allreduce_microstep: 48.59 | step_microstep: 134.05
[2025-08-03 04:49:21,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.31 | bwd: 7543.09 | bwd_inner: 6900.58 | bwd_allreduce: 642.27 | step: 134.40
{'loss': 0.7435, 'learning_rate': 1.0340003764918078e-05, 'epoch': 0.5}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13300
total_samples=15331, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:24,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.04 | bwd_microstep: 2125.11 | bwd_inner_microstep: 1999.06 | bwd_allreduce_microstep: 125.99 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16000
total_samples=15337, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:26,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.26 | bwd_microstep: 1799.73 | bwd_inner_microstep: 1785.03 | bwd_allreduce_microstep: 14.64 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12283
total_samples=15340, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:29,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.97 | bwd_microstep: 1759.30 | bwd_inner_microstep: 1579.64 | bwd_allreduce_microstep: 179.60 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13358
total_samples=15344, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:32,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 04:49:32,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.32 | bwd_microstep: 1955.30 | bwd_inner_microstep: 1770.09 | bwd_allreduce_microstep: 185.14 | step_microstep: 153.62
[2025-08-03 04:49:32,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.52 | bwd: 7639.49 | bwd_inner: 7133.82 | bwd_allreduce: 505.44 | step: 154.07
{'loss': 0.7645, 'learning_rate': 1.0323818912533561e-05, 'epoch': 0.51}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13778
total_samples=15348, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:34,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.39 | bwd_microstep: 1876.65 | bwd_inner_microstep: 1836.50 | bwd_allreduce_microstep: 40.09 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12154
total_samples=15351, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:37,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.14 | bwd_microstep: 1926.54 | bwd_inner_microstep: 1736.23 | bwd_allreduce_microstep: 190.24 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14344
total_samples=15355, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:40,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.48 | bwd_microstep: 1798.71 | bwd_inner_microstep: 1731.12 | bwd_allreduce_microstep: 67.53 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13547
total_samples=15359, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:43,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 04:49:43,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.90 | bwd_microstep: 2135.72 | bwd_inner_microstep: 1930.10 | bwd_allreduce_microstep: 205.56 | step_microstep: 107.37
[2025-08-03 04:49:43,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.85 | bwd: 7737.68 | bwd_inner: 7233.95 | bwd_allreduce: 503.49 | step: 107.85
{'loss': 0.7422, 'learning_rate': 1.0307633210971697e-05, 'epoch': 0.51}
                                            50%|█████     | 1006/2000 [3:06:03<3:06:57, 11.29s/it] 50%|█████     | 1007/2000 [3:06:14<3:03:55, 11.11s/it]                                                        50%|█████     | 1007/2000 [3:06:14<3:03:55, 11.11s/it] 50%|█████     | 1008/2000 [3:06:25<3:02:21, 11.03s/it]                                                        50%|█████     | 1008/2000 [3:06:25<3:02:21, 11.03s/it] 50%|█████     | 1009/2000 [3:06:36<3:00:49, 10.95s/it]                                                        50%|█████     | 1009/2000 [3:06:36<3:00:49, 10.95s/it] 50%|█████     | 1010/2000 [3:06:47<3:00:39, 10.95s/it]                                                        50%|█████     | 1010/2000 [3:06:47<3:00:39, 10.95s/it] 51%|█████     | 1011/2000 [3:06:57<3:00:32, 10.95s/it]                                                        51%|█████     | 1011/20dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14663
total_samples=15364, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:45,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.32 | bwd_microstep: 1762.46 | bwd_inner_microstep: 1712.34 | bwd_allreduce_microstep: 50.05 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14115
total_samples=15368, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:48,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.21 | bwd_microstep: 1760.73 | bwd_inner_microstep: 1711.14 | bwd_allreduce_microstep: 49.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247
total_samples=15372, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:50,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.69 | bwd_microstep: 1780.02 | bwd_inner_microstep: 1680.35 | bwd_allreduce_microstep: 99.61 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13328
total_samples=15376, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:53,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 04:49:53,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.80 | bwd_microstep: 1981.94 | bwd_inner_microstep: 1905.26 | bwd_allreduce_microstep: 76.61 | step_microstep: 108.48
[2025-08-03 04:49:53,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.96 | bwd: 7285.20 | bwd_inner: 7009.09 | bwd_allreduce: 275.88 | step: 108.81
{'loss': 0.761, 'learning_rate': 1.0291446702677598e-05, 'epoch': 0.51}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13211
total_samples=15381, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:56,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 988.57 | bwd_microstep: 1750.62 | bwd_inner_microstep: 1652.88 | bwd_allreduce_microstep: 97.67 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13949
total_samples=15385, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:49:59,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.10 | bwd_microstep: 1842.41 | bwd_inner_microstep: 1737.08 | bwd_allreduce_microstep: 105.27 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14635
total_samples=15389, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:01,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.77 | bwd_microstep: 2160.36 | bwd_inner_microstep: 2131.94 | bwd_allreduce_microstep: 28.34 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15089
total_samples=15394, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:04,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 04:50:04,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.00 | bwd_microstep: 1799.71 | bwd_inner_microstep: 1737.48 | bwd_allreduce_microstep: 62.16 | step_microstep: 120.41
[2025-08-03 04:50:04,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3121.36 | bwd: 7553.16 | bwd_inner: 7259.38 | bwd_allreduce: 293.52 | step: 120.79
{'loss': 0.7565, 'learning_rate': 1.0275259430098502e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11576
total_samples=15397, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:07,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.19 | bwd_microstep: 1991.54 | bwd_inner_microstep: 1800.59 | bwd_allreduce_microstep: 190.88 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13556
total_samples=15401, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:10,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.32 | bwd_microstep: 1814.43 | bwd_inner_microstep: 1807.40 | bwd_allreduce_microstep: 6.97 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13571
total_samples=15405, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:12,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.23 | bwd_microstep: 1745.55 | bwd_inner_microstep: 1694.18 | bwd_allreduce_microstep: 51.31 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13111
total_samples=15409, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:15,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 04:50:15,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.37 | bwd_microstep: 1875.85 | bwd_inner_microstep: 1665.06 | bwd_allreduce_microstep: 210.73 | step_microstep: 109.62
[2025-08-03 04:50:15,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.05 | bwd: 7427.42 | bwd_inner: 6967.22 | bwd_allreduce: 459.97 | step: 110.04
{'loss': 0.7556, 'learning_rate': 1.0259071435683636e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12071
total_samples=15412, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:18,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.36 | bwd_microstep: 2026.84 | bwd_inner_microstep: 1842.31 | bwd_allreduce_microstep: 184.46 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13590
total_samples=15416, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:20,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.89 | bwd_microstep: 1774.96 | bwd_inner_microstep: 1714.01 | bwd_allreduce_microstep: 60.88 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13543
total_samples=15420, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:23,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.94 | bwd_microstep: 2060.64 | bwd_inner_microstep: 1844.47 | bwd_allreduce_microstep: 216.10 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11685
total_samples=15423, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:26,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 04:50:26,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.17 | bwd_microstep: 2129.91 | bwd_inner_microstep: 1652.80 | bwd_allreduce_microstep: 477.02 | step_microstep: 125.73
[2025-08-03 04:50:26,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2873.29 | bwd: 7992.40 | bwd_inner: 7053.61 | bwd_allreduce: 938.52 | step: 126.18
{'loss': 0.7609, 'learning_rate': 1.0242882761884132e-05, 'epoch': 0.51}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 14645
total_samples=15427, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:29,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.89 | bwd_microstep: 1758.04 | bwd_inner_microstep: 1692.06 | bwd_allreduce_microstep: 65.92 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14576
total_samples=15432, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:31,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.47 | bwd_microstep: 2073.79 | bwd_inner_microstep: 1763.18 | bwd_allreduce_microstep: 310.54 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15005
total_samples=15437, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:34,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.87 | bwd_microstep: 2063.22 | bwd_inner_microstep: 1930.12 | bwd_allreduce_microstep: 133.04 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13661
total_samples=15441, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:37,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:50:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.61 | bwd_microstep: 1971.30 | bwd_inner_microstep: 1746.24 | bwd_allreduce_microstep: 225.00 | step_microstep: 133.20
[2025-08-03 04:50:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.78 | bwd: 7866.41 | bwd_inner: 7131.60 | bwd_allreduce: 734.57 | step: 133.54
{'loss': 0.7602, 'learning_rate': 1.02266934511529e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11627
total_samples=15444, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:40,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 1853.60 | bwd_inner_microstep: 1582.37 | bwd_allreduce_microstep: 271.17 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13324
total_samples=15448, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:42,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.59 | bwd_microstep: 1790.44 | bwd_inner_microstep: 1708.97 | bwd_allreduce_microstep: 81.41 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=15452, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:45,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.42 | bwd_microstep: 1721.37 | bwd_inner_microstep: 1679.54 | bwd_allreduce_microstep: 41.77 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11720
total_samples=15455, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:48,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 04:50:48,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.67 | bwd_microstep: 1753.59 | bwd_inner_microstep: 1550.55 | bwd_allreduce_microstep: 202.98 | step_microstep: 114.17
[2025-08-03 04:50:48,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.55 | bwd: 7119.05 | bwd_inner: 6521.42 | bwd_allreduce: 597.41 | step: 114.51
00 [3:06:57<3:00:32, 10.95s/it] 51%|█████     | 1012/2000 [3:07:08<2:57:50, 10.80s/it]                                                        51%|█████     | 1012/2000 [3:07:08<2:57:50, 10.80s/it] 51%|█████     | 1013/2000 [3:07:19<2:59:06, 10.89s/it]                                                        51%|█████     | 1013/2000 [3:07:19<2:59:06, 10.89s/it] 51%|█████     | 1014/2000 [3:07:30<2:57:43, 10.82s/it]                                                        51%|█████     | 1014/2000 [3:07:30<2:57:43, 10.82s/it] 51%|█████     | 1015/2000 [3:07:41<2:59:43, 10.95s/it]                                                        51%|█████     | 1015/2000 [3:07:41<2:59:43, 10.95s/it] 51%|█████     | 1016/2000 [3:07:52<3:00:20, 11.00s/it]                                                        51%|█████     | 1016/2000 [3:07:52<3:00:20, 11.00s/it] 51%|█████     | 1017/2000 [3:08:02<2{'loss': 0.7474, 'learning_rate': 1.0210503545944522e-05, 'epoch': 0.51}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13393
total_samples=15459, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:50,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.95 | bwd_microstep: 2017.09 | bwd_inner_microstep: 1861.89 | bwd_allreduce_microstep: 155.13 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13953
total_samples=15463, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:53,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.28 | bwd_microstep: 1843.49 | bwd_inner_microstep: 1726.33 | bwd_allreduce_microstep: 117.10 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13450
total_samples=15467, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:56,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.05 | bwd_microstep: 1997.57 | bwd_inner_microstep: 1731.05 | bwd_allreduce_microstep: 266.46 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13378
total_samples=15471, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:50:59,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:50:59,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.68 | bwd_microstep: 1945.93 | bwd_inner_microstep: 1849.30 | bwd_allreduce_microstep: 96.55 | step_microstep: 121.67
[2025-08-03 04:50:59,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.88 | bwd: 7804.13 | bwd_inner: 7168.57 | bwd_allreduce: 635.32 | step: 122.13
{'loss': 0.7487, 'learning_rate': 1.0194313088715135e-05, 'epoch': 0.51}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13565
total_samples=15475, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:01,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.24 | bwd_microstep: 2020.27 | bwd_inner_microstep: 1689.06 | bwd_allreduce_microstep: 331.15 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13618
total_samples=15479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:04,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.05 | bwd_microstep: 1793.39 | bwd_inner_microstep: 1711.01 | bwd_allreduce_microstep: 82.32 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16371
total_samples=15483, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:07,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.30 | bwd_microstep: 2047.30 | bwd_inner_microstep: 1974.26 | bwd_allreduce_microstep: 72.98 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13517
total_samples=15487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:10,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 04:51:10,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.32 | bwd_microstep: 2008.96 | bwd_inner_microstep: 1877.02 | bwd_allreduce_microstep: 131.87 | step_microstep: 122.71
[2025-08-03 04:51:10,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.83 | bwd: 7869.98 | bwd_inner: 7251.34 | bwd_allreduce: 618.40 | step: 123.04
{'loss': 0.7677, 'learning_rate': 1.0178122121922324e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11916
total_samples=15490, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:12,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.30 | bwd_microstep: 1861.33 | bwd_inner_microstep: 1549.59 | bwd_allreduce_microstep: 311.69 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12978
total_samples=15494, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:15,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.21 | bwd_microstep: 1834.33 | bwd_inner_microstep: 1698.10 | bwd_allreduce_microstep: 136.16 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13667
total_samples=15498, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:18,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.15 | bwd_microstep: 1782.43 | bwd_inner_microstep: 1705.42 | bwd_allreduce_microstep: 76.95 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13054
total_samples=15502, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:21,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 04:51:21,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.94 | bwd_microstep: 2336.02 | bwd_inner_microstep: 2168.68 | bwd_allreduce_microstep: 167.27 | step_microstep: 113.17
[2025-08-03 04:51:21,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.52 | bwd: 7814.16 | bwd_inner: 7121.78 | bwd_allreduce: 692.15 | step: 113.51
{'loss': 0.74, 'learning_rate': 1.0161930688025018e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11660
total_samples=15505, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:24,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.75 | bwd_microstep: 2022.65 | bwd_inner_microstep: 1861.96 | bwd_allreduce_microstep: 160.63 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11773
total_samples=15508, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:26,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.86 | bwd_microstep: 1925.97 | bwd_inner_microstep: 1788.96 | bwd_allreduce_microstep: 136.96 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15763
total_samples=15513, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:29,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.10 | bwd_microstep: 1955.12 | bwd_inner_microstep: 1781.95 | bwd_allreduce_microstep: 173.07 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12984
total_samples=15517, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:32,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 04:51:32,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.82 | bwd_microstep: 1786.74 | bwd_inner_microstep: 1670.59 | bwd_allreduce_microstep: 116.07 | step_microstep: 143.15
[2025-08-03 04:51:32,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.45 | bwd: 7690.51 | bwd_inner: 7103.44 | bwd_allreduce: 586.81 | step: 143.57
{'loss': 0.7453, 'learning_rate': 1.0145738829483354e-05, 'epoch': 0.51}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13165
total_samples=15521, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:34,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.75 | bwd_microstep: 1870.54 | bwd_inner_microstep: 1677.42 | bwd_allreduce_microstep: 193.06 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13456
total_samples=15525, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:37,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.54 | bwd_microstep: 1700.96 | bwd_inner_microstep: 1667.71 | bwd_allreduce_microstep: 33.19 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12850
total_samples=15529, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:40,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.68 | bwd_microstep: 2036.50 | bwd_inner_microstep: 1862.03 | bwd_allreduce_microstep: 174.40 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12797
total_samples=15533, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:43,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:51:43,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.69 | bwd_microstep: 2032.27 | bwd_inner_microstep: 1786.78 | bwd_allreduce_microstep: 245.43 | step_microstep: 133.58
[2025-08-03 04:51:43,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.58 | bwd: 7640.32 | bwd_inner: 6993.93 | bwd_allreduce: 646.16 | step: 133.91
{'loss': 0.7384, 'learning_rate': 1.0129546588758605e-05, 'epoch': 0.51}
:57:00, 10.80s/it]                                                        51%|█████     | 1017/2000 [3:08:02<2:57:00, 10.80s/it] 51%|█████     | 1018/2000 [3:08:13<2:58:02, 10.88s/it]                                                        51%|█████     | 1018/2000 [3:08:13<2:58:02, 10.88s/it] 51%|█████     | 1019/2000 [3:08:25<2:59:13, 10.96s/it]                                                        51%|█████     | 1019/2000 [3:08:25<2:59:13, 10.96s/it] 51%|█████     | 1020/2000 [3:08:36<2:59:16, 10.98s/it]                                                        51%|█████     | 1020/2000 [3:08:36<2:59:16, 10.98s/it] 51%|█████     | 1021/2000 [3:08:47<2:58:53, 10.96s/it]                                                        51%|█████     | 1021/2000 [3:08:47<2:58:53, 10.96s/it] 51%|█████     | 1022/2000 [3:08:57<2:58:09, 10.93s/it]                                                        51dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14088
total_samples=15537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:45,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.23 | bwd_microstep: 1766.19 | bwd_inner_microstep: 1704.71 | bwd_allreduce_microstep: 61.41 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11982
total_samples=15540, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:48,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.32 | bwd_microstep: 1861.75 | bwd_inner_microstep: 1612.92 | bwd_allreduce_microstep: 248.78 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14095
total_samples=15544, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:50,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.90 | bwd_microstep: 1832.60 | bwd_inner_microstep: 1756.48 | bwd_allreduce_microstep: 76.06 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13265
total_samples=15548, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:53,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 04:51:53,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.01 | bwd_microstep: 2215.00 | bwd_inner_microstep: 1884.32 | bwd_allreduce_microstep: 330.62 | step_microstep: 109.89
[2025-08-03 04:51:53,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2870.39 | bwd: 7675.59 | bwd_inner: 6958.42 | bwd_allreduce: 716.95 | step: 110.24
{'loss': 0.7578, 'learning_rate': 1.0113354008313025e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=15551, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:56,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.98 | bwd_microstep: 1992.75 | bwd_inner_microstep: 1526.89 | bwd_allreduce_microstep: 465.79 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13285
total_samples=15555, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:51:59,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.54 | bwd_microstep: 2058.83 | bwd_inner_microstep: 1905.11 | bwd_allreduce_microstep: 153.66 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12417
total_samples=15558, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:02,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.32 | bwd_microstep: 1805.71 | bwd_inner_microstep: 1592.47 | bwd_allreduce_microstep: 213.17 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11586
total_samples=15561, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:04,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 04:52:04,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.70 | bwd_microstep: 1759.81 | bwd_inner_microstep: 1541.05 | bwd_allreduce_microstep: 218.70 | step_microstep: 121.87
[2025-08-03 04:52:04,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.46 | bwd: 7617.15 | bwd_inner: 6565.51 | bwd_allreduce: 1051.40 | step: 122.31
{'loss': 0.7455, 'learning_rate': 1.0097161130609774e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11870
total_samples=15564, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:07,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.89 | bwd_microstep: 2178.47 | bwd_inner_microstep: 1951.47 | bwd_allreduce_microstep: 226.93 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13961
total_samples=15568, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:10,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.60 | bwd_microstep: 1998.48 | bwd_inner_microstep: 1860.09 | bwd_allreduce_microstep: 138.32 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14636
total_samples=15572, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:13,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 1742.55 | bwd_inner_microstep: 1712.81 | bwd_allreduce_microstep: 29.68 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11705
total_samples=15575, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:15,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 04:52:15,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.93 | bwd_microstep: 1831.22 | bwd_inner_microstep: 1568.86 | bwd_allreduce_microstep: 262.30 | step_microstep: 157.42
[2025-08-03 04:52:15,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.42 | bwd: 7750.77 | bwd_inner: 7093.22 | bwd_allreduce: 657.31 | step: 157.88
{'loss': 0.7531, 'learning_rate': 1.0080967998112787e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11852
total_samples=15578, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:18,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.09 | bwd_microstep: 1724.75 | bwd_inner_microstep: 1539.98 | bwd_allreduce_microstep: 184.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13633
total_samples=15582, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:20,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.09 | bwd_microstep: 1788.25 | bwd_inner_microstep: 1698.27 | bwd_allreduce_microstep: 89.92 | step_microstep: 0.16
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12693
total_samples=15586, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:23,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.41 | bwd_microstep: 2156.85 | bwd_inner_microstep: 1985.98 | bwd_allreduce_microstep: 170.81 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13343
total_samples=15590, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:26,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 04:52:26,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.87 | bwd_microstep: 2113.43 | bwd_inner_microstep: 1961.38 | bwd_allreduce_microstep: 151.99 | step_microstep: 116.16
[2025-08-03 04:52:26,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.39 | bwd: 7783.34 | bwd_inner: 7185.61 | bwd_allreduce: 597.50 | step: 116.53
{'loss': 0.751, 'learning_rate': 1.0064774653286662e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12187
total_samples=15593, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:29,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.05 | bwd_microstep: 1935.18 | bwd_inner_microstep: 1766.26 | bwd_allreduce_microstep: 168.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13291
total_samples=15597, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:32,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.94 | bwd_microstep: 1775.13 | bwd_inner_microstep: 1689.92 | bwd_allreduce_microstep: 85.15 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11763
total_samples=15600, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:34,960] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.77 | bwd_microstep: 2101.87 | bwd_inner_microstep: 1861.70 | bwd_allreduce_microstep: 240.10 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13851
total_samples=15604, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:37,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 04:52:37,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.78 | bwd_microstep: 1981.46 | bwd_inner_microstep: 1973.23 | bwd_allreduce_microstep: 8.16 | step_microstep: 143.65
[2025-08-03 04:52:37,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.46 | bwd: 7793.68 | bwd_inner: 7291.11 | bwd_allreduce: 502.36 | step: 143.98
{'loss': 0.7441, 'learning_rate': 1.0048581138596563e-05, 'epoch': 0.51}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11927
total_samples=15607, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:40,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.39 | bwd_microstep: 2033.30 | bwd_inner_microstep: 1802.10 | bwd_allreduce_microstep: 231.13 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13841
total_samples=15611, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:43,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.62 | bwd_microstep: 1786.51 | bwd_inner_microstep: 1714.47 | bwd_allreduce_microstep: 71.98 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13508
total_samples=15615, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:45,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.46 | bwd_microstep: 1837.72 | bwd_inner_microstep: 1697.52 | bwd_allreduce_microstep: 140.14 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14524
total_samples=15619, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:48,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 04:52:48,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.08 | bwd_microstep: 1891.80 | bwd_inner_microstep: 1871.50 | bwd_allreduce_microstep: 20.23 | step_microstep: 140.15
[2025-08-03 04:52:48,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.47 | bwd: 7549.37 | bwd_inner: 7085.59 | bwd_allreduce: 463.54 | step: 140.49
%|█████     | 1022/2000 [3:08:57<2:58:09, 10.93s/it] 51%|█████     | 1023/2000 [3:09:08<2:58:03, 10.93s/it]                                                        51%|█████     | 1023/2000 [3:09:08<2:58:03, 10.93s/it] 51%|█████     | 1024/2000 [3:09:19<2:57:16, 10.90s/it]                                                        51%|█████     | 1024/2000 [3:09:19<2:57:16, 10.90s/it] 51%|█████▏    | 1025/2000 [3:09:30<2:57:52, 10.95s/it]                                                        51%|█████▏    | 1025/2000 [3:09:30<2:57:52, 10.95s/it] 51%|█████▏    | 1026/2000 [3:09:41<2:57:55, 10.96s/it]                                                        51%|█████▏    | 1026/2000 [3:09:41<2:57:55, 10.96s/it] 51%|█████▏    | 1027/2000 [3:09:52<2:58:02, 10.98s/it]                                                        51%|█████▏    | 1027/2000 [3:09:52<2:58:02, 10.98s/it] 51%{'loss': 0.7585, 'learning_rate': 1.003238749650809e-05, 'epoch': 0.51}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13487
total_samples=15623, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:51,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.59 | bwd_microstep: 2015.62 | bwd_inner_microstep: 1737.57 | bwd_allreduce_microstep: 277.99 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13387
total_samples=15627, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:54,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.41 | bwd_microstep: 2097.95 | bwd_inner_microstep: 1946.58 | bwd_allreduce_microstep: 151.30 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13894
total_samples=15631, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:56,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.30 | bwd_microstep: 1851.25 | bwd_inner_microstep: 1703.84 | bwd_allreduce_microstep: 147.36 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13876
total_samples=15635, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:52:59,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 04:52:59,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.25 | bwd_microstep: 1990.95 | bwd_inner_microstep: 1884.21 | bwd_allreduce_microstep: 106.68 | step_microstep: 114.16
[2025-08-03 04:52:59,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.48 | bwd: 7955.81 | bwd_inner: 7272.19 | bwd_allreduce: 683.40 | step: 114.58
{'loss': 0.7525, 'learning_rate': 1.001619376948718e-05, 'epoch': 0.51}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13495
total_samples=15639, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:02,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.97 | bwd_microstep: 1764.27 | bwd_inner_microstep: 1656.53 | bwd_allreduce_microstep: 107.67 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14686
total_samples=15643, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:04,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.53 | bwd_microstep: 1845.41 | bwd_inner_microstep: 1786.23 | bwd_allreduce_microstep: 59.12 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13230
total_samples=15647, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:07,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.86 | bwd_microstep: 1975.61 | bwd_inner_microstep: 1820.35 | bwd_allreduce_microstep: 155.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13199
total_samples=15651, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:10,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 04:53:10,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.02 | bwd_microstep: 1823.08 | bwd_inner_microstep: 1711.22 | bwd_allreduce_microstep: 111.79 | step_microstep: 124.54
[2025-08-03 04:53:10,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.32 | bwd: 7408.42 | bwd_inner: 6974.31 | bwd_allreduce: 433.85 | step: 124.86
{'loss': 0.757, 'learning_rate': 1e-05, 'epoch': 0.52}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12396
total_samples=15654, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:13,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.77 | bwd_microstep: 1834.47 | bwd_inner_microstep: 1597.13 | bwd_allreduce_microstep: 237.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14345
total_samples=15658, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:15,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.18 | bwd_microstep: 1785.90 | bwd_inner_microstep: 1735.78 | bwd_allreduce_microstep: 50.05 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12926
total_samples=15662, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:18,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.52 | bwd_microstep: 2165.74 | bwd_inner_microstep: 1991.55 | bwd_allreduce_microstep: 174.13 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12515
total_samples=15665, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:21,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 04:53:21,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.67 | bwd_microstep: 1984.82 | bwd_inner_microstep: 1766.05 | bwd_allreduce_microstep: 218.71 | step_microstep: 398.15
[2025-08-03 04:53:21,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.08 | bwd: 7770.99 | bwd_inner: 7090.52 | bwd_allreduce: 680.25 | step: 398.60
{'loss': 0.7407, 'learning_rate': 9.98380623051282e-06, 'epoch': 0.52}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13144
total_samples=15669, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:24,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.24 | bwd_microstep: 1954.14 | bwd_inner_microstep: 1703.12 | bwd_allreduce_microstep: 250.95 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13599
total_samples=15673, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:26,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.90 | bwd_microstep: 1720.36 | bwd_inner_microstep: 1633.05 | bwd_allreduce_microstep: 87.24 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11694
total_samples=15676, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:29,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.79 | bwd_microstep: 2111.22 | bwd_inner_microstep: 1870.18 | bwd_allreduce_microstep: 240.97 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14567
total_samples=15680, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:32,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 04:53:32,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.84 | bwd_microstep: 1853.61 | bwd_inner_microstep: 1765.84 | bwd_allreduce_microstep: 87.71 | step_microstep: 137.06
[2025-08-03 04:53:32,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.71 | bwd: 7639.37 | bwd_inner: 6972.19 | bwd_allreduce: 666.96 | step: 137.39
{'loss': 0.7502, 'learning_rate': 9.967612503491915e-06, 'epoch': 0.52}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11862
total_samples=15683, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:35,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.41 | bwd_microstep: 2027.93 | bwd_inner_microstep: 1803.43 | bwd_allreduce_microstep: 224.43 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11898
total_samples=15686, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:38,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.81 | bwd_microstep: 1847.71 | bwd_inner_microstep: 1598.81 | bwd_allreduce_microstep: 248.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14538
total_samples=15690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:40,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.44 | bwd_microstep: 2009.07 | bwd_inner_microstep: 1778.30 | bwd_allreduce_microstep: 230.71 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12891
total_samples=15694, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:43,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.76
[2025-08-03 04:53:43,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.56 | bwd_microstep: 2121.99 | bwd_inner_microstep: 1968.12 | bwd_allreduce_microstep: 153.82 | step_microstep: 106.55
[2025-08-03 04:53:43,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2888.14 | bwd: 8006.75 | bwd_inner: 7148.65 | bwd_allreduce: 857.88 | step: 106.86
|█████▏    | 1028/2000 [3:10:03<2:56:54, 10.92s/it]                                                        51%|█████▏    | 1028/2000 [3:10:03<2:56:54, 10.92s/it] 51%|█████▏    | 1029/2000 [3:10:14<2:57:58, 11.00s/it]                                                        51%|█████▏    | 1029/2000 [3:10:14<2:57:58, 11.00s/it] 52%|█████▏    | 1030/2000 [3:10:25<2:56:04, 10.89s/it]                                                        52%|█████▏    | 1030/2000 [3:10:25<2:56:04, 10.89s/it] 52%|█████▏    | 1031/2000 [3:10:36<2:57:50, 11.01s/it]                                                        52%|█████▏    | 1031/2000 [3:10:36<2:57:50, 11.01s/it] 52%|█████▏    | 1032/2000 [3:10:47<2:57:08, 10.98s/it]                                                        52%|█████▏    | 1032/2000 [3:10:47<2:57:08, 10.98s/it] 52%|█████▏    | 1033/2000 [3:10:58<2:58:20, 11.07s{'loss': 0.7549, 'learning_rate': 9.95141886140344e-06, 'epoch': 0.52}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13661
total_samples=15698, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:46,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.68 | bwd_microstep: 2143.50 | bwd_inner_microstep: 1878.16 | bwd_allreduce_microstep: 265.27 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11631
total_samples=15701, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:49,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.12 | bwd_microstep: 1710.16 | bwd_inner_microstep: 1674.84 | bwd_allreduce_microstep: 35.26 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14562
total_samples=15705, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:51,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.84 | bwd_microstep: 1713.51 | bwd_inner_microstep: 1698.73 | bwd_allreduce_microstep: 14.72 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11651
total_samples=15708, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:54,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.80
[2025-08-03 04:53:54,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.79 | bwd_microstep: 1986.17 | bwd_inner_microstep: 1811.87 | bwd_allreduce_microstep: 174.23 | step_microstep: 112.61
[2025-08-03 04:53:54,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.36 | bwd: 7553.38 | bwd_inner: 7063.60 | bwd_allreduce: 489.55 | step: 113.06
{'loss': 0.7625, 'learning_rate': 9.935225346713341e-06, 'epoch': 0.52}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13308
total_samples=15712, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:57,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.65 | bwd_microstep: 1741.32 | bwd_inner_microstep: 1651.69 | bwd_allreduce_microstep: 89.56 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11726
total_samples=15715, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:53:59,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.10 | bwd_microstep: 1802.66 | bwd_inner_microstep: 1570.67 | bwd_allreduce_microstep: 231.93 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11895
total_samples=15718, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:02,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.88 | bwd_microstep: 2084.77 | bwd_inner_microstep: 1859.17 | bwd_allreduce_microstep: 225.53 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14132
total_samples=15722, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:05,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 04:54:05,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.77 | bwd_microstep: 2090.17 | bwd_inner_microstep: 1782.43 | bwd_allreduce_microstep: 307.67 | step_microstep: 126.38
[2025-08-03 04:54:05,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.33 | bwd: 7718.96 | bwd_inner: 6863.96 | bwd_allreduce: 854.77 | step: 126.71
{'loss': 0.7612, 'learning_rate': 9.919032001887215e-06, 'epoch': 0.52}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12372
total_samples=15726, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:08,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.23 | bwd_microstep: 1807.79 | bwd_inner_microstep: 1567.56 | bwd_allreduce_microstep: 240.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13148
total_samples=15730, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:11,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.11 | bwd_microstep: 2032.64 | bwd_inner_microstep: 1981.36 | bwd_allreduce_microstep: 51.22 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13972
total_samples=15734, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:13,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.80 | bwd_microstep: 1833.42 | bwd_inner_microstep: 1748.89 | bwd_allreduce_microstep: 84.46 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11576
total_samples=15737, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:16,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 04:54:16,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.76 | bwd_microstep: 1978.95 | bwd_inner_microstep: 1536.04 | bwd_allreduce_microstep: 442.84 | step_microstep: 145.86
[2025-08-03 04:54:16,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.82 | bwd: 7652.85 | bwd_inner: 6833.85 | bwd_allreduce: 818.78 | step: 146.18
{'loss': 0.7537, 'learning_rate': 9.90283886939023e-06, 'epoch': 0.52}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11726
total_samples=15741, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:19,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.48 | bwd_microstep: 1852.84 | bwd_inner_microstep: 1608.15 | bwd_allreduce_microstep: 244.63 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12854
total_samples=15745, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:22,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.98 | bwd_microstep: 2048.83 | bwd_inner_microstep: 1686.99 | bwd_allreduce_microstep: 361.79 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12045
total_samples=15748, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:24,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.99 | bwd_microstep: 1735.61 | bwd_inner_microstep: 1552.33 | bwd_allreduce_microstep: 183.22 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12540
total_samples=15751, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:27,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 04:54:27,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.60 | bwd_microstep: 1946.97 | bwd_inner_microstep: 1815.26 | bwd_allreduce_microstep: 131.65 | step_microstep: 109.55
[2025-08-03 04:54:27,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.98 | bwd: 7584.31 | bwd_inner: 6662.74 | bwd_allreduce: 921.35 | step: 109.88
{'loss': 0.7554, 'learning_rate': 9.886645991686977e-06, 'epoch': 0.52}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14098
total_samples=15755, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:30,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.72 | bwd_microstep: 2637.81 | bwd_inner_microstep: 1946.53 | bwd_allreduce_microstep: 691.21 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13661
total_samples=15759, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:33,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.81 | bwd_microstep: 1826.11 | bwd_inner_microstep: 1729.98 | bwd_allreduce_microstep: 96.07 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14190
total_samples=15763, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:36,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.07 | bwd_microstep: 2003.69 | bwd_inner_microstep: 1761.17 | bwd_allreduce_microstep: 242.46 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11858
total_samples=15766, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:38,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 04:54:38,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.13 | bwd_microstep: 1869.51 | bwd_inner_microstep: 1608.48 | bwd_allreduce_microstep: 260.97 | step_microstep: 114.45
[2025-08-03 04:54:38,920] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.66 | bwd: 8337.17 | bwd_inner: 7046.15 | bwd_allreduce: 1290.79 | step: 114.88
/it]                                                        52%|█████▏    | 1033/2000 [3:10:58<2:58:20, 11.07s/it] 52%|█████▏    | 1034/2000 [3:11:09<2:56:35, 10.97s/it]                                                        52%|█████▏    | 1034/2000 [3:11:09<2:56:35, 10.97s/it] 52%|█████▏    | 1035/2000 [3:11:20<2:56:28, 10.97s/it]                                                        52%|█████▏    | 1035/2000 [3:11:20<2:56:28, 10.97s/it] 52%|█████▏    | 1036/2000 [3:11:31<2:55:59, 10.95s/it]                                                        52%|█████▏    | 1036/2000 [3:11:31<2:55:59, 10.95s/it] 52%|█████▏    | 1037/2000 [3:11:42<2:55:05, 10.91s/it]                                                        52%|█████▏    | 1037/2000 [3:11:42<2:55:05, 10.91s/it] 52%|█████▏    | 1038/2000 [3:11:53<2:58:00, 11.10s/it]                                                      {'loss': 0.7528, 'learning_rate': 9.870453411241399e-06, 'epoch': 0.52}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14236
total_samples=15770, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:41,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.69 | bwd_microstep: 1781.98 | bwd_inner_microstep: 1742.63 | bwd_allreduce_microstep: 39.28 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12581
total_samples=15774, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:44,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 765.19 | bwd_microstep: 1993.37 | bwd_inner_microstep: 1816.84 | bwd_allreduce_microstep: 176.47 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13566
total_samples=15778, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:46,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.49 | bwd_microstep: 1758.56 | bwd_inner_microstep: 1687.88 | bwd_allreduce_microstep: 70.63 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12046
total_samples=15781, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:49,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 04:54:49,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.31 | bwd_microstep: 1815.76 | bwd_inner_microstep: 1579.25 | bwd_allreduce_microstep: 236.44 | step_microstep: 146.04
[2025-08-03 04:54:49,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2904.61 | bwd: 7349.72 | bwd_inner: 6826.59 | bwd_allreduce: 522.89 | step: 146.47
{'loss': 0.7617, 'learning_rate': 9.854261170516648e-06, 'epoch': 0.52}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11601
total_samples=15784, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:52,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.29 | bwd_microstep: 2009.25 | bwd_inner_microstep: 1808.10 | bwd_allreduce_microstep: 201.08 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11714
total_samples=15787, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:55,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.94 | bwd_microstep: 1889.69 | bwd_inner_microstep: 1531.49 | bwd_allreduce_microstep: 358.14 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13980
total_samples=15791, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:54:57,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.42 | bwd_microstep: 1958.39 | bwd_inner_microstep: 1771.42 | bwd_allreduce_microstep: 186.91 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12510
total_samples=15794, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:00,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 04:55:00,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.48 | bwd_microstep: 1841.82 | bwd_inner_microstep: 1596.50 | bwd_allreduce_microstep: 245.24 | step_microstep: 107.76
[2025-08-03 04:55:00,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.05 | bwd: 7699.20 | bwd_inner: 6707.51 | bwd_allreduce: 991.46 | step: 108.20
{'loss': 0.751, 'learning_rate': 9.838069311974986e-06, 'epoch': 0.52}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14616
total_samples=15798, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:03,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.58 | bwd_microstep: 1832.47 | bwd_inner_microstep: 1826.50 | bwd_allreduce_microstep: 5.92 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11485
total_samples=15801, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:05,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.24 | bwd_microstep: 1710.74 | bwd_inner_microstep: 1525.49 | bwd_allreduce_microstep: 185.19 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13635
total_samples=15805, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:08,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.41 | bwd_microstep: 2195.76 | bwd_inner_microstep: 2051.72 | bwd_allreduce_microstep: 143.97 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14800
total_samples=15809, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:11,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 04:55:11,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.61 | bwd_microstep: 1837.50 | bwd_inner_microstep: 1814.13 | bwd_allreduce_microstep: 23.31 | step_microstep: 140.74
[2025-08-03 04:55:11,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.78 | bwd: 7576.52 | bwd_inner: 7217.84 | bwd_allreduce: 358.46 | step: 141.27
{'loss': 0.7435, 'learning_rate': 9.821877878077678e-06, 'epoch': 0.52}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14435
total_samples=15813, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:14,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.32 | bwd_microstep: 2025.24 | bwd_inner_microstep: 1899.68 | bwd_allreduce_microstep: 125.49 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14995
total_samples=15817, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:16,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.16 | bwd_microstep: 2013.34 | bwd_inner_microstep: 1885.25 | bwd_allreduce_microstep: 128.03 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492
total_samples=15821, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:19,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.33 | bwd_microstep: 1749.18 | bwd_inner_microstep: 1680.42 | bwd_allreduce_microstep: 68.69 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13589
total_samples=15825, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:22,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 04:55:22,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.26 | bwd_microstep: 2092.37 | bwd_inner_microstep: 1948.11 | bwd_allreduce_microstep: 144.19 | step_microstep: 139.52
[2025-08-03 04:55:22,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.96 | bwd: 7880.18 | bwd_inner: 7413.45 | bwd_allreduce: 466.49 | step: 139.98
{'loss': 0.7529, 'learning_rate': 9.805686911284867e-06, 'epoch': 0.52}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12423
total_samples=15829, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:25,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.63 | bwd_microstep: 2077.14 | bwd_inner_microstep: 1955.66 | bwd_allreduce_microstep: 121.42 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14980
total_samples=15833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:27,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.10 | bwd_microstep: 1747.26 | bwd_inner_microstep: 1734.29 | bwd_allreduce_microstep: 12.90 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13369
total_samples=15837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:30,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.44 | bwd_microstep: 2081.12 | bwd_inner_microstep: 1921.53 | bwd_allreduce_microstep: 159.53 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13522
total_samples=15841, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:33,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 04:55:33,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.79 | bwd_microstep: 1801.82 | bwd_inner_microstep: 1717.21 | bwd_allreduce_microstep: 84.54 | step_microstep: 141.23
[2025-08-03 04:55:33,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.89 | bwd: 7707.39 | bwd_inner: 7328.68 | bwd_allreduce: 378.47 | step: 141.70
{'loss': 0.7578, 'learning_rate': 9.789496454055482e-06, 'epoch': 0.52}
  52%|█████▏    | 1038/2000 [3:11:53<2:58:00, 11.10s/it] 52%|█████▏    | 1039/2000 [3:12:04<2:55:53, 10.98s/it]                                                        52%|█████▏    | 1039/2000 [3:12:04<2:55:53, 10.98s/it] 52%|█████▏    | 1040/2000 [3:12:15<2:55:30, 10.97s/it]                                                        52%|█████▏    | 1040/2000 [3:12:15<2:55:30, 10.97s/it] 52%|█████▏    | 1041/2000 [3:12:26<2:54:41, 10.93s/it]                                                        52%|█████▏    | 1041/2000 [3:12:26<2:54:41, 10.93s/it] 52%|█████▏    | 1042/2000 [3:12:37<2:55:32, 10.99s/it]                                                        52%|█████▏    | 1042/2000 [3:12:37<2:55:32, 10.99s/it] 52%|█████▏    | 1043/2000 [3:12:48<2:55:03, 10.98s/it]                                                        52%|█████▏    | 1043/2000 [3:12:48<2:55:03,dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13949
total_samples=15845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:36,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.64 | bwd_microstep: 1822.33 | bwd_inner_microstep: 1731.77 | bwd_allreduce_microstep: 90.49 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13607
total_samples=15849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:38,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.00 | bwd_microstep: 1846.42 | bwd_inner_microstep: 1744.18 | bwd_allreduce_microstep: 102.17 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13765
total_samples=15853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:41,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.15 | bwd_microstep: 1980.98 | bwd_inner_microstep: 1883.65 | bwd_allreduce_microstep: 97.25 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14078
total_samples=15859, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:44,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 04:55:44,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.18 | bwd_microstep: 2126.67 | bwd_inner_microstep: 2026.08 | bwd_allreduce_microstep: 100.53 | step_microstep: 126.62
[2025-08-03 04:55:44,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.91 | bwd: 7776.44 | bwd_inner: 7385.68 | bwd_allreduce: 390.53 | step: 127.19
{'loss': 0.7457, 'learning_rate': 9.773306548847102e-06, 'epoch': 0.52}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11773
total_samples=15862, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:47,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.05 | bwd_microstep: 2143.24 | bwd_inner_microstep: 1923.25 | bwd_allreduce_microstep: 219.93 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12790
total_samples=15866, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:50,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.00 | bwd_microstep: 2124.87 | bwd_inner_microstep: 1970.15 | bwd_allreduce_microstep: 154.66 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15276
total_samples=15870, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:52,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.79 | bwd_microstep: 1746.33 | bwd_inner_microstep: 1697.21 | bwd_allreduce_microstep: 49.05 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13428
total_samples=15874, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:55,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.41
[2025-08-03 04:55:55,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.58 | bwd_microstep: 1712.18 | bwd_inner_microstep: 1672.08 | bwd_allreduce_microstep: 40.03 | step_microstep: 135.75
[2025-08-03 04:55:55,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.35 | bwd: 7726.68 | bwd_inner: 7262.69 | bwd_allreduce: 463.74 | step: 136.19
{'loss': 0.7492, 'learning_rate': 9.757117238115871e-06, 'epoch': 0.52}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11982
total_samples=15877, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:55:57,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.10 | bwd_microstep: 1722.35 | bwd_inner_microstep: 1541.15 | bwd_allreduce_microstep: 181.14 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13571
total_samples=15881, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:00,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1002.73 | bwd_microstep: 1753.73 | bwd_inner_microstep: 1678.39 | bwd_allreduce_microstep: 75.27 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13550
total_samples=15885, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:03,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.11 | bwd_microstep: 1721.55 | bwd_inner_microstep: 1671.06 | bwd_allreduce_microstep: 50.43 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13998
total_samples=15889, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:06,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 04:56:06,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.08 | bwd_microstep: 1794.92 | bwd_inner_microstep: 1736.15 | bwd_allreduce_microstep: 58.70 | step_microstep: 159.02
[2025-08-03 04:56:06,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3070.93 | bwd: 6992.59 | bwd_inner: 6626.75 | bwd_allreduce: 365.62 | step: 159.45
{'loss': 0.7461, 'learning_rate': 9.740928564316369e-06, 'epoch': 0.52}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11638
total_samples=15892, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:08,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.48 | bwd_microstep: 1734.25 | bwd_inner_microstep: 1522.88 | bwd_allreduce_microstep: 211.31 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13263
total_samples=15896, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:11,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.05 | bwd_microstep: 2013.87 | bwd_inner_microstep: 1709.04 | bwd_allreduce_microstep: 304.76 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11889
total_samples=15899, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:13,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.01 | bwd_microstep: 1733.11 | bwd_inner_microstep: 1539.92 | bwd_allreduce_microstep: 193.12 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13713
total_samples=15903, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:16,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 04:56:16,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.69 | bwd_microstep: 1895.81 | bwd_inner_microstep: 1865.82 | bwd_allreduce_microstep: 29.93 | step_microstep: 130.15
[2025-08-03 04:56:16,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.16 | bwd: 7377.09 | bwd_inner: 6637.67 | bwd_allreduce: 739.19 | step: 130.61
{'loss': 0.7578, 'learning_rate': 9.724740569901503e-06, 'epoch': 0.52}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14345
total_samples=15907, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:19,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.34 | bwd_microstep: 1952.01 | bwd_inner_microstep: 1773.60 | bwd_allreduce_microstep: 178.35 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13772
total_samples=15911, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:22,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.45 | bwd_microstep: 1919.16 | bwd_inner_microstep: 1870.48 | bwd_allreduce_microstep: 48.61 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12556
total_samples=15915, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:24,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.40 | bwd_microstep: 1804.16 | bwd_inner_microstep: 1608.94 | bwd_allreduce_microstep: 195.16 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13579
total_samples=15919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:27,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 04:56:27,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.23 | bwd_microstep: 1993.80 | bwd_inner_microstep: 1870.88 | bwd_allreduce_microstep: 122.86 | step_microstep: 112.62
[2025-08-03 04:56:27,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.33 | bwd: 7669.18 | bwd_inner: 7123.89 | bwd_allreduce: 545.05 | step: 112.95
{'loss': 0.7362, 'learning_rate': 9.708553297322407e-06, 'epoch': 0.52}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13217
total_samples=15923, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:30,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.39 | bwd_microstep: 1981.18 | bwd_inner_microstep: 1687.78 | bwd_allreduce_microstep: 293.34 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13048
total_samples=15927, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:32,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.09 | bwd_microstep: 1809.46 | bwd_inner_microstep: 1653.19 | bwd_allreduce_microstep: 156.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14242
total_samples=15931, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:35,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.45 | bwd_microstep: 1781.49 | bwd_inner_microstep: 1730.16 | bwd_allreduce_microstep: 51.26 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12978
total_samples=15935, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:38,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 04:56:38,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.25 | bwd_microstep: 2226.37 | bwd_inner_microstep: 1807.30 | bwd_allreduce_microstep: 419.01 | step_microstep: 132.42
[2025-08-03 04:56:38,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.11 | bwd: 7798.55 | bwd_inner: 6878.43 | bwd_allreduce: 919.89 | step: 132.89
 10.98s/it] 52%|█████▏    | 1044/2000 [3:12:59<2:55:05, 10.99s/it]                                                        52%|█████▏    | 1044/2000 [3:12:59<2:55:05, 10.99s/it] 52%|█████▏    | 1045/2000 [3:13:10<2:54:49, 10.98s/it]                                                        52%|█████▏    | 1045/2000 [3:13:10<2:54:49, 10.98s/it] 52%|█████▏    | 1046/2000 [3:13:20<2:52:32, 10.85s/it]                                                        52%|█████▏    | 1046/2000 [3:13:20<2:52:32, 10.85s/it] 52%|█████▏    | 1047/2000 [3:13:31<2:51:22, 10.79s/it]                                                        52%|█████▏    | 1047/2000 [3:13:31<2:51:22, 10.79s/it] 52%|█████▏    | 1048/2000 [3:13:42<2:51:39, 10.82s/it]                                                        52%|█████▏    | 1048/2000 [3:13:42<2:51:39, 10.82s/it] 52%|█████▏    | 1049/2000 [3:13:53{'loss': 0.751, 'learning_rate': 9.692366789028308e-06, 'epoch': 0.52}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13505
total_samples=15939, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:41,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.48 | bwd_microstep: 2021.72 | bwd_inner_microstep: 1894.13 | bwd_allreduce_microstep: 127.52 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11888
total_samples=15942, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:44,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.13 | bwd_microstep: 1854.96 | bwd_inner_microstep: 1700.03 | bwd_allreduce_microstep: 154.86 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11575
total_samples=15945, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:46,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.05 | bwd_microstep: 1806.41 | bwd_inner_microstep: 1680.22 | bwd_allreduce_microstep: 126.13 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14313
total_samples=15949, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:49,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 04:56:49,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.14 | bwd_microstep: 1794.36 | bwd_inner_microstep: 1728.52 | bwd_allreduce_microstep: 65.78 | step_microstep: 121.61
[2025-08-03 04:56:49,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.73 | bwd: 7477.50 | bwd_inner: 7002.89 | bwd_allreduce: 474.37 | step: 122.07
{'loss': 0.7567, 'learning_rate': 9.676181087466444e-06, 'epoch': 0.53}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14255
total_samples=15955, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:52,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.07 | bwd_microstep: 1966.64 | bwd_inner_microstep: 1888.50 | bwd_allreduce_microstep: 78.06 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11954
total_samples=15958, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:54,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.88 | bwd_microstep: 1704.17 | bwd_inner_microstep: 1560.67 | bwd_allreduce_microstep: 143.43 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13176
total_samples=15962, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:56:57,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 981.89 | bwd_microstep: 1854.04 | bwd_inner_microstep: 1705.50 | bwd_allreduce_microstep: 148.47 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13520
total_samples=15966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:00,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 04:57:00,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.60 | bwd_microstep: 1950.77 | bwd_inner_microstep: 1716.97 | bwd_allreduce_microstep: 233.69 | step_microstep: 141.94
[2025-08-03 04:57:00,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3115.36 | bwd: 7475.69 | bwd_inner: 6871.64 | bwd_allreduce: 603.76 | step: 142.59
{'loss': 0.7523, 'learning_rate': 9.659996235081926e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12840
total_samples=15969, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:02,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.91 | bwd_microstep: 1765.01 | bwd_inner_microstep: 1603.57 | bwd_allreduce_microstep: 161.38 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12225
total_samples=15972, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:05,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.43 | bwd_microstep: 1842.36 | bwd_inner_microstep: 1593.67 | bwd_allreduce_microstep: 248.62 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13488
total_samples=15976, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:08,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.99 | bwd_microstep: 1820.89 | bwd_inner_microstep: 1735.99 | bwd_allreduce_microstep: 84.83 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13522
total_samples=15980, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:10,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 04:57:10,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.53 | bwd_microstep: 1801.64 | bwd_inner_microstep: 1714.30 | bwd_allreduce_microstep: 87.27 | step_microstep: 134.43
[2025-08-03 04:57:10,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2865.80 | bwd: 7229.96 | bwd_inner: 6647.53 | bwd_allreduce: 582.20 | step: 135.00
{'loss': 0.7514, 'learning_rate': 9.643812274317644e-06, 'epoch': 0.53}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13259
total_samples=15984, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:13,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.94 | bwd_microstep: 1791.73 | bwd_inner_microstep: 1689.97 | bwd_allreduce_microstep: 101.69 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891
total_samples=15987, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:15,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.60 | bwd_microstep: 1735.31 | bwd_inner_microstep: 1536.91 | bwd_allreduce_microstep: 198.34 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13586
total_samples=15991, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:18,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.15 | bwd_microstep: 1778.23 | bwd_inner_microstep: 1684.51 | bwd_allreduce_microstep: 93.66 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13498
total_samples=15995, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:21,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 04:57:21,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.20 | bwd_microstep: 1756.64 | bwd_inner_microstep: 1689.62 | bwd_allreduce_microstep: 66.94 | step_microstep: 113.84
[2025-08-03 04:57:21,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.81 | bwd: 7061.96 | bwd_inner: 6601.00 | bwd_allreduce: 460.71 | step: 114.36
{'loss': 0.7555, 'learning_rate': 9.627629247614151e-06, 'epoch': 0.53}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12917
total_samples=15999, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:24,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 770.53 | bwd_microstep: 2153.09 | bwd_inner_microstep: 1954.53 | bwd_allreduce_microstep: 198.49 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13643
total_samples=16003, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:26,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.14 | bwd_microstep: 1810.12 | bwd_inner_microstep: 1721.14 | bwd_allreduce_microstep: 88.91 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15387
total_samples=16008, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:29,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.11 | bwd_microstep: 2138.53 | bwd_inner_microstep: 2132.22 | bwd_allreduce_microstep: 6.25 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13567
total_samples=16012, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:32,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 04:57:32,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.55 | bwd_microstep: 1842.24 | bwd_inner_microstep: 1720.65 | bwd_allreduce_microstep: 121.53 | step_microstep: 135.06
[2025-08-03 04:57:32,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2902.27 | bwd: 7944.04 | bwd_inner: 7528.55 | bwd_allreduce: 415.25 | step: 135.50
<2:52:37, 10.89s/it]                                                        52%|█████▏    | 1049/2000 [3:13:53<2:52:37, 10.89s/it] 52%|█████▎    | 1050/2000 [3:14:04<2:51:43, 10.85s/it]                                                        52%|█████▎    | 1050/2000 [3:14:04<2:51:43, 10.85s/it] 53%|█████▎    | 1051/2000 [3:14:15<2:52:23, 10.90s/it]                                                        53%|█████▎    | 1051/2000 [3:14:15<2:52:23, 10.90s/it] 53%|█████▎    | 1052/2000 [3:14:25<2:50:25, 10.79s/it]                                                        53%|█████▎    | 1052/2000 [3:14:25<2:50:25, 10.79s/it] 53%|█████▎    | 1053/2000 [3:14:36<2:47:51, 10.64s/it]                                                        53%|█████▎    | 1053/2000 [3:14:36<2:47:51, 10.64s/it] 53%|█████▎    | 1054/2000 [3:14:47<2:50:46, 10.83s/it]                                      {'loss': 0.7407, 'learning_rate': 9.611447197409544e-06, 'epoch': 0.53}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13391
total_samples=16016, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:35,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.42 | bwd_microstep: 1800.33 | bwd_inner_microstep: 1703.25 | bwd_allreduce_microstep: 97.01 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12122
total_samples=16020, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:37,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.20 | bwd_microstep: 1789.88 | bwd_inner_microstep: 1579.19 | bwd_allreduce_microstep: 210.62 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13300
total_samples=16024, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:40,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.38 | bwd_microstep: 1836.13 | bwd_inner_microstep: 1686.85 | bwd_allreduce_microstep: 149.21 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14362
total_samples=16030, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:42,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 04:57:42,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.68 | bwd_microstep: 1814.91 | bwd_inner_microstep: 1752.48 | bwd_allreduce_microstep: 62.36 | step_microstep: 122.60
[2025-08-03 04:57:42,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.61 | bwd: 7241.29 | bwd_inner: 6721.77 | bwd_allreduce: 519.28 | step: 123.07
{'loss': 0.7449, 'learning_rate': 9.595266166139366e-06, 'epoch': 0.53}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13564
total_samples=16034, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:45,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.57 | bwd_microstep: 2012.00 | bwd_inner_microstep: 1866.70 | bwd_allreduce_microstep: 145.24 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14079
total_samples=16038, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:48,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.87 | bwd_microstep: 1725.51 | bwd_inner_microstep: 1678.06 | bwd_allreduce_microstep: 47.38 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13555
total_samples=16042, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:50,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.59 | bwd_microstep: 1784.72 | bwd_inner_microstep: 1726.00 | bwd_allreduce_microstep: 58.66 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264
total_samples=16046, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:53,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:57:53,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.36 | bwd_microstep: 2115.52 | bwd_inner_microstep: 1843.49 | bwd_allreduce_microstep: 271.97 | step_microstep: 109.30
[2025-08-03 04:57:53,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.32 | bwd: 7637.80 | bwd_inner: 7114.24 | bwd_allreduce: 523.32 | step: 109.76
{'loss': 0.7528, 'learning_rate': 9.579086196236483e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12361
total_samples=16049, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:56,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.36 | bwd_microstep: 2037.83 | bwd_inner_microstep: 1810.38 | bwd_allreduce_microstep: 227.39 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11950
total_samples=16052, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:57:59,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.85 | bwd_microstep: 1857.07 | bwd_inner_microstep: 1569.83 | bwd_allreduce_microstep: 287.18 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12744
total_samples=16056, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:01,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.37 | bwd_microstep: 1696.04 | bwd_inner_microstep: 1618.53 | bwd_allreduce_microstep: 77.45 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13504
total_samples=16060, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:04,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.77
[2025-08-03 04:58:04,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.05 | bwd_microstep: 2032.30 | bwd_inner_microstep: 1719.30 | bwd_allreduce_microstep: 312.93 | step_microstep: 110.04
[2025-08-03 04:58:04,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.57 | bwd: 7623.28 | bwd_inner: 6718.03 | bwd_allreduce: 905.02 | step: 110.37
{'loss': 0.7475, 'learning_rate': 9.562907330130981e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12338
total_samples=16063, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:07,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.10 | bwd_microstep: 2161.73 | bwd_inner_microstep: 1811.90 | bwd_allreduce_microstep: 349.76 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14165
total_samples=16067, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:10,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.14 | bwd_microstep: 2044.12 | bwd_inner_microstep: 1902.59 | bwd_allreduce_microstep: 141.46 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13542
total_samples=16071, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:12,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.11 | bwd_microstep: 1692.55 | bwd_inner_microstep: 1661.64 | bwd_allreduce_microstep: 30.85 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12386
total_samples=16075, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:15,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 04:58:15,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.69 | bwd_microstep: 1831.59 | bwd_inner_microstep: 1637.78 | bwd_allreduce_microstep: 193.74 | step_microstep: 126.64
[2025-08-03 04:58:15,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.98 | bwd: 7730.02 | bwd_inner: 7013.91 | bwd_allreduce: 715.89 | step: 127.07
{'loss': 0.7493, 'learning_rate': 9.54672961025005e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12065
total_samples=16078, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:18,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.90 | bwd_microstep: 1757.09 | bwd_inner_microstep: 1556.60 | bwd_allreduce_microstep: 200.42 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13251
total_samples=16082, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:20,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.52 | bwd_microstep: 1995.10 | bwd_inner_microstep: 1855.66 | bwd_allreduce_microstep: 139.38 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14220
total_samples=16086, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:23,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.23 | bwd_microstep: 1802.10 | bwd_inner_microstep: 1734.32 | bwd_allreduce_microstep: 67.71 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13187
total_samples=16090, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:26,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 04:58:26,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.02 | bwd_microstep: 1719.62 | bwd_inner_microstep: 1654.49 | bwd_allreduce_microstep: 65.07 | step_microstep: 147.99
[2025-08-03 04:58:26,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.59 | bwd: 7273.96 | bwd_inner: 6801.06 | bwd_allreduce: 472.67 | step: 148.45
{'loss': 0.7457, 'learning_rate': 9.530553079017872e-06, 'epoch': 0.53}
                  53%|█████▎    | 1054/2000 [3:14:47<2:50:46, 10.83s/it] 53%|█████▎    | 1055/2000 [3:14:57<2:49:03, 10.73s/it]                                                        53%|█████▎    | 1055/2000 [3:14:57<2:49:03, 10.73s/it] 53%|█████▎    | 1056/2000 [3:15:08<2:49:28, 10.77s/it]                                                        53%|█████▎    | 1056/2000 [3:15:08<2:49:28, 10.77s/it] 53%|█████▎    | 1057/2000 [3:15:19<2:49:32, 10.79s/it]                                                        53%|█████▎    | 1057/2000 [3:15:19<2:49:32, 10.79s/it] 53%|█████▎    | 1058/2000 [3:15:30<2:50:09, 10.84s/it]                                                        53%|█████▎    | 1058/2000 [3:15:30<2:50:09, 10.84s/it] 53%|█████▎    | 1059/2000 [3:15:41<2:48:39, 10.75s/it]                                                        53%|█████▎    | 1059/2000 [dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11610
total_samples=16093, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:28,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.30 | bwd_microstep: 1829.56 | bwd_inner_microstep: 1580.54 | bwd_allreduce_microstep: 248.96 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12805
total_samples=16097, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:31,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.86 | bwd_microstep: 1754.43 | bwd_inner_microstep: 1652.93 | bwd_allreduce_microstep: 101.43 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13595
total_samples=16101, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:34,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.39 | bwd_microstep: 1953.46 | bwd_inner_microstep: 1716.66 | bwd_allreduce_microstep: 236.74 | step_microstep: 0.21
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13653
total_samples=16105, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:36,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 04:58:36,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.24 | bwd_microstep: 1838.97 | bwd_inner_microstep: 1725.91 | bwd_allreduce_microstep: 112.99 | step_microstep: 144.58
[2025-08-03 04:58:36,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2862.72 | bwd: 7376.48 | bwd_inner: 6676.03 | bwd_allreduce: 700.20 | step: 145.05
{'loss': 0.7606, 'learning_rate': 9.514377778855521e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11818
total_samples=16108, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:39,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.46 | bwd_microstep: 1836.65 | bwd_inner_microstep: 1608.33 | bwd_allreduce_microstep: 228.25 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12914
total_samples=16112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:42,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.01 | bwd_microstep: 1900.62 | bwd_inner_microstep: 1889.21 | bwd_allreduce_microstep: 11.35 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13059
total_samples=16116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:44,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.10 | bwd_microstep: 1900.18 | bwd_inner_microstep: 1712.43 | bwd_allreduce_microstep: 187.68 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14706
total_samples=16120, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:47,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74
[2025-08-03 04:58:47,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.20 | bwd_microstep: 1754.68 | bwd_inner_microstep: 1725.37 | bwd_allreduce_microstep: 29.24 | step_microstep: 113.77
[2025-08-03 04:58:47,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.69 | bwd: 7392.18 | bwd_inner: 6935.33 | bwd_allreduce: 456.60 | step: 114.43
{'loss': 0.7472, 'learning_rate': 9.498203752180827e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12007
total_samples=16123, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:50,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.16 | bwd_microstep: 1902.80 | bwd_inner_microstep: 1556.75 | bwd_allreduce_microstep: 345.97 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13978
total_samples=16127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:52,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.10 | bwd_microstep: 1752.72 | bwd_inner_microstep: 1723.92 | bwd_allreduce_microstep: 28.74 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12853
total_samples=16131, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:55,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.13 | bwd_microstep: 1960.32 | bwd_inner_microstep: 1828.48 | bwd_allreduce_microstep: 131.78 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13532
total_samples=16135, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:58:58,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.92
[2025-08-03 04:58:58,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.59 | bwd_microstep: 1841.43 | bwd_inner_microstep: 1709.33 | bwd_allreduce_microstep: 132.03 | step_microstep: 129.63
[2025-08-03 04:58:58,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.91 | bwd: 7457.32 | bwd_inner: 6818.46 | bwd_allreduce: 638.61 | step: 130.24
{'loss': 0.7558, 'learning_rate': 9.482031041408296e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11592
total_samples=16138, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:00,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.02 | bwd_microstep: 2056.99 | bwd_inner_microstep: 1815.50 | bwd_allreduce_microstep: 241.42 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13364
total_samples=16143, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:03,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.90 | bwd_microstep: 1822.70 | bwd_inner_microstep: 1746.94 | bwd_allreduce_microstep: 75.69 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13427
total_samples=16148, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:06,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.52 | bwd_microstep: 1982.86 | bwd_inner_microstep: 1830.05 | bwd_allreduce_microstep: 152.75 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13802
total_samples=16152, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:09,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.02
[2025-08-03 04:59:09,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.57 | bwd_microstep: 2036.66 | bwd_inner_microstep: 1825.91 | bwd_allreduce_microstep: 210.68 | step_microstep: 113.37
[2025-08-03 04:59:09,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.93 | bwd: 7899.26 | bwd_inner: 7218.39 | bwd_allreduce: 680.62 | step: 113.87
{'loss': 0.7476, 'learning_rate': 9.465859688948977e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12152
total_samples=16155, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:11,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.44 | bwd_microstep: 1806.76 | bwd_inner_microstep: 1573.53 | bwd_allreduce_microstep: 233.16 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264
total_samples=16159, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:14,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.36 | bwd_microstep: 1737.98 | bwd_inner_microstep: 1670.23 | bwd_allreduce_microstep: 67.68 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12799
total_samples=16163, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:16,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.02 | bwd_microstep: 1826.28 | bwd_inner_microstep: 1679.25 | bwd_allreduce_microstep: 146.96 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13512
total_samples=16168, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:19,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.95
[2025-08-03 04:59:19,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.37 | bwd_microstep: 2080.79 | bwd_inner_microstep: 1924.17 | bwd_allreduce_microstep: 156.55 | step_microstep: 141.67
[2025-08-03 04:59:19,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.12 | bwd: 7451.86 | bwd_inner: 6847.18 | bwd_allreduce: 604.44 | step: 142.26
{'loss': 0.7553, 'learning_rate': 9.449689737210352e-06, 'epoch': 0.53}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13290
total_samples=16172, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:22,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.71 | bwd_microstep: 1769.41 | bwd_inner_microstep: 1694.23 | bwd_allreduce_microstep: 75.12 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689
total_samples=16175, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:25,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.22 | bwd_microstep: 1759.80 | bwd_inner_microstep: 1544.94 | bwd_allreduce_microstep: 214.80 | step_microstep: 0.16
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13890
total_samples=16179, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:27,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.92 | bwd_microstep: 1867.45 | bwd_inner_microstep: 1689.15 | bwd_allreduce_microstep: 178.22 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14990
total_samples=16184, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:30,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 04:59:30,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.34 | bwd_microstep: 1773.40 | bwd_inner_microstep: 1741.18 | bwd_allreduce_microstep: 32.15 | step_microstep: 119.96
[2025-08-03 04:59:30,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2874.12 | bwd: 7170.12 | bwd_inner: 6669.49 | bwd_allreduce: 500.39 | step: 120.50
3:15:41<2:48:39, 10.75s/it] 53%|█████▎    | 1060/2000 [3:15:51<2:48:03, 10.73s/it]                                                        53%|█████▎    | 1060/2000 [3:15:51<2:48:03, 10.73s/it] 53%|█████▎    | 1061/2000 [3:16:02<2:47:20, 10.69s/it]                                                        53%|█████▎    | 1061/2000 [3:16:02<2:47:20, 10.69s/it] 53%|█████▎    | 1062/2000 [3:16:13<2:47:14, 10.70s/it]                                                        53%|█████▎    | 1062/2000 [3:16:13<2:47:14, 10.70s/it] 53%|█████▎    | 1063/2000 [3:16:24<2:48:58, 10.82s/it]                                                        53%|█████▎    | 1063/2000 [3:16:24<2:48:58, 10.82s/it] 53%|█████▎    | 1064/2000 [3:16:34<2:48:10, 10.78s/it]                                                        53%|█████▎    | 1064/2000 [3:16:34<2:48:10, 10.78s/it] 53%|█████▎    | 10{'loss': 0.747, 'learning_rate': 9.433521228596237e-06, 'epoch': 0.53}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13384
total_samples=16188, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:33,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.28 | bwd_microstep: 1998.65 | bwd_inner_microstep: 1865.60 | bwd_allreduce_microstep: 132.98 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16231
total_samples=16192, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:35,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.94 | bwd_microstep: 1835.23 | bwd_inner_microstep: 1828.72 | bwd_allreduce_microstep: 6.45 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305
total_samples=16196, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:38,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.68 | bwd_microstep: 1773.83 | bwd_inner_microstep: 1695.06 | bwd_allreduce_microstep: 78.70 | step_microstep: 0.14
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12921
total_samples=16200, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:41,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61
[2025-08-03 04:59:41,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.88 | bwd_microstep: 1845.14 | bwd_inner_microstep: 1665.59 | bwd_allreduce_microstep: 179.49 | step_microstep: 135.12
[2025-08-03 04:59:41,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.70 | bwd: 7452.90 | bwd_inner: 7054.96 | bwd_allreduce: 397.70 | step: 135.61
{'loss': 0.7552, 'learning_rate': 9.417354205506663e-06, 'epoch': 0.53}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13904
total_samples=16204, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:43,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.03 | bwd_microstep: 1780.21 | bwd_inner_microstep: 1708.37 | bwd_allreduce_microstep: 71.77 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12352
total_samples=16207, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:46,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.71 | bwd_microstep: 1773.96 | bwd_inner_microstep: 1568.59 | bwd_allreduce_microstep: 205.31 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13462
total_samples=16211, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:48,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.90 | bwd_microstep: 1776.53 | bwd_inner_microstep: 1693.01 | bwd_allreduce_microstep: 83.46 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13604
total_samples=16215, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:51,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 04:59:51,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.65 | bwd_microstep: 1984.96 | bwd_inner_microstep: 1913.30 | bwd_allreduce_microstep: 71.60 | step_microstep: 111.86
[2025-08-03 04:59:51,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.21 | bwd: 7315.72 | bwd_inner: 6883.27 | bwd_allreduce: 432.22 | step: 112.33
{'loss': 0.765, 'learning_rate': 9.401188710337757e-06, 'epoch': 0.53}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13472
total_samples=16219, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:54,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.07 | bwd_microstep: 2165.23 | bwd_inner_microstep: 2159.13 | bwd_allreduce_microstep: 6.03 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11617
total_samples=16222, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 04:59:58,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.99 | bwd_microstep: 3019.75 | bwd_inner_microstep: 2887.49 | bwd_allreduce_microstep: 132.19 | step_microstep: 0.21
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12955
total_samples=16226, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:01,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.64 | bwd_microstep: 1973.62 | bwd_inner_microstep: 1653.98 | bwd_allreduce_microstep: 319.57 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12501
total_samples=16229, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:03,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 05:00:03,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.15 | bwd_microstep: 1788.80 | bwd_inner_microstep: 1578.64 | bwd_allreduce_microstep: 210.08 | step_microstep: 147.27
[2025-08-03 05:00:03,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.77 | bwd: 8947.44 | bwd_inner: 8279.24 | bwd_allreduce: 667.96 | step: 147.73
{'loss': 0.7531, 'learning_rate': 9.385024785481653e-06, 'epoch': 0.53}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11903
total_samples=16232, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:06,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.03 | bwd_microstep: 1745.79 | bwd_inner_microstep: 1549.94 | bwd_allreduce_microstep: 195.79 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11683
total_samples=16235, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:09,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.31 | bwd_microstep: 2034.36 | bwd_inner_microstep: 1801.10 | bwd_allreduce_microstep: 233.18 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13253
total_samples=16239, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:11,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.27 | bwd_microstep: 1856.32 | bwd_inner_microstep: 1729.58 | bwd_allreduce_microstep: 126.67 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13137
total_samples=16243, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:14,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 05:00:14,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.47 | bwd_microstep: 1796.79 | bwd_inner_microstep: 1703.43 | bwd_allreduce_microstep: 93.30 | step_microstep: 143.88
[2025-08-03 05:00:14,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.99 | bwd: 7433.31 | bwd_inner: 6784.04 | bwd_allreduce: 649.03 | step: 144.34
{'loss': 0.7475, 'learning_rate': 9.368862473326355e-06, 'epoch': 0.53}
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 12100
total_samples=16246, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:18,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1511.33 | bwd_microstep: 2530.53 | bwd_inner_microstep: 2280.76 | bwd_allreduce_microstep: 249.71 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16165
total_samples=16250, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:21,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.50 | bwd_microstep: 1948.06 | bwd_inner_microstep: 1843.74 | bwd_allreduce_microstep: 104.25 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13719
total_samples=16255, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:24,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.71 | bwd_microstep: 1895.46 | bwd_inner_microstep: 1669.18 | bwd_allreduce_microstep: 226.22 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13333
total_samples=16259, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:27,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 05:00:27,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.31 | bwd_microstep: 2012.92 | bwd_inner_microstep: 1876.27 | bwd_allreduce_microstep: 136.58 | step_microstep: 110.54
[2025-08-03 05:00:27,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3646.78 | bwd: 8387.13 | bwd_inner: 7669.94 | bwd_allreduce: 716.84 | step: 110.92
65/2000 [3:16:45<2:46:34, 10.69s/it]                                                        53%|█████▎    | 1065/2000 [3:16:45<2:46:34, 10.69s/it] 53%|█████▎    | 1066/2000 [3:16:55<2:46:33, 10.70s/it]                                                        53%|█████▎    | 1066/2000 [3:16:56<2:46:33, 10.70s/it] 53%|█████▎    | 1067/2000 [3:17:06<2:45:42, 10.66s/it]                                                        53%|█████▎    | 1067/2000 [3:17:06<2:45:42, 10.66s/it] 53%|█████▎    | 1068/2000 [3:17:18<2:52:42, 11.12s/it]                                                        53%|█████▎    | 1068/2000 [3:17:18<2:52:42, 11.12s/it] 53%|█████▎    | 1069/2000 [3:17:29<2:50:46, 11.01s/it]                                                        53%|█████▎    | 1069/2000 [3:17:29<2:50:46, 11.01s/it] 54%|█████▎    | 1070/2000 [3:17:41<2:57:14, 11.44s/it]                      {'loss': 0.7611, 'learning_rate': 9.352701816255643e-06, 'epoch': 0.54}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11734
total_samples=16263, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:29,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.23 | bwd_microstep: 1898.51 | bwd_inner_microstep: 1728.75 | bwd_allreduce_microstep: 169.69 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13853
total_samples=16267, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:32,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.28 | bwd_microstep: 1819.39 | bwd_inner_microstep: 1737.83 | bwd_allreduce_microstep: 81.49 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13282
total_samples=16271, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:35,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.18 | bwd_microstep: 2114.24 | bwd_inner_microstep: 1903.29 | bwd_allreduce_microstep: 210.89 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13402
total_samples=16275, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:37,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 05:00:37,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.42 | bwd_microstep: 1851.04 | bwd_inner_microstep: 1699.40 | bwd_allreduce_microstep: 151.57 | step_microstep: 134.33
[2025-08-03 05:00:37,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.06 | bwd: 7683.22 | bwd_inner: 7069.27 | bwd_allreduce: 613.71 | step: 134.81
{'loss': 0.7401, 'learning_rate': 9.336542856648958e-06, 'epoch': 0.54}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12227
total_samples=16278, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:40,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.37 | bwd_microstep: 1882.70 | bwd_inner_microstep: 1563.67 | bwd_allreduce_microstep: 318.96 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13587
total_samples=16282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:43,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.45 | bwd_microstep: 1744.22 | bwd_inner_microstep: 1685.53 | bwd_allreduce_microstep: 58.63 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12737
total_samples=16286, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:45,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.01 | bwd_microstep: 1767.65 | bwd_inner_microstep: 1659.38 | bwd_allreduce_microstep: 108.21 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15187
total_samples=16291, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:48,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.71
[2025-08-03 05:00:48,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.74 | bwd_microstep: 1843.46 | bwd_inner_microstep: 1790.18 | bwd_allreduce_microstep: 53.21 | step_microstep: 144.81
[2025-08-03 05:00:48,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.50 | bwd: 7238.08 | bwd_inner: 6698.76 | bwd_allreduce: 539.09 | step: 145.28
{'loss': 0.7416, 'learning_rate': 9.320385636881283e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14312
total_samples=16295, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:51,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.39 | bwd_microstep: 1781.82 | bwd_inner_microstep: 1736.58 | bwd_allreduce_microstep: 45.18 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13360
total_samples=16299, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:53,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.48 | bwd_microstep: 1772.37 | bwd_inner_microstep: 1669.61 | bwd_allreduce_microstep: 102.69 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14235
total_samples=16303, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:56,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.64 | bwd_microstep: 1801.78 | bwd_inner_microstep: 1753.15 | bwd_allreduce_microstep: 48.56 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13871
total_samples=16307, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:00:58,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.65
[2025-08-03 05:00:58,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.42 | bwd_microstep: 1812.28 | bwd_inner_microstep: 1757.44 | bwd_allreduce_microstep: 54.77 | step_microstep: 117.65
[2025-08-03 05:00:58,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.86 | bwd: 7168.29 | bwd_inner: 6916.77 | bwd_allreduce: 251.28 | step: 118.10
{'loss': 0.7482, 'learning_rate': 9.30423019932305e-06, 'epoch': 0.54}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11755
total_samples=16310, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:01,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.71 | bwd_microstep: 2106.48 | bwd_inner_microstep: 1900.05 | bwd_allreduce_microstep: 206.37 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13917
total_samples=16314, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:04,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.55 | bwd_microstep: 1898.63 | bwd_inner_microstep: 1729.73 | bwd_allreduce_microstep: 168.84 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12495
total_samples=16318, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:07,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.98 | bwd_microstep: 1799.84 | bwd_inner_microstep: 1614.80 | bwd_allreduce_microstep: 184.98 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13502
total_samples=16322, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:10,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 05:01:10,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.72 | bwd_microstep: 2138.59 | bwd_inner_microstep: 2132.67 | bwd_allreduce_microstep: 5.86 | step_microstep: 113.46
[2025-08-03 05:01:10,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.88 | bwd: 7943.59 | bwd_inner: 7377.24 | bwd_allreduce: 566.12 | step: 113.94
{'loss': 0.7551, 'learning_rate': 9.288076586340005e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13312
total_samples=16326, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:12,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.22 | bwd_microstep: 2173.81 | bwd_inner_microstep: 2056.27 | bwd_allreduce_microstep: 117.47 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14327
total_samples=16330, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:15,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.69 | bwd_microstep: 2240.76 | bwd_inner_microstep: 2113.42 | bwd_allreduce_microstep: 127.27 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12549
total_samples=16333, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:18,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.10 | bwd_microstep: 2041.81 | bwd_inner_microstep: 1827.60 | bwd_allreduce_microstep: 214.15 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14010
total_samples=16337, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:21,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 05:01:21,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.75 | bwd_microstep: 1785.03 | bwd_inner_microstep: 1742.63 | bwd_allreduce_microstep: 42.35 | step_microstep: 135.20
[2025-08-03 05:01:21,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.70 | bwd: 8241.47 | bwd_inner: 7739.91 | bwd_allreduce: 501.32 | step: 135.75
{'loss': 0.7498, 'learning_rate': 9.27192484029312e-06, 'epoch': 0.54}
                                  54%|█████▎    | 1070/2000 [3:17:41<2:57:14, 11.44s/it] 54%|█████▎    | 1071/2000 [3:17:52<2:54:37, 11.28s/it]                                                        54%|█████▎    | 1071/2000 [3:17:52<2:54:37, 11.28s/it] 54%|█████▎    | 1072/2000 [3:18:03<2:50:42, 11.04s/it]                                                        54%|█████▎    | 1072/2000 [3:18:03<2:50:42, 11.04s/it] 54%|█████▎    | 1073/2000 [3:18:13<2:47:34, 10.85s/it]                                                        54%|█████▎    | 1073/2000 [3:18:13<2:47:34, 10.85s/it] 54%|█████▎    | 1074/2000 [3:18:24<2:48:57, 10.95s/it]                                                        54%|█████▎    | 1074/2000 [3:18:24<2:48:57, 10.95s/it] 54%|█████▍    | 1075/2000 [3:18:36<2:51:05, 11.10s/it]                                                        54%|█████▍ dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483
total_samples=16341, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:24,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.38 | bwd_microstep: 1722.28 | bwd_inner_microstep: 1673.84 | bwd_allreduce_microstep: 48.37 | step_microstep: 0.71
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13371
total_samples=16345, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:26,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.02 | bwd_microstep: 2195.59 | bwd_inner_microstep: 2064.11 | bwd_allreduce_microstep: 131.43 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12645
total_samples=16348, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:29,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.81 | bwd_microstep: 1978.23 | bwd_inner_microstep: 1613.88 | bwd_allreduce_microstep: 364.28 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13761
total_samples=16352, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:32,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 05:01:32,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.29 | bwd_microstep: 1832.59 | bwd_inner_microstep: 1747.31 | bwd_allreduce_microstep: 85.22 | step_microstep: 113.65
[2025-08-03 05:01:32,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.42 | bwd: 7728.74 | bwd_inner: 7099.13 | bwd_allreduce: 629.38 | step: 114.58
{'loss': 0.7559, 'learning_rate': 9.255775003538462e-06, 'epoch': 0.54}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11872
total_samples=16355, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:35,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.04 | bwd_microstep: 1924.99 | bwd_inner_microstep: 1560.42 | bwd_allreduce_microstep: 364.49 | step_microstep: 0.15
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13880
total_samples=16360, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:37,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.71 | bwd_microstep: 1832.32 | bwd_inner_microstep: 1690.42 | bwd_allreduce_microstep: 141.83 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891
total_samples=16363, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:40,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.55 | bwd_microstep: 1846.54 | bwd_inner_microstep: 1542.99 | bwd_allreduce_microstep: 303.48 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13507
total_samples=16367, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:43,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 05:01:43,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1001.10 | bwd_microstep: 2171.20 | bwd_inner_microstep: 2057.94 | bwd_allreduce_microstep: 113.19 | step_microstep: 117.33
[2025-08-03 05:01:43,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3140.34 | bwd: 7775.10 | bwd_inner: 6851.76 | bwd_allreduce: 923.07 | step: 117.86
{'loss': 0.7632, 'learning_rate': 9.239627118427098e-06, 'epoch': 0.54}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11859
total_samples=16370, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:46,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.28 | bwd_microstep: 1797.55 | bwd_inner_microstep: 1556.25 | bwd_allreduce_microstep: 241.23 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13298
total_samples=16374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:49,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.78 | bwd_microstep: 2057.31 | bwd_inner_microstep: 1880.43 | bwd_allreduce_microstep: 176.82 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11605
total_samples=16377, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:52,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.04 | bwd_microstep: 2248.14 | bwd_inner_microstep: 1915.10 | bwd_allreduce_microstep: 332.97 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13776
total_samples=16381, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:55,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 05:01:55,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.03 | bwd_microstep: 1912.09 | bwd_inner_microstep: 1861.79 | bwd_allreduce_microstep: 50.24 | step_microstep: 135.15
[2025-08-03 05:01:55,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.08 | bwd: 8015.14 | bwd_inner: 7213.56 | bwd_allreduce: 801.34 | step: 135.50
{'loss': 0.7444, 'learning_rate': 9.22348122730497e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13413
total_samples=16385, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:01:57,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.66 | bwd_microstep: 2092.20 | bwd_inner_microstep: 1933.67 | bwd_allreduce_microstep: 158.47 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13496
total_samples=16389, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:00,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.92 | bwd_microstep: 1733.82 | bwd_inner_microstep: 1674.24 | bwd_allreduce_microstep: 59.52 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12052
total_samples=16392, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:03,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.14 | bwd_microstep: 1990.89 | bwd_inner_microstep: 1783.19 | bwd_allreduce_microstep: 207.64 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13326
total_samples=16396, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:05,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 05:02:05,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.43 | bwd_microstep: 1734.13 | bwd_inner_microstep: 1672.78 | bwd_allreduce_microstep: 61.28 | step_microstep: 110.54
[2025-08-03 05:02:05,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.08 | bwd: 7551.08 | bwd_inner: 7063.87 | bwd_allreduce: 486.99 | step: 110.99
{'loss': 0.7546, 'learning_rate': 9.207337372512797e-06, 'epoch': 0.54}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13746
total_samples=16400, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:08,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.38 | bwd_microstep: 1794.62 | bwd_inner_microstep: 1706.91 | bwd_allreduce_microstep: 87.64 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11846
total_samples=16403, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:11,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.51 | bwd_microstep: 1852.62 | bwd_inner_microstep: 1591.27 | bwd_allreduce_microstep: 261.28 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12067
total_samples=16406, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:13,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.99 | bwd_microstep: 1823.35 | bwd_inner_microstep: 1596.53 | bwd_allreduce_microstep: 226.74 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13922
total_samples=16410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:16,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 05:02:16,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.56 | bwd_microstep: 2132.24 | bwd_inner_microstep: 1984.35 | bwd_allreduce_microstep: 147.83 | step_microstep: 111.95
[2025-08-03 05:02:16,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2868.37 | bwd: 7602.88 | bwd_inner: 6879.05 | bwd_allreduce: 723.59 | step: 112.35
{'loss': 0.7426, 'learning_rate': 9.19119559638596e-06, 'epoch': 0.54}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11739
total_samples=16413, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:19,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.05 | bwd_microstep: 1775.19 | bwd_inner_microstep: 1542.61 | bwd_allreduce_microstep: 232.51 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11811
total_samples=16416, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:21,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.07 | bwd_microstep: 1800.68 | bwd_inner_microstep: 1565.36 | bwd_allreduce_microstep: 235.26 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14008
total_samples=16420, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:24,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.38 | bwd_microstep: 1760.65 | bwd_inner_microstep: 1739.99 | bwd_allreduce_microstep: 20.59 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13671
total_samples=16424, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:28,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 05:02:28,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.56 | bwd_microstep: 2934.02 | bwd_inner_microstep: 2927.98 | bwd_allreduce_microstep: 5.98 | step_microstep: 116.34
[2025-08-03 05:02:28,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.99 | bwd: 8270.59 | bwd_inner: 7775.94 | bwd_allreduce: 494.42 | step: 116.79
   | 1075/2000 [3:18:36<2:51:05, 11.10s/it] 54%|█████▍    | 1076/2000 [3:18:47<2:50:16, 11.06s/it]                                                        54%|█████▍    | 1076/2000 [3:18:47<2:50:16, 11.06s/it] 54%|█████▍    | 1077/2000 [3:18:58<2:51:19, 11.14s/it]                                                        54%|█████▍    | 1077/2000 [3:18:58<2:51:19, 11.14s/it] 54%|█████▍    | 1078/2000 [3:19:09<2:51:47, 11.18s/it]                                                        54%|█████▍    | 1078/2000 [3:19:09<2:51:47, 11.18s/it] 54%|█████▍    | 1079/2000 [3:19:20<2:49:54, 11.07s/it]                                                        54%|█████▍    | 1079/2000 [3:19:20<2:49:54, 11.07s/it] 54%|█████▍    | 1080/2000 [3:19:31<2:48:48, 11.01s/it]                                                        54%|█████▍    | 1080/2000 [3:19:31<2:48:48, 11.01s/it] 54%|███�{'loss': 0.7616, 'learning_rate': 9.17505594125438e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13452
total_samples=16428, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:30,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.28 | bwd_microstep: 1944.45 | bwd_inner_microstep: 1723.86 | bwd_allreduce_microstep: 220.52 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11595
total_samples=16431, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:33,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.77 | bwd_microstep: 1922.86 | bwd_inner_microstep: 1584.47 | bwd_allreduce_microstep: 338.32 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11829
total_samples=16434, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:36,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.21 | bwd_microstep: 1721.70 | bwd_inner_microstep: 1539.37 | bwd_allreduce_microstep: 182.26 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14949
total_samples=16438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:38,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:02:38,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.31 | bwd_microstep: 1821.90 | bwd_inner_microstep: 1769.00 | bwd_allreduce_microstep: 52.84 | step_microstep: 111.43
[2025-08-03 05:02:38,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.49 | bwd: 7410.97 | bwd_inner: 6616.69 | bwd_allreduce: 794.03 | step: 111.97
{'loss': 0.736, 'learning_rate': 9.158918449442425e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13292
total_samples=16442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:41,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.22 | bwd_microstep: 1759.05 | bwd_inner_microstep: 1702.14 | bwd_allreduce_microstep: 56.84 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13697
total_samples=16446, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:44,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.93 | bwd_microstep: 2154.71 | bwd_inner_microstep: 2055.23 | bwd_allreduce_microstep: 99.42 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14147
total_samples=16450, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:46,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.90 | bwd_microstep: 1873.28 | bwd_inner_microstep: 1765.26 | bwd_allreduce_microstep: 107.96 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14268
total_samples=16454, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:49,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 05:02:49,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.19 | bwd_microstep: 2035.36 | bwd_inner_microstep: 1951.38 | bwd_allreduce_microstep: 83.91 | step_microstep: 108.26
[2025-08-03 05:02:49,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.16 | bwd: 7822.44 | bwd_inner: 7473.99 | bwd_allreduce: 348.22 | step: 108.79
{'loss': 0.7577, 'learning_rate': 9.142783163268782e-06, 'epoch': 0.54}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717
total_samples=16457, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:52,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.63 | bwd_microstep: 1849.02 | bwd_inner_microstep: 1597.54 | bwd_allreduce_microstep: 251.42 | step_microstep: 0.27
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12678
total_samples=16461, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:55,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.03 | bwd_microstep: 1778.14 | bwd_inner_microstep: 1611.79 | bwd_allreduce_microstep: 166.28 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12968
total_samples=16465, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:02:57,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.34 | bwd_microstep: 1930.76 | bwd_inner_microstep: 1654.52 | bwd_allreduce_microstep: 276.18 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13820
total_samples=16469, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:00,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 05:03:00,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.68 | bwd_microstep: 2066.27 | bwd_inner_microstep: 1958.39 | bwd_allreduce_microstep: 107.80 | step_microstep: 131.72
[2025-08-03 05:03:00,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.60 | bwd: 7624.24 | bwd_inner: 6822.23 | bwd_allreduce: 801.75 | step: 132.23
{'loss': 0.7509, 'learning_rate': 9.126650125046361e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14462
total_samples=16473, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:03,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.26 | bwd_microstep: 1900.97 | bwd_inner_microstep: 1739.99 | bwd_allreduce_microstep: 160.92 | step_microstep: 0.31
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11683
total_samples=16476, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:06,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.82 | bwd_microstep: 1759.36 | bwd_inner_microstep: 1540.89 | bwd_allreduce_microstep: 218.40 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13193
total_samples=16480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:08,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.74 | bwd_microstep: 2159.99 | bwd_inner_microstep: 1917.59 | bwd_allreduce_microstep: 242.33 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13263
total_samples=16485, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:11,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 05:03:11,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.32 | bwd_microstep: 2063.38 | bwd_inner_microstep: 1904.09 | bwd_allreduce_microstep: 159.22 | step_microstep: 113.25
[2025-08-03 05:03:11,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.07 | bwd: 7883.74 | bwd_inner: 7102.56 | bwd_allreduce: 780.95 | step: 113.81
{'loss': 0.7587, 'learning_rate': 9.110519377082174e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13086
total_samples=16489, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:14,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.79 | bwd_microstep: 2038.69 | bwd_inner_microstep: 1896.94 | bwd_allreduce_microstep: 141.69 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13326
total_samples=16493, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:17,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.94 | bwd_microstep: 1753.14 | bwd_inner_microstep: 1686.40 | bwd_allreduce_microstep: 66.66 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13860
total_samples=16497, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:19,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.79 | bwd_microstep: 1777.29 | bwd_inner_microstep: 1720.83 | bwd_allreduce_microstep: 56.39 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13146
total_samples=16501, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:22,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 34.23
[2025-08-03 05:03:22,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.95 | bwd_microstep: 1858.86 | bwd_inner_microstep: 1656.79 | bwd_allreduce_microstep: 202.00 | step_microstep: 142.93
[2025-08-03 05:03:22,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.39 | bwd: 7428.03 | bwd_inner: 6960.96 | bwd_allreduce: 466.82 | step: 143.40
��█▍    | 1081/2000 [3:19:43<2:50:46, 11.15s/it]                                                        54%|█████▍    | 1081/2000 [3:19:43<2:50:46, 11.15s/it] 54%|█████▍    | 1082/2000 [3:19:53<2:48:15, 11.00s/it]                                                        54%|█████▍    | 1082/2000 [3:19:53<2:48:15, 11.00s/it] 54%|█████▍    | 1083/2000 [3:20:04<2:48:22, 11.02s/it]                                                        54%|█████▍    | 1083/2000 [3:20:04<2:48:22, 11.02s/it] 54%|█████▍    | 1084/2000 [3:20:15<2:47:41, 10.98s/it]                                                        54%|█████▍    | 1084/2000 [3:20:15<2:47:41, 10.98s/it] 54%|█████▍    | 1085/2000 [3:20:26<2:47:56, 11.01s/it]                                                        54%|█████▍    | 1085/2000 [3:20:26<2:47:56, 11.01s/it] 54%|█████▍    | 1086/2000 [3:20:37<2:46:25, 10.93s/it]      {'loss': 0.7564, 'learning_rate': 9.094390961677223e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13237
total_samples=16505, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:25,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.13 | bwd_microstep: 1729.48 | bwd_inner_microstep: 1679.00 | bwd_allreduce_microstep: 50.41 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12405
total_samples=16509, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:27,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 746.73 | bwd_microstep: 1835.33 | bwd_inner_microstep: 1630.54 | bwd_allreduce_microstep: 204.71 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13335
total_samples=16513, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:30,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.91 | bwd_microstep: 1702.33 | bwd_inner_microstep: 1661.14 | bwd_allreduce_microstep: 41.12 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13535
total_samples=16517, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:33,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.80
[2025-08-03 05:03:33,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.26 | bwd_microstep: 2081.50 | bwd_inner_microstep: 1923.84 | bwd_allreduce_microstep: 157.59 | step_microstep: 150.62
[2025-08-03 05:03:33,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.94 | bwd: 7348.70 | bwd_inner: 6894.52 | bwd_allreduce: 453.93 | step: 151.28
{'loss': 0.734, 'learning_rate': 9.078264921126405e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13270
total_samples=16521, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:35,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.79 | bwd_microstep: 1773.65 | bwd_inner_microstep: 1691.32 | bwd_allreduce_microstep: 82.26 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12376
total_samples=16524, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:38,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.35 | bwd_microstep: 1856.64 | bwd_inner_microstep: 1598.79 | bwd_allreduce_microstep: 257.78 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13188
total_samples=16528, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:41,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.16 | bwd_microstep: 1961.21 | bwd_inner_microstep: 1733.25 | bwd_allreduce_microstep: 227.88 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13726
total_samples=16532, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:44,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 05:03:44,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.27 | bwd_microstep: 2338.12 | bwd_inner_microstep: 2328.33 | bwd_allreduce_microstep: 9.72 | step_microstep: 112.03
[2025-08-03 05:03:44,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.50 | bwd: 7929.66 | bwd_inner: 7351.69 | bwd_allreduce: 577.72 | step: 112.62
{'loss': 0.7569, 'learning_rate': 9.062141297718372e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13428
total_samples=16536, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:47,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.45 | bwd_microstep: 1805.34 | bwd_inner_microstep: 1703.64 | bwd_allreduce_microstep: 101.63 | step_microstep: 0.25
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12324
total_samples=16540, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:49,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.73 | bwd_microstep: 1960.45 | bwd_inner_microstep: 1785.27 | bwd_allreduce_microstep: 175.12 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12052
total_samples=16544, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:52,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.06 | bwd_microstep: 1789.99 | bwd_inner_microstep: 1596.08 | bwd_allreduce_microstep: 193.84 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12846
total_samples=16548, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:54,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 05:03:54,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.74 | bwd_microstep: 1748.47 | bwd_inner_microstep: 1664.97 | bwd_allreduce_microstep: 83.44 | step_microstep: 116.23
[2025-08-03 05:03:54,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.91 | bwd: 7304.31 | bwd_inner: 6749.95 | bwd_allreduce: 554.10 | step: 116.70
{'loss': 0.7608, 'learning_rate': 9.046020133735455e-06, 'epoch': 0.54}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13533
total_samples=16552, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:03:57,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.50 | bwd_microstep: 1955.53 | bwd_inner_microstep: 1715.64 | bwd_allreduce_microstep: 239.83 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12106
total_samples=16556, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:00,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.71 | bwd_microstep: 1734.66 | bwd_inner_microstep: 1566.24 | bwd_allreduce_microstep: 168.36 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=16559, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:02,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 847.60 | bwd_microstep: 1809.35 | bwd_inner_microstep: 1593.01 | bwd_allreduce_microstep: 216.26 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12042
total_samples=16562, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:05,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.57
[2025-08-03 05:04:05,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.03 | bwd_microstep: 2004.73 | bwd_inner_microstep: 1778.62 | bwd_allreduce_microstep: 226.04 | step_microstep: 130.65
[2025-08-03 05:04:05,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2940.77 | bwd: 7504.33 | bwd_inner: 6653.49 | bwd_allreduce: 850.57 | step: 131.16
{'loss': 0.7553, 'learning_rate': 9.02990147145352e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13294
total_samples=16566, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:08,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.20 | bwd_microstep: 1721.92 | bwd_inner_microstep: 1672.81 | bwd_allreduce_microstep: 49.04 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14470
total_samples=16570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:11,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.58 | bwd_microstep: 2104.37 | bwd_inner_microstep: 1903.69 | bwd_allreduce_microstep: 200.61 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11902
total_samples=16573, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:14,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.11 | bwd_microstep: 2035.31 | bwd_inner_microstep: 1833.46 | bwd_allreduce_microstep: 201.78 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12037
total_samples=16576, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:17,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:04:17,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.07 | bwd_microstep: 2213.61 | bwd_inner_microstep: 2012.25 | bwd_allreduce_microstep: 201.30 | step_microstep: 134.18
[2025-08-03 05:04:17,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.88 | bwd: 8075.27 | bwd_inner: 7422.21 | bwd_allreduce: 652.82 | step: 134.75
{'loss': 0.737, 'learning_rate': 9.013785353141887e-06, 'epoch': 0.55}
                                                  54%|█████▍    | 1086/2000 [3:20:37<2:46:25, 10.93s/it] 54%|█████▍    | 1087/2000 [3:20:48<2:44:49, 10.83s/it]                                                        54%|█████▍    | 1087/2000 [3:20:48<2:44:49, 10.83s/it] 54%|█████▍    | 1088/2000 [3:20:59<2:46:16, 10.94s/it]                                                        54%|█████▍    | 1088/2000 [3:20:59<2:46:16, 10.94s/it] 54%|█████▍    | 1089/2000 [3:21:09<2:44:13, 10.82s/it]                                                        54%|█████▍    | 1089/2000 [3:21:09<2:44:13, 10.82s/it] 55%|█████▍    | 1090/2000 [3:21:20<2:44:24, 10.84s/it]                                                        55%|█████▍    | 1090/2000 [3:21:20<2:44:24, 10.84s/it] 55%|█████▍    | 1091/2000 [3:21:32<2:46:36, 11.00s/it]                                                        55%|█dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13725
total_samples=16580, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:20,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.25 | bwd_microstep: 2045.88 | bwd_inner_microstep: 1891.80 | bwd_allreduce_microstep: 154.02 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14036
total_samples=16584, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:22,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.83 | bwd_microstep: 1796.03 | bwd_inner_microstep: 1736.49 | bwd_allreduce_microstep: 59.47 | step_microstep: 0.31
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11622
total_samples=16587, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:25,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.02 | bwd_microstep: 1781.89 | bwd_inner_microstep: 1549.44 | bwd_allreduce_microstep: 232.39 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12754
total_samples=16590, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:27,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58
[2025-08-03 05:04:27,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.00 | bwd_microstep: 1816.62 | bwd_inner_microstep: 1598.21 | bwd_allreduce_microstep: 218.33 | step_microstep: 127.12
[2025-08-03 05:04:27,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.04 | bwd: 7440.49 | bwd_inner: 6775.93 | bwd_allreduce: 664.30 | step: 127.65
{'loss': 0.749, 'learning_rate': 8.99767182106319e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13490
total_samples=16594, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:30,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.64 | bwd_microstep: 1881.19 | bwd_inner_microstep: 1827.13 | bwd_allreduce_microstep: 53.99 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13058
total_samples=16598, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:33,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.08 | bwd_microstep: 1753.75 | bwd_inner_microstep: 1684.04 | bwd_allreduce_microstep: 69.64 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12294
total_samples=16603, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:35,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.04 | bwd_microstep: 1784.21 | bwd_inner_microstep: 1558.03 | bwd_allreduce_microstep: 226.12 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11573
total_samples=16606, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:38,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 05:04:38,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.61 | bwd_microstep: 1748.79 | bwd_inner_microstep: 1529.62 | bwd_allreduce_microstep: 219.09 | step_microstep: 153.08
[2025-08-03 05:04:38,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.29 | bwd: 7168.00 | bwd_inner: 6598.82 | bwd_allreduce: 568.92 | step: 153.66
{'loss': 0.7593, 'learning_rate': 8.981560917473292e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=16610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:40,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.12 | bwd_microstep: 1767.64 | bwd_inner_microstep: 1693.60 | bwd_allreduce_microstep: 73.98 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13397
total_samples=16614, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:43,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.23 | bwd_microstep: 1971.44 | bwd_inner_microstep: 1906.27 | bwd_allreduce_microstep: 65.09 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885
total_samples=16617, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:46,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.10 | bwd_microstep: 1817.40 | bwd_inner_microstep: 1572.79 | bwd_allreduce_microstep: 244.54 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11659
total_samples=16620, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:48,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 05:04:48,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.61 | bwd_microstep: 1791.82 | bwd_inner_microstep: 1550.24 | bwd_allreduce_microstep: 241.52 | step_microstep: 109.67
[2025-08-03 05:04:48,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.99 | bwd: 7348.34 | bwd_inner: 6722.89 | bwd_allreduce: 625.21 | step: 110.24
{'loss': 0.7564, 'learning_rate': 8.965452684621164e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14158
total_samples=16625, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:51,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.11 | bwd_microstep: 1778.07 | bwd_inner_microstep: 1735.07 | bwd_allreduce_microstep: 42.93 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13532
total_samples=16629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:54,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.72 | bwd_microstep: 1990.19 | bwd_inner_microstep: 1764.78 | bwd_allreduce_microstep: 225.31 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13444
total_samples=16633, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:56,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.37 | bwd_microstep: 1747.76 | bwd_inner_microstep: 1683.18 | bwd_allreduce_microstep: 64.51 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11754
total_samples=16636, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:04:59,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 05:04:59,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.60 | bwd_microstep: 2049.37 | bwd_inner_microstep: 1826.21 | bwd_allreduce_microstep: 223.09 | step_microstep: 114.71
[2025-08-03 05:04:59,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2876.73 | bwd: 7565.43 | bwd_inner: 7009.27 | bwd_allreduce: 555.90 | step: 115.15
{'loss': 0.7453, 'learning_rate': 8.949347164748761e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215
total_samples=16640, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:02,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.00 | bwd_microstep: 1851.62 | bwd_inner_microstep: 1729.23 | bwd_allreduce_microstep: 122.34 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13445
total_samples=16644, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:04,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.67 | bwd_microstep: 1742.52 | bwd_inner_microstep: 1682.18 | bwd_allreduce_microstep: 60.27 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11851
total_samples=16647, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:07,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.18 | bwd_microstep: 1707.14 | bwd_inner_microstep: 1538.31 | bwd_allreduce_microstep: 168.77 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11969
total_samples=16650, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:10,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.31
[2025-08-03 05:05:10,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.92 | bwd_microstep: 1910.92 | bwd_inner_microstep: 1745.65 | bwd_allreduce_microstep: 165.20 | step_microstep: 138.48
[2025-08-03 05:05:10,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.70 | bwd: 7212.25 | bwd_inner: 6695.35 | bwd_allreduce: 516.66 | step: 139.06
{'loss': 0.7425, 'learning_rate': 8.933244400090937e-06, 'epoch': 0.55}
████▍    | 1091/2000 [3:21:32<2:46:36, 11.00s/it] 55%|█████▍    | 1092/2000 [3:21:42<2:44:51, 10.89s/it]                                                        55%|█████▍    | 1092/2000 [3:21:42<2:44:51, 10.89s/it] 55%|█████▍    | 1093/2000 [3:21:53<2:42:27, 10.75s/it]                                                        55%|█████▍    | 1093/2000 [3:21:53<2:42:27, 10.75s/it] 55%|█████▍    | 1094/2000 [3:22:03<2:41:29, 10.70s/it]                                                        55%|█████▍    | 1094/2000 [3:22:03<2:41:29, 10.70s/it] 55%|█████▍    | 1095/2000 [3:22:14<2:41:59, 10.74s/it]                                                        55%|█████▍    | 1095/2000 [3:22:14<2:41:59, 10.74s/it] 55%|█████▍    | 1096/2000 [3:22:25<2:40:30, 10.65s/it]                                                        55%|█████▍    | 1096/2000 [3:22:25<2:40:30, 10.65s/it]dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15086
total_samples=16654, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:13,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.36 | bwd_microstep: 2121.86 | bwd_inner_microstep: 1834.91 | bwd_allreduce_microstep: 286.88 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14143
total_samples=16658, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:15,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.22 | bwd_microstep: 2063.38 | bwd_inner_microstep: 1921.80 | bwd_allreduce_microstep: 141.51 | step_microstep: 0.94
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12265
total_samples=16661, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:18,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.49 | bwd_microstep: 2008.42 | bwd_inner_microstep: 1784.98 | bwd_allreduce_microstep: 223.37 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14833
total_samples=16665, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:21,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 05:05:21,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.11 | bwd_microstep: 1856.89 | bwd_inner_microstep: 1751.33 | bwd_allreduce_microstep: 105.49 | step_microstep: 129.22
[2025-08-03 05:05:21,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.11 | bwd: 8050.60 | bwd_inner: 7293.01 | bwd_allreduce: 757.33 | step: 130.43
{'loss': 0.7551, 'learning_rate': 8.91714443287531e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13700
total_samples=16670, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:24,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.29 | bwd_microstep: 1731.73 | bwd_inner_microstep: 1675.99 | bwd_allreduce_microstep: 55.67 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13341
total_samples=16674, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:26,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.02 | bwd_microstep: 1802.72 | bwd_inner_microstep: 1725.11 | bwd_allreduce_microstep: 77.54 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11835
total_samples=16677, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:29,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.86 | bwd_microstep: 2283.25 | bwd_inner_microstep: 1989.93 | bwd_allreduce_microstep: 293.23 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13210
total_samples=16682, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:32,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.03
[2025-08-03 05:05:32,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.80 | bwd_microstep: 2056.92 | bwd_inner_microstep: 1877.89 | bwd_allreduce_microstep: 178.97 | step_microstep: 135.73
[2025-08-03 05:05:32,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.92 | bwd: 7874.68 | bwd_inner: 7268.91 | bwd_allreduce: 605.50 | step: 136.36
{'loss': 0.7472, 'learning_rate': 8.901047305322172e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13972
total_samples=16686, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:35,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.18 | bwd_microstep: 2015.70 | bwd_inner_microstep: 1889.27 | bwd_allreduce_microstep: 126.38 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13961
total_samples=16690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:38,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.12 | bwd_microstep: 1789.08 | bwd_inner_microstep: 1736.68 | bwd_allreduce_microstep: 52.32 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12122
total_samples=16693, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:40,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.57 | bwd_microstep: 1782.48 | bwd_inner_microstep: 1585.34 | bwd_allreduce_microstep: 197.08 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13573
total_samples=16697, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:43,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61
[2025-08-03 05:05:43,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.65 | bwd_microstep: 2166.11 | bwd_inner_microstep: 2000.15 | bwd_allreduce_microstep: 165.88 | step_microstep: 138.49
[2025-08-03 05:05:43,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.44 | bwd: 7753.43 | bwd_inner: 7211.43 | bwd_allreduce: 541.74 | step: 139.04
{'loss': 0.7561, 'learning_rate': 8.88495305964436e-06, 'epoch': 0.55}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13391
total_samples=16701, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:46,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.39 | bwd_microstep: 1803.57 | bwd_inner_microstep: 1689.73 | bwd_allreduce_microstep: 113.78 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13803
total_samples=16705, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:48,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.19 | bwd_microstep: 1804.99 | bwd_inner_microstep: 1741.69 | bwd_allreduce_microstep: 63.24 | step_microstep: 0.12
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 15137
total_samples=16708, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:51,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.22 | bwd_microstep: 1804.79 | bwd_inner_microstep: 1721.11 | bwd_allreduce_microstep: 83.61 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14636
total_samples=16712, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:54,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 20.56
[2025-08-03 05:05:54,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.25 | bwd_microstep: 1745.35 | bwd_inner_microstep: 1719.95 | bwd_allreduce_microstep: 25.33 | step_microstep: 124.40
[2025-08-03 05:05:54,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.98 | bwd: 7158.76 | bwd_inner: 6872.47 | bwd_allreduce: 286.03 | step: 124.77
{'loss': 0.7579, 'learning_rate': 8.868861738047158e-06, 'epoch': 0.55}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13479
total_samples=16716, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:56,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.43 | bwd_microstep: 1880.76 | bwd_inner_microstep: 1703.75 | bwd_allreduce_microstep: 176.94 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13015
total_samples=16720, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:05:59,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.20 | bwd_microstep: 1989.77 | bwd_inner_microstep: 1884.84 | bwd_allreduce_microstep: 104.86 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264
total_samples=16724, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:02,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.46 | bwd_microstep: 1764.70 | bwd_inner_microstep: 1669.70 | bwd_allreduce_microstep: 94.93 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15238
total_samples=16729, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:04,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 05:06:04,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.94 | bwd_microstep: 1753.12 | bwd_inner_microstep: 1747.02 | bwd_allreduce_microstep: 6.04 | step_microstep: 114.27
[2025-08-03 05:06:04,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.95 | bwd: 7388.40 | bwd_inner: 7005.30 | bwd_allreduce: 382.85 | step: 114.97
{'loss': 0.7557, 'learning_rate': 8.852773382728184e-06, 'epoch': 0.55}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12847
total_samples=16733, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:07,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.18 | bwd_microstep: 1785.87 | bwd_inner_microstep: 1664.11 | bwd_allreduce_microstep: 121.70 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13045
total_samples=16737, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:10,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.03 | bwd_microstep: 2015.25 | bwd_inner_microstep: 2003.04 | bwd_allreduce_microstep: 12.15 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13718
total_samples=16741, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:12,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.80 | bwd_microstep: 1828.80 | bwd_inner_microstep: 1749.18 | bwd_allreduce_microstep: 79.55 | step_microstep: 0.19
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12907
total_samples=16745, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:15,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 05:06:15,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1833.46 | bwd_inner_microstep: 1773.40 | bwd_allreduce_microstep: 59.99 | step_microstep: 148.90
[2025-08-03 05:06:15,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.54 | bwd: 7463.44 | bwd_inner: 7189.73 | bwd_allreduce: 273.48 | step: 149.43
 55%|█████▍    | 1097/2000 [3:22:36<2:43:31, 10.87s/it]                                                        55%|█████▍    | 1097/2000 [3:22:36<2:43:31, 10.87s/it] 55%|█████▍    | 1098/2000 [3:22:47<2:44:36, 10.95s/it]                                                        55%|█████▍    | 1098/2000 [3:22:47<2:44:36, 10.95s/it] 55%|█████▍    | 1099/2000 [3:22:58<2:44:46, 10.97s/it]                                                        55%|█████▍    | 1099/2000 [3:22:58<2:44:46, 10.97s/it] 55%|█████▌    | 1100/2000 [3:23:08<2:42:05, 10.81s/it]                                                        55%|█████▌    | 1100/2000 [3:23:08<2:42:05, 10.81s/it] 55%|█████▌    | 1101/2000 [3:23:19<2:41:00, 10.75s/it]                                                        55%|█████▌    | 1101/2000 [3:23:19<2:41:00, 10.75s/it] 55%|█████▌    | 1102/2000 [3:23:30<2:40:46, 1{'loss': 0.7642, 'learning_rate': 8.836688035877268e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14330
total_samples=16750, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:17,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.79 | bwd_microstep: 1727.24 | bwd_inner_microstep: 1698.73 | bwd_allreduce_microstep: 28.45 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13481
total_samples=16754, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:20,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.15 | bwd_microstep: 1806.22 | bwd_inner_microstep: 1702.11 | bwd_allreduce_microstep: 104.03 | step_microstep: 0.32
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13255
total_samples=16758, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:23,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.42 | bwd_microstep: 1858.28 | bwd_inner_microstep: 1807.50 | bwd_allreduce_microstep: 50.72 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12518
total_samples=16761, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:26,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 36.66
[2025-08-03 05:06:26,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.40 | bwd_microstep: 1904.65 | bwd_inner_microstep: 1601.74 | bwd_allreduce_microstep: 302.84 | step_microstep: 162.85
[2025-08-03 05:06:26,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.69 | bwd: 7296.44 | bwd_inner: 6810.07 | bwd_allreduce: 486.13 | step: 163.42
{'loss': 0.7465, 'learning_rate': 8.820605739676363e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14254
total_samples=16765, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:28,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.42 | bwd_microstep: 2002.25 | bwd_inner_microstep: 1875.94 | bwd_allreduce_microstep: 126.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332
total_samples=16769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:31,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.26 | bwd_microstep: 1793.07 | bwd_inner_microstep: 1705.28 | bwd_allreduce_microstep: 87.72 | step_microstep: 0.70
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=16773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:34,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.61 | bwd_microstep: 1846.65 | bwd_inner_microstep: 1722.30 | bwd_allreduce_microstep: 124.27 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13567
total_samples=16777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:36,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.29
[2025-08-03 05:06:36,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.09 | bwd_microstep: 1751.30 | bwd_inner_microstep: 1690.94 | bwd_allreduce_microstep: 60.27 | step_microstep: 145.05
[2025-08-03 05:06:36,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.31 | bwd: 7393.33 | bwd_inner: 6994.47 | bwd_allreduce: 398.59 | step: 146.14
{'loss': 0.7503, 'learning_rate': 8.804526536299413e-06, 'epoch': 0.55}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14049
total_samples=16782, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:39,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.27 | bwd_microstep: 2157.80 | bwd_inner_microstep: 2052.74 | bwd_allreduce_microstep: 105.00 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13646
total_samples=16786, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:42,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.15 | bwd_microstep: 1971.00 | bwd_inner_microstep: 1867.31 | bwd_allreduce_microstep: 103.61 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13319
total_samples=16790, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:45,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.62 | bwd_microstep: 1977.78 | bwd_inner_microstep: 1722.60 | bwd_allreduce_microstep: 255.13 | step_microstep: 0.68
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11720
total_samples=16793, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:48,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 05:06:48,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.41 | bwd_microstep: 1996.23 | bwd_inner_microstep: 1813.21 | bwd_allreduce_microstep: 182.95 | step_microstep: 120.77
[2025-08-03 05:06:48,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2867.38 | bwd: 8102.86 | bwd_inner: 7455.85 | bwd_allreduce: 646.77 | step: 121.88
{'loss': 0.7456, 'learning_rate': 8.788450467912254e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13693
total_samples=16797, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:51,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.71 | bwd_microstep: 2231.14 | bwd_inner_microstep: 2132.58 | bwd_allreduce_microstep: 98.50 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12935
total_samples=16801, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:53,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.74 | bwd_microstep: 1795.45 | bwd_inner_microstep: 1674.31 | bwd_allreduce_microstep: 121.04 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13527
total_samples=16805, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:56,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.44 | bwd_microstep: 1734.79 | bwd_inner_microstep: 1691.89 | bwd_allreduce_microstep: 42.83 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13748
total_samples=16809, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:06:59,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 05:06:59,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.66 | bwd_microstep: 2141.23 | bwd_inner_microstep: 1934.26 | bwd_allreduce_microstep: 206.91 | step_microstep: 145.14
[2025-08-03 05:06:59,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.47 | bwd: 7902.67 | bwd_inner: 7433.04 | bwd_allreduce: 469.38 | step: 145.86
{'loss': 0.7584, 'learning_rate': 8.772377576672502e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13831
total_samples=16815, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:01,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.50 | bwd_microstep: 1860.17 | bwd_inner_microstep: 1832.50 | bwd_allreduce_microstep: 27.60 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12828
total_samples=16819, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:04,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.82 | bwd_microstep: 2005.03 | bwd_inner_microstep: 1844.81 | bwd_allreduce_microstep: 160.16 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14871
total_samples=16823, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:07,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.93 | bwd_microstep: 1781.42 | bwd_inner_microstep: 1745.20 | bwd_allreduce_microstep: 36.15 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12060
total_samples=16826, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:10,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.32
[2025-08-03 05:07:10,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.53 | bwd_microstep: 1964.78 | bwd_inner_microstep: 1756.16 | bwd_allreduce_microstep: 208.54 | step_microstep: 158.41
[2025-08-03 05:07:10,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.70 | bwd: 7611.46 | bwd_inner: 7178.67 | bwd_allreduce: 432.54 | step: 158.98
0.74s/it]                                                        55%|█████▌    | 1102/2000 [3:23:30<2:40:46, 10.74s/it] 55%|█████▌    | 1103/2000 [3:23:40<2:40:02, 10.70s/it]                                                        55%|█████▌    | 1103/2000 [3:23:40<2:40:02, 10.70s/it] 55%|█████▌    | 1104/2000 [3:23:51<2:39:41, 10.69s/it]                                                        55%|█████▌    | 1104/2000 [3:23:51<2:39:41, 10.69s/it] 55%|█████▌    | 1105/2000 [3:24:02<2:42:35, 10.90s/it]                                                        55%|█████▌    | 1105/2000 [3:24:03<2:42:35, 10.90s/it] 55%|█████▌    | 1106/2000 [3:24:14<2:43:41, 10.99s/it]                                                        55%|█████▌    | 1106/2000 [3:24:14<2:43:41, 10.99s/it] 55%|█████▌    | 1107/2000 [3:24:25<2:42:58, 10.95s/it]                                                 {'loss': 0.7537, 'learning_rate': 8.75630790472944e-06, 'epoch': 0.55}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13830
total_samples=16831, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:12,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.98 | bwd_microstep: 2033.69 | bwd_inner_microstep: 1694.98 | bwd_allreduce_microstep: 338.64 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16278
total_samples=16835, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:15,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.79 | bwd_microstep: 1850.97 | bwd_inner_microstep: 1844.14 | bwd_allreduce_microstep: 6.75 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14075
total_samples=16841, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:18,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.03 | bwd_microstep: 1756.04 | bwd_inner_microstep: 1708.88 | bwd_allreduce_microstep: 47.09 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12214
total_samples=16844, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:20,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 05:07:20,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.63 | bwd_microstep: 1881.62 | bwd_inner_microstep: 1600.35 | bwd_allreduce_microstep: 281.20 | step_microstep: 111.97
[2025-08-03 05:07:20,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.37 | bwd: 7522.38 | bwd_inner: 6848.34 | bwd_allreduce: 673.79 | step: 112.47
{'loss': 0.7545, 'learning_rate': 8.740241494223911e-06, 'epoch': 0.55}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14334
total_samples=16848, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:23,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.08 | bwd_microstep: 1915.68 | bwd_inner_microstep: 1908.92 | bwd_allreduce_microstep: 6.65 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11579
total_samples=16851, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:26,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.68 | bwd_microstep: 1947.47 | bwd_inner_microstep: 1728.49 | bwd_allreduce_microstep: 218.92 | step_microstep: 0.19
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12451
total_samples=16855, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:29,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.29 | bwd_microstep: 2021.96 | bwd_inner_microstep: 1834.64 | bwd_allreduce_microstep: 187.25 | step_microstep: 0.85
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15253
total_samples=16860, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:32,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 05:07:32,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.85 | bwd_microstep: 1960.78 | bwd_inner_microstep: 1791.77 | bwd_allreduce_microstep: 168.94 | step_microstep: 137.07
[2025-08-03 05:07:32,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.83 | bwd: 7845.94 | bwd_inner: 7263.84 | bwd_allreduce: 581.82 | step: 138.42
{'loss': 0.7548, 'learning_rate': 8.724178387288202e-06, 'epoch': 0.55}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15510
total_samples=16865, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:34,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.24 | bwd_microstep: 1779.90 | bwd_inner_microstep: 1717.02 | bwd_allreduce_microstep: 62.81 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13319
total_samples=16869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:37,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.74 | bwd_microstep: 1912.39 | bwd_inner_microstep: 1818.88 | bwd_allreduce_microstep: 93.44 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12011
total_samples=16872, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:39,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.34 | bwd_microstep: 1765.88 | bwd_inner_microstep: 1560.73 | bwd_allreduce_microstep: 205.09 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12184
total_samples=16875, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:42,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 05:07:42,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.60 | bwd_microstep: 2222.66 | bwd_inner_microstep: 1954.48 | bwd_allreduce_microstep: 268.11 | step_microstep: 108.62
[2025-08-03 05:07:42,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.85 | bwd: 7680.88 | bwd_inner: 7051.10 | bwd_allreduce: 629.53 | step: 109.03
{'loss': 0.7478, 'learning_rate': 8.708118626045939e-06, 'epoch': 0.56}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15425
total_samples=16879, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:45,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.94 | bwd_microstep: 1822.68 | bwd_inner_microstep: 1764.43 | bwd_allreduce_microstep: 58.18 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13469
total_samples=16884, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:48,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.88 | bwd_microstep: 1802.44 | bwd_inner_microstep: 1710.16 | bwd_allreduce_microstep: 92.22 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13181
total_samples=16888, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:50,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.42 | bwd_microstep: 1938.84 | bwd_inner_microstep: 1888.63 | bwd_allreduce_microstep: 50.13 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12037
total_samples=16891, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:53,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 05:07:53,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.52 | bwd_microstep: 2122.28 | bwd_inner_microstep: 1932.21 | bwd_allreduce_microstep: 189.99 | step_microstep: 113.56
[2025-08-03 05:07:53,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.70 | bwd: 7686.28 | bwd_inner: 7295.41 | bwd_allreduce: 390.60 | step: 114.16
{'loss': 0.7509, 'learning_rate': 8.692062252611973e-06, 'epoch': 0.56}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12674
total_samples=16895, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:56,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.02 | bwd_microstep: 1816.34 | bwd_inner_microstep: 1600.34 | bwd_allreduce_microstep: 215.93 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14319
total_samples=16899, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:07:59,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.42 | bwd_microstep: 2052.85 | bwd_inner_microstep: 1882.54 | bwd_allreduce_microstep: 170.24 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12503
total_samples=16903, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:01,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.86 | bwd_microstep: 1743.50 | bwd_inner_microstep: 1579.89 | bwd_allreduce_microstep: 163.54 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11788
total_samples=16906, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:04,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 05:08:04,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.22 | bwd_microstep: 1724.17 | bwd_inner_microstep: 1527.45 | bwd_allreduce_microstep: 196.64 | step_microstep: 471.74
[2025-08-03 05:08:04,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.45 | bwd: 7336.91 | bwd_inner: 6590.21 | bwd_allreduce: 746.43 | step: 472.15
{'loss': 0.7498, 'learning_rate': 8.676009309092273e-06, 'epoch': 0.56}
       55%|█████▌    | 1107/2000 [3:24:25<2:42:58, 10.95s/it] 55%|█████▌    | 1108/2000 [3:24:35<2:42:05, 10.90s/it]                                                        55%|█████▌    | 1108/2000 [3:24:35<2:42:05, 10.90s/it] 55%|█████▌    | 1109/2000 [3:24:46<2:42:48, 10.96s/it]                                                        55%|█████▌    | 1109/2000 [3:24:46<2:42:48, 10.96s/it] 56%|█████▌    | 1110/2000 [3:24:57<2:42:25, 10.95s/it]                                                        56%|█████▌    | 1110/2000 [3:24:57<2:42:25, 10.95s/it] 56%|█████▌    | 1111/2000 [3:25:08<2:42:17, 10.95s/it]                                                        56%|█████▌    | 1111/2000 [3:25:08<2:42:17, 10.95s/it] 56%|█████▌    | 1112/2000 [3:25:19<2:42:01, 10.95s/it]                                                        56%|█████▌    | 1112/2000 [3:25:19<2:4dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13550
total_samples=16910, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:07,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.90 | bwd_microstep: 2087.93 | bwd_inner_microstep: 1908.64 | bwd_allreduce_microstep: 179.23 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12199
total_samples=16913, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:10,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.22 | bwd_microstep: 1738.54 | bwd_inner_microstep: 1552.20 | bwd_allreduce_microstep: 186.27 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13204
total_samples=16917, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:12,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.84 | bwd_microstep: 1787.20 | bwd_inner_microstep: 1630.29 | bwd_allreduce_microstep: 156.84 | step_microstep: 0.45
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625
total_samples=16920, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:15,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 05:08:15,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.87 | bwd_microstep: 1862.70 | bwd_inner_microstep: 1597.58 | bwd_allreduce_microstep: 265.05 | step_microstep: 133.18
[2025-08-03 05:08:15,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.76 | bwd: 7476.42 | bwd_inner: 6688.71 | bwd_allreduce: 787.46 | step: 133.86
{'loss': 0.7539, 'learning_rate': 8.659959837583808e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13409
total_samples=16924, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:18,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.91 | bwd_microstep: 1975.62 | bwd_inner_microstep: 1851.53 | bwd_allreduce_microstep: 124.02 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12866
total_samples=16927, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:20,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.99 | bwd_microstep: 1726.20 | bwd_inner_microstep: 1579.34 | bwd_allreduce_microstep: 146.75 | step_microstep: 1.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13649
total_samples=16931, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:23,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.21 | bwd_microstep: 1795.90 | bwd_inner_microstep: 1710.85 | bwd_allreduce_microstep: 84.98 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13542
total_samples=16936, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:26,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 05:08:26,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.34 | bwd_microstep: 1891.53 | bwd_inner_microstep: 1711.28 | bwd_allreduce_microstep: 180.18 | step_microstep: 135.56
[2025-08-03 05:08:26,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.38 | bwd: 7389.30 | bwd_inner: 6853.00 | bwd_allreduce: 536.05 | step: 136.99
{'loss': 0.7477, 'learning_rate': 8.643913880174449e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13148
total_samples=16940, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:28,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.94 | bwd_microstep: 1809.41 | bwd_inner_microstep: 1689.67 | bwd_allreduce_microstep: 119.67 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13760
total_samples=16944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:31,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.26 | bwd_microstep: 1716.71 | bwd_inner_microstep: 1678.37 | bwd_allreduce_microstep: 38.27 | step_microstep: 0.26
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13421
total_samples=16949, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:33,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.80 | bwd_microstep: 1822.73 | bwd_inner_microstep: 1688.62 | bwd_allreduce_microstep: 134.05 | step_microstep: 0.88
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12927
total_samples=16953, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:36,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.76
[2025-08-03 05:08:36,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.97 | bwd_microstep: 1788.14 | bwd_inner_microstep: 1732.29 | bwd_allreduce_microstep: 55.78 | step_microstep: 114.58
[2025-08-03 05:08:36,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.90 | bwd: 7137.05 | bwd_inner: 6788.95 | bwd_allreduce: 347.85 | step: 115.86
{'loss': 0.7491, 'learning_rate': 8.62787147894285e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418
total_samples=16957, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:39,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.36 | bwd_microstep: 2071.71 | bwd_inner_microstep: 1905.04 | bwd_allreduce_microstep: 166.62 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11541
total_samples=16960, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:42,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.91 | bwd_microstep: 1750.19 | bwd_inner_microstep: 1524.80 | bwd_allreduce_microstep: 225.32 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11822
total_samples=16963, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:44,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.09 | bwd_microstep: 1889.26 | bwd_inner_microstep: 1556.51 | bwd_allreduce_microstep: 332.68 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11690
total_samples=16966, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:47,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 05:08:47,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.92 | bwd_microstep: 2010.69 | bwd_inner_microstep: 1580.40 | bwd_allreduce_microstep: 430.15 | step_microstep: 110.80
[2025-08-03 05:08:47,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.21 | bwd: 7721.90 | bwd_inner: 6566.78 | bwd_allreduce: 1154.81 | step: 111.18
{'loss': 0.7402, 'learning_rate': 8.611832675958335e-06, 'epoch': 0.56}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11821
total_samples=16969, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:50,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.46 | bwd_microstep: 2024.17 | bwd_inner_microstep: 1817.89 | bwd_allreduce_microstep: 206.21 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14983
total_samples=16973, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:53,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.53 | bwd_microstep: 1824.01 | bwd_inner_microstep: 1678.80 | bwd_allreduce_microstep: 145.14 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14101
total_samples=16977, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:55,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.48 | bwd_microstep: 1753.44 | bwd_inner_microstep: 1713.31 | bwd_allreduce_microstep: 40.07 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13142
total_samples=16981, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:08:58,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 05:08:58,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.66 | bwd_microstep: 1741.17 | bwd_inner_microstep: 1662.48 | bwd_allreduce_microstep: 78.62 | step_microstep: 406.08
[2025-08-03 05:08:58,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.06 | bwd: 7342.84 | bwd_inner: 6872.48 | bwd_allreduce: 470.12 | step: 406.55
{'loss': 0.7495, 'learning_rate': 8.595797513280799e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13355
total_samples=16985, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:01,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.54 | bwd_microstep: 1817.49 | bwd_inner_microstep: 1700.33 | bwd_allreduce_microstep: 117.09 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11574
total_samples=16988, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:04,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.87 | bwd_microstep: 2276.00 | bwd_inner_microstep: 2033.27 | bwd_allreduce_microstep: 242.66 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13325
total_samples=16992, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:06,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.78 | bwd_microstep: 1824.10 | bwd_inner_microstep: 1719.52 | bwd_allreduce_microstep: 104.52 | step_microstep: 0.15
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12709
total_samples=16996, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:09,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 05:09:09,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.80 | bwd_microstep: 1927.14 | bwd_inner_microstep: 1619.71 | bwd_allreduce_microstep: 307.37 | step_microstep: 154.19
[2025-08-03 05:09:09,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.92 | bwd: 7844.78 | bwd_inner: 7072.83 | bwd_allreduce: 771.71 | step: 154.73
2:01, 10.95s/it] 56%|█████▌    | 1113/2000 [3:25:30<2:40:51, 10.88s/it]                                                        56%|█████▌    | 1113/2000 [3:25:30<2:40:51, 10.88s/it] 56%|█████▌    | 1114/2000 [3:25:41<2:39:48, 10.82s/it]                                                        56%|█████▌    | 1114/2000 [3:25:41<2:39:48, 10.82s/it] 56%|█████▌    | 1115/2000 [3:25:51<2:37:38, 10.69s/it]                                                        56%|█████▌    | 1115/2000 [3:25:51<2:37:38, 10.69s/it] 56%|█████▌    | 1116/2000 [3:26:02<2:38:38, 10.77s/it]                                                        56%|█████▌    | 1116/2000 [3:26:02<2:38:38, 10.77s/it] 56%|█████▌    | 1117/2000 [3:26:13<2:38:46, 10.79s/it]                                                        56%|█████▌    | 1117/2000 [3:26:13<2:38:46, 10.79s/it] 56%|█████▌    | 1118/2000 [3:{'loss': 0.7394, 'learning_rate': 8.579766032960582e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13681
total_samples=17002, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:12,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.36 | bwd_microstep: 1782.03 | bwd_inner_microstep: 1676.14 | bwd_allreduce_microstep: 105.82 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11632
total_samples=17005, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:14,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.08 | bwd_microstep: 1929.65 | bwd_inner_microstep: 1730.89 | bwd_allreduce_microstep: 198.70 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13353
total_samples=17009, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:17,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.76 | bwd_microstep: 1815.64 | bwd_inner_microstep: 1748.43 | bwd_allreduce_microstep: 67.14 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11941
total_samples=17012, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:20,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 05:09:20,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.27 | bwd_microstep: 2215.31 | bwd_inner_microstep: 1790.57 | bwd_allreduce_microstep: 424.63 | step_microstep: 109.26
[2025-08-03 05:09:20,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.40 | bwd: 7742.69 | bwd_inner: 6946.04 | bwd_allreduce: 796.36 | step: 109.80
{'loss': 0.7505, 'learning_rate': 8.563738277038376e-06, 'epoch': 0.56}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12296
total_samples=17015, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:23,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.47 | bwd_microstep: 1991.25 | bwd_inner_microstep: 1773.90 | bwd_allreduce_microstep: 217.27 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12464
total_samples=17018, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:26,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.76 | bwd_microstep: 2079.03 | bwd_inner_microstep: 1876.04 | bwd_allreduce_microstep: 202.93 | step_microstep: 0.79
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12073
total_samples=17021, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:28,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.42 | bwd_microstep: 1924.92 | bwd_inner_microstep: 1759.76 | bwd_allreduce_microstep: 165.08 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13197
total_samples=17025, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:31,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.17
[2025-08-03 05:09:31,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.30 | bwd_microstep: 1716.80 | bwd_inner_microstep: 1625.22 | bwd_allreduce_microstep: 91.49 | step_microstep: 153.36
[2025-08-03 05:09:31,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.88 | bwd: 7712.05 | bwd_inner: 7034.91 | bwd_allreduce: 676.86 | step: 154.46
{'loss': 0.7543, 'learning_rate': 8.5477142875451e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13405
total_samples=17029, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:34,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.48 | bwd_microstep: 1834.37 | bwd_inner_microstep: 1728.22 | bwd_allreduce_microstep: 106.08 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11897
total_samples=17032, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:36,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.13 | bwd_microstep: 1813.21 | bwd_inner_microstep: 1583.98 | bwd_allreduce_microstep: 229.16 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14097
total_samples=17036, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:39,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.60 | bwd_microstep: 2079.78 | bwd_inner_microstep: 1937.85 | bwd_allreduce_microstep: 141.86 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11614
total_samples=17039, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:42,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.89
[2025-08-03 05:09:42,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.77 | bwd_microstep: 1857.87 | bwd_inner_microstep: 1540.50 | bwd_allreduce_microstep: 317.28 | step_microstep: 540.53
[2025-08-03 05:09:42,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.92 | bwd: 7585.29 | bwd_inner: 6790.54 | bwd_allreduce: 794.47 | step: 541.00
{'loss': 0.7494, 'learning_rate': 8.531694106501796e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14929
total_samples=17043, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:45,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.83 | bwd_microstep: 1769.64 | bwd_inner_microstep: 1745.25 | bwd_allreduce_microstep: 24.31 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11846
total_samples=17046, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:48,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.64 | bwd_microstep: 2042.01 | bwd_inner_microstep: 1969.47 | bwd_allreduce_microstep: 72.48 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13529
total_samples=17050, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:50,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.14 | bwd_microstep: 1939.50 | bwd_inner_microstep: 1739.66 | bwd_allreduce_microstep: 199.77 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=17053, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:53,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 05:09:53,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.40 | bwd_microstep: 1713.91 | bwd_inner_microstep: 1537.87 | bwd_allreduce_microstep: 175.98 | step_microstep: 125.10
[2025-08-03 05:09:53,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.95 | bwd: 7465.11 | bwd_inner: 6992.25 | bwd_allreduce: 472.62 | step: 125.75
{'loss': 0.7513, 'learning_rate': 8.515677775919528e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14031
total_samples=17057, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:56,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.56 | bwd_microstep: 1971.93 | bwd_inner_microstep: 1702.96 | bwd_allreduce_microstep: 268.90 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11846
total_samples=17060, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:09:58,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.75 | bwd_microstep: 2059.51 | bwd_inner_microstep: 1900.58 | bwd_allreduce_microstep: 158.88 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11575
total_samples=17063, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:02,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.52 | bwd_microstep: 2384.53 | bwd_inner_microstep: 2151.13 | bwd_allreduce_microstep: 233.32 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11537
total_samples=17066, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:04,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 05:10:04,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.25 | bwd_microstep: 1802.77 | bwd_inner_microstep: 1553.80 | bwd_allreduce_microstep: 248.91 | step_microstep: 118.08
[2025-08-03 05:10:04,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.01 | bwd: 8218.79 | bwd_inner: 7308.46 | bwd_allreduce: 910.09 | step: 118.76
26:24<2:40:00, 10.89s/it]                                                        56%|█████▌    | 1118/2000 [3:26:24<2:40:00, 10.89s/it] 56%|█████▌    | 1119/2000 [3:26:35<2:39:54, 10.89s/it]                                                        56%|█████▌    | 1119/2000 [3:26:35<2:39:54, 10.89s/it] 56%|█████▌    | 1120/2000 [3:26:46<2:40:10, 10.92s/it]                                                        56%|█████▌    | 1120/2000 [3:26:46<2:40:10, 10.92s/it] 56%|█████▌    | 1121/2000 [3:26:57<2:41:15, 11.01s/it]                                                        56%|█████▌    | 1121/2000 [3:26:57<2:41:15, 11.01s/it] 56%|█████▌    | 1122/2000 [3:27:08<2:39:40, 10.91s/it]                                                        56%|█████▌    | 1122/2000 [3:27:08<2:39:40, 10.91s/it] 56%|█████▌    | 1123/2000 [3:27:19<2:41:47, 11.07s/it]                                 {'loss': 0.7639, 'learning_rate': 8.499665337799254e-06, 'epoch': 0.56}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13051
total_samples=17070, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:07,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.98 | bwd_microstep: 1812.09 | bwd_inner_microstep: 1684.35 | bwd_allreduce_microstep: 127.68 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12275
total_samples=17073, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:10,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.32 | bwd_microstep: 1825.78 | bwd_inner_microstep: 1567.96 | bwd_allreduce_microstep: 257.76 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13107
total_samples=17076, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:12,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.80 | bwd_microstep: 2060.29 | bwd_inner_microstep: 1702.10 | bwd_allreduce_microstep: 358.11 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13385
total_samples=17080, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:15,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 05:10:15,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.24 | bwd_microstep: 1973.82 | bwd_inner_microstep: 1901.85 | bwd_allreduce_microstep: 71.90 | step_microstep: 141.09
[2025-08-03 05:10:15,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2861.27 | bwd: 7672.03 | bwd_inner: 6856.26 | bwd_allreduce: 815.52 | step: 141.60
{'loss': 0.7425, 'learning_rate': 8.48365683413172e-06, 'epoch': 0.56}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15978
total_samples=17084, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:18,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.79 | bwd_microstep: 2079.49 | bwd_inner_microstep: 1817.75 | bwd_allreduce_microstep: 261.68 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12078
total_samples=17087, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:21,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.44 | bwd_microstep: 1804.16 | bwd_inner_microstep: 1579.52 | bwd_allreduce_microstep: 224.58 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13834
total_samples=17091, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:23,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.04 | bwd_microstep: 1947.54 | bwd_inner_microstep: 1906.63 | bwd_allreduce_microstep: 40.83 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13635
total_samples=17095, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:26,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 05:10:26,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.95 | bwd_microstep: 1989.13 | bwd_inner_microstep: 1753.87 | bwd_allreduce_microstep: 235.18 | step_microstep: 140.82
[2025-08-03 05:10:26,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.15 | bwd: 7820.38 | bwd_inner: 7057.77 | bwd_allreduce: 762.36 | step: 141.49
{'loss': 0.7485, 'learning_rate': 8.46765230689737e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13690
total_samples=17099, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:29,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.48 | bwd_microstep: 1864.08 | bwd_inner_microstep: 1727.45 | bwd_allreduce_microstep: 136.57 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12474
total_samples=17104, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:32,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.68 | bwd_microstep: 1767.54 | bwd_inner_microstep: 1620.18 | bwd_allreduce_microstep: 147.28 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 15316
total_samples=17108, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:34,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.77 | bwd_microstep: 1972.10 | bwd_inner_microstep: 1777.10 | bwd_allreduce_microstep: 194.93 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13192
total_samples=17112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:37,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 05:10:37,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.46 | bwd_microstep: 1811.17 | bwd_inner_microstep: 1694.09 | bwd_allreduce_microstep: 117.02 | step_microstep: 113.40
[2025-08-03 05:10:37,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.28 | bwd: 7414.95 | bwd_inner: 6818.82 | bwd_allreduce: 595.89 | step: 113.81
{'loss': 0.7469, 'learning_rate': 8.451651798066203e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13733
total_samples=17116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:40,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.06 | bwd_microstep: 1916.59 | bwd_inner_microstep: 1831.34 | bwd_allreduce_microstep: 85.19 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938
total_samples=17119, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:42,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.50 | bwd_microstep: 1701.19 | bwd_inner_microstep: 1541.43 | bwd_allreduce_microstep: 159.70 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446
total_samples=17123, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:45,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.26 | bwd_microstep: 1800.60 | bwd_inner_microstep: 1705.09 | bwd_allreduce_microstep: 95.45 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13298
total_samples=17127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:47,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.02
[2025-08-03 05:10:47,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.11 | bwd_microstep: 1784.70 | bwd_inner_microstep: 1665.81 | bwd_allreduce_microstep: 118.81 | step_microstep: 142.76
[2025-08-03 05:10:47,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.86 | bwd: 7203.13 | bwd_inner: 6743.66 | bwd_allreduce: 459.23 | step: 143.26
{'loss': 0.7526, 'learning_rate': 8.43565534959769e-06, 'epoch': 0.56}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12139
total_samples=17130, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:51,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.83 | bwd_microstep: 2423.53 | bwd_inner_microstep: 2198.84 | bwd_allreduce_microstep: 224.61 | step_microstep: 0.90
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13450
total_samples=17134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:53,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.44 | bwd_microstep: 1846.46 | bwd_inner_microstep: 1714.54 | bwd_allreduce_microstep: 131.86 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13244
total_samples=17138, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:56,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.93 | bwd_microstep: 1855.53 | bwd_inner_microstep: 1719.94 | bwd_allreduce_microstep: 135.51 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12003
total_samples=17141, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:10:59,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 05:10:59,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.48 | bwd_microstep: 2107.14 | bwd_inner_microstep: 1890.08 | bwd_allreduce_microstep: 217.00 | step_microstep: 132.25
[2025-08-03 05:10:59,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.61 | bwd: 8232.74 | bwd_inner: 7523.41 | bwd_allreduce: 709.05 | step: 133.42
{'loss': 0.7537, 'learning_rate': 8.419663003440657e-06, 'epoch': 0.56}
                       56%|█████▌    | 1123/2000 [3:27:19<2:41:47, 11.07s/it] 56%|█████▌    | 1124/2000 [3:27:30<2:41:11, 11.04s/it]                                                        56%|█████▌    | 1124/2000 [3:27:30<2:41:11, 11.04s/it] 56%|█████▋    | 1125/2000 [3:27:41<2:41:14, 11.06s/it]                                                        56%|█████▋    | 1125/2000 [3:27:41<2:41:14, 11.06s/it] 56%|█████▋    | 1126/2000 [3:27:52<2:39:12, 10.93s/it]                                                        56%|█████▋    | 1126/2000 [3:27:52<2:39:12, 10.93s/it] 56%|█████▋    | 1127/2000 [3:28:02<2:37:05, 10.80s/it]                                                        56%|█████▋    | 1127/2000 [3:28:02<2:37:05, 10.80s/it] 56%|█████▋    | 1128/2000 [3:28:14<2:39:44, 10.99s/it]                                                        56%|█████▋    | 1128/2dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12204
total_samples=17144, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:02,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.59 | bwd_microstep: 2513.69 | bwd_inner_microstep: 2482.06 | bwd_allreduce_microstep: 31.55 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13713
total_samples=17148, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:05,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.34 | bwd_microstep: 1948.09 | bwd_inner_microstep: 1853.97 | bwd_allreduce_microstep: 94.05 | step_microstep: 0.77
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11769
total_samples=17151, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:08,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.30 | bwd_microstep: 2061.89 | bwd_inner_microstep: 1766.79 | bwd_allreduce_microstep: 295.02 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13247
total_samples=17155, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:11,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.94
[2025-08-03 05:11:11,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.56 | bwd_microstep: 1809.00 | bwd_inner_microstep: 1696.27 | bwd_allreduce_microstep: 112.65 | step_microstep: 157.93
[2025-08-03 05:11:11,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.72 | bwd: 8332.74 | bwd_inner: 7799.08 | bwd_allreduce: 533.38 | step: 159.02
{'loss': 0.7508, 'learning_rate': 8.40367480153316e-06, 'epoch': 0.56}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13221
total_samples=17159, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:14,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.08 | bwd_microstep: 2153.24 | bwd_inner_microstep: 1941.67 | bwd_allreduce_microstep: 211.51 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12045
total_samples=17162, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:16,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.16 | bwd_microstep: 1789.27 | bwd_inner_microstep: 1566.49 | bwd_allreduce_microstep: 222.71 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12182
total_samples=17165, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:19,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.50 | bwd_microstep: 1781.44 | bwd_inner_microstep: 1603.83 | bwd_allreduce_microstep: 177.54 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11887
total_samples=17168, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:21,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.08
[2025-08-03 05:11:21,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.20 | bwd_microstep: 1719.16 | bwd_inner_microstep: 1534.50 | bwd_allreduce_microstep: 184.59 | step_microstep: 127.61
[2025-08-03 05:11:21,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.85 | bwd: 7443.16 | bwd_inner: 6646.47 | bwd_allreduce: 796.42 | step: 128.16
{'loss': 0.7458, 'learning_rate': 8.387690785802403e-06, 'epoch': 0.56}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11862
total_samples=17171, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:24,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.54 | bwd_microstep: 1793.10 | bwd_inner_microstep: 1573.85 | bwd_allreduce_microstep: 219.17 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13252
total_samples=17175, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:26,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.19 | bwd_microstep: 1827.24 | bwd_inner_microstep: 1697.04 | bwd_allreduce_microstep: 130.12 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11958
total_samples=17178, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:29,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.64 | bwd_microstep: 1841.14 | bwd_inner_microstep: 1558.83 | bwd_allreduce_microstep: 282.24 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12998
total_samples=17182, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:32,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 05:11:32,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.83 | bwd_microstep: 1975.98 | bwd_inner_microstep: 1803.21 | bwd_allreduce_microstep: 172.69 | step_microstep: 111.49
[2025-08-03 05:11:32,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.12 | bwd: 7437.51 | bwd_inner: 6632.94 | bwd_allreduce: 804.32 | step: 111.94
{'loss': 0.7496, 'learning_rate': 8.371710998164595e-06, 'epoch': 0.57}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12053
total_samples=17185, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:35,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.84 | bwd_microstep: 2210.12 | bwd_inner_microstep: 2043.35 | bwd_allreduce_microstep: 166.71 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713
total_samples=17188, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:38,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.90 | bwd_microstep: 1868.49 | bwd_inner_microstep: 1737.75 | bwd_allreduce_microstep: 130.67 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11652
total_samples=17191, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:40,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.96 | bwd_microstep: 1942.02 | bwd_inner_microstep: 1758.80 | bwd_allreduce_microstep: 183.15 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11786
total_samples=17194, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:43,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.94
[2025-08-03 05:11:43,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.40 | bwd_microstep: 1881.26 | bwd_inner_microstep: 1739.43 | bwd_allreduce_microstep: 141.75 | step_microstep: 131.93
[2025-08-03 05:11:43,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.02 | bwd: 7901.94 | bwd_inner: 7279.33 | bwd_allreduce: 622.36 | step: 132.26
{'loss': 0.7516, 'learning_rate': 8.355735480524874e-06, 'epoch': 0.57}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14691
total_samples=17198, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:46,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.21 | bwd_microstep: 1845.88 | bwd_inner_microstep: 1750.26 | bwd_allreduce_microstep: 95.56 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12370
total_samples=17202, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:48,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.18 | bwd_microstep: 1727.93 | bwd_inner_microstep: 1558.94 | bwd_allreduce_microstep: 168.92 | step_microstep: 0.13
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 16280
total_samples=17206, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:51,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.16 | bwd_microstep: 1989.57 | bwd_inner_microstep: 1755.07 | bwd_allreduce_microstep: 234.43 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12412
total_samples=17210, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:54,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 05:11:54,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.01 | bwd_microstep: 1982.92 | bwd_inner_microstep: 1569.57 | bwd_allreduce_microstep: 413.28 | step_microstep: 129.44
[2025-08-03 05:11:54,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.49 | bwd: 7546.33 | bwd_inner: 6633.83 | bwd_allreduce: 912.26 | step: 129.83
{'loss': 0.7416, 'learning_rate': 8.339764274777165e-06, 'epoch': 0.57}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13146
total_samples=17214, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:56,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.87 | bwd_microstep: 1827.46 | bwd_inner_microstep: 1702.32 | bwd_allreduce_microstep: 125.08 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11948
total_samples=17217, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:11:59,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.38 | bwd_microstep: 1753.57 | bwd_inner_microstep: 1550.67 | bwd_allreduce_microstep: 202.83 | step_microstep: 0.33
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 16344
total_samples=17220, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:02,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.88 | bwd_microstep: 1829.30 | bwd_inner_microstep: 1776.82 | bwd_allreduce_microstep: 52.42 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13271
total_samples=17224, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:04,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 05:12:04,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.29 | bwd_microstep: 1960.84 | bwd_inner_microstep: 1903.32 | bwd_allreduce_microstep: 57.46 | step_microstep: 114.06
[2025-08-03 05:12:04,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.35 | bwd: 7371.23 | bwd_inner: 6933.12 | bwd_allreduce: 437.87 | step: 114.62
000 [3:28:14<2:39:44, 10.99s/it] 56%|█████▋    | 1129/2000 [3:28:25<2:42:12, 11.17s/it]                                                        56%|█████▋    | 1129/2000 [3:28:25<2:42:12, 11.17s/it] 56%|█████▋    | 1130/2000 [3:28:36<2:39:54, 11.03s/it]                                                        56%|█████▋    | 1130/2000 [3:28:36<2:39:54, 11.03s/it] 57%|█████▋    | 1131/2000 [3:28:47<2:38:14, 10.93s/it]                                                        57%|█████▋    | 1131/2000 [3:28:47<2:38:14, 10.93s/it] 57%|█████▋    | 1132/2000 [3:28:58<2:39:03, 11.00s/it]                                                        57%|█████▋    | 1132/2000 [3:28:58<2:39:03, 11.00s/it] 57%|█████▋    | 1133/2000 [3:29:09<2:38:00, 10.94s/it]                                                        57%|█████▋    | 1133/2000 [3:29:09<2:38:00, 10.94s/it] 57%|█████▋   {'loss': 0.7566, 'learning_rate': 8.3237974228041e-06, 'epoch': 0.57}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13793
total_samples=17228, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:07,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.25 | bwd_microstep: 1816.04 | bwd_inner_microstep: 1720.57 | bwd_allreduce_microstep: 95.41 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11562
total_samples=17231, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:10,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.85 | bwd_microstep: 1714.53 | bwd_inner_microstep: 1535.35 | bwd_allreduce_microstep: 179.13 | step_microstep: 0.12
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11320
total_samples=17234, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:12,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.53 | bwd_microstep: 2057.33 | bwd_inner_microstep: 1785.04 | bwd_allreduce_microstep: 272.23 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12899
total_samples=17238, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:15,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 05:12:15,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.84 | bwd_microstep: 2141.73 | bwd_inner_microstep: 1884.02 | bwd_allreduce_microstep: 257.64 | step_microstep: 112.02
[2025-08-03 05:12:15,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.40 | bwd: 7729.68 | bwd_inner: 6924.96 | bwd_allreduce: 804.49 | step: 112.54
{'loss': 0.7423, 'learning_rate': 8.307834966476885e-06, 'epoch': 0.57}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11948
total_samples=17241, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:18,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.28 | bwd_microstep: 1760.36 | bwd_inner_microstep: 1550.06 | bwd_allreduce_microstep: 210.22 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13555
total_samples=17246, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:21,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.66 | bwd_microstep: 1864.90 | bwd_inner_microstep: 1718.70 | bwd_allreduce_microstep: 146.13 | step_microstep: 0.13
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11487
total_samples=17249, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:23,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.02 | bwd_microstep: 1710.66 | bwd_inner_microstep: 1507.05 | bwd_allreduce_microstep: 203.52 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13083
total_samples=17253, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:26,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58
[2025-08-03 05:12:26,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.94 | bwd_microstep: 1952.14 | bwd_inner_microstep: 1685.94 | bwd_allreduce_microstep: 266.14 | step_microstep: 171.55
[2025-08-03 05:12:26,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.83 | bwd: 7288.11 | bwd_inner: 6461.75 | bwd_allreduce: 826.10 | step: 171.96
{'loss': 0.7459, 'learning_rate': 8.291876947655197e-06, 'epoch': 0.57}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462
total_samples=17257, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:29,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.99 | bwd_microstep: 2006.68 | bwd_inner_microstep: 1737.31 | bwd_allreduce_microstep: 269.29 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15947
total_samples=17262, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:31,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.60 | bwd_microstep: 1816.65 | bwd_inner_microstep: 1743.62 | bwd_allreduce_microstep: 72.96 | step_microstep: 0.12
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11214
total_samples=17265, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:34,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.30 | bwd_microstep: 2010.84 | bwd_inner_microstep: 1776.45 | bwd_allreduce_microstep: 234.33 | step_microstep: 0.18
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13364
total_samples=17269, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:37,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.20
[2025-08-03 05:12:37,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.10 | bwd_microstep: 1751.25 | bwd_inner_microstep: 1646.53 | bwd_allreduce_microstep: 104.64 | step_microstep: 141.02
[2025-08-03 05:12:37,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.92 | bwd: 7585.46 | bwd_inner: 6903.90 | bwd_allreduce: 681.32 | step: 141.45
{'loss': 0.747, 'learning_rate': 8.275923408187086e-06, 'epoch': 0.57}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12800
total_samples=17273, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:40,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.46 | bwd_microstep: 2035.61 | bwd_inner_microstep: 1627.91 | bwd_allreduce_microstep: 407.64 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12089
total_samples=17276, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:43,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.83 | bwd_microstep: 2166.81 | bwd_inner_microstep: 1942.27 | bwd_allreduce_microstep: 224.46 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13215
total_samples=17280, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:45,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.77 | bwd_microstep: 2119.74 | bwd_inner_microstep: 2003.90 | bwd_allreduce_microstep: 115.76 | step_microstep: 0.32
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14078
total_samples=17284, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:48,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 05:12:48,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.39 | bwd_microstep: 1973.00 | bwd_inner_microstep: 1785.23 | bwd_allreduce_microstep: 187.70 | step_microstep: 126.93
[2025-08-03 05:12:48,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.39 | bwd: 8295.21 | bwd_inner: 7359.31 | bwd_allreduce: 935.65 | step: 127.62
{'loss': 0.7524, 'learning_rate': 8.259974389908842e-06, 'epoch': 0.57}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13707
total_samples=17288, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:51,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.38 | bwd_microstep: 2057.98 | bwd_inner_microstep: 1895.98 | bwd_allreduce_microstep: 161.94 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12056
total_samples=17291, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:54,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.76 | bwd_microstep: 1859.48 | bwd_inner_microstep: 1615.07 | bwd_allreduce_microstep: 244.34 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13211
total_samples=17295, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:57,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.64 | bwd_microstep: 1874.70 | bwd_inner_microstep: 1831.80 | bwd_allreduce_microstep: 42.84 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12926
total_samples=17299, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:12:59,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 05:12:59,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.85 | bwd_microstep: 1846.32 | bwd_inner_microstep: 1679.49 | bwd_allreduce_microstep: 166.76 | step_microstep: 108.96
[2025-08-03 05:12:59,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.56 | bwd: 7638.54 | bwd_inner: 7022.33 | bwd_allreduce: 615.97 | step: 109.45
 | 1134/2000 [3:29:19<2:36:21, 10.83s/it]                                                        57%|█████▋    | 1134/2000 [3:29:19<2:36:21, 10.83s/it] 57%|█████▋    | 1135/2000 [3:29:30<2:36:47, 10.88s/it]                                                        57%|█████▋    | 1135/2000 [3:29:30<2:36:47, 10.88s/it] 57%|█████▋    | 1136/2000 [3:29:41<2:35:14, 10.78s/it]                                                        57%|█████▋    | 1136/2000 [3:29:41<2:35:14, 10.78s/it] 57%|█████▋    | 1137/2000 [3:29:52<2:35:25, 10.81s/it]                                                        57%|█████▋    | 1137/2000 [3:29:52<2:35:25, 10.81s/it] 57%|█████▋    | 1138/2000 [3:30:03<2:38:20, 11.02s/it]                                                        57%|█████▋    | 1138/2000 [3:30:03<2:38:20, 11.02s/it] 57%|█████▋    | 1139/2000 [3:30:14<2:37:45, 10.99s/it]                 {'loss': 0.7534, 'learning_rate': 8.244029934644916e-06, 'epoch': 0.57}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13463
total_samples=17303, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:02,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.64 | bwd_microstep: 1745.47 | bwd_inner_microstep: 1671.90 | bwd_allreduce_microstep: 73.49 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12167
total_samples=17306, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:05,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.94 | bwd_microstep: 2066.69 | bwd_inner_microstep: 1849.63 | bwd_allreduce_microstep: 217.00 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752
total_samples=17310, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:07,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.45 | bwd_microstep: 1817.09 | bwd_inner_microstep: 1729.94 | bwd_allreduce_microstep: 87.08 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=17313, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:10,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 05:13:10,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.71 | bwd_microstep: 2138.87 | bwd_inner_microstep: 1910.83 | bwd_allreduce_microstep: 227.97 | step_microstep: 130.01
[2025-08-03 05:13:10,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2879.66 | bwd: 7768.18 | bwd_inner: 7162.30 | bwd_allreduce: 605.64 | step: 130.54
{'loss': 0.7475, 'learning_rate': 8.228090084207773e-06, 'epoch': 0.57}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13524
total_samples=17317, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:13,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.10 | bwd_microstep: 1817.43 | bwd_inner_microstep: 1709.19 | bwd_allreduce_microstep: 108.17 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11761
total_samples=17320, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:16,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.98 | bwd_microstep: 1794.85 | bwd_inner_microstep: 1575.97 | bwd_allreduce_microstep: 218.81 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13528
total_samples=17324, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:18,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.47 | bwd_microstep: 1757.27 | bwd_inner_microstep: 1688.80 | bwd_allreduce_microstep: 68.41 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13651
total_samples=17328, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:21,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.25
[2025-08-03 05:13:21,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.62 | bwd_microstep: 1733.11 | bwd_inner_microstep: 1681.51 | bwd_allreduce_microstep: 51.53 | step_microstep: 138.78
[2025-08-03 05:13:21,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.08 | bwd: 7102.73 | bwd_inner: 6655.47 | bwd_allreduce: 447.01 | step: 139.15
{'loss': 0.7536, 'learning_rate': 8.212154880397817e-06, 'epoch': 0.57}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13658
total_samples=17332, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:24,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.18 | bwd_microstep: 2082.92 | bwd_inner_microstep: 2076.44 | bwd_allreduce_microstep: 6.41 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13501
total_samples=17336, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:26,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.41 | bwd_microstep: 2046.96 | bwd_inner_microstep: 1913.53 | bwd_allreduce_microstep: 133.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15245
total_samples=17341, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:29,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.41 | bwd_microstep: 1871.23 | bwd_inner_microstep: 1773.93 | bwd_allreduce_microstep: 97.23 | step_microstep: 0.93
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13669
total_samples=17345, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:32,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 05:13:32,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.67 | bwd_microstep: 2015.43 | bwd_inner_microstep: 1872.60 | bwd_allreduce_microstep: 142.76 | step_microstep: 136.35
[2025-08-03 05:13:32,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.60 | bwd: 8016.59 | bwd_inner: 7636.49 | bwd_allreduce: 379.85 | step: 137.57
{'loss': 0.7499, 'learning_rate': 8.196224365003267e-06, 'epoch': 0.57}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12918
total_samples=17349, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:35,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.25 | bwd_microstep: 1838.17 | bwd_inner_microstep: 1692.96 | bwd_allreduce_microstep: 145.14 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375
total_samples=17353, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:37,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.13 | bwd_microstep: 1982.39 | bwd_inner_microstep: 1886.02 | bwd_allreduce_microstep: 96.30 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14338
total_samples=17357, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:40,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.79 | bwd_microstep: 2026.86 | bwd_inner_microstep: 1892.70 | bwd_allreduce_microstep: 134.10 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13065
total_samples=17361, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:43,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 05:13:43,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 919.36 | bwd_microstep: 2097.70 | bwd_inner_microstep: 1951.44 | bwd_allreduce_microstep: 146.18 | step_microstep: 109.92
[2025-08-03 05:13:43,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3034.47 | bwd: 7945.18 | bwd_inner: 7423.11 | bwd_allreduce: 521.80 | step: 110.42
{'loss': 0.7536, 'learning_rate': 8.180298579800034e-06, 'epoch': 0.57}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12679
total_samples=17365, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:46,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.11 | bwd_microstep: 1912.47 | bwd_inner_microstep: 1837.85 | bwd_allreduce_microstep: 74.55 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14882
total_samples=17369, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:49,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.91 | bwd_microstep: 1781.37 | bwd_inner_microstep: 1748.98 | bwd_allreduce_microstep: 32.32 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13377
total_samples=17374, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:51,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.87 | bwd_microstep: 1747.64 | bwd_inner_microstep: 1682.63 | bwd_allreduce_microstep: 64.95 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12460
total_samples=17378, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:54,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 05:13:54,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.73 | bwd_microstep: 2040.27 | bwd_inner_microstep: 1875.90 | bwd_allreduce_microstep: 164.31 | step_microstep: 111.60
[2025-08-03 05:13:54,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.55 | bwd: 7481.82 | bwd_inner: 7145.35 | bwd_allreduce: 336.21 | step: 112.17
{'loss': 0.7474, 'learning_rate': 8.16437756655164e-06, 'epoch': 0.57}
                                       57%|█████▋    | 1139/2000 [3:30:14<2:37:45, 10.99s/it] 57%|█████▋    | 1140/2000 [3:30:25<2:37:51, 11.01s/it]                                                        57%|█████▋    | 1140/2000 [3:30:25<2:37:51, 11.01s/it] 57%|█████▋    | 1141/2000 [3:30:36<2:34:48, 10.81s/it]                                                        57%|█████▋    | 1141/2000 [3:30:36<2:34:48, 10.81s/it] 57%|█████▋    | 1142/2000 [3:30:47<2:36:34, 10.95s/it]                                                        57%|█████▋    | 1142/2000 [3:30:47<2:36:34, 10.95s/it] 57%|█████▋    | 1143/2000 [3:30:58<2:38:17, 11.08s/it]                                                        57%|█████▋    | 1143/2000 [3:30:58<2:38:17, 11.08s/it] 57%|█████▋    | 1144/2000 [3:31:09<2:36:21, 10.96s/it]                                                        57%|████�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13223
total_samples=17382, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:57,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.23 | bwd_microstep: 1717.34 | bwd_inner_microstep: 1659.27 | bwd_allreduce_microstep: 58.00 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13287
total_samples=17386, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:13:59,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.02 | bwd_microstep: 2016.01 | bwd_inner_microstep: 1886.90 | bwd_allreduce_microstep: 129.04 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14465
total_samples=17391, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:02,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.99 | bwd_microstep: 1999.84 | bwd_inner_microstep: 1792.37 | bwd_allreduce_microstep: 207.41 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11686
total_samples=17394, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:05,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 05:14:05,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.98 | bwd_microstep: 1718.33 | bwd_inner_microstep: 1525.76 | bwd_allreduce_microstep: 192.49 | step_microstep: 134.55
[2025-08-03 05:14:05,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.16 | bwd: 7451.60 | bwd_inner: 6864.31 | bwd_allreduce: 587.03 | step: 135.05
{'loss': 0.7375, 'learning_rate': 8.148461367009081e-06, 'epoch': 0.57}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13614
total_samples=17400, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:07,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.77 | bwd_microstep: 1744.04 | bwd_inner_microstep: 1690.66 | bwd_allreduce_microstep: 53.31 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13904
total_samples=17405, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:10,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.82 | bwd_microstep: 1841.82 | bwd_inner_microstep: 1804.28 | bwd_allreduce_microstep: 37.47 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13283
total_samples=17409, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:12,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.27 | bwd_microstep: 1749.59 | bwd_inner_microstep: 1706.37 | bwd_allreduce_microstep: 43.15 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12401
total_samples=17412, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:15,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 05:14:15,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.73 | bwd_microstep: 2028.03 | bwd_inner_microstep: 1808.59 | bwd_allreduce_microstep: 219.37 | step_microstep: 116.33
[2025-08-03 05:14:15,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.53 | bwd: 7363.53 | bwd_inner: 7009.91 | bwd_allreduce: 353.37 | step: 116.97
{'loss': 0.7389, 'learning_rate': 8.132550022910737e-06, 'epoch': 0.57}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13228
total_samples=17416, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:18,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.66 | bwd_microstep: 1823.46 | bwd_inner_microstep: 1765.33 | bwd_allreduce_microstep: 58.07 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14627
total_samples=17420, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:20,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.59 | bwd_microstep: 1820.82 | bwd_inner_microstep: 1744.28 | bwd_allreduce_microstep: 76.47 | step_microstep: 0.27
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13445
total_samples=17424, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:23,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.20 | bwd_microstep: 1770.92 | bwd_inner_microstep: 1666.19 | bwd_allreduce_microstep: 104.67 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11532
total_samples=17427, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:26,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 05:14:26,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.37 | bwd_microstep: 1778.41 | bwd_inner_microstep: 1540.87 | bwd_allreduce_microstep: 237.48 | step_microstep: 114.77
[2025-08-03 05:14:26,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.74 | bwd: 7193.67 | bwd_inner: 6716.66 | bwd_allreduce: 476.77 | step: 115.32
{'loss': 0.748, 'learning_rate': 8.116643575982254e-06, 'epoch': 0.57}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13491
total_samples=17431, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:28,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.98 | bwd_microstep: 1727.34 | bwd_inner_microstep: 1673.81 | bwd_allreduce_microstep: 53.47 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13316
total_samples=17435, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:31,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.54 | bwd_microstep: 1819.04 | bwd_inner_microstep: 1707.71 | bwd_allreduce_microstep: 111.26 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13229
total_samples=17439, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:33,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.19 | bwd_microstep: 1821.50 | bwd_inner_microstep: 1715.97 | bwd_allreduce_microstep: 105.48 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11811
total_samples=17442, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:36,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 05:14:36,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.33 | bwd_microstep: 2166.37 | bwd_inner_microstep: 1697.65 | bwd_allreduce_microstep: 468.65 | step_microstep: 108.25
[2025-08-03 05:14:36,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.97 | bwd: 7534.31 | bwd_inner: 6795.14 | bwd_allreduce: 738.94 | step: 108.73
{'loss': 0.7511, 'learning_rate': 8.100742067936432e-06, 'epoch': 0.57}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12569
total_samples=17445, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:39,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.86 | bwd_microstep: 2055.35 | bwd_inner_microstep: 1823.63 | bwd_allreduce_microstep: 231.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14135
total_samples=17449, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:42,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.76 | bwd_microstep: 1822.43 | bwd_inner_microstep: 1738.76 | bwd_allreduce_microstep: 83.61 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13728
total_samples=17453, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:45,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.20 | bwd_microstep: 2146.89 | bwd_inner_microstep: 2081.55 | bwd_allreduce_microstep: 65.27 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11925
total_samples=17456, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:48,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 05:14:48,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.53 | bwd_microstep: 2054.16 | bwd_inner_microstep: 1822.09 | bwd_allreduce_microstep: 232.01 | step_microstep: 119.14
[2025-08-03 05:14:48,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.27 | bwd: 8078.88 | bwd_inner: 7466.03 | bwd_allreduce: 612.62 | step: 119.61
{'loss': 0.7472, 'learning_rate': 8.084845540473127e-06, 'epoch': 0.57}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12439
total_samples=17459, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:50,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.98 | bwd_microstep: 1842.38 | bwd_inner_microstep: 1703.12 | bwd_allreduce_microstep: 139.19 | step_microstep: 0.15
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12273
total_samples=17463, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:53,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.41 | bwd_microstep: 1820.93 | bwd_inner_microstep: 1591.04 | bwd_allreduce_microstep: 229.82 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=17466, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:56,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.99 | bwd_microstep: 1764.88 | bwd_inner_microstep: 1540.94 | bwd_allreduce_microstep: 223.87 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13534
total_samples=17470, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:14:58,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.99
[2025-08-03 05:14:58,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.30 | bwd_microstep: 1879.97 | bwd_inner_microstep: 1675.14 | bwd_allreduce_microstep: 204.72 | step_microstep: 142.36
[2025-08-03 05:14:58,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.62 | bwd: 7308.21 | bwd_inner: 6510.24 | bwd_allreduce: 797.71 | step: 142.90
�▋    | 1144/2000 [3:31:09<2:36:21, 10.96s/it] 57%|█████▋    | 1145/2000 [3:31:20<2:34:53, 10.87s/it]                                                        57%|█████▋    | 1145/2000 [3:31:20<2:34:53, 10.87s/it] 57%|█████▋    | 1146/2000 [3:31:30<2:33:20, 10.77s/it]                                                        57%|█████▋    | 1146/2000 [3:31:30<2:33:20, 10.77s/it] 57%|█████▋    | 1147/2000 [3:31:41<2:31:44, 10.67s/it]                                                        57%|█████▋    | 1147/2000 [3:31:41<2:31:44, 10.67s/it] 57%|█████▋    | 1148/2000 [3:31:51<2:31:57, 10.70s/it]                                                        57%|█████▋    | 1148/2000 [3:31:51<2:31:57, 10.70s/it] 57%|█████▋    | 1149/2000 [3:32:03<2:34:18, 10.88s/it]                                                        57%|█████▋    | 1149/2000 [3:32:03<2:34:18, 10.88s/it] 57%|█�{'loss': 0.7465, 'learning_rate': 8.068954035279121e-06, 'epoch': 0.57}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13611
total_samples=17474, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:01,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.77 | bwd_microstep: 1801.15 | bwd_inner_microstep: 1718.17 | bwd_allreduce_microstep: 82.91 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12388
total_samples=17477, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:04,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.95 | bwd_microstep: 2137.22 | bwd_inner_microstep: 1844.06 | bwd_allreduce_microstep: 293.08 | step_microstep: 0.34
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13709
total_samples=17481, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:06,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.90 | bwd_microstep: 1825.17 | bwd_inner_microstep: 1732.30 | bwd_allreduce_microstep: 92.80 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11597
total_samples=17484, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:09,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 05:15:09,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.61 | bwd_microstep: 2046.25 | bwd_inner_microstep: 1813.71 | bwd_allreduce_microstep: 232.46 | step_microstep: 112.45
[2025-08-03 05:15:09,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.15 | bwd: 7809.85 | bwd_inner: 7108.24 | bwd_allreduce: 701.35 | step: 113.02
{'loss': 0.7445, 'learning_rate': 8.053067594028044e-06, 'epoch': 0.58}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13555
total_samples=17488, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:12,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.64 | bwd_microstep: 1985.58 | bwd_inner_microstep: 1854.50 | bwd_allreduce_microstep: 131.01 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13686
total_samples=17492, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:15,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.49 | bwd_microstep: 1855.11 | bwd_inner_microstep: 1739.18 | bwd_allreduce_microstep: 115.86 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11596
total_samples=17495, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:18,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.80 | bwd_microstep: 2044.69 | bwd_inner_microstep: 1830.39 | bwd_allreduce_microstep: 214.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13489
total_samples=17499, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:21,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 05:15:21,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.00 | bwd_microstep: 1978.78 | bwd_inner_microstep: 1741.85 | bwd_allreduce_microstep: 236.86 | step_microstep: 118.16
[2025-08-03 05:15:21,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2877.86 | bwd: 7864.21 | bwd_inner: 7165.92 | bwd_allreduce: 698.05 | step: 118.66
{'loss': 0.7362, 'learning_rate': 8.037186258380226e-06, 'epoch': 0.58}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837
total_samples=17503, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:23,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.21 | bwd_microstep: 2173.66 | bwd_inner_microstep: 2048.54 | bwd_allreduce_microstep: 125.06 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13190
total_samples=17507, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:26,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.81 | bwd_microstep: 1809.61 | bwd_inner_microstep: 1704.92 | bwd_allreduce_microstep: 104.62 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14912
total_samples=17511, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:29,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.30 | bwd_microstep: 1817.04 | bwd_inner_microstep: 1744.86 | bwd_allreduce_microstep: 72.10 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13729
total_samples=17516, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:31,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 05:15:31,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.69 | bwd_microstep: 1825.77 | bwd_inner_microstep: 1728.25 | bwd_allreduce_microstep: 97.45 | step_microstep: 137.44
[2025-08-03 05:15:31,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.93 | bwd: 7626.13 | bwd_inner: 7226.56 | bwd_allreduce: 399.33 | step: 137.87
{'loss': 0.7475, 'learning_rate': 8.021310069982624e-06, 'epoch': 0.58}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12905
total_samples=17520, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:34,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.04 | bwd_microstep: 1807.18 | bwd_inner_microstep: 1688.67 | bwd_allreduce_microstep: 118.44 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14908
total_samples=17524, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:37,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.78 | bwd_microstep: 1803.50 | bwd_inner_microstep: 1763.54 | bwd_allreduce_microstep: 39.88 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12069
total_samples=17528, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:39,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.64 | bwd_microstep: 1799.62 | bwd_inner_microstep: 1568.88 | bwd_allreduce_microstep: 230.68 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13207
total_samples=17532, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:42,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 05:15:42,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.57 | bwd_microstep: 1765.56 | bwd_inner_microstep: 1688.46 | bwd_allreduce_microstep: 77.04 | step_microstep: 120.42
[2025-08-03 05:15:42,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.96 | bwd: 7175.92 | bwd_inner: 6709.55 | bwd_allreduce: 466.13 | step: 120.98
{'loss': 0.7634, 'learning_rate': 8.005439070468692e-06, 'epoch': 0.58}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13046
total_samples=17536, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:45,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.60 | bwd_microstep: 2053.44 | bwd_inner_microstep: 1869.13 | bwd_allreduce_microstep: 184.26 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14994
total_samples=17540, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:47,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.10 | bwd_microstep: 1839.36 | bwd_inner_microstep: 1770.51 | bwd_allreduce_microstep: 68.78 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11751
total_samples=17543, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:50,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.51 | bwd_microstep: 2185.35 | bwd_inner_microstep: 2059.64 | bwd_allreduce_microstep: 125.66 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13302
total_samples=17547, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:53,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 05:15:53,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.94 | bwd_microstep: 1827.05 | bwd_inner_microstep: 1696.29 | bwd_allreduce_microstep: 130.69 | step_microstep: 160.10
[2025-08-03 05:15:53,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.09 | bwd: 7905.25 | bwd_inner: 7395.57 | bwd_allreduce: 509.45 | step: 160.57
�███▊    | 1150/2000 [3:32:13<2:32:49, 10.79s/it]                                                        57%|█████▊    | 1150/2000 [3:32:13<2:32:49, 10.79s/it] 58%|█████▊    | 1151/2000 [3:32:24<2:33:50, 10.87s/it]                                                        58%|█████▊    | 1151/2000 [3:32:24<2:33:50, 10.87s/it] 58%|█████▊    | 1152/2000 [3:32:35<2:34:51, 10.96s/it]                                                        58%|█████▊    | 1152/2000 [3:32:35<2:34:51, 10.96s/it] 58%|█████▊    | 1153/2000 [3:32:46<2:34:14, 10.93s/it]                                                        58%|█████▊    | 1153/2000 [3:32:46<2:34:14, 10.93s/it] 58%|█████▊    | 1154/2000 [3:32:57<2:32:01, 10.78s/it]                                                        58%|█████▊    | 1154/2000 [3:32:57<2:32:01, 10.78s/it] 58%|█████▊    | 1155/2000 [3:33:08<2:33:26, 10.89s/it] {'loss': 0.7532, 'learning_rate': 7.989573301458274e-06, 'epoch': 0.58}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12020
total_samples=17550, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:56,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.64 | bwd_microstep: 1972.69 | bwd_inner_microstep: 1759.60 | bwd_allreduce_microstep: 213.02 | step_microstep: 0.17
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14874
total_samples=17554, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:15:58,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.95 | bwd_microstep: 1881.58 | bwd_inner_microstep: 1827.99 | bwd_allreduce_microstep: 53.52 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12384
total_samples=17557, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:01,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.61 | bwd_microstep: 1817.56 | bwd_inner_microstep: 1588.43 | bwd_allreduce_microstep: 229.06 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13929
total_samples=17561, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:04,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58
[2025-08-03 05:16:04,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.63 | bwd_microstep: 1922.10 | bwd_inner_microstep: 1764.52 | bwd_allreduce_microstep: 157.51 | step_microstep: 123.39
[2025-08-03 05:16:04,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.76 | bwd: 7594.00 | bwd_inner: 6940.54 | bwd_allreduce: 653.20 | step: 123.92
{'loss': 0.747, 'learning_rate': 7.9737128045575e-06, 'epoch': 0.58}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14048
total_samples=17565, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:07,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.65 | bwd_microstep: 1947.28 | bwd_inner_microstep: 1884.34 | bwd_allreduce_microstep: 62.86 | step_microstep: 0.95
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13041
total_samples=17569, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:10,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.64 | bwd_microstep: 2094.97 | bwd_inner_microstep: 1960.68 | bwd_allreduce_microstep: 134.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13206
total_samples=17573, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:12,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.37 | bwd_microstep: 1726.37 | bwd_inner_microstep: 1670.40 | bwd_allreduce_microstep: 55.91 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13298
total_samples=17577, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:15,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 05:16:15,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.90 | bwd_microstep: 1851.63 | bwd_inner_microstep: 1719.68 | bwd_allreduce_microstep: 131.88 | step_microstep: 134.57
[2025-08-03 05:16:15,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.50 | bwd: 7620.32 | bwd_inner: 7235.10 | bwd_allreduce: 384.96 | step: 135.78
{'loss': 0.741, 'learning_rate': 7.957857621358674e-06, 'epoch': 0.58}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14675
total_samples=17581, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:18,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.41 | bwd_microstep: 2051.44 | bwd_inner_microstep: 1934.67 | bwd_allreduce_microstep: 116.70 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15623
total_samples=17585, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:20,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.77 | bwd_microstep: 1918.40 | bwd_inner_microstep: 1824.63 | bwd_allreduce_microstep: 93.71 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13198
total_samples=17589, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:23,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.45 | bwd_microstep: 1779.18 | bwd_inner_microstep: 1674.99 | bwd_allreduce_microstep: 104.13 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13166
total_samples=17593, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:26,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 05:16:26,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.67 | bwd_microstep: 1968.58 | bwd_inner_microstep: 1856.19 | bwd_allreduce_microstep: 112.32 | step_microstep: 115.88
[2025-08-03 05:16:26,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.22 | bwd: 7717.66 | bwd_inner: 7290.47 | bwd_allreduce: 426.94 | step: 116.46
{'loss': 0.7645, 'learning_rate': 7.942007793440165e-06, 'epoch': 0.58}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13748
total_samples=17597, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:28,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.87 | bwd_microstep: 1782.55 | bwd_inner_microstep: 1710.44 | bwd_allreduce_microstep: 72.05 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14171
total_samples=17602, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:31,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.47 | bwd_microstep: 2160.57 | bwd_inner_microstep: 1947.42 | bwd_allreduce_microstep: 213.07 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15255
total_samples=17607, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:34,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.99 | bwd_microstep: 2207.13 | bwd_inner_microstep: 2011.55 | bwd_allreduce_microstep: 195.50 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13392
total_samples=17611, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:37,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 05:16:37,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.69 | bwd_microstep: 1800.34 | bwd_inner_microstep: 1702.80 | bwd_allreduce_microstep: 97.46 | step_microstep: 122.86
[2025-08-03 05:16:37,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.93 | bwd: 7950.65 | bwd_inner: 7372.22 | bwd_allreduce: 578.16 | step: 123.60
{'loss': 0.7521, 'learning_rate': 7.9261633623663e-06, 'epoch': 0.58}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13911
total_samples=17615, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:40,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.53 | bwd_microstep: 1896.98 | bwd_inner_microstep: 1846.52 | bwd_allreduce_microstep: 50.39 | step_microstep: 0.81
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13521
total_samples=17619, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:42,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.05 | bwd_microstep: 2014.77 | bwd_inner_microstep: 1984.43 | bwd_allreduce_microstep: 30.27 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11664
total_samples=17622, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:45,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.14 | bwd_microstep: 1784.87 | bwd_inner_microstep: 1548.05 | bwd_allreduce_microstep: 236.73 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14136
total_samples=17626, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:48,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 05:16:48,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.81 | bwd_microstep: 1821.36 | bwd_inner_microstep: 1738.22 | bwd_allreduce_microstep: 83.07 | step_microstep: 113.05
[2025-08-03 05:16:48,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.45 | bwd: 7518.03 | bwd_inner: 7117.22 | bwd_allreduce: 400.56 | step: 114.19
{'loss': 0.753, 'learning_rate': 7.91032436968725e-06, 'epoch': 0.58}
                                                       58%|█████▊    | 1155/2000 [3:33:08<2:33:26, 10.89s/it] 58%|█████▊    | 1156/2000 [3:33:19<2:33:12, 10.89s/it]                                                        58%|█████▊    | 1156/2000 [3:33:19<2:33:12, 10.89s/it] 58%|█████▊    | 1157/2000 [3:33:30<2:33:01, 10.89s/it]                                                        58%|█████▊    | 1157/2000 [3:33:30<2:33:01, 10.89s/it] 58%|█████▊    | 1158/2000 [3:33:41<2:33:16, 10.92s/it]                                                        58%|█████▊    | 1158/2000 [3:33:41<2:33:16, 10.92s/it] 58%|█████▊    | 1159/2000 [3:33:52<2:34:23, 11.01s/it]                                                        58%|█████▊    | 1159/2000 [3:33:52<2:34:23, 11.01s/it] 58%|█████▊    | 1160/2000 [3:34:03<2:32:55, 10.92s/it]                                                        58dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13746
total_samples=17630, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:50,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.10 | bwd_microstep: 2015.20 | bwd_inner_microstep: 1982.92 | bwd_allreduce_microstep: 32.20 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12248
total_samples=17634, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:53,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.42 | bwd_microstep: 1797.79 | bwd_inner_microstep: 1593.72 | bwd_allreduce_microstep: 204.01 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13396
total_samples=17638, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:56,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.38 | bwd_microstep: 1755.53 | bwd_inner_microstep: 1686.98 | bwd_allreduce_microstep: 68.47 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13933
total_samples=17642, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:16:58,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.83
[2025-08-03 05:16:58,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.08 | bwd_microstep: 1726.64 | bwd_inner_microstep: 1700.32 | bwd_allreduce_microstep: 26.25 | step_microstep: 134.42
[2025-08-03 05:16:58,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.91 | bwd: 7295.22 | bwd_inner: 6963.93 | bwd_allreduce: 331.02 | step: 134.89
{'loss': 0.7512, 'learning_rate': 7.894490856938931e-06, 'epoch': 0.58}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11924
total_samples=17645, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:01,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.08 | bwd_microstep: 1803.82 | bwd_inner_microstep: 1596.02 | bwd_allreduce_microstep: 207.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13527
total_samples=17649, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:04,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.12 | bwd_microstep: 2161.74 | bwd_inner_microstep: 2102.05 | bwd_allreduce_microstep: 59.61 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13693
total_samples=17653, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:07,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.55 | bwd_microstep: 2124.68 | bwd_inner_microstep: 1817.42 | bwd_allreduce_microstep: 307.20 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13230
total_samples=17657, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:09,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 05:17:09,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.36 | bwd_microstep: 1942.97 | bwd_inner_microstep: 1905.64 | bwd_allreduce_microstep: 37.28 | step_microstep: 116.54
[2025-08-03 05:17:09,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.03 | bwd: 8033.26 | bwd_inner: 7421.12 | bwd_allreduce: 611.91 | step: 117.07
{'loss': 0.7419, 'learning_rate': 7.87866286564288e-06, 'epoch': 0.58}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13239
total_samples=17661, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:12,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.08 | bwd_microstep: 1773.16 | bwd_inner_microstep: 1681.09 | bwd_allreduce_microstep: 92.00 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11686
total_samples=17664, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:15,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.00 | bwd_microstep: 2325.94 | bwd_inner_microstep: 2019.70 | bwd_allreduce_microstep: 306.18 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14009
total_samples=17669, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:18,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.93 | bwd_microstep: 1833.08 | bwd_inner_microstep: 1741.11 | bwd_allreduce_microstep: 91.89 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13867
total_samples=17673, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:21,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 05:17:21,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.36 | bwd_microstep: 1986.01 | bwd_inner_microstep: 1892.88 | bwd_allreduce_microstep: 93.06 | step_microstep: 126.73
[2025-08-03 05:17:21,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.29 | bwd: 7918.25 | bwd_inner: 7334.78 | bwd_allreduce: 583.22 | step: 127.38
{'loss': 0.7344, 'learning_rate': 7.862840437306165e-06, 'epoch': 0.58}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13473
total_samples=17677, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:23,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.93 | bwd_microstep: 1688.18 | bwd_inner_microstep: 1652.44 | bwd_allreduce_microstep: 35.68 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14050
total_samples=17681, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:26,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.69 | bwd_microstep: 2047.96 | bwd_inner_microstep: 1760.38 | bwd_allreduce_microstep: 287.50 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13981
total_samples=17686, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:29,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.16 | bwd_microstep: 1772.50 | bwd_inner_microstep: 1714.84 | bwd_allreduce_microstep: 57.61 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13348
total_samples=17691, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:31,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 05:17:31,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.85 | bwd_microstep: 1862.61 | bwd_inner_microstep: 1805.47 | bwd_allreduce_microstep: 57.07 | step_microstep: 113.75
[2025-08-03 05:17:31,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.57 | bwd: 7371.30 | bwd_inner: 6933.12 | bwd_allreduce: 437.93 | step: 114.35
{'loss': 0.7438, 'learning_rate': 7.847023613421251e-06, 'epoch': 0.58}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12743
total_samples=17695, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:34,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.12 | bwd_microstep: 1891.13 | bwd_inner_microstep: 1660.42 | bwd_allreduce_microstep: 230.65 | step_microstep: 0.22
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12473
total_samples=17699, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:37,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.76 | bwd_microstep: 1772.41 | bwd_inner_microstep: 1586.12 | bwd_allreduce_microstep: 186.23 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13429
total_samples=17704, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:39,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.24 | bwd_microstep: 1835.01 | bwd_inner_microstep: 1780.94 | bwd_allreduce_microstep: 53.99 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13673
total_samples=17708, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:42,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 05:17:42,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.93 | bwd_microstep: 2045.05 | bwd_inner_microstep: 1749.55 | bwd_allreduce_microstep: 295.44 | step_microstep: 129.14
[2025-08-03 05:17:42,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.97 | bwd: 7543.66 | bwd_inner: 6777.03 | bwd_allreduce: 766.38 | step: 129.73
{'loss': 0.743, 'learning_rate': 7.831212435465925e-06, 'epoch': 0.58}
%|█████▊    | 1160/2000 [3:34:03<2:32:55, 10.92s/it] 58%|█████▊    | 1161/2000 [3:34:13<2:30:59, 10.80s/it]                                                        58%|█████▊    | 1161/2000 [3:34:13<2:30:59, 10.80s/it] 58%|█████▊    | 1162/2000 [3:34:24<2:32:49, 10.94s/it]                                                        58%|█████▊    | 1162/2000 [3:34:24<2:32:49, 10.94s/it] 58%|█████▊    | 1163/2000 [3:34:36<2:33:45, 11.02s/it]                                                        58%|█████▊    | 1163/2000 [3:34:36<2:33:45, 11.02s/it] 58%|█████▊    | 1164/2000 [3:34:46<2:31:52, 10.90s/it]                                                        58%|█████▊    | 1164/2000 [3:34:46<2:31:52, 10.90s/it] 58%|█████▊    | 1165/2000 [3:34:57<2:31:15, 10.87s/it]                                                        58%|█████▊    | 1165/2000 [3:34:57<2:31:15, 10.87dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11829
total_samples=17711, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:45,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.63 | bwd_microstep: 2076.96 | bwd_inner_microstep: 1674.10 | bwd_allreduce_microstep: 402.80 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12071
total_samples=17714, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:48,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.45 | bwd_microstep: 2081.36 | bwd_inner_microstep: 1620.10 | bwd_allreduce_microstep: 461.15 | step_microstep: 0.25
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13085
total_samples=17718, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:51,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.05 | bwd_microstep: 2112.34 | bwd_inner_microstep: 1865.65 | bwd_allreduce_microstep: 246.62 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13494
total_samples=17722, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:54,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.12
[2025-08-03 05:17:54,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.57 | bwd_microstep: 1768.12 | bwd_inner_microstep: 1681.29 | bwd_allreduce_microstep: 86.77 | step_microstep: 400.71
[2025-08-03 05:17:54,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.63 | bwd: 8038.83 | bwd_inner: 6841.16 | bwd_allreduce: 1197.41 | step: 401.22
{'loss': 0.758, 'learning_rate': 7.815406944903148e-06, 'epoch': 0.58}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13667
total_samples=17726, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:56,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.30 | bwd_microstep: 1832.63 | bwd_inner_microstep: 1644.26 | bwd_allreduce_microstep: 188.31 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11796
total_samples=17729, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:17:59,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.78 | bwd_microstep: 1739.95 | bwd_inner_microstep: 1537.55 | bwd_allreduce_microstep: 202.33 | step_microstep: 0.14
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12633
total_samples=17733, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:01,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.18 | bwd_microstep: 1786.24 | bwd_inner_microstep: 1622.02 | bwd_allreduce_microstep: 164.15 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492
total_samples=17737, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:05,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 05:18:05,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.22 | bwd_microstep: 2563.63 | bwd_inner_microstep: 2511.08 | bwd_allreduce_microstep: 52.49 | step_microstep: 106.85
[2025-08-03 05:18:05,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.40 | bwd: 7922.51 | bwd_inner: 7314.90 | bwd_allreduce: 607.36 | step: 107.38
{'loss': 0.7565, 'learning_rate': 7.799607183180981e-06, 'epoch': 0.58}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12063
total_samples=17740, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:07,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.38 | bwd_microstep: 1768.76 | bwd_inner_microstep: 1555.62 | bwd_allreduce_microstep: 213.07 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11795
total_samples=17743, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:10,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.25 | bwd_microstep: 1819.43 | bwd_inner_microstep: 1607.00 | bwd_allreduce_microstep: 212.36 | step_microstep: 0.22
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 16351
total_samples=17748, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:12,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.23 | bwd_microstep: 1793.55 | bwd_inner_microstep: 1754.74 | bwd_allreduce_microstep: 38.75 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13332
total_samples=17752, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:15,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 05:18:15,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.82 | bwd_microstep: 2038.29 | bwd_inner_microstep: 1894.91 | bwd_allreduce_microstep: 143.31 | step_microstep: 126.41
[2025-08-03 05:18:15,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.61 | bwd: 7420.09 | bwd_inner: 6812.27 | bwd_allreduce: 607.57 | step: 126.90
{'loss': 0.7445, 'learning_rate': 7.78381319173246e-06, 'epoch': 0.58}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12481
total_samples=17756, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:18,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.18 | bwd_microstep: 2071.65 | bwd_inner_microstep: 2064.53 | bwd_allreduce_microstep: 7.06 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12407
total_samples=17759, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:21,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.88 | bwd_microstep: 2116.59 | bwd_inner_microstep: 1885.17 | bwd_allreduce_microstep: 231.36 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13308
total_samples=17763, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:24,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.08 | bwd_microstep: 1863.40 | bwd_inner_microstep: 1714.08 | bwd_allreduce_microstep: 149.26 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13175
total_samples=17767, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:26,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 05:18:26,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.80 | bwd_microstep: 1752.80 | bwd_inner_microstep: 1672.08 | bwd_allreduce_microstep: 80.65 | step_microstep: 133.39
[2025-08-03 05:18:26,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.86 | bwd: 7804.50 | bwd_inner: 7335.87 | bwd_allreduce: 468.40 | step: 133.84
{'loss': 0.7427, 'learning_rate': 7.768025011975481e-06, 'epoch': 0.58}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938
total_samples=17770, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:29,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.07 | bwd_microstep: 1691.73 | bwd_inner_microstep: 1532.10 | bwd_allreduce_microstep: 159.54 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14173
total_samples=17775, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:32,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.51 | bwd_microstep: 1846.03 | bwd_inner_microstep: 1733.43 | bwd_allreduce_microstep: 112.54 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13497
total_samples=17779, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:34,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.77 | bwd_microstep: 1773.97 | bwd_inner_microstep: 1702.68 | bwd_allreduce_microstep: 71.22 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15469
total_samples=17786, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:37,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.97
[2025-08-03 05:18:37,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.39 | bwd_microstep: 1756.89 | bwd_inner_microstep: 1749.99 | bwd_allreduce_microstep: 6.83 | step_microstep: 145.30
[2025-08-03 05:18:37,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.65 | bwd: 7068.67 | bwd_inner: 6718.19 | bwd_allreduce: 350.22 | step: 145.83
{'loss': 0.7231, 'learning_rate': 7.752242685312709e-06, 'epoch': 0.58}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11818
total_samples=17789, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:39,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.21 | bwd_microstep: 1773.34 | bwd_inner_microstep: 1570.12 | bwd_allreduce_microstep: 203.16 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13340
total_samples=17793, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:42,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.81 | bwd_microstep: 2149.29 | bwd_inner_microstep: 1935.48 | bwd_allreduce_microstep: 213.73 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13317
total_samples=17797, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:45,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 660.23 | bwd_microstep: 1849.19 | bwd_inner_microstep: 1710.21 | bwd_allreduce_microstep: 138.91 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13103
total_samples=17801, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:48,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 05:18:48,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.62 | bwd_microstep: 2123.79 | bwd_inner_microstep: 1880.39 | bwd_allreduce_microstep: 243.33 | step_microstep: 109.12
[2025-08-03 05:18:48,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.80 | bwd: 7895.67 | bwd_inner: 7096.19 | bwd_allreduce: 799.22 | step: 109.65
s/it] 58%|█████▊    | 1166/2000 [3:35:08<2:33:44, 11.06s/it]                                                        58%|█████▊    | 1166/2000 [3:35:09<2:33:44, 11.06s/it] 58%|█████▊    | 1167/2000 [3:35:20<2:33:50, 11.08s/it]                                                        58%|█████▊    | 1167/2000 [3:35:20<2:33:50, 11.08s/it] 58%|█████▊    | 1168/2000 [3:35:30<2:31:47, 10.95s/it]                                                        58%|█████▊    | 1168/2000 [3:35:30<2:31:47, 10.95s/it] 58%|█████▊    | 1169/2000 [3:35:41<2:32:05, 10.98s/it]                                                        58%|█████▊    | 1169/2000 [3:35:41<2:32:05, 10.98s/it] 58%|█████▊    | 1170/2000 [3:35:52<2:29:06, 10.78s/it]                                                        58%|█████▊    | 1170/2000 [3:35:52<2:29:06, 10.78s/it] 59%|█████▊    | 1171/2000 [3:36:03<2:30:{'loss': 0.7452, 'learning_rate': 7.736466253131451e-06, 'epoch': 0.59}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12829
total_samples=17805, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:51,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.09 | bwd_microstep: 2012.71 | bwd_inner_microstep: 1917.05 | bwd_allreduce_microstep: 95.59 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13735
total_samples=17809, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:53,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.80 | bwd_microstep: 1782.33 | bwd_inner_microstep: 1718.12 | bwd_allreduce_microstep: 64.14 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12906
total_samples=17813, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:56,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.58 | bwd_microstep: 1848.45 | bwd_inner_microstep: 1671.65 | bwd_allreduce_microstep: 176.73 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15309
total_samples=17817, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:18:59,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.81
[2025-08-03 05:18:59,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.11 | bwd_microstep: 1785.43 | bwd_inner_microstep: 1765.79 | bwd_allreduce_microstep: 19.58 | step_microstep: 126.42
[2025-08-03 05:18:59,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2862.51 | bwd: 7428.97 | bwd_inner: 7072.61 | bwd_allreduce: 356.13 | step: 126.90
{'loss': 0.7508, 'learning_rate': 7.720695756803569e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13108
total_samples=17821, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:02,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.32 | bwd_microstep: 2280.04 | bwd_inner_microstep: 2075.99 | bwd_allreduce_microstep: 203.98 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13778
total_samples=17825, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:05,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.32 | bwd_microstep: 2332.12 | bwd_inner_microstep: 2045.80 | bwd_allreduce_microstep: 286.26 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13416
total_samples=17829, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:08,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.60 | bwd_microstep: 2038.88 | bwd_inner_microstep: 1910.82 | bwd_allreduce_microstep: 127.99 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13940
total_samples=17833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:10,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 05:19:10,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.28 | bwd_microstep: 1799.34 | bwd_inner_microstep: 1727.00 | bwd_allreduce_microstep: 72.27 | step_microstep: 123.05
[2025-08-03 05:19:10,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.44 | bwd: 8450.44 | bwd_inner: 7759.60 | bwd_allreduce: 690.60 | step: 123.72
{'loss': 0.7384, 'learning_rate': 7.704931237685342e-06, 'epoch': 0.59}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11866
total_samples=17837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:13,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.05 | bwd_microstep: 1766.43 | bwd_inner_microstep: 1527.92 | bwd_allreduce_microstep: 238.44 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13800
total_samples=17841, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:16,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.64 | bwd_microstep: 2178.00 | bwd_inner_microstep: 2083.94 | bwd_allreduce_microstep: 93.99 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13561
total_samples=17845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:18,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.59 | bwd_microstep: 1816.69 | bwd_inner_microstep: 1717.60 | bwd_allreduce_microstep: 99.02 | step_microstep: 0.25
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15917
total_samples=17849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:21,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 33.60
[2025-08-03 05:19:21,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.95 | bwd_microstep: 1775.30 | bwd_inner_microstep: 1717.93 | bwd_allreduce_microstep: 57.30 | step_microstep: 142.13
[2025-08-03 05:19:21,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.15 | bwd: 7536.47 | bwd_inner: 7047.38 | bwd_allreduce: 488.84 | step: 142.92
{'loss': 0.7443, 'learning_rate': 7.689172737117389e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332
total_samples=17853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:24,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.56 | bwd_microstep: 2241.66 | bwd_inner_microstep: 2149.36 | bwd_allreduce_microstep: 92.23 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13833
total_samples=17857, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:27,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.94 | bwd_microstep: 1820.48 | bwd_inner_microstep: 1722.68 | bwd_allreduce_microstep: 97.73 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13087
total_samples=17861, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:30,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.22 | bwd_microstep: 2054.22 | bwd_inner_microstep: 1953.92 | bwd_allreduce_microstep: 100.24 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11533
total_samples=17864, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:33,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 05:19:33,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.28 | bwd_microstep: 2146.37 | bwd_inner_microstep: 2135.24 | bwd_allreduce_microstep: 11.07 | step_microstep: 120.76
[2025-08-03 05:19:33,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.94 | bwd: 8262.78 | bwd_inner: 7961.18 | bwd_allreduce: 301.35 | step: 121.32
{'loss': 0.749, 'learning_rate': 7.673420296424541e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13970
total_samples=17868, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:35,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.57 | bwd_microstep: 1849.51 | bwd_inner_microstep: 1723.49 | bwd_allreduce_microstep: 125.95 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15122
total_samples=17873, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:38,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.51 | bwd_microstep: 1769.21 | bwd_inner_microstep: 1730.61 | bwd_allreduce_microstep: 38.53 | step_microstep: 0.17
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14119
total_samples=17877, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:40,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.82 | bwd_microstep: 1758.08 | bwd_inner_microstep: 1701.54 | bwd_allreduce_microstep: 56.47 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12916
total_samples=17881, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:43,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 05:19:43,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.83 | bwd_microstep: 1906.33 | bwd_inner_microstep: 1794.74 | bwd_allreduce_microstep: 111.52 | step_microstep: 132.70
[2025-08-03 05:19:43,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2866.66 | bwd: 7283.18 | bwd_inner: 6950.38 | bwd_allreduce: 332.56 | step: 133.11
12, 10.87s/it]                                                        59%|█████▊    | 1171/2000 [3:36:03<2:30:12, 10.87s/it] 59%|█████▊    | 1172/2000 [3:36:13<2:29:23, 10.83s/it]                                                        59%|█████▊    | 1172/2000 [3:36:13<2:29:23, 10.83s/it] 59%|█████▊    | 1173/2000 [3:36:25<2:32:46, 11.08s/it]                                                        59%|█████▊    | 1173/2000 [3:36:25<2:32:46, 11.08s/it] 59%|█████▊    | 1174/2000 [3:36:36<2:31:28, 11.00s/it]                                                        59%|█████▊    | 1174/2000 [3:36:36<2:31:28, 11.00s/it] 59%|█████▉    | 1175/2000 [3:36:47<2:33:25, 11.16s/it]                                                        59%|█████▉    | 1175/2000 [3:36:47<2:33:25, 11.16s/it] 59%|█████▉    | 1176/2000 [3:36:58<2:30:59, 10.99s/it]                                            {'loss': 0.7484, 'learning_rate': 7.657673956915735e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13952
total_samples=17885, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:46,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.98 | bwd_microstep: 1855.13 | bwd_inner_microstep: 1722.27 | bwd_allreduce_microstep: 132.79 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11762
total_samples=17888, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:48,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.82 | bwd_microstep: 1809.67 | bwd_inner_microstep: 1571.50 | bwd_allreduce_microstep: 238.10 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13166
total_samples=17892, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:51,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.94 | bwd_microstep: 1843.00 | bwd_inner_microstep: 1725.49 | bwd_allreduce_microstep: 117.44 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11777
total_samples=17895, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:54,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 05:19:54,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.73 | bwd_microstep: 1848.84 | bwd_inner_microstep: 1616.41 | bwd_allreduce_microstep: 232.37 | step_microstep: 138.57
[2025-08-03 05:19:54,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.39 | bwd: 7356.69 | bwd_inner: 6635.66 | bwd_allreduce: 720.78 | step: 139.09
{'loss': 0.7419, 'learning_rate': 7.641933759883913e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13495
total_samples=17899, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:56,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 1826.16 | bwd_inner_microstep: 1690.65 | bwd_allreduce_microstep: 135.44 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14499
total_samples=17903, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:19:59,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.05 | bwd_microstep: 1867.58 | bwd_inner_microstep: 1772.96 | bwd_allreduce_microstep: 94.55 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13206
total_samples=17907, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:02,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.83 | bwd_microstep: 2019.33 | bwd_inner_microstep: 1723.74 | bwd_allreduce_microstep: 295.51 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12497
total_samples=17911, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:05,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.92
[2025-08-03 05:20:05,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.96 | bwd_microstep: 1721.59 | bwd_inner_microstep: 1577.00 | bwd_allreduce_microstep: 144.52 | step_microstep: 160.05
[2025-08-03 05:20:05,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.82 | bwd: 7434.72 | bwd_inner: 6764.36 | bwd_allreduce: 670.10 | step: 160.58
{'loss': 0.7499, 'learning_rate': 7.6261997466059035e-06, 'epoch': 0.59}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13627
total_samples=17915, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:07,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.88 | bwd_microstep: 1988.18 | bwd_inner_microstep: 1896.98 | bwd_allreduce_microstep: 91.14 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12659
total_samples=17919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:10,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.19 | bwd_microstep: 1718.50 | bwd_inner_microstep: 1624.53 | bwd_allreduce_microstep: 93.91 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13811
total_samples=17924, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:13,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.44 | bwd_microstep: 2241.93 | bwd_inner_microstep: 2042.17 | bwd_allreduce_microstep: 199.70 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11913
total_samples=17927, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:16,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 05:20:16,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.54 | bwd_microstep: 1977.64 | bwd_inner_microstep: 1563.40 | bwd_allreduce_microstep: 414.18 | step_microstep: 142.91
[2025-08-03 05:20:16,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.98 | bwd: 7926.31 | bwd_inner: 7127.08 | bwd_allreduce: 799.00 | step: 143.41
{'loss': 0.7383, 'learning_rate': 7.610471958342326e-06, 'epoch': 0.59}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13038
total_samples=17931, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:19,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.57 | bwd_microstep: 2342.19 | bwd_inner_microstep: 2026.23 | bwd_allreduce_microstep: 315.89 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13221
total_samples=17935, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:22,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.20 | bwd_microstep: 2002.02 | bwd_inner_microstep: 1866.15 | bwd_allreduce_microstep: 135.81 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14596
total_samples=17942, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:24,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.83 | bwd_microstep: 1802.66 | bwd_inner_microstep: 1761.72 | bwd_allreduce_microstep: 40.87 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12178
total_samples=17945, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:27,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61
[2025-08-03 05:20:27,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.64 | bwd_microstep: 1945.60 | bwd_inner_microstep: 1736.68 | bwd_allreduce_microstep: 208.86 | step_microstep: 129.13
[2025-08-03 05:20:27,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.17 | bwd: 8092.53 | bwd_inner: 7390.77 | bwd_allreduce: 701.50 | step: 129.63
{'loss': 0.7438, 'learning_rate': 7.594750436337467e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13453
total_samples=17949, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:30,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.39 | bwd_microstep: 2016.99 | bwd_inner_microstep: 1757.60 | bwd_allreduce_microstep: 259.32 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12510
total_samples=17953, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:33,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.96 | bwd_microstep: 1878.35 | bwd_inner_microstep: 1778.33 | bwd_allreduce_microstep: 99.96 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13111
total_samples=17957, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:35,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.17 | bwd_microstep: 1752.60 | bwd_inner_microstep: 1664.52 | bwd_allreduce_microstep: 88.02 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11765
total_samples=17960, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:38,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 05:20:38,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1083.29 | bwd_microstep: 1928.39 | bwd_inner_microstep: 1604.80 | bwd_allreduce_microstep: 323.52 | step_microstep: 112.12
[2025-08-03 05:20:38,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3204.73 | bwd: 7576.39 | bwd_inner: 6805.24 | bwd_allreduce: 770.90 | step: 112.60
{'loss': 0.7568, 'learning_rate': 7.579035221819188e-06, 'epoch': 0.59}
            59%|█████▉    | 1176/2000 [3:36:58<2:30:59, 10.99s/it] 59%|█████▉    | 1177/2000 [3:37:09<2:29:17, 10.88s/it]                                                        59%|█████▉    | 1177/2000 [3:37:09<2:29:17, 10.88s/it] 59%|█████▉    | 1178/2000 [3:37:19<2:28:29, 10.84s/it]                                                        59%|█████▉    | 1178/2000 [3:37:19<2:28:29, 10.84s/it] 59%|█████▉    | 1179/2000 [3:37:31<2:29:45, 10.95s/it]                                                        59%|█████▉    | 1179/2000 [3:37:31<2:29:45, 10.95s/it] 59%|█████▉    | 1180/2000 [3:37:42<2:31:11, 11.06s/it]                                                        59%|█████▉    | 1180/2000 [3:37:42<2:31:11, 11.06s/it] 59%|█████▉    | 1181/2000 [3:37:53<2:31:31, 11.10s/it]                                                        59%|█████▉    | 1181/2000 [3:37:5dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885
total_samples=17963, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:41,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.90 | bwd_microstep: 1927.07 | bwd_inner_microstep: 1768.87 | bwd_allreduce_microstep: 158.13 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717
total_samples=17966, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:44,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.29 | bwd_microstep: 2048.90 | bwd_inner_microstep: 1881.66 | bwd_allreduce_microstep: 167.17 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12940
total_samples=17970, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:46,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.12 | bwd_microstep: 1767.68 | bwd_inner_microstep: 1641.33 | bwd_allreduce_microstep: 126.27 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11535
total_samples=17973, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:49,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 05:20:49,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.82 | bwd_microstep: 2073.51 | bwd_inner_microstep: 2067.41 | bwd_allreduce_microstep: 6.04 | step_microstep: 135.05
[2025-08-03 05:20:49,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2751.06 | bwd: 7817.21 | bwd_inner: 7359.26 | bwd_allreduce: 457.69 | step: 135.56
{'loss': 0.75, 'learning_rate': 7.5633263559988035e-06, 'epoch': 0.59}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12002
total_samples=17976, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:52,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.92 | bwd_microstep: 1855.73 | bwd_inner_microstep: 1730.10 | bwd_allreduce_microstep: 125.56 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13405
total_samples=17980, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:54,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.99 | bwd_microstep: 1711.05 | bwd_inner_microstep: 1670.61 | bwd_allreduce_microstep: 40.37 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11861
total_samples=17983, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:20:57,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.77 | bwd_microstep: 1750.58 | bwd_inner_microstep: 1549.88 | bwd_allreduce_microstep: 200.63 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11894
total_samples=17986, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:00,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 05:21:00,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 862.27 | bwd_microstep: 1757.62 | bwd_inner_microstep: 1538.78 | bwd_allreduce_microstep: 218.78 | step_microstep: 114.15
[2025-08-03 05:21:00,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2913.88 | bwd: 7075.03 | bwd_inner: 6489.35 | bwd_allreduce: 585.42 | step: 114.65
{'loss': 0.748, 'learning_rate': 7.547623880070992e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14671
total_samples=17990, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:02,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.52 | bwd_microstep: 2026.57 | bwd_inner_microstep: 1904.12 | bwd_allreduce_microstep: 122.38 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11642
total_samples=17993, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:05,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.50 | bwd_microstep: 2023.80 | bwd_inner_microstep: 1831.77 | bwd_allreduce_microstep: 191.97 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12465
total_samples=17996, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:08,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.46 | bwd_microstep: 1794.40 | bwd_inner_microstep: 1614.62 | bwd_allreduce_microstep: 179.72 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12504
total_samples=17999, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:10,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:21:10,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.66 | bwd_microstep: 1708.84 | bwd_inner_microstep: 1556.37 | bwd_allreduce_microstep: 152.40 | step_microstep: 114.91
[2025-08-03 05:21:10,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.07 | bwd: 7553.67 | bwd_inner: 6906.87 | bwd_allreduce: 646.55 | step: 115.43
{'loss': 0.7498, 'learning_rate': 7.531927835213657e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508
total_samples=18003, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:13,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.69 | bwd_microstep: 2184.02 | bwd_inner_microstep: 1921.45 | bwd_allreduce_microstep: 262.50 | step_microstep: 0.89
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13402
total_samples=18007, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:16,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.71 | bwd_microstep: 1789.09 | bwd_inner_microstep: 1697.48 | bwd_allreduce_microstep: 91.54 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778
total_samples=18010, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:19,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.33 | bwd_microstep: 1884.65 | bwd_inner_microstep: 1542.52 | bwd_allreduce_microstep: 342.06 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11940
total_samples=18013, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:21,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 05:21:21,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 652.03 | bwd_microstep: 1759.91 | bwd_inner_microstep: 1586.39 | bwd_allreduce_microstep: 173.45 | step_microstep: 126.37
[2025-08-03 05:21:21,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.69 | bwd: 7617.72 | bwd_inner: 6747.84 | bwd_allreduce: 869.63 | step: 127.49
{'loss': 0.7448, 'learning_rate': 7.516238262587851e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13758
total_samples=18017, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:24,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.13 | bwd_microstep: 1852.84 | bwd_inner_microstep: 1694.07 | bwd_allreduce_microstep: 158.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13590
total_samples=18021, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:27,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.08 | bwd_microstep: 2238.80 | bwd_inner_microstep: 1921.02 | bwd_allreduce_microstep: 317.71 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13465
total_samples=18025, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:30,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.15 | bwd_microstep: 1826.23 | bwd_inner_microstep: 1720.70 | bwd_allreduce_microstep: 105.47 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12077
total_samples=18028, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:33,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 05:21:33,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.35 | bwd_microstep: 2227.35 | bwd_inner_microstep: 2004.73 | bwd_allreduce_microstep: 222.55 | step_microstep: 115.11
[2025-08-03 05:21:33,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.62 | bwd: 8145.28 | bwd_inner: 7340.52 | bwd_allreduce: 804.52 | step: 115.46
{'loss': 0.7605, 'learning_rate': 7.500555203337647e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14065
total_samples=18033, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:35,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.13 | bwd_microstep: 1701.38 | bwd_inner_microstep: 1679.98 | bwd_allreduce_microstep: 21.34 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=18037, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:38,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.51 | bwd_microstep: 1748.44 | bwd_inner_microstep: 1675.89 | bwd_allreduce_microstep: 72.47 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13437
total_samples=18041, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:40,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.00 | bwd_microstep: 1816.61 | bwd_inner_microstep: 1719.82 | bwd_allreduce_microstep: 96.71 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13498
total_samples=18045, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:43,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 05:21:43,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.41 | bwd_microstep: 1750.21 | bwd_inner_microstep: 1684.83 | bwd_allreduce_microstep: 65.32 | step_microstep: 110.66
[2025-08-03 05:21:43,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.98 | bwd: 7016.69 | bwd_inner: 6760.52 | bwd_allreduce: 255.92 | step: 111.29
3<2:31:31, 11.10s/it] 59%|█████▉    | 1182/2000 [3:38:04<2:30:54, 11.07s/it]                                                        59%|█████▉    | 1182/2000 [3:38:04<2:30:54, 11.07s/it] 59%|█████▉    | 1183/2000 [3:38:15<2:28:01, 10.87s/it]                                                        59%|█████▉    | 1183/2000 [3:38:15<2:28:01, 10.87s/it] 59%|█████▉    | 1184/2000 [3:38:25<2:27:22, 10.84s/it]                                                        59%|█████▉    | 1184/2000 [3:38:25<2:27:22, 10.84s/it] 59%|█████▉    | 1185/2000 [3:38:36<2:27:17, 10.84s/it]                                                        59%|█████▉    | 1185/2000 [3:38:36<2:27:17, 10.84s/it] 59%|█████▉    | 1186/2000 [3:38:48<2:29:16, 11.00s/it]                                                        59%|█████▉    | 1186/2000 [3:38:48<2:29:16, 11.00s/it] 59%|█████▉    | 1187/200{'loss': 0.7414, 'learning_rate': 7.48487869859004e-06, 'epoch': 0.59}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14031
total_samples=18050, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:46,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 766.90 | bwd_microstep: 1821.33 | bwd_inner_microstep: 1711.88 | bwd_allreduce_microstep: 109.39 | step_microstep: 0.10
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12555
total_samples=18055, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:48,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.07 | bwd_microstep: 1871.73 | bwd_inner_microstep: 1631.68 | bwd_allreduce_microstep: 239.99 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13394
total_samples=18059, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:51,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.09 | bwd_microstep: 1837.37 | bwd_inner_microstep: 1721.83 | bwd_allreduce_microstep: 115.48 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12674
total_samples=18063, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:54,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 05:21:54,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.68 | bwd_microstep: 1809.78 | bwd_inner_microstep: 1632.93 | bwd_allreduce_microstep: 176.79 | step_microstep: 143.00
[2025-08-03 05:21:54,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2935.67 | bwd: 7340.28 | bwd_inner: 6698.31 | bwd_allreduce: 641.73 | step: 143.46
{'loss': 0.7507, 'learning_rate': 7.469208789454838e-06, 'epoch': 0.59}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13620
total_samples=18066, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:56,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.29 | bwd_microstep: 2041.73 | bwd_inner_microstep: 1641.64 | bwd_allreduce_microstep: 400.02 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12612
total_samples=18070, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:21:59,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.36 | bwd_microstep: 1768.88 | bwd_inner_microstep: 1591.34 | bwd_allreduce_microstep: 177.48 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13946
total_samples=18074, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:02,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.94 | bwd_microstep: 2122.88 | bwd_inner_microstep: 1953.54 | bwd_allreduce_microstep: 169.28 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11643
total_samples=18077, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:05,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 05:22:05,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.32 | bwd_microstep: 1804.60 | bwd_inner_microstep: 1784.96 | bwd_allreduce_microstep: 19.58 | step_microstep: 110.54
[2025-08-03 05:22:05,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.86 | bwd: 7738.14 | bwd_inner: 6971.47 | bwd_allreduce: 766.43 | step: 111.02
{'loss': 0.7454, 'learning_rate': 7.4535455170245476e-06, 'epoch': 0.59}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13554
total_samples=18081, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:07,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.27 | bwd_microstep: 1738.67 | bwd_inner_microstep: 1668.94 | bwd_allreduce_microstep: 69.66 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446
total_samples=18085, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:10,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.13 | bwd_microstep: 1748.33 | bwd_inner_microstep: 1687.47 | bwd_allreduce_microstep: 60.80 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13142
total_samples=18089, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:12,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.06 | bwd_microstep: 1746.66 | bwd_inner_microstep: 1660.33 | bwd_allreduce_microstep: 86.27 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11605
total_samples=18092, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:16,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.84
[2025-08-03 05:22:16,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.42 | bwd_microstep: 2324.28 | bwd_inner_microstep: 2087.10 | bwd_allreduce_microstep: 237.12 | step_microstep: 417.05
[2025-08-03 05:22:16,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.82 | bwd: 7558.00 | bwd_inner: 7103.85 | bwd_allreduce: 453.92 | step: 417.37
{'loss': 0.7497, 'learning_rate': 7.4378889223742766e-06, 'epoch': 0.59}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12721
total_samples=18096, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:18,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.79 | bwd_microstep: 1775.52 | bwd_inner_microstep: 1632.48 | bwd_allreduce_microstep: 142.98 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12150
total_samples=18099, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:21,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.74 | bwd_microstep: 2022.25 | bwd_inner_microstep: 1802.77 | bwd_allreduce_microstep: 219.41 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11707
total_samples=18102, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:24,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.65 | bwd_microstep: 1952.54 | bwd_inner_microstep: 1775.68 | bwd_allreduce_microstep: 176.79 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11855
total_samples=18105, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:27,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.39
[2025-08-03 05:22:27,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.86 | bwd_microstep: 1852.71 | bwd_inner_microstep: 1632.08 | bwd_allreduce_microstep: 220.56 | step_microstep: 135.53
[2025-08-03 05:22:27,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.98 | bwd: 7603.07 | bwd_inner: 6843.01 | bwd_allreduce: 759.81 | step: 135.87
{'loss': 0.744, 'learning_rate': 7.422239046561619e-06, 'epoch': 0.6}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713
total_samples=18108, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:29,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.18 | bwd_microstep: 1787.09 | bwd_inner_microstep: 1541.61 | bwd_allreduce_microstep: 245.42 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11883
total_samples=18111, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:32,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.48 | bwd_microstep: 2181.18 | bwd_inner_microstep: 1834.10 | bwd_allreduce_microstep: 347.01 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717
total_samples=18114, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:35,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.30 | bwd_microstep: 2002.73 | bwd_inner_microstep: 1763.98 | bwd_allreduce_microstep: 238.69 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11802
total_samples=18117, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:38,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 05:22:38,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.26 | bwd_microstep: 1796.15 | bwd_inner_microstep: 1554.66 | bwd_allreduce_microstep: 241.42 | step_microstep: 111.94
[2025-08-03 05:22:38,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.14 | bwd: 7767.20 | bwd_inner: 6694.35 | bwd_allreduce: 1072.62 | step: 112.26
0 [3:38:58<2:26:02, 10.78s/it]                                                        59%|█████▉    | 1187/2000 [3:38:58<2:26:02, 10.78s/it] 59%|█████▉    | 1188/2000 [3:39:08<2:25:34, 10.76s/it]                                                        59%|█████▉    | 1188/2000 [3:39:09<2:25:34, 10.76s/it] 59%|█████▉    | 1189/2000 [3:39:19<2:26:24, 10.83s/it]                                                        59%|█████▉    | 1189/2000 [3:39:20<2:26:24, 10.83s/it] 60%|█████▉    | 1190/2000 [3:39:31<2:27:09, 10.90s/it]                                                        60%|█████▉    | 1190/2000 [3:39:31<2:27:09, 10.90s/it] 60%|█████▉    | 1191/2000 [3:39:41<2:26:46, 10.89s/it]                                                        60%|█████▉    | 1191/2000 [3:39:41<2:26:46, 10.89s/it] 60%|█████▉    | 1192/2000 [3:39:52<2:27:01, 10.92s/it]                            {'loss': 0.7434, 'learning_rate': 7.40659593062655e-06, 'epoch': 0.6}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11777
total_samples=18120, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:40,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.09 | bwd_microstep: 1762.21 | bwd_inner_microstep: 1543.98 | bwd_allreduce_microstep: 218.16 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13089
total_samples=18124, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:43,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.41 | bwd_microstep: 1994.11 | bwd_inner_microstep: 1897.45 | bwd_allreduce_microstep: 96.59 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12285
total_samples=18127, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:46,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.96 | bwd_microstep: 2238.05 | bwd_inner_microstep: 1932.97 | bwd_allreduce_microstep: 305.02 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11680
total_samples=18130, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:49,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 05:22:49,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.96 | bwd_microstep: 1954.12 | bwd_inner_microstep: 1761.34 | bwd_allreduce_microstep: 192.72 | step_microstep: 137.06
[2025-08-03 05:22:49,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2859.35 | bwd: 7948.54 | bwd_inner: 7135.73 | bwd_allreduce: 812.56 | step: 137.56
{'loss': 0.7453, 'learning_rate': 7.390959615591315e-06, 'epoch': 0.6}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11914
total_samples=18133, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:51,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.35 | bwd_microstep: 1844.02 | bwd_inner_microstep: 1586.89 | bwd_allreduce_microstep: 257.06 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14382
total_samples=18137, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:54,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.79 | bwd_microstep: 2012.19 | bwd_inner_microstep: 1904.42 | bwd_allreduce_microstep: 107.71 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13169
total_samples=18141, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:22:57,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.01 | bwd_microstep: 1805.58 | bwd_inner_microstep: 1690.66 | bwd_allreduce_microstep: 114.86 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12120
total_samples=18144, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:00,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 05:23:00,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.13 | bwd_microstep: 1848.90 | bwd_inner_microstep: 1594.35 | bwd_allreduce_microstep: 254.48 | step_microstep: 111.61
[2025-08-03 05:23:00,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2895.21 | bwd: 7510.74 | bwd_inner: 6776.30 | bwd_allreduce: 734.20 | step: 112.09
{'loss': 0.7552, 'learning_rate': 7.375330142460331e-06, 'epoch': 0.6}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13327
total_samples=18148, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:02,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.98 | bwd_microstep: 1742.54 | bwd_inner_microstep: 1670.79 | bwd_allreduce_microstep: 71.68 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14613
total_samples=18152, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:05,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.49 | bwd_microstep: 2523.60 | bwd_inner_microstep: 2457.43 | bwd_allreduce_microstep: 66.08 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14960
total_samples=18156, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:08,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.41 | bwd_microstep: 1781.70 | bwd_inner_microstep: 1744.38 | bwd_allreduce_microstep: 37.26 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13658
total_samples=18160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:11,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85
[2025-08-03 05:23:11,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.51 | bwd_microstep: 1862.12 | bwd_inner_microstep: 1718.23 | bwd_allreduce_microstep: 143.82 | step_microstep: 132.30
[2025-08-03 05:23:11,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.31 | bwd: 7910.00 | bwd_inner: 7590.83 | bwd_allreduce: 318.91 | step: 132.66
{'loss': 0.7514, 'learning_rate': 7.35970755222007e-06, 'epoch': 0.6}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13158
total_samples=18164, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:14,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.41 | bwd_microstep: 2065.32 | bwd_inner_microstep: 1963.97 | bwd_allreduce_microstep: 101.28 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14996
total_samples=18168, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:16,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.81 | bwd_microstep: 1939.29 | bwd_inner_microstep: 1759.41 | bwd_allreduce_microstep: 179.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13239
total_samples=18172, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:19,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.47 | bwd_microstep: 2191.97 | bwd_inner_microstep: 1980.29 | bwd_allreduce_microstep: 211.61 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11859
total_samples=18175, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:22,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 05:23:22,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.94 | bwd_microstep: 1811.04 | bwd_inner_microstep: 1562.97 | bwd_allreduce_microstep: 248.00 | step_microstep: 110.93
[2025-08-03 05:23:22,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2891.56 | bwd: 8007.66 | bwd_inner: 7266.64 | bwd_allreduce: 740.79 | step: 111.29
{'loss': 0.7552, 'learning_rate': 7.344091885838949e-06, 'epoch': 0.6}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12013
total_samples=18178, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:25,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.89 | bwd_microstep: 1797.82 | bwd_inner_microstep: 1583.90 | bwd_allreduce_microstep: 213.84 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13315
total_samples=18182, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:27,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.77 | bwd_microstep: 1768.13 | bwd_inner_microstep: 1690.84 | bwd_allreduce_microstep: 77.21 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13128
total_samples=18186, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:30,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.41 | bwd_microstep: 2025.59 | bwd_inner_microstep: 1735.91 | bwd_allreduce_microstep: 289.60 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11586
total_samples=18189, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:33,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 05:23:33,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.55 | bwd_microstep: 1765.07 | bwd_inner_microstep: 1558.89 | bwd_allreduce_microstep: 206.12 | step_microstep: 111.43
[2025-08-03 05:23:33,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2879.54 | bwd: 7356.66 | bwd_inner: 6569.54 | bwd_allreduce: 786.86 | step: 111.94
{'loss': 0.7482, 'learning_rate': 7.328483184267236e-06, 'epoch': 0.6}
                            60%|█████▉    | 1192/2000 [3:39:52<2:27:01, 10.92s/it] 60%|█████▉    | 1193/2000 [3:40:04<2:28:12, 11.02s/it]                                                        60%|█████▉    | 1193/2000 [3:40:04<2:28:12, 11.02s/it] 60%|█████▉    | 1194/2000 [3:40:14<2:27:06, 10.95s/it]                                                        60%|█████▉    | 1194/2000 [3:40:14<2:27:06, 10.95s/it] 60%|█████▉    | 1195/2000 [3:40:26<2:27:50, 11.02s/it]                                                        60%|█████▉    | 1195/2000 [3:40:26<2:27:50, 11.02s/it] 60%|█████▉    | 1196/2000 [3:40:37<2:28:50, 11.11s/it]                                                        60%|█████▉    | 1196/2000 [3:40:37<2:28:50, 11.11s/it] 60%|█████▉    | 1197/2000 [3:40:48<2:26:47, 10.97s/it]                                                        60%|█████▉    | 1dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13717
total_samples=18193, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:35,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.04 | bwd_microstep: 1790.35 | bwd_inner_microstep: 1721.55 | bwd_allreduce_microstep: 68.72 | step_microstep: 0.30
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13717
total_samples=18197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:38,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.13 | bwd_microstep: 1834.02 | bwd_inner_microstep: 1712.25 | bwd_allreduce_microstep: 121.71 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790
total_samples=18200, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:41,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.38 | bwd_microstep: 1821.20 | bwd_inner_microstep: 1599.17 | bwd_allreduce_microstep: 221.97 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13427
total_samples=18204, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:43,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 05:23:43,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.09 | bwd_microstep: 1834.25 | bwd_inner_microstep: 1734.48 | bwd_allreduce_microstep: 99.70 | step_microstep: 110.15
[2025-08-03 05:23:43,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.55 | bwd: 7279.87 | bwd_inner: 6767.45 | bwd_allreduce: 512.18 | step: 110.66
{'loss': 0.7542, 'learning_rate': 7.312881488436928e-06, 'epoch': 0.6}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13036
total_samples=18208, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:46,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.45 | bwd_microstep: 1893.65 | bwd_inner_microstep: 1839.95 | bwd_allreduce_microstep: 53.62 | step_microstep: 0.17
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12715
total_samples=18212, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:49,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.13 | bwd_microstep: 2292.70 | bwd_inner_microstep: 1974.56 | bwd_allreduce_microstep: 318.08 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13519
total_samples=18216, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:52,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.14 | bwd_microstep: 2246.96 | bwd_inner_microstep: 1949.03 | bwd_allreduce_microstep: 297.86 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11803
total_samples=18219, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:55,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 05:23:55,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.35 | bwd_microstep: 1692.71 | bwd_inner_microstep: 1523.93 | bwd_allreduce_microstep: 168.71 | step_microstep: 124.29
[2025-08-03 05:23:55,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.99 | bwd: 8126.07 | bwd_inner: 7287.46 | bwd_allreduce: 838.36 | step: 124.83
{'loss': 0.7439, 'learning_rate': 7.297286839261659e-06, 'epoch': 0.6}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418
total_samples=18223, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:23:57,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.70 | bwd_microstep: 1714.70 | bwd_inner_microstep: 1658.32 | bwd_allreduce_microstep: 56.30 | step_microstep: 0.79
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12142
total_samples=18227, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:00,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.22 | bwd_microstep: 2134.02 | bwd_inner_microstep: 1761.60 | bwd_allreduce_microstep: 372.35 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11594
total_samples=18230, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:03,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.15 | bwd_microstep: 2002.54 | bwd_inner_microstep: 1695.56 | bwd_allreduce_microstep: 306.89 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13276
total_samples=18234, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:05,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 05:24:05,960] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.94 | bwd_microstep: 1771.09 | bwd_inner_microstep: 1683.76 | bwd_allreduce_microstep: 87.26 | step_microstep: 110.59
[2025-08-03 05:24:05,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.92 | bwd: 7622.41 | bwd_inner: 6799.22 | bwd_allreduce: 822.90 | step: 111.75
{'loss': 0.7375, 'learning_rate': 7.2816992776365714e-06, 'epoch': 0.6}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13066
total_samples=18238, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:08,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.53 | bwd_microstep: 2035.58 | bwd_inner_microstep: 1844.99 | bwd_allreduce_microstep: 190.52 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375
total_samples=18242, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:11,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.11 | bwd_microstep: 1769.79 | bwd_inner_microstep: 1682.23 | bwd_allreduce_microstep: 87.49 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13313
total_samples=18246, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:13,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.76 | bwd_microstep: 1748.62 | bwd_inner_microstep: 1673.56 | bwd_allreduce_microstep: 74.98 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=18250, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:16,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 05:24:16,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.30 | bwd_microstep: 2085.71 | bwd_inner_microstep: 1977.55 | bwd_allreduce_microstep: 108.10 | step_microstep: 115.03
[2025-08-03 05:24:16,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.62 | bwd: 7639.75 | bwd_inner: 7178.33 | bwd_allreduce: 461.17 | step: 115.61
{'loss': 0.7444, 'learning_rate': 7.2661188444382345e-06, 'epoch': 0.6}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12819
total_samples=18254, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:19,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.06 | bwd_microstep: 1883.19 | bwd_inner_microstep: 1671.54 | bwd_allreduce_microstep: 211.58 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12026
total_samples=18257, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:22,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.60 | bwd_microstep: 1817.01 | bwd_inner_microstep: 1577.90 | bwd_allreduce_microstep: 239.05 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11420
total_samples=18260, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:24,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.77 | bwd_microstep: 2006.28 | bwd_inner_microstep: 1551.37 | bwd_allreduce_microstep: 454.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13467
total_samples=18264, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:27,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 05:24:27,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.82 | bwd_microstep: 1790.17 | bwd_inner_microstep: 1699.69 | bwd_allreduce_microstep: 90.42 | step_microstep: 132.12
[2025-08-03 05:24:27,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.17 | bwd: 7496.70 | bwd_inner: 6500.50 | bwd_allreduce: 995.96 | step: 132.57
{'loss': 0.7548, 'learning_rate': 7.250545580524515e-06, 'epoch': 0.6}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508
total_samples=18268, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:30,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.48 | bwd_microstep: 1892.85 | bwd_inner_microstep: 1723.68 | bwd_allreduce_microstep: 169.10 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12107
total_samples=18271, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:33,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.76 | bwd_microstep: 2774.53 | bwd_inner_microstep: 2496.66 | bwd_allreduce_microstep: 277.81 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13688
total_samples=18275, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:36,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.70 | bwd_microstep: 1785.49 | bwd_inner_microstep: 1713.84 | bwd_allreduce_microstep: 71.59 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13248
total_samples=18279, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:39,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.66
[2025-08-03 05:24:39,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.45 | bwd_microstep: 1749.15 | bwd_inner_microstep: 1679.18 | bwd_allreduce_microstep: 69.90 | step_microstep: 157.69
[2025-08-03 05:24:39,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.31 | bwd: 8202.07 | bwd_inner: 7613.36 | bwd_allreduce: 588.47 | step: 158.27
197/2000 [3:40:48<2:26:47, 10.97s/it] 60%|█████▉    | 1198/2000 [3:40:58<2:24:58, 10.85s/it]                                                        60%|█████▉    | 1198/2000 [3:40:58<2:24:58, 10.85s/it] 60%|█████▉    | 1199/2000 [3:41:09<2:26:37, 10.98s/it]                                                        60%|█████▉    | 1199/2000 [3:41:09<2:26:37, 10.98s/it] 60%|██████    | 1200/2000 [3:41:20<2:26:00, 10.95s/it]                                                        60%|██████    | 1200/2000 [3:41:20<2:26:00, 10.95s/it] 60%|██████    | 1201/2000 [3:41:31<2:25:22, 10.92s/it]                                                        60%|██████    | 1201/2000 [3:41:31<2:25:22, 10.92s/it] 60%|██████    | 1202/2000 [3:41:42<2:24:32, 10.87s/it]                                                        60%|██████    | 1202/2000 [3:41:42<2:24:32, 10.87s/it] 60%|█████�{'loss': 0.7388, 'learning_rate': 7.234979526734482e-06, 'epoch': 0.6}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13415
total_samples=18283, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:41,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.87 | bwd_microstep: 1813.32 | bwd_inner_microstep: 1727.61 | bwd_allreduce_microstep: 85.65 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13190
total_samples=18287, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:44,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.97 | bwd_microstep: 1737.22 | bwd_inner_microstep: 1647.61 | bwd_allreduce_microstep: 89.54 | step_microstep: 0.34
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12231
total_samples=18290, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:46,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.31 | bwd_microstep: 1713.37 | bwd_inner_microstep: 1558.83 | bwd_allreduce_microstep: 154.48 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13466
total_samples=18294, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:49,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.18
[2025-08-03 05:24:49,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.89 | bwd_microstep: 1837.89 | bwd_inner_microstep: 1804.15 | bwd_allreduce_microstep: 33.66 | step_microstep: 161.87
[2025-08-03 05:24:49,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.98 | bwd: 7101.86 | bwd_inner: 6738.19 | bwd_allreduce: 363.41 | step: 162.45
{'loss': 0.7393, 'learning_rate': 7.219420723888301e-06, 'epoch': 0.6}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13769
total_samples=18298, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:52,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.65 | bwd_microstep: 1961.82 | bwd_inner_microstep: 1699.24 | bwd_allreduce_microstep: 262.51 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13622
total_samples=18302, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:54,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.59 | bwd_microstep: 1791.17 | bwd_inner_microstep: 1670.60 | bwd_allreduce_microstep: 120.51 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11685
total_samples=18305, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:24:57,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.25 | bwd_microstep: 1909.74 | bwd_inner_microstep: 1747.19 | bwd_allreduce_microstep: 162.48 | step_microstep: 0.29
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13434
total_samples=18309, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:00,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 05:25:00,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.67 | bwd_microstep: 1950.08 | bwd_inner_microstep: 1701.40 | bwd_allreduce_microstep: 248.58 | step_microstep: 139.62
[2025-08-03 05:25:00,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2870.10 | bwd: 7612.88 | bwd_inner: 6818.43 | bwd_allreduce: 794.19 | step: 140.16
{'loss': 0.751, 'learning_rate': 7.203869212787112e-06, 'epoch': 0.6}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13141
total_samples=18313, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:03,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.72 | bwd_microstep: 2047.21 | bwd_inner_microstep: 1864.78 | bwd_allreduce_microstep: 182.35 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13686
total_samples=18317, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:05,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.33 | bwd_microstep: 1816.66 | bwd_inner_microstep: 1655.98 | bwd_allreduce_microstep: 160.62 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13560
total_samples=18321, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:08,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.19 | bwd_microstep: 1710.68 | bwd_inner_microstep: 1666.90 | bwd_allreduce_microstep: 43.72 | step_microstep: 0.28
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13157
total_samples=18325, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:10,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.26
[2025-08-03 05:25:10,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.43 | bwd_microstep: 1803.87 | bwd_inner_microstep: 1660.35 | bwd_allreduce_microstep: 143.45 | step_microstep: 155.86
[2025-08-03 05:25:10,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.60 | bwd: 7378.48 | bwd_inner: 6848.01 | bwd_allreduce: 530.23 | step: 156.51
{'loss': 0.7439, 'learning_rate': 7.188325034212944e-06, 'epoch': 0.6}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11963
total_samples=18328, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:13,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.96 | bwd_microstep: 1825.33 | bwd_inner_microstep: 1590.68 | bwd_allreduce_microstep: 234.59 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14526
total_samples=18332, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:16,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.31 | bwd_microstep: 1755.09 | bwd_inner_microstep: 1702.99 | bwd_allreduce_microstep: 52.03 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12093
total_samples=18335, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:18,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 756.51 | bwd_microstep: 1807.38 | bwd_inner_microstep: 1561.78 | bwd_allreduce_microstep: 245.52 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13389
total_samples=18339, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:21,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 05:25:21,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.38 | bwd_microstep: 1992.95 | bwd_inner_microstep: 1723.15 | bwd_allreduce_microstep: 269.73 | step_microstep: 114.90
[2025-08-03 05:25:21,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2896.09 | bwd: 7380.80 | bwd_inner: 6578.60 | bwd_allreduce: 801.96 | step: 115.39
{'loss': 0.7433, 'learning_rate': 7.1727882289285915e-06, 'epoch': 0.6}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12773
total_samples=18343, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:24,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.21 | bwd_microstep: 1939.72 | bwd_inner_microstep: 1658.64 | bwd_allreduce_microstep: 281.00 | step_microstep: 0.18
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12993
total_samples=18347, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:27,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.53 | bwd_microstep: 2025.49 | bwd_inner_microstep: 1781.89 | bwd_allreduce_microstep: 243.53 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11667
total_samples=18350, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:29,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.09 | bwd_microstep: 1967.13 | bwd_inner_microstep: 1831.73 | bwd_allreduce_microstep: 135.32 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11756
total_samples=18353, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:32,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58
[2025-08-03 05:25:32,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.25 | bwd_microstep: 1753.30 | bwd_inner_microstep: 1542.78 | bwd_allreduce_microstep: 210.44 | step_microstep: 115.51
[2025-08-03 05:25:32,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.01 | bwd: 7685.68 | bwd_inner: 6815.03 | bwd_allreduce: 870.38 | step: 116.10
��    | 1203/2000 [3:41:53<2:26:48, 11.05s/it]                                                        60%|██████    | 1203/2000 [3:41:53<2:26:48, 11.05s/it] 60%|██████    | 1204/2000 [3:42:04<2:23:56, 10.85s/it]                                                        60%|██████    | 1204/2000 [3:42:04<2:23:56, 10.85s/it] 60%|██████    | 1205/2000 [3:42:15<2:24:00, 10.87s/it]                                                        60%|██████    | 1205/2000 [3:42:15<2:24:00, 10.87s/it] 60%|██████    | 1206/2000 [3:42:25<2:22:48, 10.79s/it]                                                        60%|██████    | 1206/2000 [3:42:25<2:22:48, 10.79s/it] 60%|██████    | 1207/2000 [3:42:36<2:22:09, 10.76s/it]                                                        60%|██████    | 1207/2000 [3:42:36<2:22:09, 10.76s/it] 60%|██████    | 1208/2000 [3:42:47<2:22:29, 10.80s/it]            {'loss': 0.7463, 'learning_rate': 7.157258837677514e-06, 'epoch': 0.6}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13148
total_samples=18357, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:35,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.83 | bwd_microstep: 1778.72 | bwd_inner_microstep: 1672.43 | bwd_allreduce_microstep: 106.22 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13701
total_samples=18360, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:37,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.20 | bwd_microstep: 1993.51 | bwd_inner_microstep: 1843.60 | bwd_allreduce_microstep: 149.84 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12009
total_samples=18363, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:40,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.42 | bwd_microstep: 1938.09 | bwd_inner_microstep: 1753.00 | bwd_allreduce_microstep: 185.02 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13278
total_samples=18367, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:43,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 05:25:43,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.75 | bwd_microstep: 1789.75 | bwd_inner_microstep: 1695.59 | bwd_allreduce_microstep: 94.08 | step_microstep: 111.39
[2025-08-03 05:25:43,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.12 | bwd: 7500.13 | bwd_inner: 6964.63 | bwd_allreduce: 535.25 | step: 111.91
{'loss': 0.7416, 'learning_rate': 7.1417369011837355e-06, 'epoch': 0.6}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12947
total_samples=18371, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:46,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.77 | bwd_microstep: 2168.00 | bwd_inner_microstep: 1735.81 | bwd_allreduce_microstep: 432.11 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15187
total_samples=18375, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:48,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 1762.82 | bwd_inner_microstep: 1749.55 | bwd_allreduce_microstep: 13.21 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14063
total_samples=18379, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:51,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1773.13 | bwd_inner_microstep: 1733.99 | bwd_allreduce_microstep: 39.06 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13409
total_samples=18383, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:54,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 05:25:54,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.39 | bwd_microstep: 1795.03 | bwd_inner_microstep: 1711.75 | bwd_allreduce_microstep: 83.20 | step_microstep: 116.44
[2025-08-03 05:25:54,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.74 | bwd: 7499.02 | bwd_inner: 6931.10 | bwd_allreduce: 567.67 | step: 117.01
{'loss': 0.7337, 'learning_rate': 7.126222460151719e-06, 'epoch': 0.6}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13760
total_samples=18387, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:56,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.70 | bwd_microstep: 1791.82 | bwd_inner_microstep: 1709.13 | bwd_allreduce_microstep: 82.62 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12206
total_samples=18391, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:25:59,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.06 | bwd_microstep: 1749.40 | bwd_inner_microstep: 1581.01 | bwd_allreduce_microstep: 168.32 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14199
total_samples=18395, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:01,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.54 | bwd_microstep: 2081.12 | bwd_inner_microstep: 1754.28 | bwd_allreduce_microstep: 326.77 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13481
total_samples=18399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:04,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:26:04,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.17 | bwd_microstep: 1986.71 | bwd_inner_microstep: 1866.97 | bwd_allreduce_microstep: 119.67 | step_microstep: 138.62
[2025-08-03 05:26:04,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.39 | bwd: 7609.11 | bwd_inner: 6911.38 | bwd_allreduce: 697.48 | step: 139.09
{'loss': 0.7451, 'learning_rate': 7.110715555266281e-06, 'epoch': 0.61}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12747
total_samples=18403, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:07,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.54 | bwd_microstep: 1819.29 | bwd_inner_microstep: 1624.97 | bwd_allreduce_microstep: 194.26 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11853
total_samples=18406, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:10,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.83 | bwd_microstep: 1957.75 | bwd_inner_microstep: 1585.63 | bwd_allreduce_microstep: 372.05 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13062
total_samples=18410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:13,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 901.38 | bwd_microstep: 1889.03 | bwd_inner_microstep: 1821.50 | bwd_allreduce_microstep: 67.47 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13357
total_samples=18414, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:16,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 05:26:16,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.92 | bwd_microstep: 2103.45 | bwd_inner_microstep: 1938.87 | bwd_allreduce_microstep: 164.51 | step_microstep: 394.66
[2025-08-03 05:26:16,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2983.60 | bwd: 7769.56 | bwd_inner: 6970.97 | bwd_allreduce: 798.36 | step: 395.23
{'loss': 0.7489, 'learning_rate': 7.095216227192467e-06, 'epoch': 0.61}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15036
total_samples=18418, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:19,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.73 | bwd_microstep: 2279.89 | bwd_inner_microstep: 1952.99 | bwd_allreduce_microstep: 326.83 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13638
total_samples=18422, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:22,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.44 | bwd_microstep: 1984.97 | bwd_inner_microstep: 1872.75 | bwd_allreduce_microstep: 112.16 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13773
total_samples=18427, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:24,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.12 | bwd_microstep: 1763.57 | bwd_inner_microstep: 1705.94 | bwd_allreduce_microstep: 57.57 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11760
total_samples=18430, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:28,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 05:26:28,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1243.15 | bwd_microstep: 2514.73 | bwd_inner_microstep: 2293.18 | bwd_allreduce_microstep: 221.47 | step_microstep: 116.81
[2025-08-03 05:26:28,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3362.37 | bwd: 8543.22 | bwd_inner: 7824.86 | bwd_allreduce: 718.10 | step: 117.43
{'loss': 0.7373, 'learning_rate': 7.0797245165754654e-06, 'epoch': 0.61}
                                            60%|██████    | 1208/2000 [3:42:47<2:22:29, 10.80s/it] 60%|██████    | 1209/2000 [3:42:58<2:22:11, 10.79s/it]                                                        60%|██████    | 1209/2000 [3:42:58<2:22:11, 10.79s/it] 60%|██████    | 1210/2000 [3:43:08<2:21:53, 10.78s/it]                                                        60%|██████    | 1210/2000 [3:43:08<2:21:53, 10.78s/it] 61%|██████    | 1211/2000 [3:43:19<2:22:11, 10.81s/it]                                                        61%|██████    | 1211/2000 [3:43:19<2:22:11, 10.81s/it] 61%|██████    | 1212/2000 [3:43:31<2:24:33, 11.01s/it]                                                        61%|██████    | 1212/2000 [3:43:31<2:24:33, 11.01s/it] 61%|██████    | 1213/2000 [3:43:43<2:29:35, 11.40s/it]                                                        61%|███dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13841
total_samples=18435, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:31,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.86 | bwd_microstep: 1728.27 | bwd_inner_microstep: 1674.39 | bwd_allreduce_microstep: 53.81 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12681
total_samples=18439, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:33,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.03 | bwd_microstep: 1847.54 | bwd_inner_microstep: 1644.67 | bwd_allreduce_microstep: 202.81 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13630
total_samples=18443, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:36,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.37 | bwd_microstep: 1992.86 | bwd_inner_microstep: 1878.95 | bwd_allreduce_microstep: 113.83 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13376
total_samples=18447, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:39,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 05:26:39,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.94 | bwd_microstep: 1816.66 | bwd_inner_microstep: 1714.56 | bwd_allreduce_microstep: 102.03 | step_microstep: 119.27
[2025-08-03 05:26:39,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.11 | bwd: 7385.38 | bwd_inner: 6912.58 | bwd_allreduce: 472.56 | step: 119.73
{'loss': 0.7381, 'learning_rate': 7.064240464040472e-06, 'epoch': 0.61}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14037
total_samples=18451, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:42,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 2018.34 | bwd_inner_microstep: 1880.68 | bwd_allreduce_microstep: 137.59 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12189
total_samples=18454, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:45,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.20 | bwd_microstep: 2097.64 | bwd_inner_microstep: 1875.66 | bwd_allreduce_microstep: 221.91 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13217
total_samples=18458, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:47,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.03 | bwd_microstep: 1731.67 | bwd_inner_microstep: 1667.86 | bwd_allreduce_microstep: 63.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13140
total_samples=18462, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:50,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 05:26:50,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.32 | bwd_microstep: 1814.33 | bwd_inner_microstep: 1700.17 | bwd_allreduce_microstep: 114.09 | step_microstep: 122.48
[2025-08-03 05:26:50,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.42 | bwd: 7662.02 | bwd_inner: 7124.36 | bwd_allreduce: 537.41 | step: 122.93
{'loss': 0.745, 'learning_rate': 7.048764110192618e-06, 'epoch': 0.61}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13703
total_samples=18466, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:53,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.05 | bwd_microstep: 2168.18 | bwd_inner_microstep: 1976.14 | bwd_allreduce_microstep: 191.97 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14328
total_samples=18470, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:55,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.89 | bwd_microstep: 1755.00 | bwd_inner_microstep: 1733.67 | bwd_allreduce_microstep: 21.26 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13697
total_samples=18474, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:26:58,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.66 | bwd_microstep: 1727.03 | bwd_inner_microstep: 1655.76 | bwd_allreduce_microstep: 71.21 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13760
total_samples=18478, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:01,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 05:27:01,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.24 | bwd_microstep: 1818.58 | bwd_inner_microstep: 1812.17 | bwd_allreduce_microstep: 6.34 | step_microstep: 149.27
[2025-08-03 05:27:01,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.78 | bwd: 7468.84 | bwd_inner: 7177.73 | bwd_allreduce: 290.88 | step: 149.68
{'loss': 0.7443, 'learning_rate': 7.033295495616834e-06, 'epoch': 0.61}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12866
total_samples=18482, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:03,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.70 | bwd_microstep: 2063.10 | bwd_inner_microstep: 1645.13 | bwd_allreduce_microstep: 417.90 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14262
total_samples=18486, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:06,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.24 | bwd_microstep: 1719.82 | bwd_inner_microstep: 1700.38 | bwd_allreduce_microstep: 19.36 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14327
total_samples=18490, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:08,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.85 | bwd_microstep: 1835.25 | bwd_inner_microstep: 1754.81 | bwd_allreduce_microstep: 80.37 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332
total_samples=18494, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:11,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 05:27:11,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.90 | bwd_microstep: 1764.79 | bwd_inner_microstep: 1693.06 | bwd_allreduce_microstep: 71.65 | step_microstep: 113.27
[2025-08-03 05:27:11,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.62 | bwd: 7383.00 | bwd_inner: 6793.39 | bwd_allreduce: 589.37 | step: 113.77
{'loss': 0.7397, 'learning_rate': 7.017834660877756e-06, 'epoch': 0.61}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12807
total_samples=18498, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:14,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.15 | bwd_microstep: 1736.56 | bwd_inner_microstep: 1594.24 | bwd_allreduce_microstep: 142.25 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12423
total_samples=18502, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:16,653] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.02 | bwd_microstep: 1714.85 | bwd_inner_microstep: 1589.31 | bwd_allreduce_microstep: 125.48 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13374
total_samples=18506, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:19,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.23 | bwd_microstep: 1987.06 | bwd_inner_microstep: 1859.95 | bwd_allreduce_microstep: 127.05 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14662
total_samples=18510, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:22,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 05:27:22,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.21 | bwd_microstep: 2170.22 | bwd_inner_microstep: 1780.86 | bwd_allreduce_microstep: 389.28 | step_microstep: 132.63
[2025-08-03 05:27:22,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.55 | bwd: 7608.74 | bwd_inner: 6824.35 | bwd_allreduce: 784.14 | step: 133.10
{'loss': 0.7522, 'learning_rate': 7.002381646519625e-06, 'epoch': 0.61}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11746
total_samples=18513, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:25,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.07 | bwd_microstep: 1957.00 | bwd_inner_microstep: 1601.49 | bwd_allreduce_microstep: 355.44 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13135
total_samples=18517, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:27,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.38 | bwd_microstep: 1787.10 | bwd_inner_microstep: 1723.08 | bwd_allreduce_microstep: 63.94 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11874
total_samples=18520, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:30,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.48 | bwd_microstep: 2135.02 | bwd_inner_microstep: 1773.35 | bwd_allreduce_microstep: 361.61 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11940
total_samples=18523, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:33,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.13
[2025-08-03 05:27:33,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.77 | bwd_microstep: 2177.63 | bwd_inner_microstep: 1926.59 | bwd_allreduce_microstep: 250.97 | step_microstep: 160.37
[2025-08-03 05:27:33,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.62 | bwd: 8056.81 | bwd_inner: 7024.51 | bwd_allreduce: 1032.05 | step: 160.88
███    | 1213/2000 [3:43:43<2:29:35, 11.40s/it] 61%|██████    | 1214/2000 [3:43:54<2:26:23, 11.18s/it]                                                        61%|██████    | 1214/2000 [3:43:54<2:26:23, 11.18s/it] 61%|██████    | 1215/2000 [3:44:05<2:25:09, 11.10s/it]                                                        61%|██████    | 1215/2000 [3:44:05<2:25:09, 11.10s/it] 61%|██████    | 1216/2000 [3:44:15<2:23:39, 10.99s/it]                                                        61%|██████    | 1216/2000 [3:44:15<2:23:39, 10.99s/it] 61%|██████    | 1217/2000 [3:44:26<2:22:06, 10.89s/it]                                                        61%|██████    | 1217/2000 [3:44:26<2:22:06, 10.89s/it] 61%|██████    | 1218/2000 [3:44:37<2:21:45, 10.88s/it]                                                        61%|██████    | 1218/2000 [3:44:37<2:21:45, 10.88s/it] 61%|{'loss': 0.7476, 'learning_rate': 6.986936493066165e-06, 'epoch': 0.61}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=18526, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:36,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.29 | bwd_microstep: 1988.67 | bwd_inner_microstep: 1759.82 | bwd_allreduce_microstep: 228.78 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13989
total_samples=18530, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:39,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.21 | bwd_microstep: 2031.47 | bwd_inner_microstep: 1886.37 | bwd_allreduce_microstep: 145.03 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14722
total_samples=18534, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:42,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.69 | bwd_microstep: 1837.88 | bwd_inner_microstep: 1783.39 | bwd_allreduce_microstep: 54.42 | step_microstep: 0.83
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13103
total_samples=18538, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:44,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 05:27:44,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.24 | bwd_microstep: 1744.51 | bwd_inner_microstep: 1675.33 | bwd_allreduce_microstep: 69.12 | step_microstep: 133.56
[2025-08-03 05:27:44,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.36 | bwd: 7602.59 | bwd_inner: 7104.91 | bwd_allreduce: 497.43 | step: 134.61
{'loss': 0.7395, 'learning_rate': 6.971499241020495e-06, 'epoch': 0.61}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11870
total_samples=18541, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:47,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.11 | bwd_microstep: 1838.52 | bwd_inner_microstep: 1576.48 | bwd_allreduce_microstep: 261.98 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16239
total_samples=18545, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:50,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.68 | bwd_microstep: 2163.56 | bwd_inner_microstep: 1961.18 | bwd_allreduce_microstep: 202.31 | step_microstep: 0.77
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13241
total_samples=18549, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:52,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.05 | bwd_microstep: 1717.25 | bwd_inner_microstep: 1655.82 | bwd_allreduce_microstep: 61.37 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13136
total_samples=18553, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:55,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 05:27:55,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.95 | bwd_microstep: 1725.17 | bwd_inner_microstep: 1663.99 | bwd_allreduce_microstep: 61.12 | step_microstep: 128.76
[2025-08-03 05:27:55,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.71 | bwd: 7444.56 | bwd_inner: 6857.47 | bwd_allreduce: 586.86 | step: 129.90
{'loss': 0.732, 'learning_rate': 6.956069930865005e-06, 'epoch': 0.61}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13292
total_samples=18557, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:27:58,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.56 | bwd_microstep: 2149.47 | bwd_inner_microstep: 2004.92 | bwd_allreduce_microstep: 144.48 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13648
total_samples=18561, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:00,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.85 | bwd_microstep: 1822.01 | bwd_inner_microstep: 1723.19 | bwd_allreduce_microstep: 98.75 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11651
total_samples=18564, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:03,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.91 | bwd_microstep: 1777.86 | bwd_inner_microstep: 1547.77 | bwd_allreduce_microstep: 230.02 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13551
total_samples=18568, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:06,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.15
[2025-08-03 05:28:06,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.96 | bwd_microstep: 2113.42 | bwd_inner_microstep: 1916.45 | bwd_allreduce_microstep: 196.91 | step_microstep: 114.81
[2025-08-03 05:28:06,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2875.20 | bwd: 7862.81 | bwd_inner: 7192.32 | bwd_allreduce: 670.24 | step: 115.44
{'loss': 0.7562, 'learning_rate': 6.940648603061263e-06, 'epoch': 0.61}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11680
total_samples=18571, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:09,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.08 | bwd_microstep: 1851.50 | bwd_inner_microstep: 1602.78 | bwd_allreduce_microstep: 248.65 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14209
total_samples=18575, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:11,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.66 | bwd_microstep: 1919.51 | bwd_inner_microstep: 1890.91 | bwd_allreduce_microstep: 28.54 | step_microstep: 0.24
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12269
total_samples=18579, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:14,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.59 | bwd_microstep: 1978.11 | bwd_inner_microstep: 1971.07 | bwd_allreduce_microstep: 6.96 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13745
total_samples=18583, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:17,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 05:28:17,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.37 | bwd_microstep: 1771.50 | bwd_inner_microstep: 1720.35 | bwd_allreduce_microstep: 51.09 | step_microstep: 132.05
[2025-08-03 05:28:17,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.63 | bwd: 7520.68 | bwd_inner: 7185.11 | bwd_allreduce: 335.32 | step: 132.55
{'loss': 0.7594, 'learning_rate': 6.925235298049906e-06, 'epoch': 0.61}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12547
total_samples=18587, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:19,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.02 | bwd_microstep: 1773.83 | bwd_inner_microstep: 1636.67 | bwd_allreduce_microstep: 137.09 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13315
total_samples=18591, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:22,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.37 | bwd_microstep: 2304.07 | bwd_inner_microstep: 2266.93 | bwd_allreduce_microstep: 37.08 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11988
total_samples=18594, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:25,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.99 | bwd_microstep: 1741.44 | bwd_inner_microstep: 1547.34 | bwd_allreduce_microstep: 194.01 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13229
total_samples=18598, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:28,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.68
[2025-08-03 05:28:28,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.38 | bwd_microstep: 1734.72 | bwd_inner_microstep: 1673.20 | bwd_allreduce_microstep: 61.45 | step_microstep: 133.55
[2025-08-03 05:28:28,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.69 | bwd: 7554.11 | bwd_inner: 7124.13 | bwd_allreduce: 429.73 | step: 134.17
██████    | 1219/2000 [3:44:48<2:23:28, 11.02s/it]                                                        61%|██████    | 1219/2000 [3:44:48<2:23:28, 11.02s/it] 61%|██████    | 1220/2000 [3:44:59<2:22:33, 10.97s/it]                                                        61%|██████    | 1220/2000 [3:44:59<2:22:33, 10.97s/it] 61%|██████    | 1221/2000 [3:45:10<2:21:14, 10.88s/it]                                                        61%|██████    | 1221/2000 [3:45:10<2:21:14, 10.88s/it] 61%|██████    | 1222/2000 [3:45:21<2:22:00, 10.95s/it]                                                        61%|██████    | 1222/2000 [3:45:21<2:22:00, 10.95s/it] 61%|██████    | 1223/2000 [3:45:32<2:21:18, 10.91s/it]                                                        61%|██████    | 1223/2000 [3:45:32<2:21:18, 10.91s/it] 61%|██████    | 1224/2000 [3:45:42<2:20:35, 10.87s/{'loss': 0.7455, 'learning_rate': 6.909830056250527e-06, 'epoch': 0.61}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15547
total_samples=18602, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:30,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.98 | bwd_microstep: 1773.26 | bwd_inner_microstep: 1766.08 | bwd_allreduce_microstep: 7.11 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14218
total_samples=18606, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:33,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.86 | bwd_microstep: 1822.69 | bwd_inner_microstep: 1723.22 | bwd_allreduce_microstep: 99.40 | step_microstep: 0.84
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13732
total_samples=18610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:35,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.31 | bwd_microstep: 1744.19 | bwd_inner_microstep: 1702.45 | bwd_allreduce_microstep: 41.67 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13210
total_samples=18614, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:38,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 05:28:38,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.23 | bwd_microstep: 2007.76 | bwd_inner_microstep: 1882.18 | bwd_allreduce_microstep: 125.52 | step_microstep: 115.70
[2025-08-03 05:28:38,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.31 | bwd: 7347.95 | bwd_inner: 7073.93 | bwd_allreduce: 273.77 | step: 116.77
{'loss': 0.7395, 'learning_rate': 6.894432918061579e-06, 'epoch': 0.61}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13378
total_samples=18618, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:41,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.45 | bwd_microstep: 1905.23 | bwd_inner_microstep: 1812.52 | bwd_allreduce_microstep: 92.62 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13464
total_samples=18622, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:43,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.35 | bwd_microstep: 1774.96 | bwd_inner_microstep: 1692.61 | bwd_allreduce_microstep: 82.28 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12945
total_samples=18626, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:46,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.01 | bwd_microstep: 2045.14 | bwd_inner_microstep: 1899.63 | bwd_allreduce_microstep: 145.44 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14815
total_samples=18631, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:49,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.05
[2025-08-03 05:28:49,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.32 | bwd_microstep: 1761.70 | bwd_inner_microstep: 1734.91 | bwd_allreduce_microstep: 26.72 | step_microstep: 119.75
[2025-08-03 05:28:49,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2761.07 | bwd: 7487.08 | bwd_inner: 7139.67 | bwd_allreduce: 347.15 | step: 120.28
{'loss': 0.7352, 'learning_rate': 6.8790439238602576e-06, 'epoch': 0.61}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12115
total_samples=18635, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:52,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.29 | bwd_microstep: 1964.29 | bwd_inner_microstep: 1799.98 | bwd_allreduce_microstep: 164.24 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13746
total_samples=18640, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:54,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.09 | bwd_microstep: 2020.38 | bwd_inner_microstep: 1878.12 | bwd_allreduce_microstep: 142.20 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14031
total_samples=18644, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:28:57,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.77 | bwd_microstep: 1811.06 | bwd_inner_microstep: 1725.34 | bwd_allreduce_microstep: 85.66 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11953
total_samples=18647, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:00,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 05:29:00,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.86 | bwd_microstep: 1969.08 | bwd_inner_microstep: 1825.22 | bwd_allreduce_microstep: 143.79 | step_microstep: 110.69
[2025-08-03 05:29:00,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2750.94 | bwd: 7764.86 | bwd_inner: 7228.65 | bwd_allreduce: 535.96 | step: 111.14
{'loss': 0.7387, 'learning_rate': 6.863663114002411e-06, 'epoch': 0.61}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11981
total_samples=18650, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:02,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.25 | bwd_microstep: 1724.99 | bwd_inner_microstep: 1536.83 | bwd_allreduce_microstep: 188.09 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13273
total_samples=18654, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:05,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.33 | bwd_microstep: 1823.62 | bwd_inner_microstep: 1677.23 | bwd_allreduce_microstep: 146.33 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12831
total_samples=18658, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:08,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.26 | bwd_microstep: 2125.43 | bwd_inner_microstep: 1905.49 | bwd_allreduce_microstep: 219.88 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11663
total_samples=18661, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:11,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 05:29:11,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.07 | bwd_microstep: 2084.23 | bwd_inner_microstep: 1838.21 | bwd_allreduce_microstep: 245.96 | step_microstep: 112.80
[2025-08-03 05:29:11,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.83 | bwd: 7758.31 | bwd_inner: 6957.75 | bwd_allreduce: 800.33 | step: 113.16
{'loss': 0.7489, 'learning_rate': 6.848290528822417e-06, 'epoch': 0.61}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13292
total_samples=18665, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:13,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.44 | bwd_microstep: 1854.41 | bwd_inner_microstep: 1798.01 | bwd_allreduce_microstep: 56.33 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243
total_samples=18669, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:16,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.97 | bwd_microstep: 1758.60 | bwd_inner_microstep: 1689.88 | bwd_allreduce_microstep: 68.65 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12777
total_samples=18673, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:19,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.12 | bwd_microstep: 1867.05 | bwd_inner_microstep: 1671.24 | bwd_allreduce_microstep: 195.75 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305
total_samples=18677, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:21,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 05:29:21,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.55 | bwd_microstep: 1878.19 | bwd_inner_microstep: 1689.94 | bwd_allreduce_microstep: 188.19 | step_microstep: 115.95
[2025-08-03 05:29:21,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.01 | bwd: 7358.31 | bwd_inner: 6849.07 | bwd_allreduce: 509.01 | step: 116.42
it]                                                        61%|██████    | 1224/2000 [3:45:42<2:20:35, 10.87s/it] 61%|██████▏   | 1225/2000 [3:45:53<2:19:09, 10.77s/it]                                                        61%|██████▏   | 1225/2000 [3:45:53<2:19:09, 10.77s/it] 61%|██████▏   | 1226/2000 [3:46:04<2:18:34, 10.74s/it]                                                        61%|██████▏   | 1226/2000 [3:46:04<2:18:34, 10.74s/it] 61%|██████▏   | 1227/2000 [3:46:15<2:19:10, 10.80s/it]                                                        61%|██████▏   | 1227/2000 [3:46:15<2:19:10, 10.80s/it] 61%|██████▏   | 1228/2000 [3:46:26<2:19:44, 10.86s/it]                                                        61%|██████▏   | 1228/2000 [3:46:26<2:19:44, 10.86s/it] 61%|██████▏   | 1229/2000 [3:46:36<2:18:20, 10.77s/it]                                     {'loss': 0.7468, 'learning_rate': 6.8329262086330864e-06, 'epoch': 0.61}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13395
total_samples=18681, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:24,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.04 | bwd_microstep: 1775.57 | bwd_inner_microstep: 1690.34 | bwd_allreduce_microstep: 85.16 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12890
total_samples=18685, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:27,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.54 | bwd_microstep: 2304.95 | bwd_inner_microstep: 2299.26 | bwd_allreduce_microstep: 5.63 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13426
total_samples=18689, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:29,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.03 | bwd_microstep: 1772.17 | bwd_inner_microstep: 1695.21 | bwd_allreduce_microstep: 76.89 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11404
total_samples=18692, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:32,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 05:29:32,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.48 | bwd_microstep: 1812.38 | bwd_inner_microstep: 1585.88 | bwd_allreduce_microstep: 226.44 | step_microstep: 131.76
[2025-08-03 05:29:32,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.01 | bwd: 7665.12 | bwd_inner: 7270.69 | bwd_allreduce: 394.20 | step: 132.25
{'loss': 0.7451, 'learning_rate': 6.8175701937255645e-06, 'epoch': 0.61}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11993
total_samples=18695, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:35,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.81 | bwd_microstep: 2020.07 | bwd_inner_microstep: 1788.29 | bwd_allreduce_microstep: 231.72 | step_microstep: 0.17
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14282
total_samples=18699, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:38,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.68 | bwd_microstep: 1878.17 | bwd_inner_microstep: 1734.02 | bwd_allreduce_microstep: 144.08 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13564
total_samples=18703, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:40,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.23 | bwd_microstep: 1777.16 | bwd_inner_microstep: 1695.24 | bwd_allreduce_microstep: 81.87 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11596
total_samples=18706, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:43,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 05:29:43,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.32 | bwd_microstep: 1986.64 | bwd_inner_microstep: 1798.32 | bwd_allreduce_microstep: 188.25 | step_microstep: 112.78
[2025-08-03 05:29:43,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.97 | bwd: 7662.10 | bwd_inner: 7015.85 | bwd_allreduce: 646.00 | step: 113.33
{'loss': 0.7473, 'learning_rate': 6.802222524369202e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13032
total_samples=18710, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:46,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.98 | bwd_microstep: 1892.43 | bwd_inner_microstep: 1822.98 | bwd_allreduce_microstep: 69.38 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13279
total_samples=18714, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:49,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.75 | bwd_microstep: 1966.70 | bwd_inner_microstep: 1863.51 | bwd_allreduce_microstep: 103.13 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13280
total_samples=18719, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:51,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.43 | bwd_microstep: 2059.96 | bwd_inner_microstep: 1901.55 | bwd_allreduce_microstep: 158.35 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373
total_samples=18723, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:54,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 05:29:54,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.74 | bwd_microstep: 2092.02 | bwd_inner_microstep: 1746.89 | bwd_allreduce_microstep: 345.06 | step_microstep: 112.84
[2025-08-03 05:29:54,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.82 | bwd: 8011.16 | bwd_inner: 7334.93 | bwd_allreduce: 675.99 | step: 113.47
{'loss': 0.7461, 'learning_rate': 6.786883240811479e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13818
total_samples=18727, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:29:57,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.86 | bwd_microstep: 1913.90 | bwd_inner_microstep: 1865.09 | bwd_allreduce_microstep: 48.75 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13709
total_samples=18731, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:00,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.08 | bwd_microstep: 2229.71 | bwd_inner_microstep: 1877.67 | bwd_allreduce_microstep: 351.98 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305
total_samples=18735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:03,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.81 | bwd_microstep: 1768.89 | bwd_inner_microstep: 1687.59 | bwd_allreduce_microstep: 81.24 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13338
total_samples=18739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:05,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61
[2025-08-03 05:30:05,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.44 | bwd_microstep: 1822.28 | bwd_inner_microstep: 1711.97 | bwd_allreduce_microstep: 110.24 | step_microstep: 116.53
[2025-08-03 05:30:05,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.12 | bwd: 7734.84 | bwd_inner: 7142.31 | bwd_allreduce: 592.29 | step: 117.13
{'loss': 0.7554, 'learning_rate': 6.771552383277875e-06, 'epoch': 0.62}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13355
total_samples=18743, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:08,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.79 | bwd_microstep: 1785.50 | bwd_inner_microstep: 1671.56 | bwd_allreduce_microstep: 113.87 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14101
total_samples=18747, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:11,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.02 | bwd_microstep: 1909.66 | bwd_inner_microstep: 1854.79 | bwd_allreduce_microstep: 54.80 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15507
total_samples=18751, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:13,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.15 | bwd_microstep: 2016.70 | bwd_inner_microstep: 1931.81 | bwd_allreduce_microstep: 84.82 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14433
total_samples=18756, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:16,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 05:30:16,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.12 | bwd_microstep: 1746.00 | bwd_inner_microstep: 1718.90 | bwd_allreduce_microstep: 27.03 | step_microstep: 125.76
[2025-08-03 05:30:16,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.01 | bwd: 7457.91 | bwd_inner: 7177.06 | bwd_allreduce: 280.61 | step: 126.29
{'loss': 0.7478, 'learning_rate': 6.756229991971779e-06, 'epoch': 0.62}
                   61%|██████▏   | 1229/2000 [3:46:36<2:18:20, 10.77s/it] 62%|██████▏   | 1230/2000 [3:46:47<2:18:38, 10.80s/it]                                                        62%|██████▏   | 1230/2000 [3:46:47<2:18:38, 10.80s/it] 62%|██████▏   | 1231/2000 [3:46:58<2:18:54, 10.84s/it]                                                        62%|██████▏   | 1231/2000 [3:46:58<2:18:54, 10.84s/it] 62%|██████▏   | 1232/2000 [3:47:09<2:20:14, 10.96s/it]                                                        62%|██████▏   | 1232/2000 [3:47:09<2:20:14, 10.96s/it] 62%|██████▏   | 1233/2000 [3:47:20<2:20:15, 10.97s/it]                                                        62%|██████▏   | 1233/2000 [3:47:20<2:20:15, 10.97s/it] 62%|██████▏   | 1234/2000 [3:47:31<2:19:03, 10.89s/it]                                                        62%|████�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12120
total_samples=18759, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:19,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 748.72 | bwd_microstep: 1851.22 | bwd_inner_microstep: 1594.92 | bwd_allreduce_microstep: 256.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13274
total_samples=18763, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:22,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.32 | bwd_microstep: 2003.96 | bwd_inner_microstep: 1860.90 | bwd_allreduce_microstep: 143.00 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462
total_samples=18768, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:24,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.82 | bwd_microstep: 2047.07 | bwd_inner_microstep: 1890.89 | bwd_allreduce_microstep: 156.09 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13495
total_samples=18773, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:27,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 05:30:27,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.29 | bwd_microstep: 1841.75 | bwd_inner_microstep: 1729.88 | bwd_allreduce_microstep: 111.80 | step_microstep: 124.14
[2025-08-03 05:30:27,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2886.08 | bwd: 7744.05 | bwd_inner: 7076.60 | bwd_allreduce: 667.20 | step: 124.63
{'loss': 0.7509, 'learning_rate': 6.740916107074372e-06, 'epoch': 0.62}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13013
total_samples=18777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:30,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.11 | bwd_microstep: 1870.34 | bwd_inner_microstep: 1655.18 | bwd_allreduce_microstep: 215.10 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12203
total_samples=18780, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:32,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.94 | bwd_microstep: 1788.13 | bwd_inner_microstep: 1590.35 | bwd_allreduce_microstep: 197.72 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13366
total_samples=18784, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:35,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.88 | bwd_microstep: 1960.64 | bwd_inner_microstep: 1715.87 | bwd_allreduce_microstep: 244.70 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14000
total_samples=18789, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:38,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 05:30:38,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.50 | bwd_microstep: 1820.11 | bwd_inner_microstep: 1745.50 | bwd_allreduce_microstep: 74.54 | step_microstep: 131.85
[2025-08-03 05:30:38,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2757.36 | bwd: 7439.29 | bwd_inner: 6706.89 | bwd_allreduce: 732.15 | step: 132.33
{'loss': 0.748, 'learning_rate': 6.725610768744535e-06, 'epoch': 0.62}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11953
total_samples=18792, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:40,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.55 | bwd_microstep: 1817.85 | bwd_inner_microstep: 1781.41 | bwd_allreduce_microstep: 36.38 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11535
total_samples=18795, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:43,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.45 | bwd_microstep: 1773.04 | bwd_inner_microstep: 1539.46 | bwd_allreduce_microstep: 233.50 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13140
total_samples=18799, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:45,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.23 | bwd_microstep: 1794.84 | bwd_inner_microstep: 1686.76 | bwd_allreduce_microstep: 108.02 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14324
total_samples=18803, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:48,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 05:30:48,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.15 | bwd_microstep: 1733.08 | bwd_inner_microstep: 1705.72 | bwd_allreduce_microstep: 27.29 | step_microstep: 127.51
[2025-08-03 05:30:48,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.30 | bwd: 7118.87 | bwd_inner: 6713.35 | bwd_allreduce: 405.27 | step: 128.02
{'loss': 0.7446, 'learning_rate': 6.710314017118734e-06, 'epoch': 0.62}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14194
total_samples=18807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:51,314] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.77 | bwd_microstep: 1889.63 | bwd_inner_microstep: 1832.22 | bwd_allreduce_microstep: 57.34 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12082
total_samples=18810, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:54,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.43 | bwd_microstep: 1997.17 | bwd_inner_microstep: 1777.16 | bwd_allreduce_microstep: 219.95 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13234
total_samples=18814, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:56,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.46 | bwd_microstep: 2076.88 | bwd_inner_microstep: 1909.62 | bwd_allreduce_microstep: 167.19 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12783
total_samples=18818, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:30:59,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 05:30:59,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.97 | bwd_microstep: 1818.92 | bwd_inner_microstep: 1667.80 | bwd_allreduce_microstep: 151.05 | step_microstep: 130.61
[2025-08-03 05:30:59,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.55 | bwd: 7782.66 | bwd_inner: 7186.80 | bwd_allreduce: 595.61 | step: 131.27
{'loss': 0.7335, 'learning_rate': 6.695025892310913e-06, 'epoch': 0.62}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12093
total_samples=18821, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:02,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.23 | bwd_microstep: 2129.68 | bwd_inner_microstep: 1870.39 | bwd_allreduce_microstep: 259.22 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11654
total_samples=18824, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:05,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.73 | bwd_microstep: 2042.11 | bwd_inner_microstep: 1816.85 | bwd_allreduce_microstep: 225.20 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13388
total_samples=18828, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:08,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.39 | bwd_microstep: 1928.35 | bwd_inner_microstep: 1726.29 | bwd_allreduce_microstep: 201.99 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12795
total_samples=18832, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:11,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 05:31:11,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.65 | bwd_microstep: 2055.36 | bwd_inner_microstep: 1864.30 | bwd_allreduce_microstep: 191.00 | step_microstep: 130.33
[2025-08-03 05:31:11,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.94 | bwd: 8155.55 | bwd_inner: 7277.82 | bwd_allreduce: 877.49 | step: 130.79
{'loss': 0.7381, 'learning_rate': 6.6797464344124045e-06, 'epoch': 0.62}
�█▏   | 1234/2000 [3:47:31<2:19:03, 10.89s/it] 62%|██████▏   | 1235/2000 [3:47:42<2:19:27, 10.94s/it]                                                        62%|██████▏   | 1235/2000 [3:47:42<2:19:27, 10.94s/it] 62%|██████▏   | 1236/2000 [3:47:53<2:18:12, 10.85s/it]                                                        62%|██████▏   | 1236/2000 [3:47:53<2:18:12, 10.85s/it] 62%|██████▏   | 1237/2000 [3:48:03<2:16:07, 10.70s/it]                                                        62%|██████▏   | 1237/2000 [3:48:03<2:16:07, 10.70s/it] 62%|██████▏   | 1238/2000 [3:48:14<2:17:22, 10.82s/it]                                                        62%|██████▏   | 1238/2000 [3:48:14<2:17:22, 10.82s/it] 62%|██████▏   | 1239/2000 [3:48:25<2:19:21, 10.99s/it]                                                        62%|██████▏   | 1239/2000 [3:48:25<2:19:21,dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13708
total_samples=18836, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:13,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.94 | bwd_microstep: 1943.69 | bwd_inner_microstep: 1691.32 | bwd_allreduce_microstep: 252.30 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812
total_samples=18839, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:16,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.62 | bwd_microstep: 1873.14 | bwd_inner_microstep: 1713.67 | bwd_allreduce_microstep: 159.40 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13903
total_samples=18844, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:18,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.74 | bwd_microstep: 1728.55 | bwd_inner_microstep: 1663.38 | bwd_allreduce_microstep: 65.10 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12605
total_samples=18847, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:21,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 05:31:21,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.53 | bwd_microstep: 1822.51 | bwd_inner_microstep: 1598.95 | bwd_allreduce_microstep: 223.50 | step_microstep: 118.93
[2025-08-03 05:31:21,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.76 | bwd: 7367.95 | bwd_inner: 6667.31 | bwd_allreduce: 700.39 | step: 119.44
{'loss': 0.7529, 'learning_rate': 6.664475683491797e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14274
total_samples=18851, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:24,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.89 | bwd_microstep: 1810.95 | bwd_inner_microstep: 1737.66 | bwd_allreduce_microstep: 73.23 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14730
total_samples=18855, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:26,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.40 | bwd_microstep: 1733.99 | bwd_inner_microstep: 1721.33 | bwd_allreduce_microstep: 12.59 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13210
total_samples=18859, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:29,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.21 | bwd_microstep: 1966.47 | bwd_inner_microstep: 1874.00 | bwd_allreduce_microstep: 92.39 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12023
total_samples=18862, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:32,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 05:31:32,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.09 | bwd_microstep: 2046.28 | bwd_inner_microstep: 1808.41 | bwd_allreduce_microstep: 237.80 | step_microstep: 107.60
[2025-08-03 05:31:32,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.52 | bwd: 7557.73 | bwd_inner: 7141.40 | bwd_allreduce: 416.08 | step: 108.07
{'loss': 0.7565, 'learning_rate': 6.649213679594859e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13993
total_samples=18867, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:35,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.90 | bwd_microstep: 2233.81 | bwd_inner_microstep: 2017.99 | bwd_allreduce_microstep: 215.76 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12214
total_samples=18871, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:38,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.69 | bwd_microstep: 2022.39 | bwd_inner_microstep: 1607.98 | bwd_allreduce_microstep: 414.31 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13374
total_samples=18875, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:42,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1883.44 | bwd_microstep: 2420.38 | bwd_inner_microstep: 2353.51 | bwd_allreduce_microstep: 66.80 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12051
total_samples=18878, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:45,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 05:31:45,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.73 | bwd_microstep: 2020.26 | bwd_inner_microstep: 1791.67 | bwd_allreduce_microstep: 228.52 | step_microstep: 111.35
[2025-08-03 05:31:45,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4002.68 | bwd: 8696.91 | bwd_inner: 7771.16 | bwd_allreduce: 925.46 | step: 111.88
{'loss': 0.7442, 'learning_rate': 6.633960462744415e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13182
total_samples=18882, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:48,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.13 | bwd_microstep: 2001.92 | bwd_inner_microstep: 1862.73 | bwd_allreduce_microstep: 139.13 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13862
total_samples=18886, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:50,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.18 | bwd_microstep: 1854.46 | bwd_inner_microstep: 1710.26 | bwd_allreduce_microstep: 144.15 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752
total_samples=18890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:53,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.21 | bwd_microstep: 2218.10 | bwd_inner_microstep: 2096.32 | bwd_allreduce_microstep: 121.71 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11954
total_samples=18893, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:56,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.28
[2025-08-03 05:31:56,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.23 | bwd_microstep: 1972.06 | bwd_inner_microstep: 1562.40 | bwd_allreduce_microstep: 409.59 | step_microstep: 137.85
[2025-08-03 05:31:56,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.68 | bwd: 8046.59 | bwd_inner: 7231.71 | bwd_allreduce: 814.64 | step: 138.32
{'loss': 0.7431, 'learning_rate': 6.618716072940248e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13699
total_samples=18897, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:31:59,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.73 | bwd_microstep: 1758.12 | bwd_inner_microstep: 1689.76 | bwd_allreduce_microstep: 68.29 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11928
total_samples=18900, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:01,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.66 | bwd_microstep: 1711.87 | bwd_inner_microstep: 1544.68 | bwd_allreduce_microstep: 167.12 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13254
total_samples=18904, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:04,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.84 | bwd_microstep: 2082.38 | bwd_inner_microstep: 1737.37 | bwd_allreduce_microstep: 344.94 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11891
total_samples=18907, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:07,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 05:32:07,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.06 | bwd_microstep: 1776.56 | bwd_inner_microstep: 1555.23 | bwd_allreduce_microstep: 221.26 | step_microstep: 135.15
[2025-08-03 05:32:07,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.21 | bwd: 7328.97 | bwd_inner: 6527.04 | bwd_allreduce: 801.68 | step: 135.54
{'loss': 0.7419, 'learning_rate': 6.603480550158995e-06, 'epoch': 0.62}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13724
total_samples=18912, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:09,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.80 | bwd_microstep: 1764.04 | bwd_inner_microstep: 1654.70 | bwd_allreduce_microstep: 109.28 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799
total_samples=18915, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:12,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.04 | bwd_microstep: 1897.67 | bwd_inner_microstep: 1620.83 | bwd_allreduce_microstep: 276.78 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14370
total_samples=18919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:15,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.89 | bwd_microstep: 1949.32 | bwd_inner_microstep: 1915.44 | bwd_allreduce_microstep: 33.82 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11693
total_samples=18922, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:18,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 05:32:18,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.31 | bwd_microstep: 2110.66 | bwd_inner_microstep: 1921.63 | bwd_allreduce_microstep: 188.97 | step_microstep: 111.10
[2025-08-03 05:32:18,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.97 | bwd: 7721.75 | bwd_inner: 7112.59 | bwd_allreduce: 608.92 | step: 111.55
 10.99s/it] 62%|██████▏   | 1240/2000 [3:48:36<2:17:31, 10.86s/it]                                                        62%|██████▏   | 1240/2000 [3:48:36<2:17:31, 10.86s/it] 62%|██████▏   | 1241/2000 [3:48:47<2:17:07, 10.84s/it]                                                        62%|██████▏   | 1241/2000 [3:48:47<2:17:07, 10.84s/it] 62%|██████▏   | 1242/2000 [3:49:00<2:25:32, 11.52s/it]                                                        62%|██████▏   | 1242/2000 [3:49:00<2:25:32, 11.52s/it] 62%|██████▏   | 1243/2000 [3:49:11<2:24:24, 11.45s/it]                                                        62%|██████▏   | 1243/2000 [3:49:11<2:24:24, 11.45s/it] 62%|██████▏   | 1244/2000 [3:49:22<2:21:02, 11.19s/it]                                                        62%|██████▏   | 1244/2000 [3:49:22<2:21:02, 11.19s/it] 62%|██████▏ {'loss': 0.7415, 'learning_rate': 6.588253934354039e-06, 'epoch': 0.62}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11721
total_samples=18925, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:21,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.06 | bwd_microstep: 2013.58 | bwd_inner_microstep: 1800.20 | bwd_allreduce_microstep: 213.32 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=18928, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:24,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.59 | bwd_microstep: 2162.93 | bwd_inner_microstep: 1927.91 | bwd_allreduce_microstep: 234.96 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14277
total_samples=18934, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:26,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.60 | bwd_microstep: 1780.65 | bwd_inner_microstep: 1671.81 | bwd_allreduce_microstep: 108.78 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11765
total_samples=18937, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:29,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70
[2025-08-03 05:32:29,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.82 | bwd_microstep: 1768.25 | bwd_inner_microstep: 1536.37 | bwd_allreduce_microstep: 231.81 | step_microstep: 110.96
[2025-08-03 05:32:29,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.00 | bwd: 7725.47 | bwd_inner: 6936.29 | bwd_allreduce: 788.94 | step: 111.43
{'loss': 0.7458, 'learning_rate': 6.5730362654554015e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14005
total_samples=18941, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:32,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.36 | bwd_microstep: 1998.64 | bwd_inner_microstep: 1886.19 | bwd_allreduce_microstep: 112.38 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12297
total_samples=18944, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:34,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.25 | bwd_microstep: 1751.16 | bwd_inner_microstep: 1573.76 | bwd_allreduce_microstep: 177.34 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13767
total_samples=18948, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:37,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.27 | bwd_microstep: 1837.30 | bwd_inner_microstep: 1727.41 | bwd_allreduce_microstep: 109.83 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14200
total_samples=18952, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:39,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 05:32:39,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.56 | bwd_microstep: 1851.08 | bwd_inner_microstep: 1776.66 | bwd_allreduce_microstep: 74.36 | step_microstep: 136.01
[2025-08-03 05:32:39,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.38 | bwd: 7438.24 | bwd_inner: 6964.01 | bwd_allreduce: 473.98 | step: 136.51
{'loss': 0.7476, 'learning_rate': 6.5578275833696485e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14101
total_samples=18956, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:42,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.35 | bwd_microstep: 1810.70 | bwd_inner_microstep: 1738.04 | bwd_allreduce_microstep: 72.58 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11947
total_samples=18959, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:45,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.99 | bwd_microstep: 1833.82 | bwd_inner_microstep: 1605.17 | bwd_allreduce_microstep: 228.59 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12561
total_samples=18963, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:48,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.99 | bwd_microstep: 2006.50 | bwd_inner_microstep: 1630.09 | bwd_allreduce_microstep: 376.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13873
total_samples=18967, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:50,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.25
[2025-08-03 05:32:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.98 | bwd_microstep: 1739.97 | bwd_inner_microstep: 1695.08 | bwd_allreduce_microstep: 44.81 | step_microstep: 140.33
[2025-08-03 05:32:50,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.24 | bwd: 7391.05 | bwd_inner: 6668.37 | bwd_allreduce: 722.42 | step: 140.83
{'loss': 0.7495, 'learning_rate': 6.542627927979772e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13714
total_samples=18971, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:53,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.03 | bwd_microstep: 2185.55 | bwd_inner_microstep: 2062.50 | bwd_allreduce_microstep: 122.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13582
total_samples=18975, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:56,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.89 | bwd_microstep: 2109.22 | bwd_inner_microstep: 1858.63 | bwd_allreduce_microstep: 250.53 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11551
total_samples=18978, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:32:59,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.78 | bwd_microstep: 1871.91 | bwd_inner_microstep: 1620.43 | bwd_allreduce_microstep: 251.43 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13471
total_samples=18982, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:01,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:33:01,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.52 | bwd_microstep: 1736.31 | bwd_inner_microstep: 1679.80 | bwd_allreduce_microstep: 56.43 | step_microstep: 111.65
[2025-08-03 05:33:01,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.16 | bwd: 7903.05 | bwd_inner: 7221.35 | bwd_allreduce: 681.46 | step: 112.13
{'loss': 0.7468, 'learning_rate': 6.527437339145097e-06, 'epoch': 0.62}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14705
total_samples=18986, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:04,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.12 | bwd_microstep: 2017.17 | bwd_inner_microstep: 1955.52 | bwd_allreduce_microstep: 61.57 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13374
total_samples=18990, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:07,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.86 | bwd_microstep: 2059.33 | bwd_inner_microstep: 2053.17 | bwd_allreduce_microstep: 6.09 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14083
total_samples=18994, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:10,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.46 | bwd_microstep: 2158.74 | bwd_inner_microstep: 2022.76 | bwd_allreduce_microstep: 135.92 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14974
total_samples=18998, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:12,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 05:33:12,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.41 | bwd_microstep: 1776.21 | bwd_inner_microstep: 1726.55 | bwd_allreduce_microstep: 49.57 | step_microstep: 141.31
[2025-08-03 05:33:12,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.78 | bwd: 8011.48 | bwd_inner: 7757.99 | bwd_allreduce: 253.23 | step: 141.94
  | 1245/2000 [3:49:33<2:19:58, 11.12s/it]                                                        62%|██████▏   | 1245/2000 [3:49:33<2:19:58, 11.12s/it] 62%|██████▏   | 1246/2000 [3:49:44<2:19:02, 11.06s/it]                                                        62%|██████▏   | 1246/2000 [3:49:44<2:19:02, 11.06s/it] 62%|██████▏   | 1247/2000 [3:49:54<2:17:24, 10.95s/it]                                                        62%|██████▏   | 1247/2000 [3:49:54<2:17:24, 10.95s/it] 62%|██████▏   | 1248/2000 [3:50:05<2:16:07, 10.86s/it]                                                        62%|██████▏   | 1248/2000 [3:50:05<2:16:07, 10.86s/it] 62%|██████▏   | 1249/2000 [3:50:16<2:16:47, 10.93s/it]                                                        62%|██████▏   | 1249/2000 [3:50:16<2:16:47, 10.93s/it] 62%|██████▎   | 1250/2000 [3:50:27<2:17:50, 11.03s/{'loss': 0.7445, 'learning_rate': 6.5122558567011775e-06, 'epoch': 0.62}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13008
total_samples=19002, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:15,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 914.62 | bwd_microstep: 1752.42 | bwd_inner_microstep: 1675.47 | bwd_allreduce_microstep: 76.88 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13566
total_samples=19006, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:18,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.38 | bwd_microstep: 1767.67 | bwd_inner_microstep: 1706.93 | bwd_allreduce_microstep: 60.67 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11750
total_samples=19009, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:21,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.82 | bwd_microstep: 2099.38 | bwd_inner_microstep: 1863.38 | bwd_allreduce_microstep: 235.93 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13599
total_samples=19013, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:23,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 05:33:23,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.99 | bwd_microstep: 1896.90 | bwd_inner_microstep: 1707.08 | bwd_allreduce_microstep: 189.76 | step_microstep: 158.14
[2025-08-03 05:33:23,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3018.75 | bwd: 7516.43 | bwd_inner: 6952.86 | bwd_allreduce: 563.32 | step: 158.67
{'loss': 0.7508, 'learning_rate': 6.497083520459674e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16009
total_samples=19017, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:26,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.04 | bwd_microstep: 1808.79 | bwd_inner_microstep: 1802.11 | bwd_allreduce_microstep: 6.61 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16144
total_samples=19021, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:29,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.24 | bwd_microstep: 2010.53 | bwd_inner_microstep: 2004.46 | bwd_allreduce_microstep: 6.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13725
total_samples=19025, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:31,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.30 | bwd_microstep: 1720.00 | bwd_inner_microstep: 1678.91 | bwd_allreduce_microstep: 41.02 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11981
total_samples=19028, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:35,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 05:33:35,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.17 | bwd_microstep: 2029.17 | bwd_inner_microstep: 1787.92 | bwd_allreduce_microstep: 241.18 | step_microstep: 469.85
[2025-08-03 05:33:35,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.67 | bwd: 7568.54 | bwd_inner: 7273.40 | bwd_allreduce: 294.90 | step: 470.51
{'loss': 0.7378, 'learning_rate': 6.481920370208274e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13664
total_samples=19032, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:37,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.13 | bwd_microstep: 1731.56 | bwd_inner_microstep: 1677.68 | bwd_allreduce_microstep: 53.81 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13733
total_samples=19036, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:40,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.69 | bwd_microstep: 1806.39 | bwd_inner_microstep: 1742.76 | bwd_allreduce_microstep: 63.55 | step_microstep: 0.42
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13635
total_samples=19040, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:42,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.35 | bwd_microstep: 1978.54 | bwd_inner_microstep: 1876.46 | bwd_allreduce_microstep: 101.99 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14715
total_samples=19044, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:45,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85
[2025-08-03 05:33:45,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.94 | bwd_microstep: 1755.98 | bwd_inner_microstep: 1726.32 | bwd_allreduce_microstep: 29.59 | step_microstep: 149.67
[2025-08-03 05:33:45,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.04 | bwd: 7272.54 | bwd_inner: 7023.22 | bwd_allreduce: 249.04 | step: 150.34
{'loss': 0.7426, 'learning_rate': 6.466766445710568e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14922
total_samples=19049, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:48,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.42 | bwd_microstep: 2107.12 | bwd_inner_microstep: 1971.42 | bwd_allreduce_microstep: 135.63 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483
total_samples=19053, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:51,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.91 | bwd_microstep: 2012.72 | bwd_inner_microstep: 1825.08 | bwd_allreduce_microstep: 187.57 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13393
total_samples=19057, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:54,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.60 | bwd_microstep: 1854.94 | bwd_inner_microstep: 1814.76 | bwd_allreduce_microstep: 40.11 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11972
total_samples=19060, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:56,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 05:33:56,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.19 | bwd_microstep: 1912.36 | bwd_inner_microstep: 1793.47 | bwd_allreduce_microstep: 118.83 | step_microstep: 111.77
[2025-08-03 05:33:56,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.06 | bwd: 7887.19 | bwd_inner: 7404.72 | bwd_allreduce: 482.22 | step: 112.27
{'loss': 0.7311, 'learning_rate': 6.4516217867059615e-06, 'epoch': 0.63}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11938
total_samples=19063, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:33:59,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.70 | bwd_microstep: 1782.21 | bwd_inner_microstep: 1572.75 | bwd_allreduce_microstep: 209.39 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13829
total_samples=19069, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:01,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.51 | bwd_microstep: 1763.26 | bwd_inner_microstep: 1691.88 | bwd_allreduce_microstep: 71.31 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11855
total_samples=19072, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:04,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.21 | bwd_microstep: 1761.85 | bwd_inner_microstep: 1553.27 | bwd_allreduce_microstep: 208.51 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11715
total_samples=19075, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:07,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 05:34:07,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.09 | bwd_microstep: 2164.06 | bwd_inner_microstep: 1833.03 | bwd_allreduce_microstep: 330.96 | step_microstep: 116.36
[2025-08-03 05:34:07,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.43 | bwd: 7471.43 | bwd_inner: 6650.92 | bwd_allreduce: 820.26 | step: 116.93
it]                                                        62%|██████▎   | 1250/2000 [3:50:27<2:17:50, 11.03s/it] 63%|██████▎   | 1251/2000 [3:50:38<2:17:26, 11.01s/it]                                                        63%|██████▎   | 1251/2000 [3:50:38<2:17:26, 11.01s/it] 63%|██████▎   | 1252/2000 [3:50:49<2:17:54, 11.06s/it]                                                        63%|██████▎   | 1252/2000 [3:50:50<2:17:54, 11.06s/it] 63%|██████▎   | 1253/2000 [3:51:00<2:15:51, 10.91s/it]                                                        63%|██████▎   | 1253/2000 [3:51:00<2:15:51, 10.91s/it] 63%|██████▎   | 1254/2000 [3:51:11<2:16:18, 10.96s/it]                                                        63%|██████▎   | 1254/2000 [3:51:11<2:16:18, 10.96s/it] 63%|██████▎   | 1255/2000 [3:51:22<2:15:04, 10.88s/it]                                   {'loss': 0.7435, 'learning_rate': 6.43648643290955e-06, 'epoch': 0.63}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11518
total_samples=19078, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:10,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.39 | bwd_microstep: 2218.64 | bwd_inner_microstep: 1855.95 | bwd_allreduce_microstep: 362.62 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14456
total_samples=19082, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:13,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.81 | bwd_microstep: 1936.20 | bwd_inner_microstep: 1820.44 | bwd_allreduce_microstep: 115.69 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13737
total_samples=19086, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:15,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.06 | bwd_microstep: 1758.24 | bwd_inner_microstep: 1698.90 | bwd_allreduce_microstep: 59.27 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13110
total_samples=19089, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:18,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77
[2025-08-03 05:34:18,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.41 | bwd_microstep: 1892.29 | bwd_inner_microstep: 1744.76 | bwd_allreduce_microstep: 147.45 | step_microstep: 160.97
[2025-08-03 05:34:18,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2884.59 | bwd: 7805.43 | bwd_inner: 7120.05 | bwd_allreduce: 685.13 | step: 161.49
{'loss': 0.7442, 'learning_rate': 6.421360424012039e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13314
total_samples=19093, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:21,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.13 | bwd_microstep: 2274.62 | bwd_inner_microstep: 2119.72 | bwd_allreduce_microstep: 154.83 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13246
total_samples=19097, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:24,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.37 | bwd_microstep: 2040.83 | bwd_inner_microstep: 1872.48 | bwd_allreduce_microstep: 168.27 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11625
total_samples=19100, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:26,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.27 | bwd_microstep: 1733.18 | bwd_inner_microstep: 1538.15 | bwd_allreduce_microstep: 194.97 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12116
total_samples=19103, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:29,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 05:34:29,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.80 | bwd_microstep: 1867.13 | bwd_inner_microstep: 1567.56 | bwd_allreduce_microstep: 299.50 | step_microstep: 111.79
[2025-08-03 05:34:29,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.50 | bwd: 7915.82 | bwd_inner: 7097.91 | bwd_allreduce: 817.66 | step: 112.29
{'loss': 0.7373, 'learning_rate': 6.406243799679625e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13544
total_samples=19107, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:32,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.15 | bwd_microstep: 1861.41 | bwd_inner_microstep: 1820.60 | bwd_allreduce_microstep: 40.74 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12253
total_samples=19110, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:34,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.71 | bwd_microstep: 1791.60 | bwd_inner_microstep: 1574.24 | bwd_allreduce_microstep: 217.29 | step_microstep: 0.26
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12889
total_samples=19114, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:37,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.61 | bwd_microstep: 1989.78 | bwd_inner_microstep: 1676.52 | bwd_allreduce_microstep: 313.20 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11826
total_samples=19117, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:40,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.78
[2025-08-03 05:34:40,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.76 | bwd_microstep: 1811.07 | bwd_inner_microstep: 1579.03 | bwd_allreduce_microstep: 231.96 | step_microstep: 130.10
[2025-08-03 05:34:40,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.17 | bwd: 7453.93 | bwd_inner: 6650.39 | bwd_allreduce: 803.29 | step: 130.59
{'loss': 0.7422, 'learning_rate': 6.39113659955389e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13915
total_samples=19121, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:43,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.71 | bwd_microstep: 2185.19 | bwd_inner_microstep: 2060.50 | bwd_allreduce_microstep: 124.62 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13293
total_samples=19125, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:46,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.73 | bwd_microstep: 2025.59 | bwd_inner_microstep: 1721.03 | bwd_allreduce_microstep: 304.49 | step_microstep: 0.42
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13756
total_samples=19129, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:49,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.07 | bwd_microstep: 2087.82 | bwd_inner_microstep: 1933.13 | bwd_allreduce_microstep: 154.63 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11919
total_samples=19132, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:51,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 05:34:51,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.02 | bwd_microstep: 1837.22 | bwd_inner_microstep: 1605.03 | bwd_allreduce_microstep: 232.12 | step_microstep: 112.95
[2025-08-03 05:34:51,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.45 | bwd: 8135.88 | bwd_inner: 7319.69 | bwd_allreduce: 815.94 | step: 113.61
{'loss': 0.7451, 'learning_rate': 6.376038863251706e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14032
total_samples=19138, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:54,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.37 | bwd_microstep: 1721.54 | bwd_inner_microstep: 1681.65 | bwd_allreduce_microstep: 39.82 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14265
total_samples=19143, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:57,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.87 | bwd_microstep: 1873.51 | bwd_inner_microstep: 1817.00 | bwd_allreduce_microstep: 56.43 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13738
total_samples=19147, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:34:59,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.24 | bwd_microstep: 1845.20 | bwd_inner_microstep: 1783.40 | bwd_allreduce_microstep: 61.73 | step_microstep: 0.74
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11758
total_samples=19150, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:02,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 05:35:02,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.09 | bwd_microstep: 1847.98 | bwd_inner_microstep: 1596.31 | bwd_allreduce_microstep: 251.60 | step_microstep: 111.48
[2025-08-03 05:35:02,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.49 | bwd: 7288.28 | bwd_inner: 6878.36 | bwd_allreduce: 409.67 | step: 112.60
{'loss': 0.7394, 'learning_rate': 6.360950630365126e-06, 'epoch': 0.63}
                     63%|██████▎   | 1255/2000 [3:51:22<2:15:04, 10.88s/it] 63%|██████▎   | 1256/2000 [3:51:33<2:15:49, 10.95s/it]                                                        63%|██████▎   | 1256/2000 [3:51:33<2:15:49, 10.95s/it] 63%|██████▎   | 1257/2000 [3:51:44<2:16:15, 11.00s/it]                                                        63%|██████▎   | 1257/2000 [3:51:44<2:16:15, 11.00s/it] 63%|██████▎   | 1258/2000 [3:51:55<2:15:05, 10.92s/it]                                                        63%|██████▎   | 1258/2000 [3:51:55<2:15:05, 10.92s/it] 63%|██████▎   | 1259/2000 [3:52:06<2:16:42, 11.07s/it]                                                        63%|██████▎   | 1259/2000 [3:52:06<2:16:42, 11.07s/it] 63%|██████▎   | 1260/2000 [3:52:17<2:14:37, 10.92s/it]                                                        63%|████dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15621
total_samples=19154, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:04,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.18 | bwd_microstep: 1777.67 | bwd_inner_microstep: 1768.78 | bwd_allreduce_microstep: 8.82 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13314
total_samples=19158, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:07,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.78 | bwd_microstep: 1860.48 | bwd_inner_microstep: 1726.47 | bwd_allreduce_microstep: 133.95 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689
total_samples=19161, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:10,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.20 | bwd_microstep: 1777.11 | bwd_inner_microstep: 1566.39 | bwd_allreduce_microstep: 210.64 | step_microstep: 0.82
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=19164, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:12,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77
[2025-08-03 05:35:12,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.45 | bwd_microstep: 1725.45 | bwd_inner_microstep: 1534.95 | bwd_allreduce_microstep: 190.43 | step_microstep: 159.42
[2025-08-03 05:35:12,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2753.54 | bwd: 7140.77 | bwd_inner: 6596.59 | bwd_allreduce: 543.92 | step: 160.66
{'loss': 0.7508, 'learning_rate': 6.345871940461282e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14586
total_samples=19168, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:15,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.91 | bwd_microstep: 1750.13 | bwd_inner_microstep: 1716.57 | bwd_allreduce_microstep: 33.49 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12055
total_samples=19171, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:17,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.64 | bwd_microstep: 1872.05 | bwd_inner_microstep: 1574.93 | bwd_allreduce_microstep: 297.06 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11879
total_samples=19174, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:20,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.34 | bwd_microstep: 2012.43 | bwd_inner_microstep: 1556.75 | bwd_allreduce_microstep: 455.62 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12618
total_samples=19178, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:23,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.95
[2025-08-03 05:35:23,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.72 | bwd_microstep: 1844.78 | bwd_inner_microstep: 1644.84 | bwd_allreduce_microstep: 199.87 | step_microstep: 123.80
[2025-08-03 05:35:23,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.54 | bwd: 7479.45 | bwd_inner: 6493.08 | bwd_allreduce: 986.12 | step: 124.31
{'loss': 0.7307, 'learning_rate': 6.33080283308228e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132
total_samples=19182, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:26,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.13 | bwd_microstep: 1958.49 | bwd_inner_microstep: 1838.40 | bwd_allreduce_microstep: 120.03 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14696
total_samples=19186, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:28,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.99 | bwd_microstep: 1919.73 | bwd_inner_microstep: 1758.20 | bwd_allreduce_microstep: 161.45 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13620
total_samples=19190, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:31,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.23 | bwd_microstep: 1803.66 | bwd_inner_microstep: 1687.87 | bwd_allreduce_microstep: 115.72 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11787
total_samples=19193, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:34,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 05:35:34,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.82 | bwd_microstep: 2032.11 | bwd_inner_microstep: 1923.74 | bwd_allreduce_microstep: 108.30 | step_microstep: 113.77
[2025-08-03 05:35:34,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.10 | bwd: 7714.03 | bwd_inner: 7208.19 | bwd_allreduce: 505.58 | step: 114.38
{'loss': 0.7522, 'learning_rate': 6.315743347745098e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14879
total_samples=19198, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:36,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.17 | bwd_microstep: 1731.18 | bwd_inner_microstep: 1719.47 | bwd_allreduce_microstep: 11.65 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13322
total_samples=19202, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:39,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.14 | bwd_microstep: 1796.83 | bwd_inner_microstep: 1714.71 | bwd_allreduce_microstep: 82.04 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13787
total_samples=19206, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:42,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.65 | bwd_microstep: 1795.50 | bwd_inner_microstep: 1721.21 | bwd_allreduce_microstep: 74.21 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13504
total_samples=19210, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:44,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 05:35:44,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.22 | bwd_microstep: 1925.59 | bwd_inner_microstep: 1885.71 | bwd_allreduce_microstep: 39.81 | step_microstep: 126.13
[2025-08-03 05:35:44,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.11 | bwd: 7249.16 | bwd_inner: 7041.10 | bwd_allreduce: 207.80 | step: 126.64
{'loss': 0.7422, 'learning_rate': 6.300693523941481e-06, 'epoch': 0.63}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13312
total_samples=19214, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:47,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.13 | bwd_microstep: 1811.80 | bwd_inner_microstep: 1681.03 | bwd_allreduce_microstep: 130.71 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13488
total_samples=19218, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:49,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.79 | bwd_microstep: 1744.65 | bwd_inner_microstep: 1695.32 | bwd_allreduce_microstep: 49.27 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11640
total_samples=19221, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:52,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.25 | bwd_microstep: 2144.48 | bwd_inner_microstep: 1936.96 | bwd_allreduce_microstep: 207.46 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13252
total_samples=19225, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:55,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 05:35:55,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.38 | bwd_microstep: 1805.72 | bwd_inner_microstep: 1698.21 | bwd_allreduce_microstep: 107.44 | step_microstep: 134.58
[2025-08-03 05:35:55,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.48 | bwd: 7506.70 | bwd_inner: 7011.50 | bwd_allreduce: 494.95 | step: 135.06
{'loss': 0.7547, 'learning_rate': 6.2856534011378365e-06, 'epoch': 0.63}
██▎   | 1260/2000 [3:52:17<2:14:37, 10.92s/it] 63%|██████▎   | 1261/2000 [3:52:27<2:12:22, 10.75s/it]                                                        63%|██████▎   | 1261/2000 [3:52:27<2:12:22, 10.75s/it] 63%|██████▎   | 1262/2000 [3:52:38<2:12:02, 10.74s/it]                                                        63%|██████▎   | 1262/2000 [3:52:38<2:12:02, 10.74s/it] 63%|██████▎   | 1263/2000 [3:52:49<2:12:38, 10.80s/it]                                                        63%|██████▎   | 1263/2000 [3:52:49<2:12:38, 10.80s/it] 63%|██████▎   | 1264/2000 [3:52:59<2:11:11, 10.70s/it]                                                        63%|██████▎   | 1264/2000 [3:52:59<2:11:11, 10.70s/it] 63%|██████▎   | 1265/2000 [3:53:10<2:11:11, 10.71s/it]                                                        63%|██████▎   | 1265/2000 [3:53:10<2:11:1dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15728
total_samples=19229, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:35:58,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.44 | bwd_microstep: 2183.04 | bwd_inner_microstep: 2098.96 | bwd_allreduce_microstep: 84.01 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14725
total_samples=19233, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:01,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.17 | bwd_microstep: 1780.18 | bwd_inner_microstep: 1737.68 | bwd_allreduce_microstep: 42.44 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13758
total_samples=19237, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:03,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.17 | bwd_microstep: 1974.65 | bwd_inner_microstep: 1906.79 | bwd_allreduce_microstep: 67.80 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13215
total_samples=19241, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:06,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 05:36:06,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.50 | bwd_microstep: 1694.74 | bwd_inner_microstep: 1636.52 | bwd_allreduce_microstep: 58.16 | step_microstep: 145.25
[2025-08-03 05:36:06,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.22 | bwd: 7632.66 | bwd_inner: 7379.93 | bwd_allreduce: 252.48 | step: 145.73
{'loss': 0.7441, 'learning_rate': 6.270623018775135e-06, 'epoch': 0.63}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13008
total_samples=19245, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:09,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 981.80 | bwd_microstep: 2026.83 | bwd_inner_microstep: 1865.42 | bwd_allreduce_microstep: 161.35 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508
total_samples=19249, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:12,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.39 | bwd_microstep: 1963.48 | bwd_inner_microstep: 1872.63 | bwd_allreduce_microstep: 90.79 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14781
total_samples=19255, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:14,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.26 | bwd_microstep: 1768.73 | bwd_inner_microstep: 1753.58 | bwd_allreduce_microstep: 15.09 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=19259, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:17,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 05:36:17,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.74 | bwd_microstep: 2055.39 | bwd_inner_microstep: 1901.56 | bwd_allreduce_microstep: 153.77 | step_microstep: 132.32
[2025-08-03 05:36:17,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3075.13 | bwd: 7814.48 | bwd_inner: 7393.17 | bwd_allreduce: 421.08 | step: 132.64
{'loss': 0.7438, 'learning_rate': 6.255602416268799e-06, 'epoch': 0.63}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13744
total_samples=19263, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:20,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.77 | bwd_microstep: 1829.64 | bwd_inner_microstep: 1708.48 | bwd_allreduce_microstep: 121.09 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13599
total_samples=19267, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:23,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.95 | bwd_microstep: 1768.35 | bwd_inner_microstep: 1695.06 | bwd_allreduce_microstep: 73.22 | step_microstep: 0.32
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12172
total_samples=19270, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:25,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.28 | bwd_microstep: 1762.48 | bwd_inner_microstep: 1570.62 | bwd_allreduce_microstep: 191.80 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=19274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:28,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 05:36:28,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.03 | bwd_microstep: 1972.80 | bwd_inner_microstep: 1876.00 | bwd_allreduce_microstep: 96.74 | step_microstep: 109.99
[2025-08-03 05:36:28,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.93 | bwd: 7333.32 | bwd_inner: 6850.15 | bwd_allreduce: 482.93 | step: 110.58
{'loss': 0.7451, 'learning_rate': 6.2405916330086106e-06, 'epoch': 0.63}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13463
total_samples=19278, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:30,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.93 | bwd_microstep: 1736.92 | bwd_inner_microstep: 1675.07 | bwd_allreduce_microstep: 61.79 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15218
total_samples=19282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:33,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.93 | bwd_microstep: 2112.12 | bwd_inner_microstep: 1797.14 | bwd_allreduce_microstep: 314.91 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14307
total_samples=19286, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:36,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.38 | bwd_microstep: 1746.61 | bwd_inner_microstep: 1722.21 | bwd_allreduce_microstep: 24.34 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14415
total_samples=19290, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:38,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 05:36:38,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.13 | bwd_microstep: 1719.47 | bwd_inner_microstep: 1713.34 | bwd_allreduce_microstep: 6.06 | step_microstep: 138.06
[2025-08-03 05:36:38,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.32 | bwd: 7315.18 | bwd_inner: 6907.75 | bwd_allreduce: 407.18 | step: 138.43
{'loss': 0.7591, 'learning_rate': 6.225590708358596e-06, 'epoch': 0.63}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13022
total_samples=19294, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:41,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.07 | bwd_microstep: 1733.54 | bwd_inner_microstep: 1634.65 | bwd_allreduce_microstep: 98.82 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13419
total_samples=19298, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:44,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.41 | bwd_microstep: 2190.11 | bwd_inner_microstep: 2184.09 | bwd_allreduce_microstep: 5.96 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11830
total_samples=19301, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:47,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.25 | bwd_microstep: 2022.57 | bwd_inner_microstep: 1563.27 | bwd_allreduce_microstep: 459.24 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13847
total_samples=19305, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:49,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.46
[2025-08-03 05:36:49,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.44 | bwd_microstep: 1726.19 | bwd_inner_microstep: 1693.92 | bwd_allreduce_microstep: 32.21 | step_microstep: 129.37
[2025-08-03 05:36:49,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.10 | bwd: 7672.46 | bwd_inner: 7075.93 | bwd_allreduce: 596.30 | step: 129.82
{'loss': 0.7515, 'learning_rate': 6.210599681656933e-06, 'epoch': 0.64}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15025
total_samples=19310, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:52,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.52 | bwd_microstep: 1769.72 | bwd_inner_microstep: 1701.97 | bwd_allreduce_microstep: 67.68 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13168
total_samples=19314, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:54,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.94 | bwd_microstep: 1736.05 | bwd_inner_microstep: 1629.58 | bwd_allreduce_microstep: 106.40 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11727
total_samples=19317, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:36:57,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.49 | bwd_microstep: 2028.84 | bwd_inner_microstep: 1860.60 | bwd_allreduce_microstep: 168.17 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215
total_samples=19321, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:00,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 05:37:00,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.94 | bwd_microstep: 2027.92 | bwd_inner_microstep: 1896.48 | bwd_allreduce_microstep: 131.37 | step_microstep: 151.67
[2025-08-03 05:37:00,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.81 | bwd: 7562.58 | bwd_inner: 7088.63 | bwd_allreduce: 473.71 | step: 152.03
1, 10.71s/it] 63%|██████▎   | 1266/2000 [3:53:21<2:11:42, 10.77s/it]                                                        63%|██████▎   | 1266/2000 [3:53:21<2:11:42, 10.77s/it] 63%|██████▎   | 1267/2000 [3:53:32<2:13:39, 10.94s/it]                                                        63%|██████▎   | 1267/2000 [3:53:32<2:13:39, 10.94s/it] 63%|██████▎   | 1268/2000 [3:53:43<2:12:07, 10.83s/it]                                                        63%|██████▎   | 1268/2000 [3:53:43<2:12:07, 10.83s/it] 63%|██████▎   | 1269/2000 [3:53:53<2:10:53, 10.74s/it]                                                        63%|██████▎   | 1269/2000 [3:53:53<2:10:53, 10.74s/it] 64%|██████▎   | 1270/2000 [3:54:04<2:11:16, 10.79s/it]                                                        64%|██████▎   | 1270/2000 [3:54:04<2:11:16, 10.79s/it] 64%|██████�{'loss': 0.7453, 'learning_rate': 6.1956185922158445e-06, 'epoch': 0.64}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14082
total_samples=19325, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:03,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.76 | bwd_microstep: 2127.67 | bwd_inner_microstep: 1734.31 | bwd_allreduce_microstep: 393.30 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11930
total_samples=19328, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:06,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.14 | bwd_microstep: 1825.01 | bwd_inner_microstep: 1589.60 | bwd_allreduce_microstep: 235.35 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13354
total_samples=19332, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:09,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.19 | bwd_microstep: 1995.22 | bwd_inner_microstep: 1883.96 | bwd_allreduce_microstep: 111.19 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14106
total_samples=19336, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:11,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 05:37:11,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.76 | bwd_microstep: 1796.23 | bwd_inner_microstep: 1731.87 | bwd_allreduce_microstep: 64.30 | step_microstep: 148.88
[2025-08-03 05:37:11,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.77 | bwd: 7744.19 | bwd_inner: 6939.73 | bwd_allreduce: 804.22 | step: 149.39
{'loss': 0.7517, 'learning_rate': 6.180647479321484e-06, 'epoch': 0.64}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132
total_samples=19340, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:14,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.74 | bwd_microstep: 2007.63 | bwd_inner_microstep: 1877.60 | bwd_allreduce_microstep: 129.96 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13518
total_samples=19344, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:17,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.26 | bwd_microstep: 2147.57 | bwd_inner_microstep: 1746.20 | bwd_allreduce_microstep: 401.27 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14230
total_samples=19348, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:20,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.35 | bwd_microstep: 1924.06 | bwd_inner_microstep: 1849.54 | bwd_allreduce_microstep: 74.45 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11907
total_samples=19351, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:22,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73
[2025-08-03 05:37:22,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.93 | bwd_microstep: 1708.35 | bwd_inner_microstep: 1536.35 | bwd_allreduce_microstep: 171.94 | step_microstep: 126.35
[2025-08-03 05:37:22,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.21 | bwd: 7787.66 | bwd_inner: 7009.70 | bwd_allreduce: 777.70 | step: 126.88
{'loss': 0.734, 'learning_rate': 6.165686382233856e-06, 'epoch': 0.64}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16315
total_samples=19355, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:25,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.45 | bwd_microstep: 1996.71 | bwd_inner_microstep: 1933.03 | bwd_allreduce_microstep: 63.62 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11968
total_samples=19358, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:28,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.85 | bwd_microstep: 1766.96 | bwd_inner_microstep: 1575.39 | bwd_allreduce_microstep: 191.50 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13530
total_samples=19362, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:30,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.18 | bwd_microstep: 1728.67 | bwd_inner_microstep: 1664.85 | bwd_allreduce_microstep: 63.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=19366, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:33,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74
[2025-08-03 05:37:33,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.84 | bwd_microstep: 2045.71 | bwd_inner_microstep: 1917.20 | bwd_allreduce_microstep: 128.43 | step_microstep: 115.71
[2025-08-03 05:37:33,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.24 | bwd: 7538.10 | bwd_inner: 7090.45 | bwd_allreduce: 447.38 | step: 116.06
{'loss': 0.7303, 'learning_rate': 6.150735340186689e-06, 'epoch': 0.64}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=19370, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:36,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.68 | bwd_microstep: 1791.00 | bwd_inner_microstep: 1693.22 | bwd_allreduce_microstep: 97.71 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12298
total_samples=19374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:38,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.77 | bwd_microstep: 1908.36 | bwd_inner_microstep: 1590.42 | bwd_allreduce_microstep: 317.87 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13203
total_samples=19379, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:41,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.35 | bwd_microstep: 1724.75 | bwd_inner_microstep: 1620.41 | bwd_allreduce_microstep: 104.28 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13487
total_samples=19383, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:44,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.81
[2025-08-03 05:37:44,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.06 | bwd_microstep: 1887.88 | bwd_inner_microstep: 1828.88 | bwd_allreduce_microstep: 58.91 | step_microstep: 158.69
[2025-08-03 05:37:44,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.79 | bwd: 7312.05 | bwd_inner: 6732.93 | bwd_allreduce: 578.86 | step: 159.06
{'loss': 0.7423, 'learning_rate': 6.135794392387353e-06, 'epoch': 0.64}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12982
total_samples=19387, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:46,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.19 | bwd_microstep: 1970.01 | bwd_inner_microstep: 1642.36 | bwd_allreduce_microstep: 327.58 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11585
total_samples=19390, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:49,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.02 | bwd_microstep: 1751.68 | bwd_inner_microstep: 1531.95 | bwd_allreduce_microstep: 219.67 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12257
total_samples=19393, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:51,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.80 | bwd_microstep: 1779.24 | bwd_inner_microstep: 1577.23 | bwd_allreduce_microstep: 201.95 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14016
total_samples=19397, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:55,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 05:37:55,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.13 | bwd_microstep: 2043.67 | bwd_inner_microstep: 1892.84 | bwd_allreduce_microstep: 150.74 | step_microstep: 415.03
[2025-08-03 05:37:55,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.07 | bwd: 7544.67 | bwd_inner: 6644.39 | bwd_allreduce: 900.01 | step: 415.42
�   | 1271/2000 [3:54:15<2:11:14, 10.80s/it]                                                        64%|██████▎   | 1271/2000 [3:54:15<2:11:14, 10.80s/it] 64%|██████▎   | 1272/2000 [3:54:26<2:11:54, 10.87s/it]                                                        64%|██████▎   | 1272/2000 [3:54:26<2:11:54, 10.87s/it] 64%|██████▎   | 1273/2000 [3:54:37<2:12:26, 10.93s/it]                                                        64%|██████▎   | 1273/2000 [3:54:37<2:12:26, 10.93s/it] 64%|██████▎   | 1274/2000 [3:54:48<2:11:43, 10.89s/it]                                                        64%|██████▎   | 1274/2000 [3:54:48<2:11:43, 10.89s/it] 64%|██████▍   | 1275/2000 [3:54:59<2:10:26, 10.80s/it]                                                        64%|██████▍   | 1275/2000 [3:54:59<2:10:26, 10.80s/it] 64%|██████▍   | 1276/2000 [3:55:10<2:11:08, 10.87{'loss': 0.7434, 'learning_rate': 6.120863578016736e-06, 'epoch': 0.64}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12826
total_samples=19401, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:37:58,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.84 | bwd_microstep: 2126.74 | bwd_inner_microstep: 2106.64 | bwd_allreduce_microstep: 20.03 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11871
total_samples=19404, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:00,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.52 | bwd_microstep: 1836.97 | bwd_inner_microstep: 1565.42 | bwd_allreduce_microstep: 271.47 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11544
total_samples=19407, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:03,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.62 | bwd_microstep: 1831.39 | bwd_inner_microstep: 1586.97 | bwd_allreduce_microstep: 244.35 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13345
total_samples=19411, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:06,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 05:38:06,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.50 | bwd_microstep: 1755.32 | bwd_inner_microstep: 1686.04 | bwd_allreduce_microstep: 69.21 | step_microstep: 158.09
[2025-08-03 05:38:06,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.40 | bwd: 7550.47 | bwd_inner: 6945.08 | bwd_allreduce: 605.14 | step: 158.62
{'loss': 0.7449, 'learning_rate': 6.1059429362291615e-06, 'epoch': 0.64}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13353
total_samples=19415, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:08,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.43 | bwd_microstep: 1799.12 | bwd_inner_microstep: 1681.04 | bwd_allreduce_microstep: 117.99 | step_microstep: 0.96
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11693
total_samples=19418, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:11,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.54 | bwd_microstep: 1728.20 | bwd_inner_microstep: 1533.50 | bwd_allreduce_microstep: 194.63 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132
total_samples=19422, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:13,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.73 | bwd_microstep: 1981.04 | bwd_inner_microstep: 1894.57 | bwd_allreduce_microstep: 86.41 | step_microstep: 0.31
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13328
total_samples=19426, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:17,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 05:38:17,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.77 | bwd_microstep: 2084.60 | bwd_inner_microstep: 1919.59 | bwd_allreduce_microstep: 164.94 | step_microstep: 444.22
[2025-08-03 05:38:17,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.40 | bwd: 7593.03 | bwd_inner: 7028.68 | bwd_allreduce: 564.08 | step: 445.63
{'loss': 0.7475, 'learning_rate': 6.091032506152274e-06, 'epoch': 0.64}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12517
total_samples=19430, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:19,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.76 | bwd_microstep: 1820.08 | bwd_inner_microstep: 1620.85 | bwd_allreduce_microstep: 199.15 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14830
total_samples=19434, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:22,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.03 | bwd_microstep: 2259.81 | bwd_inner_microstep: 2001.95 | bwd_allreduce_microstep: 257.80 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12255
total_samples=19437, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:25,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.32 | bwd_microstep: 1739.64 | bwd_inner_microstep: 1562.95 | bwd_allreduce_microstep: 176.62 | step_microstep: 0.17
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13841
total_samples=19442, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:28,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.23
[2025-08-03 05:38:28,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.14 | bwd_microstep: 2123.69 | bwd_inner_microstep: 1809.88 | bwd_allreduce_microstep: 313.75 | step_microstep: 132.65
[2025-08-03 05:38:28,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.18 | bwd: 7943.29 | bwd_inner: 6995.63 | bwd_allreduce: 947.40 | step: 133.25
{'loss': 0.7587, 'learning_rate': 6.076132326886934e-06, 'epoch': 0.64}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13617
total_samples=19446, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:31,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.04 | bwd_microstep: 2006.67 | bwd_inner_microstep: 1882.09 | bwd_allreduce_microstep: 124.52 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375
total_samples=19450, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:34,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.96 | bwd_microstep: 2186.28 | bwd_inner_microstep: 2052.17 | bwd_allreduce_microstep: 134.05 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13175
total_samples=19454, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:37,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.17 | bwd_microstep: 2172.99 | bwd_inner_microstep: 2056.10 | bwd_allreduce_microstep: 116.82 | step_microstep: 0.83
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11843
total_samples=19457, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:39,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 05:38:39,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.17 | bwd_microstep: 1747.98 | bwd_inner_microstep: 1553.48 | bwd_allreduce_microstep: 194.43 | step_microstep: 110.93
[2025-08-03 05:38:39,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.26 | bwd: 8113.98 | bwd_inner: 7543.84 | bwd_allreduce: 569.88 | step: 112.14
{'loss': 0.7398, 'learning_rate': 6.061242437507131e-06, 'epoch': 0.64}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13009
total_samples=19461, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:42,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.80 | bwd_microstep: 1969.06 | bwd_inner_microstep: 1865.98 | bwd_allreduce_microstep: 103.01 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12922
total_samples=19465, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:45,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.48 | bwd_microstep: 1733.31 | bwd_inner_microstep: 1660.81 | bwd_allreduce_microstep: 72.43 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13642
total_samples=19469, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:47,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 756.84 | bwd_microstep: 1979.51 | bwd_inner_microstep: 1738.09 | bwd_allreduce_microstep: 241.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13797
total_samples=19473, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:50,624] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 05:38:50,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.68 | bwd_microstep: 1841.35 | bwd_inner_microstep: 1731.32 | bwd_allreduce_microstep: 109.96 | step_microstep: 112.85
[2025-08-03 05:38:50,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2875.74 | bwd: 7523.28 | bwd_inner: 6996.19 | bwd_allreduce: 526.84 | step: 113.36
s/it]                                                        64%|██████▍   | 1276/2000 [3:55:10<2:11:08, 10.87s/it] 64%|██████▍   | 1277/2000 [3:55:20<2:10:58, 10.87s/it]                                                        64%|██████▍   | 1277/2000 [3:55:20<2:10:58, 10.87s/it] 64%|██████▍   | 1278/2000 [3:55:32<2:11:45, 10.95s/it]                                                        64%|██████▍   | 1278/2000 [3:55:32<2:11:45, 10.95s/it] 64%|██████▍   | 1279/2000 [3:55:43<2:12:28, 11.02s/it]                                                        64%|██████▍   | 1279/2000 [3:55:43<2:12:28, 11.02s/it] 64%|██████▍   | 1280/2000 [3:55:54<2:13:40, 11.14s/it]                                                        64%|██████▍   | 1280/2000 [3:55:54<2:13:40, 11.14s/it] 64%|██████▍   | 1281/2000 [3:56:05<2:12:15, 11.04s/it]                                 {'loss': 0.7437, 'learning_rate': 6.0463628770598574e-06, 'epoch': 0.64}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11795
total_samples=19476, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:53,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.17 | bwd_microstep: 1781.03 | bwd_inner_microstep: 1547.94 | bwd_allreduce_microstep: 233.01 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13153
total_samples=19480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:55,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.52 | bwd_microstep: 1723.03 | bwd_inner_microstep: 1641.31 | bwd_allreduce_microstep: 81.64 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12924
total_samples=19484, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:38:58,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.78 | bwd_microstep: 2063.69 | bwd_inner_microstep: 1767.94 | bwd_allreduce_microstep: 295.69 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14359
total_samples=19488, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:01,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70
[2025-08-03 05:39:01,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.41 | bwd_microstep: 1929.16 | bwd_inner_microstep: 1786.84 | bwd_allreduce_microstep: 142.26 | step_microstep: 134.26
[2025-08-03 05:39:01,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.81 | bwd: 7496.96 | bwd_inner: 6744.03 | bwd_allreduce: 752.69 | step: 134.75
{'loss': 0.7442, 'learning_rate': 6.0314936845650296e-06, 'epoch': 0.64}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12529
total_samples=19491, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:03,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.17 | bwd_microstep: 1778.16 | bwd_inner_microstep: 1586.46 | bwd_allreduce_microstep: 191.63 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13351
total_samples=19495, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:06,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.07 | bwd_microstep: 1858.44 | bwd_inner_microstep: 1726.11 | bwd_allreduce_microstep: 132.27 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14095
total_samples=19499, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:09,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.43 | bwd_microstep: 2359.05 | bwd_inner_microstep: 2352.71 | bwd_allreduce_microstep: 6.27 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13453
total_samples=19503, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:12,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 05:39:12,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.59 | bwd_microstep: 1829.48 | bwd_inner_microstep: 1733.92 | bwd_allreduce_microstep: 95.50 | step_microstep: 425.37
[2025-08-03 05:39:12,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.19 | bwd: 7825.19 | bwd_inner: 7399.20 | bwd_allreduce: 425.75 | step: 426.04
{'loss': 0.7382, 'learning_rate': 6.016634899015369e-06, 'epoch': 0.64}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13318
total_samples=19507, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:15,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.02 | bwd_microstep: 2036.29 | bwd_inner_microstep: 1796.87 | bwd_allreduce_microstep: 239.35 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12983
total_samples=19511, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:18,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.73 | bwd_microstep: 1720.69 | bwd_inner_microstep: 1643.26 | bwd_allreduce_microstep: 77.37 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13484
total_samples=19515, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:20,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.37 | bwd_microstep: 1762.55 | bwd_inner_microstep: 1703.63 | bwd_allreduce_microstep: 58.84 | step_microstep: 0.36
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13888
total_samples=19520, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:23,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.79
[2025-08-03 05:39:23,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.08 | bwd_microstep: 1848.01 | bwd_inner_microstep: 1733.72 | bwd_allreduce_microstep: 114.23 | step_microstep: 156.19
[2025-08-03 05:39:23,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.13 | bwd: 7367.60 | bwd_inner: 6877.47 | bwd_allreduce: 489.88 | step: 156.94
{'loss': 0.7522, 'learning_rate': 6.00178655937631e-06, 'epoch': 0.64}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11687
total_samples=19523, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:26,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.37 | bwd_microstep: 1796.36 | bwd_inner_microstep: 1554.31 | bwd_allreduce_microstep: 241.98 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12785
total_samples=19527, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:28,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.47 | bwd_microstep: 1839.30 | bwd_inner_microstep: 1605.76 | bwd_allreduce_microstep: 233.47 | step_microstep: 0.42
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11685
total_samples=19530, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:31,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.78 | bwd_microstep: 2070.57 | bwd_inner_microstep: 1599.32 | bwd_allreduce_microstep: 471.15 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13390
total_samples=19534, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:34,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.75
[2025-08-03 05:39:34,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.29 | bwd_microstep: 1935.85 | bwd_inner_microstep: 1859.87 | bwd_allreduce_microstep: 75.91 | step_microstep: 115.21
[2025-08-03 05:39:34,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.83 | bwd: 7642.14 | bwd_inner: 6619.27 | bwd_allreduce: 1022.58 | step: 115.88
{'loss': 0.7338, 'learning_rate': 5.986948704585895e-06, 'epoch': 0.64}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13844
total_samples=19538, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:36,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.46 | bwd_microstep: 1765.31 | bwd_inner_microstep: 1700.97 | bwd_allreduce_microstep: 64.28 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15416
total_samples=19542, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:39,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.93 | bwd_microstep: 2102.90 | bwd_inner_microstep: 1967.86 | bwd_allreduce_microstep: 134.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13420
total_samples=19547, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:42,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1091.26 | bwd_microstep: 1821.81 | bwd_inner_microstep: 1735.89 | bwd_allreduce_microstep: 85.85 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13142
total_samples=19551, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:45,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 05:39:45,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.45 | bwd_microstep: 2000.56 | bwd_inner_microstep: 1700.77 | bwd_allreduce_microstep: 299.72 | step_microstep: 125.87
[2025-08-03 05:39:45,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3204.03 | bwd: 7690.62 | bwd_inner: 7105.48 | bwd_allreduce: 584.90 | step: 126.46
{'loss': 0.7438, 'learning_rate': 5.972121373554665e-06, 'epoch': 0.64}
                       64%|██████▍   | 1281/2000 [3:56:05<2:12:15, 11.04s/it] 64%|██████▍   | 1282/2000 [3:56:16<2:11:08, 10.96s/it]                                                        64%|██████▍   | 1282/2000 [3:56:16<2:11:08, 10.96s/it] 64%|██████▍   | 1283/2000 [3:56:27<2:12:24, 11.08s/it]                                                        64%|██████▍   | 1283/2000 [3:56:27<2:12:24, 11.08s/it] 64%|██████▍   | 1284/2000 [3:56:38<2:10:36, 10.94s/it]                                                        64%|██████▍   | 1284/2000 [3:56:38<2:10:36, 10.94s/it] 64%|██████▍   | 1285/2000 [3:56:49<2:10:13, 10.93s/it]                                                        64%|██████▍   | 1285/2000 [3:56:49<2:10:13, 10.93s/it] 64%|██████▍   | 1286/2000 [3:57:00<2:11:26, 11.05s/it]                                                        64%|███�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12142
total_samples=19554, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:48,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.41 | bwd_microstep: 1862.56 | bwd_inner_microstep: 1739.67 | bwd_allreduce_microstep: 122.83 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15919
total_samples=19559, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:50,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.00 | bwd_microstep: 1797.25 | bwd_inner_microstep: 1735.75 | bwd_allreduce_microstep: 61.43 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11708
total_samples=19562, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:53,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.89 | bwd_microstep: 2113.07 | bwd_inner_microstep: 1830.01 | bwd_allreduce_microstep: 282.99 | step_microstep: 0.77
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14011
total_samples=19566, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:56,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 05:39:56,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.22 | bwd_microstep: 1844.14 | bwd_inner_microstep: 1733.29 | bwd_allreduce_microstep: 110.79 | step_microstep: 113.26
[2025-08-03 05:39:56,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.45 | bwd: 7617.07 | bwd_inner: 7038.72 | bwd_allreduce: 578.11 | step: 114.39
{'loss': 0.7287, 'learning_rate': 5.957304605165567e-06, 'epoch': 0.64}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13155
total_samples=19570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:39:59,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.47 | bwd_microstep: 2033.70 | bwd_inner_microstep: 1692.79 | bwd_allreduce_microstep: 340.84 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14907
total_samples=19574, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:01,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.78 | bwd_microstep: 1758.77 | bwd_inner_microstep: 1733.21 | bwd_allreduce_microstep: 25.49 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11923
total_samples=19577, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:04,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.66 | bwd_microstep: 1771.26 | bwd_inner_microstep: 1563.05 | bwd_allreduce_microstep: 208.11 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13259
total_samples=19581, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:07,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 05:40:07,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.96 | bwd_microstep: 1795.54 | bwd_inner_microstep: 1674.96 | bwd_allreduce_microstep: 120.51 | step_microstep: 113.33
[2025-08-03 05:40:07,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.80 | bwd: 7359.33 | bwd_inner: 6664.01 | bwd_allreduce: 695.05 | step: 113.92
{'loss': 0.748, 'learning_rate': 5.942498438273849e-06, 'epoch': 0.64}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13497
total_samples=19586, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:09,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.36 | bwd_microstep: 2055.56 | bwd_inner_microstep: 1901.89 | bwd_allreduce_microstep: 153.60 | step_microstep: 0.75
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12451
total_samples=19590, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:12,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.48 | bwd_microstep: 1727.02 | bwd_inner_microstep: 1574.39 | bwd_allreduce_microstep: 152.56 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13682
total_samples=19594, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:14,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.16 | bwd_microstep: 1752.76 | bwd_inner_microstep: 1690.93 | bwd_allreduce_microstep: 61.76 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13236
total_samples=19598, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:17,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.23
[2025-08-03 05:40:17,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.28 | bwd_microstep: 1806.76 | bwd_inner_microstep: 1700.00 | bwd_allreduce_microstep: 106.69 | step_microstep: 133.07
[2025-08-03 05:40:17,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.21 | bwd: 7342.16 | bwd_inner: 6867.21 | bwd_allreduce: 474.70 | step: 134.09
{'loss': 0.735, 'learning_rate': 5.927702911706961e-06, 'epoch': 0.64}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12100
total_samples=19601, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:20,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.47 | bwd_microstep: 1979.58 | bwd_inner_microstep: 1698.15 | bwd_allreduce_microstep: 281.37 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13304
total_samples=19605, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:23,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.02 | bwd_microstep: 2008.88 | bwd_inner_microstep: 1899.28 | bwd_allreduce_microstep: 109.53 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12462
total_samples=19608, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:26,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.96 | bwd_microstep: 2346.98 | bwd_inner_microstep: 2135.00 | bwd_allreduce_microstep: 211.92 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13523
total_samples=19612, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:28,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.29
[2025-08-03 05:40:28,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.05 | bwd_microstep: 1760.15 | bwd_inner_microstep: 1691.26 | bwd_allreduce_microstep: 68.82 | step_microstep: 120.24
[2025-08-03 05:40:28,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.42 | bwd: 8095.65 | bwd_inner: 7423.68 | bwd_allreduce: 671.73 | step: 120.78
{'loss': 0.7352, 'learning_rate': 5.912918064264441e-06, 'epoch': 0.65}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11598
total_samples=19615, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:31,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.87 | bwd_microstep: 2016.63 | bwd_inner_microstep: 1839.13 | bwd_allreduce_microstep: 177.44 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12158
total_samples=19618, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:34,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.01 | bwd_microstep: 2017.81 | bwd_inner_microstep: 1792.13 | bwd_allreduce_microstep: 225.61 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13213
total_samples=19622, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:37,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.61 | bwd_microstep: 1748.65 | bwd_inner_microstep: 1671.71 | bwd_allreduce_microstep: 76.87 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13537
total_samples=19626, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:39,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 05:40:39,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.60 | bwd_microstep: 1819.76 | bwd_inner_microstep: 1729.33 | bwd_allreduce_microstep: 90.36 | step_microstep: 133.13
[2025-08-03 05:40:39,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.02 | bwd: 7602.92 | bwd_inner: 7032.29 | bwd_allreduce: 570.36 | step: 133.74
{'loss': 0.7315, 'learning_rate': 5.898143934717831e-06, 'epoch': 0.65}
��██▍   | 1286/2000 [3:57:00<2:11:26, 11.05s/it] 64%|██████▍   | 1287/2000 [3:57:11<2:10:35, 10.99s/it]                                                        64%|██████▍   | 1287/2000 [3:57:11<2:10:35, 10.99s/it] 64%|██████▍   | 1288/2000 [3:57:21<2:08:53, 10.86s/it]                                                        64%|██████▍   | 1288/2000 [3:57:21<2:08:53, 10.86s/it] 64%|██████▍   | 1289/2000 [3:57:32<2:07:43, 10.78s/it]                                                        64%|██████▍   | 1289/2000 [3:57:32<2:07:43, 10.78s/it] 64%|██████▍   | 1290/2000 [3:57:43<2:09:34, 10.95s/it]                                                        64%|██████▍   | 1290/2000 [3:57:43<2:09:34, 10.95s/it] 65%|██████▍   | 1291/2000 [3:57:54<2:08:57, 10.91s/it]                                                        65%|██████▍   | 1291/2000 [3:57:54<2:08dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12818
total_samples=19630, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:42,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.56 | bwd_microstep: 1832.23 | bwd_inner_microstep: 1617.77 | bwd_allreduce_microstep: 214.39 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15171
total_samples=19634, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:45,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.31 | bwd_microstep: 1771.59 | bwd_inner_microstep: 1751.24 | bwd_allreduce_microstep: 20.29 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14715
total_samples=19638, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:47,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.13 | bwd_microstep: 1754.57 | bwd_inner_microstep: 1739.78 | bwd_allreduce_microstep: 14.72 | step_microstep: 0.14
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13228
total_samples=19643, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:50,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 05:40:50,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.12 | bwd_microstep: 1819.39 | bwd_inner_microstep: 1707.36 | bwd_allreduce_microstep: 111.95 | step_microstep: 114.39
[2025-08-03 05:40:50,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.05 | bwd: 7177.82 | bwd_inner: 6816.15 | bwd_allreduce: 361.43 | step: 114.93
{'loss': 0.7484, 'learning_rate': 5.8833805618105635e-06, 'epoch': 0.65}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12898
total_samples=19647, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:52,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.27 | bwd_microstep: 1732.75 | bwd_inner_microstep: 1613.73 | bwd_allreduce_microstep: 118.96 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13161
total_samples=19651, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:55,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.93 | bwd_microstep: 2022.71 | bwd_inner_microstep: 1915.27 | bwd_allreduce_microstep: 107.37 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132
total_samples=19655, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:40:58,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.84 | bwd_microstep: 2033.62 | bwd_inner_microstep: 1898.04 | bwd_allreduce_microstep: 135.51 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12846
total_samples=19659, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:01,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:41:01,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.48 | bwd_microstep: 2066.43 | bwd_inner_microstep: 1701.95 | bwd_allreduce_microstep: 364.41 | step_microstep: 146.82
[2025-08-03 05:41:01,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.43 | bwd: 7855.57 | bwd_inner: 7128.98 | bwd_allreduce: 726.34 | step: 147.38
{'loss': 0.7443, 'learning_rate': 5.868627984257862e-06, 'epoch': 0.65}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11779
total_samples=19662, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:04,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.61 | bwd_microstep: 1990.78 | bwd_inner_microstep: 1798.57 | bwd_allreduce_microstep: 192.13 | step_microstep: 0.19
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13884
total_samples=19666, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:06,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.85 | bwd_microstep: 1700.70 | bwd_inner_microstep: 1661.72 | bwd_allreduce_microstep: 38.92 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13057
total_samples=19670, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:09,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.36 | bwd_microstep: 1843.26 | bwd_inner_microstep: 1670.87 | bwd_allreduce_microstep: 172.32 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12023
total_samples=19673, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:11,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 05:41:11,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.82 | bwd_microstep: 1772.73 | bwd_inner_microstep: 1598.82 | bwd_allreduce_microstep: 173.84 | step_microstep: 115.17
[2025-08-03 05:41:11,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.58 | bwd: 7307.52 | bwd_inner: 6729.97 | bwd_allreduce: 577.30 | step: 115.74
{'loss': 0.7379, 'learning_rate': 5.853886240746643e-06, 'epoch': 0.65}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12036
total_samples=19676, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:14,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.33 | bwd_microstep: 2011.79 | bwd_inner_microstep: 1808.77 | bwd_allreduce_microstep: 202.96 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15581
total_samples=19681, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:17,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.15 | bwd_microstep: 1792.05 | bwd_inner_microstep: 1766.47 | bwd_allreduce_microstep: 25.51 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11735
total_samples=19684, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:19,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.09 | bwd_microstep: 1841.58 | bwd_inner_microstep: 1592.54 | bwd_allreduce_microstep: 248.97 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14068
total_samples=19688, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:22,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73
[2025-08-03 05:41:22,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.11 | bwd_microstep: 1972.16 | bwd_inner_microstep: 1897.49 | bwd_allreduce_microstep: 74.60 | step_microstep: 111.13
[2025-08-03 05:41:22,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.62 | bwd: 7617.62 | bwd_inner: 7065.25 | bwd_allreduce: 552.12 | step: 111.76
{'loss': 0.7466, 'learning_rate': 5.839155369935407e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13630
total_samples=19692, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:25,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.55 | bwd_microstep: 1752.99 | bwd_inner_microstep: 1723.83 | bwd_allreduce_microstep: 29.09 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12760
total_samples=19696, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:27,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.10 | bwd_microstep: 1746.00 | bwd_inner_microstep: 1621.36 | bwd_allreduce_microstep: 124.57 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13036
total_samples=19700, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:30,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.76 | bwd_microstep: 2054.72 | bwd_inner_microstep: 1710.07 | bwd_allreduce_microstep: 344.60 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13370
total_samples=19704, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:33,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:41:33,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.30 | bwd_microstep: 1749.68 | bwd_inner_microstep: 1687.24 | bwd_allreduce_microstep: 62.37 | step_microstep: 152.66
[2025-08-03 05:41:33,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2760.63 | bwd: 7303.46 | bwd_inner: 6742.49 | bwd_allreduce: 560.71 | step: 153.25
{'loss': 0.7389, 'learning_rate': 5.82443541045415e-06, 'epoch': 0.65}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12615
total_samples=19708, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:35,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.49 | bwd_microstep: 1943.54 | bwd_inner_microstep: 1616.75 | bwd_allreduce_microstep: 326.72 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13491
total_samples=19713, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:38,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.52 | bwd_microstep: 1899.72 | bwd_inner_microstep: 1888.98 | bwd_allreduce_microstep: 10.68 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11695
total_samples=19717, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:41,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.14 | bwd_microstep: 2133.80 | bwd_inner_microstep: 1915.81 | bwd_allreduce_microstep: 217.93 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13963
total_samples=19721, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:44,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 05:41:44,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.53 | bwd_microstep: 1773.88 | bwd_inner_microstep: 1703.09 | bwd_allreduce_microstep: 70.72 | step_microstep: 141.02
[2025-08-03 05:41:44,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.62 | bwd: 7750.99 | bwd_inner: 7124.63 | bwd_allreduce: 626.13 | step: 141.48
:57, 10.91s/it] 65%|██████▍   | 1292/2000 [3:58:05<2:07:11, 10.78s/it]                                                        65%|██████▍   | 1292/2000 [3:58:05<2:07:11, 10.78s/it] 65%|██████▍   | 1293/2000 [3:58:16<2:08:04, 10.87s/it]                                                        65%|██████▍   | 1293/2000 [3:58:16<2:08:04, 10.87s/it] 65%|██████▍   | 1294/2000 [3:58:26<2:06:30, 10.75s/it]                                                        65%|██████▍   | 1294/2000 [3:58:26<2:06:30, 10.75s/it] 65%|██████▍   | 1295/2000 [3:58:37<2:06:44, 10.79s/it]                                                        65%|██████▍   | 1295/2000 [3:58:37<2:06:44, 10.79s/it] 65%|██████▍   | 1296/2000 [3:58:48<2:05:38, 10.71s/it]                                                        65%|██████▍   | 1296/2000 [3:58:48<2:05:38, 10.71s/it] 65%|██████{'loss': 0.748, 'learning_rate': 5.809726400904242e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13991
total_samples=19725, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:47,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 939.87 | bwd_microstep: 1792.52 | bwd_inner_microstep: 1728.14 | bwd_allreduce_microstep: 64.30 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13569
total_samples=19729, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:49,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.49 | bwd_microstep: 1750.14 | bwd_inner_microstep: 1696.63 | bwd_allreduce_microstep: 53.45 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13689
total_samples=19733, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:52,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.28 | bwd_microstep: 1983.21 | bwd_inner_microstep: 1778.28 | bwd_allreduce_microstep: 204.86 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11649
total_samples=19736, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:55,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 05:41:55,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.34 | bwd_microstep: 1789.65 | bwd_inner_microstep: 1541.52 | bwd_allreduce_microstep: 248.06 | step_microstep: 135.52
[2025-08-03 05:41:55,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3099.91 | bwd: 7315.57 | bwd_inner: 6744.58 | bwd_allreduce: 570.74 | step: 135.99
{'loss': 0.7459, 'learning_rate': 5.795028379858355e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13657
total_samples=19740, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:41:57,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.23 | bwd_microstep: 1825.33 | bwd_inner_microstep: 1725.95 | bwd_allreduce_microstep: 99.32 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12941
total_samples=19744, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:00,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.61 | bwd_microstep: 2113.06 | bwd_inner_microstep: 1828.77 | bwd_allreduce_microstep: 284.23 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13377
total_samples=19748, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:03,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.21 | bwd_microstep: 1789.43 | bwd_inner_microstep: 1710.69 | bwd_allreduce_microstep: 78.67 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14508
total_samples=19752, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:05,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 05:42:05,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.20 | bwd_microstep: 1773.23 | bwd_inner_microstep: 1729.11 | bwd_allreduce_microstep: 44.06 | step_microstep: 138.38
[2025-08-03 05:42:05,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.16 | bwd: 7501.12 | bwd_inner: 6994.51 | bwd_allreduce: 506.36 | step: 138.89
{'loss': 0.7398, 'learning_rate': 5.780341385860333e-06, 'epoch': 0.65}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11865
total_samples=19755, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:08,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.32 | bwd_microstep: 1843.99 | bwd_inner_microstep: 1560.64 | bwd_allreduce_microstep: 283.28 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11651
total_samples=19758, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:10,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.04 | bwd_microstep: 1744.02 | bwd_inner_microstep: 1555.81 | bwd_allreduce_microstep: 188.14 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13359
total_samples=19762, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:13,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.89 | bwd_microstep: 1796.18 | bwd_inner_microstep: 1704.13 | bwd_allreduce_microstep: 91.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14010
total_samples=19766, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:16,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 05:42:16,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.76 | bwd_microstep: 1739.47 | bwd_inner_microstep: 1704.47 | bwd_allreduce_microstep: 34.93 | step_microstep: 116.20
[2025-08-03 05:42:16,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2756.95 | bwd: 7123.71 | bwd_inner: 6525.04 | bwd_allreduce: 598.42 | step: 116.73
{'loss': 0.7393, 'learning_rate': 5.765665457425102e-06, 'epoch': 0.65}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13851
total_samples=19770, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:18,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.04 | bwd_microstep: 1825.69 | bwd_inner_microstep: 1682.23 | bwd_allreduce_microstep: 143.38 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12277
total_samples=19773, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:21,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.88 | bwd_microstep: 1861.99 | bwd_inner_microstep: 1617.18 | bwd_allreduce_microstep: 244.74 | step_microstep: 0.80
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12487
total_samples=19777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:24,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.60 | bwd_microstep: 1971.62 | bwd_inner_microstep: 1793.89 | bwd_allreduce_microstep: 177.65 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13527
total_samples=19781, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:27,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 05:42:27,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.53 | bwd_microstep: 1959.09 | bwd_inner_microstep: 1853.96 | bwd_allreduce_microstep: 105.06 | step_microstep: 108.78
[2025-08-03 05:42:27,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.97 | bwd: 7618.44 | bwd_inner: 6947.26 | bwd_allreduce: 670.93 | step: 109.89
{'loss': 0.7416, 'learning_rate': 5.751000633038573e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14208
total_samples=19786, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:29,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.99 | bwd_microstep: 2175.81 | bwd_inner_microstep: 2120.14 | bwd_allreduce_microstep: 55.60 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13723
total_samples=19790, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:32,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.35 | bwd_microstep: 1872.54 | bwd_inner_microstep: 1686.40 | bwd_allreduce_microstep: 186.07 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11834
total_samples=19793, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:35,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.52 | bwd_microstep: 1874.10 | bwd_inner_microstep: 1698.39 | bwd_allreduce_microstep: 175.64 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13780
total_samples=19799, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:38,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.15
[2025-08-03 05:42:38,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.60 | bwd_microstep: 1899.13 | bwd_inner_microstep: 1830.44 | bwd_allreduce_microstep: 68.61 | step_microstep: 154.44
[2025-08-03 05:42:38,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.40 | bwd: 7821.63 | bwd_inner: 7335.36 | bwd_allreduce: 486.01 | step: 155.08
▍   | 1297/2000 [3:58:59<2:06:38, 10.81s/it]                                                        65%|██████▍   | 1297/2000 [3:58:59<2:06:38, 10.81s/it] 65%|██████▍   | 1298/2000 [3:59:09<2:06:32, 10.82s/it]                                                        65%|██████▍   | 1298/2000 [3:59:09<2:06:32, 10.82s/it] 65%|██████▍   | 1299/2000 [3:59:20<2:06:07, 10.79s/it]                                                        65%|██████▍   | 1299/2000 [3:59:20<2:06:07, 10.79s/it] 65%|██████▌   | 1300/2000 [3:59:31<2:04:20, 10.66s/it]                                                        65%|██████▌   | 1300/2000 [3:59:31<2:04:20, 10.66s/it] 65%|██████▌   | 1301/2000 [3:59:41<2:04:53, 10.72s/it]                                                        65%|██████▌   | 1301/2000 [3:59:41<2:04:53, 10.72s/it] 65%|██████▌   | 1302/2000 [3:59:52<2:05:56, 10.{'loss': 0.7343, 'learning_rate': 5.736346951157544e-06, 'epoch': 0.65}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12168
total_samples=19802, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:40,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.52 | bwd_microstep: 1842.18 | bwd_inner_microstep: 1614.95 | bwd_allreduce_microstep: 227.16 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11927
total_samples=19805, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:43,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.98 | bwd_microstep: 1809.92 | bwd_inner_microstep: 1564.03 | bwd_allreduce_microstep: 245.81 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12722
total_samples=19809, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:45,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.76 | bwd_microstep: 1833.02 | bwd_inner_microstep: 1627.23 | bwd_allreduce_microstep: 205.73 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13519
total_samples=19814, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:48,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 05:42:48,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.70 | bwd_microstep: 1996.71 | bwd_inner_microstep: 1913.81 | bwd_allreduce_microstep: 82.84 | step_microstep: 112.50
[2025-08-03 05:42:48,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2880.91 | bwd: 7481.89 | bwd_inner: 6720.02 | bwd_allreduce: 761.62 | step: 113.01
{'loss': 0.7434, 'learning_rate': 5.721704450209581e-06, 'epoch': 0.65}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12323
total_samples=19818, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:52,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.52 | bwd_microstep: 2337.36 | bwd_inner_microstep: 2331.24 | bwd_allreduce_microstep: 6.05 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13624
total_samples=19822, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:54,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.90 | bwd_microstep: 2066.13 | bwd_inner_microstep: 2034.48 | bwd_allreduce_microstep: 31.59 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11657
total_samples=19825, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:42:57,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.39 | bwd_microstep: 1982.57 | bwd_inner_microstep: 1784.55 | bwd_allreduce_microstep: 197.96 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11488
total_samples=19828, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:00,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70
[2025-08-03 05:43:00,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.78 | bwd_microstep: 1847.85 | bwd_inner_microstep: 1616.40 | bwd_allreduce_microstep: 231.38 | step_microstep: 142.52
[2025-08-03 05:43:00,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2892.52 | bwd: 8233.96 | bwd_inner: 7766.67 | bwd_allreduce: 467.06 | step: 142.99
{'loss': 0.7477, 'learning_rate': 5.707073168592943e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14403
total_samples=19833, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:02,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.98 | bwd_microstep: 1741.09 | bwd_inner_microstep: 1717.01 | bwd_allreduce_microstep: 24.01 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13598
total_samples=19837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:05,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.14 | bwd_microstep: 1751.18 | bwd_inner_microstep: 1684.58 | bwd_allreduce_microstep: 66.54 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13645
total_samples=19841, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:08,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.05 | bwd_microstep: 2123.33 | bwd_inner_microstep: 1878.10 | bwd_allreduce_microstep: 245.17 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12285
total_samples=19844, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:11,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 05:43:11,359] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.33 | bwd_microstep: 2028.63 | bwd_inner_microstep: 1809.54 | bwd_allreduce_microstep: 219.02 | step_microstep: 114.96
[2025-08-03 05:43:11,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.42 | bwd: 7644.29 | bwd_inner: 7089.23 | bwd_allreduce: 554.82 | step: 115.44
{'loss': 0.7488, 'learning_rate': 5.692453144676451e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14330
total_samples=19848, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:14,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.10 | bwd_microstep: 1890.87 | bwd_inner_microstep: 1843.56 | bwd_allreduce_microstep: 47.24 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=19851, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:16,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.77 | bwd_microstep: 1732.51 | bwd_inner_microstep: 1536.55 | bwd_allreduce_microstep: 195.90 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13950
total_samples=19855, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:19,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.32 | bwd_microstep: 1819.75 | bwd_inner_microstep: 1745.20 | bwd_allreduce_microstep: 74.49 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14139
total_samples=19859, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:22,068] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 05:43:22,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.39 | bwd_microstep: 2029.85 | bwd_inner_microstep: 1896.31 | bwd_allreduce_microstep: 133.47 | step_microstep: 125.80
[2025-08-03 05:43:22,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.51 | bwd: 7473.02 | bwd_inner: 7021.61 | bwd_allreduce: 451.17 | step: 126.33
{'loss': 0.7495, 'learning_rate': 5.677844416799424e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243
total_samples=19863, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:24,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.09 | bwd_microstep: 1770.53 | bwd_inner_microstep: 1691.52 | bwd_allreduce_microstep: 78.94 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12818
total_samples=19867, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:27,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.54 | bwd_microstep: 1841.60 | bwd_inner_microstep: 1662.18 | bwd_allreduce_microstep: 179.36 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13987
total_samples=19871, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:29,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.11 | bwd_microstep: 1845.27 | bwd_inner_microstep: 1754.20 | bwd_allreduce_microstep: 90.99 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14593
total_samples=19876, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:32,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 05:43:32,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.44 | bwd_microstep: 1822.63 | bwd_inner_microstep: 1800.21 | bwd_allreduce_microstep: 22.36 | step_microstep: 135.99
[2025-08-03 05:43:32,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.12 | bwd: 7280.08 | bwd_inner: 6908.10 | bwd_allreduce: 371.73 | step: 136.51
83s/it]                                                        65%|██████▌   | 1302/2000 [3:59:52<2:05:56, 10.83s/it] 65%|██████▌   | 1303/2000 [4:00:03<2:05:30, 10.80s/it]                                                        65%|██████▌   | 1303/2000 [4:00:03<2:05:30, 10.80s/it] 65%|██████▌   | 1304/2000 [4:00:15<2:08:03, 11.04s/it]                                                        65%|██████▌   | 1304/2000 [4:00:15<2:08:03, 11.04s/it] 65%|██████▌   | 1305/2000 [4:00:26<2:07:27, 11.00s/it]                                                        65%|██████▌   | 1305/2000 [4:00:26<2:07:27, 11.00s/it] 65%|██████▌   | 1306/2000 [4:00:36<2:06:15, 10.92s/it]                                                        65%|██████▌   | 1306/2000 [4:00:36<2:06:15, 10.92s/it] 65%|██████▌   | 1307/2000 [4:00:47<2:04:56, 10.82s/it]                               {'loss': 0.7464, 'learning_rate': 5.663247023271543e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13307
total_samples=19880, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:35,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.52 | bwd_microstep: 1690.57 | bwd_inner_microstep: 1644.78 | bwd_allreduce_microstep: 45.72 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15258
total_samples=19884, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:37,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.96 | bwd_microstep: 2048.35 | bwd_inner_microstep: 1930.24 | bwd_allreduce_microstep: 118.04 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14731
total_samples=19888, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:40,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.04 | bwd_microstep: 1903.73 | bwd_inner_microstep: 1743.69 | bwd_allreduce_microstep: 159.97 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11806
total_samples=19891, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:43,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 05:43:43,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.67 | bwd_microstep: 1783.96 | bwd_inner_microstep: 1564.93 | bwd_allreduce_microstep: 218.97 | step_microstep: 127.41
[2025-08-03 05:43:43,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2751.11 | bwd: 7426.67 | bwd_inner: 6883.65 | bwd_allreduce: 542.78 | step: 127.96
{'loss': 0.7526, 'learning_rate': 5.648661002372769e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13576
total_samples=19895, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:46,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.29 | bwd_microstep: 1963.60 | bwd_inner_microstep: 1867.14 | bwd_allreduce_microstep: 96.39 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11865
total_samples=19898, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:48,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.23 | bwd_microstep: 1762.67 | bwd_inner_microstep: 1567.66 | bwd_allreduce_microstep: 194.94 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13657
total_samples=19902, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:51,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.57 | bwd_microstep: 1996.22 | bwd_inner_microstep: 1784.28 | bwd_allreduce_microstep: 211.86 | step_microstep: 0.98
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11650
total_samples=19905, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:53,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:43:53,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.62 | bwd_microstep: 1738.55 | bwd_inner_microstep: 1538.23 | bwd_allreduce_microstep: 200.25 | step_microstep: 108.49
[2025-08-03 05:43:53,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.65 | bwd: 7461.09 | bwd_inner: 6757.30 | bwd_allreduce: 703.54 | step: 109.86
{'loss': 0.7327, 'learning_rate': 5.63408639235324e-06, 'epoch': 0.65}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15180
total_samples=19909, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:56,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.21 | bwd_microstep: 1753.92 | bwd_inner_microstep: 1736.54 | bwd_allreduce_microstep: 17.32 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11633
total_samples=19912, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:43:59,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.39 | bwd_microstep: 1858.76 | bwd_inner_microstep: 1535.19 | bwd_allreduce_microstep: 323.51 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13516
total_samples=19916, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:02,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.64 | bwd_microstep: 2121.48 | bwd_inner_microstep: 1933.13 | bwd_allreduce_microstep: 188.29 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12093
total_samples=19919, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:04,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 05:44:04,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.87 | bwd_microstep: 2046.27 | bwd_inner_microstep: 1814.37 | bwd_allreduce_microstep: 231.83 | step_microstep: 113.19
[2025-08-03 05:44:04,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.03 | bwd: 7780.49 | bwd_inner: 7019.22 | bwd_allreduce: 761.03 | step: 113.65
{'loss': 0.733, 'learning_rate': 5.619523231433177e-06, 'epoch': 0.66}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12973
total_samples=19923, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:07,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.39 | bwd_microstep: 1727.47 | bwd_inner_microstep: 1660.91 | bwd_allreduce_microstep: 66.48 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=19927, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:09,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.29 | bwd_microstep: 1769.56 | bwd_inner_microstep: 1698.68 | bwd_allreduce_microstep: 70.82 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11515
total_samples=19930, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:12,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.17 | bwd_microstep: 1822.45 | bwd_inner_microstep: 1582.56 | bwd_allreduce_microstep: 239.81 | step_microstep: 0.17
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12586
total_samples=19934, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:15,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 05:44:15,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.54 | bwd_microstep: 1806.17 | bwd_inner_microstep: 1632.48 | bwd_allreduce_microstep: 173.62 | step_microstep: 117.94
[2025-08-03 05:44:15,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.32 | bwd: 7125.71 | bwd_inner: 6574.63 | bwd_allreduce: 550.82 | step: 118.47
{'loss': 0.7447, 'learning_rate': 5.604971557802769e-06, 'epoch': 0.66}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12806
total_samples=19938, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:17,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.95 | bwd_microstep: 1792.35 | bwd_inner_microstep: 1669.92 | bwd_allreduce_microstep: 122.35 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13507
total_samples=19942, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:20,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.56 | bwd_microstep: 1888.43 | bwd_inner_microstep: 1841.48 | bwd_allreduce_microstep: 46.89 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13988
total_samples=19946, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:23,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.92 | bwd_microstep: 2042.79 | bwd_inner_microstep: 1936.68 | bwd_allreduce_microstep: 106.04 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12009
total_samples=19949, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:26,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 05:44:26,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.30 | bwd_microstep: 1963.18 | bwd_inner_microstep: 1727.22 | bwd_allreduce_microstep: 235.89 | step_microstep: 129.12
[2025-08-03 05:44:26,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.66 | bwd: 7686.80 | bwd_inner: 7175.31 | bwd_allreduce: 511.26 | step: 129.60
{'loss': 0.738, 'learning_rate': 5.590431409622081e-06, 'epoch': 0.66}
                         65%|██████▌   | 1307/2000 [4:00:47<2:04:56, 10.82s/it] 65%|██████▌   | 1308/2000 [4:00:58<2:04:02, 10.75s/it]                                                        65%|██████▌   | 1308/2000 [4:00:58<2:04:02, 10.75s/it] 65%|██████▌   | 1309/2000 [4:01:08<2:03:38, 10.74s/it]                                                        65%|██████▌   | 1309/2000 [4:01:08<2:03:38, 10.74s/it] 66%|██████▌   | 1310/2000 [4:01:19<2:04:16, 10.81s/it]                                                        66%|██████▌   | 1310/2000 [4:01:19<2:04:16, 10.81s/it] 66%|██████▌   | 1311/2000 [4:01:30<2:02:36, 10.68s/it]                                                        66%|██████▌   | 1311/2000 [4:01:30<2:02:36, 10.68s/it] 66%|██████▌   | 1312/2000 [4:01:41<2:03:25, 10.76s/it]                                                        66%|██�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14554
total_samples=19955, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:28,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.13 | bwd_microstep: 1766.12 | bwd_inner_microstep: 1720.35 | bwd_allreduce_microstep: 45.70 | step_microstep: 0.35
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13217
total_samples=19959, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:31,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.71 | bwd_microstep: 1968.69 | bwd_inner_microstep: 1840.47 | bwd_allreduce_microstep: 128.15 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13329
total_samples=19963, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:34,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 914.85 | bwd_microstep: 1782.68 | bwd_inner_microstep: 1712.98 | bwd_allreduce_microstep: 69.64 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13286
total_samples=19967, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:37,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 05:44:37,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.28 | bwd_microstep: 2024.86 | bwd_inner_microstep: 1735.98 | bwd_allreduce_microstep: 288.81 | step_microstep: 111.97
[2025-08-03 05:44:37,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3064.90 | bwd: 7542.41 | bwd_inner: 7009.77 | bwd_allreduce: 532.39 | step: 112.58
{'loss': 0.7434, 'learning_rate': 5.575902825020962e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14807
total_samples=19971, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:39,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.82 | bwd_microstep: 1894.27 | bwd_inner_microstep: 1869.03 | bwd_allreduce_microstep: 25.18 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13497
total_samples=19975, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:42,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.25 | bwd_microstep: 1799.60 | bwd_inner_microstep: 1694.83 | bwd_allreduce_microstep: 104.67 | step_microstep: 0.19
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12861
total_samples=19979, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:45,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.88 | bwd_microstep: 1834.19 | bwd_inner_microstep: 1632.33 | bwd_allreduce_microstep: 201.79 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12670
total_samples=19983, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:48,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 05:44:48,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.78 | bwd_microstep: 2308.05 | bwd_inner_microstep: 1987.60 | bwd_allreduce_microstep: 320.38 | step_microstep: 154.91
[2025-08-03 05:44:48,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.66 | bwd: 7836.18 | bwd_inner: 7183.77 | bwd_allreduce: 652.13 | step: 155.34
{'loss': 0.7468, 'learning_rate': 5.56138584209893e-06, 'epoch': 0.66}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11863
total_samples=19986, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:50,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.78 | bwd_microstep: 1779.59 | bwd_inner_microstep: 1568.18 | bwd_allreduce_microstep: 211.29 | step_microstep: 0.61
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14921
total_samples=19991, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:53,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.57 | bwd_microstep: 1781.62 | bwd_inner_microstep: 1726.15 | bwd_allreduce_microstep: 55.40 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12202
total_samples=19994, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:56,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.12 | bwd_microstep: 1970.63 | bwd_inner_microstep: 1763.68 | bwd_allreduce_microstep: 206.86 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14284
total_samples=19998, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:44:58,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.33
[2025-08-03 05:44:58,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.80 | bwd_microstep: 1834.38 | bwd_inner_microstep: 1761.91 | bwd_allreduce_microstep: 72.41 | step_microstep: 125.99
[2025-08-03 05:44:59,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.16 | bwd: 7366.29 | bwd_inner: 6819.93 | bwd_allreduce: 546.08 | step: 127.05
{'loss': 0.7404, 'learning_rate': 5.546880498925079e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14605
total_samples=20002, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:01,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.54 | bwd_microstep: 2048.15 | bwd_inner_microstep: 1948.30 | bwd_allreduce_microstep: 99.79 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11791
total_samples=20006, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:04,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.45 | bwd_microstep: 2030.21 | bwd_inner_microstep: 1817.11 | bwd_allreduce_microstep: 213.03 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13320
total_samples=20010, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:07,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.97 | bwd_microstep: 1781.99 | bwd_inner_microstep: 1697.03 | bwd_allreduce_microstep: 84.89 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13805
total_samples=20014, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:10,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.29
[2025-08-03 05:45:10,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.04 | bwd_microstep: 2050.48 | bwd_inner_microstep: 1938.27 | bwd_allreduce_microstep: 112.14 | step_microstep: 139.93
[2025-08-03 05:45:10,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.91 | bwd: 7910.88 | bwd_inner: 7400.70 | bwd_allreduce: 509.93 | step: 140.48
{'loss': 0.7448, 'learning_rate': 5.5323868335379775e-06, 'epoch': 0.66}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11831
total_samples=20017, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:13,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.70 | bwd_microstep: 2042.76 | bwd_inner_microstep: 1786.47 | bwd_allreduce_microstep: 256.22 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11985
total_samples=20020, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:15,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.09 | bwd_microstep: 1741.47 | bwd_inner_microstep: 1549.05 | bwd_allreduce_microstep: 192.36 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11869
total_samples=20023, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:18,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.84 | bwd_microstep: 2247.42 | bwd_inner_microstep: 2238.75 | bwd_allreduce_microstep: 8.61 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15368
total_samples=20027, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:21,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 05:45:21,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.94 | bwd_microstep: 1911.32 | bwd_inner_microstep: 1809.87 | bwd_allreduce_microstep: 101.38 | step_microstep: 124.32
[2025-08-03 05:45:21,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.50 | bwd: 7943.02 | bwd_inner: 7384.14 | bwd_allreduce: 558.65 | step: 124.95
{'loss': 0.7543, 'learning_rate': 5.517904883945577e-06, 'epoch': 0.66}
�███▌   | 1312/2000 [4:01:41<2:03:25, 10.76s/it] 66%|██████▌   | 1313/2000 [4:01:52<2:04:10, 10.85s/it]                                                        66%|██████▌   | 1313/2000 [4:01:52<2:04:10, 10.85s/it] 66%|██████▌   | 1314/2000 [4:02:03<2:04:53, 10.92s/it]                                                        66%|██████▌   | 1314/2000 [4:02:03<2:04:53, 10.92s/it] 66%|██████▌   | 1315/2000 [4:02:13<2:03:33, 10.82s/it]                                                        66%|██████▌   | 1315/2000 [4:02:13<2:03:33, 10.82s/it] 66%|██████▌   | 1316/2000 [4:02:25<2:04:41, 10.94s/it]                                                        66%|██████▌   | 1316/2000 [4:02:25<2:04:41, 10.94s/it] 66%|██████▌   | 1317/2000 [4:02:36<2:05:23, 11.02s/it]                                                        66%|██████▌   | 1317/2000 [4:02:36<2:dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12765
total_samples=20031, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:24,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.91 | bwd_microstep: 1825.05 | bwd_inner_microstep: 1667.60 | bwd_allreduce_microstep: 157.37 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11757
total_samples=20034, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:26,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.91 | bwd_microstep: 1794.63 | bwd_inner_microstep: 1550.16 | bwd_allreduce_microstep: 244.40 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14295
total_samples=20038, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:29,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.97 | bwd_microstep: 2010.88 | bwd_inner_microstep: 1874.75 | bwd_allreduce_microstep: 136.06 | step_microstep: 0.14
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13031
total_samples=20042, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:32,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.51
[2025-08-03 05:45:32,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.88 | bwd_microstep: 1921.05 | bwd_inner_microstep: 1660.30 | bwd_allreduce_microstep: 260.68 | step_microstep: 127.42
[2025-08-03 05:45:32,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.59 | bwd: 7551.68 | bwd_inner: 6752.81 | bwd_allreduce: 798.61 | step: 127.95
{'loss': 0.7396, 'learning_rate': 5.503434688125104e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752
total_samples=20046, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:35,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.73 | bwd_microstep: 2034.97 | bwd_inner_microstep: 1896.88 | bwd_allreduce_microstep: 138.02 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13910
total_samples=20050, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:37,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.27 | bwd_microstep: 1884.41 | bwd_inner_microstep: 1754.12 | bwd_allreduce_microstep: 130.22 | step_microstep: 0.17
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12423
total_samples=20054, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:40,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.34 | bwd_microstep: 2061.32 | bwd_inner_microstep: 1840.93 | bwd_allreduce_microstep: 220.33 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14357
total_samples=20058, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:43,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 05:45:43,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.55 | bwd_microstep: 1775.32 | bwd_inner_microstep: 1768.65 | bwd_allreduce_microstep: 6.61 | step_microstep: 148.39
[2025-08-03 05:45:43,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.83 | bwd: 7756.07 | bwd_inner: 7260.57 | bwd_allreduce: 495.26 | step: 148.92
{'loss': 0.7425, 'learning_rate': 5.488976284022953e-06, 'epoch': 0.66}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14298
total_samples=20062, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:45,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.26 | bwd_microstep: 1759.35 | bwd_inner_microstep: 1705.03 | bwd_allreduce_microstep: 54.23 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11681
total_samples=20065, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:48,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.58 | bwd_microstep: 1758.21 | bwd_inner_microstep: 1552.63 | bwd_allreduce_microstep: 205.52 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11819
total_samples=20069, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:50,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.94 | bwd_microstep: 1786.58 | bwd_inner_microstep: 1559.63 | bwd_allreduce_microstep: 226.88 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13350
total_samples=20073, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:53,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.96
[2025-08-03 05:45:53,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.25 | bwd_microstep: 1734.97 | bwd_inner_microstep: 1673.56 | bwd_allreduce_microstep: 61.33 | step_microstep: 140.20
[2025-08-03 05:45:53,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.95 | bwd: 7039.15 | bwd_inner: 6490.84 | bwd_allreduce: 548.04 | step: 140.68
{'loss': 0.7509, 'learning_rate': 5.4745297095546125e-06, 'epoch': 0.66}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14359
total_samples=20077, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:56,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.77 | bwd_microstep: 1757.83 | bwd_inner_microstep: 1688.67 | bwd_allreduce_microstep: 69.09 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11579
total_samples=20080, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:45:58,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.98 | bwd_microstep: 1809.37 | bwd_inner_microstep: 1580.55 | bwd_allreduce_microstep: 228.75 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11613
total_samples=20083, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:01,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.13 | bwd_microstep: 1849.01 | bwd_inner_microstep: 1555.02 | bwd_allreduce_microstep: 293.84 | step_microstep: 0.71
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13774
total_samples=20087, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:04,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 33.78
[2025-08-03 05:46:04,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.53 | bwd_microstep: 1892.43 | bwd_inner_microstep: 1706.05 | bwd_allreduce_microstep: 186.31 | step_microstep: 144.94
[2025-08-03 05:46:04,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.33 | bwd: 7308.71 | bwd_inner: 6530.28 | bwd_allreduce: 778.13 | step: 145.93
{'loss': 0.7439, 'learning_rate': 5.460095002604533e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13188
total_samples=20091, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:06,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.79 | bwd_microstep: 2017.34 | bwd_inner_microstep: 1880.80 | bwd_allreduce_microstep: 136.47 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14038
total_samples=20095, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:09,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.43 | bwd_microstep: 1815.74 | bwd_inner_microstep: 1740.16 | bwd_allreduce_microstep: 75.52 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13526
total_samples=20099, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:12,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.90 | bwd_microstep: 1968.47 | bwd_inner_microstep: 1854.59 | bwd_allreduce_microstep: 113.81 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14502
total_samples=20103, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:14,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.84
[2025-08-03 05:46:14,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.87 | bwd_microstep: 1743.62 | bwd_inner_microstep: 1722.04 | bwd_allreduce_microstep: 21.52 | step_microstep: 143.05
[2025-08-03 05:46:14,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.91 | bwd: 7545.22 | bwd_inner: 7197.58 | bwd_allreduce: 347.40 | step: 143.62
{'loss': 0.7436, 'learning_rate': 5.445672201026054e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13325
total_samples=20107, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:17,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.57 | bwd_microstep: 2059.17 | bwd_inner_microstep: 1783.19 | bwd_allreduce_microstep: 275.91 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13547
total_samples=20111, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:20,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.08 | bwd_microstep: 1790.67 | bwd_inner_microstep: 1698.62 | bwd_allreduce_microstep: 91.98 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13491
total_samples=20115, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:23,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.45 | bwd_microstep: 2115.63 | bwd_inner_microstep: 1948.80 | bwd_allreduce_microstep: 166.75 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13091
total_samples=20119, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:26,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.23
[2025-08-03 05:46:26,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.67 | bwd_microstep: 1838.56 | bwd_inner_microstep: 1696.52 | bwd_allreduce_microstep: 141.98 | step_microstep: 124.13
[2025-08-03 05:46:26,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.70 | bwd: 7804.09 | bwd_inner: 7127.13 | bwd_allreduce: 676.72 | step: 124.74
05:23, 11.02s/it] 66%|██████▌   | 1318/2000 [4:02:47<2:04:32, 10.96s/it]                                                        66%|██████▌   | 1318/2000 [4:02:47<2:04:32, 10.96s/it] 66%|██████▌   | 1319/2000 [4:02:58<2:04:46, 10.99s/it]                                                        66%|██████▌   | 1319/2000 [4:02:58<2:04:46, 10.99s/it] 66%|██████▌   | 1320/2000 [4:03:08<2:02:10, 10.78s/it]                                                        66%|██████▌   | 1320/2000 [4:03:08<2:02:10, 10.78s/it] 66%|██████▌   | 1321/2000 [4:03:19<2:01:16, 10.72s/it]                                                        66%|██████▌   | 1321/2000 [4:03:19<2:01:16, 10.72s/it] 66%|██████▌   | 1322/2000 [4:03:29<2:01:20, 10.74s/it]                                                        66%|██████▌   | 1322/2000 [4:03:29<2:01:20, 10.74s/it] 66%|█████�{'loss': 0.7444, 'learning_rate': 5.431261342641287e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13478
total_samples=20124, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:28,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.24 | bwd_microstep: 1769.59 | bwd_inner_microstep: 1694.43 | bwd_allreduce_microstep: 75.09 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13584
total_samples=20128, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:31,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.51 | bwd_microstep: 1960.54 | bwd_inner_microstep: 1953.72 | bwd_allreduce_microstep: 6.71 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12982
total_samples=20132, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:33,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.96 | bwd_microstep: 1722.64 | bwd_inner_microstep: 1624.22 | bwd_allreduce_microstep: 98.35 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12898
total_samples=20136, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:36,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73
[2025-08-03 05:46:36,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.32 | bwd_microstep: 1864.63 | bwd_inner_microstep: 1683.94 | bwd_allreduce_microstep: 180.63 | step_microstep: 160.79
[2025-08-03 05:46:36,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.96 | bwd: 7317.45 | bwd_inner: 6956.33 | bwd_allreduce: 360.85 | step: 161.46
{'loss': 0.742, 'learning_rate': 5.416862465241033e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14876
total_samples=20140, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:39,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.33 | bwd_microstep: 1751.73 | bwd_inner_microstep: 1742.55 | bwd_allreduce_microstep: 9.11 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12099
total_samples=20144, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:41,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.43 | bwd_microstep: 1780.13 | bwd_inner_microstep: 1599.55 | bwd_allreduce_microstep: 180.50 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12276
total_samples=20147, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:44,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.38 | bwd_microstep: 1819.52 | bwd_inner_microstep: 1592.90 | bwd_allreduce_microstep: 226.55 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13594
total_samples=20151, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:47,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.98
[2025-08-03 05:46:47,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.17 | bwd_microstep: 2129.07 | bwd_inner_microstep: 1904.43 | bwd_allreduce_microstep: 224.58 | step_microstep: 455.90
[2025-08-03 05:46:47,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.23 | bwd: 7480.50 | bwd_inner: 6839.42 | bwd_allreduce: 640.82 | step: 456.41
{'loss': 0.7509, 'learning_rate': 5.40247560658467e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14395
total_samples=20156, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:50,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.19 | bwd_microstep: 2032.15 | bwd_inner_microstep: 1900.32 | bwd_allreduce_microstep: 131.76 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13268
total_samples=20160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:53,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.64 | bwd_microstep: 1813.81 | bwd_inner_microstep: 1711.68 | bwd_allreduce_microstep: 102.07 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13431
total_samples=20164, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:56,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.08 | bwd_microstep: 2545.77 | bwd_inner_microstep: 2164.00 | bwd_allreduce_microstep: 381.71 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13004
total_samples=20168, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:46:59,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 05:46:59,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.61 | bwd_microstep: 1793.50 | bwd_inner_microstep: 1675.58 | bwd_allreduce_microstep: 117.85 | step_microstep: 136.75
[2025-08-03 05:46:59,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.45 | bwd: 8185.29 | bwd_inner: 7451.57 | bwd_allreduce: 733.48 | step: 137.29
{'loss': 0.739, 'learning_rate': 5.3881008044000495e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14068
total_samples=20172, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:01,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.26 | bwd_microstep: 1807.34 | bwd_inner_microstep: 1740.76 | bwd_allreduce_microstep: 66.52 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13159
total_samples=20176, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:04,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.78 | bwd_microstep: 1925.28 | bwd_inner_microstep: 1808.29 | bwd_allreduce_microstep: 116.93 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13876
total_samples=20180, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:07,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.51 | bwd_microstep: 2666.79 | bwd_inner_microstep: 2577.35 | bwd_allreduce_microstep: 89.37 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13349
total_samples=20184, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:10,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.62
[2025-08-03 05:47:10,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.90 | bwd_microstep: 1753.99 | bwd_inner_microstep: 1672.66 | bwd_allreduce_microstep: 81.26 | step_microstep: 143.41
[2025-08-03 05:47:10,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.38 | bwd: 8153.46 | bwd_inner: 7799.06 | bwd_allreduce: 354.16 | step: 143.94
{'loss': 0.738, 'learning_rate': 5.373738096383423e-06, 'epoch': 0.66}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13338
total_samples=20188, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:13,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.49 | bwd_microstep: 1903.30 | bwd_inner_microstep: 1666.30 | bwd_allreduce_microstep: 236.94 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13443
total_samples=20192, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:16,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 975.33 | bwd_microstep: 2080.42 | bwd_inner_microstep: 1776.83 | bwd_allreduce_microstep: 303.53 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13795
total_samples=20196, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:18,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.72 | bwd_microstep: 1736.98 | bwd_inner_microstep: 1692.05 | bwd_allreduce_microstep: 44.86 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15431
total_samples=20201, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:21,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.62
[2025-08-03 05:47:21,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.69 | bwd_microstep: 1750.80 | bwd_inner_microstep: 1744.37 | bwd_allreduce_microstep: 6.36 | step_microstep: 112.19
[2025-08-03 05:47:21,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3081.16 | bwd: 7471.56 | bwd_inner: 6879.54 | bwd_allreduce: 591.78 | step: 112.68
��▌   | 1323/2000 [4:03:40<2:02:15, 10.84s/it]                                                        66%|██████▌   | 1323/2000 [4:03:40<2:02:15, 10.84s/it] 66%|██████▌   | 1324/2000 [4:03:51<2:01:17, 10.77s/it]                                                        66%|██████▌   | 1324/2000 [4:03:51<2:01:17, 10.77s/it] 66%|██████▋   | 1325/2000 [4:04:02<2:02:08, 10.86s/it]                                                        66%|██████▋   | 1325/2000 [4:04:02<2:02:08, 10.86s/it] 66%|██████▋   | 1326/2000 [4:04:13<2:03:59, 11.04s/it]                                                        66%|██████▋   | 1326/2000 [4:04:14<2:03:59, 11.04s/it] 66%|██████▋   | 1327/2000 [4:04:25<2:05:03, 11.15s/it]                                                        66%|██████▋   | 1327/2000 [4:04:25<2:05:03, 11.15s/it] 66%|██████▋   | 1328/2000 [4:04:36<2:04:17, 1{'loss': 0.7377, 'learning_rate': 5.359387520199317e-06, 'epoch': 0.66}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13953
total_samples=20205, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:24,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.47 | bwd_microstep: 2005.08 | bwd_inner_microstep: 1873.52 | bwd_allreduce_microstep: 131.49 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13284
total_samples=20209, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:27,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.75 | bwd_microstep: 2251.86 | bwd_inner_microstep: 2082.71 | bwd_allreduce_microstep: 169.08 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14154
total_samples=20214, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:30,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.39 | bwd_microstep: 2164.69 | bwd_inner_microstep: 1903.74 | bwd_allreduce_microstep: 260.88 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13411
total_samples=20218, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:33,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85
[2025-08-03 05:47:33,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.47 | bwd_microstep: 1802.08 | bwd_inner_microstep: 1715.06 | bwd_allreduce_microstep: 86.95 | step_microstep: 157.18
[2025-08-03 05:47:33,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.01 | bwd: 8223.76 | bwd_inner: 7575.03 | bwd_allreduce: 648.49 | step: 157.71
{'loss': 0.7364, 'learning_rate': 5.3450491134804416e-06, 'epoch': 0.66}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12038
total_samples=20221, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:35,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.92 | bwd_microstep: 1761.77 | bwd_inner_microstep: 1558.51 | bwd_allreduce_microstep: 203.18 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11942
total_samples=20224, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:38,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.21 | bwd_microstep: 2148.26 | bwd_inner_microstep: 1939.29 | bwd_allreduce_microstep: 208.87 | step_microstep: 0.87
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13513
total_samples=20228, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:41,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.54 | bwd_microstep: 1710.52 | bwd_inner_microstep: 1658.52 | bwd_allreduce_microstep: 51.94 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14598
total_samples=20233, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:43,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77
[2025-08-03 05:47:43,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.62 | bwd_microstep: 1900.40 | bwd_inner_microstep: 1739.54 | bwd_allreduce_microstep: 160.77 | step_microstep: 123.24
[2025-08-03 05:47:43,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.22 | bwd: 7521.02 | bwd_inner: 6895.85 | bwd_allreduce: 624.88 | step: 124.38
{'loss': 0.7403, 'learning_rate': 5.330722913827594e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12075
total_samples=20236, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:47,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 971.12 | bwd_microstep: 1951.42 | bwd_inner_microstep: 1904.63 | bwd_allreduce_microstep: 46.73 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12103
total_samples=20239, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:49,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.70 | bwd_microstep: 1713.41 | bwd_inner_microstep: 1539.92 | bwd_allreduce_microstep: 173.41 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13252
total_samples=20243, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:52,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.60 | bwd_microstep: 1764.01 | bwd_inner_microstep: 1694.39 | bwd_allreduce_microstep: 69.50 | step_microstep: 0.44
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14501
total_samples=20247, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:54,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 05:47:54,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.30 | bwd_microstep: 1784.68 | bwd_inner_microstep: 1727.50 | bwd_allreduce_microstep: 57.10 | step_microstep: 111.30
[2025-08-03 05:47:55,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3062.64 | bwd: 7213.58 | bwd_inner: 6866.44 | bwd_allreduce: 346.87 | step: 112.14
{'loss': 0.7419, 'learning_rate': 5.3164089588095705e-06, 'epoch': 0.67}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12925
total_samples=20252, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:47:57,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.36 | bwd_microstep: 2027.08 | bwd_inner_microstep: 1835.26 | bwd_allreduce_microstep: 191.73 | step_microstep: 0.34
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13798
total_samples=20255, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:00,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.77 | bwd_microstep: 1880.66 | bwd_inner_microstep: 1663.25 | bwd_allreduce_microstep: 217.28 | step_microstep: 0.89
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13965
total_samples=20259, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:03,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.60 | bwd_microstep: 1797.35 | bwd_inner_microstep: 1735.90 | bwd_allreduce_microstep: 61.38 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13205
total_samples=20263, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:05,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:48:05,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 2010.64 | bwd_inner_microstep: 1866.93 | bwd_allreduce_microstep: 143.64 | step_microstep: 120.48
[2025-08-03 05:48:05,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.72 | bwd: 7715.78 | bwd_inner: 7101.34 | bwd_allreduce: 614.16 | step: 121.85
{'loss': 0.7437, 'learning_rate': 5.302107285963045e-06, 'epoch': 0.67}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13935
total_samples=20267, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:09,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.90 | bwd_microstep: 2404.26 | bwd_inner_microstep: 1713.17 | bwd_allreduce_microstep: 691.02 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11984
total_samples=20270, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:11,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.52 | bwd_microstep: 1746.24 | bwd_inner_microstep: 1558.61 | bwd_allreduce_microstep: 187.57 | step_microstep: 0.89
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13947
total_samples=20274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:14,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 915.67 | bwd_microstep: 1815.26 | bwd_inner_microstep: 1738.96 | bwd_allreduce_microstep: 76.23 | step_microstep: 0.25
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14159
total_samples=20279, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:17,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 05:48:17,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.33 | bwd_microstep: 1729.09 | bwd_inner_microstep: 1666.22 | bwd_allreduce_microstep: 62.80 | step_microstep: 123.83
[2025-08-03 05:48:17,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3000.34 | bwd: 7694.90 | bwd_inner: 6676.95 | bwd_allreduce: 1017.70 | step: 125.27
1.10s/it]                                                        66%|██████▋   | 1328/2000 [4:04:36<2:04:17, 11.10s/it] 66%|██████▋   | 1329/2000 [4:04:47<2:05:29, 11.22s/it]                                                        66%|██████▋   | 1329/2000 [4:04:47<2:05:29, 11.22s/it] 66%|██████▋   | 1330/2000 [4:04:58<2:03:45, 11.08s/it]                                                        66%|██████▋   | 1330/2000 [4:04:59<2:03:45, 11.08s/it] 67%|██████▋   | 1331/2000 [4:05:09<2:03:59, 11.12s/it]                                                        67%|██████▋   | 1331/2000 [4:05:09<2:03:59, 11.12s/it] 67%|██████▋   | 1332/2000 [4:05:20<2:03:22, 11.08s/it]                                                        67%|██████▋   | 1332/2000 [4:05:20<2:03:22, 11.08s/it] 67%|██████▋   | 1333/2000 [4:05:31<2:03:21, 11.10s/it]                             {'loss': 0.7362, 'learning_rate': 5.287817932792485e-06, 'epoch': 0.67}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15401
total_samples=20283, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:19,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.27 | bwd_microstep: 1802.57 | bwd_inner_microstep: 1775.75 | bwd_allreduce_microstep: 26.76 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12800
total_samples=20286, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:22,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.23 | bwd_microstep: 1733.21 | bwd_inner_microstep: 1576.94 | bwd_allreduce_microstep: 156.20 | step_microstep: 0.24
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15890
total_samples=20290, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:24,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.70 | bwd_microstep: 1774.55 | bwd_inner_microstep: 1756.47 | bwd_allreduce_microstep: 18.01 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13188
total_samples=20294, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:27,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 05:48:27,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.37 | bwd_microstep: 2088.07 | bwd_inner_microstep: 2036.77 | bwd_allreduce_microstep: 51.24 | step_microstep: 105.98
[2025-08-03 05:48:27,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.49 | bwd: 7398.45 | bwd_inner: 7145.91 | bwd_allreduce: 252.30 | step: 106.59
{'loss': 0.7324, 'learning_rate': 5.273540936770059e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11569
total_samples=20297, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:30,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.96 | bwd_microstep: 1781.00 | bwd_inner_microstep: 1541.89 | bwd_allreduce_microstep: 239.05 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14079
total_samples=20301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:32,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.92 | bwd_microstep: 1723.89 | bwd_inner_microstep: 1709.70 | bwd_allreduce_microstep: 14.13 | step_microstep: 0.25
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13578
total_samples=20306, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:35,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.53 | bwd_microstep: 1957.19 | bwd_inner_microstep: 1852.42 | bwd_allreduce_microstep: 104.68 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11914
total_samples=20309, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:38,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 05:48:38,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.50 | bwd_microstep: 2072.94 | bwd_inner_microstep: 1837.03 | bwd_allreduce_microstep: 235.85 | step_microstep: 133.09
[2025-08-03 05:48:38,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.85 | bwd: 7535.08 | bwd_inner: 6941.02 | bwd_allreduce: 593.81 | step: 133.67
{'loss': 0.7459, 'learning_rate': 5.259276335335522e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885
total_samples=20312, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:41,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.86 | bwd_microstep: 1921.62 | bwd_inner_microstep: 1552.34 | bwd_allreduce_microstep: 369.21 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14165
total_samples=20317, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:44,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.68 | bwd_microstep: 2077.90 | bwd_inner_microstep: 1988.71 | bwd_allreduce_microstep: 89.13 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14341
total_samples=20322, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:46,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.33 | bwd_microstep: 1716.00 | bwd_inner_microstep: 1693.76 | bwd_allreduce_microstep: 22.17 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11966
total_samples=20325, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:49,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 05:48:49,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.64 | bwd_microstep: 1796.19 | bwd_inner_microstep: 1564.07 | bwd_allreduce_microstep: 232.05 | step_microstep: 116.32
[2025-08-03 05:48:49,216] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.44 | bwd: 7511.76 | bwd_inner: 6798.85 | bwd_allreduce: 712.65 | step: 116.91
{'loss': 0.735, 'learning_rate': 5.245024165896126e-06, 'epoch': 0.67}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13502
total_samples=20329, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:51,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.77 | bwd_microstep: 1798.70 | bwd_inner_microstep: 1681.15 | bwd_allreduce_microstep: 117.49 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13510
total_samples=20333, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:54,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.55 | bwd_microstep: 2112.37 | bwd_inner_microstep: 1954.86 | bwd_allreduce_microstep: 157.45 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12936
total_samples=20338, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:48:57,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.28 | bwd_microstep: 1962.43 | bwd_inner_microstep: 1651.27 | bwd_allreduce_microstep: 311.10 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11811
total_samples=20341, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:00,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.80
[2025-08-03 05:49:00,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.33 | bwd_microstep: 1733.07 | bwd_inner_microstep: 1705.20 | bwd_allreduce_microstep: 27.81 | step_microstep: 129.16
[2025-08-03 05:49:00,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.86 | bwd: 7606.62 | bwd_inner: 6992.46 | bwd_allreduce: 613.92 | step: 129.63
{'loss': 0.7469, 'learning_rate': 5.2307844658265236e-06, 'epoch': 0.67}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13029
total_samples=20345, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:02,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.09 | bwd_microstep: 1766.81 | bwd_inner_microstep: 1661.21 | bwd_allreduce_microstep: 105.53 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13248
total_samples=20349, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:05,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.33 | bwd_microstep: 1765.46 | bwd_inner_microstep: 1702.81 | bwd_allreduce_microstep: 62.59 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13800
total_samples=20353, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:08,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.30 | bwd_microstep: 2025.83 | bwd_inner_microstep: 1905.22 | bwd_allreduce_microstep: 120.53 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13393
total_samples=20357, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:10,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 05:49:10,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.67 | bwd_microstep: 1798.99 | bwd_inner_microstep: 1667.80 | bwd_allreduce_microstep: 131.12 | step_microstep: 111.06
[2025-08-03 05:49:10,691] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2871.31 | bwd: 7357.13 | bwd_inner: 6937.04 | bwd_allreduce: 419.85 | step: 111.58
{'loss': 0.7325, 'learning_rate': 5.216557272468675e-06, 'epoch': 0.67}
                           67%|██████▋   | 1333/2000 [4:05:32<2:03:21, 11.10s/it] 67%|██████▋   | 1334/2000 [4:05:42<2:01:33, 10.95s/it]                                                        67%|██████▋   | 1334/2000 [4:05:42<2:01:33, 10.95s/it] 67%|██████▋   | 1335/2000 [4:05:53<2:00:41, 10.89s/it]                                                        67%|██████▋   | 1335/2000 [4:05:53<2:00:41, 10.89s/it] 67%|██████▋   | 1336/2000 [4:06:04<1:59:59, 10.84s/it]                                                        67%|██████▋   | 1336/2000 [4:06:04<1:59:59, 10.84s/it] 67%|██████▋   | 1337/2000 [4:06:14<1:59:49, 10.84s/it]                                                        67%|██████▋   | 1337/2000 [4:06:14<1:59:49, 10.84s/it] 67%|██████▋   | 1338/2000 [4:06:25<1:58:56, 10.78s/it]                                                        67%|██dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13573
total_samples=20361, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:13,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.62 | bwd_microstep: 1836.09 | bwd_inner_microstep: 1639.95 | bwd_allreduce_microstep: 196.07 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13287
total_samples=20365, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:15,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.70 | bwd_microstep: 1880.78 | bwd_inner_microstep: 1710.50 | bwd_allreduce_microstep: 170.21 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13863
total_samples=20369, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:18,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.14 | bwd_microstep: 2130.81 | bwd_inner_microstep: 1756.91 | bwd_allreduce_microstep: 373.85 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11928
total_samples=20372, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:21,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.97
[2025-08-03 05:49:21,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.04 | bwd_microstep: 1790.92 | bwd_inner_microstep: 1583.20 | bwd_allreduce_microstep: 207.66 | step_microstep: 119.28
[2025-08-03 05:49:21,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.44 | bwd: 7638.66 | bwd_inner: 6690.55 | bwd_allreduce: 947.86 | step: 119.90
{'loss': 0.7366, 'learning_rate': 5.202342623131731e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12173
total_samples=20375, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:24,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.79 | bwd_microstep: 1844.56 | bwd_inner_microstep: 1723.97 | bwd_allreduce_microstep: 120.52 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13621
total_samples=20379, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:26,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.91 | bwd_microstep: 1832.98 | bwd_inner_microstep: 1748.43 | bwd_allreduce_microstep: 84.48 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13623
total_samples=20383, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:29,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.50 | bwd_microstep: 2077.56 | bwd_inner_microstep: 1759.71 | bwd_allreduce_microstep: 317.79 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11641
total_samples=20386, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:32,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 05:49:32,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.13 | bwd_microstep: 1861.07 | bwd_inner_microstep: 1529.39 | bwd_allreduce_microstep: 331.61 | step_microstep: 106.84
[2025-08-03 05:49:32,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.26 | bwd: 7616.22 | bwd_inner: 6761.50 | bwd_allreduce: 854.48 | step: 107.30
{'loss': 0.7311, 'learning_rate': 5.18814055509195e-06, 'epoch': 0.67}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13874
total_samples=20390, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:35,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.98 | bwd_microstep: 1833.24 | bwd_inner_microstep: 1735.67 | bwd_allreduce_microstep: 97.50 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=20394, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:37,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.77 | bwd_microstep: 2002.88 | bwd_inner_microstep: 1827.46 | bwd_allreduce_microstep: 175.35 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13745
total_samples=20399, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:40,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.68 | bwd_microstep: 1797.72 | bwd_inner_microstep: 1727.10 | bwd_allreduce_microstep: 70.56 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11604
total_samples=20402, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:43,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 05:49:43,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.58 | bwd_microstep: 2177.66 | bwd_inner_microstep: 1884.25 | bwd_allreduce_microstep: 293.34 | step_microstep: 126.65
[2025-08-03 05:49:43,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.94 | bwd: 7811.55 | bwd_inner: 7174.47 | bwd_allreduce: 636.83 | step: 127.14
{'loss': 0.7416, 'learning_rate': 5.173951105592605e-06, 'epoch': 0.67}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13688
total_samples=20406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:46,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.70 | bwd_microstep: 1734.93 | bwd_inner_microstep: 1680.92 | bwd_allreduce_microstep: 53.94 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13896
total_samples=20410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:48,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.45 | bwd_microstep: 1951.25 | bwd_inner_microstep: 1856.14 | bwd_allreduce_microstep: 95.04 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13354
total_samples=20414, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:51,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.94 | bwd_microstep: 1741.22 | bwd_inner_microstep: 1683.48 | bwd_allreduce_microstep: 57.67 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810
total_samples=20417, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:54,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 05:49:54,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.92 | bwd_microstep: 1948.59 | bwd_inner_microstep: 1561.06 | bwd_allreduce_microstep: 387.47 | step_microstep: 109.39
[2025-08-03 05:49:54,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.93 | bwd: 7376.05 | bwd_inner: 6781.60 | bwd_allreduce: 594.21 | step: 109.80
{'loss': 0.7423, 'learning_rate': 5.1597743118438725e-06, 'epoch': 0.67}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13126
total_samples=20421, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:56,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.39 | bwd_microstep: 1963.95 | bwd_inner_microstep: 1862.64 | bwd_allreduce_microstep: 101.24 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14220
total_samples=20425, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:49:59,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.03 | bwd_microstep: 1713.85 | bwd_inner_microstep: 1697.03 | bwd_allreduce_microstep: 16.75 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13629
total_samples=20429, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:01,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.03 | bwd_microstep: 1743.57 | bwd_inner_microstep: 1685.36 | bwd_allreduce_microstep: 58.14 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11676
total_samples=20432, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:04,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:50:04,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.19 | bwd_microstep: 1930.60 | bwd_inner_microstep: 1745.54 | bwd_allreduce_microstep: 184.99 | step_microstep: 139.59
[2025-08-03 05:50:04,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.57 | bwd: 7352.02 | bwd_inner: 6990.56 | bwd_allreduce: 361.21 | step: 140.12
{'loss': 0.7402, 'learning_rate': 5.145610211022738e-06, 'epoch': 0.67}
████▋   | 1338/2000 [4:06:25<1:58:56, 10.78s/it] 67%|██████▋   | 1339/2000 [4:06:36<1:59:05, 10.81s/it]                                                        67%|██████▋   | 1339/2000 [4:06:36<1:59:05, 10.81s/it] 67%|██████▋   | 1340/2000 [4:06:47<1:59:08, 10.83s/it]                                                        67%|██████▋   | 1340/2000 [4:06:47<1:59:08, 10.83s/it] 67%|██████▋   | 1341/2000 [4:06:58<1:59:45, 10.90s/it]                                                        67%|██████▋   | 1341/2000 [4:06:58<1:59:45, 10.90s/it] 67%|██████▋   | 1342/2000 [4:07:08<1:58:33, 10.81s/it]                                                        67%|██████▋   | 1342/2000 [4:07:08<1:58:33, 10.81s/it] 67%|██████▋   | 1343/2000 [4:07:19<1:57:39, 10.75s/it]                                                        67%|██████▋   | 1343/2000 [4:07:19<dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12900
total_samples=20436, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:07,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.92 | bwd_microstep: 1816.74 | bwd_inner_microstep: 1629.94 | bwd_allreduce_microstep: 186.73 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12921
total_samples=20440, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:09,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.86 | bwd_microstep: 1691.49 | bwd_inner_microstep: 1633.36 | bwd_allreduce_microstep: 58.06 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13119
total_samples=20444, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:12,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.37 | bwd_microstep: 2097.61 | bwd_inner_microstep: 1681.92 | bwd_allreduce_microstep: 415.62 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13507
total_samples=20448, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:16,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.78
[2025-08-03 05:50:16,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.56 | bwd_microstep: 2549.43 | bwd_inner_microstep: 2391.22 | bwd_allreduce_microstep: 158.14 | step_microstep: 110.42
[2025-08-03 05:50:16,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.64 | bwd: 8155.33 | bwd_inner: 7336.44 | bwd_allreduce: 818.63 | step: 111.06
{'loss': 0.7257, 'learning_rate': 5.131458840272905e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11747
total_samples=20451, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:18,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.83 | bwd_microstep: 1781.07 | bwd_inner_microstep: 1545.98 | bwd_allreduce_microstep: 235.02 | step_microstep: 0.17
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13190
total_samples=20455, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:21,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.41 | bwd_microstep: 1933.23 | bwd_inner_microstep: 1695.56 | bwd_allreduce_microstep: 237.60 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13913
total_samples=20459, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:23,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.66 | bwd_microstep: 1709.89 | bwd_inner_microstep: 1677.16 | bwd_allreduce_microstep: 32.65 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11730
total_samples=20462, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:27,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 05:50:27,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 746.65 | bwd_microstep: 1995.52 | bwd_inner_microstep: 1760.21 | bwd_allreduce_microstep: 235.24 | step_microstep: 570.42
[2025-08-03 05:50:27,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.48 | bwd: 7419.75 | bwd_inner: 6678.90 | bwd_allreduce: 740.60 | step: 570.97
{'loss': 0.7278, 'learning_rate': 5.117320236704697e-06, 'epoch': 0.67}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13491
total_samples=20466, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:29,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.98 | bwd_microstep: 1850.14 | bwd_inner_microstep: 1715.99 | bwd_allreduce_microstep: 134.07 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15245
total_samples=20470, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:32,391] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.94 | bwd_microstep: 1773.76 | bwd_inner_microstep: 1760.79 | bwd_allreduce_microstep: 12.90 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13048
total_samples=20474, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:35,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.23 | bwd_microstep: 1946.48 | bwd_inner_microstep: 1892.73 | bwd_allreduce_microstep: 53.69 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14234
total_samples=20478, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:37,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.87
[2025-08-03 05:50:37,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 751.50 | bwd_microstep: 1809.51 | bwd_inner_microstep: 1740.58 | bwd_allreduce_microstep: 68.85 | step_microstep: 154.59
[2025-08-03 05:50:37,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2875.58 | bwd: 7379.95 | bwd_inner: 7110.09 | bwd_allreduce: 269.61 | step: 155.23
{'loss': 0.7421, 'learning_rate': 5.103194437394952e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 16360
total_samples=20482, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:40,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.51 | bwd_microstep: 2033.43 | bwd_inner_microstep: 1786.53 | bwd_allreduce_microstep: 246.84 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13845
total_samples=20487, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:43,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.53 | bwd_microstep: 2206.70 | bwd_inner_microstep: 1912.41 | bwd_allreduce_microstep: 294.20 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15025
total_samples=20492, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:46,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.48 | bwd_microstep: 1755.00 | bwd_inner_microstep: 1746.98 | bwd_allreduce_microstep: 7.96 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11775
total_samples=20495, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:49,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.05
[2025-08-03 05:50:49,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.77 | bwd_microstep: 1795.07 | bwd_inner_microstep: 1549.41 | bwd_allreduce_microstep: 245.59 | step_microstep: 449.13
[2025-08-03 05:50:49,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.23 | bwd: 7790.28 | bwd_inner: 6995.32 | bwd_allreduce: 794.68 | step: 449.62
{'loss': 0.7383, 'learning_rate': 5.089081479386928e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11697
total_samples=20498, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:51,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.07 | bwd_microstep: 1728.61 | bwd_inner_microstep: 1530.95 | bwd_allreduce_microstep: 197.60 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13529
total_samples=20502, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:54,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.57 | bwd_microstep: 2008.62 | bwd_inner_microstep: 1878.73 | bwd_allreduce_microstep: 129.81 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13160
total_samples=20506, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:50:57,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.30 | bwd_microstep: 1855.61 | bwd_inner_microstep: 1695.60 | bwd_allreduce_microstep: 159.94 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11580
total_samples=20509, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:00,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 05:51:00,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.47 | bwd_microstep: 1982.30 | bwd_inner_microstep: 1769.77 | bwd_allreduce_microstep: 212.46 | step_microstep: 110.50
[2025-08-03 05:51:00,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.32 | bwd: 7575.19 | bwd_inner: 6875.04 | bwd_allreduce: 699.90 | step: 111.01
{'loss': 0.7364, 'learning_rate': 5.074981399690219e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11935
total_samples=20512, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:02,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.97 | bwd_microstep: 1808.75 | bwd_inner_microstep: 1580.59 | bwd_allreduce_microstep: 228.08 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14736
total_samples=20517, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:05,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.62 | bwd_microstep: 1773.74 | bwd_inner_microstep: 1716.61 | bwd_allreduce_microstep: 57.05 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13472
total_samples=20521, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:08,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.41 | bwd_microstep: 1989.53 | bwd_inner_microstep: 1861.04 | bwd_allreduce_microstep: 128.41 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14234
total_samples=20525, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:10,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 05:51:10,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.08 | bwd_microstep: 1750.31 | bwd_inner_microstep: 1706.12 | bwd_allreduce_microstep: 44.13 | step_microstep: 127.81
[2025-08-03 05:51:10,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.99 | bwd: 7322.37 | bwd_inner: 6864.35 | bwd_allreduce: 457.74 | step: 128.61
1:57:39, 10.75s/it] 67%|██████▋   | 1344/2000 [4:07:30<1:59:30, 10.93s/it]                                                        67%|██████▋   | 1344/2000 [4:07:30<1:59:30, 10.93s/it] 67%|██████▋   | 1345/2000 [4:07:42<1:59:54, 10.98s/it]                                                        67%|██████▋   | 1345/2000 [4:07:42<1:59:54, 10.98s/it] 67%|██████▋   | 1346/2000 [4:07:52<1:58:52, 10.91s/it]                                                        67%|██████▋   | 1346/2000 [4:07:52<1:58:52, 10.91s/it] 67%|██████▋   | 1347/2000 [4:08:04<2:00:21, 11.06s/it]                                                        67%|██████▋   | 1347/2000 [4:08:04<2:00:21, 11.06s/it] 67%|██████▋   | 1348/2000 [4:08:14<1:59:12, 10.97s/it]                                                        67%|██████▋   | 1348/2000 [4:08:14<1:59:12, 10.97s/it] 67%|████�{'loss': 0.7368, 'learning_rate': 5.060894235280637e-06, 'epoch': 0.67}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12313
total_samples=20528, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:13,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.21 | bwd_microstep: 1939.17 | bwd_inner_microstep: 1729.67 | bwd_allreduce_microstep: 209.42 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=20531, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:16,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.30 | bwd_microstep: 1831.96 | bwd_inner_microstep: 1594.31 | bwd_allreduce_microstep: 237.56 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13099
total_samples=20535, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:18,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.89 | bwd_microstep: 1861.53 | bwd_inner_microstep: 1706.49 | bwd_allreduce_microstep: 154.98 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12040
total_samples=20539, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:21,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 05:51:21,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.19 | bwd_microstep: 1828.67 | bwd_inner_microstep: 1612.85 | bwd_allreduce_microstep: 215.75 | step_microstep: 115.49
[2025-08-03 05:51:21,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2903.51 | bwd: 7461.38 | bwd_inner: 6643.31 | bwd_allreduce: 817.81 | step: 115.92
{'loss': 0.7435, 'learning_rate': 5.046820023100129e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11880
total_samples=20542, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:24,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.24 | bwd_microstep: 1828.27 | bwd_inner_microstep: 1584.57 | bwd_allreduce_microstep: 243.62 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12077
total_samples=20546, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:27,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.36 | bwd_microstep: 2144.57 | bwd_inner_microstep: 1916.65 | bwd_allreduce_microstep: 227.83 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12970
total_samples=20550, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:29,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.86 | bwd_microstep: 1947.60 | bwd_inner_microstep: 1664.98 | bwd_allreduce_microstep: 282.55 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11569
total_samples=20553, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:32,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.66
[2025-08-03 05:51:32,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.36 | bwd_microstep: 1887.44 | bwd_inner_microstep: 1640.46 | bwd_allreduce_microstep: 246.91 | step_microstep: 133.66
[2025-08-03 05:51:32,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.74 | bwd: 7807.93 | bwd_inner: 6806.67 | bwd_allreduce: 1001.02 | step: 134.32
{'loss': 0.74, 'learning_rate': 5.03275880005667e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778
total_samples=20557, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:35,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.89 | bwd_microstep: 1811.45 | bwd_inner_microstep: 1582.60 | bwd_allreduce_microstep: 228.78 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=20560, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:37,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.51 | bwd_microstep: 1936.42 | bwd_inner_microstep: 1726.39 | bwd_allreduce_microstep: 209.96 | step_microstep: 0.24
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13499
total_samples=20564, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:40,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.83 | bwd_microstep: 1823.89 | bwd_inner_microstep: 1679.77 | bwd_allreduce_microstep: 144.05 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11618
total_samples=20567, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:43,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.01
[2025-08-03 05:51:43,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.17 | bwd_microstep: 1744.13 | bwd_inner_microstep: 1542.61 | bwd_allreduce_microstep: 201.45 | step_microstep: 137.60
[2025-08-03 05:51:43,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.32 | bwd: 7315.93 | bwd_inner: 6531.37 | bwd_allreduce: 784.32 | step: 138.08
{'loss': 0.7421, 'learning_rate': 5.018710603024187e-06, 'epoch': 0.68}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12622
total_samples=20571, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:47,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.76 | bwd_microstep: 3038.67 | bwd_inner_microstep: 2803.06 | bwd_allreduce_microstep: 235.55 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12368
total_samples=20576, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:49,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.68 | bwd_microstep: 2072.21 | bwd_inner_microstep: 1864.86 | bwd_allreduce_microstep: 207.28 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14873
total_samples=20580, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:52,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.76 | bwd_microstep: 1813.88 | bwd_inner_microstep: 1722.72 | bwd_allreduce_microstep: 91.10 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13359
total_samples=20584, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:55,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 05:51:55,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.38 | bwd_microstep: 2042.90 | bwd_inner_microstep: 1739.74 | bwd_allreduce_microstep: 303.10 | step_microstep: 124.41
[2025-08-03 05:51:55,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2887.52 | bwd: 8967.72 | bwd_inner: 8130.38 | bwd_allreduce: 837.10 | step: 125.00
{'loss': 0.7472, 'learning_rate': 5.004675468842436e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799
total_samples=20587, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:51:58,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.18 | bwd_microstep: 1846.32 | bwd_inner_microstep: 1552.11 | bwd_allreduce_microstep: 294.14 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12156
total_samples=20590, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:00,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.28 | bwd_microstep: 1842.33 | bwd_inner_microstep: 1589.45 | bwd_allreduce_microstep: 252.81 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11901
total_samples=20593, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:03,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.82 | bwd_microstep: 1739.27 | bwd_inner_microstep: 1562.79 | bwd_allreduce_microstep: 176.41 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13954
total_samples=20597, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:05,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 05:52:05,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.27 | bwd_microstep: 1714.20 | bwd_inner_microstep: 1681.46 | bwd_allreduce_microstep: 32.66 | step_microstep: 131.94
[2025-08-03 05:52:05,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2880.48 | bwd: 7142.17 | bwd_inner: 6385.79 | bwd_allreduce: 756.12 | step: 132.48
�█▋   | 1349/2000 [4:08:25<1:57:45, 10.85s/it]                                                        67%|██████▋   | 1349/2000 [4:08:25<1:57:45, 10.85s/it] 68%|██████▊   | 1350/2000 [4:08:36<1:57:23, 10.84s/it]                                                        68%|██████▊   | 1350/2000 [4:08:36<1:57:23, 10.84s/it] 68%|██████▊   | 1351/2000 [4:08:47<1:58:00, 10.91s/it]                                                        68%|██████▊   | 1351/2000 [4:08:47<1:58:00, 10.91s/it] 68%|██████▊   | 1352/2000 [4:08:58<1:56:50, 10.82s/it]                                                        68%|██████▊   | 1352/2000 [4:08:58<1:56:50, 10.82s/it] 68%|██████▊   | 1353/2000 [4:09:10<2:01:24, 11.26s/it]                                                        68%|██████▊   | 1353/2000 [4:09:10<2:01:24, 11.26s/it] 68%|██████▊   | 1354/2000 [4:09:20<1:58:35,{'loss': 0.7455, 'learning_rate': 4.990653434316915e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11958
total_samples=20600, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:08,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.24 | bwd_microstep: 2218.63 | bwd_inner_microstep: 1994.29 | bwd_allreduce_microstep: 224.28 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11815
total_samples=20603, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:11,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.93 | bwd_microstep: 1778.78 | bwd_inner_microstep: 1571.44 | bwd_allreduce_microstep: 207.27 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11796
total_samples=20606, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:13,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.26 | bwd_microstep: 1706.71 | bwd_inner_microstep: 1538.08 | bwd_allreduce_microstep: 168.56 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13378
total_samples=20610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:16,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 05:52:16,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.92 | bwd_microstep: 1765.86 | bwd_inner_microstep: 1691.37 | bwd_allreduce_microstep: 74.42 | step_microstep: 134.39
[2025-08-03 05:52:16,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.29 | bwd: 7470.03 | bwd_inner: 6795.19 | bwd_allreduce: 674.59 | step: 134.91
{'loss': 0.7442, 'learning_rate': 4.976644536218783e-06, 'epoch': 0.68}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13394
total_samples=20614, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:19,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.97 | bwd_microstep: 1941.50 | bwd_inner_microstep: 1716.39 | bwd_allreduce_microstep: 225.04 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12606
total_samples=20617, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:21,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.47 | bwd_microstep: 1720.08 | bwd_inner_microstep: 1573.87 | bwd_allreduce_microstep: 146.14 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14476
total_samples=20621, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:24,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.69 | bwd_microstep: 1753.73 | bwd_inner_microstep: 1734.41 | bwd_allreduce_microstep: 19.25 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13812
total_samples=20625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:27,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 05:52:27,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.49 | bwd_microstep: 2010.90 | bwd_inner_microstep: 1857.62 | bwd_allreduce_microstep: 153.22 | step_microstep: 125.49
[2025-08-03 05:52:27,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.56 | bwd: 7426.26 | bwd_inner: 6882.29 | bwd_allreduce: 543.73 | step: 126.14
{'loss': 0.7542, 'learning_rate': 4.9626488112847384e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11774
total_samples=20628, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:30,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.08 | bwd_microstep: 2032.02 | bwd_inner_microstep: 1808.31 | bwd_allreduce_microstep: 223.64 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13391
total_samples=20632, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:32,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.07 | bwd_microstep: 1839.20 | bwd_inner_microstep: 1719.88 | bwd_allreduce_microstep: 119.25 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12361
total_samples=20636, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:35,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.44 | bwd_microstep: 1851.04 | bwd_inner_microstep: 1617.20 | bwd_allreduce_microstep: 233.78 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13580
total_samples=20640, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:38,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 05:52:38,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.89 | bwd_microstep: 2054.92 | bwd_inner_microstep: 2048.28 | bwd_allreduce_microstep: 6.58 | step_microstep: 114.74
[2025-08-03 05:52:38,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.40 | bwd: 7777.24 | bwd_inner: 7193.66 | bwd_allreduce: 583.32 | step: 115.37
{'loss': 0.754, 'learning_rate': 4.948666296216938e-06, 'epoch': 0.68}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334
total_samples=20644, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:40,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.95 | bwd_microstep: 1743.18 | bwd_inner_microstep: 1663.94 | bwd_allreduce_microstep: 79.17 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13500
total_samples=20648, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:43,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.96 | bwd_microstep: 1809.08 | bwd_inner_microstep: 1723.16 | bwd_allreduce_microstep: 85.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13424
total_samples=20652, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:46,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.62 | bwd_microstep: 1927.76 | bwd_inner_microstep: 1752.61 | bwd_allreduce_microstep: 175.08 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11679
total_samples=20655, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:48,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 05:52:48,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.86 | bwd_microstep: 1880.73 | bwd_inner_microstep: 1721.55 | bwd_allreduce_microstep: 159.12 | step_microstep: 135.75
[2025-08-03 05:52:48,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.31 | bwd: 7360.80 | bwd_inner: 6861.26 | bwd_allreduce: 499.31 | step: 136.24
{'loss': 0.7428, 'learning_rate': 4.934697027682894e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11779
total_samples=20658, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:51,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.38 | bwd_microstep: 2036.13 | bwd_inner_microstep: 1820.08 | bwd_allreduce_microstep: 215.98 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11751
total_samples=20661, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:54,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.04 | bwd_microstep: 2031.32 | bwd_inner_microstep: 1699.20 | bwd_allreduce_microstep: 332.06 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11823
total_samples=20664, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:52:57,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.93 | bwd_microstep: 1944.15 | bwd_inner_microstep: 1802.72 | bwd_allreduce_microstep: 141.37 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12151
total_samples=20667, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:00,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 05:53:00,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.24 | bwd_microstep: 2247.99 | bwd_inner_microstep: 1896.69 | bwd_allreduce_microstep: 351.24 | step_microstep: 112.45
[2025-08-03 05:53:00,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.52 | bwd: 8259.63 | bwd_inner: 7218.68 | bwd_allreduce: 1040.73 | step: 112.82
 11.01s/it]                                                        68%|██████▊   | 1354/2000 [4:09:20<1:58:35, 11.01s/it] 68%|██████▊   | 1355/2000 [4:09:31<1:57:24, 10.92s/it]                                                        68%|██████▊   | 1355/2000 [4:09:31<1:57:24, 10.92s/it] 68%|██████▊   | 1356/2000 [4:09:42<1:56:20, 10.84s/it]                                                        68%|██████▊   | 1356/2000 [4:09:42<1:56:20, 10.84s/it] 68%|██████▊   | 1357/2000 [4:09:53<1:56:44, 10.89s/it]                                                        68%|██████▊   | 1357/2000 [4:09:53<1:56:44, 10.89s/it] 68%|██████▊   | 1358/2000 [4:10:03<1:55:47, 10.82s/it]                                                        68%|██████▊   | 1358/2000 [4:10:03<1:55:47, 10.82s/it] 68%|██████▊   | 1359/2000 [4:10:15<1:57:50, 11.03s/it]                           {'loss': 0.7456, 'learning_rate': 4.9207410423153925e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11766
total_samples=20671, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:03,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.06 | bwd_microstep: 2007.62 | bwd_inner_microstep: 1835.96 | bwd_allreduce_microstep: 171.61 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13562
total_samples=20675, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:05,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.48 | bwd_microstep: 1820.76 | bwd_inner_microstep: 1731.37 | bwd_allreduce_microstep: 89.32 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11973
total_samples=20678, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:08,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.79 | bwd_microstep: 2151.34 | bwd_inner_microstep: 1604.63 | bwd_allreduce_microstep: 546.65 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13349
total_samples=20682, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:11,530] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.32
[2025-08-03 05:53:11,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.92 | bwd_microstep: 1841.90 | bwd_inner_microstep: 1720.36 | bwd_allreduce_microstep: 121.47 | step_microstep: 140.45
[2025-08-03 05:53:11,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2861.17 | bwd: 7821.68 | bwd_inner: 6892.32 | bwd_allreduce: 929.13 | step: 140.94
{'loss': 0.7441, 'learning_rate': 4.9067983767123736e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11628
total_samples=20685, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:14,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.06 | bwd_microstep: 1716.93 | bwd_inner_microstep: 1509.98 | bwd_allreduce_microstep: 206.88 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13295
total_samples=20689, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:16,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.08 | bwd_microstep: 1799.38 | bwd_inner_microstep: 1696.90 | bwd_allreduce_microstep: 102.42 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13139
total_samples=20693, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:19,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.35 | bwd_microstep: 1848.84 | bwd_inner_microstep: 1715.23 | bwd_allreduce_microstep: 133.55 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12169
total_samples=20696, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:21,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 05:53:21,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.50 | bwd_microstep: 1805.62 | bwd_inner_microstep: 1570.27 | bwd_allreduce_microstep: 235.29 | step_microstep: 130.17
[2025-08-03 05:53:21,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.92 | bwd: 7170.83 | bwd_inner: 6492.37 | bwd_allreduce: 678.22 | step: 130.65
{'loss': 0.7399, 'learning_rate': 4.8928690674368495e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11900
total_samples=20699, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:24,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.91 | bwd_microstep: 1978.27 | bwd_inner_microstep: 1762.25 | bwd_allreduce_microstep: 215.95 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13292
total_samples=20703, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:27,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.41 | bwd_microstep: 1841.76 | bwd_inner_microstep: 1707.08 | bwd_allreduce_microstep: 134.61 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13363
total_samples=20707, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:30,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.06 | bwd_microstep: 1908.44 | bwd_inner_microstep: 1692.77 | bwd_allreduce_microstep: 215.60 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12830
total_samples=20711, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:32,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.83
[2025-08-03 05:53:32,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.19 | bwd_microstep: 1698.27 | bwd_inner_microstep: 1630.99 | bwd_allreduce_microstep: 67.21 | step_microstep: 133.44
[2025-08-03 05:53:32,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.51 | bwd: 7426.79 | bwd_inner: 6793.08 | bwd_allreduce: 633.46 | step: 133.96
{'loss': 0.7246, 'learning_rate': 4.878953151016816e-06, 'epoch': 0.68}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13230
total_samples=20715, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:35,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.89 | bwd_microstep: 1811.21 | bwd_inner_microstep: 1670.72 | bwd_allreduce_microstep: 140.42 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12088
total_samples=20718, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:38,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.25 | bwd_microstep: 2098.20 | bwd_inner_microstep: 1857.18 | bwd_allreduce_microstep: 240.95 | step_microstep: 0.84
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13669
total_samples=20722, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:40,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.65 | bwd_microstep: 1710.39 | bwd_inner_microstep: 1666.40 | bwd_allreduce_microstep: 43.93 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11818
total_samples=20725, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:43,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 05:53:43,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.63 | bwd_microstep: 1864.33 | bwd_inner_microstep: 1626.42 | bwd_allreduce_microstep: 237.85 | step_microstep: 107.31
[2025-08-03 05:53:43,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.36 | bwd: 7484.19 | bwd_inner: 6820.72 | bwd_allreduce: 663.23 | step: 108.40
{'loss': 0.7373, 'learning_rate': 4.8650506639451385e-06, 'epoch': 0.68}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11859
total_samples=20728, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:46,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.49 | bwd_microstep: 1958.83 | bwd_inner_microstep: 1536.08 | bwd_allreduce_microstep: 422.68 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11725
total_samples=20731, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:48,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.94 | bwd_microstep: 1751.45 | bwd_inner_microstep: 1536.95 | bwd_allreduce_microstep: 214.43 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13550
total_samples=20736, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:51,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.51 | bwd_microstep: 2015.53 | bwd_inner_microstep: 2009.34 | bwd_allreduce_microstep: 6.12 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13912
total_samples=20740, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:54,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 05:53:54,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.95 | bwd_microstep: 2083.38 | bwd_inner_microstep: 1913.22 | bwd_allreduce_microstep: 170.09 | step_microstep: 120.81
[2025-08-03 05:53:54,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.82 | bwd: 7809.23 | bwd_inner: 6995.59 | bwd_allreduce: 813.40 | step: 121.17
{'loss': 0.7378, 'learning_rate': 4.851161642679466e-06, 'epoch': 0.68}
                             68%|██████▊   | 1359/2000 [4:10:15<1:57:50, 11.03s/it] 68%|██████▊   | 1360/2000 [4:10:26<1:57:53, 11.05s/it]                                                        68%|██████▊   | 1360/2000 [4:10:26<1:57:53, 11.05s/it] 68%|██████▊   | 1361/2000 [4:10:36<1:55:44, 10.87s/it]                                                        68%|██████▊   | 1361/2000 [4:10:36<1:55:44, 10.87s/it] 68%|██████▊   | 1362/2000 [4:10:47<1:54:53, 10.81s/it]                                                        68%|██████▊   | 1362/2000 [4:10:47<1:54:53, 10.81s/it] 68%|██████▊   | 1363/2000 [4:10:58<1:54:24, 10.78s/it]                                                        68%|██████▊   | 1363/2000 [4:10:58<1:54:24, 10.78s/it] 68%|██████▊   | 1364/2000 [4:11:09<1:55:03, 10.86s/it]                                                        68%|█�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11699
total_samples=20743, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:53:57,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.59 | bwd_microstep: 2134.79 | bwd_inner_microstep: 1984.34 | bwd_allreduce_microstep: 150.40 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12117
total_samples=20746, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:00,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.31 | bwd_microstep: 2062.78 | bwd_inner_microstep: 1847.48 | bwd_allreduce_microstep: 215.21 | step_microstep: 0.82
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13257
total_samples=20750, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:02,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.81 | bwd_microstep: 1913.07 | bwd_inner_microstep: 1827.20 | bwd_allreduce_microstep: 85.81 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13628
total_samples=20754, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:05,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 05:54:05,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.98 | bwd_microstep: 1744.50 | bwd_inner_microstep: 1687.96 | bwd_allreduce_microstep: 56.47 | step_microstep: 135.10
[2025-08-03 05:54:05,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.61 | bwd: 7855.20 | bwd_inner: 7346.98 | bwd_allreduce: 507.95 | step: 136.14
{'loss': 0.7442, 'learning_rate': 4.837286123642141e-06, 'epoch': 0.68}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13529
total_samples=20758, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:08,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.63 | bwd_microstep: 2261.16 | bwd_inner_microstep: 1871.60 | bwd_allreduce_microstep: 389.49 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12009
total_samples=20761, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:11,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.88 | bwd_microstep: 1822.76 | bwd_inner_microstep: 1600.38 | bwd_allreduce_microstep: 222.32 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12780
total_samples=20765, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:13,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.38 | bwd_microstep: 1722.02 | bwd_inner_microstep: 1607.93 | bwd_allreduce_microstep: 114.02 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13801
total_samples=20769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:16,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 05:54:16,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.62 | bwd_microstep: 1829.70 | bwd_inner_microstep: 1701.67 | bwd_allreduce_microstep: 127.97 | step_microstep: 113.83
[2025-08-03 05:54:16,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.43 | bwd: 7635.68 | bwd_inner: 6781.57 | bwd_allreduce: 853.88 | step: 114.41
{'loss': 0.7352, 'learning_rate': 4.823424143220097e-06, 'epoch': 0.68}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13288
total_samples=20773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:18,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.62 | bwd_microstep: 1735.19 | bwd_inner_microstep: 1667.88 | bwd_allreduce_microstep: 67.25 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12280
total_samples=20776, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:21,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.61 | bwd_microstep: 1885.32 | bwd_inner_microstep: 1750.07 | bwd_allreduce_microstep: 135.18 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11795
total_samples=20779, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:24,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.77 | bwd_microstep: 1833.01 | bwd_inner_microstep: 1574.78 | bwd_allreduce_microstep: 258.16 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14584
total_samples=20783, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:27,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 05:54:27,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.43 | bwd_microstep: 2012.96 | bwd_inner_microstep: 1895.52 | bwd_allreduce_microstep: 117.38 | step_microstep: 132.69
[2025-08-03 05:54:27,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.37 | bwd: 7466.53 | bwd_inner: 6888.25 | bwd_allreduce: 578.05 | step: 133.29
{'loss': 0.7401, 'learning_rate': 4.809575737764759e-06, 'epoch': 0.68}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13792
total_samples=20788, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:29,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.45 | bwd_microstep: 1851.82 | bwd_inner_microstep: 1807.09 | bwd_allreduce_microstep: 44.65 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11731
total_samples=20791, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:32,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.93 | bwd_microstep: 1840.65 | bwd_inner_microstep: 1599.97 | bwd_allreduce_microstep: 240.61 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13738
total_samples=20795, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:34,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.69 | bwd_microstep: 1791.77 | bwd_inner_microstep: 1716.19 | bwd_allreduce_microstep: 75.51 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12498
total_samples=20799, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:37,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 05:54:37,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.06 | bwd_microstep: 1955.97 | bwd_inner_microstep: 1860.70 | bwd_allreduce_microstep: 95.20 | step_microstep: 114.80
[2025-08-03 05:54:37,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.06 | bwd: 7440.25 | bwd_inner: 6983.95 | bwd_allreduce: 456.06 | step: 115.28
{'loss': 0.7222, 'learning_rate': 4.795740943591955e-06, 'epoch': 0.68}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14579
total_samples=20803, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:40,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.68 | bwd_microstep: 1755.94 | bwd_inner_microstep: 1719.57 | bwd_allreduce_microstep: 36.30 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13351
total_samples=20807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:42,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.35 | bwd_microstep: 1809.16 | bwd_inner_microstep: 1712.77 | bwd_allreduce_microstep: 96.31 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13850
total_samples=20811, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:45,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.22 | bwd_microstep: 1853.73 | bwd_inner_microstep: 1738.77 | bwd_allreduce_microstep: 114.89 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11771
total_samples=20814, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:48,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 05:54:48,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.22 | bwd_microstep: 1928.71 | bwd_inner_microstep: 1606.88 | bwd_allreduce_microstep: 321.76 | step_microstep: 139.79
[2025-08-03 05:54:48,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2874.41 | bwd: 7347.60 | bwd_inner: 6777.99 | bwd_allreduce: 569.37 | step: 140.32
{'loss': 0.7469, 'learning_rate': 4.781919796981818e-06, 'epoch': 0.68}
��████▊   | 1364/2000 [4:11:09<1:55:03, 10.86s/it] 68%|██████▊   | 1365/2000 [4:11:20<1:55:38, 10.93s/it]                                                        68%|██████▊   | 1365/2000 [4:11:20<1:55:38, 10.93s/it] 68%|██████▊   | 1366/2000 [4:11:31<1:55:19, 10.91s/it]                                                        68%|██████▊   | 1366/2000 [4:11:31<1:55:19, 10.91s/it] 68%|██████▊   | 1367/2000 [4:11:41<1:54:33, 10.86s/it]                                                        68%|██████▊   | 1367/2000 [4:11:41<1:54:33, 10.86s/it] 68%|██████▊   | 1368/2000 [4:11:52<1:53:44, 10.80s/it]                                                        68%|██████▊   | 1368/2000 [4:11:52<1:53:44, 10.80s/it] 68%|██████▊   | 1369/2000 [4:12:03<1:53:11, 10.76s/it]                                                        68%|██████▊   | 1369/2000 [4:12:0dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13736
total_samples=20820, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:51,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.22 | bwd_microstep: 1843.35 | bwd_inner_microstep: 1776.94 | bwd_allreduce_microstep: 66.31 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13435
total_samples=20824, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:53,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.28 | bwd_microstep: 1793.94 | bwd_inner_microstep: 1705.11 | bwd_allreduce_microstep: 88.76 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13987
total_samples=20828, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:56,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.05 | bwd_microstep: 1807.09 | bwd_inner_microstep: 1744.77 | bwd_allreduce_microstep: 62.25 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13626
total_samples=20832, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:54:59,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 05:54:59,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.51 | bwd_microstep: 2265.05 | bwd_inner_microstep: 2138.55 | bwd_allreduce_microstep: 126.42 | step_microstep: 116.13
[2025-08-03 05:54:59,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.00 | bwd: 7709.49 | bwd_inner: 7365.37 | bwd_allreduce: 343.85 | step: 116.75
{'loss': 0.7325, 'learning_rate': 4.7681123341787e-06, 'epoch': 0.69}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12007
total_samples=20836, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:02,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.83 | bwd_microstep: 1865.58 | bwd_inner_microstep: 1603.10 | bwd_allreduce_microstep: 262.41 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13133
total_samples=20840, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:04,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.02 | bwd_microstep: 1897.94 | bwd_inner_microstep: 1662.33 | bwd_allreduce_microstep: 235.53 | step_microstep: 0.34
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14300
total_samples=20845, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:07,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.91 | bwd_microstep: 2214.86 | bwd_inner_microstep: 2206.94 | bwd_allreduce_microstep: 7.84 | step_microstep: 0.85
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11750
total_samples=20848, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:10,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 05:55:10,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.04 | bwd_microstep: 2000.89 | bwd_inner_microstep: 1756.97 | bwd_allreduce_microstep: 243.86 | step_microstep: 138.56
[2025-08-03 05:55:10,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.73 | bwd: 7979.32 | bwd_inner: 7229.33 | bwd_allreduce: 749.74 | step: 139.87
{'loss': 0.7538, 'learning_rate': 4.754318591391057e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13657
total_samples=20852, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:13,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.66 | bwd_microstep: 2063.09 | bwd_inner_microstep: 1986.08 | bwd_allreduce_microstep: 76.94 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12533
total_samples=20855, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:16,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.65 | bwd_microstep: 1996.20 | bwd_inner_microstep: 1590.16 | bwd_allreduce_microstep: 405.98 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12054
total_samples=20858, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:18,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.71 | bwd_microstep: 1771.92 | bwd_inner_microstep: 1569.61 | bwd_allreduce_microstep: 202.22 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14806
total_samples=20863, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:22,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 05:55:22,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.33 | bwd_microstep: 1836.36 | bwd_inner_microstep: 1767.23 | bwd_allreduce_microstep: 69.06 | step_microstep: 685.74
[2025-08-03 05:55:22,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.28 | bwd: 7667.63 | bwd_inner: 6913.07 | bwd_allreduce: 754.31 | step: 686.25
{'loss': 0.7387, 'learning_rate': 4.740538604791371e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13386
total_samples=20867, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:24,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.25 | bwd_microstep: 1854.89 | bwd_inner_microstep: 1721.33 | bwd_allreduce_microstep: 133.49 | step_microstep: 0.32
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12941
total_samples=20872, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:27,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.95 | bwd_microstep: 1817.76 | bwd_inner_microstep: 1632.81 | bwd_allreduce_microstep: 184.89 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11651
total_samples=20875, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:29,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.59 | bwd_microstep: 1781.81 | bwd_inner_microstep: 1535.01 | bwd_allreduce_microstep: 246.74 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12381
total_samples=20878, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:33,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.06
[2025-08-03 05:55:33,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.45 | bwd_microstep: 1794.68 | bwd_inner_microstep: 1607.35 | bwd_allreduce_microstep: 187.26 | step_microstep: 458.79
[2025-08-03 05:55:33,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2901.17 | bwd: 7249.20 | bwd_inner: 6496.49 | bwd_allreduce: 752.46 | step: 459.38
{'loss': 0.7413, 'learning_rate': 4.726772410516055e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13617
total_samples=20882, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:35,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.49 | bwd_microstep: 2029.20 | bwd_inner_microstep: 1916.21 | bwd_allreduce_microstep: 112.93 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13631
total_samples=20886, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:38,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.32 | bwd_microstep: 2055.54 | bwd_inner_microstep: 1907.84 | bwd_allreduce_microstep: 147.64 | step_microstep: 0.85
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14173
total_samples=20890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:41,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.80 | bwd_microstep: 1816.61 | bwd_inner_microstep: 1757.13 | bwd_allreduce_microstep: 59.39 | step_microstep: 0.32
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11657
total_samples=20893, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:44,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 05:55:44,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.81 | bwd_microstep: 2016.14 | bwd_inner_microstep: 1787.18 | bwd_allreduce_microstep: 228.89 | step_microstep: 113.59
[2025-08-03 05:55:44,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.35 | bwd: 7917.55 | bwd_inner: 7368.36 | bwd_allreduce: 548.93 | step: 114.88
{'loss': 0.7415, 'learning_rate': 4.713020044665348e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13135
total_samples=20897, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:46,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.85 | bwd_microstep: 1804.23 | bwd_inner_microstep: 1697.39 | bwd_allreduce_microstep: 106.78 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13308
total_samples=20901, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:49,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.22 | bwd_microstep: 1892.93 | bwd_inner_microstep: 1844.04 | bwd_allreduce_microstep: 48.82 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13605
total_samples=20905, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:52,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.58 | bwd_microstep: 2436.01 | bwd_inner_microstep: 2197.10 | bwd_allreduce_microstep: 238.84 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13504
total_samples=20909, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:55,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 05:55:55,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.11 | bwd_microstep: 2095.62 | bwd_inner_microstep: 1942.10 | bwd_allreduce_microstep: 153.45 | step_microstep: 109.45
[2025-08-03 05:55:55,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.70 | bwd: 8228.83 | bwd_inner: 7680.63 | bwd_allreduce: 547.96 | step: 109.96
3<1:53:11, 10.76s/it] 68%|██████▊   | 1370/2000 [4:12:14<1:53:35, 10.82s/it]                                                        68%|██████▊   | 1370/2000 [4:12:14<1:53:35, 10.82s/it] 69%|██████▊   | 1371/2000 [4:12:25<1:54:48, 10.95s/it]                                                        69%|██████▊   | 1371/2000 [4:12:25<1:54:48, 10.95s/it] 69%|██████▊   | 1372/2000 [4:12:36<1:56:20, 11.12s/it]                                                        69%|██████▊   | 1372/2000 [4:12:37<1:56:20, 11.12s/it] 69%|██████▊   | 1373/2000 [4:12:47<1:55:27, 11.05s/it]                                                        69%|██████▊   | 1373/2000 [4:12:47<1:55:27, 11.05s/it] 69%|██████▊   | 1374/2000 [4:12:59<1:55:39, 11.09s/it]                                                        69%|██████▊   | 1374/2000 [4:12:59<1:55:39, 11.09s/it] 69%|████{'loss': 0.7378, 'learning_rate': 4.699281543303222e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13264
total_samples=20913, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:55:58,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.79 | bwd_microstep: 1707.08 | bwd_inner_microstep: 1661.74 | bwd_allreduce_microstep: 45.27 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12376
total_samples=20917, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:00,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.83 | bwd_microstep: 1848.61 | bwd_inner_microstep: 1584.10 | bwd_allreduce_microstep: 264.44 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13148
total_samples=20921, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:03,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.44 | bwd_microstep: 1756.01 | bwd_inner_microstep: 1666.47 | bwd_allreduce_microstep: 89.48 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13809
total_samples=20925, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:05,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 05:56:05,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.46 | bwd_microstep: 1748.39 | bwd_inner_microstep: 1691.24 | bwd_allreduce_microstep: 57.09 | step_microstep: 130.67
[2025-08-03 05:56:05,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.45 | bwd: 7060.13 | bwd_inner: 6603.54 | bwd_allreduce: 456.36 | step: 131.01
{'loss': 0.7315, 'learning_rate': 4.685556942457296e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13494
total_samples=20929, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:08,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.09 | bwd_microstep: 1885.25 | bwd_inner_microstep: 1693.65 | bwd_allreduce_microstep: 191.52 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13387
total_samples=20933, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:11,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.94 | bwd_microstep: 2073.33 | bwd_inner_microstep: 1918.29 | bwd_allreduce_microstep: 154.98 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13952
total_samples=20937, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:13,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.65 | bwd_microstep: 1784.51 | bwd_inner_microstep: 1724.04 | bwd_allreduce_microstep: 60.40 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14413
total_samples=20941, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:16,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 05:56:16,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.55 | bwd_microstep: 1883.10 | bwd_inner_microstep: 1745.96 | bwd_allreduce_microstep: 137.08 | step_microstep: 111.39
[2025-08-03 05:56:16,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.17 | bwd: 7626.24 | bwd_inner: 7081.94 | bwd_allreduce: 544.06 | step: 111.84
{'loss': 0.7342, 'learning_rate': 4.67184627811874e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13822
total_samples=20946, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:19,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.99 | bwd_microstep: 1801.17 | bwd_inner_microstep: 1710.88 | bwd_allreduce_microstep: 90.24 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13507
total_samples=20950, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:21,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.69 | bwd_microstep: 1824.97 | bwd_inner_microstep: 1793.79 | bwd_allreduce_microstep: 31.11 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14046
total_samples=20955, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:24,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.97 | bwd_microstep: 1734.69 | bwd_inner_microstep: 1697.75 | bwd_allreduce_microstep: 36.88 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13390
total_samples=20959, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:26,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 05:56:26,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.61 | bwd_microstep: 1740.75 | bwd_inner_microstep: 1696.42 | bwd_allreduce_microstep: 44.26 | step_microstep: 107.98
[2025-08-03 05:56:26,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2731.18 | bwd: 7101.63 | bwd_inner: 6898.83 | bwd_allreduce: 202.57 | step: 108.48
{'loss': 0.7354, 'learning_rate': 4.65814958624217e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13162
total_samples=20963, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:29,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.13 | bwd_microstep: 1939.77 | bwd_inner_microstep: 1879.62 | bwd_allreduce_microstep: 60.09 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15943
total_samples=20969, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:32,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.05 | bwd_microstep: 1799.60 | bwd_inner_microstep: 1793.53 | bwd_allreduce_microstep: 5.99 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14023
total_samples=20974, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:35,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.12 | bwd_microstep: 2090.83 | bwd_inner_microstep: 2049.44 | bwd_allreduce_microstep: 41.32 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14087
total_samples=20978, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:37,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 05:56:37,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.62 | bwd_microstep: 1941.48 | bwd_inner_microstep: 1845.46 | bwd_allreduce_microstep: 95.96 | step_microstep: 135.98
[2025-08-03 05:56:37,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.85 | bwd: 7771.73 | bwd_inner: 7568.03 | bwd_allreduce: 203.44 | step: 136.57
{'loss': 0.7495, 'learning_rate': 4.6444669027455615e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13087
total_samples=20982, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:41,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.63 | bwd_microstep: 2371.96 | bwd_inner_microstep: 1902.91 | bwd_allreduce_microstep: 468.95 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14140
total_samples=20987, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:43,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.53 | bwd_microstep: 1821.29 | bwd_inner_microstep: 1725.20 | bwd_allreduce_microstep: 96.02 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14620
total_samples=20991, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:46,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.47 | bwd_microstep: 1914.73 | bwd_inner_microstep: 1759.36 | bwd_allreduce_microstep: 155.31 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13431
total_samples=20995, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:49,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.84
[2025-08-03 05:56:49,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.58 | bwd_microstep: 2032.20 | bwd_inner_microstep: 1909.35 | bwd_allreduce_microstep: 122.77 | step_microstep: 143.55
[2025-08-03 05:56:49,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.13 | bwd: 8140.23 | bwd_inner: 7296.84 | bwd_allreduce: 843.12 | step: 144.03
██▉   | 1375/2000 [4:13:10<1:56:27, 11.18s/it]                                                        69%|██████▉   | 1375/2000 [4:13:10<1:56:27, 11.18s/it] 69%|██████▉   | 1376/2000 [4:13:20<1:53:27, 10.91s/it]                                                        69%|██████▉   | 1376/2000 [4:13:20<1:53:27, 10.91s/it] 69%|██████▉   | 1377/2000 [4:13:31<1:53:02, 10.89s/it]                                                        69%|██████▉   | 1377/2000 [4:13:31<1:53:02, 10.89s/it] 69%|██████▉   | 1378/2000 [4:13:41<1:50:54, 10.70s/it]                                                        69%|██████▉   | 1378/2000 [4:13:41<1:50:54, 10.70s/it] 69%|██████▉   | 1379/2000 [4:13:52<1:51:37, 10.78s/it]                                                        69%|██████▉   | 1379/2000 [4:13:52<1:51:37, 10.78s/it] 69%|██████▉   | 1380/2000 [4:14:04<1:53:2{'loss': 0.7286, 'learning_rate': 4.630798263510162e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14631
total_samples=20999, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:52,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.26 | bwd_microstep: 2034.01 | bwd_inner_microstep: 1763.98 | bwd_allreduce_microstep: 269.96 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13224
total_samples=21004, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:54,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.84 | bwd_microstep: 1839.60 | bwd_inner_microstep: 1707.86 | bwd_allreduce_microstep: 131.67 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15106
total_samples=21009, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:56:57,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.33 | bwd_microstep: 1760.91 | bwd_inner_microstep: 1749.10 | bwd_allreduce_microstep: 11.75 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13341
total_samples=21013, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:00,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 05:57:00,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.81 | bwd_microstep: 1783.65 | bwd_inner_microstep: 1698.54 | bwd_allreduce_microstep: 85.05 | step_microstep: 158.86
[2025-08-03 05:57:00,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.17 | bwd: 7418.24 | bwd_inner: 6919.46 | bwd_allreduce: 498.52 | step: 159.40
{'loss': 0.7473, 'learning_rate': 4.617143704380382e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13439
total_samples=21017, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:02,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.36 | bwd_microstep: 1856.00 | bwd_inner_microstep: 1764.95 | bwd_allreduce_microstep: 90.98 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14061
total_samples=21021, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:05,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.35 | bwd_microstep: 1903.60 | bwd_inner_microstep: 1698.56 | bwd_allreduce_microstep: 204.96 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13202
total_samples=21025, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:08,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.98 | bwd_microstep: 2079.20 | bwd_inner_microstep: 1923.71 | bwd_allreduce_microstep: 155.42 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13592
total_samples=21029, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:10,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 05:57:10,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.94 | bwd_microstep: 1730.38 | bwd_inner_microstep: 1686.03 | bwd_allreduce_microstep: 44.28 | step_microstep: 108.93
[2025-08-03 05:57:10,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2747.56 | bwd: 7569.24 | bwd_inner: 7073.24 | bwd_allreduce: 495.73 | step: 109.84
{'loss': 0.7317, 'learning_rate': 4.60350326116371e-06, 'epoch': 0.69}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14226
total_samples=21033, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:13,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.99 | bwd_microstep: 1736.06 | bwd_inner_microstep: 1687.95 | bwd_allreduce_microstep: 48.04 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14366
total_samples=21037, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:15,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.78 | bwd_microstep: 1747.35 | bwd_inner_microstep: 1724.70 | bwd_allreduce_microstep: 22.58 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13730
total_samples=21041, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:18,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.39 | bwd_microstep: 1929.50 | bwd_inner_microstep: 1752.13 | bwd_allreduce_microstep: 177.30 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14284
total_samples=21046, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:21,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 05:57:21,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.51 | bwd_microstep: 1972.47 | bwd_inner_microstep: 1852.12 | bwd_allreduce_microstep: 120.29 | step_microstep: 111.38
[2025-08-03 05:57:21,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.60 | bwd: 7385.44 | bwd_inner: 7016.89 | bwd_allreduce: 368.30 | step: 111.95
{'loss': 0.7444, 'learning_rate': 4.589876969630616e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15208
total_samples=21050, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:24,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.67 | bwd_microstep: 1831.06 | bwd_inner_microstep: 1762.66 | bwd_allreduce_microstep: 68.32 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12697
total_samples=21054, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:26,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.79 | bwd_microstep: 1745.10 | bwd_inner_microstep: 1646.52 | bwd_allreduce_microstep: 98.52 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13779
total_samples=21059, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:29,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.58 | bwd_microstep: 1734.09 | bwd_inner_microstep: 1679.68 | bwd_allreduce_microstep: 54.34 | step_microstep: 0.24
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14629
total_samples=21063, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:31,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.83
[2025-08-03 05:57:31,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.63 | bwd_microstep: 1760.84 | bwd_inner_microstep: 1693.17 | bwd_allreduce_microstep: 67.59 | step_microstep: 145.31
[2025-08-03 05:57:31,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.59 | bwd: 7071.16 | bwd_inner: 6782.04 | bwd_allreduce: 288.86 | step: 146.09
{'loss': 0.7392, 'learning_rate': 4.576264865514467e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13307
total_samples=21067, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:34,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.41 | bwd_microstep: 2011.32 | bwd_inner_microstep: 2004.97 | bwd_allreduce_microstep: 6.27 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13412
total_samples=21071, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:37,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.37 | bwd_microstep: 1825.43 | bwd_inner_microstep: 1668.00 | bwd_allreduce_microstep: 157.36 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13505
total_samples=21075, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:39,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.11 | bwd_microstep: 1966.73 | bwd_inner_microstep: 1843.85 | bwd_allreduce_microstep: 122.81 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12451
total_samples=21078, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:42,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 05:57:42,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.40 | bwd_microstep: 1934.53 | bwd_inner_microstep: 1752.13 | bwd_allreduce_microstep: 182.32 | step_microstep: 129.19
[2025-08-03 05:57:42,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.22 | bwd: 7738.06 | bwd_inner: 7268.96 | bwd_allreduce: 468.85 | step: 129.66
8, 10.98s/it]                                                        69%|██████▉   | 1380/2000 [4:14:04<1:53:28, 10.98s/it] 69%|██████▉   | 1381/2000 [4:14:14<1:52:24, 10.90s/it]                                                        69%|██████▉   | 1381/2000 [4:14:14<1:52:24, 10.90s/it] 69%|██████▉   | 1382/2000 [4:14:25<1:51:46, 10.85s/it]                                                        69%|██████▉   | 1382/2000 [4:14:25<1:51:46, 10.85s/it] 69%|██████▉   | 1383/2000 [4:14:36<1:50:55, 10.79s/it]                                                        69%|██████▉   | 1383/2000 [4:14:36<1:50:55, 10.79s/it] 69%|██████▉   | 1384/2000 [4:14:46<1:49:19, 10.65s/it]                                                        69%|██████▉   | 1384/2000 [4:14:46<1:49:19, 10.65s/it] 69%|██████▉   | 1385/2000 [4:14:57<1:50:10, 10.75s/it]                         {'loss': 0.7378, 'learning_rate': 4.562666984511416e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13186
total_samples=21082, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:45,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.81 | bwd_microstep: 1789.50 | bwd_inner_microstep: 1685.36 | bwd_allreduce_microstep: 104.07 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12002
total_samples=21085, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:48,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.24 | bwd_microstep: 2094.35 | bwd_inner_microstep: 1861.76 | bwd_allreduce_microstep: 232.52 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13246
total_samples=21090, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:50,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.65 | bwd_microstep: 1889.52 | bwd_inner_microstep: 1814.71 | bwd_allreduce_microstep: 74.75 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13668
total_samples=21094, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:53,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 05:57:53,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.71 | bwd_microstep: 1830.45 | bwd_inner_microstep: 1712.88 | bwd_allreduce_microstep: 117.48 | step_microstep: 141.53
[2025-08-03 05:57:53,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.33 | bwd: 7603.87 | bwd_inner: 7074.71 | bwd_allreduce: 528.90 | step: 142.12
{'loss': 0.7532, 'learning_rate': 4.549083362280318e-06, 'epoch': 0.69}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13118
total_samples=21098, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:56,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.41 | bwd_microstep: 1947.64 | bwd_inner_microstep: 1733.04 | bwd_allreduce_microstep: 214.53 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12005
total_samples=21101, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:57:58,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.96 | bwd_microstep: 1804.87 | bwd_inner_microstep: 1571.56 | bwd_allreduce_microstep: 233.25 | step_microstep: 0.23
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13587
total_samples=21105, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:01,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.34 | bwd_microstep: 2220.63 | bwd_inner_microstep: 1919.22 | bwd_allreduce_microstep: 301.33 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12449
total_samples=21108, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:04,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 05:58:04,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.83 | bwd_microstep: 1765.12 | bwd_inner_microstep: 1583.85 | bwd_allreduce_microstep: 181.20 | step_microstep: 445.93
[2025-08-03 05:58:04,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.46 | bwd: 7738.31 | bwd_inner: 6807.67 | bwd_allreduce: 930.39 | step: 446.54
{'loss': 0.7384, 'learning_rate': 4.535514034442644e-06, 'epoch': 0.69}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13496
total_samples=21112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:07,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.42 | bwd_microstep: 2069.90 | bwd_inner_microstep: 1907.93 | bwd_allreduce_microstep: 161.91 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13669
total_samples=21116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:10,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.69 | bwd_microstep: 1774.31 | bwd_inner_microstep: 1694.72 | bwd_allreduce_microstep: 79.52 | step_microstep: 0.17
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12389
total_samples=21120, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:13,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.42 | bwd_microstep: 2007.71 | bwd_inner_microstep: 1588.08 | bwd_allreduce_microstep: 419.56 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13457
total_samples=21123, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:15,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 20.40
[2025-08-03 05:58:15,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.34 | bwd_microstep: 1746.54 | bwd_inner_microstep: 1659.88 | bwd_allreduce_microstep: 86.55 | step_microstep: 159.73
[2025-08-03 05:58:15,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.80 | bwd: 7598.53 | bwd_inner: 6850.60 | bwd_allreduce: 747.64 | step: 160.24
{'loss': 0.7379, 'learning_rate': 4.521959036582372e-06, 'epoch': 0.69}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13710
total_samples=21127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:18,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.81 | bwd_microstep: 1734.75 | bwd_inner_microstep: 1660.71 | bwd_allreduce_microstep: 73.97 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14499
total_samples=21131, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:21,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.41 | bwd_microstep: 2154.68 | bwd_inner_microstep: 2015.25 | bwd_allreduce_microstep: 139.37 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11760
total_samples=21134, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:23,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.30 | bwd_microstep: 2029.75 | bwd_inner_microstep: 1817.46 | bwd_allreduce_microstep: 212.22 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13663
total_samples=21138, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:26,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 05:58:26,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.76 | bwd_microstep: 1751.20 | bwd_inner_microstep: 1701.52 | bwd_allreduce_microstep: 49.61 | step_microstep: 136.20
[2025-08-03 05:58:26,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.20 | bwd: 7670.44 | bwd_inner: 7194.95 | bwd_allreduce: 475.23 | step: 136.85
{'loss': 0.7474, 'learning_rate': 4.508418404245903e-06, 'epoch': 0.69}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13425
total_samples=21142, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:29,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.55 | bwd_microstep: 1737.55 | bwd_inner_microstep: 1651.78 | bwd_allreduce_microstep: 85.70 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14345
total_samples=21146, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:31,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.24 | bwd_microstep: 2030.11 | bwd_inner_microstep: 1916.37 | bwd_allreduce_microstep: 113.68 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11744
total_samples=21149, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:34,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.53 | bwd_microstep: 1950.53 | bwd_inner_microstep: 1533.86 | bwd_allreduce_microstep: 416.55 | step_microstep: 0.47
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14081
total_samples=21153, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:37,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 05:58:37,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.37 | bwd_microstep: 2104.80 | bwd_inner_microstep: 1768.37 | bwd_allreduce_microstep: 336.36 | step_microstep: 112.27
[2025-08-03 05:58:37,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.61 | bwd: 7823.07 | bwd_inner: 6870.38 | bwd_allreduce: 952.39 | step: 113.04
{'loss': 0.7552, 'learning_rate': 4.494892172941965e-06, 'epoch': 0.69}
                               69%|██████▉   | 1385/2000 [4:14:57<1:50:10, 10.75s/it] 69%|██████▉   | 1386/2000 [4:15:08<1:50:19, 10.78s/it]                                                        69%|██████▉   | 1386/2000 [4:15:08<1:50:19, 10.78s/it] 69%|██████▉   | 1387/2000 [4:15:19<1:51:38, 10.93s/it]                                                        69%|██████▉   | 1387/2000 [4:15:19<1:51:38, 10.93s/it] 69%|██████▉   | 1388/2000 [4:15:30<1:51:07, 10.89s/it]                                                        69%|██████▉   | 1388/2000 [4:15:30<1:51:07, 10.89s/it] 69%|██████▉   | 1389/2000 [4:15:41<1:50:53, 10.89s/it]                                                        69%|██████▉   | 1389/2000 [4:15:41<1:50:53, 10.89s/it] 70%|██████▉   | 1390/2000 [4:15:52<1:51:14, 10.94s/it]                                                        70%|�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13444
total_samples=21157, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:40,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 670.95 | bwd_microstep: 1741.42 | bwd_inner_microstep: 1675.66 | bwd_allreduce_microstep: 65.69 | step_microstep: 0.41
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12830
total_samples=21161, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:42,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.43 | bwd_microstep: 1833.12 | bwd_inner_microstep: 1682.40 | bwd_allreduce_microstep: 150.66 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12338
total_samples=21164, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:45,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.44 | bwd_microstep: 1762.50 | bwd_inner_microstep: 1571.80 | bwd_allreduce_microstep: 190.62 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11526
total_samples=21167, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:48,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 05:58:48,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.25 | bwd_microstep: 1766.05 | bwd_inner_microstep: 1540.06 | bwd_allreduce_microstep: 225.92 | step_microstep: 469.30
[2025-08-03 05:58:48,335] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.00 | bwd: 7103.13 | bwd_inner: 6469.92 | bwd_allreduce: 632.98 | step: 470.03
{'loss': 0.7356, 'learning_rate': 4.481380378141528e-06, 'epoch': 0.7}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14188
total_samples=21171, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:51,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.95 | bwd_microstep: 2112.72 | bwd_inner_microstep: 1910.82 | bwd_allreduce_microstep: 201.84 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12708
total_samples=21175, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:53,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.71 | bwd_microstep: 1880.53 | bwd_inner_microstep: 1772.79 | bwd_allreduce_microstep: 107.67 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13558
total_samples=21179, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:56,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.41 | bwd_microstep: 1791.73 | bwd_inner_microstep: 1710.53 | bwd_allreduce_microstep: 81.14 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13497
total_samples=21183, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:58:59,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 05:58:59,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.49 | bwd_microstep: 1809.83 | bwd_inner_microstep: 1716.21 | bwd_allreduce_microstep: 93.55 | step_microstep: 121.16
[2025-08-03 05:58:59,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.49 | bwd: 7594.86 | bwd_inner: 7110.34 | bwd_allreduce: 484.28 | step: 121.60
{'loss': 0.7482, 'learning_rate': 4.467883055277696e-06, 'epoch': 0.7}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508
total_samples=21187, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:01,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.80 | bwd_microstep: 1851.97 | bwd_inner_microstep: 1718.92 | bwd_allreduce_microstep: 132.97 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13609
total_samples=21191, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:04,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.15 | bwd_microstep: 2218.93 | bwd_inner_microstep: 1999.36 | bwd_allreduce_microstep: 219.50 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12365
total_samples=21194, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:07,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.94 | bwd_microstep: 2017.61 | bwd_inner_microstep: 1794.24 | bwd_allreduce_microstep: 223.31 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13851
total_samples=21198, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:10,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 05:59:10,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.23 | bwd_microstep: 1811.75 | bwd_inner_microstep: 1736.63 | bwd_allreduce_microstep: 75.05 | step_microstep: 108.68
[2025-08-03 05:59:10,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.05 | bwd: 7900.32 | bwd_inner: 7249.14 | bwd_allreduce: 650.92 | step: 109.08
{'loss': 0.7348, 'learning_rate': 4.454400239745619e-06, 'epoch': 0.7}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13459
total_samples=21202, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:13,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.79 | bwd_microstep: 1904.61 | bwd_inner_microstep: 1854.38 | bwd_allreduce_microstep: 50.16 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13482
total_samples=21206, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:15,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.89 | bwd_microstep: 1825.01 | bwd_inner_microstep: 1734.78 | bwd_allreduce_microstep: 90.17 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=21209, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:18,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.56 | bwd_microstep: 1755.22 | bwd_inner_microstep: 1544.32 | bwd_allreduce_microstep: 210.84 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=21212, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:21,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.69
[2025-08-03 05:59:21,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.43 | bwd_microstep: 2028.58 | bwd_inner_microstep: 1817.76 | bwd_allreduce_microstep: 210.72 | step_microstep: 121.32
[2025-08-03 05:59:21,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.59 | bwd: 7513.50 | bwd_inner: 6951.24 | bwd_allreduce: 561.98 | step: 121.81
{'loss': 0.7489, 'learning_rate': 4.440931966902419e-06, 'epoch': 0.7}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13324
total_samples=21216, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:23,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.60 | bwd_microstep: 1976.16 | bwd_inner_microstep: 1846.96 | bwd_allreduce_microstep: 129.12 | step_microstep: 0.73
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13698
total_samples=21220, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:26,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.06 | bwd_microstep: 1871.01 | bwd_inner_microstep: 1824.84 | bwd_allreduce_microstep: 46.10 | step_microstep: 0.17
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13005
total_samples=21224, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:29,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.61 | bwd_microstep: 1882.20 | bwd_inner_microstep: 1618.35 | bwd_allreduce_microstep: 263.77 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12094
total_samples=21227, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:32,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 05:59:32,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.03 | bwd_microstep: 2020.82 | bwd_inner_microstep: 1806.40 | bwd_allreduce_microstep: 214.35 | step_microstep: 116.32
[2025-08-03 05:59:32,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.21 | bwd: 7750.24 | bwd_inner: 7096.54 | bwd_allreduce: 653.42 | step: 117.50
{'loss': 0.7477, 'learning_rate': 4.427478272067066e-06, 'epoch': 0.7}
�█████▉   | 1390/2000 [4:15:52<1:51:14, 10.94s/it] 70%|██████▉   | 1391/2000 [4:16:03<1:50:14, 10.86s/it]                                                        70%|██████▉   | 1391/2000 [4:16:03<1:50:14, 10.86s/it] 70%|██████▉   | 1392/2000 [4:16:14<1:50:01, 10.86s/it]                                                        70%|██████▉   | 1392/2000 [4:16:14<1:50:01, 10.86s/it] 70%|██████▉   | 1393/2000 [4:16:25<1:50:45, 10.95s/it]                                                        70%|██████▉   | 1393/2000 [4:16:25<1:50:45, 10.95s/it] 70%|██████▉   | 1394/2000 [4:16:35<1:49:57, 10.89s/it]                                                        70%|██████▉   | 1394/2000 [4:16:35<1:49:57, 10.89s/it] 70%|██████▉   | 1395/2000 [4:16:46<1:49:53, 10.90s/it]                                                        70%|██████▉   | 1395/2000 [4:16dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14074
total_samples=21231, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:34,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.86 | bwd_microstep: 2061.85 | bwd_inner_microstep: 1884.71 | bwd_allreduce_microstep: 177.08 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13161
total_samples=21236, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:37,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.40 | bwd_microstep: 1902.45 | bwd_inner_microstep: 1788.25 | bwd_allreduce_microstep: 114.13 | step_microstep: 0.16
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12399
total_samples=21240, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:40,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.55 | bwd_microstep: 2093.86 | bwd_inner_microstep: 1864.17 | bwd_allreduce_microstep: 229.62 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11683
total_samples=21243, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:43,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 05:59:43,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.11 | bwd_microstep: 2110.74 | bwd_inner_microstep: 1634.49 | bwd_allreduce_microstep: 476.12 | step_microstep: 135.93
[2025-08-03 05:59:43,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.86 | bwd: 8168.97 | bwd_inner: 7171.64 | bwd_allreduce: 997.03 | step: 136.32
{'loss': 0.7469, 'learning_rate': 4.414039190520308e-06, 'epoch': 0.7}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13354
total_samples=21247, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:46,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.44 | bwd_microstep: 1810.35 | bwd_inner_microstep: 1698.43 | bwd_allreduce_microstep: 111.86 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=21250, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:48,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.23 | bwd_microstep: 1932.08 | bwd_inner_microstep: 1739.68 | bwd_allreduce_microstep: 192.34 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13615
total_samples=21254, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:51,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.24 | bwd_microstep: 1824.84 | bwd_inner_microstep: 1733.31 | bwd_allreduce_microstep: 91.47 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11800
total_samples=21257, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:54,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 05:59:54,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.03 | bwd_microstep: 1849.09 | bwd_inner_microstep: 1602.43 | bwd_allreduce_microstep: 246.60 | step_microstep: 108.78
[2025-08-03 05:59:54,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2857.87 | bwd: 7416.42 | bwd_inner: 6773.84 | bwd_allreduce: 642.35 | step: 109.24
{'loss': 0.7331, 'learning_rate': 4.400614757504565e-06, 'epoch': 0.7}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13135
total_samples=21261, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:56,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.56 | bwd_microstep: 1954.97 | bwd_inner_microstep: 1858.00 | bwd_allreduce_microstep: 96.91 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11600
total_samples=21264, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 05:59:59,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.57 | bwd_microstep: 2027.73 | bwd_inner_microstep: 1798.21 | bwd_allreduce_microstep: 229.45 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13456
total_samples=21268, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:02,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.15 | bwd_microstep: 1944.24 | bwd_inner_microstep: 1696.32 | bwd_allreduce_microstep: 247.85 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12927
total_samples=21272, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:05,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 06:00:05,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.27 | bwd_microstep: 2050.41 | bwd_inner_microstep: 1959.00 | bwd_allreduce_microstep: 91.35 | step_microstep: 128.69
[2025-08-03 06:00:05,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.48 | bwd: 7977.40 | bwd_inner: 7311.53 | bwd_allreduce: 665.64 | step: 129.18
{'loss': 0.7376, 'learning_rate': 4.3872050082238535e-06, 'epoch': 0.7}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13441
total_samples=21277, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:07,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.09 | bwd_microstep: 1822.27 | bwd_inner_microstep: 1766.88 | bwd_allreduce_microstep: 55.32 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11806
total_samples=21280, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:10,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.92 | bwd_microstep: 1786.78 | bwd_inner_microstep: 1559.94 | bwd_allreduce_microstep: 226.78 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12522
total_samples=21283, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:12,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.14 | bwd_microstep: 1792.15 | bwd_inner_microstep: 1598.48 | bwd_allreduce_microstep: 193.61 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13876
total_samples=21287, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:16,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 06:00:16,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.88 | bwd_microstep: 2160.60 | bwd_inner_microstep: 1988.09 | bwd_allreduce_microstep: 172.45 | step_microstep: 158.98
[2025-08-03 06:00:16,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.96 | bwd: 7561.86 | bwd_inner: 6913.38 | bwd_allreduce: 648.23 | step: 159.48
{'loss': 0.7375, 'learning_rate': 4.373809977843676e-06, 'epoch': 0.7}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12198
total_samples=21290, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:18,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.31 | bwd_microstep: 1810.97 | bwd_inner_microstep: 1582.83 | bwd_allreduce_microstep: 228.08 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11693
total_samples=21293, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:21,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.52 | bwd_microstep: 2041.08 | bwd_inner_microstep: 1719.55 | bwd_allreduce_microstep: 321.47 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12562
total_samples=21297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:24,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.03 | bwd_microstep: 1827.83 | bwd_inner_microstep: 1611.56 | bwd_allreduce_microstep: 216.21 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11802
total_samples=21300, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:27,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 06:00:27,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.62 | bwd_microstep: 2158.25 | bwd_inner_microstep: 1782.42 | bwd_allreduce_microstep: 375.72 | step_microstep: 110.86
[2025-08-03 06:00:27,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.40 | bwd: 7838.17 | bwd_inner: 6696.36 | bwd_allreduce: 1141.55 | step: 111.22
{'loss': 0.7426, 'learning_rate': 4.360429701490935e-06, 'epoch': 0.7}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11701
total_samples=21303, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:29,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.69 | bwd_microstep: 2057.03 | bwd_inner_microstep: 1814.18 | bwd_allreduce_microstep: 242.78 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305
total_samples=21308, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:32,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.07 | bwd_microstep: 2210.65 | bwd_inner_microstep: 2078.21 | bwd_allreduce_microstep: 132.39 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12569
total_samples=21312, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:35,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.42 | bwd_microstep: 1719.30 | bwd_inner_microstep: 1596.35 | bwd_allreduce_microstep: 122.86 | step_microstep: 0.40
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13010
total_samples=21315, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:38,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 06:00:38,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.21 | bwd_microstep: 1801.35 | bwd_inner_microstep: 1620.07 | bwd_allreduce_microstep: 181.22 | step_microstep: 130.97
[2025-08-03 06:00:38,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.30 | bwd: 7788.38 | bwd_inner: 7108.79 | bwd_allreduce: 679.35 | step: 131.62
:46<1:49:53, 10.90s/it] 70%|██████▉   | 1396/2000 [4:16:58<1:51:17, 11.06s/it]                                                        70%|██████▉   | 1396/2000 [4:16:58<1:51:17, 11.06s/it] 70%|██████▉   | 1397/2000 [4:17:08<1:49:55, 10.94s/it]                                                        70%|██████▉   | 1397/2000 [4:17:08<1:49:55, 10.94s/it] 70%|██████▉   | 1398/2000 [4:17:20<1:50:27, 11.01s/it]                                                        70%|██████▉   | 1398/2000 [4:17:20<1:50:27, 11.01s/it] 70%|██████▉   | 1399/2000 [4:17:30<1:49:41, 10.95s/it]                                                        70%|██████▉   | 1399/2000 [4:17:30<1:49:41, 10.95s/it] 70%|███████   | 1400/2000 [4:17:41<1:49:42, 10.97s/it]                                                        70%|███████   | 1400/2000 [4:17:41<1:49:42, 10.97s/it] 70%|███�{'loss': 0.7382, 'learning_rate': 4.34706421425385e-06, 'epoch': 0.7}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14800
total_samples=21319, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:40,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.15 | bwd_microstep: 1987.16 | bwd_inner_microstep: 1882.50 | bwd_allreduce_microstep: 104.59 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13311
total_samples=21323, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:43,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.25 | bwd_microstep: 1967.54 | bwd_inner_microstep: 1673.20 | bwd_allreduce_microstep: 294.29 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12343
total_samples=21327, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:46,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.90 | bwd_microstep: 1982.58 | bwd_inner_microstep: 1795.65 | bwd_allreduce_microstep: 186.87 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12135
total_samples=21330, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:49,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 06:00:49,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.61 | bwd_microstep: 2022.40 | bwd_inner_microstep: 1819.66 | bwd_allreduce_microstep: 202.67 | step_microstep: 111.52
[2025-08-03 06:00:49,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.83 | bwd: 7959.74 | bwd_inner: 7171.01 | bwd_allreduce: 788.49 | step: 111.85
{'loss': 0.7467, 'learning_rate': 4.3337135511818514e-06, 'epoch': 0.7}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11951
total_samples=21333, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:52,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.30 | bwd_microstep: 2038.56 | bwd_inner_microstep: 1798.04 | bwd_allreduce_microstep: 240.45 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13705
total_samples=21337, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:54,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.46 | bwd_microstep: 1926.46 | bwd_inner_microstep: 1848.58 | bwd_allreduce_microstep: 77.82 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11814
total_samples=21340, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:00:57,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.14 | bwd_microstep: 1748.90 | bwd_inner_microstep: 1542.52 | bwd_allreduce_microstep: 206.31 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13529
total_samples=21344, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:00,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.16
[2025-08-03 06:01:00,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 927.18 | bwd_microstep: 1746.54 | bwd_inner_microstep: 1683.04 | bwd_allreduce_microstep: 63.42 | step_microstep: 124.97
[2025-08-03 06:01:00,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3047.99 | bwd: 7460.51 | bwd_inner: 6872.18 | bwd_allreduce: 588.09 | step: 125.32
{'loss': 0.7352, 'learning_rate': 4.320377747285497e-06, 'epoch': 0.7}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12372
total_samples=21347, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:03,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.33 | bwd_microstep: 1934.55 | bwd_inner_microstep: 1756.02 | bwd_allreduce_microstep: 178.46 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14079
total_samples=21351, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:05,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.43 | bwd_microstep: 1719.17 | bwd_inner_microstep: 1688.17 | bwd_allreduce_microstep: 30.93 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13145
total_samples=21355, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:08,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.01 | bwd_microstep: 1824.10 | bwd_inner_microstep: 1697.68 | bwd_allreduce_microstep: 126.35 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13657
total_samples=21359, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:11,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 06:01:11,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.51 | bwd_microstep: 2248.81 | bwd_inner_microstep: 2181.49 | bwd_allreduce_microstep: 67.25 | step_microstep: 109.22
[2025-08-03 06:01:11,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.20 | bwd: 7726.67 | bwd_inner: 7323.36 | bwd_allreduce: 403.07 | step: 109.58
{'loss': 0.7313, 'learning_rate': 4.307056837536373e-06, 'epoch': 0.7}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13506
total_samples=21363, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:13,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.12 | bwd_microstep: 1852.44 | bwd_inner_microstep: 1685.89 | bwd_allreduce_microstep: 166.48 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12069
total_samples=21366, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:16,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.30 | bwd_microstep: 1736.11 | bwd_inner_microstep: 1550.10 | bwd_allreduce_microstep: 185.94 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13549
total_samples=21370, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:19,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.30 | bwd_microstep: 1801.24 | bwd_inner_microstep: 1723.00 | bwd_allreduce_microstep: 78.16 | step_microstep: 0.34
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13093
total_samples=21374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:21,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 06:01:21,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.73 | bwd_microstep: 1976.19 | bwd_inner_microstep: 1969.80 | bwd_allreduce_microstep: 6.32 | step_microstep: 111.98
[2025-08-03 06:01:21,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.38 | bwd: 7366.02 | bwd_inner: 6928.79 | bwd_allreduce: 436.99 | step: 112.53
{'loss': 0.7346, 'learning_rate': 4.2937508568670194e-06, 'epoch': 0.7}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12149
total_samples=21377, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:24,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.27 | bwd_microstep: 1838.50 | bwd_inner_microstep: 1706.41 | bwd_allreduce_microstep: 132.03 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13289
total_samples=21381, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:27,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.72 | bwd_microstep: 2033.20 | bwd_inner_microstep: 1716.17 | bwd_allreduce_microstep: 316.97 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13707
total_samples=21385, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:29,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.45 | bwd_microstep: 1769.63 | bwd_inner_microstep: 1697.03 | bwd_allreduce_microstep: 72.53 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12534
total_samples=21389, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:32,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.68
[2025-08-03 06:01:32,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.68 | bwd_microstep: 1733.17 | bwd_inner_microstep: 1592.38 | bwd_allreduce_microstep: 140.71 | step_microstep: 137.95
[2025-08-03 06:01:32,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.04 | bwd: 7374.56 | bwd_inner: 6711.99 | bwd_allreduce: 662.33 | step: 138.30
��███   | 1401/2000 [4:17:53<1:49:45, 10.99s/it]                                                        70%|███████   | 1401/2000 [4:17:53<1:49:45, 10.99s/it] 70%|███████   | 1402/2000 [4:18:04<1:50:13, 11.06s/it]                                                        70%|███████   | 1402/2000 [4:18:04<1:50:13, 11.06s/it] 70%|███████   | 1403/2000 [4:18:15<1:49:46, 11.03s/it]                                                        70%|███████   | 1403/2000 [4:18:15<1:49:46, 11.03s/it] 70%|███████   | 1404/2000 [4:18:26<1:49:21, 11.01s/it]                                                        70%|███████   | 1404/2000 [4:18:26<1:49:21, 11.01s/it] 70%|███████   | 1405/2000 [4:18:36<1:47:57, 10.89s/it]                                                        70%|███████   | 1405/2000 [4:18:36<1:47:57, 10.89s/it] 70%|███████   | 1406/2000 [4:18:47<1:47{'loss': 0.7338, 'learning_rate': 4.280459840170818e-06, 'epoch': 0.7}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11725
total_samples=21392, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:35,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.11 | bwd_microstep: 1878.34 | bwd_inner_microstep: 1527.51 | bwd_allreduce_microstep: 350.76 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12211
total_samples=21395, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:37,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.28 | bwd_microstep: 1903.96 | bwd_inner_microstep: 1759.66 | bwd_allreduce_microstep: 144.23 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13623
total_samples=21399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:40,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.48 | bwd_microstep: 1803.64 | bwd_inner_microstep: 1715.75 | bwd_allreduce_microstep: 87.82 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11853
total_samples=21402, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:43,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.32
[2025-08-03 06:01:43,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.57 | bwd_microstep: 1839.79 | bwd_inner_microstep: 1556.16 | bwd_allreduce_microstep: 283.54 | step_microstep: 151.97
[2025-08-03 06:01:43,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.36 | bwd: 7425.79 | bwd_inner: 6559.08 | bwd_allreduce: 866.45 | step: 152.44
{'loss': 0.738, 'learning_rate': 4.267183822301903e-06, 'epoch': 0.7}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14001
total_samples=21406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:45,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.29 | bwd_microstep: 1770.42 | bwd_inner_microstep: 1679.04 | bwd_allreduce_microstep: 91.31 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13403
total_samples=21410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:48,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.82 | bwd_microstep: 1761.34 | bwd_inner_microstep: 1687.09 | bwd_allreduce_microstep: 74.18 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12567
total_samples=21414, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:50,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.87 | bwd_microstep: 1818.48 | bwd_inner_microstep: 1600.25 | bwd_allreduce_microstep: 218.16 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11903
total_samples=21417, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:53,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.67
[2025-08-03 06:01:53,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.58 | bwd_microstep: 1862.12 | bwd_inner_microstep: 1606.79 | bwd_allreduce_microstep: 255.27 | step_microstep: 134.69
[2025-08-03 06:01:53,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.48 | bwd: 7212.41 | bwd_inner: 6573.17 | bwd_allreduce: 639.00 | step: 135.19
{'loss': 0.7331, 'learning_rate': 4.2539228380750955e-06, 'epoch': 0.7}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13256
total_samples=21421, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:56,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.96 | bwd_microstep: 1997.43 | bwd_inner_microstep: 1870.02 | bwd_allreduce_microstep: 127.35 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11404
total_samples=21424, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:01:59,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.89 | bwd_microstep: 1976.23 | bwd_inner_microstep: 1781.25 | bwd_allreduce_microstep: 194.91 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11844
total_samples=21427, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:01,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.72 | bwd_microstep: 1791.34 | bwd_inner_microstep: 1547.81 | bwd_allreduce_microstep: 243.46 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=21430, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:04,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 06:02:04,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.04 | bwd_microstep: 1898.98 | bwd_inner_microstep: 1557.94 | bwd_allreduce_microstep: 340.98 | step_microstep: 154.16
[2025-08-03 06:02:04,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.54 | bwd: 7664.06 | bwd_inner: 6757.02 | bwd_allreduce: 906.79 | step: 154.66
{'loss': 0.7394, 'learning_rate': 4.240676922265774e-06, 'epoch': 0.7}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15488
total_samples=21434, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:07,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.52 | bwd_microstep: 2153.19 | bwd_inner_microstep: 2084.74 | bwd_allreduce_microstep: 68.38 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13726
total_samples=21438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:10,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.00 | bwd_microstep: 1863.52 | bwd_inner_microstep: 1706.53 | bwd_allreduce_microstep: 156.92 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11832
total_samples=21441, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:12,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.36 | bwd_microstep: 1799.00 | bwd_inner_microstep: 1557.78 | bwd_allreduce_microstep: 241.16 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11622
total_samples=21444, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:15,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 06:02:15,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.07 | bwd_microstep: 2064.88 | bwd_inner_microstep: 1759.65 | bwd_allreduce_microstep: 305.16 | step_microstep: 109.35
[2025-08-03 06:02:15,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.88 | bwd: 7880.65 | bwd_inner: 7108.69 | bwd_allreduce: 771.70 | step: 110.02
{'loss': 0.7457, 'learning_rate': 4.2274461096098085e-06, 'epoch': 0.7}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14149
total_samples=21448, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:18,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.74 | bwd_microstep: 1777.72 | bwd_inner_microstep: 1731.15 | bwd_allreduce_microstep: 46.51 | step_microstep: 0.18
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13049
total_samples=21452, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:20,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.67 | bwd_microstep: 1763.81 | bwd_inner_microstep: 1661.84 | bwd_allreduce_microstep: 101.91 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11737
total_samples=21455, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:23,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.31 | bwd_microstep: 1925.29 | bwd_inner_microstep: 1732.68 | bwd_allreduce_microstep: 192.56 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14151
total_samples=21459, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:26,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 06:02:26,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.98 | bwd_microstep: 1746.55 | bwd_inner_microstep: 1725.64 | bwd_allreduce_microstep: 20.84 | step_microstep: 131.44
[2025-08-03 06:02:26,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.62 | bwd: 7213.43 | bwd_inner: 6851.29 | bwd_allreduce: 361.90 | step: 131.85
:01, 10.81s/it]                                                        70%|███████   | 1406/2000 [4:18:47<1:47:01, 10.81s/it] 70%|███████   | 1407/2000 [4:18:58<1:46:25, 10.77s/it]                                                        70%|███████   | 1407/2000 [4:18:58<1:46:25, 10.77s/it] 70%|███████   | 1408/2000 [4:19:08<1:45:27, 10.69s/it]                                                        70%|███████   | 1408/2000 [4:19:08<1:45:27, 10.69s/it] 70%|███████   | 1409/2000 [4:19:19<1:45:56, 10.76s/it]                                                        70%|███████   | 1409/2000 [4:19:19<1:45:56, 10.76s/it] 70%|███████   | 1410/2000 [4:19:30<1:46:42, 10.85s/it]                                                        70%|███████   | 1410/2000 [4:19:30<1:46:42, 10.85s/it] 71%|███████   | 1411/2000 [4:19:40<1:45:19, 10.73s/it]                       {'loss': 0.7268, 'learning_rate': 4.21423043480346e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13888
total_samples=21463, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:28,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.59 | bwd_microstep: 1739.30 | bwd_inner_microstep: 1691.15 | bwd_allreduce_microstep: 48.07 | step_microstep: 0.87
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13599
total_samples=21467, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:31,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.55 | bwd_microstep: 2069.61 | bwd_inner_microstep: 1963.00 | bwd_allreduce_microstep: 106.54 | step_microstep: 0.23
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12204
total_samples=21471, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:34,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.98 | bwd_microstep: 1965.47 | bwd_inner_microstep: 1797.91 | bwd_allreduce_microstep: 167.47 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13655
total_samples=21475, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:37,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.24
[2025-08-03 06:02:37,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.81 | bwd_microstep: 2045.99 | bwd_inner_microstep: 1929.74 | bwd_allreduce_microstep: 116.18 | step_microstep: 119.03
[2025-08-03 06:02:37,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.85 | bwd: 7820.44 | bwd_inner: 7381.78 | bwd_allreduce: 438.35 | step: 120.44
{'loss': 0.7528, 'learning_rate': 4.201029932503303e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14515
total_samples=21479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:39,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.36 | bwd_microstep: 1938.28 | bwd_inner_microstep: 1786.65 | bwd_allreduce_microstep: 151.56 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14435
total_samples=21483, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:42,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.92 | bwd_microstep: 1782.96 | bwd_inner_microstep: 1728.46 | bwd_allreduce_microstep: 54.44 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11827
total_samples=21486, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:45,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.16 | bwd_microstep: 1820.03 | bwd_inner_microstep: 1569.57 | bwd_allreduce_microstep: 250.38 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13854
total_samples=21490, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:47,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 06:02:47,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.37 | bwd_microstep: 1771.26 | bwd_inner_microstep: 1721.01 | bwd_allreduce_microstep: 50.18 | step_microstep: 139.23
[2025-08-03 06:02:47,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2746.71 | bwd: 7312.57 | bwd_inner: 6805.68 | bwd_allreduce: 506.65 | step: 139.88
{'loss': 0.749, 'learning_rate': 4.18784463732611e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=21494, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:50,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.41 | bwd_microstep: 1744.85 | bwd_inner_microstep: 1690.63 | bwd_allreduce_microstep: 54.16 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11761
total_samples=21497, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:52,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.51 | bwd_microstep: 1911.47 | bwd_inner_microstep: 1744.80 | bwd_allreduce_microstep: 166.60 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13926
total_samples=21501, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:55,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.58 | bwd_microstep: 1760.31 | bwd_inner_microstep: 1713.69 | bwd_allreduce_microstep: 46.56 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13594
total_samples=21505, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:02:58,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.93
[2025-08-03 06:02:58,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.28 | bwd_microstep: 1802.50 | bwd_inner_microstep: 1724.77 | bwd_allreduce_microstep: 77.67 | step_microstep: 139.02
[2025-08-03 06:02:58,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.71 | bwd: 7219.18 | bwd_inner: 6873.89 | bwd_allreduce: 345.06 | step: 139.49
{'loss': 0.7317, 'learning_rate': 4.17467458384878e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13194
total_samples=21509, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:00,921] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.06 | bwd_microstep: 2026.11 | bwd_inner_microstep: 1899.03 | bwd_allreduce_microstep: 127.01 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11777
total_samples=21512, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:03,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.22 | bwd_microstep: 1925.00 | bwd_inner_microstep: 1554.00 | bwd_allreduce_microstep: 370.93 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13656
total_samples=21516, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:06,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.52 | bwd_microstep: 2039.64 | bwd_inner_microstep: 1887.21 | bwd_allreduce_microstep: 152.36 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13701
total_samples=21520, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:09,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 06:03:09,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.83 | bwd_microstep: 1940.10 | bwd_inner_microstep: 1802.32 | bwd_allreduce_microstep: 137.71 | step_microstep: 114.52
[2025-08-03 06:03:09,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.56 | bwd: 7930.90 | bwd_inner: 7142.57 | bwd_allreduce: 788.10 | step: 115.05
{'loss': 0.7379, 'learning_rate': 4.1615198066082475e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13356
total_samples=21524, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:11,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.59 | bwd_microstep: 1829.78 | bwd_inner_microstep: 1696.08 | bwd_allreduce_microstep: 133.63 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12634
total_samples=21528, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:14,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.41 | bwd_microstep: 2028.66 | bwd_inner_microstep: 1797.21 | bwd_allreduce_microstep: 231.38 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11716
total_samples=21531, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:17,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.43 | bwd_microstep: 1714.78 | bwd_inner_microstep: 1523.56 | bwd_allreduce_microstep: 191.14 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13653
total_samples=21535, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:19,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 06:03:19,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.29 | bwd_microstep: 1832.45 | bwd_inner_microstep: 1731.17 | bwd_allreduce_microstep: 101.13 | step_microstep: 140.52
[2025-08-03 06:03:19,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2863.65 | bwd: 7405.72 | bwd_inner: 6748.01 | bwd_allreduce: 657.37 | step: 141.23
{'loss': 0.7412, 'learning_rate': 4.14838034010138e-06, 'epoch': 0.71}
                                 71%|███████   | 1411/2000 [4:19:41<1:45:19, 10.73s/it] 71%|███████   | 1412/2000 [4:19:52<1:46:01, 10.82s/it]                                                        71%|███████   | 1412/2000 [4:19:52<1:46:01, 10.82s/it] 71%|███████   | 1413/2000 [4:20:02<1:44:56, 10.73s/it]                                                        71%|███████   | 1413/2000 [4:20:02<1:44:56, 10.73s/it] 71%|███████   | 1414/2000 [4:20:12<1:43:55, 10.64s/it]                                                        71%|███████   | 1414/2000 [4:20:12<1:43:55, 10.64s/it] 71%|███████   | 1415/2000 [4:20:24<1:45:13, 10.79s/it]                                                        71%|███████   | 1415/2000 [4:20:24<1:45:13, 10.79s/it] 71%|███████   | 1416/2000 [4:20:34<1:44:50, 10.77s/it]                                                        71%|dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13327
total_samples=21540, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:22,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.18 | bwd_microstep: 1800.77 | bwd_inner_microstep: 1675.66 | bwd_allreduce_microstep: 125.03 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11860
total_samples=21543, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:25,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.49 | bwd_microstep: 1841.90 | bwd_inner_microstep: 1714.57 | bwd_allreduce_microstep: 127.26 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13288
total_samples=21547, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:28,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 985.98 | bwd_microstep: 1778.48 | bwd_inner_microstep: 1691.60 | bwd_allreduce_microstep: 86.81 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14303
total_samples=21551, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:30,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 06:03:30,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.94 | bwd_microstep: 1775.01 | bwd_inner_microstep: 1724.34 | bwd_allreduce_microstep: 50.60 | step_microstep: 136.40
[2025-08-03 06:03:30,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3112.52 | bwd: 7196.21 | bwd_inner: 6806.17 | bwd_allreduce: 389.79 | step: 137.02
{'loss': 0.7317, 'learning_rate': 4.135256218784896e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13256
total_samples=21555, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:33,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.66 | bwd_microstep: 1725.59 | bwd_inner_microstep: 1682.76 | bwd_allreduce_microstep: 42.76 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13672
total_samples=21559, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:35,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.36 | bwd_microstep: 1964.59 | bwd_inner_microstep: 1922.30 | bwd_allreduce_microstep: 42.22 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12721
total_samples=21562, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:38,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.48 | bwd_microstep: 1980.27 | bwd_inner_microstep: 1630.70 | bwd_allreduce_microstep: 349.51 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13813
total_samples=21566, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:41,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.95
[2025-08-03 06:03:41,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.52 | bwd_microstep: 1737.11 | bwd_inner_microstep: 1687.05 | bwd_allreduce_microstep: 49.97 | step_microstep: 117.49
[2025-08-03 06:03:41,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.94 | bwd: 7407.60 | bwd_inner: 6922.80 | bwd_allreduce: 484.55 | step: 117.97
{'loss': 0.7438, 'learning_rate': 4.12214747707527e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13330
total_samples=21570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:44,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.70 | bwd_microstep: 1966.96 | bwd_inner_microstep: 1847.73 | bwd_allreduce_microstep: 119.17 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13387
total_samples=21574, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:46,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.83 | bwd_microstep: 2037.52 | bwd_inner_microstep: 2030.90 | bwd_allreduce_microstep: 6.56 | step_microstep: 0.28
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12532
total_samples=21578, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:49,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.75 | bwd_microstep: 1744.50 | bwd_inner_microstep: 1584.43 | bwd_allreduce_microstep: 160.00 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13617
total_samples=21582, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:52,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 06:03:52,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.94 | bwd_microstep: 1977.41 | bwd_inner_microstep: 1896.60 | bwd_allreduce_microstep: 80.75 | step_microstep: 129.35
[2025-08-03 06:03:52,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.15 | bwd: 7726.45 | bwd_inner: 7359.66 | bwd_allreduce: 366.55 | step: 129.85
{'loss': 0.7321, 'learning_rate': 4.1090541493486555e-06, 'epoch': 0.71}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14232
total_samples=21588, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:54,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.04 | bwd_microstep: 1753.37 | bwd_inner_microstep: 1703.10 | bwd_allreduce_microstep: 50.19 | step_microstep: 0.74
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13930
total_samples=21593, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:03:57,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.57 | bwd_microstep: 1827.99 | bwd_inner_microstep: 1703.26 | bwd_allreduce_microstep: 124.66 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11833
total_samples=21596, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:00,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.76 | bwd_microstep: 1844.81 | bwd_inner_microstep: 1586.93 | bwd_allreduce_microstep: 257.81 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14488
total_samples=21600, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:02,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.47
[2025-08-03 06:04:02,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.51 | bwd_microstep: 1744.64 | bwd_inner_microstep: 1684.37 | bwd_allreduce_microstep: 60.20 | step_microstep: 112.49
[2025-08-03 06:04:02,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.81 | bwd: 7170.87 | bwd_inner: 6677.67 | bwd_allreduce: 492.95 | step: 113.48
{'loss': 0.725, 'learning_rate': 4.095976269940777e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13363
total_samples=21604, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:05,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.50 | bwd_microstep: 1753.82 | bwd_inner_microstep: 1675.05 | bwd_allreduce_microstep: 78.69 | step_microstep: 0.72
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12572
total_samples=21607, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:07,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.09 | bwd_microstep: 1817.68 | bwd_inner_microstep: 1583.45 | bwd_allreduce_microstep: 234.16 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11788
total_samples=21610, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:10,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.03 | bwd_microstep: 1911.57 | bwd_inner_microstep: 1557.03 | bwd_allreduce_microstep: 354.46 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13480
total_samples=21614, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:13,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 06:04:13,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.83 | bwd_microstep: 2016.47 | bwd_inner_microstep: 1887.09 | bwd_allreduce_microstep: 129.32 | step_microstep: 130.11
[2025-08-03 06:04:13,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.37 | bwd: 7499.58 | bwd_inner: 6702.62 | bwd_allreduce: 796.71 | step: 131.09
{'loss': 0.7392, 'learning_rate': 4.082913873146842e-06, 'epoch': 0.71}
███████   | 1416/2000 [4:20:34<1:44:50, 10.77s/it] 71%|███████   | 1417/2000 [4:20:45<1:44:30, 10.76s/it]                                                        71%|███████   | 1417/2000 [4:20:45<1:44:30, 10.76s/it] 71%|███████   | 1418/2000 [4:20:56<1:44:05, 10.73s/it]                                                        71%|███████   | 1418/2000 [4:20:56<1:44:05, 10.73s/it] 71%|███████   | 1419/2000 [4:21:07<1:44:34, 10.80s/it]                                                        71%|███████   | 1419/2000 [4:21:07<1:44:34, 10.80s/it] 71%|███████   | 1420/2000 [4:21:17<1:43:12, 10.68s/it]                                                        71%|███████   | 1420/2000 [4:21:17<1:43:12, 10.68s/it] 71%|███████   | 1421/2000 [4:21:28<1:43:21, 10.71s/it]                                                        71%|███████   | 1421/2000 [4:dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13283
total_samples=21618, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:16,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.22 | bwd_microstep: 1794.50 | bwd_inner_microstep: 1677.71 | bwd_allreduce_microstep: 116.72 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13416
total_samples=21622, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:18,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.42 | bwd_microstep: 1959.77 | bwd_inner_microstep: 1855.48 | bwd_allreduce_microstep: 104.22 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11892
total_samples=21625, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:21,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.04 | bwd_microstep: 2035.95 | bwd_inner_microstep: 1813.07 | bwd_allreduce_microstep: 222.82 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13307
total_samples=21629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:24,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 06:04:24,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.81 | bwd_microstep: 1727.30 | bwd_inner_microstep: 1665.24 | bwd_allreduce_microstep: 62.00 | step_microstep: 432.62
[2025-08-03 06:04:24,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.42 | bwd: 7517.57 | bwd_inner: 7011.50 | bwd_allreduce: 505.84 | step: 433.10
{'loss': 0.7428, 'learning_rate': 4.069866993221473e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14287
total_samples=21633, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:27,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.17 | bwd_microstep: 2024.60 | bwd_inner_microstep: 1742.16 | bwd_allreduce_microstep: 282.38 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11846
total_samples=21636, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:29,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.92 | bwd_microstep: 1791.50 | bwd_inner_microstep: 1561.66 | bwd_allreduce_microstep: 229.77 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799
total_samples=21639, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:32,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.16 | bwd_microstep: 1728.46 | bwd_inner_microstep: 1537.62 | bwd_allreduce_microstep: 190.77 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=21643, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:35,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 06:04:35,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.80 | bwd_microstep: 2069.84 | bwd_inner_microstep: 1905.76 | bwd_allreduce_microstep: 164.01 | step_microstep: 118.76
[2025-08-03 06:04:35,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.97 | bwd: 7614.44 | bwd_inner: 6747.20 | bwd_allreduce: 867.01 | step: 119.22
{'loss': 0.7245, 'learning_rate': 4.056835664378585e-06, 'epoch': 0.71}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11639
total_samples=21646, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:38,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.17 | bwd_microstep: 2014.59 | bwd_inner_microstep: 1772.48 | bwd_allreduce_microstep: 242.05 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11805
total_samples=21649, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:40,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.78 | bwd_microstep: 1796.24 | bwd_inner_microstep: 1561.73 | bwd_allreduce_microstep: 234.45 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12540
total_samples=21653, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:43,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.12 | bwd_microstep: 1767.86 | bwd_inner_microstep: 1585.19 | bwd_allreduce_microstep: 182.60 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13220
total_samples=21658, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:46,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 06:04:46,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.58 | bwd_microstep: 1973.84 | bwd_inner_microstep: 1853.32 | bwd_allreduce_microstep: 120.46 | step_microstep: 144.67
[2025-08-03 06:04:46,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.57 | bwd: 7552.57 | bwd_inner: 6772.71 | bwd_allreduce: 779.63 | step: 145.01
{'loss': 0.7439, 'learning_rate': 4.043819920791322e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13809
total_samples=21662, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:48,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.65 | bwd_microstep: 1813.29 | bwd_inner_microstep: 1723.96 | bwd_allreduce_microstep: 89.25 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12704
total_samples=21665, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:51,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.76 | bwd_microstep: 2116.24 | bwd_inner_microstep: 1779.04 | bwd_allreduce_microstep: 337.13 | step_microstep: 0.23
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12823
total_samples=21669, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:54,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.69 | bwd_microstep: 1761.96 | bwd_inner_microstep: 1608.69 | bwd_allreduce_microstep: 153.20 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11648
total_samples=21672, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:56,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70
[2025-08-03 06:04:56,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.91 | bwd_microstep: 1792.13 | bwd_inner_microstep: 1674.77 | bwd_allreduce_microstep: 117.30 | step_microstep: 135.94
[2025-08-03 06:04:56,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.95 | bwd: 7483.69 | bwd_inner: 6786.45 | bwd_allreduce: 696.98 | step: 136.59
{'loss': 0.7379, 'learning_rate': 4.03081979659195e-06, 'epoch': 0.71}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13781
total_samples=21676, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:04:59,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.96 | bwd_microstep: 1795.05 | bwd_inner_microstep: 1698.02 | bwd_allreduce_microstep: 96.96 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11617
total_samples=21679, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:02,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.66 | bwd_microstep: 1928.98 | bwd_inner_microstep: 1724.79 | bwd_allreduce_microstep: 204.12 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13304
total_samples=21683, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:04,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.65 | bwd_microstep: 1730.76 | bwd_inner_microstep: 1676.63 | bwd_allreduce_microstep: 54.07 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13833
total_samples=21687, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:07,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 06:05:07,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.77 | bwd_microstep: 1832.80 | bwd_inner_microstep: 1737.07 | bwd_allreduce_microstep: 95.66 | step_microstep: 131.26
[2025-08-03 06:05:07,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2879.97 | bwd: 7287.65 | bwd_inner: 6836.51 | bwd_allreduce: 450.89 | step: 131.68
{'loss': 0.7378, 'learning_rate': 4.017835325871781e-06, 'epoch': 0.71}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12580
total_samples=21691, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:10,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.10 | bwd_microstep: 1807.92 | bwd_inner_microstep: 1621.98 | bwd_allreduce_microstep: 185.87 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12299
total_samples=21694, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:12,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.39 | bwd_microstep: 1804.12 | bwd_inner_microstep: 1579.63 | bwd_allreduce_microstep: 224.43 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11947
total_samples=21697, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:15,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.05 | bwd_microstep: 1944.44 | bwd_inner_microstep: 1561.92 | bwd_allreduce_microstep: 382.46 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14565
total_samples=21701, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:18,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 06:05:18,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.69 | bwd_microstep: 1790.59 | bwd_inner_microstep: 1728.88 | bwd_allreduce_microstep: 61.63 | step_microstep: 132.56
[2025-08-03 06:05:18,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.15 | bwd: 7347.13 | bwd_inner: 6492.41 | bwd_allreduce: 854.48 | step: 133.02
21:28<1:43:21, 10.71s/it] 71%|███████   | 1422/2000 [4:21:39<1:44:14, 10.82s/it]                                                        71%|███████   | 1422/2000 [4:21:39<1:44:14, 10.82s/it] 71%|███████   | 1423/2000 [4:21:50<1:44:05, 10.82s/it]                                                        71%|███████   | 1423/2000 [4:21:50<1:44:05, 10.82s/it] 71%|███████   | 1424/2000 [4:22:01<1:43:45, 10.81s/it]                                                        71%|███████   | 1424/2000 [4:22:01<1:43:45, 10.81s/it] 71%|███████▏  | 1425/2000 [4:22:11<1:43:23, 10.79s/it]                                                        71%|███████▏  | 1425/2000 [4:22:11<1:43:23, 10.79s/it] 71%|███████▏  | 1426/2000 [4:22:22<1:42:40, 10.73s/it]                                                        71%|███████▏  | 1426/2000 [4:22:22<1:42:40, 10.73s/it] 71%|{'loss': 0.7348, 'learning_rate': 4.004866542681079e-06, 'epoch': 0.71}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12475
total_samples=21704, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:21,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.68 | bwd_microstep: 2262.74 | bwd_inner_microstep: 1850.34 | bwd_allreduce_microstep: 412.29 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13758
total_samples=21709, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:23,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.05 | bwd_microstep: 1761.37 | bwd_inner_microstep: 1692.39 | bwd_allreduce_microstep: 68.92 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12429
total_samples=21712, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:26,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.45 | bwd_microstep: 2021.07 | bwd_inner_microstep: 1771.87 | bwd_allreduce_microstep: 249.13 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11878
total_samples=21715, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:29,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 06:05:29,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.26 | bwd_microstep: 1740.52 | bwd_inner_microstep: 1540.23 | bwd_allreduce_microstep: 200.20 | step_microstep: 129.75
[2025-08-03 06:05:29,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.37 | bwd: 7785.75 | bwd_inner: 6854.85 | bwd_allreduce: 930.61 | step: 130.25
{'loss': 0.739, 'learning_rate': 3.991913481028965e-06, 'epoch': 0.71}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11997
total_samples=21718, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:32,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.59 | bwd_microstep: 2116.84 | bwd_inner_microstep: 1870.93 | bwd_allreduce_microstep: 245.84 | step_microstep: 0.28
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12356
total_samples=21722, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:34,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.96 | bwd_microstep: 1981.98 | bwd_inner_microstep: 1773.66 | bwd_allreduce_microstep: 208.26 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12145
total_samples=21725, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:37,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.45 | bwd_microstep: 1972.32 | bwd_inner_microstep: 1751.86 | bwd_allreduce_microstep: 220.40 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11660
total_samples=21728, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:40,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 06:05:40,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.44 | bwd_microstep: 1882.69 | bwd_inner_microstep: 1621.63 | bwd_allreduce_microstep: 260.99 | step_microstep: 151.18
[2025-08-03 06:05:40,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.38 | bwd: 7953.89 | bwd_inner: 7018.07 | bwd_allreduce: 935.57 | step: 151.68
{'loss': 0.7332, 'learning_rate': 3.978976174883329e-06, 'epoch': 0.71}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11963
total_samples=21731, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:42,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.32 | bwd_microstep: 1792.71 | bwd_inner_microstep: 1565.97 | bwd_allreduce_microstep: 226.67 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11589
total_samples=21734, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:45,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.38 | bwd_microstep: 1759.33 | bwd_inner_microstep: 1532.23 | bwd_allreduce_microstep: 227.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446
total_samples=21738, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:48,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.03 | bwd_microstep: 2123.90 | bwd_inner_microstep: 1895.32 | bwd_allreduce_microstep: 228.52 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14227
total_samples=21742, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:51,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.01
[2025-08-03 06:05:51,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.28 | bwd_microstep: 1743.69 | bwd_inner_microstep: 1714.18 | bwd_allreduce_microstep: 29.45 | step_microstep: 127.37
[2025-08-03 06:05:51,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.94 | bwd: 7419.67 | bwd_inner: 6707.69 | bwd_allreduce: 711.74 | step: 127.82
{'loss': 0.7286, 'learning_rate': 3.966054658170754e-06, 'epoch': 0.71}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15081
total_samples=21746, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:54,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.59 | bwd_microstep: 2229.97 | bwd_inner_microstep: 2067.16 | bwd_allreduce_microstep: 162.75 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11566
total_samples=21749, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:56,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.44 | bwd_microstep: 1948.10 | bwd_inner_microstep: 1719.68 | bwd_allreduce_microstep: 228.35 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13320
total_samples=21753, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:05:59,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.29 | bwd_microstep: 1761.21 | bwd_inner_microstep: 1683.42 | bwd_allreduce_microstep: 77.73 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12177
total_samples=21757, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:02,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 06:06:02,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.45 | bwd_microstep: 1732.21 | bwd_inner_microstep: 1571.92 | bwd_allreduce_microstep: 160.22 | step_microstep: 136.59
[2025-08-03 06:06:02,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.69 | bwd: 7671.53 | bwd_inner: 7042.18 | bwd_allreduce: 629.12 | step: 137.05
{'loss': 0.7304, 'learning_rate': 3.953148964776408e-06, 'epoch': 0.72}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13237
total_samples=21760, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:04,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.61 | bwd_microstep: 2038.30 | bwd_inner_microstep: 1783.41 | bwd_allreduce_microstep: 254.83 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12013
total_samples=21763, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:07,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.57 | bwd_microstep: 1826.28 | bwd_inner_microstep: 1587.59 | bwd_allreduce_microstep: 238.62 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11840
total_samples=21766, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:10,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.60 | bwd_microstep: 1829.66 | bwd_inner_microstep: 1585.21 | bwd_allreduce_microstep: 244.38 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12300
total_samples=21769, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:12,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 06:06:12,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.46 | bwd_microstep: 2007.76 | bwd_inner_microstep: 2001.61 | bwd_allreduce_microstep: 6.09 | step_microstep: 135.36
[2025-08-03 06:06:12,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.16 | bwd: 7702.05 | bwd_inner: 6957.82 | bwd_allreduce: 743.99 | step: 135.83
███████▏  | 1427/2000 [4:22:32<1:42:02, 10.69s/it]                                                        71%|███████▏  | 1427/2000 [4:22:33<1:42:02, 10.69s/it] 71%|███████▏  | 1428/2000 [4:22:44<1:42:53, 10.79s/it]                                                        71%|███████▏  | 1428/2000 [4:22:44<1:42:53, 10.79s/it] 71%|███████▏  | 1429/2000 [4:22:55<1:43:55, 10.92s/it]                                                        71%|███████▏  | 1429/2000 [4:22:55<1:43:55, 10.92s/it] 72%|███████▏  | 1430/2000 [4:23:05<1:43:05, 10.85s/it]                                                        72%|███████▏  | 1430/2000 [4:23:05<1:43:05, 10.85s/it] 72%|███████▏  | 1431/2000 [4:23:16<1:43:11, 10.88s/it]                                                        72%|███████▏  | 1431/2000 [4:23:16<1:43:11, 10.88s/it] 72%|██████�{'loss': 0.7392, 'learning_rate': 3.940259128543967e-06, 'epoch': 0.72}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12597
total_samples=21772, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:15,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.38 | bwd_microstep: 1865.10 | bwd_inner_microstep: 1707.15 | bwd_allreduce_microstep: 157.89 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13018
total_samples=21776, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:18,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.04 | bwd_microstep: 1914.76 | bwd_inner_microstep: 1689.61 | bwd_allreduce_microstep: 225.09 | step_microstep: 0.09
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13202
total_samples=21780, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:20,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.20 | bwd_microstep: 1825.80 | bwd_inner_microstep: 1669.64 | bwd_allreduce_microstep: 156.10 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12656
total_samples=21783, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:23,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.67
[2025-08-03 06:06:23,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.98 | bwd_microstep: 1792.53 | bwd_inner_microstep: 1588.41 | bwd_allreduce_microstep: 204.03 | step_microstep: 159.91
[2025-08-03 06:06:23,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.52 | bwd: 7398.24 | bwd_inner: 6654.81 | bwd_allreduce: 743.19 | step: 160.22
{'loss': 0.7344, 'learning_rate': 3.927385183275522e-06, 'epoch': 0.72}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11949
total_samples=21786, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:26,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.15 | bwd_microstep: 1749.74 | bwd_inner_microstep: 1544.94 | bwd_allreduce_microstep: 204.72 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11787
total_samples=21789, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:28,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.34 | bwd_microstep: 1761.32 | bwd_inner_microstep: 1553.83 | bwd_allreduce_microstep: 207.42 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12656
total_samples=21793, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:31,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.88 | bwd_microstep: 1884.43 | bwd_inner_microstep: 1635.11 | bwd_allreduce_microstep: 249.26 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11998
total_samples=21796, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:34,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.97
[2025-08-03 06:06:34,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.78 | bwd_microstep: 1831.86 | bwd_inner_microstep: 1551.08 | bwd_allreduce_microstep: 280.71 | step_microstep: 127.13
[2025-08-03 06:06:34,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.08 | bwd: 7227.42 | bwd_inner: 6284.96 | bwd_allreduce: 942.19 | step: 127.62
{'loss': 0.742, 'learning_rate': 3.914527162731498e-06, 'epoch': 0.72}
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 12616
total_samples=21799, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:37,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.77 | bwd_microstep: 2119.05 | bwd_inner_microstep: 1888.14 | bwd_allreduce_microstep: 230.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13345
total_samples=21803, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:39,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.66 | bwd_microstep: 1958.84 | bwd_inner_microstep: 1829.47 | bwd_allreduce_microstep: 129.31 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11936
total_samples=21806, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:42,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.69 | bwd_microstep: 1995.25 | bwd_inner_microstep: 1843.01 | bwd_allreduce_microstep: 152.17 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508
total_samples=21810, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:45,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 06:06:45,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.07 | bwd_microstep: 1813.62 | bwd_inner_microstep: 1701.20 | bwd_allreduce_microstep: 112.36 | step_microstep: 131.55
[2025-08-03 06:06:45,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.12 | bwd: 7886.81 | bwd_inner: 7261.81 | bwd_allreduce: 624.77 | step: 132.01
{'loss': 0.7303, 'learning_rate': 3.901685100630554e-06, 'epoch': 0.72}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11684
total_samples=21813, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:48,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.92 | bwd_microstep: 1994.52 | bwd_inner_microstep: 1780.20 | bwd_allreduce_microstep: 214.25 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13701
total_samples=21817, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:50,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.63 | bwd_microstep: 1746.72 | bwd_inner_microstep: 1697.08 | bwd_allreduce_microstep: 49.55 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13973
total_samples=21821, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:53,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.09 | bwd_microstep: 1780.49 | bwd_inner_microstep: 1721.56 | bwd_allreduce_microstep: 58.87 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11851
total_samples=21824, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:55,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 06:06:55,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.16 | bwd_microstep: 1997.74 | bwd_inner_microstep: 1784.29 | bwd_allreduce_microstep: 213.39 | step_microstep: 109.08
[2025-08-03 06:06:55,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.73 | bwd: 7519.53 | bwd_inner: 6983.10 | bwd_allreduce: 536.13 | step: 109.58
{'loss': 0.7328, 'learning_rate': 3.888859030649498e-06, 'epoch': 0.72}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12008
total_samples=21827, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:06:58,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.81 | bwd_microstep: 1969.17 | bwd_inner_microstep: 1602.05 | bwd_allreduce_microstep: 367.05 | step_microstep: 0.37
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13774
total_samples=21831, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:01,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.92 | bwd_microstep: 1785.46 | bwd_inner_microstep: 1710.37 | bwd_allreduce_microstep: 75.03 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11662
total_samples=21834, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:04,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.21 | bwd_microstep: 2134.74 | bwd_inner_microstep: 1861.90 | bwd_allreduce_microstep: 272.77 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13304
total_samples=21838, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:06,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 06:07:06,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.43 | bwd_microstep: 1757.28 | bwd_inner_microstep: 1685.11 | bwd_allreduce_microstep: 72.10 | step_microstep: 110.33
[2025-08-03 06:07:06,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.30 | bwd: 7646.71 | bwd_inner: 6859.42 | bwd_allreduce: 787.03 | step: 110.95
��▏  | 1432/2000 [4:23:27<1:43:07, 10.89s/it]                                                        72%|███████▏  | 1432/2000 [4:23:27<1:43:07, 10.89s/it] 72%|███████▏  | 1433/2000 [4:23:38<1:42:17, 10.82s/it]                                                        72%|███████▏  | 1433/2000 [4:23:38<1:42:17, 10.82s/it] 72%|███████▏  | 1434/2000 [4:23:48<1:41:10, 10.72s/it]                                                        72%|███████▏  | 1434/2000 [4:23:48<1:41:10, 10.72s/it] 72%|███████▏  | 1435/2000 [4:24:00<1:42:04, 10.84s/it]                                                        72%|███████▏  | 1435/2000 [4:24:00<1:42:04, 10.84s/it] 72%|███████▏  | 1436/2000 [4:24:10<1:41:45, 10.83s/it]                                                        72%|███████▏  | 1436/2000 [4:24:10<1:41:45, 10.83s/it] 72%|███████▏  | 1437/2000 {'loss': 0.7347, 'learning_rate': 3.876048986423207e-06, 'epoch': 0.72}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12267
total_samples=21841, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:09,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.44 | bwd_microstep: 1916.82 | bwd_inner_microstep: 1773.43 | bwd_allreduce_microstep: 143.33 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13829
total_samples=21845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:12,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.73 | bwd_microstep: 1841.68 | bwd_inner_microstep: 1736.82 | bwd_allreduce_microstep: 104.79 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11882
total_samples=21849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:14,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.89 | bwd_microstep: 1775.70 | bwd_inner_microstep: 1554.17 | bwd_allreduce_microstep: 221.46 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12770
total_samples=21853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:17,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.30
[2025-08-03 06:07:17,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.08 | bwd_microstep: 1782.98 | bwd_inner_microstep: 1648.20 | bwd_allreduce_microstep: 134.72 | step_microstep: 114.89
[2025-08-03 06:07:17,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2857.05 | bwd: 7317.23 | bwd_inner: 6712.61 | bwd_allreduce: 604.38 | step: 115.37
{'loss': 0.7357, 'learning_rate': 3.863255001544526e-06, 'epoch': 0.72}
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11329
total_samples=21856, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:20,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.80 | bwd_microstep: 1801.72 | bwd_inner_microstep: 1564.84 | bwd_allreduce_microstep: 236.81 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16353
total_samples=21860, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:23,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.48 | bwd_microstep: 2208.38 | bwd_inner_microstep: 2112.22 | bwd_allreduce_microstep: 96.11 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12007
total_samples=21863, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:25,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.71 | bwd_microstep: 2028.21 | bwd_inner_microstep: 1799.62 | bwd_allreduce_microstep: 228.51 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12619
total_samples=21866, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:28,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.79
[2025-08-03 06:07:28,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.40 | bwd_microstep: 1811.19 | bwd_inner_microstep: 1607.70 | bwd_allreduce_microstep: 203.43 | step_microstep: 137.59
[2025-08-03 06:07:28,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.32 | bwd: 7849.58 | bwd_inner: 7084.38 | bwd_allreduce: 764.94 | step: 138.16
{'loss': 0.7317, 'learning_rate': 3.8504771095641905e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13416
total_samples=21870, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:31,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.76 | bwd_microstep: 2020.97 | bwd_inner_microstep: 1872.95 | bwd_allreduce_microstep: 147.95 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13583
total_samples=21875, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:33,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.63 | bwd_microstep: 1907.20 | bwd_inner_microstep: 1739.71 | bwd_allreduce_microstep: 167.42 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11964
total_samples=21879, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:36,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.56 | bwd_microstep: 1991.02 | bwd_inner_microstep: 1760.82 | bwd_allreduce_microstep: 230.14 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14403
total_samples=21883, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:39,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.57
[2025-08-03 06:07:39,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.00 | bwd_microstep: 1729.90 | bwd_inner_microstep: 1703.57 | bwd_allreduce_microstep: 26.24 | step_microstep: 144.12
[2025-08-03 06:07:39,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.88 | bwd: 7649.15 | bwd_inner: 7077.05 | bwd_allreduce: 571.84 | step: 144.63
{'loss': 0.7354, 'learning_rate': 3.837715343990727e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13501
total_samples=21887, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:42,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.80 | bwd_microstep: 2035.56 | bwd_inner_microstep: 1899.82 | bwd_allreduce_microstep: 135.67 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12637
total_samples=21891, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:44,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.39 | bwd_microstep: 1795.86 | bwd_inner_microstep: 1627.86 | bwd_allreduce_microstep: 167.91 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12249
total_samples=21894, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:47,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.05 | bwd_microstep: 1802.23 | bwd_inner_microstep: 1581.40 | bwd_allreduce_microstep: 220.76 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13377
total_samples=21898, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:50,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 06:07:50,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.73 | bwd_microstep: 1986.75 | bwd_inner_microstep: 1819.91 | bwd_allreduce_microstep: 166.76 | step_microstep: 114.82
[2025-08-03 06:07:50,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.90 | bwd: 7620.46 | bwd_inner: 6928.99 | bwd_allreduce: 691.20 | step: 115.48
{'loss': 0.7321, 'learning_rate': 3.824969738290386e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13959
total_samples=21904, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:52,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.49 | bwd_microstep: 1998.53 | bwd_inner_microstep: 1766.67 | bwd_allreduce_microstep: 231.78 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13173
total_samples=21908, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:55,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.02 | bwd_microstep: 2009.03 | bwd_inner_microstep: 1863.21 | bwd_allreduce_microstep: 145.75 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14020
total_samples=21912, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:07:58,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.08 | bwd_microstep: 2214.98 | bwd_inner_microstep: 1971.21 | bwd_allreduce_microstep: 243.70 | step_microstep: 0.75
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12598
total_samples=21915, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:01,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.83
[2025-08-03 06:08:01,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.92 | bwd_microstep: 2016.01 | bwd_inner_microstep: 1652.95 | bwd_allreduce_microstep: 363.00 | step_microstep: 112.16
[2025-08-03 06:08:01,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.43 | bwd: 8238.61 | bwd_inner: 7254.04 | bwd_allreduce: 984.32 | step: 113.28
[4:24:21<1:41:40, 10.84s/it]                                                        72%|███████▏  | 1437/2000 [4:24:21<1:41:40, 10.84s/it] 72%|███████▏  | 1438/2000 [4:24:32<1:40:46, 10.76s/it]                                                        72%|███████▏  | 1438/2000 [4:24:32<1:40:46, 10.76s/it] 72%|███████▏  | 1439/2000 [4:24:43<1:41:27, 10.85s/it]                                                        72%|███████▏  | 1439/2000 [4:24:43<1:41:27, 10.85s/it] 72%|███████▏  | 1440/2000 [4:24:54<1:41:20, 10.86s/it]                                                        72%|███████▏  | 1440/2000 [4:24:54<1:41:20, 10.86s/it] 72%|███████▏  | 1441/2000 [4:25:05<1:41:05, 10.85s/it]                                                        72%|███████▏  | 1441/2000 [4:25:05<1:41:05, 10.85s/it] 72%|███████▏  | 1442/2000 [4:25:16<1:42:35, 1{'loss': 0.7434, 'learning_rate': 3.81224032588703e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14706
total_samples=21921, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:04,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.08 | bwd_microstep: 1804.78 | bwd_inner_microstep: 1763.90 | bwd_allreduce_microstep: 40.81 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11746
total_samples=21924, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:06,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.30 | bwd_microstep: 1804.02 | bwd_inner_microstep: 1561.21 | bwd_allreduce_microstep: 242.74 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12033
total_samples=21928, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:09,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.10 | bwd_microstep: 2094.70 | bwd_inner_microstep: 1960.69 | bwd_allreduce_microstep: 133.94 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12947
total_samples=21932, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:12,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 06:08:12,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.13 | bwd_microstep: 1753.55 | bwd_inner_microstep: 1664.60 | bwd_allreduce_microstep: 88.88 | step_microstep: 138.01
[2025-08-03 06:08:12,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.54 | bwd: 7457.09 | bwd_inner: 6950.39 | bwd_allreduce: 506.45 | step: 138.65
{'loss': 0.7327, 'learning_rate': 3.7995271401620548e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13235
total_samples=21936, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:14,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.43 | bwd_microstep: 1723.75 | bwd_inner_microstep: 1665.30 | bwd_allreduce_microstep: 58.38 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13170
total_samples=21940, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:17,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.78 | bwd_microstep: 1868.72 | bwd_inner_microstep: 1717.46 | bwd_allreduce_microstep: 151.20 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13306
total_samples=21944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:20,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.54 | bwd_microstep: 1923.50 | bwd_inner_microstep: 1678.16 | bwd_allreduce_microstep: 245.28 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12040
total_samples=21947, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:23,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.29
[2025-08-03 06:08:23,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.31 | bwd_microstep: 2199.84 | bwd_inner_microstep: 1852.39 | bwd_allreduce_microstep: 347.38 | step_microstep: 148.38
[2025-08-03 06:08:23,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.98 | bwd: 7715.88 | bwd_inner: 6913.30 | bwd_allreduce: 802.34 | step: 148.93
{'loss': 0.7316, 'learning_rate': 3.7868302144543146e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13970
total_samples=21951, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:25,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.55 | bwd_microstep: 1766.90 | bwd_inner_microstep: 1728.25 | bwd_allreduce_microstep: 38.57 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11603
total_samples=21954, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:28,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.81 | bwd_microstep: 2187.15 | bwd_inner_microstep: 1935.80 | bwd_allreduce_microstep: 251.27 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11996
total_samples=21957, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:31,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.95 | bwd_microstep: 1760.36 | bwd_inner_microstep: 1552.90 | bwd_allreduce_microstep: 207.40 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12486
total_samples=21960, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:34,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 06:08:34,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.60 | bwd_microstep: 1999.00 | bwd_inner_microstep: 1921.32 | bwd_allreduce_microstep: 77.62 | step_microstep: 108.50
[2025-08-03 06:08:34,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.85 | bwd: 7713.46 | bwd_inner: 7138.26 | bwd_allreduce: 574.94 | step: 109.02
{'loss': 0.7283, 'learning_rate': 3.7741495820600128e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15381
total_samples=21964, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:36,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.58 | bwd_microstep: 1952.89 | bwd_inner_microstep: 1935.82 | bwd_allreduce_microstep: 16.98 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13223
total_samples=21968, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:39,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.62 | bwd_microstep: 1761.90 | bwd_inner_microstep: 1637.51 | bwd_allreduce_microstep: 124.32 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12361
total_samples=21971, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:42,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.09 | bwd_microstep: 2061.41 | bwd_inner_microstep: 1588.91 | bwd_allreduce_microstep: 472.39 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12803
total_samples=21974, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:45,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 06:08:45,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.00 | bwd_microstep: 1966.22 | bwd_inner_microstep: 1645.66 | bwd_allreduce_microstep: 320.47 | step_microstep: 142.53
[2025-08-03 06:08:45,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.21 | bwd: 7742.46 | bwd_inner: 6807.92 | bwd_allreduce: 934.26 | step: 143.05
{'loss': 0.74, 'learning_rate': 3.7614852762326303e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13457
total_samples=21978, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:47,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.21 | bwd_microstep: 1708.21 | bwd_inner_microstep: 1659.11 | bwd_allreduce_microstep: 49.03 | step_microstep: 0.33
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12085
total_samples=21982, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:50,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.23 | bwd_microstep: 1829.42 | bwd_inner_microstep: 1599.48 | bwd_allreduce_microstep: 229.87 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13545
total_samples=21986, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:53,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.86 | bwd_microstep: 2077.48 | bwd_inner_microstep: 1720.79 | bwd_allreduce_microstep: 356.61 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12244
total_samples=21989, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:55,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.91
[2025-08-03 06:08:55,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.70 | bwd_microstep: 1845.42 | bwd_inner_microstep: 1632.75 | bwd_allreduce_microstep: 212.59 | step_microstep: 141.17
[2025-08-03 06:08:55,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.92 | bwd: 7460.60 | bwd_inner: 6612.13 | bwd_allreduce: 848.19 | step: 141.87
1.03s/it]                                                        72%|███████▏  | 1442/2000 [4:25:16<1:42:35, 11.03s/it] 72%|███████▏  | 1443/2000 [4:25:27<1:41:32, 10.94s/it]                                                        72%|███████▏  | 1443/2000 [4:25:27<1:41:32, 10.94s/it] 72%|███████▏  | 1444/2000 [4:25:38<1:41:27, 10.95s/it]                                                        72%|███████▏  | 1444/2000 [4:25:38<1:41:27, 10.95s/it] 72%|███████▏  | 1445/2000 [4:25:49<1:41:07, 10.93s/it]                                                        72%|███████▏  | 1445/2000 [4:25:49<1:41:07, 10.93s/it] 72%|███████▏  | 1446/2000 [4:26:00<1:41:03, 10.94s/it]                                                        72%|███████▏  | 1446/2000 [4:26:00<1:41:03, 10.94s/it] 72%|███████▏  | 1447/2000 [4:26:10<1:40:20, 10.89s/it]         {'loss': 0.7406, 'learning_rate': 3.7488373301828296e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16160
total_samples=21993, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:08:58,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.18 | bwd_microstep: 1819.36 | bwd_inner_microstep: 1811.34 | bwd_allreduce_microstep: 7.94 | step_microstep: 0.74
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12138
total_samples=21996, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:01,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.98 | bwd_microstep: 1798.62 | bwd_inner_microstep: 1588.81 | bwd_allreduce_microstep: 209.75 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11522
total_samples=21999, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:04,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1019.48 | bwd_microstep: 1746.33 | bwd_inner_microstep: 1519.86 | bwd_allreduce_microstep: 226.41 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14408
total_samples=22003, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:06,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.36
[2025-08-03 06:09:06,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.30 | bwd_microstep: 1786.65 | bwd_inner_microstep: 1726.26 | bwd_allreduce_microstep: 60.32 | step_microstep: 138.94
[2025-08-03 06:09:06,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3141.88 | bwd: 7151.03 | bwd_inner: 6646.26 | bwd_allreduce: 504.52 | step: 140.07
{'loss': 0.7527, 'learning_rate': 3.736205777078381e-06, 'epoch': 0.72}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12629
total_samples=22007, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:09,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.60 | bwd_microstep: 1792.71 | bwd_inner_microstep: 1621.88 | bwd_allreduce_microstep: 170.76 | step_microstep: 0.26
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12722
total_samples=22011, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:11,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.41 | bwd_microstep: 1744.86 | bwd_inner_microstep: 1609.88 | bwd_allreduce_microstep: 134.91 | step_microstep: 0.33
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11589
total_samples=22014, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:14,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.95 | bwd_microstep: 1765.60 | bwd_inner_microstep: 1531.75 | bwd_allreduce_microstep: 233.78 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11808
total_samples=22017, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:17,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 06:09:17,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.51 | bwd_microstep: 1817.24 | bwd_inner_microstep: 1571.65 | bwd_allreduce_microstep: 245.51 | step_microstep: 111.47
[2025-08-03 06:09:17,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.40 | bwd: 7120.48 | bwd_inner: 6335.15 | bwd_allreduce: 785.07 | step: 112.18
{'loss': 0.741, 'learning_rate': 3.7235906500440576e-06, 'epoch': 0.72}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11750
total_samples=22020, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:19,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.87 | bwd_microstep: 2000.48 | bwd_inner_microstep: 1778.01 | bwd_allreduce_microstep: 222.41 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11716
total_samples=22023, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:22,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.44 | bwd_microstep: 2049.20 | bwd_inner_microstep: 1808.07 | bwd_allreduce_microstep: 241.06 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13970
total_samples=22027, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:25,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.27 | bwd_microstep: 1798.97 | bwd_inner_microstep: 1733.61 | bwd_allreduce_microstep: 65.28 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13561
total_samples=22031, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:27,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70
[2025-08-03 06:09:27,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.79 | bwd_microstep: 1834.62 | bwd_inner_microstep: 1727.38 | bwd_allreduce_microstep: 107.18 | step_microstep: 112.23
[2025-08-03 06:09:27,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.29 | bwd: 7683.32 | bwd_inner: 7047.07 | bwd_allreduce: 636.00 | step: 112.81
{'loss': 0.7305, 'learning_rate': 3.7109919821615546e-06, 'epoch': 0.72}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13416
total_samples=22035, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:30,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.52 | bwd_microstep: 1716.73 | bwd_inner_microstep: 1667.81 | bwd_allreduce_microstep: 48.85 | step_microstep: 0.77
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14546
total_samples=22039, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:33,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.52 | bwd_microstep: 1810.48 | bwd_inner_microstep: 1737.36 | bwd_allreduce_microstep: 73.06 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13490
total_samples=22043, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:35,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.60 | bwd_microstep: 1891.11 | bwd_inner_microstep: 1707.13 | bwd_allreduce_microstep: 183.90 | step_microstep: 0.13
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12742
total_samples=22047, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:38,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.94
[2025-08-03 06:09:38,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.29 | bwd_microstep: 2065.80 | bwd_inner_microstep: 1866.08 | bwd_allreduce_microstep: 199.66 | step_microstep: 123.83
[2025-08-03 06:09:38,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.85 | bwd: 7484.17 | bwd_inner: 6978.39 | bwd_allreduce: 505.54 | step: 124.83
{'loss': 0.7284, 'learning_rate': 3.6984098064694174e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13753
total_samples=22051, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:41,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.99 | bwd_microstep: 2023.12 | bwd_inner_microstep: 1906.70 | bwd_allreduce_microstep: 116.36 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12045
total_samples=22054, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:44,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.04 | bwd_microstep: 2494.62 | bwd_inner_microstep: 2487.97 | bwd_allreduce_microstep: 6.55 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13139
total_samples=22058, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:47,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.97 | bwd_microstep: 2013.45 | bwd_inner_microstep: 1894.16 | bwd_allreduce_microstep: 119.22 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13168
total_samples=22062, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:50,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77
[2025-08-03 06:09:50,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.18 | bwd_microstep: 1821.74 | bwd_inner_microstep: 1698.37 | bwd_allreduce_microstep: 123.29 | step_microstep: 115.09
[2025-08-03 06:09:50,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.11 | bwd: 8352.98 | bwd_inner: 7987.22 | bwd_allreduce: 365.49 | step: 115.56
                                               72%|███████▏  | 1447/2000 [4:26:10<1:40:20, 10.89s/it] 72%|███████▏  | 1448/2000 [4:26:21<1:39:45, 10.84s/it]                                                        72%|███████▏  | 1448/2000 [4:26:21<1:39:45, 10.84s/it] 72%|███████▏  | 1449/2000 [4:26:31<1:38:18, 10.70s/it]                                                        72%|███████▏  | 1449/2000 [4:26:31<1:38:18, 10.70s/it] 72%|███████▎  | 1450/2000 [4:26:42<1:38:38, 10.76s/it]                                                        72%|███████▎  | 1450/2000 [4:26:42<1:38:38, 10.76s/it] 73%|███████▎  | 1451/2000 [4:26:53<1:38:22, 10.75s/it]                                                        73%|███████▎  | 1451/2000 [4:26:53<1:38:22, 10.75s/it] 73%|███████▎  | 1452/2000 [4:27:05<1:40:24, 10.99s/it]                            {'loss': 0.7383, 'learning_rate': 3.685844155962931e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14456
total_samples=22067, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:52,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.08 | bwd_microstep: 1745.99 | bwd_inner_microstep: 1719.41 | bwd_allreduce_microstep: 26.52 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11660
total_samples=22070, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:55,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.56 | bwd_microstep: 1839.23 | bwd_inner_microstep: 1704.02 | bwd_allreduce_microstep: 135.14 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12826
total_samples=22074, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:09:58,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.45 | bwd_microstep: 2027.99 | bwd_inner_microstep: 1819.30 | bwd_allreduce_microstep: 208.62 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13409
total_samples=22078, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:00,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.90
[2025-08-03 06:10:00,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.66 | bwd_microstep: 1772.36 | bwd_inner_microstep: 1679.84 | bwd_allreduce_microstep: 92.44 | step_microstep: 131.11
[2025-08-03 06:10:00,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.67 | bwd: 7385.62 | bwd_inner: 6922.55 | bwd_allreduce: 462.81 | step: 131.61
{'loss': 0.735, 'learning_rate': 3.673295063594049e-06, 'epoch': 0.73}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15109
total_samples=22082, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:03,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.76 | bwd_microstep: 1778.09 | bwd_inner_microstep: 1771.62 | bwd_allreduce_microstep: 6.40 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11834
total_samples=22085, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:06,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.41 | bwd_microstep: 1814.97 | bwd_inner_microstep: 1586.94 | bwd_allreduce_microstep: 227.95 | step_microstep: 0.18
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12947
total_samples=22089, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:08,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.30 | bwd_microstep: 1770.72 | bwd_inner_microstep: 1673.51 | bwd_allreduce_microstep: 97.14 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13663
total_samples=22093, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:11,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 06:10:11,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.31 | bwd_microstep: 1816.76 | bwd_inner_microstep: 1737.00 | bwd_allreduce_microstep: 79.68 | step_microstep: 150.80
[2025-08-03 06:10:11,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.71 | bwd: 7180.59 | bwd_inner: 6769.07 | bwd_allreduce: 411.26 | step: 151.46
{'loss': 0.7313, 'learning_rate': 3.6607625622713005e-06, 'epoch': 0.73}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13093
total_samples=22097, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:14,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.73 | bwd_microstep: 2040.14 | bwd_inner_microstep: 1879.01 | bwd_allreduce_microstep: 161.08 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717
total_samples=22100, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:16,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.63 | bwd_microstep: 1808.98 | bwd_inner_microstep: 1565.98 | bwd_allreduce_microstep: 242.94 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13134
total_samples=22104, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:19,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.69 | bwd_microstep: 1897.02 | bwd_inner_microstep: 1838.53 | bwd_allreduce_microstep: 58.42 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13355
total_samples=22108, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:22,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 06:10:22,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.25 | bwd_microstep: 2115.93 | bwd_inner_microstep: 2055.50 | bwd_allreduce_microstep: 60.36 | step_microstep: 109.78
[2025-08-03 06:10:22,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.25 | bwd: 7862.13 | bwd_inner: 7339.01 | bwd_allreduce: 522.88 | step: 110.40
{'loss': 0.7346, 'learning_rate': 3.6482466848597164e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13357
total_samples=22112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:25,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.79 | bwd_microstep: 1751.51 | bwd_inner_microstep: 1694.41 | bwd_allreduce_microstep: 57.03 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12130
total_samples=22115, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:27,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.33 | bwd_microstep: 1912.78 | bwd_inner_microstep: 1581.60 | bwd_allreduce_microstep: 331.12 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12359
total_samples=22119, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:30,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.88 | bwd_microstep: 1717.17 | bwd_inner_microstep: 1553.36 | bwd_allreduce_microstep: 163.73 | step_microstep: 0.34
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13381
total_samples=22123, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:33,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 06:10:33,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.27 | bwd_microstep: 1875.29 | bwd_inner_microstep: 1829.22 | bwd_allreduce_microstep: 46.00 | step_microstep: 459.46
[2025-08-03 06:10:33,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.18 | bwd: 7256.81 | bwd_inner: 6658.60 | bwd_allreduce: 597.96 | step: 460.23
{'loss': 0.7243, 'learning_rate': 3.63574746418072e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13313
total_samples=22127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:35,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.22 | bwd_microstep: 1719.42 | bwd_inner_microstep: 1659.01 | bwd_allreduce_microstep: 60.33 | step_microstep: 0.32
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11860
total_samples=22130, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:38,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.39 | bwd_microstep: 1777.68 | bwd_inner_microstep: 1555.28 | bwd_allreduce_microstep: 222.32 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13337
total_samples=22134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:40,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.99 | bwd_microstep: 1775.51 | bwd_inner_microstep: 1697.78 | bwd_allreduce_microstep: 77.65 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12718
total_samples=22138, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:43,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 06:10:43,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.05 | bwd_microstep: 1792.76 | bwd_inner_microstep: 1655.72 | bwd_allreduce_microstep: 136.97 | step_microstep: 137.75
[2025-08-03 06:10:43,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.58 | bwd: 7065.42 | bwd_inner: 6567.78 | bwd_allreduce: 497.36 | step: 138.30
                            73%|███████▎  | 1452/2000 [4:27:05<1:40:24, 10.99s/it] 73%|███████▎  | 1453/2000 [4:27:15<1:39:18, 10.89s/it]                                                        73%|███████▎  | 1453/2000 [4:27:15<1:39:18, 10.89s/it] 73%|███████▎  | 1454/2000 [4:27:26<1:37:55, 10.76s/it]                                                        73%|███████▎  | 1454/2000 [4:27:26<1:37:55, 10.76s/it] 73%|███████▎  | 1455/2000 [4:27:37<1:38:40, 10.86s/it]                                                        73%|███████▎  | 1455/2000 [4:27:37<1:38:40, 10.86s/it] 73%|███████▎  | 1456/2000 [4:27:48<1:38:18, 10.84s/it]                                                        73%|███████▎  | 1456/2000 [4:27:48<1:38:18, 10.84s/it] 73%|███████▎  | 1457/2000 [4:27:58<1:36:42, 10.69s/it]                                               {'loss': 0.7353, 'learning_rate': 3.6232649330120608e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13221
total_samples=22142, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:46,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.19 | bwd_microstep: 1790.29 | bwd_inner_microstep: 1684.91 | bwd_allreduce_microstep: 105.31 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13193
total_samples=22146, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:49,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.94 | bwd_microstep: 2216.52 | bwd_inner_microstep: 2090.62 | bwd_allreduce_microstep: 125.84 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12676
total_samples=22150, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:51,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.38 | bwd_microstep: 2037.17 | bwd_inner_microstep: 1842.28 | bwd_allreduce_microstep: 194.83 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13483
total_samples=22154, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:54,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.90
[2025-08-03 06:10:54,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.75 | bwd_microstep: 1989.01 | bwd_inner_microstep: 1858.94 | bwd_allreduce_microstep: 130.01 | step_microstep: 139.99
[2025-08-03 06:10:54,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.15 | bwd: 8033.04 | bwd_inner: 7476.75 | bwd_allreduce: 556.07 | step: 140.44
{'loss': 0.744, 'learning_rate': 3.610799124087725e-06, 'epoch': 0.73}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12749
total_samples=22158, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:10:57,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.03 | bwd_microstep: 1899.19 | bwd_inner_microstep: 1797.73 | bwd_allreduce_microstep: 101.38 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13257
total_samples=22162, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:00,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.13 | bwd_microstep: 2049.97 | bwd_inner_microstep: 2043.39 | bwd_allreduce_microstep: 6.48 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11711
total_samples=22165, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:03,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.34 | bwd_microstep: 1840.02 | bwd_inner_microstep: 1597.77 | bwd_allreduce_microstep: 242.18 | step_microstep: 0.27
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13130
total_samples=22170, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:05,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 06:11:05,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.76 | bwd_microstep: 1829.57 | bwd_inner_microstep: 1754.98 | bwd_allreduce_microstep: 74.51 | step_microstep: 147.07
[2025-08-03 06:11:05,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2845.19 | bwd: 7618.78 | bwd_inner: 7193.89 | bwd_allreduce: 424.62 | step: 147.58
{'loss': 0.7344, 'learning_rate': 3.5983500700978425e-06, 'epoch': 0.73}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12121
total_samples=22173, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:08,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.92 | bwd_microstep: 2093.67 | bwd_inner_microstep: 1871.68 | bwd_allreduce_microstep: 221.90 | step_microstep: 0.91
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13022
total_samples=22177, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:11,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.00 | bwd_microstep: 1794.22 | bwd_inner_microstep: 1691.67 | bwd_allreduce_microstep: 102.48 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12647
total_samples=22181, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:13,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.65 | bwd_microstep: 1766.26 | bwd_inner_microstep: 1620.09 | bwd_allreduce_microstep: 146.08 | step_microstep: 0.16
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12755
total_samples=22185, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:16,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 06:11:16,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.35 | bwd_microstep: 2104.68 | bwd_inner_microstep: 1999.37 | bwd_allreduce_microstep: 105.24 | step_microstep: 133.35
[2025-08-03 06:11:16,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.84 | bwd: 7758.92 | bwd_inner: 7182.81 | bwd_allreduce: 575.81 | step: 134.55
{'loss': 0.7402, 'learning_rate': 3.585917803688603e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14769
total_samples=22189, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:19,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.06 | bwd_microstep: 1804.41 | bwd_inner_microstep: 1758.70 | bwd_allreduce_microstep: 45.64 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13654
total_samples=22193, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:22,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.60 | bwd_microstep: 1974.82 | bwd_inner_microstep: 1878.88 | bwd_allreduce_microstep: 95.89 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12774
total_samples=22197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:24,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.81 | bwd_microstep: 2038.30 | bwd_inner_microstep: 1891.42 | bwd_allreduce_microstep: 146.78 | step_microstep: 0.38
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15225
total_samples=22201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:27,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 06:11:27,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.07 | bwd_microstep: 1746.87 | bwd_inner_microstep: 1672.38 | bwd_allreduce_microstep: 74.43 | step_microstep: 136.07
[2025-08-03 06:11:27,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.47 | bwd: 7564.46 | bwd_inner: 7201.37 | bwd_allreduce: 362.84 | step: 136.81
{'loss': 0.7401, 'learning_rate': 3.5735023574621765e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13167
total_samples=22205, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:30,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.73 | bwd_microstep: 1930.84 | bwd_inner_microstep: 1859.78 | bwd_allreduce_microstep: 70.99 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14115
total_samples=22209, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:32,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.42 | bwd_microstep: 1757.63 | bwd_inner_microstep: 1712.89 | bwd_allreduce_microstep: 44.67 | step_microstep: 0.76
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13360
total_samples=22213, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:35,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.80 | bwd_microstep: 1872.43 | bwd_inner_microstep: 1818.87 | bwd_allreduce_microstep: 53.49 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=22216, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:38,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 06:11:38,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.26 | bwd_microstep: 1778.95 | bwd_inner_microstep: 1550.55 | bwd_allreduce_microstep: 228.33 | step_microstep: 118.44
[2025-08-03 06:11:38,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.14 | bwd: 7339.91 | bwd_inner: 6942.09 | bwd_allreduce: 397.57 | step: 119.43
{'loss': 0.7373, 'learning_rate': 3.5611037639766267e-06, 'epoch': 0.73}
         73%|███████▎  | 1457/2000 [4:27:58<1:36:42, 10.69s/it] 73%|███████▎  | 1458/2000 [4:28:09<1:38:10, 10.87s/it]                                                        73%|███████▎  | 1458/2000 [4:28:09<1:38:10, 10.87s/it] 73%|███████▎  | 1459/2000 [4:28:20<1:38:05, 10.88s/it]                                                        73%|███████▎  | 1459/2000 [4:28:20<1:38:05, 10.88s/it] 73%|███████▎  | 1460/2000 [4:28:31<1:38:13, 10.91s/it]                                                        73%|███████▎  | 1460/2000 [4:28:31<1:38:13, 10.91s/it] 73%|███████▎  | 1461/2000 [4:28:42<1:37:44, 10.88s/it]                                                        73%|███████▎  | 1461/2000 [4:28:42<1:37:44, 10.88s/it] 73%|███████▎  | 1462/2000 [4:28:53<1:36:47, 10.79s/it]                                                        73%|█�dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13119
total_samples=22220, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:41,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.26 | bwd_microstep: 2096.61 | bwd_inner_microstep: 1693.08 | bwd_allreduce_microstep: 403.47 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13230
total_samples=22224, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:43,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.20 | bwd_microstep: 1879.78 | bwd_inner_microstep: 1709.06 | bwd_allreduce_microstep: 170.65 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11751
total_samples=22227, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:46,277] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.28 | bwd_microstep: 1772.46 | bwd_inner_microstep: 1551.34 | bwd_allreduce_microstep: 221.04 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12416
total_samples=22230, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:49,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 06:11:49,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.34 | bwd_microstep: 1958.85 | bwd_inner_microstep: 1754.73 | bwd_allreduce_microstep: 204.06 | step_microstep: 123.34
[2025-08-03 06:11:49,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.00 | bwd: 7707.76 | bwd_inner: 6708.20 | bwd_allreduce: 999.31 | step: 123.83
{'loss': 0.7525, 'learning_rate': 3.548722055745818e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13832
total_samples=22234, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:51,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.63 | bwd_microstep: 1804.75 | bwd_inner_microstep: 1735.84 | bwd_allreduce_microstep: 68.84 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13227
total_samples=22238, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:54,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.20 | bwd_microstep: 1860.99 | bwd_inner_microstep: 1688.20 | bwd_allreduce_microstep: 172.72 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13967
total_samples=22242, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:56,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.76 | bwd_microstep: 1741.78 | bwd_inner_microstep: 1699.95 | bwd_allreduce_microstep: 41.76 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13516
total_samples=22246, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:11:59,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 06:11:59,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.90 | bwd_microstep: 2005.28 | bwd_inner_microstep: 1878.46 | bwd_allreduce_microstep: 126.75 | step_microstep: 148.78
[2025-08-03 06:11:59,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2867.42 | bwd: 7412.84 | bwd_inner: 7002.45 | bwd_allreduce: 410.16 | step: 149.30
{'loss': 0.7406, 'learning_rate': 3.536357265239333e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13331
total_samples=22250, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:02,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.21 | bwd_microstep: 2033.42 | bwd_inner_microstep: 1877.05 | bwd_allreduce_microstep: 156.28 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13299
total_samples=22254, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:05,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.16 | bwd_microstep: 1809.93 | bwd_inner_microstep: 1690.83 | bwd_allreduce_microstep: 119.03 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12872
total_samples=22258, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:08,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.01 | bwd_microstep: 2020.73 | bwd_inner_microstep: 1868.67 | bwd_allreduce_microstep: 152.00 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13675
total_samples=22262, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:10,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 06:12:10,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.01 | bwd_microstep: 1808.84 | bwd_inner_microstep: 1725.04 | bwd_allreduce_microstep: 83.74 | step_microstep: 134.89
[2025-08-03 06:12:10,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2885.31 | bwd: 7672.95 | bwd_inner: 7161.57 | bwd_allreduce: 511.14 | step: 135.42
{'loss': 0.7433, 'learning_rate': 3.5240094248824e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13751
total_samples=22267, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:13,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.94 | bwd_microstep: 2261.48 | bwd_inner_microstep: 2223.66 | bwd_allreduce_microstep: 37.76 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12592
total_samples=22272, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:16,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.76 | bwd_microstep: 1780.88 | bwd_inner_microstep: 1600.66 | bwd_allreduce_microstep: 180.15 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13230
total_samples=22276, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:18,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.81 | bwd_microstep: 1756.15 | bwd_inner_microstep: 1683.89 | bwd_allreduce_microstep: 72.20 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13798
total_samples=22280, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:21,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70
[2025-08-03 06:12:21,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.65 | bwd_microstep: 1844.38 | bwd_inner_microstep: 1740.52 | bwd_allreduce_microstep: 103.80 | step_microstep: 113.67
[2025-08-03 06:12:21,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.09 | bwd: 7642.95 | bwd_inner: 7248.72 | bwd_allreduce: 393.99 | step: 114.08
{'loss': 0.7513, 'learning_rate': 3.511678567055786e-06, 'epoch': 0.73}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15189
total_samples=22284, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:24,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.99 | bwd_microstep: 1874.64 | bwd_inner_microstep: 1821.93 | bwd_allreduce_microstep: 52.63 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13404
total_samples=22288, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:26,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.66 | bwd_microstep: 1842.37 | bwd_inner_microstep: 1742.49 | bwd_allreduce_microstep: 99.81 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14053
total_samples=22293, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:29,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.36 | bwd_microstep: 1854.53 | bwd_inner_microstep: 1743.56 | bwd_allreduce_microstep: 110.91 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15148
total_samples=22297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:32,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 06:12:32,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.60 | bwd_microstep: 2027.73 | bwd_inner_microstep: 1935.80 | bwd_allreduce_microstep: 91.86 | step_microstep: 113.94
[2025-08-03 06:12:32,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.54 | bwd: 7599.32 | bwd_inner: 7243.78 | bwd_allreduce: 355.30 | step: 114.43
{'loss': 0.7343, 'learning_rate': 3.4993647240957307e-06, 'epoch': 0.73}
��█████▎  | 1462/2000 [4:28:53<1:36:47, 10.79s/it] 73%|███████▎  | 1463/2000 [4:29:03<1:36:57, 10.83s/it]                                                        73%|███████▎  | 1463/2000 [4:29:03<1:36:57, 10.83s/it] 73%|███████▎  | 1464/2000 [4:29:14<1:36:26, 10.80s/it]                                                        73%|███████▎  | 1464/2000 [4:29:14<1:36:26, 10.80s/it] 73%|███████▎  | 1465/2000 [4:29:25<1:36:46, 10.85s/it]                                                        73%|███████▎  | 1465/2000 [4:29:25<1:36:46, 10.85s/it] 73%|███████▎  | 1466/2000 [4:29:36<1:36:34, 10.85s/it]                                                        73%|███████▎  | 1466/2000 [4:29:36<1:36:34, 10.85s/it] 73%|███████▎  | 1467/2000 [4:29:47<1:36:26, 10.86s/it]                                                        73%|███████�dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13485
total_samples=22301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:35,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.56 | bwd_microstep: 1936.12 | bwd_inner_microstep: 1709.71 | bwd_allreduce_microstep: 226.34 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13142
total_samples=22305, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:37,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.40 | bwd_microstep: 1731.74 | bwd_inner_microstep: 1654.71 | bwd_allreduce_microstep: 76.97 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13691
total_samples=22310, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:40,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.43 | bwd_microstep: 2284.24 | bwd_inner_microstep: 1990.39 | bwd_allreduce_microstep: 293.79 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14324
total_samples=22314, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:43,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 06:12:43,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.01 | bwd_microstep: 2005.41 | bwd_inner_microstep: 1887.22 | bwd_allreduce_microstep: 118.12 | step_microstep: 134.76
[2025-08-03 06:12:43,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.33 | bwd: 7957.58 | bwd_inner: 7242.02 | bwd_allreduce: 715.31 | step: 135.28
{'loss': 0.7352, 'learning_rate': 3.487067928293848e-06, 'epoch': 0.73}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12712
total_samples=22318, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:46,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.77 | bwd_microstep: 1791.38 | bwd_inner_microstep: 1633.63 | bwd_allreduce_microstep: 157.68 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12189
total_samples=22321, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:49,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.45 | bwd_microstep: 2021.21 | bwd_inner_microstep: 1799.90 | bwd_allreduce_microstep: 221.23 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13143
total_samples=22325, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:51,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.91 | bwd_microstep: 1870.29 | bwd_inner_microstep: 1677.51 | bwd_allreduce_microstep: 192.70 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13432
total_samples=22329, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:54,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 06:12:54,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.83 | bwd_microstep: 2284.38 | bwd_inner_microstep: 1869.67 | bwd_allreduce_microstep: 414.66 | step_microstep: 110.11
[2025-08-03 06:12:54,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.88 | bwd: 7967.31 | bwd_inner: 6980.71 | bwd_allreduce: 986.36 | step: 110.75
{'loss': 0.7417, 'learning_rate': 3.4747882118970565e-06, 'epoch': 0.73}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11844
total_samples=22332, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:12:57,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.14 | bwd_microstep: 1927.07 | bwd_inner_microstep: 1587.03 | bwd_allreduce_microstep: 339.97 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11698
total_samples=22335, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:00,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.09 | bwd_microstep: 1791.43 | bwd_inner_microstep: 1560.96 | bwd_allreduce_microstep: 230.41 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13269
total_samples=22339, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:02,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.67 | bwd_microstep: 1830.63 | bwd_inner_microstep: 1642.90 | bwd_allreduce_microstep: 187.66 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14615
total_samples=22343, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:05,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 06:13:05,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.62 | bwd_microstep: 1796.32 | bwd_inner_microstep: 1755.21 | bwd_allreduce_microstep: 41.03 | step_microstep: 139.22
[2025-08-03 06:13:05,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.45 | bwd: 7345.51 | bwd_inner: 6546.11 | bwd_allreduce: 799.15 | step: 139.83
{'loss': 0.7397, 'learning_rate': 3.4625256071074776e-06, 'epoch': 0.73}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11871
total_samples=22346, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:08,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.32 | bwd_microstep: 1736.98 | bwd_inner_microstep: 1536.30 | bwd_allreduce_microstep: 200.61 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12675
total_samples=22350, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:10,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.47 | bwd_microstep: 1806.16 | bwd_inner_microstep: 1625.85 | bwd_allreduce_microstep: 180.25 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13641
total_samples=22354, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:13,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.99 | bwd_microstep: 2045.85 | bwd_inner_microstep: 1833.88 | bwd_allreduce_microstep: 211.91 | step_microstep: 0.31
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12879
total_samples=22358, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:16,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 06:13:16,568] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.75 | bwd_microstep: 1839.85 | bwd_inner_microstep: 1782.65 | bwd_allreduce_microstep: 57.13 | step_microstep: 464.90
[2025-08-03 06:13:16,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.45 | bwd: 7428.90 | bwd_inner: 6778.67 | bwd_allreduce: 649.98 | step: 465.47
{'loss': 0.729, 'learning_rate': 3.450280146082361e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11760
total_samples=22361, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:19,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.45 | bwd_microstep: 1857.81 | bwd_inner_microstep: 1602.63 | bwd_allreduce_microstep: 255.12 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13650
total_samples=22365, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:22,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.38 | bwd_microstep: 2072.83 | bwd_inner_microstep: 1981.14 | bwd_allreduce_microstep: 91.62 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15006
total_samples=22370, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:24,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.05 | bwd_microstep: 1851.11 | bwd_inner_microstep: 1786.11 | bwd_allreduce_microstep: 64.92 | step_microstep: 0.32
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13723
total_samples=22375, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:27,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 06:13:27,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.42 | bwd_microstep: 2073.10 | bwd_inner_microstep: 2066.71 | bwd_allreduce_microstep: 6.32 | step_microstep: 120.72
[2025-08-03 06:13:27,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2882.21 | bwd: 7854.91 | bwd_inner: 7436.58 | bwd_allreduce: 418.07 | step: 121.27
{'loss': 0.7393, 'learning_rate': 3.4380518609340076e-06, 'epoch': 0.74}
�  | 1467/2000 [4:29:47<1:36:26, 10.86s/it] 73%|███████▎  | 1468/2000 [4:29:58<1:37:08, 10.96s/it]                                                        73%|███████▎  | 1468/2000 [4:29:58<1:37:08, 10.96s/it] 73%|███████▎  | 1469/2000 [4:30:09<1:37:36, 11.03s/it]                                                        73%|███████▎  | 1469/2000 [4:30:09<1:37:36, 11.03s/it] 74%|███████▎  | 1470/2000 [4:30:20<1:36:18, 10.90s/it]                                                        74%|███████▎  | 1470/2000 [4:30:20<1:36:18, 10.90s/it] 74%|███████▎  | 1471/2000 [4:30:31<1:36:29, 10.94s/it]                                                        74%|███████▎  | 1471/2000 [4:30:31<1:36:29, 10.94s/it] 74%|███████▎  | 1472/2000 [4:30:42<1:36:52, 11.01s/it]                                                        74%|███████▎  | 1472/2000 [4:3dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11847
total_samples=22378, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:30,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.27 | bwd_microstep: 1750.01 | bwd_inner_microstep: 1533.58 | bwd_allreduce_microstep: 216.37 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13385
total_samples=22383, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:32,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.45 | bwd_microstep: 1703.62 | bwd_inner_microstep: 1629.82 | bwd_allreduce_microstep: 73.73 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11612
total_samples=22386, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:35,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.40 | bwd_microstep: 1879.39 | bwd_inner_microstep: 1633.77 | bwd_allreduce_microstep: 245.55 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13197
total_samples=22390, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:38,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 06:13:38,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.52 | bwd_microstep: 1997.69 | bwd_inner_microstep: 1867.62 | bwd_allreduce_microstep: 130.01 | step_microstep: 140.42
[2025-08-03 06:13:38,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.57 | bwd: 7330.75 | bwd_inner: 6664.78 | bwd_allreduce: 665.73 | step: 141.01
{'loss': 0.7263, 'learning_rate': 3.4258407837296635e-06, 'epoch': 0.74}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14116
total_samples=22394, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:41,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.70 | bwd_microstep: 1856.49 | bwd_inner_microstep: 1751.34 | bwd_allreduce_microstep: 105.08 | step_microstep: 0.28
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13309
total_samples=22398, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:43,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.31 | bwd_microstep: 1940.99 | bwd_inner_microstep: 1667.63 | bwd_allreduce_microstep: 273.30 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12554
total_samples=22402, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:46,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.39 | bwd_microstep: 1734.33 | bwd_inner_microstep: 1605.96 | bwd_allreduce_microstep: 128.30 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14383
total_samples=22407, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:49,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.84
[2025-08-03 06:13:49,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.79 | bwd_microstep: 1915.55 | bwd_inner_microstep: 1746.67 | bwd_allreduce_microstep: 168.81 | step_microstep: 407.18
[2025-08-03 06:13:49,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.11 | bwd: 7447.41 | bwd_inner: 6771.59 | bwd_allreduce: 675.57 | step: 407.83
{'loss': 0.7347, 'learning_rate': 3.413646946491458e-06, 'epoch': 0.74}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13346
total_samples=22412, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:51,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.14 | bwd_microstep: 1733.73 | bwd_inner_microstep: 1648.27 | bwd_allreduce_microstep: 85.39 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12280
total_samples=22415, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:54,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.39 | bwd_microstep: 1746.92 | bwd_inner_microstep: 1566.85 | bwd_allreduce_microstep: 179.99 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11742
total_samples=22418, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:13:57,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.74 | bwd_microstep: 2019.55 | bwd_inner_microstep: 1792.04 | bwd_allreduce_microstep: 227.45 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11967
total_samples=22421, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:00,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 06:14:00,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.67 | bwd_microstep: 2074.31 | bwd_inner_microstep: 1720.13 | bwd_allreduce_microstep: 354.12 | step_microstep: 107.57
[2025-08-03 06:14:00,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.86 | bwd: 7574.57 | bwd_inner: 6727.29 | bwd_allreduce: 847.03 | step: 108.10
{'loss': 0.7469, 'learning_rate': 3.4014703811963024e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11893
total_samples=22424, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:02,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.09 | bwd_microstep: 1736.43 | bwd_inner_microstep: 1549.57 | bwd_allreduce_microstep: 186.78 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11699
total_samples=22427, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:05,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.42 | bwd_microstep: 2020.71 | bwd_inner_microstep: 1759.11 | bwd_allreduce_microstep: 261.54 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14094
total_samples=22431, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:08,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.77 | bwd_microstep: 2536.40 | bwd_inner_microstep: 2254.22 | bwd_allreduce_microstep: 282.10 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13478
total_samples=22435, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:11,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 06:14:11,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.11 | bwd_microstep: 2021.25 | bwd_inner_microstep: 1985.03 | bwd_allreduce_microstep: 36.16 | step_microstep: 109.74
[2025-08-03 06:14:11,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2750.32 | bwd: 8314.85 | bwd_inner: 7547.92 | bwd_allreduce: 766.66 | step: 110.29
{'loss': 0.7248, 'learning_rate': 3.3893111197758276e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11922
total_samples=22438, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:14,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.38 | bwd_microstep: 2004.20 | bwd_inner_microstep: 1782.12 | bwd_allreduce_microstep: 222.02 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13096
total_samples=22442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:17,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.75 | bwd_microstep: 1856.59 | bwd_inner_microstep: 1707.59 | bwd_allreduce_microstep: 148.93 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11587
total_samples=22445, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:19,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.84 | bwd_microstep: 1836.40 | bwd_inner_microstep: 1596.27 | bwd_allreduce_microstep: 240.06 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13727
total_samples=22449, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:22,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.23
[2025-08-03 06:14:22,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.45 | bwd_microstep: 2031.65 | bwd_inner_microstep: 1892.69 | bwd_allreduce_microstep: 138.89 | step_microstep: 125.99
[2025-08-03 06:14:22,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.35 | bwd: 7728.88 | bwd_inner: 6978.66 | bwd_allreduce: 749.98 | step: 126.55
{'loss': 0.7369, 'learning_rate': 3.3771691941162755e-06, 'epoch': 0.74}
0:42<1:36:52, 11.01s/it] 74%|███████▎  | 1473/2000 [4:30:53<1:35:36, 10.89s/it]                                                        74%|███████▎  | 1473/2000 [4:30:53<1:35:36, 10.89s/it] 74%|███████▎  | 1474/2000 [4:31:04<1:35:43, 10.92s/it]                                                        74%|███████▎  | 1474/2000 [4:31:04<1:35:43, 10.92s/it] 74%|███████▍  | 1475/2000 [4:31:14<1:35:14, 10.88s/it]                                                        74%|███████▍  | 1475/2000 [4:31:15<1:35:14, 10.88s/it] 74%|███████▍  | 1476/2000 [4:31:26<1:36:35, 11.06s/it]                                                        74%|███████▍  | 1476/2000 [4:31:26<1:36:35, 11.06s/it] 74%|███████▍  | 1477/2000 [4:31:37<1:36:16, 11.04s/it]                                                        74%|███████▍  | 1477/2000 [4:31:37<1:36:16, 11.04dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12747
total_samples=22453, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:25,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.83 | bwd_microstep: 2166.76 | bwd_inner_microstep: 1877.91 | bwd_allreduce_microstep: 288.78 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13688
total_samples=22458, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:28,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.74 | bwd_microstep: 1753.49 | bwd_inner_microstep: 1682.30 | bwd_allreduce_microstep: 71.11 | step_microstep: 0.16
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12731
total_samples=22462, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:30,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.70 | bwd_microstep: 1756.29 | bwd_inner_microstep: 1623.44 | bwd_allreduce_microstep: 132.78 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13029
total_samples=22466, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:33,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.14
[2025-08-03 06:14:33,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.24 | bwd_microstep: 1751.86 | bwd_inner_microstep: 1684.43 | bwd_allreduce_microstep: 67.36 | step_microstep: 117.17
[2025-08-03 06:14:33,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.45 | bwd: 7428.45 | bwd_inner: 6868.09 | bwd_allreduce: 560.12 | step: 117.68
{'loss': 0.7366, 'learning_rate': 3.3650446360584276e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11880
total_samples=22469, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:36,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.62 | bwd_microstep: 2223.01 | bwd_inner_microstep: 1983.10 | bwd_allreduce_microstep: 239.81 | step_microstep: 0.36
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243
total_samples=22473, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:38,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.47 | bwd_microstep: 1917.97 | bwd_inner_microstep: 1682.79 | bwd_allreduce_microstep: 235.10 | step_microstep: 0.34
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11757
total_samples=22476, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:42,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.15 | bwd_microstep: 2333.77 | bwd_inner_microstep: 2322.86 | bwd_allreduce_microstep: 10.86 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13696
total_samples=22480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:44,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 06:14:44,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.42 | bwd_microstep: 1722.72 | bwd_inner_microstep: 1662.52 | bwd_allreduce_microstep: 60.13 | step_microstep: 114.77
[2025-08-03 06:14:44,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.58 | bwd: 8197.53 | bwd_inner: 7651.27 | bwd_allreduce: 546.00 | step: 115.62
{'loss': 0.7376, 'learning_rate': 3.35293747739753e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12362
total_samples=22483, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:47,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.11 | bwd_microstep: 1789.35 | bwd_inner_microstep: 1576.53 | bwd_allreduce_microstep: 212.75 | step_microstep: 0.27
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12855
total_samples=22487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:49,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.64 | bwd_microstep: 1791.05 | bwd_inner_microstep: 1619.36 | bwd_allreduce_microstep: 171.64 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12912
total_samples=22490, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:52,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.64 | bwd_microstep: 1801.61 | bwd_inner_microstep: 1621.25 | bwd_allreduce_microstep: 180.29 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13479
total_samples=22494, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:55,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 06:14:55,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.30 | bwd_microstep: 1760.56 | bwd_inner_microstep: 1693.81 | bwd_allreduce_microstep: 66.68 | step_microstep: 141.38
[2025-08-03 06:14:55,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2874.60 | bwd: 7142.62 | bwd_inner: 6510.95 | bwd_allreduce: 631.42 | step: 141.91
{'loss': 0.7348, 'learning_rate': 3.3408477498831917e-06, 'epoch': 0.74}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14379
total_samples=22498, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:14:57,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.52 | bwd_microstep: 1921.48 | bwd_inner_microstep: 1763.17 | bwd_allreduce_microstep: 158.24 | step_microstep: 0.26
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12622
total_samples=22502, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:00,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 663.30 | bwd_microstep: 1691.99 | bwd_inner_microstep: 1573.43 | bwd_allreduce_microstep: 118.49 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12280
total_samples=22505, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:02,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.94 | bwd_microstep: 1829.76 | bwd_inner_microstep: 1569.35 | bwd_allreduce_microstep: 260.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14378
total_samples=22510, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:05,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.08
[2025-08-03 06:15:05,637] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.16 | bwd_microstep: 1834.49 | bwd_inner_microstep: 1738.43 | bwd_allreduce_microstep: 95.98 | step_microstep: 131.14
[2025-08-03 06:15:05,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.84 | bwd: 7277.77 | bwd_inner: 6644.38 | bwd_allreduce: 633.14 | step: 131.62
{'loss': 0.7302, 'learning_rate': 3.3287754852193143e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12092
total_samples=22513, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:08,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.19 | bwd_microstep: 1702.59 | bwd_inner_microstep: 1544.01 | bwd_allreduce_microstep: 158.51 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11836
total_samples=22516, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:10,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.77 | bwd_microstep: 1987.25 | bwd_inner_microstep: 1766.73 | bwd_allreduce_microstep: 220.45 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13189
total_samples=22520, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:13,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 888.48 | bwd_microstep: 1759.06 | bwd_inner_microstep: 1678.99 | bwd_allreduce_microstep: 80.00 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14272
total_samples=22525, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:16,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 06:15:16,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.02 | bwd_microstep: 2273.69 | bwd_inner_microstep: 2183.86 | bwd_allreduce_microstep: 89.77 | step_microstep: 118.57
[2025-08-03 06:15:16,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3028.39 | bwd: 7722.65 | bwd_inner: 7173.59 | bwd_allreduce: 548.81 | step: 118.94
{'loss': 0.735, 'learning_rate': 3.3167207150640003e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11902
total_samples=22528, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:19,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 669.80 | bwd_microstep: 2045.67 | bwd_inner_microstep: 1838.89 | bwd_allreduce_microstep: 206.72 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11882
total_samples=22531, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:22,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.18 | bwd_microstep: 1742.67 | bwd_inner_microstep: 1551.43 | bwd_allreduce_microstep: 191.17 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12019
total_samples=22534, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:25,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.54 | bwd_microstep: 2243.12 | bwd_inner_microstep: 2017.48 | bwd_allreduce_microstep: 225.57 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13278
total_samples=22538, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:27,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.82
[2025-08-03 06:15:27,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.53 | bwd_microstep: 1777.82 | bwd_inner_microstep: 1680.89 | bwd_allreduce_microstep: 96.87 | step_microstep: 168.18
[2025-08-03 06:15:27,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.98 | bwd: 7809.33 | bwd_inner: 7088.68 | bwd_allreduce: 720.41 | step: 168.57
s/it] 74%|███████▍  | 1478/2000 [4:31:48<1:35:00, 10.92s/it]                                                        74%|███████▍  | 1478/2000 [4:31:48<1:35:00, 10.92s/it] 74%|███████▍  | 1479/2000 [4:31:59<1:35:58, 11.05s/it]                                                        74%|███████▍  | 1479/2000 [4:31:59<1:35:58, 11.05s/it] 74%|███████▍  | 1480/2000 [4:32:09<1:34:17, 10.88s/it]                                                        74%|███████▍  | 1480/2000 [4:32:09<1:34:17, 10.88s/it] 74%|███████▍  | 1481/2000 [4:32:20<1:33:17, 10.78s/it]                                                        74%|███████▍  | 1481/2000 [4:32:20<1:33:17, 10.78s/it] 74%|███████▍  | 1482/2000 [4:32:31<1:34:04, 10.90s/it]                                                        74%|███████▍  | 1482/2000 [4:32:31<1:34:04, 10.90s/it] 74%|██�{'loss': 0.7408, 'learning_rate': 3.304683471029485e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11759
total_samples=22541, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:30,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.62 | bwd_microstep: 1752.71 | bwd_inner_microstep: 1536.48 | bwd_allreduce_microstep: 216.17 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12211
total_samples=22544, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:33,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.63 | bwd_microstep: 2001.26 | bwd_inner_microstep: 1746.75 | bwd_allreduce_microstep: 254.44 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12094
total_samples=22547, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:35,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.74 | bwd_microstep: 1816.41 | bwd_inner_microstep: 1589.80 | bwd_allreduce_microstep: 226.55 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12970
total_samples=22551, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:38,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58
[2025-08-03 06:15:38,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.86 | bwd_microstep: 1804.52 | bwd_inner_microstep: 1670.49 | bwd_allreduce_microstep: 133.96 | step_microstep: 133.98
[2025-08-03 06:15:38,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.78 | bwd: 7374.96 | bwd_inner: 6543.52 | bwd_allreduce: 831.20 | step: 134.57
{'loss': 0.7375, 'learning_rate': 3.2926637846820366e-06, 'epoch': 0.74}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13212
total_samples=22555, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:41,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.78 | bwd_microstep: 1758.27 | bwd_inner_microstep: 1637.98 | bwd_allreduce_microstep: 120.21 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12182
total_samples=22558, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:43,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.55 | bwd_microstep: 2039.34 | bwd_inner_microstep: 1817.08 | bwd_allreduce_microstep: 222.20 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12473
total_samples=22562, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:46,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.61 | bwd_microstep: 2344.77 | bwd_inner_microstep: 1968.24 | bwd_allreduce_microstep: 376.46 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16185
total_samples=22566, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:49,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.75
[2025-08-03 06:15:49,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.31 | bwd_microstep: 1919.22 | bwd_inner_microstep: 1816.01 | bwd_allreduce_microstep: 103.15 | step_microstep: 114.19
[2025-08-03 06:15:49,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.17 | bwd: 8061.65 | bwd_inner: 7239.28 | bwd_allreduce: 822.10 | step: 114.71
{'loss': 0.7359, 'learning_rate': 3.280661687541876e-06, 'epoch': 0.74}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13324
total_samples=22570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:52,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.89 | bwd_microstep: 2055.70 | bwd_inner_microstep: 1888.23 | bwd_allreduce_microstep: 167.40 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11844
total_samples=22573, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:55,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.82 | bwd_microstep: 1986.46 | bwd_inner_microstep: 1980.53 | bwd_allreduce_microstep: 5.87 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13644
total_samples=22577, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:15:57,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.40 | bwd_microstep: 1803.49 | bwd_inner_microstep: 1725.55 | bwd_allreduce_microstep: 77.88 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13108
total_samples=22581, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:00,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 06:16:00,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.61 | bwd_microstep: 1959.09 | bwd_inner_microstep: 1863.24 | bwd_allreduce_microstep: 95.79 | step_microstep: 116.64
[2025-08-03 06:16:00,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.64 | bwd: 7804.79 | bwd_inner: 7457.54 | bwd_allreduce: 347.01 | step: 117.13
{'loss': 0.7451, 'learning_rate': 3.268677211083109e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11540
total_samples=22584, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:03,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.42 | bwd_microstep: 1841.95 | bwd_inner_microstep: 1620.25 | bwd_allreduce_microstep: 221.63 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837
total_samples=22588, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:06,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.05 | bwd_microstep: 1855.86 | bwd_inner_microstep: 1751.83 | bwd_allreduce_microstep: 103.95 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14153
total_samples=22592, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:08,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.05 | bwd_microstep: 1804.92 | bwd_inner_microstep: 1739.64 | bwd_allreduce_microstep: 65.21 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14611
total_samples=22596, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:11,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.71
[2025-08-03 06:16:11,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.49 | bwd_microstep: 2020.95 | bwd_inner_microstep: 1810.49 | bwd_allreduce_microstep: 210.39 | step_microstep: 124.57
[2025-08-03 06:16:11,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.94 | bwd: 7523.73 | bwd_inner: 6922.21 | bwd_allreduce: 601.27 | step: 124.92
{'loss': 0.7361, 'learning_rate': 3.256710386733629e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11610
total_samples=22599, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:14,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.66 | bwd_microstep: 2073.79 | bwd_inner_microstep: 1887.53 | bwd_allreduce_microstep: 186.19 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12638
total_samples=22603, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:17,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.67 | bwd_microstep: 2109.42 | bwd_inner_microstep: 1896.01 | bwd_allreduce_microstep: 213.35 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14238
total_samples=22608, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:19,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.79 | bwd_microstep: 1776.03 | bwd_inner_microstep: 1718.48 | bwd_allreduce_microstep: 57.43 | step_microstep: 0.45
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13232
total_samples=22612, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:22,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 06:16:22,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.56 | bwd_microstep: 1760.08 | bwd_inner_microstep: 1673.22 | bwd_allreduce_microstep: 86.78 | step_microstep: 138.85
[2025-08-03 06:16:22,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.61 | bwd: 7719.41 | bwd_inner: 7175.24 | bwd_allreduce: 543.86 | step: 139.57
�████▍  | 1483/2000 [4:32:42<1:34:16, 10.94s/it]                                                        74%|███████▍  | 1483/2000 [4:32:42<1:34:16, 10.94s/it] 74%|███████▍  | 1484/2000 [4:32:53<1:33:21, 10.86s/it]                                                        74%|███████▍  | 1484/2000 [4:32:53<1:33:21, 10.86s/it] 74%|███████▍  | 1485/2000 [4:33:04<1:34:18, 10.99s/it]                                                        74%|███████▍  | 1485/2000 [4:33:04<1:34:18, 10.99s/it] 74%|███████▍  | 1486/2000 [4:33:15<1:34:17, 11.01s/it]                                                        74%|███████▍  | 1486/2000 [4:33:15<1:34:17, 11.01s/it] 74%|███████▍  | 1487/2000 [4:33:26<1:33:33, 10.94s/it]                                                        74%|███████▍  | 1487/2000 [4:33:26<1:33:33, 10.94s/it] 74%|███████▍  |{'loss': 0.7446, 'learning_rate': 3.2447612458750365e-06, 'epoch': 0.74}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11789
total_samples=22615, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:25,234] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.74 | bwd_microstep: 1825.35 | bwd_inner_microstep: 1591.69 | bwd_allreduce_microstep: 233.59 | step_microstep: 0.83
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12119
total_samples=22618, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:27,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.39 | bwd_microstep: 1777.52 | bwd_inner_microstep: 1566.22 | bwd_allreduce_microstep: 211.23 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13393
total_samples=22622, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:30,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.15 | bwd_microstep: 2081.20 | bwd_inner_microstep: 1904.42 | bwd_allreduce_microstep: 176.72 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13267
total_samples=22626, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:33,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 06:16:33,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.50 | bwd_microstep: 2034.59 | bwd_inner_microstep: 1902.81 | bwd_allreduce_microstep: 131.72 | step_microstep: 130.03
[2025-08-03 06:16:33,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.71 | bwd: 7718.71 | bwd_inner: 6965.13 | bwd_allreduce: 753.34 | step: 131.28
{'loss': 0.7305, 'learning_rate': 3.2328298198425556e-06, 'epoch': 0.74}
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11647
total_samples=22629, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:36,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.57 | bwd_microstep: 1996.82 | bwd_inner_microstep: 1559.41 | bwd_allreduce_microstep: 437.34 | step_microstep: 0.21
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12776
total_samples=22633, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:39,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.69 | bwd_microstep: 2020.08 | bwd_inner_microstep: 1851.01 | bwd_allreduce_microstep: 169.00 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12203
total_samples=22636, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:42,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.86 | bwd_microstep: 2132.98 | bwd_inner_microstep: 1978.63 | bwd_allreduce_microstep: 154.27 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12839
total_samples=22640, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:44,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 06:16:44,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.03 | bwd_microstep: 1738.94 | bwd_inner_microstep: 1634.43 | bwd_allreduce_microstep: 104.44 | step_microstep: 141.24
[2025-08-03 06:16:44,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.07 | bwd: 7888.87 | bwd_inner: 7023.49 | bwd_allreduce: 865.13 | step: 141.86
{'loss': 0.7273, 'learning_rate': 3.2209161399249677e-06, 'epoch': 0.74}
dynamic ViT batch size: 41, images per sample: 41.0, dynamic token length: 11602
total_samples=22643, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:47,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.90 | bwd_microstep: 1769.12 | bwd_inner_microstep: 1513.69 | bwd_allreduce_microstep: 255.36 | step_microstep: 0.29
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 16297
total_samples=22648, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:49,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.84 | bwd_microstep: 1776.30 | bwd_inner_microstep: 1770.19 | bwd_allreduce_microstep: 6.04 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13388
total_samples=22652, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:52,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.44 | bwd_microstep: 1794.44 | bwd_inner_microstep: 1657.14 | bwd_allreduce_microstep: 137.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13493
total_samples=22656, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:55,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.80
[2025-08-03 06:16:55,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.41 | bwd_microstep: 1811.02 | bwd_inner_microstep: 1716.08 | bwd_allreduce_microstep: 94.87 | step_microstep: 156.28
[2025-08-03 06:16:55,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2871.53 | bwd: 7150.93 | bwd_inner: 6657.08 | bwd_allreduce: 493.60 | step: 156.80
{'loss': 0.7397, 'learning_rate': 3.209020237364505e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13386
total_samples=22660, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:16:57,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.38 | bwd_microstep: 1794.26 | bwd_inner_microstep: 1713.44 | bwd_allreduce_microstep: 80.75 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12620
total_samples=22663, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:00,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.63 | bwd_microstep: 1859.94 | bwd_inner_microstep: 1582.53 | bwd_allreduce_microstep: 277.34 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13823
total_samples=22667, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:03,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.50 | bwd_microstep: 2159.14 | bwd_inner_microstep: 1849.74 | bwd_allreduce_microstep: 309.32 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13041
total_samples=22671, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:06,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 06:17:06,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.68 | bwd_microstep: 1861.36 | bwd_inner_microstep: 1682.74 | bwd_allreduce_microstep: 178.54 | step_microstep: 131.25
[2025-08-03 06:17:06,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.11 | bwd: 7674.75 | bwd_inner: 6828.45 | bwd_allreduce: 846.03 | step: 131.77
{'loss': 0.7424, 'learning_rate': 3.197142143356787e-06, 'epoch': 0.75}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13831
total_samples=22675, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:09,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.09 | bwd_microstep: 2158.86 | bwd_inner_microstep: 2098.28 | bwd_allreduce_microstep: 60.52 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12184
total_samples=22679, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:11,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.92 | bwd_microstep: 1882.60 | bwd_inner_microstep: 1545.39 | bwd_allreduce_microstep: 337.14 | step_microstep: 0.23
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13012
total_samples=22684, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:14,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.00 | bwd_microstep: 1863.69 | bwd_inner_microstep: 1650.89 | bwd_allreduce_microstep: 212.73 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13231
total_samples=22688, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:16,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 06:17:16,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.65 | bwd_microstep: 1735.80 | bwd_inner_microstep: 1631.91 | bwd_allreduce_microstep: 103.82 | step_microstep: 136.86
[2025-08-03 06:17:16,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.60 | bwd: 7641.01 | bwd_inner: 6926.48 | bwd_allreduce: 714.30 | step: 137.32
 1488/2000 [4:33:37<1:33:28, 10.95s/it]                                                        74%|███████▍  | 1488/2000 [4:33:37<1:33:28, 10.95s/it] 74%|███████▍  | 1489/2000 [4:33:48<1:33:26, 10.97s/it]                                                        74%|███████▍  | 1489/2000 [4:33:48<1:33:26, 10.97s/it] 74%|███████▍  | 1490/2000 [4:33:59<1:33:30, 11.00s/it]                                                        74%|███████▍  | 1490/2000 [4:33:59<1:33:30, 11.00s/it] 75%|███████▍  | 1491/2000 [4:34:10<1:31:58, 10.84s/it]                                                        75%|███████▍  | 1491/2000 [4:34:10<1:31:58, 10.84s/it] 75%|███████▍  | 1492/2000 [4:34:20<1:31:59, 10.87s/it]                                                        75%|███████▍  | 1492/2000 [4:34:20<1:31:59, 10.87s/it] 75%|███████▍  | 1493/2000 [4:34:31{'loss': 0.7298, 'learning_rate': 3.1852818890507255e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14964
total_samples=22692, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:19,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.66 | bwd_microstep: 1783.07 | bwd_inner_microstep: 1759.62 | bwd_allreduce_microstep: 23.38 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13036
total_samples=22696, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:22,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.95 | bwd_microstep: 1755.12 | bwd_inner_microstep: 1645.27 | bwd_allreduce_microstep: 109.79 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11947
total_samples=22699, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:25,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 951.23 | bwd_microstep: 1970.90 | bwd_inner_microstep: 1777.28 | bwd_allreduce_microstep: 193.55 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13417
total_samples=22703, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:28,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.63
[2025-08-03 06:17:28,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.19 | bwd_microstep: 2124.27 | bwd_inner_microstep: 1968.88 | bwd_allreduce_microstep: 155.33 | step_microstep: 137.27
[2025-08-03 06:17:28,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3040.96 | bwd: 7633.41 | bwd_inner: 7151.04 | bwd_allreduce: 482.12 | step: 137.72
{'loss': 0.7335, 'learning_rate': 3.1734395055484623e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13344
total_samples=22707, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:30,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.53 | bwd_microstep: 1769.19 | bwd_inner_microstep: 1696.16 | bwd_allreduce_microstep: 72.97 | step_microstep: 0.22
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12760
total_samples=22712, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:33,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.13 | bwd_microstep: 2018.91 | bwd_inner_microstep: 1811.13 | bwd_allreduce_microstep: 207.72 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11686
total_samples=22715, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:36,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.53 | bwd_microstep: 2068.86 | bwd_inner_microstep: 1835.15 | bwd_allreduce_microstep: 233.65 | step_microstep: 0.36
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14187
total_samples=22719, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:39,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.56
[2025-08-03 06:17:39,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.61 | bwd_microstep: 1747.26 | bwd_inner_microstep: 1716.34 | bwd_allreduce_microstep: 30.86 | step_microstep: 440.47
[2025-08-03 06:17:39,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2857.74 | bwd: 7604.28 | bwd_inner: 7058.77 | bwd_allreduce: 545.27 | step: 441.14
{'loss': 0.7283, 'learning_rate': 3.1616150239052647e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13612
total_samples=22723, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:41,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.77 | bwd_microstep: 1781.67 | bwd_inner_microstep: 1705.96 | bwd_allreduce_microstep: 75.63 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13581
total_samples=22727, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:44,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.37 | bwd_microstep: 2076.54 | bwd_inner_microstep: 1935.34 | bwd_allreduce_microstep: 141.15 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13196
total_samples=22731, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:47,882] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.77 | bwd_microstep: 2368.52 | bwd_inner_microstep: 1954.18 | bwd_allreduce_microstep: 414.25 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14066
total_samples=22735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:50,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 06:17:50,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.02 | bwd_microstep: 1784.32 | bwd_inner_microstep: 1722.72 | bwd_allreduce_microstep: 61.53 | step_microstep: 154.13
[2025-08-03 06:17:50,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.85 | bwd: 8011.10 | bwd_inner: 7318.20 | bwd_allreduce: 692.65 | step: 154.73
{'loss': 0.7411, 'learning_rate': 3.1498084751294523e-06, 'epoch': 0.75}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12032
total_samples=22738, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:53,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.23 | bwd_microstep: 2016.00 | bwd_inner_microstep: 1695.65 | bwd_allreduce_microstep: 320.28 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13730
total_samples=22742, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:55,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.32 | bwd_microstep: 1833.54 | bwd_inner_microstep: 1739.91 | bwd_allreduce_microstep: 93.55 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12315
total_samples=22745, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:17:58,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.53 | bwd_microstep: 1852.16 | bwd_inner_microstep: 1624.30 | bwd_allreduce_microstep: 227.80 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13574
total_samples=22750, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:01,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 06:18:01,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.87 | bwd_microstep: 1854.59 | bwd_inner_microstep: 1707.96 | bwd_allreduce_microstep: 146.55 | step_microstep: 135.71
[2025-08-03 06:18:01,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.89 | bwd: 7556.35 | bwd_inner: 6767.83 | bwd_allreduce: 788.26 | step: 136.34
{'loss': 0.73, 'learning_rate': 3.1380198901823313e-06, 'epoch': 0.75}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11730
total_samples=22753, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:03,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.02 | bwd_microstep: 1700.05 | bwd_inner_microstep: 1528.36 | bwd_allreduce_microstep: 171.61 | step_microstep: 0.38
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13390
total_samples=22757, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:06,690] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.00 | bwd_microstep: 2031.31 | bwd_inner_microstep: 1894.55 | bwd_allreduce_microstep: 136.70 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13194
total_samples=22761, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:09,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.32 | bwd_microstep: 1822.79 | bwd_inner_microstep: 1706.85 | bwd_allreduce_microstep: 115.87 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13382
total_samples=22765, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:11,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 06:18:11,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.77 | bwd_microstep: 1766.49 | bwd_inner_microstep: 1704.01 | bwd_allreduce_microstep: 62.41 | step_microstep: 135.73
[2025-08-03 06:18:11,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2749.03 | bwd: 7320.69 | bwd_inner: 6833.77 | bwd_allreduce: 486.67 | step: 136.48
<1:31:54, 10.88s/it]                                                        75%|███████▍  | 1493/2000 [4:34:31<1:31:54, 10.88s/it] 75%|███████▍  | 1494/2000 [4:34:42<1:32:17, 10.94s/it]                                                        75%|███████▍  | 1494/2000 [4:34:42<1:32:17, 10.94s/it] 75%|███████▍  | 1495/2000 [4:34:54<1:32:44, 11.02s/it]                                                        75%|███████▍  | 1495/2000 [4:34:54<1:32:44, 11.02s/it] 75%|███████▍  | 1496/2000 [4:35:05<1:33:16, 11.10s/it]                                                        75%|███████▍  | 1496/2000 [4:35:05<1:33:16, 11.10s/it] 75%|███████▍  | 1497/2000 [4:35:16<1:32:22, 11.02s/it]                                                        75%|███████▍  | 1497/2000 [4:35:16<1:32:22, 11.02s/it] 75%|███████▍  | 1498/2000 [4:35:26<1:31:00, 10.88s/it{'loss': 0.7269, 'learning_rate': 3.126249299978086e-06, 'epoch': 0.75}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11748
total_samples=22768, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:14,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.69 | bwd_microstep: 1795.05 | bwd_inner_microstep: 1578.90 | bwd_allreduce_microstep: 216.08 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15552
total_samples=22772, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:17,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.62 | bwd_microstep: 1936.04 | bwd_inner_microstep: 1785.88 | bwd_allreduce_microstep: 150.08 | step_microstep: 0.36
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13392
total_samples=22776, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:20,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.41 | bwd_microstep: 2180.19 | bwd_inner_microstep: 2174.06 | bwd_allreduce_microstep: 6.06 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13088
total_samples=22781, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:22,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 06:18:22,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.47 | bwd_microstep: 1753.40 | bwd_inner_microstep: 1643.40 | bwd_allreduce_microstep: 109.93 | step_microstep: 109.73
[2025-08-03 06:18:22,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.12 | bwd: 7664.73 | bwd_inner: 7182.23 | bwd_allreduce: 482.24 | step: 110.31
{'loss': 0.727, 'learning_rate': 3.1144967353837196e-06, 'epoch': 0.75}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12154
total_samples=22784, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:25,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.84 | bwd_microstep: 1708.05 | bwd_inner_microstep: 1541.07 | bwd_allreduce_microstep: 166.91 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14747
total_samples=22790, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:27,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.46 | bwd_microstep: 1782.05 | bwd_inner_microstep: 1693.45 | bwd_allreduce_microstep: 88.53 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14298
total_samples=22795, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:30,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.61 | bwd_microstep: 1755.44 | bwd_inner_microstep: 1717.41 | bwd_allreduce_microstep: 37.96 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15640
total_samples=22799, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:33,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 06:18:33,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.17 | bwd_microstep: 1860.67 | bwd_inner_microstep: 1812.38 | bwd_allreduce_microstep: 48.22 | step_microstep: 110.99
[2025-08-03 06:18:33,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.02 | bwd: 7106.26 | bwd_inner: 6764.28 | bwd_allreduce: 341.70 | step: 111.69
{'loss': 0.7304, 'learning_rate': 3.1027622272189572e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14445
total_samples=22803, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:35,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.42 | bwd_microstep: 2015.08 | bwd_inner_microstep: 1920.33 | bwd_allreduce_microstep: 94.69 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13707
total_samples=22807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:38,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.89 | bwd_microstep: 1728.45 | bwd_inner_microstep: 1683.85 | bwd_allreduce_microstep: 44.53 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13661
total_samples=22811, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:41,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.80 | bwd_microstep: 2034.15 | bwd_inner_microstep: 1926.18 | bwd_allreduce_microstep: 107.91 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14241
total_samples=22815, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:43,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 06:18:43,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.48 | bwd_microstep: 1763.74 | bwd_inner_microstep: 1747.63 | bwd_allreduce_microstep: 16.04 | step_microstep: 121.03
[2025-08-03 06:18:43,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.51 | bwd: 7541.48 | bwd_inner: 7277.98 | bwd_allreduce: 263.25 | step: 121.51
{'loss': 0.7356, 'learning_rate': 3.0910458062561865e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462
total_samples=22819, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:46,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.30 | bwd_microstep: 2009.13 | bwd_inner_microstep: 1888.99 | bwd_allreduce_microstep: 120.07 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12106
total_samples=22822, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:49,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.34 | bwd_microstep: 2179.71 | bwd_inner_microstep: 1944.18 | bwd_allreduce_microstep: 235.46 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12081
total_samples=22826, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:52,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.08 | bwd_microstep: 1824.21 | bwd_inner_microstep: 1587.23 | bwd_allreduce_microstep: 236.92 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15036
total_samples=22830, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:54,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 06:18:54,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.28 | bwd_microstep: 1746.86 | bwd_inner_microstep: 1732.10 | bwd_allreduce_microstep: 14.69 | step_microstep: 138.43
[2025-08-03 06:18:54,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.92 | bwd: 7759.99 | bwd_inner: 7152.50 | bwd_allreduce: 607.23 | step: 138.93
{'loss': 0.7243, 'learning_rate': 3.0793475032203513e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13687
total_samples=22834, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:18:57,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.99 | bwd_microstep: 1789.69 | bwd_inner_microstep: 1712.55 | bwd_allreduce_microstep: 77.08 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13262
total_samples=22838, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:00,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.51 | bwd_microstep: 2144.45 | bwd_inner_microstep: 1886.61 | bwd_allreduce_microstep: 257.77 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11915
total_samples=22841, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:02,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.12 | bwd_microstep: 1782.53 | bwd_inner_microstep: 1551.54 | bwd_allreduce_microstep: 230.92 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13296
total_samples=22845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:05,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 06:19:05,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.79 | bwd_microstep: 1899.30 | bwd_inner_microstep: 1690.89 | bwd_allreduce_microstep: 208.35 | step_microstep: 112.65
[2025-08-03 06:19:05,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.34 | bwd: 7616.02 | bwd_inner: 6841.58 | bwd_allreduce: 774.20 | step: 113.19
]                                                        75%|███████▍  | 1498/2000 [4:35:26<1:31:00, 10.88s/it] 75%|███████▍  | 1499/2000 [4:35:37<1:30:51, 10.88s/it]                                                        75%|███████▍  | 1499/2000 [4:35:37<1:30:51, 10.88s/it] 75%|███████▌  | 1500/2000 [4:35:47<1:29:12, 10.70s/it]                                                        75%|███████▌  | 1500/2000 [4:35:48<1:29:12, 10.70s/it] 75%|███████▌  | 1501/2000 [4:35:58<1:29:06, 10.71s/it]                                                        75%|███████▌  | 1501/2000 [4:35:58<1:29:06, 10.71s/it] 75%|███████▌  | 1502/2000 [4:36:09<1:29:38, 10.80s/it]                                                        75%|███████▌  | 1502/2000 [4:36:09<1:29:38, 10.80s/it] 75%|███████▌  | 1503/2000 [4:36:20<1:29:28, 10.80s/it]                 {'loss': 0.7358, 'learning_rate': 3.0676673487888854e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13045
total_samples=22849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:08,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.16 | bwd_microstep: 1768.98 | bwd_inner_microstep: 1676.66 | bwd_allreduce_microstep: 92.26 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12141
total_samples=22852, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:10,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.09 | bwd_microstep: 1829.28 | bwd_inner_microstep: 1583.20 | bwd_allreduce_microstep: 246.00 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13791
total_samples=22857, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:13,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.74 | bwd_microstep: 1961.31 | bwd_inner_microstep: 1864.94 | bwd_allreduce_microstep: 96.31 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13025
total_samples=22861, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:16,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 06:19:16,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.79 | bwd_microstep: 1823.40 | bwd_inner_microstep: 1668.28 | bwd_allreduce_microstep: 155.05 | step_microstep: 138.90
[2025-08-03 06:19:16,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2766.71 | bwd: 7383.02 | bwd_inner: 6793.07 | bwd_allreduce: 589.70 | step: 139.43
{'loss': 0.7333, 'learning_rate': 3.0560053735916372e-06, 'epoch': 0.75}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12117
total_samples=22864, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:18,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.78 | bwd_microstep: 1813.02 | bwd_inner_microstep: 1577.15 | bwd_allreduce_microstep: 235.79 | step_microstep: 0.90
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12017
total_samples=22867, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:21,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.44 | bwd_microstep: 1807.89 | bwd_inner_microstep: 1586.29 | bwd_allreduce_microstep: 221.55 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13179
total_samples=22871, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:24,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.00 | bwd_microstep: 1810.10 | bwd_inner_microstep: 1702.93 | bwd_allreduce_microstep: 107.10 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11950
total_samples=22874, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:26,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 06:19:26,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.43 | bwd_microstep: 2004.28 | bwd_inner_microstep: 1637.79 | bwd_allreduce_microstep: 366.43 | step_microstep: 133.46
[2025-08-03 06:19:26,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.57 | bwd: 7435.35 | bwd_inner: 6504.16 | bwd_allreduce: 930.94 | step: 134.59
{'loss': 0.7364, 'learning_rate': 3.0443616082107753e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13356
total_samples=22878, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:29,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.73 | bwd_microstep: 1747.21 | bwd_inner_microstep: 1689.91 | bwd_allreduce_microstep: 57.21 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13176
total_samples=22882, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:32,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.88 | bwd_microstep: 2142.94 | bwd_inner_microstep: 1841.40 | bwd_allreduce_microstep: 301.47 | step_microstep: 0.84
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14693
total_samples=22886, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:35,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.83 | bwd_microstep: 2541.63 | bwd_inner_microstep: 2534.36 | bwd_allreduce_microstep: 7.21 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13754
total_samples=22890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:38,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 20.91
[2025-08-03 06:19:38,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.55 | bwd_microstep: 1824.08 | bwd_inner_microstep: 1714.80 | bwd_allreduce_microstep: 109.21 | step_microstep: 137.74
[2025-08-03 06:19:38,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2868.92 | bwd: 8255.90 | bwd_inner: 7780.47 | bwd_allreduce: 475.17 | step: 138.94
{'loss': 0.7424, 'learning_rate': 3.032736083180716e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13951
total_samples=22894, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:41,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.84 | bwd_microstep: 1786.83 | bwd_inner_microstep: 1703.88 | bwd_allreduce_microstep: 82.87 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13650
total_samples=22898, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:43,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.11 | bwd_microstep: 1849.32 | bwd_inner_microstep: 1739.18 | bwd_allreduce_microstep: 110.08 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13692
total_samples=22902, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:46,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.24 | bwd_microstep: 1823.68 | bwd_inner_microstep: 1781.47 | bwd_allreduce_microstep: 42.15 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12749
total_samples=22906, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:49,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.22
[2025-08-03 06:19:49,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.90 | bwd_microstep: 1760.85 | bwd_inner_microstep: 1621.66 | bwd_allreduce_microstep: 139.12 | step_microstep: 139.21
[2025-08-03 06:19:49,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.02 | bwd: 7220.74 | bwd_inner: 6846.18 | bwd_allreduce: 374.30 | step: 139.57
{'loss': 0.7343, 'learning_rate': 3.0211288289880404e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13971
total_samples=22910, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:51,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.03 | bwd_microstep: 1793.63 | bwd_inner_microstep: 1735.69 | bwd_allreduce_microstep: 57.88 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15761
total_samples=22916, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:54,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.77 | bwd_microstep: 1787.86 | bwd_inner_microstep: 1769.95 | bwd_allreduce_microstep: 17.84 | step_microstep: 0.87
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14516
total_samples=22921, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:57,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.28 | bwd_microstep: 2154.83 | bwd_inner_microstep: 1996.60 | bwd_allreduce_microstep: 158.16 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11881
total_samples=22924, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:19:59,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 06:19:59,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.37 | bwd_microstep: 1772.64 | bwd_inner_microstep: 1548.61 | bwd_allreduce_microstep: 223.97 | step_microstep: 133.03
[2025-08-03 06:19:59,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.38 | bwd: 7509.01 | bwd_inner: 7050.84 | bwd_allreduce: 457.92 | step: 134.32
                                       75%|███████▌  | 1503/2000 [4:36:20<1:29:28, 10.80s/it] 75%|███████▌  | 1504/2000 [4:36:31<1:28:47, 10.74s/it]                                                        75%|███████▌  | 1504/2000 [4:36:31<1:28:47, 10.74s/it] 75%|███████▌  | 1505/2000 [4:36:41<1:28:28, 10.72s/it]                                                        75%|███████▌  | 1505/2000 [4:36:41<1:28:28, 10.72s/it] 75%|███████▌  | 1506/2000 [4:36:53<1:30:24, 10.98s/it]                                                        75%|███████▌  | 1506/2000 [4:36:53<1:30:24, 10.98s/it] 75%|███████▌  | 1507/2000 [4:37:03<1:29:04, 10.84s/it]                                                        75%|███████▌  | 1507/2000 [4:37:03<1:29:04, 10.84s/it] 75%|███████▌  | 1508/2000 [4:37:14<1:28:47, 10.83s/it]                                    {'loss': 0.7439, 'learning_rate': 3.009539876071427e-06, 'epoch': 0.75}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12338
total_samples=22928, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:02,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.03 | bwd_microstep: 1856.84 | bwd_inner_microstep: 1606.41 | bwd_allreduce_microstep: 250.36 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13598
total_samples=22932, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:05,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.66 | bwd_microstep: 2118.42 | bwd_inner_microstep: 1955.62 | bwd_allreduce_microstep: 162.72 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13597
total_samples=22936, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:08,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.02 | bwd_microstep: 2202.83 | bwd_inner_microstep: 2042.16 | bwd_allreduce_microstep: 160.61 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14309
total_samples=22940, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:11,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.68
[2025-08-03 06:20:11,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.70 | bwd_microstep: 1794.66 | bwd_inner_microstep: 1715.18 | bwd_allreduce_microstep: 79.42 | step_microstep: 145.70
[2025-08-03 06:20:11,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.34 | bwd: 7972.82 | bwd_inner: 7319.36 | bwd_allreduce: 653.20 | step: 146.33
{'loss': 0.734, 'learning_rate': 2.997969254821548e-06, 'epoch': 0.75}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13238
total_samples=22944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:13,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.45 | bwd_microstep: 2042.95 | bwd_inner_microstep: 1946.02 | bwd_allreduce_microstep: 96.87 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12036
total_samples=22947, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:16,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.52 | bwd_microstep: 1989.46 | bwd_inner_microstep: 1619.73 | bwd_allreduce_microstep: 369.63 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=22951, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:19,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.91 | bwd_microstep: 1736.40 | bwd_inner_microstep: 1677.13 | bwd_allreduce_microstep: 59.21 | step_microstep: 0.73
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13276
total_samples=22955, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:22,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 06:20:22,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.13 | bwd_microstep: 2084.31 | bwd_inner_microstep: 1909.12 | bwd_allreduce_microstep: 175.13 | step_microstep: 138.00
[2025-08-03 06:20:22,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.94 | bwd: 7853.18 | bwd_inner: 7151.99 | bwd_allreduce: 700.93 | step: 139.10
{'loss': 0.7345, 'learning_rate': 2.9864169955810085e-06, 'epoch': 0.76}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12114
total_samples=22958, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:24,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.52 | bwd_microstep: 1728.95 | bwd_inner_microstep: 1552.39 | bwd_allreduce_microstep: 176.50 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12674
total_samples=22962, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:27,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.31 | bwd_microstep: 1730.19 | bwd_inner_microstep: 1587.10 | bwd_allreduce_microstep: 143.02 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13610
total_samples=22966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:29,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.27 | bwd_microstep: 1853.05 | bwd_inner_microstep: 1739.44 | bwd_allreduce_microstep: 113.52 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11742
total_samples=22969, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:32,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 06:20:32,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.25 | bwd_microstep: 1745.40 | bwd_inner_microstep: 1552.93 | bwd_allreduce_microstep: 192.40 | step_microstep: 117.54
[2025-08-03 06:20:32,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.29 | bwd: 7057.64 | bwd_inner: 6431.85 | bwd_allreduce: 625.53 | step: 118.08
{'loss': 0.7206, 'learning_rate': 2.974883128644266e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13357
total_samples=22973, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:35,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.21 | bwd_microstep: 1842.29 | bwd_inner_microstep: 1697.62 | bwd_allreduce_microstep: 144.60 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11789
total_samples=22976, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:37,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.45 | bwd_microstep: 1992.69 | bwd_inner_microstep: 1779.41 | bwd_allreduce_microstep: 213.22 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14912
total_samples=22980, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:40,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.17 | bwd_microstep: 1755.86 | bwd_inner_microstep: 1732.12 | bwd_allreduce_microstep: 23.68 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11980
total_samples=22983, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:43,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 06:20:43,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.96 | bwd_microstep: 2114.70 | bwd_inner_microstep: 1865.59 | bwd_allreduce_microstep: 249.05 | step_microstep: 110.88
[2025-08-03 06:20:43,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.72 | bwd: 7705.60 | bwd_inner: 7074.74 | bwd_allreduce: 630.61 | step: 111.39
{'loss': 0.7456, 'learning_rate': 2.9633676842575386e-06, 'epoch': 0.76}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11818
total_samples=22986, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:46,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.14 | bwd_microstep: 1983.31 | bwd_inner_microstep: 1785.89 | bwd_allreduce_microstep: 197.35 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13271
total_samples=22990, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:48,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.84 | bwd_microstep: 1967.14 | bwd_inner_microstep: 1950.58 | bwd_allreduce_microstep: 16.50 | step_microstep: 0.11
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12893
total_samples=22995, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:51,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.77 | bwd_microstep: 1766.91 | bwd_inner_microstep: 1625.13 | bwd_allreduce_microstep: 141.72 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=22998, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:54,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.99
[2025-08-03 06:20:54,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.78 | bwd_microstep: 1789.36 | bwd_inner_microstep: 1574.18 | bwd_allreduce_microstep: 215.11 | step_microstep: 159.92
[2025-08-03 06:20:54,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.46 | bwd: 7506.77 | bwd_inner: 6935.79 | bwd_allreduce: 570.74 | step: 160.42
                    75%|███████▌  | 1508/2000 [4:37:14<1:28:47, 10.83s/it] 75%|███████▌  | 1509/2000 [4:37:25<1:29:36, 10.95s/it]                                                        75%|███████▌  | 1509/2000 [4:37:25<1:29:36, 10.95s/it] 76%|███████▌  | 1510/2000 [4:37:37<1:29:54, 11.01s/it]                                                        76%|███████▌  | 1510/2000 [4:37:37<1:29:54, 11.01s/it] 76%|███████▌  | 1511/2000 [4:37:47<1:27:50, 10.78s/it]                                                        76%|███████▌  | 1511/2000 [4:37:47<1:27:50, 10.78s/it] 76%|███████▌  | 1512/2000 [4:37:58<1:28:07, 10.83s/it]                                                        76%|███████▌  | 1512/2000 [4:37:58<1:28:07, 10.83s/it] 76%|███████▌  | 1513/2000 [4:38:09<1:27:48, 10.82s/it]                                                       {'loss': 0.7428, 'learning_rate': 2.951870692618739e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13211
total_samples=23002, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:56,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.43 | bwd_microstep: 1783.14 | bwd_inner_microstep: 1687.16 | bwd_allreduce_microstep: 95.91 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14409
total_samples=23006, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:20:59,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.53 | bwd_microstep: 1962.10 | bwd_inner_microstep: 1763.78 | bwd_allreduce_microstep: 198.26 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12413
total_samples=23010, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:02,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.35 | bwd_microstep: 1769.75 | bwd_inner_microstep: 1593.96 | bwd_allreduce_microstep: 175.72 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14239
total_samples=23015, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:04,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 06:21:04,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.30 | bwd_microstep: 1814.64 | bwd_inner_microstep: 1763.52 | bwd_allreduce_microstep: 51.05 | step_microstep: 160.16
[2025-08-03 06:21:04,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.55 | bwd: 7329.68 | bwd_inner: 6808.41 | bwd_allreduce: 521.02 | step: 160.56
{'loss': 0.7297, 'learning_rate': 2.940392183877382e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13602
total_samples=23019, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:07,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.82 | bwd_microstep: 1734.23 | bwd_inner_microstep: 1677.84 | bwd_allreduce_microstep: 56.32 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13288
total_samples=23023, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:09,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.30 | bwd_microstep: 1842.46 | bwd_inner_microstep: 1733.17 | bwd_allreduce_microstep: 109.22 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14061
total_samples=23027, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:12,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.24 | bwd_microstep: 2186.28 | bwd_inner_microstep: 1721.99 | bwd_allreduce_microstep: 464.22 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12423
total_samples=23031, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:15,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 06:21:15,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.86 | bwd_microstep: 1833.12 | bwd_inner_microstep: 1720.44 | bwd_allreduce_microstep: 112.62 | step_microstep: 130.00
[2025-08-03 06:21:15,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.15 | bwd: 7596.14 | bwd_inner: 6853.44 | bwd_allreduce: 742.47 | step: 130.52
{'loss': 0.7448, 'learning_rate': 2.9289321881345257e-06, 'epoch': 0.76}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12699
total_samples=23034, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:18,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.33 | bwd_microstep: 1899.94 | bwd_inner_microstep: 1589.88 | bwd_allreduce_microstep: 309.99 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13744
total_samples=23038, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:21,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.65 | bwd_microstep: 2229.67 | bwd_inner_microstep: 1903.39 | bwd_allreduce_microstep: 326.22 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12009
total_samples=23041, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:24,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.54 | bwd_microstep: 1934.30 | bwd_inner_microstep: 1575.18 | bwd_allreduce_microstep: 359.06 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14052
total_samples=23045, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:26,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 06:21:26,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.64 | bwd_microstep: 1942.03 | bwd_inner_microstep: 1764.63 | bwd_allreduce_microstep: 177.32 | step_microstep: 144.75
[2025-08-03 06:21:27,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.10 | bwd: 8005.98 | bwd_inner: 6833.08 | bwd_allreduce: 1172.66 | step: 145.25
{'loss': 0.7406, 'learning_rate': 2.9174907354426696e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13667
total_samples=23049, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:29,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.04 | bwd_microstep: 2029.93 | bwd_inner_microstep: 1899.41 | bwd_allreduce_microstep: 130.45 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13675
total_samples=23053, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:32,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.21 | bwd_microstep: 2185.76 | bwd_inner_microstep: 1767.37 | bwd_allreduce_microstep: 418.32 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13080
total_samples=23057, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:35,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.38 | bwd_microstep: 1831.16 | bwd_inner_microstep: 1700.50 | bwd_allreduce_microstep: 130.59 | step_microstep: 1.06
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13065
total_samples=23061, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:38,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 06:21:38,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.01 | bwd_microstep: 1934.27 | bwd_inner_microstep: 1694.08 | bwd_allreduce_microstep: 240.12 | step_microstep: 110.50
[2025-08-03 06:21:38,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.57 | bwd: 7981.16 | bwd_inner: 7061.36 | bwd_allreduce: 919.56 | step: 111.95
{'loss': 0.7382, 'learning_rate': 2.9060678558056876e-06, 'epoch': 0.76}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13482
total_samples=23065, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:40,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.91 | bwd_microstep: 1897.64 | bwd_inner_microstep: 1763.95 | bwd_allreduce_microstep: 133.62 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13935
total_samples=23069, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:43,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.19 | bwd_microstep: 1755.09 | bwd_inner_microstep: 1708.79 | bwd_allreduce_microstep: 46.23 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14965
total_samples=23073, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:45,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.15 | bwd_microstep: 1805.23 | bwd_inner_microstep: 1780.97 | bwd_allreduce_microstep: 24.19 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13909
total_samples=23077, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:48,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 06:21:48,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.49 | bwd_microstep: 1919.36 | bwd_inner_microstep: 1850.13 | bwd_allreduce_microstep: 69.15 | step_microstep: 121.34
[2025-08-03 06:21:48,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.67 | bwd: 7377.36 | bwd_inner: 7103.83 | bwd_allreduce: 273.27 | step: 121.87
{'loss': 0.7349, 'learning_rate': 2.8946635791787546e-06, 'epoch': 0.76}
 76%|███████▌  | 1513/2000 [4:38:09<1:27:48, 10.82s/it] 76%|███████▌  | 1514/2000 [4:38:19<1:27:10, 10.76s/it]                                                        76%|███████▌  | 1514/2000 [4:38:19<1:27:10, 10.76s/it] 76%|███████▌  | 1515/2000 [4:38:30<1:27:12, 10.79s/it]                                                        76%|███████▌  | 1515/2000 [4:38:30<1:27:12, 10.79s/it] 76%|███████▌  | 1516/2000 [4:38:41<1:28:14, 10.94s/it]                                                        76%|███████▌  | 1516/2000 [4:38:41<1:28:14, 10.94s/it] 76%|███████▌  | 1517/2000 [4:38:53<1:28:42, 11.02s/it]                                                        76%|███████▌  | 1517/2000 [4:38:53<1:28:42, 11.02s/it] 76%|███████▌  | 1518/2000 [4:39:03<1:27:28, 10.89s/it]                                                        76%|████dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12085
total_samples=23080, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:51,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.34 | bwd_microstep: 2123.68 | bwd_inner_microstep: 1663.01 | bwd_allreduce_microstep: 460.57 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13932
total_samples=23084, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:54,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.27 | bwd_microstep: 1767.68 | bwd_inner_microstep: 1704.96 | bwd_allreduce_microstep: 62.66 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12678
total_samples=23088, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:57,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.74 | bwd_microstep: 2180.00 | bwd_inner_microstep: 2023.46 | bwd_allreduce_microstep: 156.47 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13443
total_samples=23092, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:21:59,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 06:21:59,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.09 | bwd_microstep: 1751.78 | bwd_inner_microstep: 1683.32 | bwd_allreduce_microstep: 68.40 | step_microstep: 148.95
[2025-08-03 06:21:59,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.37 | bwd: 7823.20 | bwd_inner: 7074.76 | bwd_allreduce: 748.16 | step: 149.51
{'loss': 0.7395, 'learning_rate': 2.883277935468254e-06, 'epoch': 0.76}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=23095, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:02,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1031.28 | bwd_microstep: 1781.79 | bwd_inner_microstep: 1558.23 | bwd_allreduce_microstep: 223.49 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16021
total_samples=23099, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:05,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.83 | bwd_microstep: 2040.66 | bwd_inner_microstep: 1948.52 | bwd_allreduce_microstep: 92.07 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13822
total_samples=23103, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:08,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.61 | bwd_microstep: 1843.39 | bwd_inner_microstep: 1729.00 | bwd_allreduce_microstep: 114.33 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13605
total_samples=23108, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:10,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 06:22:10,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.56 | bwd_microstep: 1761.79 | bwd_inner_microstep: 1664.59 | bwd_allreduce_microstep: 97.12 | step_microstep: 116.82
[2025-08-03 06:22:10,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3163.21 | bwd: 7427.69 | bwd_inner: 6900.34 | bwd_allreduce: 527.10 | step: 117.18
{'loss': 0.7271, 'learning_rate': 2.8719109545317102e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13238
total_samples=23112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:13,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.78 | bwd_microstep: 1832.71 | bwd_inner_microstep: 1703.30 | bwd_allreduce_microstep: 129.34 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13932
total_samples=23116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:16,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.11 | bwd_microstep: 1775.34 | bwd_inner_microstep: 1727.24 | bwd_allreduce_microstep: 48.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14338
total_samples=23120, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:18,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.53 | bwd_microstep: 1920.05 | bwd_inner_microstep: 1776.49 | bwd_allreduce_microstep: 143.50 | step_microstep: 0.31
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11813
total_samples=23123, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:21,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.18
[2025-08-03 06:22:21,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.10 | bwd_microstep: 1721.44 | bwd_inner_microstep: 1552.57 | bwd_allreduce_microstep: 168.79 | step_microstep: 141.26
[2025-08-03 06:22:21,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.45 | bwd: 7249.59 | bwd_inner: 6759.60 | bwd_allreduce: 489.74 | step: 141.80
{'loss': 0.7278, 'learning_rate': 2.8605626661776995e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13176
total_samples=23127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:23,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.60 | bwd_microstep: 1812.67 | bwd_inner_microstep: 1688.16 | bwd_allreduce_microstep: 124.46 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15871
total_samples=23132, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:26,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.45 | bwd_microstep: 1775.19 | bwd_inner_microstep: 1768.47 | bwd_allreduce_microstep: 6.66 | step_microstep: 0.25
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15180
total_samples=23137, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:29,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.32 | bwd_microstep: 2097.72 | bwd_inner_microstep: 1925.90 | bwd_allreduce_microstep: 171.76 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13643
total_samples=23141, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:32,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 06:22:32,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.53 | bwd_microstep: 1810.94 | bwd_inner_microstep: 1728.54 | bwd_allreduce_microstep: 82.33 | step_microstep: 128.53
[2025-08-03 06:22:32,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.83 | bwd: 7496.57 | bwd_inner: 7111.06 | bwd_allreduce: 385.27 | step: 128.98
{'loss': 0.7378, 'learning_rate': 2.849233100165795e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13991
total_samples=23145, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:34,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.17 | bwd_microstep: 2039.57 | bwd_inner_microstep: 1854.13 | bwd_allreduce_microstep: 185.38 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13471
total_samples=23149, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:37,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 756.96 | bwd_microstep: 2086.22 | bwd_inner_microstep: 1911.87 | bwd_allreduce_microstep: 174.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11634
total_samples=23152, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:40,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.38 | bwd_microstep: 1811.11 | bwd_inner_microstep: 1523.17 | bwd_allreduce_microstep: 287.87 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13364
total_samples=23156, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:43,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.71
[2025-08-03 06:22:43,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.56 | bwd_microstep: 1706.20 | bwd_inner_microstep: 1643.87 | bwd_allreduce_microstep: 62.26 | step_microstep: 170.93
[2025-08-03 06:22:43,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2895.98 | bwd: 7643.17 | bwd_inner: 6933.03 | bwd_allreduce: 709.89 | step: 171.35
{'loss': 0.7342, 'learning_rate': 2.837922286206457e-06, 'epoch': 0.76}
███▌  | 1518/2000 [4:39:03<1:27:28, 10.89s/it] 76%|███████▌  | 1519/2000 [4:39:14<1:27:39, 10.94s/it]                                                        76%|███████▌  | 1519/2000 [4:39:14<1:27:39, 10.94s/it] 76%|███████▌  | 1520/2000 [4:39:25<1:27:40, 10.96s/it]                                                        76%|███████▌  | 1520/2000 [4:39:25<1:27:40, 10.96s/it] 76%|███████▌  | 1521/2000 [4:39:36<1:26:27, 10.83s/it]                                                        76%|███████▌  | 1521/2000 [4:39:36<1:26:27, 10.83s/it] 76%|███████▌  | 1522/2000 [4:39:47<1:26:08, 10.81s/it]                                                        76%|███████▌  | 1522/2000 [4:39:47<1:26:08, 10.81s/it] 76%|███████▌  | 1523/2000 [4:39:57<1:26:20, 10.86s/it]                                                        76%|███████▌  | 152dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12603
total_samples=23159, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:46,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.81 | bwd_microstep: 2081.54 | bwd_inner_microstep: 1855.66 | bwd_allreduce_microstep: 225.81 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13346
total_samples=23163, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:48,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.06 | bwd_microstep: 2075.73 | bwd_inner_microstep: 1750.84 | bwd_allreduce_microstep: 324.80 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13533
total_samples=23167, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:51,697] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.06 | bwd_microstep: 2044.08 | bwd_inner_microstep: 1907.18 | bwd_allreduce_microstep: 136.84 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13509
total_samples=23171, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:54,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 06:22:54,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.69 | bwd_microstep: 1806.54 | bwd_inner_microstep: 1658.50 | bwd_allreduce_microstep: 147.98 | step_microstep: 136.94
[2025-08-03 06:22:54,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2824.54 | bwd: 8007.96 | bwd_inner: 7172.18 | bwd_allreduce: 835.52 | step: 137.70
{'loss': 0.731, 'learning_rate': 2.8266302539609747e-06, 'epoch': 0.76}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13439
total_samples=23175, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:56,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.95 | bwd_microstep: 1738.45 | bwd_inner_microstep: 1651.08 | bwd_allreduce_microstep: 87.31 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13279
total_samples=23179, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:22:59,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.22 | bwd_microstep: 1906.80 | bwd_inner_microstep: 1857.25 | bwd_allreduce_microstep: 49.50 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13489
total_samples=23183, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:02,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.31 | bwd_microstep: 2031.94 | bwd_inner_microstep: 1889.13 | bwd_allreduce_microstep: 142.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14989
total_samples=23188, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:05,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 06:23:05,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.00 | bwd_microstep: 2310.62 | bwd_inner_microstep: 1867.67 | bwd_allreduce_microstep: 442.88 | step_microstep: 109.11
[2025-08-03 06:23:05,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.42 | bwd: 7987.87 | bwd_inner: 7265.12 | bwd_allreduce: 722.51 | step: 109.75
{'loss': 0.7332, 'learning_rate': 2.8153570330413925e-06, 'epoch': 0.76}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11660
total_samples=23191, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:08,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 912.77 | bwd_microstep: 2320.72 | bwd_inner_microstep: 1925.10 | bwd_allreduce_microstep: 395.56 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14706
total_samples=23195, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:11,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.55 | bwd_microstep: 1843.96 | bwd_inner_microstep: 1758.85 | bwd_allreduce_microstep: 85.05 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13443
total_samples=23199, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:14,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.44 | bwd_microstep: 1995.13 | bwd_inner_microstep: 1987.35 | bwd_allreduce_microstep: 7.71 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13927
total_samples=23203, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:17,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 06:23:17,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.04 | bwd_microstep: 1889.50 | bwd_inner_microstep: 1711.98 | bwd_allreduce_microstep: 177.45 | step_microstep: 134.00
[2025-08-03 06:23:17,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2974.73 | bwd: 8049.36 | bwd_inner: 7383.28 | bwd_allreduce: 665.85 | step: 134.63
{'loss': 0.7315, 'learning_rate': 2.8041026530104144e-06, 'epoch': 0.76}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12208
total_samples=23206, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:19,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.40 | bwd_microstep: 1720.42 | bwd_inner_microstep: 1552.41 | bwd_allreduce_microstep: 167.93 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13383
total_samples=23210, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:22,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.58 | bwd_microstep: 1718.46 | bwd_inner_microstep: 1672.28 | bwd_allreduce_microstep: 46.10 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11835
total_samples=23213, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:24,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.15 | bwd_microstep: 1740.78 | bwd_inner_microstep: 1544.06 | bwd_allreduce_microstep: 196.66 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12908
total_samples=23217, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:27,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 06:23:27,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.50 | bwd_microstep: 1823.94 | bwd_inner_microstep: 1770.11 | bwd_allreduce_microstep: 53.77 | step_microstep: 115.03
[2025-08-03 06:23:27,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.52 | bwd: 7003.65 | bwd_inner: 6538.86 | bwd_allreduce: 464.54 | step: 115.55
{'loss': 0.7198, 'learning_rate': 2.7928671433813392e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13200
total_samples=23221, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:30,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1037.69 | bwd_microstep: 1833.58 | bwd_inner_microstep: 1713.89 | bwd_allreduce_microstep: 119.63 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14222
total_samples=23226, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:32,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.24 | bwd_microstep: 1930.78 | bwd_inner_microstep: 1776.18 | bwd_allreduce_microstep: 154.54 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13865
total_samples=23230, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:35,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.71 | bwd_microstep: 1745.08 | bwd_inner_microstep: 1706.24 | bwd_allreduce_microstep: 38.78 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12024
total_samples=23233, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:38,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 06:23:38,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.63 | bwd_microstep: 1988.40 | bwd_inner_microstep: 1573.80 | bwd_allreduce_microstep: 414.54 | step_microstep: 110.84
[2025-08-03 06:23:38,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3174.20 | bwd: 7497.89 | bwd_inner: 6770.10 | bwd_allreduce: 727.57 | step: 111.46
{'loss': 0.7393, 'learning_rate': 2.78165053361798e-06, 'epoch': 0.76}
3/2000 [4:39:58<1:26:20, 10.86s/it] 76%|███████▌  | 1524/2000 [4:40:09<1:27:11, 10.99s/it]                                                        76%|███████▌  | 1524/2000 [4:40:09<1:27:11, 10.99s/it] 76%|███████▋  | 1525/2000 [4:40:20<1:27:26, 11.05s/it]                                                        76%|███████▋  | 1525/2000 [4:40:20<1:27:26, 11.05s/it] 76%|███████▋  | 1526/2000 [4:40:31<1:28:14, 11.17s/it]                                                        76%|███████▋  | 1526/2000 [4:40:31<1:28:14, 11.17s/it] 76%|███████▋  | 1527/2000 [4:40:42<1:25:50, 10.89s/it]                                                        76%|███████▋  | 1527/2000 [4:40:42<1:25:50, 10.89s/it] 76%|███████▋  | 1528/2000 [4:40:53<1:26:07, 10.95s/it]                                                        76%|███████▋  | 1528/2000 [4:40:53<1:2dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12123
total_samples=23236, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:40,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.28 | bwd_microstep: 1728.30 | bwd_inner_microstep: 1553.76 | bwd_allreduce_microstep: 174.47 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13173
total_samples=23240, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:43,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.14 | bwd_microstep: 1864.97 | bwd_inner_microstep: 1819.51 | bwd_allreduce_microstep: 45.40 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13321
total_samples=23244, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:46,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.44 | bwd_microstep: 1745.44 | bwd_inner_microstep: 1713.97 | bwd_allreduce_microstep: 31.40 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11621
total_samples=23247, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:48,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 06:23:48,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.90 | bwd_microstep: 1881.57 | bwd_inner_microstep: 1728.20 | bwd_allreduce_microstep: 153.30 | step_microstep: 133.59
[2025-08-03 06:23:48,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.69 | bwd: 7220.33 | bwd_inner: 6815.43 | bwd_allreduce: 404.65 | step: 134.12
{'loss': 0.7291, 'learning_rate': 2.770452853134593e-06, 'epoch': 0.76}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13483
total_samples=23251, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:51,544] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.98 | bwd_microstep: 1949.74 | bwd_inner_microstep: 1687.59 | bwd_allreduce_microstep: 262.09 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13738
total_samples=23255, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:54,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.71 | bwd_microstep: 2363.77 | bwd_inner_microstep: 1872.23 | bwd_allreduce_microstep: 491.46 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13353
total_samples=23259, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:23:57,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.01 | bwd_microstep: 2154.78 | bwd_inner_microstep: 1915.22 | bwd_allreduce_microstep: 239.51 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12159
total_samples=23262, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:00,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 06:24:00,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.88 | bwd_microstep: 1803.35 | bwd_inner_microstep: 1576.87 | bwd_allreduce_microstep: 226.41 | step_microstep: 146.17
[2025-08-03 06:24:00,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.50 | bwd: 8271.69 | bwd_inner: 7051.90 | bwd_allreduce: 1219.55 | step: 146.58
{'loss': 0.7351, 'learning_rate': 2.759274131295787e-06, 'epoch': 0.77}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11915
total_samples=23265, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:03,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.66 | bwd_microstep: 1922.40 | bwd_inner_microstep: 1744.85 | bwd_allreduce_microstep: 177.48 | step_microstep: 0.26
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12860
total_samples=23269, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:05,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.08 | bwd_microstep: 1791.37 | bwd_inner_microstep: 1658.52 | bwd_allreduce_microstep: 132.77 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14264
total_samples=23274, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:08,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.20 | bwd_microstep: 1840.57 | bwd_inner_microstep: 1756.95 | bwd_allreduce_microstep: 83.56 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14116
total_samples=23278, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:10,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 06:24:10,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.93 | bwd_microstep: 1883.86 | bwd_inner_microstep: 1760.27 | bwd_allreduce_microstep: 123.51 | step_microstep: 111.37
[2025-08-03 06:24:10,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.80 | bwd: 7438.26 | bwd_inner: 6920.58 | bwd_allreduce: 517.42 | step: 112.00
{'loss': 0.7346, 'learning_rate': 2.7481143974164548e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13127
total_samples=23282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:13,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.27 | bwd_microstep: 1744.23 | bwd_inner_microstep: 1679.42 | bwd_allreduce_microstep: 64.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492
total_samples=23286, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:16,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.64 | bwd_microstep: 2022.46 | bwd_inner_microstep: 1879.18 | bwd_allreduce_microstep: 143.22 | step_microstep: 0.21
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14040
total_samples=23290, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:18,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.06 | bwd_microstep: 1929.65 | bwd_inner_microstep: 1833.09 | bwd_allreduce_microstep: 96.49 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11564
total_samples=23293, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:21,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 06:24:21,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.63 | bwd_microstep: 2072.14 | bwd_inner_microstep: 1685.65 | bwd_allreduce_microstep: 386.41 | step_microstep: 135.53
[2025-08-03 06:24:21,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.53 | bwd: 7768.53 | bwd_inner: 7077.33 | bwd_allreduce: 690.95 | step: 136.12
{'loss': 0.7411, 'learning_rate': 2.736973680761702e-06, 'epoch': 0.77}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12416
total_samples=23296, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:24,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.26 | bwd_microstep: 1957.44 | bwd_inner_microstep: 1674.24 | bwd_allreduce_microstep: 283.14 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13507
total_samples=23300, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:27,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.83 | bwd_microstep: 1866.92 | bwd_inner_microstep: 1690.21 | bwd_allreduce_microstep: 176.65 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12067
total_samples=23303, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:30,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.05 | bwd_microstep: 2409.39 | bwd_inner_microstep: 2393.03 | bwd_allreduce_microstep: 16.30 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=23306, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:33,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 06:24:33,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.50 | bwd_microstep: 1796.31 | bwd_inner_microstep: 1573.96 | bwd_allreduce_microstep: 222.28 | step_microstep: 120.54
[2025-08-03 06:24:33,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2855.57 | bwd: 8030.11 | bwd_inner: 7331.43 | bwd_allreduce: 698.44 | step: 121.10
{'loss': 0.7332, 'learning_rate': 2.7258520105467566e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13931
total_samples=23310, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:35,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.38 | bwd_microstep: 1748.33 | bwd_inner_microstep: 1702.65 | bwd_allreduce_microstep: 45.61 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13414
total_samples=23314, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:38,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.25 | bwd_microstep: 1805.86 | bwd_inner_microstep: 1712.73 | bwd_allreduce_microstep: 93.05 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14371
total_samples=23318, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:40,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.34 | bwd_microstep: 1820.53 | bwd_inner_microstep: 1782.52 | bwd_allreduce_microstep: 37.94 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13434
total_samples=23322, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:43,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 06:24:43,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.07 | bwd_microstep: 1735.87 | bwd_inner_microstep: 1687.72 | bwd_allreduce_microstep: 48.09 | step_microstep: 112.38
[2025-08-03 06:24:43,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.98 | bwd: 7110.65 | bwd_inner: 6885.62 | bwd_allreduce: 224.79 | step: 112.95
6:07, 10.95s/it] 76%|███████▋  | 1529/2000 [4:41:03<1:24:43, 10.79s/it]                                                        76%|███████▋  | 1529/2000 [4:41:03<1:24:43, 10.79s/it] 76%|███████▋  | 1530/2000 [4:41:15<1:26:12, 11.01s/it]                                                        76%|███████▋  | 1530/2000 [4:41:15<1:26:12, 11.01s/it] 77%|███████▋  | 1531/2000 [4:41:25<1:25:13, 10.90s/it]                                                        77%|███████▋  | 1531/2000 [4:41:25<1:25:13, 10.90s/it] 77%|███████▋  | 1532/2000 [4:41:36<1:25:11, 10.92s/it]                                                        77%|███████▋  | 1532/2000 [4:41:36<1:25:11, 10.92s/it] 77%|███████▋  | 1533/2000 [4:41:48<1:25:53, 11.04s/it]                                                        77%|███████▋  | 1533/2000 [4:41:48<1:25:53, 11.04s/it] 7{'loss': 0.7397, 'learning_rate': 2.714749415936904e-06, 'epoch': 0.77}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13893
total_samples=23326, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:46,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.28 | bwd_microstep: 1847.50 | bwd_inner_microstep: 1707.12 | bwd_allreduce_microstep: 140.31 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14429
total_samples=23331, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:48,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.54 | bwd_microstep: 1793.01 | bwd_inner_microstep: 1755.08 | bwd_allreduce_microstep: 37.87 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12064
total_samples=23334, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:51,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.96 | bwd_microstep: 1916.53 | bwd_inner_microstep: 1572.78 | bwd_allreduce_microstep: 343.69 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13140
total_samples=23338, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:54,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 06:24:54,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.75 | bwd_microstep: 1810.74 | bwd_inner_microstep: 1702.45 | bwd_allreduce_microstep: 108.21 | step_microstep: 111.15
[2025-08-03 06:24:54,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.47 | bwd: 7367.83 | bwd_inner: 6737.41 | bwd_allreduce: 630.16 | step: 111.49
{'loss': 0.7367, 'learning_rate': 2.7036659260473973e-06, 'epoch': 0.77}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12277
total_samples=23341, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:56,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.54 | bwd_microstep: 1899.41 | bwd_inner_microstep: 1572.82 | bwd_allreduce_microstep: 326.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13430
total_samples=23345, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:24:59,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.97 | bwd_microstep: 1966.72 | bwd_inner_microstep: 1841.82 | bwd_allreduce_microstep: 124.83 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11979
total_samples=23348, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:02,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.07 | bwd_microstep: 1820.25 | bwd_inner_microstep: 1557.75 | bwd_allreduce_microstep: 262.43 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13237
total_samples=23352, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:04,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 06:25:04,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.40 | bwd_microstep: 1781.88 | bwd_inner_microstep: 1691.81 | bwd_allreduce_microstep: 90.00 | step_microstep: 156.11
[2025-08-03 06:25:04,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.89 | bwd: 7468.30 | bwd_inner: 6664.19 | bwd_allreduce: 803.86 | step: 156.56
{'loss': 0.7398, 'learning_rate': 2.692601569943407e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16095
total_samples=23356, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:07,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.67 | bwd_microstep: 2077.72 | bwd_inner_microstep: 1950.20 | bwd_allreduce_microstep: 127.46 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13869
total_samples=23360, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:10,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.72 | bwd_microstep: 1789.84 | bwd_inner_microstep: 1715.62 | bwd_allreduce_microstep: 74.15 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11571
total_samples=23363, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:13,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.42 | bwd_microstep: 1912.85 | bwd_inner_microstep: 1540.07 | bwd_allreduce_microstep: 372.70 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15021
total_samples=23367, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:15,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 06:25:15,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.71 | bwd_microstep: 1960.31 | bwd_inner_microstep: 1890.66 | bwd_allreduce_microstep: 69.57 | step_microstep: 130.39
[2025-08-03 06:25:15,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.44 | bwd: 7740.77 | bwd_inner: 7096.54 | bwd_allreduce: 643.97 | step: 130.90
{'loss': 0.7438, 'learning_rate': 2.6815563766399122e-06, 'epoch': 0.77}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11674
total_samples=23370, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:18,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.97 | bwd_microstep: 2092.66 | bwd_inner_microstep: 1847.02 | bwd_allreduce_microstep: 245.58 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13105
total_samples=23374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:21,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.48 | bwd_microstep: 1810.88 | bwd_inner_microstep: 1687.47 | bwd_allreduce_microstep: 123.34 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12637
total_samples=23379, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:24,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.86 | bwd_microstep: 1862.13 | bwd_inner_microstep: 1558.99 | bwd_allreduce_microstep: 303.07 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13613
total_samples=23384, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:26,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 06:25:26,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.39 | bwd_microstep: 2000.51 | bwd_inner_microstep: 1698.53 | bwd_allreduce_microstep: 301.91 | step_microstep: 110.10
[2025-08-03 06:25:26,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.62 | bwd: 7766.23 | bwd_inner: 6792.00 | bwd_allreduce: 973.98 | step: 110.71
{'loss': 0.7331, 'learning_rate': 2.670530375101641e-06, 'epoch': 0.77}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12557
total_samples=23388, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:29,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.34 | bwd_microstep: 1792.33 | bwd_inner_microstep: 1601.97 | bwd_allreduce_microstep: 190.28 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14076
total_samples=23394, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:32,261] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.83 | bwd_microstep: 2024.46 | bwd_inner_microstep: 1740.95 | bwd_allreduce_microstep: 283.44 | step_microstep: 0.95
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12482
total_samples=23399, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:35,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.08 | bwd_microstep: 2312.01 | bwd_inner_microstep: 2092.82 | bwd_allreduce_microstep: 219.13 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14142
total_samples=23404, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:37,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 06:25:38,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.03 | bwd_microstep: 1736.95 | bwd_inner_microstep: 1680.66 | bwd_allreduce_microstep: 56.20 | step_microstep: 161.62
[2025-08-03 06:25:38,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2763.21 | bwd: 7865.81 | bwd_inner: 7116.41 | bwd_allreduce: 749.14 | step: 163.02
7%|███████▋  | 1534/2000 [4:41:58<1:24:01, 10.82s/it]                                                        77%|███████▋  | 1534/2000 [4:41:58<1:24:01, 10.82s/it] 77%|███████▋  | 1535/2000 [4:42:09<1:23:23, 10.76s/it]                                                        77%|███████▋  | 1535/2000 [4:42:09<1:23:23, 10.76s/it] 77%|███████▋  | 1536/2000 [4:42:19<1:23:13, 10.76s/it]                                                        77%|███████▋  | 1536/2000 [4:42:19<1:23:13, 10.76s/it] 77%|███████▋  | 1537/2000 [4:42:30<1:23:34, 10.83s/it]                                                        77%|███████▋  | 1537/2000 [4:42:30<1:23:34, 10.83s/it] 77%|███████▋  | 1538/2000 [4:42:41<1:23:43, 10.87s/it]                                                        77%|███████▋  | 1538/2000 [4:42:41<1:23:43, 10.87s/it] 77%|█████�{'loss': 0.7348, 'learning_rate': 2.6595235942430044e-06, 'epoch': 0.77}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13872
total_samples=23408, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:40,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.56 | bwd_microstep: 2008.34 | bwd_inner_microstep: 1850.05 | bwd_allreduce_microstep: 158.23 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13639
total_samples=23412, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:43,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.96 | bwd_microstep: 2085.44 | bwd_inner_microstep: 2079.38 | bwd_allreduce_microstep: 6.00 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11613
total_samples=23415, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:46,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.52 | bwd_microstep: 1780.72 | bwd_inner_microstep: 1549.06 | bwd_allreduce_microstep: 231.60 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14072
total_samples=23419, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:48,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 06:25:48,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.43 | bwd_microstep: 1754.90 | bwd_inner_microstep: 1708.21 | bwd_allreduce_microstep: 46.63 | step_microstep: 132.92
[2025-08-03 06:25:48,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.38 | bwd: 7629.45 | bwd_inner: 7186.69 | bwd_allreduce: 442.54 | step: 133.38
{'loss': 0.7479, 'learning_rate': 2.648536062927999e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14221
total_samples=23423, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:51,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.49 | bwd_microstep: 1931.50 | bwd_inner_microstep: 1900.54 | bwd_allreduce_microstep: 30.86 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12711
total_samples=23426, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:54,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.57 | bwd_microstep: 1813.68 | bwd_inner_microstep: 1638.36 | bwd_allreduce_microstep: 175.26 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810
total_samples=23429, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:56,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.13 | bwd_microstep: 1684.61 | bwd_inner_microstep: 1529.67 | bwd_allreduce_microstep: 154.87 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15545
total_samples=23433, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:25:59,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 06:25:59,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.12 | bwd_microstep: 1828.48 | bwd_inner_microstep: 1748.05 | bwd_allreduce_microstep: 80.36 | step_microstep: 134.68
[2025-08-03 06:25:59,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.24 | bwd: 7258.34 | bwd_inner: 6816.62 | bwd_allreduce: 441.45 | step: 135.23
{'loss': 0.7377, 'learning_rate': 2.637567809970143e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13121
total_samples=23437, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:01,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.54 | bwd_microstep: 1774.23 | bwd_inner_microstep: 1686.66 | bwd_allreduce_microstep: 87.51 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15023
total_samples=23441, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:04,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.16 | bwd_microstep: 1734.13 | bwd_inner_microstep: 1728.01 | bwd_allreduce_microstep: 6.05 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13640
total_samples=23445, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:07,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.64 | bwd_microstep: 1817.02 | bwd_inner_microstep: 1728.54 | bwd_allreduce_microstep: 88.41 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12042
total_samples=23448, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:09,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85
[2025-08-03 06:26:09,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 747.52 | bwd_microstep: 1842.50 | bwd_inner_microstep: 1585.46 | bwd_allreduce_microstep: 256.97 | step_microstep: 156.24
[2025-08-03 06:26:09,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.79 | bwd: 7167.92 | bwd_inner: 6728.67 | bwd_allreduce: 439.02 | step: 156.73
{'loss': 0.7381, 'learning_rate': 2.6266188641324e-06, 'epoch': 0.77}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14216
total_samples=23452, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:12,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.56 | bwd_microstep: 2118.61 | bwd_inner_microstep: 1899.55 | bwd_allreduce_microstep: 219.00 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14064
total_samples=23456, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:15,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.83 | bwd_microstep: 1741.28 | bwd_inner_microstep: 1699.99 | bwd_allreduce_microstep: 41.22 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13431
total_samples=23460, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:17,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.40 | bwd_microstep: 1857.21 | bwd_inner_microstep: 1807.79 | bwd_allreduce_microstep: 49.35 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13290
total_samples=23464, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:20,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.76
[2025-08-03 06:26:20,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.65 | bwd_microstep: 1720.38 | bwd_inner_microstep: 1667.73 | bwd_allreduce_microstep: 52.59 | step_microstep: 149.21
[2025-08-03 06:26:20,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.37 | bwd: 7437.53 | bwd_inner: 7075.05 | bwd_allreduce: 362.24 | step: 149.65
{'loss': 0.7273, 'learning_rate': 2.6156892541271083e-06, 'epoch': 0.77}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13225
total_samples=23468, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:23,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.66 | bwd_microstep: 1854.75 | bwd_inner_microstep: 1698.95 | bwd_allreduce_microstep: 155.73 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14584
total_samples=23472, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:26,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.63 | bwd_microstep: 2082.45 | bwd_inner_microstep: 1772.23 | bwd_allreduce_microstep: 310.14 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15467
total_samples=23476, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:28,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 835.85 | bwd_microstep: 1867.94 | bwd_inner_microstep: 1816.51 | bwd_allreduce_microstep: 51.37 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13522
total_samples=23480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:31,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 06:26:31,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.10 | bwd_microstep: 1757.20 | bwd_inner_microstep: 1687.09 | bwd_allreduce_microstep: 70.02 | step_microstep: 132.13
[2025-08-03 06:26:31,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2955.17 | bwd: 7562.38 | bwd_inner: 6974.78 | bwd_allreduce: 587.35 | step: 132.74
��█▋  | 1539/2000 [4:42:52<1:24:03, 10.94s/it]                                                        77%|███████▋  | 1539/2000 [4:42:52<1:24:03, 10.94s/it] 77%|███████▋  | 1540/2000 [4:43:03<1:23:45, 10.92s/it]                                                        77%|███████▋  | 1540/2000 [4:43:03<1:23:45, 10.92s/it] 77%|███████▋  | 1541/2000 [4:43:14<1:22:34, 10.79s/it]                                                        77%|███████▋  | 1541/2000 [4:43:14<1:22:34, 10.79s/it] 77%|███████▋  | 1542/2000 [4:43:24<1:21:37, 10.69s/it]                                                        77%|███████▋  | 1542/2000 [4:43:24<1:21:37, 10.69s/it] 77%|███████▋  | 1543/2000 [4:43:35<1:21:24, 10.69s/it]                                                        77%|███████▋  | 1543/2000 [4:43:35<1:21:24, 10.69s/it] 77%|███████▋  | 1544/20{'loss': 0.7404, 'learning_rate': 2.604779008615895e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13342
total_samples=23484, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:34,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.90 | bwd_microstep: 1866.93 | bwd_inner_microstep: 1718.76 | bwd_allreduce_microstep: 148.11 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13922
total_samples=23488, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:36,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.07 | bwd_microstep: 1758.55 | bwd_inner_microstep: 1719.26 | bwd_allreduce_microstep: 39.23 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13250
total_samples=23492, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:39,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.74 | bwd_microstep: 1764.47 | bwd_inner_microstep: 1706.06 | bwd_allreduce_microstep: 58.33 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13314
total_samples=23496, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:42,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 06:26:42,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.17 | bwd_microstep: 2038.02 | bwd_inner_microstep: 1869.86 | bwd_allreduce_microstep: 168.09 | step_microstep: 133.39
[2025-08-03 06:26:42,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.81 | bwd: 7428.02 | bwd_inner: 7013.94 | bwd_allreduce: 413.85 | step: 133.91
{'loss': 0.7281, 'learning_rate': 2.593888156209603e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14144
total_samples=23500, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:45,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.59 | bwd_microstep: 2168.12 | bwd_inner_microstep: 1883.05 | bwd_allreduce_microstep: 285.01 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13233
total_samples=23504, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:47,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.12 | bwd_microstep: 1723.52 | bwd_inner_microstep: 1669.41 | bwd_allreduce_microstep: 54.04 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14129
total_samples=23508, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:50,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.38 | bwd_microstep: 1782.09 | bwd_inner_microstep: 1735.88 | bwd_allreduce_microstep: 46.15 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13941
total_samples=23512, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:53,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 06:26:53,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.33 | bwd_microstep: 2013.00 | bwd_inner_microstep: 1924.32 | bwd_allreduce_microstep: 88.62 | step_microstep: 123.04
[2025-08-03 06:26:53,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.34 | bwd: 7686.78 | bwd_inner: 7212.65 | bwd_allreduce: 473.90 | step: 123.62
{'loss': 0.7402, 'learning_rate': 2.583016725468226e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13372
total_samples=23516, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:55,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.30 | bwd_microstep: 1936.04 | bwd_inner_microstep: 1844.53 | bwd_allreduce_microstep: 91.45 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15394
total_samples=23521, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:26:58,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.00 | bwd_microstep: 2200.39 | bwd_inner_microstep: 2194.31 | bwd_allreduce_microstep: 6.02 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14268
total_samples=23525, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:01,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.40 | bwd_microstep: 1737.66 | bwd_inner_microstep: 1702.84 | bwd_allreduce_microstep: 34.74 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13381
total_samples=23529, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:04,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.25
[2025-08-03 06:27:04,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.98 | bwd_microstep: 1803.13 | bwd_inner_microstep: 1700.74 | bwd_allreduce_microstep: 102.32 | step_microstep: 150.38
[2025-08-03 06:27:04,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.61 | bwd: 7677.28 | bwd_inner: 7442.41 | bwd_allreduce: 234.61 | step: 150.86
{'loss': 0.7401, 'learning_rate': 2.572164744900827e-06, 'epoch': 0.77}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12497
total_samples=23533, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:06,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.53 | bwd_microstep: 1853.31 | bwd_inner_microstep: 1624.68 | bwd_allreduce_microstep: 228.57 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12679
total_samples=23537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:09,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.33 | bwd_microstep: 2003.86 | bwd_inner_microstep: 1861.85 | bwd_allreduce_microstep: 141.94 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13675
total_samples=23541, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:12,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.37 | bwd_microstep: 1769.32 | bwd_inner_microstep: 1713.23 | bwd_allreduce_microstep: 56.02 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13286
total_samples=23545, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:14,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.59
[2025-08-03 06:27:14,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.83 | bwd_microstep: 1784.76 | bwd_inner_microstep: 1698.70 | bwd_allreduce_microstep: 86.00 | step_microstep: 150.05
[2025-08-03 06:27:14,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.99 | bwd: 7411.30 | bwd_inner: 6898.46 | bwd_allreduce: 512.60 | step: 150.55
{'loss': 0.7342, 'learning_rate': 2.5613322429654573e-06, 'epoch': 0.77}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12081
total_samples=23548, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:17,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.63 | bwd_microstep: 1938.26 | bwd_inner_microstep: 1589.17 | bwd_allreduce_microstep: 349.02 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13183
total_samples=23552, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:20,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.10 | bwd_microstep: 1789.46 | bwd_inner_microstep: 1688.25 | bwd_allreduce_microstep: 101.14 | step_microstep: 0.14
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13955
total_samples=23557, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:22,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.17 | bwd_microstep: 1783.05 | bwd_inner_microstep: 1689.68 | bwd_allreduce_microstep: 93.30 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13347
total_samples=23561, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:25,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.59
[2025-08-03 06:27:25,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.13 | bwd_microstep: 1863.78 | bwd_inner_microstep: 1724.32 | bwd_allreduce_microstep: 139.39 | step_microstep: 156.93
[2025-08-03 06:27:25,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.94 | bwd: 7374.60 | bwd_inner: 6691.41 | bwd_allreduce: 682.94 | step: 157.54
00 [4:43:46<1:21:50, 10.77s/it]                                                        77%|███████▋  | 1544/2000 [4:43:46<1:21:50, 10.77s/it] 77%|███████▋  | 1545/2000 [4:43:57<1:21:28, 10.74s/it]                                                        77%|███████▋  | 1545/2000 [4:43:57<1:21:28, 10.74s/it] 77%|███████▋  | 1546/2000 [4:44:07<1:21:43, 10.80s/it]                                                        77%|███████▋  | 1546/2000 [4:44:07<1:21:43, 10.80s/it] 77%|███████▋  | 1547/2000 [4:44:18<1:21:52, 10.84s/it]                                                        77%|███████▋  | 1547/2000 [4:44:18<1:21:52, 10.84s/it] 77%|███████▋  | 1548/2000 [4:44:29<1:21:22, 10.80s/it]                                                        77%|███████▋  | 1548/2000 [4:44:29<1:21:22, 10.80s/it] 77%|███████▋  | 1549/2000 [4:44:40<1:20:51{'loss': 0.7326, 'learning_rate': 2.5505192480690865e-06, 'epoch': 0.77}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13455
total_samples=23566, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:28,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.04 | bwd_microstep: 2016.82 | bwd_inner_microstep: 1873.18 | bwd_allreduce_microstep: 143.57 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15323
total_samples=23570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:30,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.67 | bwd_microstep: 1838.99 | bwd_inner_microstep: 1807.96 | bwd_allreduce_microstep: 30.97 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14518
total_samples=23574, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:33,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.22 | bwd_microstep: 2002.75 | bwd_inner_microstep: 1906.79 | bwd_allreduce_microstep: 95.90 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12586
total_samples=23578, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:36,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 06:27:36,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.90 | bwd_microstep: 2071.71 | bwd_inner_microstep: 1867.85 | bwd_allreduce_microstep: 203.79 | step_microstep: 117.42
[2025-08-03 06:27:36,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.75 | bwd: 7930.32 | bwd_inner: 7455.78 | bwd_allreduce: 474.30 | step: 117.95
{'loss': 0.7396, 'learning_rate': 2.5397257885675396e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13568
total_samples=23582, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:39,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.53 | bwd_microstep: 2200.96 | bwd_inner_microstep: 1962.22 | bwd_allreduce_microstep: 238.67 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334
total_samples=23586, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:42,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.92 | bwd_microstep: 1839.14 | bwd_inner_microstep: 1703.33 | bwd_allreduce_microstep: 135.75 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13765
total_samples=23590, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:44,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.64 | bwd_microstep: 1797.40 | bwd_inner_microstep: 1700.30 | bwd_allreduce_microstep: 97.02 | step_microstep: 0.34
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12454
total_samples=23593, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:47,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 06:27:47,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.58 | bwd_microstep: 1901.64 | bwd_inner_microstep: 1621.65 | bwd_allreduce_microstep: 279.94 | step_microstep: 135.97
[2025-08-03 06:27:47,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.60 | bwd: 7739.20 | bwd_inner: 6987.49 | bwd_allreduce: 751.46 | step: 136.68
{'loss': 0.7464, 'learning_rate': 2.528951892765402e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13759
total_samples=23597, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:50,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.32 | bwd_microstep: 1829.82 | bwd_inner_microstep: 1732.29 | bwd_allreduce_microstep: 97.46 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14918
total_samples=23602, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:52,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.30 | bwd_microstep: 1875.91 | bwd_inner_microstep: 1856.74 | bwd_allreduce_microstep: 19.10 | step_microstep: 0.28
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12250
total_samples=23606, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:55,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.97 | bwd_microstep: 1808.60 | bwd_inner_microstep: 1590.40 | bwd_allreduce_microstep: 218.11 | step_microstep: 0.53
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11901
total_samples=23609, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:27:58,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 06:27:58,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.83 | bwd_microstep: 2105.87 | bwd_inner_microstep: 1865.86 | bwd_allreduce_microstep: 239.95 | step_microstep: 111.73
[2025-08-03 06:27:58,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.35 | bwd: 7620.28 | bwd_inner: 7045.29 | bwd_allreduce: 574.72 | step: 112.82
{'loss': 0.751, 'learning_rate': 2.5181975889159615e-06, 'epoch': 0.78}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13497
total_samples=23613, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:01,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.27 | bwd_microstep: 2161.59 | bwd_inner_microstep: 2035.76 | bwd_allreduce_microstep: 125.73 | step_microstep: 0.27
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12674
total_samples=23617, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:04,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.63 | bwd_microstep: 1982.93 | bwd_inner_microstep: 1704.74 | bwd_allreduce_microstep: 278.11 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13470
total_samples=23621, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:06,834] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.48 | bwd_microstep: 1860.71 | bwd_inner_microstep: 1736.78 | bwd_allreduce_microstep: 123.87 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14472
total_samples=23625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:09,531] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.81
[2025-08-03 06:28:09,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.77 | bwd_microstep: 1793.83 | bwd_inner_microstep: 1740.50 | bwd_allreduce_microstep: 53.26 | step_microstep: 111.23
[2025-08-03 06:28:09,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2882.08 | bwd: 7799.14 | bwd_inner: 7217.78 | bwd_allreduce: 581.07 | step: 112.03
{'loss': 0.7386, 'learning_rate': 2.507462905221122e-06, 'epoch': 0.78}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12949
total_samples=23629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:12,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.20 | bwd_microstep: 2531.03 | bwd_inner_microstep: 2522.02 | bwd_allreduce_microstep: 8.94 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713
total_samples=23632, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:15,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.33 | bwd_microstep: 1759.53 | bwd_inner_microstep: 1552.23 | bwd_allreduce_microstep: 207.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14049
total_samples=23637, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:17,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.97 | bwd_microstep: 1757.17 | bwd_inner_microstep: 1697.54 | bwd_allreduce_microstep: 59.56 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12896
total_samples=23641, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:20,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.66
[2025-08-03 06:28:20,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.76 | bwd_microstep: 1757.40 | bwd_inner_microstep: 1636.16 | bwd_allreduce_microstep: 121.17 | step_microstep: 138.03
[2025-08-03 06:28:20,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.18 | bwd: 7805.18 | bwd_inner: 7407.96 | bwd_allreduce: 396.99 | step: 138.53
, 10.76s/it]                                                        77%|███████▋  | 1549/2000 [4:44:40<1:20:51, 10.76s/it] 78%|███████▊  | 1550/2000 [4:44:51<1:21:38, 10.88s/it]                                                        78%|███████▊  | 1550/2000 [4:44:51<1:21:38, 10.88s/it] 78%|███████▊  | 1551/2000 [4:45:02<1:21:39, 10.91s/it]                                                        78%|███████▊  | 1551/2000 [4:45:02<1:21:39, 10.91s/it] 78%|███████▊  | 1552/2000 [4:45:13<1:21:23, 10.90s/it]                                                        78%|███████▊  | 1552/2000 [4:45:13<1:21:23, 10.90s/it] 78%|███████▊  | 1553/2000 [4:45:24<1:21:40, 10.96s/it]                                                        78%|███████▊  | 1553/2000 [4:45:24<1:21:40, 10.96s/it] 78%|███████▊  | 1554/2000 [4:45:35<1:21:38, 10.98s/it]      {'loss': 0.7254, 'learning_rate': 2.496747869831345e-06, 'epoch': 0.78}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12908
total_samples=23645, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:23,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.69 | bwd_microstep: 1814.69 | bwd_inner_microstep: 1676.81 | bwd_allreduce_microstep: 137.81 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14025
total_samples=23649, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:25,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.29 | bwd_microstep: 1831.76 | bwd_inner_microstep: 1756.89 | bwd_allreduce_microstep: 74.80 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12368
total_samples=23653, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:28,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.66 | bwd_microstep: 1808.74 | bwd_inner_microstep: 1599.03 | bwd_allreduce_microstep: 209.64 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13433
total_samples=23657, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:31,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 06:28:31,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.50 | bwd_microstep: 1982.98 | bwd_inner_microstep: 1869.47 | bwd_allreduce_microstep: 113.45 | step_microstep: 113.02
[2025-08-03 06:28:31,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.06 | bwd: 7438.22 | bwd_inner: 6902.19 | bwd_allreduce: 535.78 | step: 113.54
{'loss': 0.7402, 'learning_rate': 2.48605251084556e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14213
total_samples=23661, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:34,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.00 | bwd_microstep: 2075.94 | bwd_inner_microstep: 1925.56 | bwd_allreduce_microstep: 150.30 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13132
total_samples=23665, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:36,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.03 | bwd_microstep: 1855.34 | bwd_inner_microstep: 1709.46 | bwd_allreduce_microstep: 145.81 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13183
total_samples=23669, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:39,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.29 | bwd_microstep: 1831.78 | bwd_inner_microstep: 1795.71 | bwd_allreduce_microstep: 36.01 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15417
total_samples=23674, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:42,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.92
[2025-08-03 06:28:42,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.49 | bwd_microstep: 2190.17 | bwd_inner_microstep: 1953.13 | bwd_allreduce_microstep: 236.94 | step_microstep: 122.79
[2025-08-03 06:28:42,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.74 | bwd: 7953.27 | bwd_inner: 7383.85 | bwd_allreduce: 569.15 | step: 123.32
{'loss': 0.7394, 'learning_rate': 2.475376856311097e-06, 'epoch': 0.78}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13422
total_samples=23678, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:45,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.82 | bwd_microstep: 1816.71 | bwd_inner_microstep: 1649.87 | bwd_allreduce_microstep: 166.77 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13593
total_samples=23682, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:47,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.27 | bwd_microstep: 1871.80 | bwd_inner_microstep: 1739.88 | bwd_allreduce_microstep: 131.84 | step_microstep: 0.20
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13521
total_samples=23686, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:50,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.39 | bwd_microstep: 1750.71 | bwd_inner_microstep: 1664.81 | bwd_allreduce_microstep: 85.83 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14581
total_samples=23691, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:53,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 06:28:53,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.77 | bwd_microstep: 2187.37 | bwd_inner_microstep: 1910.10 | bwd_allreduce_microstep: 277.20 | step_microstep: 111.54
[2025-08-03 06:28:53,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.18 | bwd: 7626.65 | bwd_inner: 6964.65 | bwd_allreduce: 661.73 | step: 111.98
{'loss': 0.7375, 'learning_rate': 2.464720934223619e-06, 'epoch': 0.78}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11708
total_samples=23694, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:55,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.57 | bwd_microstep: 1908.75 | bwd_inner_microstep: 1734.66 | bwd_allreduce_microstep: 174.03 | step_microstep: 0.31
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12865
total_samples=23698, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:28:58,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.22 | bwd_microstep: 1757.15 | bwd_inner_microstep: 1653.63 | bwd_allreduce_microstep: 103.45 | step_microstep: 0.28
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15331
total_samples=23703, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:01,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.75 | bwd_microstep: 1863.55 | bwd_inner_microstep: 1773.07 | bwd_allreduce_microstep: 90.42 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778
total_samples=23706, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:03,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.55
[2025-08-03 06:29:03,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.99 | bwd_microstep: 1778.19 | bwd_inner_microstep: 1560.81 | bwd_allreduce_microstep: 217.31 | step_microstep: 131.00
[2025-08-03 06:29:03,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.47 | bwd: 7307.70 | bwd_inner: 6722.16 | bwd_allreduce: 585.29 | step: 131.71
{'loss': 0.7299, 'learning_rate': 2.4540847725270376e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=23710, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:06,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.72 | bwd_microstep: 1839.60 | bwd_inner_microstep: 1674.31 | bwd_allreduce_microstep: 165.23 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13719
total_samples=23714, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:08,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.28 | bwd_microstep: 1716.47 | bwd_inner_microstep: 1653.68 | bwd_allreduce_microstep: 62.71 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=23717, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:11,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.60 | bwd_microstep: 2093.10 | bwd_inner_microstep: 1860.61 | bwd_allreduce_microstep: 232.40 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13583
total_samples=23721, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:14,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 06:29:14,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.80 | bwd_microstep: 1778.02 | bwd_inner_microstep: 1688.45 | bwd_allreduce_microstep: 89.48 | step_microstep: 115.01
[2025-08-03 06:29:14,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.32 | bwd: 7427.25 | bwd_inner: 6877.05 | bwd_allreduce: 549.92 | step: 115.65
                                                  78%|███████▊  | 1554/2000 [4:45:35<1:21:38, 10.98s/it] 78%|███████▊  | 1555/2000 [4:45:46<1:20:45, 10.89s/it]                                                        78%|███████▊  | 1555/2000 [4:45:46<1:20:45, 10.89s/it] 78%|███████▊  | 1556/2000 [4:45:57<1:21:19, 10.99s/it]                                                        78%|███████▊  | 1556/2000 [4:45:57<1:21:19, 10.99s/it] 78%|███████▊  | 1557/2000 [4:46:08<1:20:48, 10.95s/it]                                                        78%|███████▊  | 1557/2000 [4:46:08<1:20:48, 10.95s/it] 78%|███████▊  | 1558/2000 [4:46:18<1:19:42, 10.82s/it]                                                        78%|███████▊  | 1558/2000 [4:46:18<1:19:42, 10.82s/it] 78%|███████▊  | 1559/2000 [4:46:29<1:19:04, 10.76s/it]                         {'loss': 0.7193, 'learning_rate': 2.4434683991134476e-06, 'epoch': 0.78}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11875
total_samples=23724, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:17,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.79 | bwd_microstep: 2138.36 | bwd_inner_microstep: 1737.83 | bwd_allreduce_microstep: 400.46 | step_microstep: 0.86
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13377
total_samples=23729, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:20,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.79 | bwd_microstep: 1872.18 | bwd_inner_microstep: 1762.46 | bwd_allreduce_microstep: 109.65 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11946
total_samples=23732, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:22,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.44 | bwd_microstep: 1904.91 | bwd_inner_microstep: 1890.56 | bwd_allreduce_microstep: 14.27 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13702
total_samples=23736, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:25,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.90
[2025-08-03 06:29:25,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 872.96 | bwd_microstep: 1846.86 | bwd_inner_microstep: 1761.04 | bwd_allreduce_microstep: 85.76 | step_microstep: 148.45
[2025-08-03 06:29:25,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2971.92 | bwd: 7762.36 | bwd_inner: 7151.87 | bwd_allreduce: 610.22 | step: 149.58
{'loss': 0.738, 'learning_rate': 2.432871841823047e-06, 'epoch': 0.78}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13458
total_samples=23740, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:28,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.59 | bwd_microstep: 1752.46 | bwd_inner_microstep: 1668.22 | bwd_allreduce_microstep: 84.18 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13164
total_samples=23744, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:30,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.09 | bwd_microstep: 1818.03 | bwd_inner_microstep: 1693.63 | bwd_allreduce_microstep: 124.32 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13566
total_samples=23748, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:33,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.55 | bwd_microstep: 1969.70 | bwd_inner_microstep: 1886.15 | bwd_allreduce_microstep: 83.48 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13249
total_samples=23752, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:36,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.95
[2025-08-03 06:29:36,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.50 | bwd_microstep: 1907.45 | bwd_inner_microstep: 1717.97 | bwd_allreduce_microstep: 189.42 | step_microstep: 150.73
[2025-08-03 06:29:36,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.65 | bwd: 7447.70 | bwd_inner: 6965.96 | bwd_allreduce: 481.49 | step: 151.48
{'loss': 0.734, 'learning_rate': 2.4222951284440776e-06, 'epoch': 0.78}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11706
total_samples=23755, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:38,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.29 | bwd_microstep: 1803.95 | bwd_inner_microstep: 1561.28 | bwd_allreduce_microstep: 242.60 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12645
total_samples=23759, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:41,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.39 | bwd_microstep: 1840.12 | bwd_inner_microstep: 1646.91 | bwd_allreduce_microstep: 193.14 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14145
total_samples=23763, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:44,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.30 | bwd_microstep: 1762.72 | bwd_inner_microstep: 1719.75 | bwd_allreduce_microstep: 42.90 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13145
total_samples=23767, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:46,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 06:29:46,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.68 | bwd_microstep: 1741.04 | bwd_inner_microstep: 1686.03 | bwd_allreduce_microstep: 54.93 | step_microstep: 122.06
[2025-08-03 06:29:46,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.59 | bwd: 7147.89 | bwd_inner: 6613.97 | bwd_allreduce: 533.67 | step: 122.41
{'loss': 0.7473, 'learning_rate': 2.411738286712735e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13531
total_samples=23771, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:49,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.34 | bwd_microstep: 1775.34 | bwd_inner_microstep: 1697.70 | bwd_allreduce_microstep: 77.57 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12612
total_samples=23775, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:51,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.58 | bwd_microstep: 1881.85 | bwd_inner_microstep: 1641.29 | bwd_allreduce_microstep: 240.49 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13550
total_samples=23779, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:54,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.04 | bwd_microstep: 1862.36 | bwd_inner_microstep: 1690.25 | bwd_allreduce_microstep: 172.04 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13221
total_samples=23784, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:29:57,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 06:29:57,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.06 | bwd_microstep: 2031.37 | bwd_inner_microstep: 1900.39 | bwd_allreduce_microstep: 130.91 | step_microstep: 447.27
[2025-08-03 06:29:57,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.96 | bwd: 7550.96 | bwd_inner: 6929.63 | bwd_allreduce: 621.08 | step: 447.76
{'loss': 0.7356, 'learning_rate': 2.401201344313102e-06, 'epoch': 0.78}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11879
total_samples=23787, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:00,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.63 | bwd_microstep: 1808.66 | bwd_inner_microstep: 1591.04 | bwd_allreduce_microstep: 217.55 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12480
total_samples=23791, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:03,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.30 | bwd_microstep: 1806.89 | bwd_inner_microstep: 1582.34 | bwd_allreduce_microstep: 224.48 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13873
total_samples=23795, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:06,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.76 | bwd_microstep: 2245.22 | bwd_inner_microstep: 2037.68 | bwd_allreduce_microstep: 207.49 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13361
total_samples=23799, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:08,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 06:30:08,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.74 | bwd_microstep: 1751.10 | bwd_inner_microstep: 1686.26 | bwd_allreduce_microstep: 64.77 | step_microstep: 136.53
[2025-08-03 06:30:08,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.36 | bwd: 7611.93 | bwd_inner: 6897.31 | bwd_allreduce: 714.37 | step: 137.02
                               78%|███████▊  | 1559/2000 [4:46:29<1:19:04, 10.76s/it] 78%|███████▊  | 1560/2000 [4:46:40<1:19:48, 10.88s/it]                                                        78%|███████▊  | 1560/2000 [4:46:40<1:19:48, 10.88s/it] 78%|███████▊  | 1561/2000 [4:46:51<1:19:16, 10.84s/it]                                                        78%|███████▊  | 1561/2000 [4:46:51<1:19:16, 10.84s/it] 78%|███████▊  | 1562/2000 [4:47:01<1:18:06, 10.70s/it]                                                        78%|███████▊  | 1562/2000 [4:47:01<1:18:06, 10.70s/it] 78%|███████▊  | 1563/2000 [4:47:12<1:18:48, 10.82s/it]                                                        78%|███████▊  | 1563/2000 [4:47:12<1:18:48, 10.82s/it] 78%|███████▊  | 1564/2000 [4:47:23<1:18:45, 10.84s/it]                                            {'loss': 0.7358, 'learning_rate': 2.390684328877089e-06, 'epoch': 0.78}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11676
total_samples=23802, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:11,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.19 | bwd_microstep: 2057.30 | bwd_inner_microstep: 1882.32 | bwd_allreduce_microstep: 174.92 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12177
total_samples=23806, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:14,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.44 | bwd_microstep: 1791.25 | bwd_inner_microstep: 1570.13 | bwd_allreduce_microstep: 221.05 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13479
total_samples=23810, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:16,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.44 | bwd_microstep: 1968.10 | bwd_inner_microstep: 1727.07 | bwd_allreduce_microstep: 240.97 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12674
total_samples=23813, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:19,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 06:30:19,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.91 | bwd_microstep: 1868.34 | bwd_inner_microstep: 1602.78 | bwd_allreduce_microstep: 265.50 | step_microstep: 117.88
[2025-08-03 06:30:19,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.91 | bwd: 7685.04 | bwd_inner: 6782.27 | bwd_allreduce: 902.52 | step: 118.32
{'loss': 0.7372, 'learning_rate': 2.3801872679843384e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13281
total_samples=23817, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:22,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.73 | bwd_microstep: 1971.36 | bwd_inner_microstep: 1911.54 | bwd_allreduce_microstep: 59.76 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11715
total_samples=23820, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:25,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.20 | bwd_microstep: 1807.32 | bwd_inner_microstep: 1578.85 | bwd_allreduce_microstep: 228.40 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15529
total_samples=23824, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:27,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.88 | bwd_microstep: 1837.73 | bwd_inner_microstep: 1718.52 | bwd_allreduce_microstep: 119.14 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15257
total_samples=23828, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:30,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 06:30:30,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.85 | bwd_microstep: 1985.74 | bwd_inner_microstep: 1906.84 | bwd_allreduce_microstep: 78.83 | step_microstep: 134.41
[2025-08-03 06:30:30,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2872.58 | bwd: 7602.20 | bwd_inner: 7115.74 | bwd_allreduce: 486.21 | step: 135.03
{'loss': 0.7316, 'learning_rate': 2.36971018916217e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14438
total_samples=23832, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:33,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.28 | bwd_microstep: 2095.11 | bwd_inner_microstep: 2017.87 | bwd_allreduce_microstep: 77.18 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11764
total_samples=23835, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:36,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.33 | bwd_microstep: 2366.23 | bwd_inner_microstep: 2125.82 | bwd_allreduce_microstep: 240.32 | step_microstep: 0.19
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12747
total_samples=23839, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:39,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 869.12 | bwd_microstep: 1827.83 | bwd_inner_microstep: 1651.79 | bwd_allreduce_microstep: 175.94 | step_microstep: 0.37
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14025
total_samples=23843, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:42,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 06:30:42,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.55 | bwd_microstep: 1847.85 | bwd_inner_microstep: 1741.93 | bwd_allreduce_microstep: 105.85 | step_microstep: 119.85
[2025-08-03 06:30:42,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2985.21 | bwd: 8137.07 | bwd_inner: 7537.40 | bwd_allreduce: 599.40 | step: 120.51
{'loss': 0.741, 'learning_rate': 2.3592531198854974e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13490
total_samples=23847, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:44,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.19 | bwd_microstep: 1757.10 | bwd_inner_microstep: 1682.19 | bwd_allreduce_microstep: 74.85 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11589
total_samples=23850, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:47,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.58 | bwd_microstep: 1999.86 | bwd_inner_microstep: 1545.02 | bwd_allreduce_microstep: 454.76 | step_microstep: 0.76
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13337
total_samples=23854, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:50,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.51 | bwd_microstep: 1927.08 | bwd_inner_microstep: 1717.28 | bwd_allreduce_microstep: 209.72 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13013
total_samples=23858, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:52,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.82
[2025-08-03 06:30:52,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.82 | bwd_microstep: 1782.46 | bwd_inner_microstep: 1662.60 | bwd_allreduce_microstep: 119.78 | step_microstep: 112.52
[2025-08-03 06:30:52,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.03 | bwd: 7466.55 | bwd_inner: 6607.08 | bwd_allreduce: 859.21 | step: 113.55
{'loss': 0.744, 'learning_rate': 2.3488160875767717e-06, 'epoch': 0.78}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13726
total_samples=23863, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:55,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.61 | bwd_microstep: 1778.96 | bwd_inner_microstep: 1698.33 | bwd_allreduce_microstep: 80.56 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14251
total_samples=23867, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:30:57,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.20 | bwd_microstep: 1745.77 | bwd_inner_microstep: 1711.25 | bwd_allreduce_microstep: 34.46 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13868
total_samples=23871, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:00,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.46 | bwd_microstep: 1834.62 | bwd_inner_microstep: 1753.86 | bwd_allreduce_microstep: 80.69 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11737
total_samples=23874, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:03,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 06:31:03,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.08 | bwd_microstep: 1822.68 | bwd_inner_microstep: 1572.86 | bwd_allreduce_microstep: 249.77 | step_microstep: 128.27
[2025-08-03 06:31:03,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2845.28 | bwd: 7182.10 | bwd_inner: 6736.27 | bwd_allreduce: 445.56 | step: 128.95
{'loss': 0.7333, 'learning_rate': 2.3383991196058918e-06, 'epoch': 0.78}
            78%|███████▊  | 1564/2000 [4:47:23<1:18:45, 10.84s/it] 78%|███████▊  | 1565/2000 [4:47:34<1:18:40, 10.85s/it]                                                        78%|███████▊  | 1565/2000 [4:47:34<1:18:40, 10.85s/it] 78%|███████▊  | 1566/2000 [4:47:45<1:18:36, 10.87s/it]                                                        78%|███████▊  | 1566/2000 [4:47:45<1:18:36, 10.87s/it] 78%|███████▊  | 1567/2000 [4:47:56<1:19:55, 11.08s/it]                                                        78%|███████▊  | 1567/2000 [4:47:56<1:19:55, 11.08s/it] 78%|███████▊  | 1568/2000 [4:48:07<1:18:55, 10.96s/it]                                                        78%|███████▊  | 1568/2000 [4:48:07<1:18:55, 10.96s/it] 78%|███████▊  | 1569/2000 [4:48:18<1:17:38, 10.81s/it]                                                        78%|�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13310
total_samples=23878, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:06,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.30 | bwd_microstep: 2064.28 | bwd_inner_microstep: 2009.46 | bwd_allreduce_microstep: 54.76 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13117
total_samples=23882, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:08,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.96 | bwd_microstep: 1865.34 | bwd_inner_microstep: 1659.15 | bwd_allreduce_microstep: 206.11 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13336
total_samples=23886, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:11,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.38 | bwd_microstep: 1775.26 | bwd_inner_microstep: 1671.59 | bwd_allreduce_microstep: 103.60 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13763
total_samples=23890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:14,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 06:31:14,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.67 | bwd_microstep: 2046.05 | bwd_inner_microstep: 1833.91 | bwd_allreduce_microstep: 212.07 | step_microstep: 110.78
[2025-08-03 06:31:14,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.25 | bwd: 7750.98 | bwd_inner: 7174.11 | bwd_allreduce: 576.62 | step: 111.12
{'loss': 0.7173, 'learning_rate': 2.328002243290138e-06, 'epoch': 0.79}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689
total_samples=23893, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:16,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.27 | bwd_microstep: 1825.77 | bwd_inner_microstep: 1691.17 | bwd_allreduce_microstep: 134.53 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11865
total_samples=23896, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:19,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.46 | bwd_microstep: 1995.59 | bwd_inner_microstep: 1773.32 | bwd_allreduce_microstep: 222.20 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13037
total_samples=23900, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:22,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.08 | bwd_microstep: 2234.74 | bwd_inner_microstep: 2061.08 | bwd_allreduce_microstep: 173.58 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13541
total_samples=23904, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:25,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 06:31:25,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.00 | bwd_microstep: 1952.04 | bwd_inner_microstep: 1721.66 | bwd_allreduce_microstep: 230.31 | step_microstep: 139.15
[2025-08-03 06:31:25,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.74 | bwd: 8008.19 | bwd_inner: 7247.23 | bwd_allreduce: 760.70 | step: 139.67
{'loss': 0.7399, 'learning_rate': 2.317625485894113e-06, 'epoch': 0.79}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11649
total_samples=23907, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:28,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.60 | bwd_microstep: 1802.51 | bwd_inner_microstep: 1548.05 | bwd_allreduce_microstep: 254.39 | step_microstep: 0.70
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14361
total_samples=23911, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:30,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.07 | bwd_microstep: 1789.24 | bwd_inner_microstep: 1740.94 | bwd_allreduce_microstep: 48.22 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13551
total_samples=23915, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:33,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.03 | bwd_microstep: 1783.91 | bwd_inner_microstep: 1702.95 | bwd_allreduce_microstep: 80.88 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13803
total_samples=23919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:35,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.32
[2025-08-03 06:31:35,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.19 | bwd_microstep: 1782.08 | bwd_inner_microstep: 1707.78 | bwd_allreduce_microstep: 74.22 | step_microstep: 137.22
[2025-08-03 06:31:35,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.82 | bwd: 7157.80 | bwd_inner: 6699.72 | bwd_allreduce: 457.82 | step: 138.37
{'loss': 0.7325, 'learning_rate': 2.307268874629649e-06, 'epoch': 0.79}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13524
total_samples=23923, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:39,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.47 | bwd_microstep: 2509.58 | bwd_inner_microstep: 2030.28 | bwd_allreduce_microstep: 479.20 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11815
total_samples=23926, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:41,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.70 | bwd_microstep: 1887.91 | bwd_inner_microstep: 1690.64 | bwd_allreduce_microstep: 197.18 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13879
total_samples=23931, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:44,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.09 | bwd_microstep: 1843.89 | bwd_inner_microstep: 1732.93 | bwd_allreduce_microstep: 110.90 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243
total_samples=23935, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:47,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 06:31:47,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.93 | bwd_microstep: 2063.39 | bwd_inner_microstep: 2057.37 | bwd_allreduce_microstep: 5.96 | step_microstep: 109.49
[2025-08-03 06:31:47,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2837.12 | bwd: 8304.80 | bwd_inner: 7511.23 | bwd_allreduce: 793.30 | step: 109.83
{'loss': 0.7332, 'learning_rate': 2.296932436655752e-06, 'epoch': 0.79}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13578
total_samples=23939, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:49,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.97 | bwd_microstep: 1734.56 | bwd_inner_microstep: 1689.14 | bwd_allreduce_microstep: 45.34 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13492
total_samples=23943, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:52,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.75 | bwd_microstep: 1793.68 | bwd_inner_microstep: 1717.91 | bwd_allreduce_microstep: 75.71 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15624
total_samples=23948, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:55,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.21 | bwd_microstep: 1923.74 | bwd_inner_microstep: 1789.49 | bwd_allreduce_microstep: 134.18 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13308
total_samples=23953, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:31:57,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.64
[2025-08-03 06:31:57,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.68 | bwd_microstep: 1742.29 | bwd_inner_microstep: 1646.48 | bwd_allreduce_microstep: 95.73 | step_microstep: 152.18
[2025-08-03 06:31:57,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.55 | bwd: 7194.33 | bwd_inner: 6843.01 | bwd_allreduce: 351.06 | step: 152.68
{'loss': 0.7383, 'learning_rate': 2.2866161990785228e-06, 'epoch': 0.79}
��██████▊  | 1569/2000 [4:48:18<1:17:38, 10.81s/it] 78%|███████▊  | 1570/2000 [4:48:29<1:17:49, 10.86s/it]                                                        78%|███████▊  | 1570/2000 [4:48:29<1:17:49, 10.86s/it] 79%|███████▊  | 1571/2000 [4:48:40<1:18:32, 10.98s/it]                                                        79%|███████▊  | 1571/2000 [4:48:40<1:18:32, 10.98s/it] 79%|███████▊  | 1572/2000 [4:48:50<1:17:10, 10.82s/it]                                                        79%|███████▊  | 1572/2000 [4:48:50<1:17:10, 10.82s/it] 79%|███████▊  | 1573/2000 [4:49:02<1:18:34, 11.04s/it]                                                        79%|███████▊  | 1573/2000 [4:49:02<1:18:34, 11.04s/it] 79%|███████▊  | 1574/2000 [4:49:12<1:17:03, 10.85s/it]                                                        79%|██████�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13559
total_samples=23958, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:00,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.92 | bwd_microstep: 1795.59 | bwd_inner_microstep: 1705.60 | bwd_allreduce_microstep: 89.91 | step_microstep: 0.29
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13780
total_samples=23962, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:03,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.10 | bwd_microstep: 2156.06 | bwd_inner_microstep: 1830.75 | bwd_allreduce_microstep: 325.24 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16058
total_samples=23966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:06,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.34 | bwd_microstep: 1969.90 | bwd_inner_microstep: 1836.32 | bwd_allreduce_microstep: 133.52 | step_microstep: 0.81
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=23970, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:08,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 06:32:08,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.15 | bwd_microstep: 1908.46 | bwd_inner_microstep: 1841.73 | bwd_allreduce_microstep: 66.66 | step_microstep: 111.33
[2025-08-03 06:32:08,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.42 | bwd: 7830.06 | bwd_inner: 7214.41 | bwd_allreduce: 615.40 | step: 112.54
{'loss': 0.7375, 'learning_rate': 2.2763201889510987e-06, 'epoch': 0.79}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13505
total_samples=23974, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:11,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.91 | bwd_microstep: 1875.33 | bwd_inner_microstep: 1722.73 | bwd_allreduce_microstep: 152.45 | step_microstep: 16.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15688
total_samples=23979, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:14,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.65 | bwd_microstep: 1813.26 | bwd_inner_microstep: 1783.26 | bwd_allreduce_microstep: 29.94 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11932
total_samples=23982, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:16,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.71 | bwd_microstep: 1744.84 | bwd_inner_microstep: 1559.10 | bwd_allreduce_microstep: 185.67 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14830
total_samples=23987, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:19,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.53
[2025-08-03 06:32:19,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.71 | bwd_microstep: 1887.12 | bwd_inner_microstep: 1809.94 | bwd_allreduce_microstep: 77.11 | step_microstep: 112.33
[2025-08-03 06:32:19,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.90 | bwd: 7320.64 | bwd_inner: 6875.02 | bwd_allreduce: 445.30 | step: 128.79
{'loss': 0.7374, 'learning_rate': 2.266044433273562e-06, 'epoch': 0.79}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13798
total_samples=23991, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:22,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.44 | bwd_microstep: 1826.98 | bwd_inner_microstep: 1696.49 | bwd_allreduce_microstep: 130.39 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13173
total_samples=23995, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:25,195] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.99 | bwd_microstep: 2284.90 | bwd_inner_microstep: 2098.03 | bwd_allreduce_microstep: 186.80 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13377
total_samples=23999, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:27,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.24 | bwd_microstep: 1907.03 | bwd_inner_microstep: 1842.59 | bwd_allreduce_microstep: 64.35 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11754
total_samples=24002, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:30,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 06:32:30,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.99 | bwd_microstep: 2197.90 | bwd_inner_microstep: 2001.38 | bwd_allreduce_microstep: 196.45 | step_microstep: 138.25
[2025-08-03 06:32:30,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2755.57 | bwd: 8216.87 | bwd_inner: 7638.49 | bwd_allreduce: 578.11 | step: 138.97
{'loss': 0.7372, 'learning_rate': 2.2557889589928815e-06, 'epoch': 0.79}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12997
total_samples=24006, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:33,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.12 | bwd_microstep: 1860.90 | bwd_inner_microstep: 1784.51 | bwd_allreduce_microstep: 76.31 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13180
total_samples=24010, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:36,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.40 | bwd_microstep: 1827.70 | bwd_inner_microstep: 1683.96 | bwd_allreduce_microstep: 143.67 | step_microstep: 0.72
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13020
total_samples=24015, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:38,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.87 | bwd_microstep: 1899.20 | bwd_inner_microstep: 1785.35 | bwd_allreduce_microstep: 113.79 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11834
total_samples=24018, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:41,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 06:32:41,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.36 | bwd_microstep: 1700.92 | bwd_inner_microstep: 1533.59 | bwd_allreduce_microstep: 167.25 | step_microstep: 123.91
[2025-08-03 06:32:41,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.67 | bwd: 7288.77 | bwd_inner: 6787.41 | bwd_allreduce: 501.11 | step: 124.99
{'loss': 0.7264, 'learning_rate': 2.245553793002849e-06, 'epoch': 0.79}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13263
total_samples=24022, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:44,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.21 | bwd_microstep: 1754.59 | bwd_inner_microstep: 1674.60 | bwd_allreduce_microstep: 79.91 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11972
total_samples=24025, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:46,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.70 | bwd_microstep: 1749.28 | bwd_inner_microstep: 1554.98 | bwd_allreduce_microstep: 194.24 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418
total_samples=24029, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:49,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.98 | bwd_microstep: 2011.24 | bwd_inner_microstep: 1862.12 | bwd_allreduce_microstep: 149.04 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11605
total_samples=24032, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:52,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.24
[2025-08-03 06:32:52,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.99 | bwd_microstep: 1898.58 | bwd_inner_microstep: 1602.05 | bwd_allreduce_microstep: 296.46 | step_microstep: 138.40
[2025-08-03 06:32:52,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.80 | bwd: 7413.74 | bwd_inner: 6693.74 | bwd_allreduce: 719.74 | step: 138.94
{'loss': 0.736, 'learning_rate': 2.23533896214399e-06, 'epoch': 0.79}
�▊  | 1574/2000 [4:49:12<1:17:03, 10.85s/it] 79%|███████▉  | 1575/2000 [4:49:23<1:17:22, 10.92s/it]                                                        79%|███████▉  | 1575/2000 [4:49:23<1:17:22, 10.92s/it] 79%|███████▉  | 1576/2000 [4:49:34<1:16:30, 10.83s/it]                                                        79%|███████▉  | 1576/2000 [4:49:34<1:16:30, 10.83s/it] 79%|███████▉  | 1577/2000 [4:49:45<1:17:34, 11.00s/it]                                                        79%|███████▉  | 1577/2000 [4:49:45<1:17:34, 11.00s/it] 79%|███████▉  | 1578/2000 [4:49:56<1:16:19, 10.85s/it]                                                        79%|███████▉  | 1578/2000 [4:49:56<1:16:19, 10.85s/it] 79%|███████▉  | 1579/2000 [4:50:07<1:15:52, 10.81s/it]                                                        79%|███████▉  | 1579/2000 [dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13265
total_samples=24036, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:54,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.24 | bwd_microstep: 1796.39 | bwd_inner_microstep: 1679.88 | bwd_allreduce_microstep: 116.43 | step_microstep: 0.36
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=24039, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:32:57,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.90 | bwd_microstep: 1917.82 | bwd_inner_microstep: 1906.79 | bwd_allreduce_microstep: 10.97 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11566
total_samples=24042, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:00,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.40 | bwd_microstep: 2072.36 | bwd_inner_microstep: 1836.90 | bwd_allreduce_microstep: 235.40 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11590
total_samples=24045, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:03,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77
[2025-08-03 06:33:03,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 848.81 | bwd_microstep: 1782.23 | bwd_inner_microstep: 1558.98 | bwd_allreduce_microstep: 223.17 | step_microstep: 146.03
[2025-08-03 06:33:03,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2968.28 | bwd: 7568.86 | bwd_inner: 6982.56 | bwd_allreduce: 586.05 | step: 146.61
{'loss': 0.7245, 'learning_rate': 2.2251444932035094e-06, 'epoch': 0.79}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12023
total_samples=24048, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:05,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.89 | bwd_microstep: 1732.93 | bwd_inner_microstep: 1541.14 | bwd_allreduce_microstep: 191.70 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11605
total_samples=24051, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:08,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.16 | bwd_microstep: 2070.82 | bwd_inner_microstep: 1828.26 | bwd_allreduce_microstep: 242.50 | step_microstep: 0.23
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12573
total_samples=24056, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:10,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 665.04 | bwd_microstep: 1699.34 | bwd_inner_microstep: 1561.47 | bwd_allreduce_microstep: 137.81 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11850
total_samples=24059, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:14,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.58
[2025-08-03 06:33:14,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.83 | bwd_microstep: 2149.27 | bwd_inner_microstep: 1800.23 | bwd_allreduce_microstep: 348.97 | step_microstep: 108.60
[2025-08-03 06:33:14,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.84 | bwd: 7652.40 | bwd_inner: 6731.10 | bwd_allreduce: 921.06 | step: 109.23
{'loss': 0.7329, 'learning_rate': 2.2149704129152083e-06, 'epoch': 0.79}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13403
total_samples=24063, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:16,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.97 | bwd_microstep: 1841.25 | bwd_inner_microstep: 1712.17 | bwd_allreduce_microstep: 129.01 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11603
total_samples=24066, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:19,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.83 | bwd_microstep: 1965.99 | bwd_inner_microstep: 1548.16 | bwd_allreduce_microstep: 417.76 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790
total_samples=24069, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:22,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.61 | bwd_microstep: 2040.20 | bwd_inner_microstep: 1803.83 | bwd_allreduce_microstep: 236.28 | step_microstep: 1.63
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13741
total_samples=24074, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:24,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 06:33:24,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.51 | bwd_microstep: 1776.92 | bwd_inner_microstep: 1692.86 | bwd_allreduce_microstep: 83.99 | step_microstep: 139.36
[2025-08-03 06:33:24,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.84 | bwd: 7624.40 | bwd_inner: 6757.02 | bwd_allreduce: 867.12 | step: 141.41
{'loss': 0.7336, 'learning_rate': 2.204816747959434e-06, 'epoch': 0.79}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13700
total_samples=24078, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:27,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.37 | bwd_microstep: 1773.31 | bwd_inner_microstep: 1690.94 | bwd_allreduce_microstep: 82.28 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12502
total_samples=24081, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:30,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.97 | bwd_microstep: 1852.72 | bwd_inner_microstep: 1628.41 | bwd_allreduce_microstep: 224.25 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15008
total_samples=24086, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:32,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.13 | bwd_microstep: 2029.97 | bwd_inner_microstep: 1906.33 | bwd_allreduce_microstep: 123.58 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11969
total_samples=24089, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:35,892] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 06:33:35,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.69 | bwd_microstep: 2074.02 | bwd_inner_microstep: 1702.72 | bwd_allreduce_microstep: 371.20 | step_microstep: 123.56
[2025-08-03 06:33:35,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.10 | bwd: 7730.09 | bwd_inner: 6928.39 | bwd_allreduce: 801.42 | step: 124.11
{'loss': 0.7385, 'learning_rate': 2.194683524962986e-06, 'epoch': 0.79}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11909
total_samples=24092, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:38,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.88 | bwd_microstep: 1787.56 | bwd_inner_microstep: 1552.08 | bwd_allreduce_microstep: 235.41 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12047
total_samples=24095, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:41,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.84 | bwd_microstep: 1879.23 | bwd_inner_microstep: 1592.06 | bwd_allreduce_microstep: 287.10 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13130
total_samples=24100, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:43,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.68 | bwd_microstep: 1868.73 | bwd_inner_microstep: 1790.22 | bwd_allreduce_microstep: 78.44 | step_microstep: 0.15
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13243
total_samples=24105, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:46,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 06:33:46,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.17 | bwd_microstep: 2152.19 | bwd_inner_microstep: 1795.63 | bwd_allreduce_microstep: 356.50 | step_microstep: 109.93
[2025-08-03 06:33:46,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.50 | bwd: 7687.77 | bwd_inner: 6729.99 | bwd_allreduce: 957.53 | step: 110.46
{'loss': 0.7413, 'learning_rate': 2.184570770499056e-06, 'epoch': 0.79}
4:50:07<1:15:52, 10.81s/it] 79%|███████▉  | 1580/2000 [4:50:18<1:16:01, 10.86s/it]                                                        79%|███████▉  | 1580/2000 [4:50:18<1:16:01, 10.86s/it] 79%|███████▉  | 1581/2000 [4:50:28<1:15:52, 10.87s/it]                                                        79%|███████▉  | 1581/2000 [4:50:28<1:15:52, 10.87s/it] 79%|███████▉  | 1582/2000 [4:50:39<1:15:42, 10.87s/it]                                                        79%|███████▉  | 1582/2000 [4:50:39<1:15:42, 10.87s/it] 79%|███████▉  | 1583/2000 [4:50:50<1:15:46, 10.90s/it]                                                        79%|███████▉  | 1583/2000 [4:50:50<1:15:46, 10.90s/it] 79%|███████▉  | 1584/2000 [4:51:01<1:15:37, 10.91s/it]                                                        79%|███████▉  | 1584/2000 [4:51:01<1:15:37, 10dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11639
total_samples=24108, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:49,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.69 | bwd_microstep: 2081.21 | bwd_inner_microstep: 1856.50 | bwd_allreduce_microstep: 224.64 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13647
total_samples=24112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:52,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.59 | bwd_microstep: 1741.61 | bwd_inner_microstep: 1695.03 | bwd_allreduce_microstep: 46.49 | step_microstep: 0.84
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12470
total_samples=24116, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:54,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.86 | bwd_microstep: 1831.69 | bwd_inner_microstep: 1616.97 | bwd_allreduce_microstep: 214.63 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13616
total_samples=24120, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:33:57,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.81
[2025-08-03 06:33:57,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 751.29 | bwd_microstep: 1821.57 | bwd_inner_microstep: 1713.53 | bwd_allreduce_microstep: 107.96 | step_microstep: 143.84
[2025-08-03 06:33:57,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.37 | bwd: 7476.13 | bwd_inner: 6882.02 | bwd_allreduce: 593.81 | step: 145.16
{'loss': 0.7308, 'learning_rate': 2.1744785110871713e-06, 'epoch': 0.79}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11700
total_samples=24123, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:00,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.20 | bwd_microstep: 1868.83 | bwd_inner_microstep: 1531.96 | bwd_allreduce_microstep: 336.80 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13744
total_samples=24127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:02,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.87 | bwd_microstep: 1810.82 | bwd_inner_microstep: 1697.46 | bwd_allreduce_microstep: 113.29 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11907
total_samples=24130, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:05,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.79 | bwd_microstep: 1797.99 | bwd_inner_microstep: 1574.43 | bwd_allreduce_microstep: 223.49 | step_microstep: 0.26
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12940
total_samples=24134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:08,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 06:34:08,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.95 | bwd_microstep: 1785.02 | bwd_inner_microstep: 1778.92 | bwd_allreduce_microstep: 6.03 | step_microstep: 151.46
[2025-08-03 06:34:08,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.75 | bwd: 7262.71 | bwd_inner: 6582.77 | bwd_allreduce: 679.69 | step: 151.96
{'loss': 0.7309, 'learning_rate': 2.1644067731931005e-06, 'epoch': 0.79}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11956
total_samples=24137, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:11,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.46 | bwd_microstep: 2151.39 | bwd_inner_microstep: 1791.81 | bwd_allreduce_microstep: 359.51 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11680
total_samples=24140, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:13,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.75 | bwd_microstep: 2175.63 | bwd_inner_microstep: 1913.38 | bwd_allreduce_microstep: 262.18 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13898
total_samples=24144, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:16,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.19 | bwd_microstep: 1803.85 | bwd_inner_microstep: 1726.14 | bwd_allreduce_microstep: 77.64 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13553
total_samples=24148, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:19,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.96
[2025-08-03 06:34:19,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.21 | bwd_microstep: 1936.86 | bwd_inner_microstep: 1718.48 | bwd_allreduce_microstep: 218.32 | step_microstep: 134.02
[2025-08-03 06:34:19,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.54 | bwd: 8067.78 | bwd_inner: 7149.81 | bwd_allreduce: 917.73 | step: 134.42
{'loss': 0.746, 'learning_rate': 2.1543555832288056e-06, 'epoch': 0.79}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13355
total_samples=24152, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:21,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.67 | bwd_microstep: 1765.26 | bwd_inner_microstep: 1680.80 | bwd_allreduce_microstep: 84.39 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13398
total_samples=24156, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:24,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.63 | bwd_microstep: 1906.88 | bwd_inner_microstep: 1847.16 | bwd_allreduce_microstep: 59.65 | step_microstep: 0.37
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13234
total_samples=24160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:27,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.76 | bwd_microstep: 1828.43 | bwd_inner_microstep: 1720.76 | bwd_allreduce_microstep: 107.58 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13219
total_samples=24164, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:30,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 06:34:30,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.96 | bwd_microstep: 2175.91 | bwd_inner_microstep: 2016.42 | bwd_allreduce_microstep: 159.42 | step_microstep: 114.62
[2025-08-03 06:34:30,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.96 | bwd: 7676.54 | bwd_inner: 7265.14 | bwd_allreduce: 411.12 | step: 115.31
{'loss': 0.7293, 'learning_rate': 2.1443249675523536e-06, 'epoch': 0.79}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13027
total_samples=24167, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:33,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.54 | bwd_microstep: 2061.62 | bwd_inner_microstep: 1749.81 | bwd_allreduce_microstep: 311.74 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12595
total_samples=24171, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:35,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.25 | bwd_microstep: 1724.26 | bwd_inner_microstep: 1598.47 | bwd_allreduce_microstep: 125.72 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11828
total_samples=24174, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:38,355] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.13 | bwd_microstep: 1860.69 | bwd_inner_microstep: 1614.80 | bwd_allreduce_microstep: 245.82 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13255
total_samples=24178, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:41,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 06:34:41,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.14 | bwd_microstep: 1969.96 | bwd_inner_microstep: 1903.63 | bwd_allreduce_microstep: 66.27 | step_microstep: 135.65
[2025-08-03 06:34:41,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2884.00 | bwd: 7616.58 | bwd_inner: 6866.70 | bwd_allreduce: 749.63 | step: 136.30
{'loss': 0.7442, 'learning_rate': 2.134314952467873e-06, 'epoch': 0.79}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14451
total_samples=24182, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:43,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.42 | bwd_microstep: 1805.87 | bwd_inner_microstep: 1721.62 | bwd_allreduce_microstep: 84.17 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13423
total_samples=24186, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:46,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.23 | bwd_microstep: 2049.48 | bwd_inner_microstep: 1896.56 | bwd_allreduce_microstep: 152.85 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14328
total_samples=24190, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:49,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.26 | bwd_microstep: 1981.25 | bwd_inner_microstep: 1743.86 | bwd_allreduce_microstep: 237.32 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12590
total_samples=24193, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:52,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 06:34:52,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.84 | bwd_microstep: 2279.21 | bwd_inner_microstep: 2090.21 | bwd_allreduce_microstep: 188.94 | step_microstep: 129.92
[2025-08-03 06:34:52,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2888.68 | bwd: 8115.86 | bwd_inner: 7452.25 | bwd_allreduce: 663.37 | step: 130.44
.91s/it] 79%|███████▉  | 1585/2000 [4:51:12<1:15:09, 10.87s/it]                                                        79%|███████▉  | 1585/2000 [4:51:12<1:15:09, 10.87s/it] 79%|███████▉  | 1586/2000 [4:51:22<1:14:12, 10.75s/it]                                                        79%|███████▉  | 1586/2000 [4:51:22<1:14:12, 10.75s/it] 79%|███████▉  | 1587/2000 [4:51:34<1:15:15, 10.93s/it]                                                        79%|███████▉  | 1587/2000 [4:51:34<1:15:15, 10.93s/it] 79%|███████▉  | 1588/2000 [4:51:45<1:15:01, 10.93s/it]                                                        79%|███████▉  | 1588/2000 [4:51:45<1:15:01, 10.93s/it] 79%|███████▉  | 1589/2000 [4:51:56<1:14:51, 10.93s/it]                                                        79%|███████▉  | 1589/2000 [4:51:56<1:14:51, 10.93s/it] 80%|█�{'loss': 0.7464, 'learning_rate': 2.124325564225458e-06, 'epoch': 0.8}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11859
total_samples=24196, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:55,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.73 | bwd_microstep: 1938.02 | bwd_inner_microstep: 1539.61 | bwd_allreduce_microstep: 398.34 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13462
total_samples=24200, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:34:58,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.73 | bwd_microstep: 2187.36 | bwd_inner_microstep: 1960.07 | bwd_allreduce_microstep: 227.22 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11726
total_samples=24203, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:01,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.24 | bwd_microstep: 1968.16 | bwd_inner_microstep: 1616.72 | bwd_allreduce_microstep: 351.38 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=24206, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:03,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 06:35:03,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.30 | bwd_microstep: 1855.80 | bwd_inner_microstep: 1531.09 | bwd_allreduce_microstep: 324.64 | step_microstep: 136.31
[2025-08-03 06:35:03,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.92 | bwd: 7949.39 | bwd_inner: 6647.49 | bwd_allreduce: 1301.66 | step: 136.66
{'loss': 0.7329, 'learning_rate': 2.1143568290211115e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13504
total_samples=24210, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:06,475] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.65 | bwd_microstep: 1799.28 | bwd_inner_microstep: 1701.25 | bwd_allreduce_microstep: 97.97 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11678
total_samples=24213, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:09,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.64 | bwd_microstep: 1792.48 | bwd_inner_microstep: 1556.55 | bwd_allreduce_microstep: 235.85 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11669
total_samples=24216, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:11,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.65 | bwd_microstep: 1824.81 | bwd_inner_microstep: 1583.16 | bwd_allreduce_microstep: 241.59 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11975
total_samples=24219, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:14,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 06:35:14,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.98 | bwd_microstep: 1845.58 | bwd_inner_microstep: 1589.78 | bwd_allreduce_microstep: 255.74 | step_microstep: 112.46
[2025-08-03 06:35:14,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.84 | bwd: 7262.20 | bwd_inner: 6430.73 | bwd_allreduce: 831.23 | step: 112.87
{'loss': 0.7375, 'learning_rate': 2.1044087729966856e-06, 'epoch': 0.8}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13520
total_samples=24223, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:16,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.49 | bwd_microstep: 1784.40 | bwd_inner_microstep: 1676.65 | bwd_allreduce_microstep: 107.67 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12031
total_samples=24226, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:19,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.22 | bwd_microstep: 1715.30 | bwd_inner_microstep: 1543.05 | bwd_allreduce_microstep: 172.19 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12498
total_samples=24229, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:22,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.04 | bwd_microstep: 2153.83 | bwd_inner_microstep: 1934.45 | bwd_allreduce_microstep: 219.31 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13231
total_samples=24233, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:25,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 06:35:25,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.72 | bwd_microstep: 2477.31 | bwd_inner_microstep: 2192.54 | bwd_allreduce_microstep: 284.70 | step_microstep: 129.66
[2025-08-03 06:35:25,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2855.41 | bwd: 8130.90 | bwd_inner: 7346.69 | bwd_allreduce: 783.95 | step: 130.19
{'loss': 0.7335, 'learning_rate': 2.0944814222397948e-06, 'epoch': 0.8}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12000
total_samples=24236, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:28,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.78 | bwd_microstep: 2251.02 | bwd_inner_microstep: 2031.67 | bwd_allreduce_microstep: 219.29 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13402
total_samples=24240, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:31,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.31 | bwd_microstep: 1829.40 | bwd_inner_microstep: 1715.49 | bwd_allreduce_microstep: 113.85 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12485
total_samples=24243, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:33,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.15 | bwd_microstep: 1710.81 | bwd_inner_microstep: 1568.66 | bwd_allreduce_microstep: 142.08 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13518
total_samples=24247, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:36,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.75
[2025-08-03 06:35:36,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.92 | bwd_microstep: 1761.89 | bwd_inner_microstep: 1694.50 | bwd_allreduce_microstep: 67.32 | step_microstep: 143.57
[2025-08-03 06:35:36,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.08 | bwd: 7553.17 | bwd_inner: 7010.31 | bwd_allreduce: 542.62 | step: 144.08
{'loss': 0.7443, 'learning_rate': 2.0845748027837585e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13870
total_samples=24251, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:39,185] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.71 | bwd_microstep: 1788.54 | bwd_inner_microstep: 1736.76 | bwd_allreduce_microstep: 51.71 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12094
total_samples=24255, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:41,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.54 | bwd_microstep: 1963.84 | bwd_inner_microstep: 1767.29 | bwd_allreduce_microstep: 196.47 | step_microstep: 0.86
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13222
total_samples=24259, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:44,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.86 | bwd_microstep: 1824.62 | bwd_inner_microstep: 1686.18 | bwd_allreduce_microstep: 138.37 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11677
total_samples=24262, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:47,192] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 06:35:47,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.49 | bwd_microstep: 1771.85 | bwd_inner_microstep: 1549.46 | bwd_allreduce_microstep: 222.32 | step_microstep: 135.34
[2025-08-03 06:35:47,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.53 | bwd: 7348.92 | bwd_inner: 6739.69 | bwd_allreduce: 608.95 | step: 136.46
�█████▉  | 1590/2000 [4:52:07<1:15:41, 11.08s/it]                                                        80%|███████▉  | 1590/2000 [4:52:07<1:15:41, 11.08s/it] 80%|███████▉  | 1591/2000 [4:52:18<1:15:46, 11.12s/it]                                                        80%|███████▉  | 1591/2000 [4:52:18<1:15:46, 11.12s/it] 80%|███████▉  | 1592/2000 [4:52:29<1:14:20, 10.93s/it]                                                        80%|███████▉  | 1592/2000 [4:52:29<1:14:20, 10.93s/it] 80%|███████▉  | 1593/2000 [4:52:40<1:15:07, 11.07s/it]                                                        80%|███████▉  | 1593/2000 [4:52:40<1:15:07, 11.07s/it] 80%|███████▉  | 1594/2000 [4:52:51<1:14:21, 10.99s/it]                                                        80%|███████▉  | 1594/2000 [4:52:51<1:14:21, 10.99s/it] 80%|███████▉{'loss': 0.7338, 'learning_rate': 2.074688940607529e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13589
total_samples=24266, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:50,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.10 | bwd_microstep: 2180.65 | bwd_inner_microstep: 2001.40 | bwd_allreduce_microstep: 179.18 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13340
total_samples=24270, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:52,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.00 | bwd_microstep: 1807.37 | bwd_inner_microstep: 1703.79 | bwd_allreduce_microstep: 103.50 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16264
total_samples=24274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:55,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.08 | bwd_microstep: 2010.10 | bwd_inner_microstep: 1836.38 | bwd_allreduce_microstep: 173.66 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11759
total_samples=24277, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:35:58,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74
[2025-08-03 06:35:58,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.51 | bwd_microstep: 2126.51 | bwd_inner_microstep: 1885.95 | bwd_allreduce_microstep: 240.48 | step_microstep: 134.13
[2025-08-03 06:35:58,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2897.62 | bwd: 8124.67 | bwd_inner: 7427.53 | bwd_allreduce: 696.90 | step: 134.64
{'loss': 0.7385, 'learning_rate': 2.064823861635633e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13298
total_samples=24281, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:01,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.63 | bwd_microstep: 1747.87 | bwd_inner_microstep: 1671.62 | bwd_allreduce_microstep: 76.18 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12029
total_samples=24284, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:03,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.81 | bwd_microstep: 1865.34 | bwd_inner_microstep: 1736.95 | bwd_allreduce_microstep: 128.32 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12048
total_samples=24288, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:07,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.16 | bwd_microstep: 2478.43 | bwd_inner_microstep: 2193.78 | bwd_allreduce_microstep: 284.58 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13608
total_samples=24292, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:09,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.84
[2025-08-03 06:36:09,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.88 | bwd_microstep: 1722.12 | bwd_inner_microstep: 1674.38 | bwd_allreduce_microstep: 47.66 | step_microstep: 153.51
[2025-08-03 06:36:09,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.41 | bwd: 7813.82 | bwd_inner: 7276.72 | bwd_allreduce: 536.84 | step: 154.12
{'loss': 0.7302, 'learning_rate': 2.0549795917380867e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13417
total_samples=24296, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:12,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.05 | bwd_microstep: 1726.10 | bwd_inner_microstep: 1670.97 | bwd_allreduce_microstep: 55.05 | step_microstep: 0.35
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11731
total_samples=24299, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:15,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.28 | bwd_microstep: 2107.77 | bwd_inner_microstep: 1800.06 | bwd_allreduce_microstep: 307.66 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12076
total_samples=24302, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:17,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.40 | bwd_microstep: 1811.34 | bwd_inner_microstep: 1602.08 | bwd_allreduce_microstep: 209.19 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13201
total_samples=24306, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:20,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61
[2025-08-03 06:36:20,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.72 | bwd_microstep: 1865.15 | bwd_inner_microstep: 1692.54 | bwd_allreduce_microstep: 172.55 | step_microstep: 130.24
[2025-08-03 06:36:20,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2877.37 | bwd: 7510.41 | bwd_inner: 6765.64 | bwd_allreduce: 744.52 | step: 130.99
{'loss': 0.7327, 'learning_rate': 2.0451561567303378e-06, 'epoch': 0.8}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13541
total_samples=24310, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:23,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.48 | bwd_microstep: 1736.93 | bwd_inner_microstep: 1667.93 | bwd_allreduce_microstep: 68.93 | step_microstep: 0.14
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13641
total_samples=24314, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:25,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.15 | bwd_microstep: 1807.51 | bwd_inner_microstep: 1694.71 | bwd_allreduce_microstep: 112.74 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11626
total_samples=24317, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:28,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.44 | bwd_microstep: 1897.17 | bwd_inner_microstep: 1707.54 | bwd_allreduce_microstep: 189.53 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11581
total_samples=24320, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:31,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.30
[2025-08-03 06:36:31,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.49 | bwd_microstep: 1785.97 | bwd_inner_microstep: 1548.23 | bwd_allreduce_microstep: 237.66 | step_microstep: 163.05
[2025-08-03 06:36:31,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.50 | bwd: 7227.63 | bwd_inner: 6618.42 | bwd_allreduce: 608.95 | step: 163.61
{'loss': 0.7316, 'learning_rate': 2.0353535823732053e-06, 'epoch': 0.8}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15510
total_samples=24324, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:33,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.04 | bwd_microstep: 2033.45 | bwd_inner_microstep: 1907.85 | bwd_allreduce_microstep: 125.52 | step_microstep: 0.36
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13705
total_samples=24328, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:36,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.19 | bwd_microstep: 1960.84 | bwd_inner_microstep: 1888.83 | bwd_allreduce_microstep: 71.95 | step_microstep: 0.74
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12444
total_samples=24331, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:39,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.24 | bwd_microstep: 2333.52 | bwd_inner_microstep: 2327.36 | bwd_allreduce_microstep: 6.10 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13351
total_samples=24335, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:42,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 06:36:42,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.46 | bwd_microstep: 1956.50 | bwd_inner_microstep: 1837.37 | bwd_allreduce_microstep: 119.06 | step_microstep: 438.79
[2025-08-03 06:36:42,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.85 | bwd: 8284.39 | bwd_inner: 7961.41 | bwd_allreduce: 322.72 | step: 440.19
  | 1595/2000 [4:53:02<1:13:22, 10.87s/it]                                                        80%|███████▉  | 1595/2000 [4:53:02<1:13:22, 10.87s/it] 80%|███████▉  | 1596/2000 [4:53:13<1:14:22, 11.05s/it]                                                        80%|███████▉  | 1596/2000 [4:53:13<1:14:22, 11.05s/it] 80%|███████▉  | 1597/2000 [4:53:24<1:14:14, 11.05s/it]                                                        80%|███████▉  | 1597/2000 [4:53:24<1:14:14, 11.05s/it] 80%|███████▉  | 1598/2000 [4:53:35<1:13:37, 10.99s/it]                                                        80%|███████▉  | 1598/2000 [4:53:35<1:13:37, 10.99s/it] 80%|███████▉  | 1599/2000 [4:53:45<1:12:30, 10.85s/it]                                                        80%|███████▉  | 1599/2000 [4:53:45<1:12:30, 10.85s/it] 80%|████████  | 1600/2000 [4:53{'loss': 0.7376, 'learning_rate': 2.025571894372794e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13805
total_samples=24339, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:45,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.19 | bwd_microstep: 2018.67 | bwd_inner_microstep: 1895.75 | bwd_allreduce_microstep: 122.85 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13209
total_samples=24343, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:48,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.06 | bwd_microstep: 1784.80 | bwd_inner_microstep: 1711.57 | bwd_allreduce_microstep: 73.16 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11579
total_samples=24346, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:51,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.70 | bwd_microstep: 2167.93 | bwd_inner_microstep: 1935.93 | bwd_allreduce_microstep: 231.93 | step_microstep: 0.25
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13002
total_samples=24350, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:54,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 06:36:54,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.61 | bwd_microstep: 1854.62 | bwd_inner_microstep: 1646.68 | bwd_allreduce_microstep: 207.87 | step_microstep: 110.46
[2025-08-03 06:36:54,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.47 | bwd: 7826.06 | bwd_inner: 7189.92 | bwd_allreduce: 635.88 | step: 110.93
{'loss': 0.7351, 'learning_rate': 2.0158111183804407e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14229
total_samples=24355, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:56,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.58 | bwd_microstep: 2108.96 | bwd_inner_microstep: 1826.56 | bwd_allreduce_microstep: 282.33 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13090
total_samples=24359, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:36:59,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.42 | bwd_microstep: 1956.94 | bwd_inner_microstep: 1682.21 | bwd_allreduce_microstep: 274.67 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12326
total_samples=24362, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:02,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.72 | bwd_microstep: 1756.34 | bwd_inner_microstep: 1568.51 | bwd_allreduce_microstep: 187.75 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12081
total_samples=24365, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:04,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 06:37:04,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.98 | bwd_microstep: 1768.36 | bwd_inner_microstep: 1574.32 | bwd_allreduce_microstep: 193.97 | step_microstep: 117.55
[2025-08-03 06:37:04,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.62 | bwd: 7590.65 | bwd_inner: 6651.60 | bwd_allreduce: 938.80 | step: 118.05
{'loss': 0.7383, 'learning_rate': 2.0060712799926407e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14224
total_samples=24369, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:07,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.56 | bwd_microstep: 1890.12 | bwd_inner_microstep: 1738.26 | bwd_allreduce_microstep: 151.79 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13292
total_samples=24373, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:10,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.33 | bwd_microstep: 1788.01 | bwd_inner_microstep: 1710.66 | bwd_allreduce_microstep: 77.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13255
total_samples=24377, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:12,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.49 | bwd_microstep: 1748.67 | bwd_inner_microstep: 1680.10 | bwd_allreduce_microstep: 68.49 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11833
total_samples=24380, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:15,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 06:37:15,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.50 | bwd_microstep: 2234.42 | bwd_inner_microstep: 2012.65 | bwd_allreduce_microstep: 221.66 | step_microstep: 114.13
[2025-08-03 06:37:15,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.80 | bwd: 7661.28 | bwd_inner: 7141.68 | bwd_allreduce: 519.33 | step: 114.68
{'loss': 0.7496, 'learning_rate': 1.9963524047509898e-06, 'epoch': 0.8}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14363
total_samples=24387, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:18,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.19 | bwd_microstep: 1774.03 | bwd_inner_microstep: 1729.92 | bwd_allreduce_microstep: 44.04 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13715
total_samples=24391, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:21,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.38 | bwd_microstep: 2134.94 | bwd_inner_microstep: 2094.16 | bwd_allreduce_microstep: 40.72 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14458
total_samples=24395, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:23,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.03 | bwd_microstep: 1968.60 | bwd_inner_microstep: 1878.89 | bwd_allreduce_microstep: 89.65 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12796
total_samples=24398, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:26,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.71
[2025-08-03 06:37:26,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.38 | bwd_microstep: 1739.94 | bwd_inner_microstep: 1592.54 | bwd_allreduce_microstep: 147.33 | step_microstep: 133.24
[2025-08-03 06:37:26,548] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2770.90 | bwd: 7617.56 | bwd_inner: 7295.50 | bwd_allreduce: 321.81 | step: 133.85
{'loss': 0.7379, 'learning_rate': 1.9866545181421016e-06, 'epoch': 0.8}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12941
total_samples=24402, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:29,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.48 | bwd_microstep: 1826.58 | bwd_inner_microstep: 1666.41 | bwd_allreduce_microstep: 160.11 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13065
total_samples=24406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:31,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.97 | bwd_microstep: 1965.67 | bwd_inner_microstep: 1718.88 | bwd_allreduce_microstep: 246.74 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799
total_samples=24409, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:34,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.70 | bwd_microstep: 1835.10 | bwd_inner_microstep: 1597.98 | bwd_allreduce_microstep: 237.05 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12765
total_samples=24413, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:37,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 06:37:37,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.74 | bwd_microstep: 1752.41 | bwd_inner_microstep: 1595.88 | bwd_allreduce_microstep: 156.45 | step_microstep: 147.15
[2025-08-03 06:37:37,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.81 | bwd: 7379.83 | bwd_inner: 6579.14 | bwd_allreduce: 800.44 | step: 147.65
:57<1:14:16, 11.14s/it]                                                        80%|████████  | 1600/2000 [4:53:57<1:14:16, 11.14s/it] 80%|████████  | 1601/2000 [4:54:08<1:14:00, 11.13s/it]                                                        80%|████████  | 1601/2000 [4:54:08<1:14:00, 11.13s/it] 80%|████████  | 1602/2000 [4:54:19<1:13:14, 11.04s/it]                                                        80%|████████  | 1602/2000 [4:54:19<1:13:14, 11.04s/it] 80%|████████  | 1603/2000 [4:54:30<1:12:45, 11.00s/it]                                                        80%|████████  | 1603/2000 [4:54:30<1:12:45, 11.00s/it] 80%|████████  | 1604/2000 [4:54:41<1:12:13, 10.94s/it]                                                        80%|████████  | 1604/2000 [4:54:41<1:12:13, 10.94s/it] 80%|████████  | 1605/2000 [4:54:52<1:11:27, 10.85s{'loss': 0.7387, 'learning_rate': 1.976977645597552e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13586
total_samples=24418, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:40,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.05 | bwd_microstep: 2023.50 | bwd_inner_microstep: 1779.10 | bwd_allreduce_microstep: 244.33 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13730
total_samples=24422, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:42,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.80 | bwd_microstep: 1843.07 | bwd_inner_microstep: 1752.19 | bwd_allreduce_microstep: 90.81 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13053
total_samples=24426, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:45,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 757.32 | bwd_microstep: 1806.00 | bwd_inner_microstep: 1681.99 | bwd_allreduce_microstep: 123.94 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 14388
total_samples=24430, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:47,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 06:37:47,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.26 | bwd_microstep: 1730.17 | bwd_inner_microstep: 1645.40 | bwd_allreduce_microstep: 84.69 | step_microstep: 142.14
[2025-08-03 06:37:47,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2868.37 | bwd: 7402.79 | bwd_inner: 6858.67 | bwd_allreduce: 543.86 | step: 142.61
{'loss': 0.733, 'learning_rate': 1.967321812493813e-06, 'epoch': 0.8}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14199
total_samples=24434, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:50,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.30 | bwd_microstep: 1954.38 | bwd_inner_microstep: 1799.09 | bwd_allreduce_microstep: 155.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13287
total_samples=24438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:53,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.83 | bwd_microstep: 2061.65 | bwd_inner_microstep: 2055.51 | bwd_allreduce_microstep: 6.08 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13282
total_samples=24442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:56,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.23 | bwd_microstep: 2100.25 | bwd_inner_microstep: 1930.18 | bwd_allreduce_microstep: 170.01 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12975
total_samples=24446, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:37:59,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.55
[2025-08-03 06:37:59,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.96 | bwd_microstep: 1832.17 | bwd_inner_microstep: 1668.26 | bwd_allreduce_microstep: 163.83 | step_microstep: 114.29
[2025-08-03 06:37:59,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2913.25 | bwd: 7948.51 | bwd_inner: 7453.03 | bwd_allreduce: 495.23 | step: 114.76
{'loss': 0.7321, 'learning_rate': 1.9576870441521834e-06, 'epoch': 0.8}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13423
total_samples=24451, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:01,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.23 | bwd_microstep: 1787.30 | bwd_inner_microstep: 1675.68 | bwd_allreduce_microstep: 111.55 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14979
total_samples=24455, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:04,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.82 | bwd_microstep: 1831.15 | bwd_inner_microstep: 1794.30 | bwd_allreduce_microstep: 36.78 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11476
total_samples=24458, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:06,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.92 | bwd_microstep: 1817.96 | bwd_inner_microstep: 1561.82 | bwd_allreduce_microstep: 256.07 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13910
total_samples=24462, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:10,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.84
[2025-08-03 06:38:10,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.82 | bwd_microstep: 2097.49 | bwd_inner_microstep: 1881.37 | bwd_allreduce_microstep: 216.04 | step_microstep: 471.71
[2025-08-03 06:38:10,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2850.71 | bwd: 7533.94 | bwd_inner: 6913.17 | bwd_allreduce: 620.51 | step: 472.30
{'loss': 0.7376, 'learning_rate': 1.9480733658387175e-06, 'epoch': 0.8}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15023
total_samples=24466, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:12,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.83 | bwd_microstep: 1800.79 | bwd_inner_microstep: 1775.88 | bwd_allreduce_microstep: 24.85 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13865
total_samples=24470, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:15,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.80 | bwd_microstep: 1736.83 | bwd_inner_microstep: 1694.67 | bwd_allreduce_microstep: 42.09 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13211
total_samples=24474, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:18,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.25 | bwd_microstep: 1736.97 | bwd_inner_microstep: 1670.59 | bwd_allreduce_microstep: 66.31 | step_microstep: 0.87
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13459
total_samples=24478, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:20,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 06:38:20,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.33 | bwd_microstep: 1968.96 | bwd_inner_microstep: 1834.32 | bwd_allreduce_microstep: 134.56 | step_microstep: 121.47
[2025-08-03 06:38:20,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.13 | bwd: 7243.60 | bwd_inner: 6975.46 | bwd_allreduce: 267.89 | step: 122.74
{'loss': 0.7348, 'learning_rate': 1.9384808027641666e-06, 'epoch': 0.8}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11784
total_samples=24481, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:23,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.59 | bwd_microstep: 2256.65 | bwd_inner_microstep: 1858.85 | bwd_allreduce_microstep: 397.73 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13313
total_samples=24485, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:26,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.64 | bwd_microstep: 1975.37 | bwd_inner_microstep: 1886.33 | bwd_allreduce_microstep: 88.96 | step_microstep: 0.29
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12213
total_samples=24489, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:29,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.99 | bwd_microstep: 1872.52 | bwd_inner_microstep: 1633.04 | bwd_allreduce_microstep: 239.41 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12127
total_samples=24492, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:32,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 06:38:32,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.73 | bwd_microstep: 2025.11 | bwd_inner_microstep: 1907.50 | bwd_allreduce_microstep: 117.55 | step_microstep: 111.86
[2025-08-03 06:38:32,280] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2878.89 | bwd: 8129.69 | bwd_inner: 7285.72 | bwd_allreduce: 843.74 | step: 112.39
/it]                                                        80%|████████  | 1605/2000 [4:54:52<1:11:27, 10.85s/it] 80%|████████  | 1606/2000 [4:55:02<1:10:57, 10.81s/it]                                                        80%|████████  | 1606/2000 [4:55:02<1:10:57, 10.81s/it] 80%|████████  | 1607/2000 [4:55:14<1:11:43, 10.95s/it]                                                        80%|████████  | 1607/2000 [4:55:14<1:11:43, 10.95s/it] 80%|████████  | 1608/2000 [4:55:25<1:11:56, 11.01s/it]                                                        80%|████████  | 1608/2000 [4:55:25<1:11:56, 11.01s/it] 80%|████████  | 1609/2000 [4:55:35<1:10:48, 10.87s/it]                                                        80%|████████  | 1609/2000 [4:55:35<1:10:48, 10.87s/it] 80%|████████  | 1610/2000 [4:55:47<1:11:42, 11.03s/it]              {'loss': 0.7313, 'learning_rate': 1.9289093800839067e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13233
total_samples=24496, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:34,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.05 | bwd_microstep: 1861.88 | bwd_inner_microstep: 1667.49 | bwd_allreduce_microstep: 194.32 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13293
total_samples=24500, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:37,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.69 | bwd_microstep: 1747.86 | bwd_inner_microstep: 1682.94 | bwd_allreduce_microstep: 64.85 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11957
total_samples=24503, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:40,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.38 | bwd_microstep: 2148.68 | bwd_inner_microstep: 1929.77 | bwd_allreduce_microstep: 218.85 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12689
total_samples=24506, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:43,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 06:38:43,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.05 | bwd_microstep: 1807.88 | bwd_inner_microstep: 1602.58 | bwd_allreduce_microstep: 205.24 | step_microstep: 144.58
[2025-08-03 06:38:43,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.09 | bwd: 7566.36 | bwd_inner: 6882.79 | bwd_allreduce: 683.33 | step: 144.94
{'loss': 0.7242, 'learning_rate': 1.9193591228978815e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13106
total_samples=24510, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:45,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.33 | bwd_microstep: 2002.44 | bwd_inner_microstep: 1750.35 | bwd_allreduce_microstep: 252.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13435
total_samples=24514, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:48,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.19 | bwd_microstep: 2012.39 | bwd_inner_microstep: 1888.83 | bwd_allreduce_microstep: 123.49 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13133
total_samples=24518, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:51,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.37 | bwd_microstep: 1820.25 | bwd_inner_microstep: 1687.01 | bwd_allreduce_microstep: 133.17 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11820
total_samples=24521, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:54,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 06:38:54,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.21 | bwd_microstep: 1994.09 | bwd_inner_microstep: 1692.20 | bwd_allreduce_microstep: 301.83 | step_microstep: 124.10
[2025-08-03 06:38:54,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.03 | bwd: 7829.22 | bwd_inner: 7018.38 | bwd_allreduce: 810.60 | step: 124.47
{'loss': 0.7329, 'learning_rate': 1.9098300562505266e-06, 'epoch': 0.81}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12859
total_samples=24525, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:56,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.79 | bwd_microstep: 1715.03 | bwd_inner_microstep: 1618.61 | bwd_allreduce_microstep: 96.35 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13794
total_samples=24529, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:38:59,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.55 | bwd_microstep: 1934.28 | bwd_inner_microstep: 1912.45 | bwd_allreduce_microstep: 21.73 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14756
total_samples=24533, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:02,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.42 | bwd_microstep: 1986.72 | bwd_inner_microstep: 1756.68 | bwd_allreduce_microstep: 229.97 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12046
total_samples=24536, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:05,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 06:39:05,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.36 | bwd_microstep: 2053.48 | bwd_inner_microstep: 1776.04 | bwd_allreduce_microstep: 277.37 | step_microstep: 111.16
[2025-08-03 06:39:05,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.05 | bwd: 7689.59 | bwd_inner: 7063.77 | bwd_allreduce: 625.52 | step: 111.81
{'loss': 0.7304, 'learning_rate': 1.9003222051307046e-06, 'epoch': 0.81}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11854
total_samples=24539, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:07,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.51 | bwd_microstep: 1762.57 | bwd_inner_microstep: 1541.58 | bwd_allreduce_microstep: 220.92 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13666
total_samples=24543, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:10,268] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.46 | bwd_microstep: 1845.85 | bwd_inner_microstep: 1742.41 | bwd_allreduce_microstep: 103.37 | step_microstep: 0.35
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13647
total_samples=24547, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:13,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.59 | bwd_microstep: 2143.92 | bwd_inner_microstep: 2137.85 | bwd_allreduce_microstep: 6.02 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11791
total_samples=24550, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:15,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 06:39:15,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.22 | bwd_microstep: 1758.99 | bwd_inner_microstep: 1541.92 | bwd_allreduce_microstep: 217.01 | step_microstep: 130.19
[2025-08-03 06:39:15,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.71 | bwd: 7511.39 | bwd_inner: 6963.75 | bwd_allreduce: 547.39 | step: 130.79
{'loss': 0.742, 'learning_rate': 1.8908355944716516e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14955
total_samples=24554, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:18,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.21 | bwd_microstep: 1815.19 | bwd_inner_microstep: 1748.67 | bwd_allreduce_microstep: 66.44 | step_microstep: 0.84
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13303
total_samples=24558, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:21,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.73 | bwd_microstep: 1817.90 | bwd_inner_microstep: 1715.01 | bwd_allreduce_microstep: 102.83 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14043
total_samples=24563, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:23,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.81 | bwd_microstep: 1800.26 | bwd_inner_microstep: 1741.91 | bwd_allreduce_microstep: 58.29 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11698
total_samples=24566, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:26,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 06:39:26,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.05 | bwd_microstep: 2074.29 | bwd_inner_microstep: 1835.92 | bwd_allreduce_microstep: 238.30 | step_microstep: 434.94
[2025-08-03 06:39:26,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.73 | bwd: 7507.70 | bwd_inner: 7041.50 | bwd_allreduce: 465.95 | step: 436.03
                                          80%|████████  | 1610/2000 [4:55:47<1:11:42, 11.03s/it] 81%|████████  | 1611/2000 [4:55:57<1:11:04, 10.96s/it]                                                        81%|████████  | 1611/2000 [4:55:57<1:11:04, 10.96s/it] 81%|████████  | 1612/2000 [4:56:09<1:11:06, 11.00s/it]                                                        81%|████████  | 1612/2000 [4:56:09<1:11:06, 11.00s/it] 81%|████████  | 1613/2000 [4:56:19<1:10:48, 10.98s/it]                                                        81%|████████  | 1613/2000 [4:56:19<1:10:48, 10.98s/it] 81%|████████  | 1614/2000 [4:56:30<1:10:15, 10.92s/it]                                                        81%|████████  | 1614/2000 [4:56:30<1:10:15, 10.92s/it] 81%|████████  | 1615/2000 [4:56:41<1:10:21, 10.96s/it]                                 {'loss': 0.7442, 'learning_rate': 1.8813702491508956e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13866
total_samples=24570, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:29,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.99 | bwd_microstep: 1816.05 | bwd_inner_microstep: 1729.77 | bwd_allreduce_microstep: 86.22 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13755
total_samples=24574, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:32,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.66 | bwd_microstep: 1808.91 | bwd_inner_microstep: 1736.82 | bwd_allreduce_microstep: 72.03 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14310
total_samples=24578, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:34,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.37 | bwd_microstep: 2035.21 | bwd_inner_microstep: 1820.36 | bwd_allreduce_microstep: 214.78 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11706
total_samples=24581, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:38,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.35
[2025-08-03 06:39:38,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.16 | bwd_microstep: 2254.03 | bwd_inner_microstep: 2013.27 | bwd_allreduce_microstep: 240.69 | step_microstep: 110.79
[2025-08-03 06:39:38,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.11 | bwd: 7914.26 | bwd_inner: 7300.23 | bwd_allreduce: 613.80 | step: 111.30
{'loss': 0.7342, 'learning_rate': 1.8719261939902023e-06, 'epoch': 0.81}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13295
total_samples=24585, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:40,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.57 | bwd_microstep: 1889.50 | bwd_inner_microstep: 1628.40 | bwd_allreduce_microstep: 261.03 | step_microstep: 0.98
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14574
total_samples=24590, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:43,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.85 | bwd_microstep: 1927.83 | bwd_inner_microstep: 1898.71 | bwd_allreduce_microstep: 29.04 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14442
total_samples=24594, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:46,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.82 | bwd_microstep: 1852.61 | bwd_inner_microstep: 1777.16 | bwd_allreduce_microstep: 75.38 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13473
total_samples=24598, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:48,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.78
[2025-08-03 06:39:48,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.91 | bwd_microstep: 1873.95 | bwd_inner_microstep: 1781.41 | bwd_allreduce_microstep: 92.47 | step_microstep: 109.23
[2025-08-03 06:39:48,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.07 | bwd: 7543.94 | bwd_inner: 7085.68 | bwd_allreduce: 458.00 | step: 110.50
{'loss': 0.7486, 'learning_rate': 1.862503453755502e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13230
total_samples=24602, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:51,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.12 | bwd_microstep: 1726.40 | bwd_inner_microstep: 1669.96 | bwd_allreduce_microstep: 56.37 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13351
total_samples=24606, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:53,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.20 | bwd_microstep: 1791.52 | bwd_inner_microstep: 1713.32 | bwd_allreduce_microstep: 78.14 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13432
total_samples=24610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:56,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.90 | bwd_microstep: 1879.29 | bwd_inner_microstep: 1729.69 | bwd_allreduce_microstep: 149.53 | step_microstep: 0.36
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12017
total_samples=24613, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:39:59,384] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 06:39:59,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.12 | bwd_microstep: 1790.54 | bwd_inner_microstep: 1567.57 | bwd_allreduce_microstep: 222.90 | step_microstep: 138.68
[2025-08-03 06:39:59,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.28 | bwd: 7187.81 | bwd_inner: 6680.53 | bwd_allreduce: 507.03 | step: 139.42
{'loss': 0.7326, 'learning_rate': 1.8531020531568377e-06, 'epoch': 0.81}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12161
total_samples=24617, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:02,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.55 | bwd_microstep: 2010.45 | bwd_inner_microstep: 1838.51 | bwd_allreduce_microstep: 171.86 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13262
total_samples=24621, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:04,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.82 | bwd_microstep: 1782.99 | bwd_inner_microstep: 1693.37 | bwd_allreduce_microstep: 89.55 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13093
total_samples=24625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:07,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.13 | bwd_microstep: 2045.09 | bwd_inner_microstep: 1694.15 | bwd_allreduce_microstep: 350.85 | step_microstep: 0.33
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14225
total_samples=24629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:10,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 06:40:10,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.14 | bwd_microstep: 1899.84 | bwd_inner_microstep: 1871.53 | bwd_allreduce_microstep: 28.24 | step_microstep: 137.54
[2025-08-03 06:40:10,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2849.57 | bwd: 7738.44 | bwd_inner: 7097.56 | bwd_allreduce: 640.60 | step: 138.41
{'loss': 0.7327, 'learning_rate': 1.8437220168482839e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14558
total_samples=24634, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:12,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.06 | bwd_microstep: 1751.44 | bwd_inner_microstep: 1716.94 | bwd_allreduce_microstep: 34.43 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13888
total_samples=24638, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:15,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.00 | bwd_microstep: 1922.20 | bwd_inner_microstep: 1838.18 | bwd_allreduce_microstep: 83.95 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13251
total_samples=24642, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:18,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.09 | bwd_microstep: 1828.75 | bwd_inner_microstep: 1791.57 | bwd_allreduce_microstep: 37.11 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11769
total_samples=24645, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:21,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.68
[2025-08-03 06:40:21,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.33 | bwd_microstep: 2102.69 | bwd_inner_microstep: 1973.09 | bwd_allreduce_microstep: 129.53 | step_microstep: 114.61
[2025-08-03 06:40:21,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.41 | bwd: 7605.13 | bwd_inner: 7319.78 | bwd_allreduce: 285.11 | step: 115.25
                       81%|████████  | 1615/2000 [4:56:41<1:10:21, 10.96s/it] 81%|████████  | 1616/2000 [4:56:52<1:10:32, 11.02s/it]                                                        81%|████████  | 1616/2000 [4:56:52<1:10:32, 11.02s/it] 81%|████████  | 1617/2000 [4:57:03<1:09:59, 10.96s/it]                                                        81%|████████  | 1617/2000 [4:57:03<1:09:59, 10.96s/it] 81%|████████  | 1618/2000 [4:57:14<1:08:50, 10.81s/it]                                                        81%|████████  | 1618/2000 [4:57:14<1:08:50, 10.81s/it] 81%|████████  | 1619/2000 [4:57:25<1:09:03, 10.88s/it]                                                        81%|████████  | 1619/2000 [4:57:25<1:09:03, 10.88s/it] 81%|████████  | 1620/2000 [4:57:36<1:08:49, 10.87s/it]                                                    {'loss': 0.738, 'learning_rate': 1.8343633694278895e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13733
total_samples=24649, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:24,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.88 | bwd_microstep: 2045.36 | bwd_inner_microstep: 2020.53 | bwd_allreduce_microstep: 24.77 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14352
total_samples=24653, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:26,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.39 | bwd_microstep: 2011.86 | bwd_inner_microstep: 1791.09 | bwd_allreduce_microstep: 220.68 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12979
total_samples=24657, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:29,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.25 | bwd_microstep: 1750.58 | bwd_inner_microstep: 1660.87 | bwd_allreduce_microstep: 89.65 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13065
total_samples=24661, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:32,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 06:40:32,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.53 | bwd_microstep: 1764.19 | bwd_inner_microstep: 1640.87 | bwd_allreduce_microstep: 123.25 | step_microstep: 114.72
[2025-08-03 06:40:32,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.98 | bwd: 7572.05 | bwd_inner: 7113.36 | bwd_allreduce: 458.43 | step: 115.34
{'loss': 0.7407, 'learning_rate': 1.825026135437622e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13513
total_samples=24665, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:34,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.10 | bwd_microstep: 1702.87 | bwd_inner_microstep: 1653.90 | bwd_allreduce_microstep: 48.90 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13963
total_samples=24669, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:37,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.39 | bwd_microstep: 2054.88 | bwd_inner_microstep: 1761.00 | bwd_allreduce_microstep: 293.81 | step_microstep: 0.25
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13274
total_samples=24673, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:40,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.67 | bwd_microstep: 1932.43 | bwd_inner_microstep: 1701.45 | bwd_allreduce_microstep: 230.92 | step_microstep: 0.27
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12305
total_samples=24677, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:42,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 06:40:42,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.45 | bwd_microstep: 1793.13 | bwd_inner_microstep: 1592.12 | bwd_allreduce_microstep: 200.95 | step_microstep: 134.22
[2025-08-03 06:40:42,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.50 | bwd: 7483.38 | bwd_inner: 6708.47 | bwd_allreduce: 774.67 | step: 134.89
{'loss': 0.7201, 'learning_rate': 1.8157103393632869e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13882
total_samples=24681, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:45,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.71 | bwd_microstep: 1830.00 | bwd_inner_microstep: 1739.44 | bwd_allreduce_microstep: 90.49 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15247
total_samples=24686, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:48,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.98 | bwd_microstep: 2002.74 | bwd_inner_microstep: 1749.58 | bwd_allreduce_microstep: 253.09 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12860
total_samples=24690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:50,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.73 | bwd_microstep: 1776.15 | bwd_inner_microstep: 1621.29 | bwd_allreduce_microstep: 154.78 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14457
total_samples=24694, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:53,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 06:40:53,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.63 | bwd_microstep: 1830.10 | bwd_inner_microstep: 1768.25 | bwd_allreduce_microstep: 61.78 | step_microstep: 135.99
[2025-08-03 06:40:53,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.98 | bwd: 7439.03 | bwd_inner: 6878.56 | bwd_allreduce: 560.23 | step: 136.54
{'loss': 0.7359, 'learning_rate': 1.8064160056344714e-06, 'epoch': 0.81}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14507
total_samples=24700, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:55,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.76 | bwd_microstep: 1755.23 | bwd_inner_microstep: 1705.62 | bwd_allreduce_microstep: 49.53 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13263
total_samples=24704, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:40:58,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.70 | bwd_microstep: 2156.93 | bwd_inner_microstep: 2052.20 | bwd_allreduce_microstep: 104.66 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11487
total_samples=24707, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:01,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.66 | bwd_microstep: 2031.22 | bwd_inner_microstep: 1883.27 | bwd_allreduce_microstep: 147.87 | step_microstep: 0.31
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11718
total_samples=24710, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:04,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 06:41:04,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.65 | bwd_microstep: 1796.29 | bwd_inner_microstep: 1586.30 | bwd_allreduce_microstep: 209.92 | step_microstep: 109.65
[2025-08-03 06:41:04,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.70 | bwd: 7739.73 | bwd_inner: 7227.38 | bwd_allreduce: 512.06 | step: 110.38
{'loss': 0.7419, 'learning_rate': 1.7971431586244814e-06, 'epoch': 0.81}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11698
total_samples=24713, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:07,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.80 | bwd_microstep: 1879.59 | bwd_inner_microstep: 1610.55 | bwd_allreduce_microstep: 268.98 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15405
total_samples=24717, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:09,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.32 | bwd_microstep: 1968.77 | bwd_inner_microstep: 1754.27 | bwd_allreduce_microstep: 214.43 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11919
total_samples=24720, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:12,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.90 | bwd_microstep: 1827.97 | bwd_inner_microstep: 1594.77 | bwd_allreduce_microstep: 233.13 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13288
total_samples=24725, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:15,326] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 06:41:15,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.61 | bwd_microstep: 1975.02 | bwd_inner_microstep: 1849.70 | bwd_allreduce_microstep: 125.26 | step_microstep: 135.07
[2025-08-03 06:41:15,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.56 | bwd: 7651.40 | bwd_inner: 6809.28 | bwd_allreduce: 841.88 | step: 135.58
{'loss': 0.7352, 'learning_rate': 1.7878918226502816e-06, 'epoch': 0.81}
    81%|████████  | 1620/2000 [4:57:36<1:08:49, 10.87s/it] 81%|████████  | 1621/2000 [4:57:46<1:08:28, 10.84s/it]                                                        81%|████████  | 1621/2000 [4:57:46<1:08:28, 10.84s/it] 81%|████████  | 1622/2000 [4:57:57<1:08:06, 10.81s/it]                                                        81%|████████  | 1622/2000 [4:57:57<1:08:06, 10.81s/it] 81%|████████  | 1623/2000 [4:58:08<1:07:39, 10.77s/it]                                                        81%|████████  | 1623/2000 [4:58:08<1:07:39, 10.77s/it] 81%|████████  | 1624/2000 [4:58:19<1:07:49, 10.82s/it]                                                        81%|████████  | 1624/2000 [4:58:19<1:07:49, 10.82s/it] 81%|████████▏ | 1625/2000 [4:58:30<1:07:50, 10.86s/it]                                                        81%|██�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13223
total_samples=24729, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:18,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.47 | bwd_microstep: 1909.79 | bwd_inner_microstep: 1866.02 | bwd_allreduce_microstep: 43.70 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14295
total_samples=24733, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:20,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.81 | bwd_microstep: 2139.29 | bwd_inner_microstep: 1766.21 | bwd_allreduce_microstep: 373.01 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13313
total_samples=24737, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:23,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.21 | bwd_microstep: 1888.81 | bwd_inner_microstep: 1787.41 | bwd_allreduce_microstep: 101.32 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11705
total_samples=24740, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:26,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 06:41:26,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.69 | bwd_microstep: 2071.92 | bwd_inner_microstep: 1872.81 | bwd_allreduce_microstep: 199.05 | step_microstep: 114.48
[2025-08-03 06:41:26,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.11 | bwd: 8009.86 | bwd_inner: 7292.45 | bwd_allreduce: 717.17 | step: 115.12
{'loss': 0.7309, 'learning_rate': 1.7786620219724205e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13496
total_samples=24744, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:29,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.89 | bwd_microstep: 1745.27 | bwd_inner_microstep: 1651.84 | bwd_allreduce_microstep: 93.37 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12811
total_samples=24748, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:31,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.00 | bwd_microstep: 1770.41 | bwd_inner_microstep: 1661.06 | bwd_allreduce_microstep: 109.28 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12247
total_samples=24751, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:34,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.50 | bwd_microstep: 1870.17 | bwd_inner_microstep: 1573.02 | bwd_allreduce_microstep: 297.08 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13942
total_samples=24755, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:37,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 06:41:37,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.46 | bwd_microstep: 2563.05 | bwd_inner_microstep: 2483.85 | bwd_allreduce_microstep: 79.13 | step_microstep: 124.53
[2025-08-03 06:41:37,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2744.78 | bwd: 7948.96 | bwd_inner: 7369.76 | bwd_allreduce: 578.94 | step: 125.08
{'loss': 0.7258, 'learning_rate': 1.7694537807949707e-06, 'epoch': 0.81}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13609
total_samples=24759, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:40,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.21 | bwd_microstep: 1775.86 | bwd_inner_microstep: 1720.54 | bwd_allreduce_microstep: 55.25 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12909
total_samples=24763, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:42,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.90 | bwd_microstep: 1873.25 | bwd_inner_microstep: 1678.21 | bwd_allreduce_microstep: 194.97 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11926
total_samples=24766, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:45,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 904.94 | bwd_microstep: 1842.95 | bwd_inner_microstep: 1594.97 | bwd_allreduce_microstep: 247.91 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12840
total_samples=24770, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:48,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.56
[2025-08-03 06:41:48,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.21 | bwd_microstep: 1798.37 | bwd_inner_microstep: 1620.96 | bwd_allreduce_microstep: 177.33 | step_microstep: 116.77
[2025-08-03 06:41:48,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3004.20 | bwd: 7290.48 | bwd_inner: 6614.68 | bwd_allreduce: 675.55 | step: 117.15
{'loss': 0.7447, 'learning_rate': 1.7602671232654755e-06, 'epoch': 0.81}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13854
total_samples=24776, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:50,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.42 | bwd_microstep: 1780.91 | bwd_inner_microstep: 1699.97 | bwd_allreduce_microstep: 80.88 | step_microstep: 0.34
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13082
total_samples=24780, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:53,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.64 | bwd_microstep: 2063.65 | bwd_inner_microstep: 1999.74 | bwd_allreduce_microstep: 63.85 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14182
total_samples=24784, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:56,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.52 | bwd_microstep: 1891.07 | bwd_inner_microstep: 1873.99 | bwd_allreduce_microstep: 17.01 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12235
total_samples=24787, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:41:59,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.59
[2025-08-03 06:41:59,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.00 | bwd_microstep: 1858.13 | bwd_inner_microstep: 1605.56 | bwd_allreduce_microstep: 252.50 | step_microstep: 129.87
[2025-08-03 06:41:59,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.50 | bwd: 7593.82 | bwd_inner: 7179.26 | bwd_allreduce: 414.31 | step: 130.45
{'loss': 0.7412, 'learning_rate': 1.751102073474873e-06, 'epoch': 0.81}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13261
total_samples=24791, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:01,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.16 | bwd_microstep: 1714.77 | bwd_inner_microstep: 1642.76 | bwd_allreduce_microstep: 71.94 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13579
total_samples=24795, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:04,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 905.90 | bwd_microstep: 1859.43 | bwd_inner_microstep: 1745.43 | bwd_allreduce_microstep: 113.94 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13250
total_samples=24799, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:07,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.76 | bwd_microstep: 2012.57 | bwd_inner_microstep: 1856.10 | bwd_allreduce_microstep: 156.40 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13053
total_samples=24803, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:09,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77
[2025-08-03 06:42:09,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.00 | bwd_microstep: 1776.19 | bwd_inner_microstep: 1689.05 | bwd_allreduce_microstep: 87.07 | step_microstep: 138.81
[2025-08-03 06:42:09,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2986.75 | bwd: 7363.01 | bwd_inner: 6933.33 | bwd_allreduce: 429.43 | step: 139.41
{'loss': 0.7407, 'learning_rate': 1.7419586554574364e-06, 'epoch': 0.81}
��█████▏ | 1625/2000 [4:58:30<1:07:50, 10.86s/it] 81%|████████▏ | 1626/2000 [4:58:41<1:08:20, 10.96s/it]                                                        81%|████████▏ | 1626/2000 [4:58:41<1:08:20, 10.96s/it] 81%|████████▏ | 1627/2000 [4:58:52<1:08:28, 11.01s/it]                                                        81%|████████▏ | 1627/2000 [4:58:52<1:08:28, 11.01s/it] 81%|████████▏ | 1628/2000 [4:59:03<1:07:43, 10.92s/it]                                                        81%|████████▏ | 1628/2000 [4:59:03<1:07:43, 10.92s/it] 81%|████████▏ | 1629/2000 [4:59:14<1:07:20, 10.89s/it]                                                        81%|████████▏ | 1629/2000 [4:59:14<1:07:20, 10.89s/it] 82%|████████▏ | 1630/2000 [4:59:24<1:06:58, 10.86s/it]                                                        82%|██dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12960
total_samples=24807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:12,855] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.12 | bwd_microstep: 2078.92 | bwd_inner_microstep: 1856.41 | bwd_allreduce_microstep: 222.44 | step_microstep: 0.20
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13034
total_samples=24812, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:15,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.58 | bwd_microstep: 2114.66 | bwd_inner_microstep: 1932.34 | bwd_allreduce_microstep: 182.25 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13545
total_samples=24816, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:18,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.15 | bwd_microstep: 1741.40 | bwd_inner_microstep: 1687.96 | bwd_allreduce_microstep: 53.37 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12866
total_samples=24820, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:21,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.79
[2025-08-03 06:42:21,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.66 | bwd_microstep: 1966.68 | bwd_inner_microstep: 1799.84 | bwd_allreduce_microstep: 166.77 | step_microstep: 114.38
[2025-08-03 06:42:21,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.43 | bwd: 7901.75 | bwd_inner: 7276.54 | bwd_allreduce: 624.91 | step: 114.93
{'loss': 0.7376, 'learning_rate': 1.7328368931907114e-06, 'epoch': 0.82}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13037
total_samples=24824, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:23,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.37 | bwd_microstep: 1929.18 | bwd_inner_microstep: 1816.50 | bwd_allreduce_microstep: 112.62 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14470
total_samples=24829, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:26,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.07 | bwd_microstep: 1780.40 | bwd_inner_microstep: 1745.09 | bwd_allreduce_microstep: 35.24 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13667
total_samples=24833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:28,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.53 | bwd_microstep: 1772.29 | bwd_inner_microstep: 1708.71 | bwd_allreduce_microstep: 63.52 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11737
total_samples=24836, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:31,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74
[2025-08-03 06:42:31,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.68 | bwd_microstep: 2012.67 | bwd_inner_microstep: 1806.15 | bwd_allreduce_microstep: 206.45 | step_microstep: 112.46
[2025-08-03 06:42:31,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2845.58 | bwd: 7494.61 | bwd_inner: 7076.43 | bwd_allreduce: 417.93 | step: 113.04
{'loss': 0.736, 'learning_rate': 1.723736810595461e-06, 'epoch': 0.82}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13335
total_samples=24840, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:34,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.43 | bwd_microstep: 1740.32 | bwd_inner_microstep: 1673.37 | bwd_allreduce_microstep: 66.87 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13686
total_samples=24845, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:37,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.43 | bwd_microstep: 1872.15 | bwd_inner_microstep: 1709.52 | bwd_allreduce_microstep: 162.54 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13636
total_samples=24849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:39,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.55 | bwd_microstep: 1783.59 | bwd_inner_microstep: 1710.84 | bwd_allreduce_microstep: 72.68 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13551
total_samples=24853, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:43,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 06:42:43,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.67 | bwd_microstep: 2252.92 | bwd_inner_microstep: 2135.34 | bwd_allreduce_microstep: 117.52 | step_microstep: 402.84
[2025-08-03 06:42:43,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.01 | bwd: 7649.03 | bwd_inner: 7229.08 | bwd_allreduce: 419.70 | step: 403.57
{'loss': 0.7355, 'learning_rate': 1.7146584315355886e-06, 'epoch': 0.82}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13083
total_samples=24857, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:45,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.24 | bwd_microstep: 1810.79 | bwd_inner_microstep: 1686.29 | bwd_allreduce_microstep: 124.43 | step_microstep: 0.84
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15963
total_samples=24861, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:48,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.65 | bwd_microstep: 1890.15 | bwd_inner_microstep: 1848.74 | bwd_allreduce_microstep: 41.35 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14355
total_samples=24865, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:51,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.63 | bwd_microstep: 2234.21 | bwd_inner_microstep: 2227.01 | bwd_allreduce_microstep: 7.13 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11676
total_samples=24868, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:53,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 06:42:53,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.73 | bwd_microstep: 1766.78 | bwd_inner_microstep: 1541.89 | bwd_allreduce_microstep: 224.81 | step_microstep: 141.15
[2025-08-03 06:42:53,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.17 | bwd: 7701.99 | bwd_inner: 7303.94 | bwd_allreduce: 397.81 | step: 142.36
{'loss': 0.7233, 'learning_rate': 1.7056017798180824e-06, 'epoch': 0.82}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13676
total_samples=24872, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:56,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.45 | bwd_microstep: 2176.06 | bwd_inner_microstep: 2042.17 | bwd_allreduce_microstep: 133.82 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13941
total_samples=24876, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:42:59,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.40 | bwd_microstep: 1796.88 | bwd_inner_microstep: 1689.11 | bwd_allreduce_microstep: 107.71 | step_microstep: 0.25
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13762
total_samples=24881, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:02,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.44 | bwd_microstep: 1917.33 | bwd_inner_microstep: 1817.24 | bwd_allreduce_microstep: 100.03 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13736
total_samples=24885, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:04,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.78
[2025-08-03 06:43:04,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.70 | bwd_microstep: 1749.31 | bwd_inner_microstep: 1690.86 | bwd_allreduce_microstep: 58.38 | step_microstep: 134.19
[2025-08-03 06:43:04,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2859.92 | bwd: 7639.63 | bwd_inner: 7239.38 | bwd_allreduce: 400.01 | step: 134.69
{'loss': 0.7369, 'learning_rate': 1.69656687919296e-06, 'epoch': 0.82}
██████▏ | 1630/2000 [4:59:24<1:06:58, 10.86s/it] 82%|████████▏ | 1631/2000 [4:59:35<1:07:18, 10.94s/it]                                                        82%|████████▏ | 1631/2000 [4:59:36<1:07:18, 10.94s/it] 82%|████████▏ | 1632/2000 [4:59:46<1:06:47, 10.89s/it]                                                        82%|████████▏ | 1632/2000 [4:59:46<1:06:47, 10.89s/it] 82%|████████▏ | 1633/2000 [4:59:57<1:07:05, 10.97s/it]                                                        82%|████████▏ | 1633/2000 [4:59:57<1:07:05, 10.97s/it] 82%|████████▏ | 1634/2000 [5:00:08<1:06:50, 10.96s/it]                                                        82%|████████▏ | 1634/2000 [5:00:08<1:06:50, 10.96s/it] 82%|████████▏ | 1635/2000 [5:00:19<1:06:35, 10.95s/it]                                                        82%|█�dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15097
total_samples=24890, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:07,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.57 | bwd_microstep: 2024.85 | bwd_inner_microstep: 1903.34 | bwd_allreduce_microstep: 121.43 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13271
total_samples=24894, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:10,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.00 | bwd_microstep: 1766.58 | bwd_inner_microstep: 1691.31 | bwd_allreduce_microstep: 75.19 | step_microstep: 0.19
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12589
total_samples=24898, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:12,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.51 | bwd_microstep: 1781.67 | bwd_inner_microstep: 1614.20 | bwd_allreduce_microstep: 167.39 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13742
total_samples=24902, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:16,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.22
[2025-08-03 06:43:16,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.83 | bwd_microstep: 2080.43 | bwd_inner_microstep: 2015.58 | bwd_allreduce_microstep: 64.77 | step_microstep: 433.73
[2025-08-03 06:43:16,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.83 | bwd: 7653.58 | bwd_inner: 7224.43 | bwd_allreduce: 428.88 | step: 434.36
{'loss': 0.7277, 'learning_rate': 1.687553753353195e-06, 'epoch': 0.82}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15903
total_samples=24906, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:18,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.29 | bwd_microstep: 1791.14 | bwd_inner_microstep: 1768.33 | bwd_allreduce_microstep: 22.74 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13071
total_samples=24910, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:21,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.71 | bwd_microstep: 1921.47 | bwd_inner_microstep: 1881.92 | bwd_allreduce_microstep: 39.47 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13761
total_samples=24914, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:24,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.25 | bwd_microstep: 1893.88 | bwd_inner_microstep: 1691.25 | bwd_allreduce_microstep: 202.57 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12768
total_samples=24918, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:26,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 06:43:26,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.28 | bwd_microstep: 1797.07 | bwd_inner_microstep: 1609.22 | bwd_allreduce_microstep: 187.79 | step_microstep: 139.23
[2025-08-03 06:43:26,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.45 | bwd: 7403.62 | bwd_inner: 6950.72 | bwd_allreduce: 452.65 | step: 139.66
{'loss': 0.7265, 'learning_rate': 1.6785624259346556e-06, 'epoch': 0.82}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12580
total_samples=24922, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:29,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.07 | bwd_microstep: 1782.75 | bwd_inner_microstep: 1612.19 | bwd_allreduce_microstep: 170.48 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14122
total_samples=24926, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:32,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.78 | bwd_microstep: 1930.87 | bwd_inner_microstep: 1872.23 | bwd_allreduce_microstep: 58.58 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11839
total_samples=24929, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:34,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.31 | bwd_microstep: 1790.95 | bwd_inner_microstep: 1561.92 | bwd_allreduce_microstep: 228.95 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13091
total_samples=24933, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:37,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.70
[2025-08-03 06:43:37,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.66 | bwd_microstep: 1778.13 | bwd_inner_microstep: 1684.49 | bwd_allreduce_microstep: 93.56 | step_microstep: 137.34
[2025-08-03 06:43:37,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2870.76 | bwd: 7282.76 | bwd_inner: 6730.84 | bwd_allreduce: 551.65 | step: 137.86
{'loss': 0.7484, 'learning_rate': 1.669592920516049e-06, 'epoch': 0.82}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13321
total_samples=24938, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:39,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.56 | bwd_microstep: 1757.22 | bwd_inner_microstep: 1681.84 | bwd_allreduce_microstep: 75.31 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13071
total_samples=24943, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:42,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.91 | bwd_microstep: 2017.55 | bwd_inner_microstep: 1843.29 | bwd_allreduce_microstep: 174.19 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12059
total_samples=24946, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:45,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.33 | bwd_microstep: 1832.57 | bwd_inner_microstep: 1599.52 | bwd_allreduce_microstep: 232.99 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13860
total_samples=24951, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:48,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 06:43:48,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.06 | bwd_microstep: 2024.76 | bwd_inner_microstep: 1888.60 | bwd_allreduce_microstep: 136.06 | step_microstep: 113.81
[2025-08-03 06:43:48,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.78 | bwd: 7632.15 | bwd_inner: 7013.25 | bwd_allreduce: 618.65 | step: 114.26
{'loss': 0.7358, 'learning_rate': 1.660645260618864e-06, 'epoch': 0.82}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13166
total_samples=24955, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:50,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.25 | bwd_microstep: 1831.65 | bwd_inner_microstep: 1700.83 | bwd_allreduce_microstep: 130.75 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13029
total_samples=24959, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:53,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.19 | bwd_microstep: 1713.08 | bwd_inner_microstep: 1605.76 | bwd_allreduce_microstep: 107.24 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11846
total_samples=24962, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:56,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.75 | bwd_microstep: 1958.66 | bwd_inner_microstep: 1599.77 | bwd_allreduce_microstep: 358.82 | step_microstep: 0.29
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13319
total_samples=24968, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:43:58,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.56
[2025-08-03 06:43:58,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.55 | bwd_microstep: 1897.28 | bwd_inner_microstep: 1647.32 | bwd_allreduce_microstep: 249.89 | step_microstep: 109.40
[2025-08-03 06:43:58,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.67 | bwd: 7400.72 | bwd_inner: 6553.68 | bwd_allreduce: 846.79 | step: 109.93
{'loss': 0.7369, 'learning_rate': 1.6517194697072903e-06, 'epoch': 0.82}
�██████▏ | 1635/2000 [5:00:19<1:06:35, 10.95s/it] 82%|████████▏ | 1636/2000 [5:00:31<1:06:59, 11.04s/it]                                                        82%|████████▏ | 1636/2000 [5:00:31<1:06:59, 11.04s/it] 82%|████████▏ | 1637/2000 [5:00:41<1:06:03, 10.92s/it]                                                        82%|████████▏ | 1637/2000 [5:00:41<1:06:03, 10.92s/it] 82%|████████▏ | 1638/2000 [5:00:52<1:05:16, 10.82s/it]                                                        82%|████████▏ | 1638/2000 [5:00:52<1:05:16, 10.82s/it] 82%|████████▏ | 1639/2000 [5:01:03<1:05:11, 10.84s/it]                                                        82%|████████▏ | 1639/2000 [5:01:03<1:05:11, 10.84s/it] 82%|████████▏ | 1640/2000 [5:01:13<1:04:35, 10.76s/it]                                                        82%|█�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=24971, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:01,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.72 | bwd_microstep: 2024.06 | bwd_inner_microstep: 1930.05 | bwd_allreduce_microstep: 93.95 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11868
total_samples=24974, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:04,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.03 | bwd_microstep: 1789.60 | bwd_inner_microstep: 1556.34 | bwd_allreduce_microstep: 233.19 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13653
total_samples=24978, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:07,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.48 | bwd_microstep: 2026.99 | bwd_inner_microstep: 1888.96 | bwd_allreduce_microstep: 137.96 | step_microstep: 0.76
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14017
total_samples=24982, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:09,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.77
[2025-08-03 06:44:09,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.05 | bwd_microstep: 1828.69 | bwd_inner_microstep: 1733.77 | bwd_allreduce_microstep: 94.85 | step_microstep: 133.57
[2025-08-03 06:44:09,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.19 | bwd: 7669.39 | bwd_inner: 7109.10 | bwd_allreduce: 560.03 | step: 134.57
{'loss': 0.7386, 'learning_rate': 1.6428155711881722e-06, 'epoch': 0.82}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 16340
total_samples=24986, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:12,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.42 | bwd_microstep: 1962.81 | bwd_inner_microstep: 1894.54 | bwd_allreduce_microstep: 68.20 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12092
total_samples=24989, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:15,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.74 | bwd_microstep: 2014.02 | bwd_inner_microstep: 1793.39 | bwd_allreduce_microstep: 220.57 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13316
total_samples=24993, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:17,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.48 | bwd_microstep: 1772.86 | bwd_inner_microstep: 1705.57 | bwd_allreduce_microstep: 67.22 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12773
total_samples=24997, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:20,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 06:44:20,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.37 | bwd_microstep: 1852.62 | bwd_inner_microstep: 1669.87 | bwd_allreduce_microstep: 182.68 | step_microstep: 134.74
[2025-08-03 06:44:20,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.94 | bwd: 7602.37 | bwd_inner: 7063.37 | bwd_allreduce: 538.75 | step: 135.28
{'loss': 0.7241, 'learning_rate': 1.633933588410952e-06, 'epoch': 0.82}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12417
total_samples=25000, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:23,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.76 | bwd_microstep: 1863.18 | bwd_inner_microstep: 1603.68 | bwd_allreduce_microstep: 259.42 | step_microstep: 1.57
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11963
total_samples=25003, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:26,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.28 | bwd_microstep: 2111.56 | bwd_inner_microstep: 1774.69 | bwd_allreduce_microstep: 336.81 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13248
total_samples=25007, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:28,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.35 | bwd_microstep: 1781.73 | bwd_inner_microstep: 1689.71 | bwd_allreduce_microstep: 91.96 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12586
total_samples=25010, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:31,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.12
[2025-08-03 06:44:31,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.32 | bwd_microstep: 1741.89 | bwd_inner_microstep: 1576.86 | bwd_allreduce_microstep: 164.96 | step_microstep: 160.21
[2025-08-03 06:44:31,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.65 | bwd: 7498.41 | bwd_inner: 6644.94 | bwd_allreduce: 853.23 | step: 162.01
{'loss': 0.7319, 'learning_rate': 1.6250735446675914e-06, 'epoch': 0.82}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13151
total_samples=25014, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:33,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.79 | bwd_microstep: 1731.44 | bwd_inner_microstep: 1652.65 | bwd_allreduce_microstep: 78.72 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11637
total_samples=25017, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:36,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.81 | bwd_microstep: 1762.53 | bwd_inner_microstep: 1535.61 | bwd_allreduce_microstep: 226.85 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14260
total_samples=25021, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:39,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.06 | bwd_microstep: 2305.15 | bwd_inner_microstep: 2062.30 | bwd_allreduce_microstep: 242.78 | step_microstep: 0.26
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13824
total_samples=25025, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:42,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 06:44:42,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.57 | bwd_microstep: 1747.75 | bwd_inner_microstep: 1645.93 | bwd_allreduce_microstep: 101.75 | step_microstep: 136.34
[2025-08-03 06:44:42,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.16 | bwd: 7546.92 | bwd_inner: 6896.49 | bwd_allreduce: 650.18 | step: 136.83
{'loss': 0.7322, 'learning_rate': 1.6162354631925203e-06, 'epoch': 0.82}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11594
total_samples=25028, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:44,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.36 | bwd_microstep: 1843.91 | bwd_inner_microstep: 1533.19 | bwd_allreduce_microstep: 310.65 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12655
total_samples=25032, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:47,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.67 | bwd_microstep: 1815.19 | bwd_inner_microstep: 1623.71 | bwd_allreduce_microstep: 191.41 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11759
total_samples=25035, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:49,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.70 | bwd_microstep: 1723.29 | bwd_inner_microstep: 1530.57 | bwd_allreduce_microstep: 192.66 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14924
total_samples=25039, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:52,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 06:44:52,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.60 | bwd_microstep: 1791.53 | bwd_inner_microstep: 1739.58 | bwd_allreduce_microstep: 51.88 | step_microstep: 134.83
[2025-08-03 06:44:52,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.25 | bwd: 7173.97 | bwd_inner: 6427.04 | bwd_allreduce: 746.68 | step: 135.33
{'loss': 0.7329, 'learning_rate': 1.607419367162577e-06, 'epoch': 0.82}
��██████▏ | 1640/2000 [5:01:13<1:04:35, 10.76s/it] 82%|████████▏ | 1641/2000 [5:01:24<1:04:41, 10.81s/it]                                                        82%|████████▏ | 1641/2000 [5:01:24<1:04:41, 10.81s/it] 82%|████████▏ | 1642/2000 [5:01:35<1:04:37, 10.83s/it]                                                        82%|████████▏ | 1642/2000 [5:01:35<1:04:37, 10.83s/it] 82%|████████▏ | 1643/2000 [5:01:46<1:04:19, 10.81s/it]                                                        82%|████████▏ | 1643/2000 [5:01:46<1:04:19, 10.81s/it] 82%|████████▏ | 1644/2000 [5:01:57<1:04:05, 10.80s/it]                                                        82%|████████▏ | 1644/2000 [5:01:57<1:04:05, 10.80s/it] 82%|████████▏ | 1645/2000 [5:02:07<1:03:13, 10.69s/it]                                                        82%|█dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12507
total_samples=25042, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:55,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.20 | bwd_microstep: 1810.25 | bwd_inner_microstep: 1615.41 | bwd_allreduce_microstep: 194.77 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11563
total_samples=25045, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:44:57,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.62 | bwd_microstep: 1799.04 | bwd_inner_microstep: 1572.90 | bwd_allreduce_microstep: 226.06 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13134
total_samples=25048, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:00,403] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.90 | bwd_microstep: 1852.81 | bwd_inner_microstep: 1743.31 | bwd_allreduce_microstep: 109.43 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13195
total_samples=25052, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:03,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 06:45:03,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.07 | bwd_microstep: 2114.97 | bwd_inner_microstep: 1948.97 | bwd_allreduce_microstep: 165.94 | step_microstep: 110.07
[2025-08-03 06:45:03,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.71 | bwd: 7577.12 | bwd_inner: 6880.59 | bwd_allreduce: 696.28 | step: 110.53
{'loss': 0.7357, 'learning_rate': 1.5986252796969482e-06, 'epoch': 0.82}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11997
total_samples=25055, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:05,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.19 | bwd_microstep: 1754.53 | bwd_inner_microstep: 1555.66 | bwd_allreduce_microstep: 198.80 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13838
total_samples=25059, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:08,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.43 | bwd_microstep: 2255.41 | bwd_inner_microstep: 1954.14 | bwd_allreduce_microstep: 301.21 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11629
total_samples=25062, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:11,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.64 | bwd_microstep: 1940.99 | bwd_inner_microstep: 1748.55 | bwd_allreduce_microstep: 192.37 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12839
total_samples=25066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:14,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 06:45:14,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.98 | bwd_microstep: 2043.00 | bwd_inner_microstep: 2037.01 | bwd_allreduce_microstep: 5.93 | step_microstep: 107.74
[2025-08-03 06:45:14,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.18 | bwd: 7993.97 | bwd_inner: 7295.36 | bwd_allreduce: 698.38 | step: 108.36
{'loss': 0.7357, 'learning_rate': 1.589853223857103e-06, 'epoch': 0.82}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14874
total_samples=25070, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:17,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.77 | bwd_microstep: 1971.81 | bwd_inner_microstep: 1807.30 | bwd_allreduce_microstep: 164.45 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13380
total_samples=25074, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:19,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.10 | bwd_microstep: 1792.69 | bwd_inner_microstep: 1704.47 | bwd_allreduce_microstep: 88.14 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13698
total_samples=25078, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:22,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.05 | bwd_microstep: 1903.85 | bwd_inner_microstep: 1879.85 | bwd_allreduce_microstep: 23.94 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11791
total_samples=25081, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:25,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.91
[2025-08-03 06:45:25,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.55 | bwd_microstep: 2012.45 | bwd_inner_microstep: 1565.47 | bwd_allreduce_microstep: 446.89 | step_microstep: 117.08
[2025-08-03 06:45:25,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.40 | bwd: 7680.86 | bwd_inner: 6957.09 | bwd_allreduce: 723.50 | step: 117.55
{'loss': 0.7428, 'learning_rate': 1.5811032226467304e-06, 'epoch': 0.82}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11669
total_samples=25084, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:28,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.90 | bwd_microstep: 1999.51 | bwd_inner_microstep: 1610.10 | bwd_allreduce_microstep: 389.35 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13791
total_samples=25088, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:30,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.00 | bwd_microstep: 1793.17 | bwd_inner_microstep: 1681.44 | bwd_allreduce_microstep: 111.67 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11882
total_samples=25091, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:33,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.68 | bwd_microstep: 1939.80 | bwd_inner_microstep: 1607.18 | bwd_allreduce_microstep: 332.56 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11911
total_samples=25094, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:36,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.19
[2025-08-03 06:45:36,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.04 | bwd_microstep: 1808.15 | bwd_inner_microstep: 1546.88 | bwd_allreduce_microstep: 261.21 | step_microstep: 404.56
[2025-08-03 06:45:36,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.55 | bwd: 7540.69 | bwd_inner: 6445.59 | bwd_allreduce: 1094.86 | step: 405.15
{'loss': 0.7468, 'learning_rate': 1.5723752990116948e-06, 'epoch': 0.82}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11737
total_samples=25097, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:39,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.92 | bwd_microstep: 2059.18 | bwd_inner_microstep: 1719.89 | bwd_allreduce_microstep: 339.22 | step_microstep: 0.18
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13513
total_samples=25101, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:42,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.39 | bwd_microstep: 1760.88 | bwd_inner_microstep: 1676.44 | bwd_allreduce_microstep: 84.37 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13441
total_samples=25106, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:44,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.09 | bwd_microstep: 1810.85 | bwd_inner_microstep: 1706.66 | bwd_allreduce_microstep: 104.12 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13060
total_samples=25110, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:47,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.71
[2025-08-03 06:45:47,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.25 | bwd_microstep: 1975.05 | bwd_inner_microstep: 1865.22 | bwd_allreduce_microstep: 109.76 | step_microstep: 155.74
[2025-08-03 06:45:47,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2872.56 | bwd: 7606.01 | bwd_inner: 6968.20 | bwd_allreduce: 637.56 | step: 156.29
{'loss': 0.7537, 'learning_rate': 1.5636694758399563e-06, 'epoch': 0.82}
███████▏ | 1645/2000 [5:02:07<1:03:13, 10.69s/it] 82%|████████▏ | 1646/2000 [5:02:18<1:03:13, 10.72s/it]                                                        82%|████████▏ | 1646/2000 [5:02:18<1:03:13, 10.72s/it] 82%|████████▏ | 1647/2000 [5:02:29<1:03:54, 10.86s/it]                                                        82%|████████▏ | 1647/2000 [5:02:29<1:03:54, 10.86s/it] 82%|████████▏ | 1648/2000 [5:02:40<1:03:49, 10.88s/it]                                                        82%|████████▏ | 1648/2000 [5:02:40<1:03:49, 10.88s/it] 82%|████████▏ | 1649/2000 [5:02:51<1:03:58, 10.94s/it]                                                        82%|████████▏ | 1649/2000 [5:02:51<1:03:58, 10.94s/it] 82%|████████▎ | 1650/2000 [5:03:02<1:03:47, 10.94s/it]                                                        82%|�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11996
total_samples=25113, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:50,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 751.56 | bwd_microstep: 2079.86 | bwd_inner_microstep: 1865.88 | bwd_allreduce_microstep: 213.91 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13473
total_samples=25117, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:53,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.84 | bwd_microstep: 1841.97 | bwd_inner_microstep: 1689.17 | bwd_allreduce_microstep: 152.69 | step_microstep: 0.22
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13038
total_samples=25121, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:55,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.72 | bwd_microstep: 1742.56 | bwd_inner_microstep: 1655.45 | bwd_allreduce_microstep: 87.04 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13193
total_samples=25125, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:45:58,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.66
[2025-08-03 06:45:58,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.86 | bwd_microstep: 2013.06 | bwd_inner_microstep: 1872.17 | bwd_allreduce_microstep: 140.82 | step_microstep: 114.99
[2025-08-03 06:45:58,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2865.92 | bwd: 7677.49 | bwd_inner: 7082.67 | bwd_allreduce: 594.55 | step: 115.46
{'loss': 0.7257, 'learning_rate': 1.5549857759615195e-06, 'epoch': 0.83}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12404
total_samples=25128, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:01,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.75 | bwd_microstep: 2342.25 | bwd_inner_microstep: 2044.61 | bwd_allreduce_microstep: 297.58 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13365
total_samples=25133, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:04,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.13 | bwd_microstep: 1788.94 | bwd_inner_microstep: 1672.15 | bwd_allreduce_microstep: 116.72 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11746
total_samples=25136, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:07,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.35 | bwd_microstep: 2053.62 | bwd_inner_microstep: 1829.71 | bwd_allreduce_microstep: 223.84 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12190
total_samples=25139, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:09,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 06:46:09,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.11 | bwd_microstep: 1726.99 | bwd_inner_microstep: 1556.18 | bwd_allreduce_microstep: 170.74 | step_microstep: 144.30
[2025-08-03 06:46:09,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.28 | bwd: 7911.85 | bwd_inner: 7102.65 | bwd_allreduce: 808.96 | step: 144.66
{'loss': 0.7335, 'learning_rate': 1.5463242221483742e-06, 'epoch': 0.83}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13271
total_samples=25143, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:12,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.21 | bwd_microstep: 1798.16 | bwd_inner_microstep: 1665.74 | bwd_allreduce_microstep: 132.36 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11869
total_samples=25146, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:14,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 662.91 | bwd_microstep: 1690.67 | bwd_inner_microstep: 1538.10 | bwd_allreduce_microstep: 152.49 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13523
total_samples=25150, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:17,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.03 | bwd_microstep: 1749.29 | bwd_inner_microstep: 1686.81 | bwd_allreduce_microstep: 62.41 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12700
total_samples=25154, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:19,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 38.50
[2025-08-03 06:46:19,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.38 | bwd_microstep: 1768.57 | bwd_inner_microstep: 1623.71 | bwd_allreduce_microstep: 144.78 | step_microstep: 143.89
[2025-08-03 06:46:19,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2747.45 | bwd: 7006.74 | bwd_inner: 6514.36 | bwd_allreduce: 492.13 | step: 144.25
{'loss': 0.7272, 'learning_rate': 1.5376848371144404e-06, 'epoch': 0.83}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12528
total_samples=25157, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:22,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 981.77 | bwd_microstep: 1979.47 | bwd_inner_microstep: 1598.76 | bwd_allreduce_microstep: 380.65 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14460
total_samples=25161, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:25,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.15 | bwd_microstep: 1715.09 | bwd_inner_microstep: 1697.62 | bwd_allreduce_microstep: 17.40 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13455
total_samples=25165, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:28,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.44 | bwd_microstep: 2869.41 | bwd_inner_microstep: 2579.13 | bwd_allreduce_microstep: 290.22 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12703
total_samples=25169, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:31,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 06:46:31,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.58 | bwd_microstep: 1863.90 | bwd_inner_microstep: 1604.45 | bwd_allreduce_microstep: 259.36 | step_microstep: 138.89
[2025-08-03 06:46:31,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3066.87 | bwd: 8427.93 | bwd_inner: 7479.95 | bwd_allreduce: 947.73 | step: 139.41
{'loss': 0.7357, 'learning_rate': 1.5290676435154949e-06, 'epoch': 0.83}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13306
total_samples=25173, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:34,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.59 | bwd_microstep: 1874.47 | bwd_inner_microstep: 1818.68 | bwd_allreduce_microstep: 55.73 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=25177, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:36,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.37 | bwd_microstep: 1780.42 | bwd_inner_microstep: 1690.49 | bwd_allreduce_microstep: 89.86 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13273
total_samples=25181, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:39,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.54 | bwd_microstep: 1930.12 | bwd_inner_microstep: 1804.51 | bwd_allreduce_microstep: 125.54 | step_microstep: 0.14
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12926
total_samples=25185, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:42,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 06:46:42,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.78 | bwd_microstep: 2020.90 | bwd_inner_microstep: 1864.18 | bwd_allreduce_microstep: 156.64 | step_microstep: 113.80
[2025-08-03 06:46:42,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2878.21 | bwd: 7605.97 | bwd_inner: 7177.86 | bwd_allreduce: 427.87 | step: 114.27
{'loss': 0.7423, 'learning_rate': 1.520472663949122e-06, 'epoch': 0.83}
�███████▎ | 1650/2000 [5:03:02<1:03:47, 10.94s/it] 83%|████████▎ | 1651/2000 [5:03:13<1:03:39, 10.94s/it]                                                        83%|████████▎ | 1651/2000 [5:03:13<1:03:39, 10.94s/it] 83%|████████▎ | 1652/2000 [5:03:24<1:03:48, 11.00s/it]                                                        83%|████████▎ | 1652/2000 [5:03:24<1:03:48, 11.00s/it] 83%|████████▎ | 1653/2000 [5:03:34<1:02:13, 10.76s/it]                                                        83%|████████▎ | 1653/2000 [5:03:34<1:02:13, 10.76s/it] 83%|████████▎ | 1654/2000 [5:03:46<1:04:03, 11.11s/it]                                                        83%|████████▎ | 1654/2000 [5:03:46<1:04:03, 11.11s/it] 83%|████████▎ | 1655/2000 [5:03:57<1:03:30, 11.04s/it]                                                        83%|�dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14351
total_samples=25189, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:45,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.41 | bwd_microstep: 2021.24 | bwd_inner_microstep: 1832.33 | bwd_allreduce_microstep: 188.83 | step_microstep: 0.26
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13396
total_samples=25193, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:47,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.54 | bwd_microstep: 1752.41 | bwd_inner_microstep: 1656.37 | bwd_allreduce_microstep: 95.97 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13464
total_samples=25197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:50,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.20 | bwd_microstep: 1726.79 | bwd_inner_microstep: 1666.64 | bwd_allreduce_microstep: 60.08 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11623
total_samples=25200, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:53,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 19.40
[2025-08-03 06:46:53,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.42 | bwd_microstep: 1785.64 | bwd_inner_microstep: 1530.64 | bwd_allreduce_microstep: 254.93 | step_microstep: 148.62
[2025-08-03 06:46:53,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.49 | bwd: 7286.15 | bwd_inner: 6685.98 | bwd_allreduce: 599.89 | step: 149.45
{'loss': 0.7195, 'learning_rate': 1.511899920954656e-06, 'epoch': 0.83}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11637
total_samples=25203, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:55,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.60 | bwd_microstep: 1818.80 | bwd_inner_microstep: 1567.66 | bwd_allreduce_microstep: 251.07 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13038
total_samples=25207, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:46:58,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.65 | bwd_microstep: 1801.70 | bwd_inner_microstep: 1734.03 | bwd_allreduce_microstep: 67.60 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13347
total_samples=25211, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:00,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.78 | bwd_microstep: 1761.13 | bwd_inner_microstep: 1686.21 | bwd_allreduce_microstep: 74.85 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12063
total_samples=25214, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:03,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 06:47:03,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.45 | bwd_microstep: 1814.06 | bwd_inner_microstep: 1573.51 | bwd_allreduce_microstep: 240.49 | step_microstep: 116.94
[2025-08-03 06:47:03,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.41 | bwd: 7195.74 | bwd_inner: 6561.41 | bwd_allreduce: 634.08 | step: 117.44
{'loss': 0.733, 'learning_rate': 1.5033494370131162e-06, 'epoch': 0.83}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=25217, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:06,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.20 | bwd_microstep: 1857.34 | bwd_inner_microstep: 1531.73 | bwd_allreduce_microstep: 325.55 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12352
total_samples=25220, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:08,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.57 | bwd_microstep: 1843.16 | bwd_inner_microstep: 1600.81 | bwd_allreduce_microstep: 242.28 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13781
total_samples=25224, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:11,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.36 | bwd_microstep: 1934.90 | bwd_inner_microstep: 1748.70 | bwd_allreduce_microstep: 186.11 | step_microstep: 0.84
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12617
total_samples=25227, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:14,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 06:47:14,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.64 | bwd_microstep: 1864.07 | bwd_inner_microstep: 1613.14 | bwd_allreduce_microstep: 250.85 | step_microstep: 152.75
[2025-08-03 06:47:14,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.72 | bwd: 7499.51 | bwd_inner: 6494.38 | bwd_allreduce: 1004.87 | step: 153.82
{'loss': 0.7331, 'learning_rate': 1.4948212345471492e-06, 'epoch': 0.83}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12143
total_samples=25230, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:17,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.19 | bwd_microstep: 1879.89 | bwd_inner_microstep: 1750.05 | bwd_allreduce_microstep: 129.77 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13790
total_samples=25234, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:19,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.60 | bwd_microstep: 1688.83 | bwd_inner_microstep: 1657.94 | bwd_allreduce_microstep: 30.83 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13526
total_samples=25238, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:22,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.70 | bwd_microstep: 1789.12 | bwd_inner_microstep: 1702.46 | bwd_allreduce_microstep: 86.59 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13961
total_samples=25243, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:24,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 06:47:24,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.52 | bwd_microstep: 1953.00 | bwd_inner_microstep: 1922.18 | bwd_allreduce_microstep: 30.75 | step_microstep: 119.37
[2025-08-03 06:47:24,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.95 | bwd: 7310.89 | bwd_inner: 7032.63 | bwd_allreduce: 278.03 | step: 119.87
{'loss': 0.7324, 'learning_rate': 1.4863153359209693e-06, 'epoch': 0.83}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11418
total_samples=25246, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:27,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.53 | bwd_microstep: 2132.65 | bwd_inner_microstep: 2001.17 | bwd_allreduce_microstep: 131.42 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13295
total_samples=25250, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:30,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.52 | bwd_microstep: 2089.19 | bwd_inner_microstep: 1966.67 | bwd_allreduce_microstep: 122.44 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12792
total_samples=25254, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:33,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.47 | bwd_microstep: 1765.80 | bwd_inner_microstep: 1658.58 | bwd_allreduce_microstep: 107.15 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13325
total_samples=25258, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:36,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 31.45
[2025-08-03 06:47:36,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.54 | bwd_microstep: 2005.24 | bwd_inner_microstep: 1872.98 | bwd_allreduce_microstep: 132.19 | step_microstep: 139.19
[2025-08-03 06:47:36,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.99 | bwd: 7992.93 | bwd_inner: 7499.39 | bwd_allreduce: 493.28 | step: 139.81
{'loss': 0.7441, 'learning_rate': 1.4778317634403082e-06, 'epoch': 0.83}
��███████▎ | 1655/2000 [5:03:57<1:03:30, 11.04s/it] 83%|████████▎ | 1656/2000 [5:04:07<1:02:24, 10.89s/it]                                                        83%|████████▎ | 1656/2000 [5:04:08<1:02:24, 10.89s/it] 83%|████████▎ | 1657/2000 [5:04:18<1:01:29, 10.76s/it]                                                        83%|████████▎ | 1657/2000 [5:04:18<1:01:29, 10.76s/it] 83%|████████▎ | 1658/2000 [5:04:29<1:01:17, 10.75s/it]                                                        83%|████████▎ | 1658/2000 [5:04:29<1:01:17, 10.75s/it] 83%|████████▎ | 1659/2000 [5:04:39<1:00:48, 10.70s/it]                                                        83%|████████▎ | 1659/2000 [5:04:39<1:00:48, 10.70s/it] 83%|████████▎ | 1660/2000 [5:04:51<1:01:31, 10.86s/it]                                                        83%|dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11837
total_samples=25262, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:38,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.96 | bwd_microstep: 1961.38 | bwd_inner_microstep: 1566.09 | bwd_allreduce_microstep: 395.23 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13756
total_samples=25266, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:41,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.67 | bwd_microstep: 2079.45 | bwd_inner_microstep: 2073.51 | bwd_allreduce_microstep: 5.87 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11899
total_samples=25270, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:44,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.33 | bwd_microstep: 2440.58 | bwd_inner_microstep: 2092.57 | bwd_allreduce_microstep: 347.95 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14378
total_samples=25274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:47,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.79
[2025-08-03 06:47:47,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.40 | bwd_microstep: 1714.11 | bwd_inner_microstep: 1691.20 | bwd_allreduce_microstep: 22.85 | step_microstep: 141.37
[2025-08-03 06:47:47,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.29 | bwd: 8195.57 | bwd_inner: 7423.37 | bwd_allreduce: 771.98 | step: 141.81
{'loss': 0.7358, 'learning_rate': 1.469370539352345e-06, 'epoch': 0.83}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11798
total_samples=25277, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:50,321] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.67 | bwd_microstep: 1889.46 | bwd_inner_microstep: 1614.72 | bwd_allreduce_microstep: 274.67 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11905
total_samples=25280, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:52,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.00 | bwd_microstep: 1855.71 | bwd_inner_microstep: 1594.90 | bwd_allreduce_microstep: 260.74 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13811
total_samples=25284, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:55,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.16 | bwd_microstep: 1924.14 | bwd_inner_microstep: 1676.79 | bwd_allreduce_microstep: 247.28 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12287
total_samples=25287, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:47:58,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 06:47:58,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.86 | bwd_microstep: 1780.43 | bwd_inner_microstep: 1602.69 | bwd_allreduce_microstep: 177.67 | step_microstep: 132.31
[2025-08-03 06:47:58,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2859.62 | bwd: 7449.79 | bwd_inner: 6489.10 | bwd_allreduce: 960.44 | step: 132.80
{'loss': 0.7266, 'learning_rate': 1.460931685845649e-06, 'epoch': 0.83}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13734
total_samples=25291, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:00,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.52 | bwd_microstep: 1778.22 | bwd_inner_microstep: 1724.36 | bwd_allreduce_microstep: 53.80 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12112
total_samples=25294, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:03,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.83 | bwd_microstep: 2075.11 | bwd_inner_microstep: 1731.95 | bwd_allreduce_microstep: 343.11 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12006
total_samples=25297, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:06,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.59 | bwd_microstep: 1805.45 | bwd_inner_microstep: 1562.53 | bwd_allreduce_microstep: 242.85 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13705
total_samples=25301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:09,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.88
[2025-08-03 06:48:09,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.59 | bwd_microstep: 1966.37 | bwd_inner_microstep: 1803.31 | bwd_allreduce_microstep: 162.99 | step_microstep: 115.12
[2025-08-03 06:48:09,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.47 | bwd: 7625.21 | bwd_inner: 6822.15 | bwd_allreduce: 802.83 | step: 115.47
{'loss': 0.7295, 'learning_rate': 1.4525152250501362e-06, 'epoch': 0.83}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12020
total_samples=25304, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:11,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.18 | bwd_microstep: 1879.66 | bwd_inner_microstep: 1616.24 | bwd_allreduce_microstep: 263.35 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11945
total_samples=25307, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:14,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.88 | bwd_microstep: 1812.07 | bwd_inner_microstep: 1566.47 | bwd_allreduce_microstep: 245.54 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14851
total_samples=25311, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:17,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.86 | bwd_microstep: 1781.69 | bwd_inner_microstep: 1764.77 | bwd_allreduce_microstep: 16.86 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12261
total_samples=25314, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:19,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 06:48:19,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.86 | bwd_microstep: 1749.83 | bwd_inner_microstep: 1560.83 | bwd_allreduce_microstep: 188.94 | step_microstep: 128.37
[2025-08-03 06:48:19,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.70 | bwd: 7223.30 | bwd_inner: 6508.30 | bwd_allreduce: 714.76 | step: 128.85
{'loss': 0.7283, 'learning_rate': 1.4441211790369892e-06, 'epoch': 0.83}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13574
total_samples=25318, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:22,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.92 | bwd_microstep: 1763.61 | bwd_inner_microstep: 1685.01 | bwd_allreduce_microstep: 78.54 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12147
total_samples=25321, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:25,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.93 | bwd_microstep: 2118.47 | bwd_inner_microstep: 1878.15 | bwd_allreduce_microstep: 240.26 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12582
total_samples=25324, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:27,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.34 | bwd_microstep: 1987.03 | bwd_inner_microstep: 1794.35 | bwd_allreduce_microstep: 192.62 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11648
total_samples=25327, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:31,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 06:48:31,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.35 | bwd_microstep: 2220.68 | bwd_inner_microstep: 1843.99 | bwd_allreduce_microstep: 376.62 | step_microstep: 134.28
[2025-08-03 06:48:31,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2855.48 | bwd: 8089.85 | bwd_inner: 7201.49 | bwd_allreduce: 888.11 | step: 134.90
{'loss': 0.7207, 'learning_rate': 1.4357495698186186e-06, 'epoch': 0.83}
████████▎ | 1660/2000 [5:04:51<1:01:31, 10.86s/it] 83%|████████▎ | 1661/2000 [5:05:02<1:02:21, 11.04s/it]                                                        83%|████████▎ | 1661/2000 [5:05:02<1:02:21, 11.04s/it] 83%|████████▎ | 1662/2000 [5:05:13<1:01:42, 10.95s/it]                                                        83%|████████▎ | 1662/2000 [5:05:13<1:01:42, 10.95s/it] 83%|████████▎ | 1663/2000 [5:05:24<1:01:20, 10.92s/it]                                                        83%|████████▎ | 1663/2000 [5:05:24<1:01:20, 10.92s/it] 83%|████████▎ | 1664/2000 [5:05:34<1:00:24, 10.79s/it]                                                        83%|████████▎ | 1664/2000 [5:05:34<1:00:24, 10.79s/it] 83%|████████▎ | 1665/2000 [5:05:45<1:01:09, 10.95s/it]                                                        83%dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11999
total_samples=25330, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:33,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.28 | bwd_microstep: 1794.04 | bwd_inner_microstep: 1589.14 | bwd_allreduce_microstep: 204.84 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14283
total_samples=25334, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:36,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.74 | bwd_microstep: 1816.41 | bwd_inner_microstep: 1774.52 | bwd_allreduce_microstep: 41.83 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16299
total_samples=25338, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:38,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.72 | bwd_microstep: 1851.10 | bwd_inner_microstep: 1844.58 | bwd_allreduce_microstep: 6.45 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12069
total_samples=25341, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:41,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 06:48:41,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.19 | bwd_microstep: 2204.26 | bwd_inner_microstep: 1841.77 | bwd_allreduce_microstep: 362.43 | step_microstep: 136.77
[2025-08-03 06:48:41,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.86 | bwd: 7665.87 | bwd_inner: 7050.00 | bwd_allreduce: 615.62 | step: 137.16
{'loss': 0.7286, 'learning_rate': 1.427400419348588e-06, 'epoch': 0.83}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13604
total_samples=25345, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:44,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.35 | bwd_microstep: 1938.48 | bwd_inner_microstep: 1702.78 | bwd_allreduce_microstep: 235.64 | step_microstep: 0.22
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13273
total_samples=25349, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:47,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.00 | bwd_microstep: 1795.55 | bwd_inner_microstep: 1662.37 | bwd_allreduce_microstep: 133.11 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13305
total_samples=25353, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:49,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.85 | bwd_microstep: 1758.71 | bwd_inner_microstep: 1682.69 | bwd_allreduce_microstep: 75.95 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11806
total_samples=25356, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:53,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 06:48:53,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 872.92 | bwd_microstep: 2622.32 | bwd_inner_microstep: 2467.65 | bwd_allreduce_microstep: 154.59 | step_microstep: 111.73
[2025-08-03 06:48:53,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2958.05 | bwd: 8115.10 | bwd_inner: 7515.49 | bwd_allreduce: 599.37 | step: 112.17
{'loss': 0.7236, 'learning_rate': 1.4190737495215746e-06, 'epoch': 0.83}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13081
total_samples=25360, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:56,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.27 | bwd_microstep: 2235.00 | bwd_inner_microstep: 2130.04 | bwd_allreduce_microstep: 104.89 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12009
total_samples=25363, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:48:59,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.05 | bwd_microstep: 1812.53 | bwd_inner_microstep: 1585.05 | bwd_allreduce_microstep: 227.42 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13233
total_samples=25367, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:01,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.67 | bwd_microstep: 1947.17 | bwd_inner_microstep: 1820.20 | bwd_allreduce_microstep: 126.91 | step_microstep: 0.31
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11880
total_samples=25370, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:04,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.94
[2025-08-03 06:49:04,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.76 | bwd_microstep: 2021.20 | bwd_inner_microstep: 1887.76 | bwd_allreduce_microstep: 133.37 | step_microstep: 134.57
[2025-08-03 06:49:04,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2885.66 | bwd: 8015.97 | bwd_inner: 7423.04 | bwd_allreduce: 592.67 | step: 135.32
{'loss': 0.7397, 'learning_rate': 1.4107695821733026e-06, 'epoch': 0.83}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13442
total_samples=25374, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:07,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.61 | bwd_microstep: 2040.65 | bwd_inner_microstep: 1718.14 | bwd_allreduce_microstep: 322.44 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12745
total_samples=25378, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:10,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.39 | bwd_microstep: 1944.34 | bwd_inner_microstep: 1660.18 | bwd_allreduce_microstep: 284.09 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14497
total_samples=25382, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:12,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.06 | bwd_microstep: 1812.48 | bwd_inner_microstep: 1760.67 | bwd_allreduce_microstep: 51.74 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12369
total_samples=25385, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:15,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 06:49:15,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.63 | bwd_microstep: 1789.47 | bwd_inner_microstep: 1563.71 | bwd_allreduce_microstep: 225.71 | step_microstep: 109.62
[2025-08-03 06:49:15,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.62 | bwd: 7587.01 | bwd_inner: 6702.70 | bwd_allreduce: 884.06 | step: 110.04
{'loss': 0.7321, 'learning_rate': 1.402487939080479e-06, 'epoch': 0.83}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14520
total_samples=25389, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:18,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.91 | bwd_microstep: 1734.71 | bwd_inner_microstep: 1704.82 | bwd_allreduce_microstep: 29.82 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13565
total_samples=25393, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:20,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.32 | bwd_microstep: 1789.84 | bwd_inner_microstep: 1710.46 | bwd_allreduce_microstep: 79.32 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14289
total_samples=25397, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:23,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.12 | bwd_microstep: 1792.66 | bwd_inner_microstep: 1755.97 | bwd_allreduce_microstep: 36.62 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11957
total_samples=25400, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:26,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.73
[2025-08-03 06:49:26,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.91 | bwd_microstep: 1845.69 | bwd_inner_microstep: 1561.08 | bwd_allreduce_microstep: 284.53 | step_microstep: 140.13
[2025-08-03 06:49:26,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.19 | bwd: 7162.96 | bwd_inner: 6732.33 | bwd_allreduce: 430.37 | step: 140.67
{'loss': 0.736, 'learning_rate': 1.3942288419607476e-06, 'epoch': 0.83}
|████████▎ | 1665/2000 [5:05:45<1:01:09, 10.95s/it] 83%|████████▎ | 1666/2000 [5:05:56<1:00:55, 10.95s/it]                                                        83%|████████▎ | 1666/2000 [5:05:56<1:00:55, 10.95s/it] 83%|████████▎ | 1667/2000 [5:06:08<1:01:39, 11.11s/it]                                                        83%|████████▎ | 1667/2000 [5:06:08<1:01:39, 11.11s/it] 83%|████████▎ | 1668/2000 [5:06:19<1:01:49, 11.17s/it]                                                        83%|████████▎ | 1668/2000 [5:06:19<1:01:49, 11.17s/it] 83%|████████▎ | 1669/2000 [5:06:30<1:01:02, 11.07s/it]                                                        83%|████████▎ | 1669/2000 [5:06:30<1:01:02, 11.07s/it] 84%|████████▎ | 1670/2000 [5:06:40<59:48, 10.88s/it]                                                        84%|dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13989
total_samples=25404, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:28,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.11 | bwd_microstep: 1820.42 | bwd_inner_microstep: 1741.36 | bwd_allreduce_microstep: 78.98 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13262
total_samples=25408, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:31,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.80 | bwd_microstep: 1704.69 | bwd_inner_microstep: 1677.35 | bwd_allreduce_microstep: 27.27 | step_microstep: 0.76
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13705
total_samples=25413, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:33,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.82 | bwd_microstep: 1892.96 | bwd_inner_microstep: 1733.40 | bwd_allreduce_microstep: 159.48 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11862
total_samples=25416, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:36,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 06:49:36,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.54 | bwd_microstep: 1888.38 | bwd_inner_microstep: 1559.12 | bwd_allreduce_microstep: 329.19 | step_microstep: 116.80
[2025-08-03 06:49:36,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.21 | bwd: 7306.51 | bwd_inner: 6711.23 | bwd_allreduce: 595.02 | step: 117.89
{'loss': 0.7299, 'learning_rate': 1.3859923124726283e-06, 'epoch': 0.84}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12778
total_samples=25420, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:39,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.92 | bwd_microstep: 1829.32 | bwd_inner_microstep: 1649.47 | bwd_allreduce_microstep: 179.78 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11906
total_samples=25423, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:41,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.13 | bwd_microstep: 1808.45 | bwd_inner_microstep: 1557.41 | bwd_allreduce_microstep: 250.97 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14363
total_samples=25427, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:44,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.10 | bwd_microstep: 1861.06 | bwd_inner_microstep: 1820.82 | bwd_allreduce_microstep: 40.17 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12797
total_samples=25431, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:47,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 06:49:47,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.38 | bwd_microstep: 1779.60 | bwd_inner_microstep: 1643.97 | bwd_allreduce_microstep: 135.56 | step_microstep: 113.30
[2025-08-03 06:49:47,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2849.47 | bwd: 7278.50 | bwd_inner: 6671.66 | bwd_allreduce: 606.58 | step: 113.94
{'loss': 0.7383, 'learning_rate': 1.3777783722154603e-06, 'epoch': 0.84}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11751
total_samples=25434, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:49,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.01 | bwd_microstep: 1906.70 | bwd_inner_microstep: 1605.70 | bwd_allreduce_microstep: 300.93 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13920
total_samples=25438, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:52,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.80 | bwd_microstep: 1972.02 | bwd_inner_microstep: 1747.73 | bwd_allreduce_microstep: 224.22 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12106
total_samples=25442, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:55,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.18 | bwd_microstep: 2358.01 | bwd_inner_microstep: 2037.28 | bwd_allreduce_microstep: 320.67 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11606
total_samples=25445, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:49:58,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 06:49:58,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.55 | bwd_microstep: 2026.47 | bwd_inner_microstep: 1803.94 | bwd_allreduce_microstep: 222.46 | step_microstep: 109.78
[2025-08-03 06:49:58,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.46 | bwd: 8263.25 | bwd_inner: 7194.64 | bwd_allreduce: 1068.37 | step: 110.45
{'loss': 0.7267, 'learning_rate': 1.369587042729341e-06, 'epoch': 0.84}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13909
total_samples=25449, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:01,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.38 | bwd_microstep: 1785.03 | bwd_inner_microstep: 1717.53 | bwd_allreduce_microstep: 67.44 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11906
total_samples=25452, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:04,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.19 | bwd_microstep: 2457.24 | bwd_inner_microstep: 2448.66 | bwd_allreduce_microstep: 8.52 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13750
total_samples=25456, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:07,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.43 | bwd_microstep: 1958.91 | bwd_inner_microstep: 1720.94 | bwd_allreduce_microstep: 237.91 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13629
total_samples=25460, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:09,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 06:50:09,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.54 | bwd_microstep: 1787.80 | bwd_inner_microstep: 1680.00 | bwd_allreduce_microstep: 107.72 | step_microstep: 139.22
[2025-08-03 06:50:09,819] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.48 | bwd: 7989.03 | bwd_inner: 7567.12 | bwd_allreduce: 421.67 | step: 139.59
{'loss': 0.7315, 'learning_rate': 1.3614183454950824e-06, 'epoch': 0.84}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13626
total_samples=25464, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:12,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.58 | bwd_microstep: 1727.87 | bwd_inner_microstep: 1677.05 | bwd_allreduce_microstep: 50.75 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13889
total_samples=25468, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:15,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.84 | bwd_microstep: 2019.65 | bwd_inner_microstep: 1870.54 | bwd_allreduce_microstep: 149.03 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13225
total_samples=25472, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:17,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.86 | bwd_microstep: 2066.53 | bwd_inner_microstep: 1914.81 | bwd_allreduce_microstep: 151.65 | step_microstep: 0.78
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13445
total_samples=25476, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:20,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85
[2025-08-03 06:50:20,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.31 | bwd_microstep: 1948.17 | bwd_inner_microstep: 1865.73 | bwd_allreduce_microstep: 82.37 | step_microstep: 115.48
[2025-08-03 06:50:20,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.51 | bwd: 7762.27 | bwd_inner: 7328.13 | bwd_allreduce: 433.88 | step: 116.68
{'loss': 0.7324, 'learning_rate': 1.3532723019341376e-06, 'epoch': 0.84}
████████▎ | 1670/2000 [5:06:40<59:48, 10.88s/it] 84%|████████▎ | 1671/2000 [5:06:51<59:06, 10.78s/it]                                                      84%|████████▎ | 1671/2000 [5:06:51<59:06, 10.78s/it] 84%|████████▎ | 1672/2000 [5:07:01<58:31, 10.71s/it]                                                      84%|████████▎ | 1672/2000 [5:07:01<58:31, 10.71s/it] 84%|████████▎ | 1673/2000 [5:07:13<59:36, 10.94s/it]                                                      84%|████████▎ | 1673/2000 [5:07:13<59:36, 10.94s/it] 84%|████████▎ | 1674/2000 [5:07:24<59:55, 11.03s/it]                                                      84%|████████▎ | 1674/2000 [5:07:24<59:55, 11.03s/it] 84%|████████▍ | 1675/2000 [5:07:35<59:43, 11.03s/it]                                                      84%|████████▍ |dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13776
total_samples=25480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:23,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.77 | bwd_microstep: 1808.59 | bwd_inner_microstep: 1721.71 | bwd_allreduce_microstep: 86.82 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14503
total_samples=25484, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:26,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.45 | bwd_microstep: 1827.16 | bwd_inner_microstep: 1783.32 | bwd_allreduce_microstep: 43.77 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14255
total_samples=25488, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:28,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.37 | bwd_microstep: 1842.52 | bwd_inner_microstep: 1773.41 | bwd_allreduce_microstep: 69.05 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11690
total_samples=25491, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:31,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.07
[2025-08-03 06:50:31,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.43 | bwd_microstep: 1776.16 | bwd_inner_microstep: 1550.97 | bwd_allreduce_microstep: 225.11 | step_microstep: 129.18
[2025-08-03 06:50:31,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2859.95 | bwd: 7254.50 | bwd_inner: 6829.40 | bwd_allreduce: 424.84 | step: 129.68
{'loss': 0.7302, 'learning_rate': 1.3451489334085555e-06, 'epoch': 0.84}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13265
total_samples=25495, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:34,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.12 | bwd_microstep: 2135.24 | bwd_inner_microstep: 1875.35 | bwd_allreduce_microstep: 259.82 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13500
total_samples=25499, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:37,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.48 | bwd_microstep: 2024.41 | bwd_inner_microstep: 1711.05 | bwd_allreduce_microstep: 313.29 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14514
total_samples=25503, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:39,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.76 | bwd_microstep: 1824.27 | bwd_inner_microstep: 1754.69 | bwd_allreduce_microstep: 69.50 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12020
total_samples=25506, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:42,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 06:50:42,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.52 | bwd_microstep: 1823.04 | bwd_inner_microstep: 1607.57 | bwd_allreduce_microstep: 215.40 | step_microstep: 136.42
[2025-08-03 06:50:42,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2865.80 | bwd: 7806.99 | bwd_inner: 6948.65 | bwd_allreduce: 858.10 | step: 136.95
{'loss': 0.7294, 'learning_rate': 1.3370482612209224e-06, 'epoch': 0.84}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13226
total_samples=25510, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:45,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.18 | bwd_microstep: 1829.51 | bwd_inner_microstep: 1705.22 | bwd_allreduce_microstep: 124.22 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13599
total_samples=25514, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:47,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.80 | bwd_microstep: 1943.93 | bwd_inner_microstep: 1840.01 | bwd_allreduce_microstep: 103.86 | step_microstep: 0.28
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12860
total_samples=25519, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:50,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.29 | bwd_microstep: 1829.62 | bwd_inner_microstep: 1645.38 | bwd_allreduce_microstep: 184.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13527
total_samples=25523, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:53,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 06:50:53,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.63 | bwd_microstep: 2034.03 | bwd_inner_microstep: 1939.64 | bwd_allreduce_microstep: 94.31 | step_microstep: 124.35
[2025-08-03 06:50:53,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.82 | bwd: 7637.15 | bwd_inner: 7130.25 | bwd_allreduce: 506.64 | step: 124.91
{'loss': 0.7352, 'learning_rate': 1.3289703066143112e-06, 'epoch': 0.84}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13405
total_samples=25528, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:56,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.48 | bwd_microstep: 2190.43 | bwd_inner_microstep: 1898.83 | bwd_allreduce_microstep: 291.52 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=25532, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:50:58,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.32 | bwd_microstep: 1826.83 | bwd_inner_microstep: 1752.44 | bwd_allreduce_microstep: 74.31 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11642
total_samples=25535, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:01,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.79 | bwd_microstep: 1738.14 | bwd_inner_microstep: 1533.05 | bwd_allreduce_microstep: 205.03 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13227
total_samples=25539, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:04,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 06:51:04,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.05 | bwd_microstep: 1998.09 | bwd_inner_microstep: 1856.65 | bwd_allreduce_microstep: 141.37 | step_microstep: 119.62
[2025-08-03 06:51:04,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.57 | bwd: 7753.56 | bwd_inner: 7040.97 | bwd_allreduce: 712.32 | step: 120.14
{'loss': 0.736, 'learning_rate': 1.3209150907722124e-06, 'epoch': 0.84}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11704
total_samples=25542, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:06,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.95 | bwd_microstep: 1747.27 | bwd_inner_microstep: 1529.44 | bwd_allreduce_microstep: 217.77 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=25546, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:09,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.02 | bwd_microstep: 1798.27 | bwd_inner_microstep: 1690.55 | bwd_allreduce_microstep: 107.66 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11718
total_samples=25549, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:12,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.24 | bwd_microstep: 1804.38 | bwd_inner_microstep: 1553.74 | bwd_allreduce_microstep: 250.57 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13177
total_samples=25553, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:14,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.68
[2025-08-03 06:51:14,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.12 | bwd_microstep: 1996.67 | bwd_inner_microstep: 1988.60 | bwd_allreduce_microstep: 7.99 | step_microstep: 146.70
[2025-08-03 06:51:14,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.26 | bwd: 7346.65 | bwd_inner: 6762.33 | bwd_allreduce: 584.08 | step: 147.22
{'loss': 0.7347, 'learning_rate': 1.3128826348184886e-06, 'epoch': 0.84}
 1675/2000 [5:07:35<59:43, 11.03s/it] 84%|████████▍ | 1676/2000 [5:07:46<58:44, 10.88s/it]                                                      84%|████████▍ | 1676/2000 [5:07:46<58:44, 10.88s/it] 84%|████████▍ | 1677/2000 [5:07:57<58:54, 10.94s/it]                                                      84%|████████▍ | 1677/2000 [5:07:57<58:54, 10.94s/it] 84%|████████▍ | 1678/2000 [5:08:08<58:39, 10.93s/it]                                                      84%|████████▍ | 1678/2000 [5:08:08<58:39, 10.93s/it] 84%|████████▍ | 1679/2000 [5:08:19<58:35, 10.95s/it]                                                      84%|████████▍ | 1679/2000 [5:08:19<58:35, 10.95s/it] 84%|████████▍ | 1680/2000 [5:08:29<57:49, 10.84s/it]                                                      84%|████████▍ | 1680/2000 [5:08:29<57:49, 10dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12130
total_samples=25556, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:17,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.17 | bwd_microstep: 1762.35 | bwd_inner_microstep: 1558.14 | bwd_allreduce_microstep: 204.14 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14043
total_samples=25560, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:19,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.31 | bwd_microstep: 1716.38 | bwd_inner_microstep: 1693.09 | bwd_allreduce_microstep: 23.22 | step_microstep: 0.75
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11966
total_samples=25563, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:22,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.15 | bwd_microstep: 1735.89 | bwd_inner_microstep: 1559.92 | bwd_allreduce_microstep: 175.90 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13321
total_samples=25567, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:25,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 06:51:25,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.81 | bwd_microstep: 2202.96 | bwd_inner_microstep: 1846.40 | bwd_allreduce_microstep: 356.48 | step_microstep: 132.41
[2025-08-03 06:51:25,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.37 | bwd: 7417.65 | bwd_inner: 6657.54 | bwd_allreduce: 759.83 | step: 133.43
{'loss': 0.7265, 'learning_rate': 1.3048729598173248e-06, 'epoch': 0.84}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13258
total_samples=25571, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:28,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 930.66 | bwd_microstep: 1752.21 | bwd_inner_microstep: 1671.82 | bwd_allreduce_microstep: 80.32 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13343
total_samples=25575, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:31,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.29 | bwd_microstep: 2067.37 | bwd_inner_microstep: 1911.42 | bwd_allreduce_microstep: 155.88 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12993
total_samples=25579, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:34,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.41 | bwd_microstep: 2019.36 | bwd_inner_microstep: 1871.68 | bwd_allreduce_microstep: 147.60 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13424
total_samples=25583, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:36,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.19
[2025-08-03 06:51:36,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.08 | bwd_microstep: 1756.24 | bwd_inner_microstep: 1699.80 | bwd_allreduce_microstep: 56.36 | step_microstep: 154.43
[2025-08-03 06:51:36,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3075.38 | bwd: 7595.23 | bwd_inner: 7154.72 | bwd_allreduce: 440.25 | step: 154.98
{'loss': 0.75, 'learning_rate': 1.296886086773157e-06, 'epoch': 0.84}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885
total_samples=25586, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:39,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.70 | bwd_microstep: 1939.98 | bwd_inner_microstep: 1746.23 | bwd_allreduce_microstep: 193.69 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13148
total_samples=25589, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:41,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.00 | bwd_microstep: 1727.22 | bwd_inner_microstep: 1598.31 | bwd_allreduce_microstep: 128.84 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13481
total_samples=25593, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:44,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.04 | bwd_microstep: 1709.43 | bwd_inner_microstep: 1664.25 | bwd_allreduce_microstep: 45.10 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13443
total_samples=25597, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:47,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 06:51:47,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.87 | bwd_microstep: 2033.18 | bwd_inner_microstep: 1887.67 | bwd_allreduce_microstep: 145.45 | step_microstep: 114.27
[2025-08-03 06:51:47,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.54 | bwd: 7409.87 | bwd_inner: 6896.46 | bwd_allreduce: 513.16 | step: 114.83
{'loss': 0.7273, 'learning_rate': 1.2889220366306276e-06, 'epoch': 0.84}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11821
total_samples=25600, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:49,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.35 | bwd_microstep: 1817.77 | bwd_inner_microstep: 1579.31 | bwd_allreduce_microstep: 238.40 | step_microstep: 0.75
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13066
total_samples=25604, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:52,415] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.61 | bwd_microstep: 1713.23 | bwd_inner_microstep: 1660.82 | bwd_allreduce_microstep: 52.34 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11984
total_samples=25607, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:55,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.68 | bwd_microstep: 1816.52 | bwd_inner_microstep: 1590.51 | bwd_allreduce_microstep: 225.93 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13728
total_samples=25611, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:51:57,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.83
[2025-08-03 06:51:57,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.69 | bwd_microstep: 1739.70 | bwd_inner_microstep: 1700.80 | bwd_allreduce_microstep: 38.82 | step_microstep: 153.61
[2025-08-03 06:51:57,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.25 | bwd: 7087.26 | bwd_inner: 6531.45 | bwd_allreduce: 555.58 | step: 154.76
{'loss': 0.7319, 'learning_rate': 1.2809808302745298e-06, 'epoch': 0.84}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12269
total_samples=25614, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:00,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.80 | bwd_microstep: 1912.55 | bwd_inner_microstep: 1559.71 | bwd_allreduce_microstep: 352.76 | step_microstep: 0.76
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11699
total_samples=25617, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:03,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.21 | bwd_microstep: 2084.37 | bwd_inner_microstep: 1847.49 | bwd_allreduce_microstep: 236.82 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11710
total_samples=25620, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:06,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.39 | bwd_microstep: 2041.82 | bwd_inner_microstep: 1821.55 | bwd_allreduce_microstep: 220.20 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14617
total_samples=25625, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:08,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 06:52:08,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.90 | bwd_microstep: 1964.39 | bwd_inner_microstep: 1783.78 | bwd_allreduce_microstep: 180.54 | step_microstep: 121.66
[2025-08-03 06:52:08,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.23 | bwd: 8003.18 | bwd_inner: 7012.52 | bwd_allreduce: 990.41 | step: 122.65
{'loss': 0.7138, 'learning_rate': 1.2730624885297537e-06, 'epoch': 0.84}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11885
total_samples=25628, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:11,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.82 | bwd_microstep: 1792.46 | bwd_inner_microstep: 1563.82 | bwd_allreduce_microstep: 228.57 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13671
total_samples=25632, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:14,351] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.22 | bwd_microstep: 2041.60 | bwd_inner_microstep: 1884.32 | bwd_allreduce_microstep: 157.22 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13501
total_samples=25636, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:17,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.97 | bwd_microstep: 1947.74 | bwd_inner_microstep: 1754.13 | bwd_allreduce_microstep: 193.53 | step_microstep: 0.89
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13360
total_samples=25640, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:19,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.55
[2025-08-03 06:52:19,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.90 | bwd_microstep: 1749.10 | bwd_inner_microstep: 1675.01 | bwd_allreduce_microstep: 74.02 | step_microstep: 122.99
[2025-08-03 06:52:19,719] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.84 | bwd: 7530.96 | bwd_inner: 6877.27 | bwd_allreduce: 653.42 | step: 124.12
.84s/it] 84%|████████▍ | 1681/2000 [5:08:40<57:17, 10.78s/it]                                                      84%|████████▍ | 1681/2000 [5:08:40<57:17, 10.78s/it] 84%|████████▍ | 1682/2000 [5:08:51<57:40, 10.88s/it]                                                      84%|████████▍ | 1682/2000 [5:08:51<57:40, 10.88s/it] 84%|████████▍ | 1683/2000 [5:09:02<57:05, 10.80s/it]                                                      84%|████████▍ | 1683/2000 [5:09:02<57:05, 10.80s/it] 84%|████████▍ | 1684/2000 [5:09:12<56:11, 10.67s/it]                                                      84%|████████▍ | 1684/2000 [5:09:12<56:11, 10.67s/it] 84%|████████▍ | 1685/2000 [5:09:23<56:54, 10.84s/it]                                                      84%|████████▍ | 1685/2000 [5:09:23<56:54, 10.84s/it] 84%|█████{'loss': 0.7344, 'learning_rate': 1.2651670321612264e-06, 'epoch': 0.84}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812
total_samples=25643, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:22,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.29 | bwd_microstep: 1899.89 | bwd_inner_microstep: 1632.88 | bwd_allreduce_microstep: 266.94 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13393
total_samples=25647, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:25,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.48 | bwd_microstep: 2214.51 | bwd_inner_microstep: 2208.32 | bwd_allreduce_microstep: 6.12 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13042
total_samples=25651, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:27,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.34 | bwd_microstep: 1760.05 | bwd_inner_microstep: 1664.60 | bwd_allreduce_microstep: 95.37 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12844
total_samples=25655, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:30,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.86
[2025-08-03 06:52:30,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.79 | bwd_microstep: 1752.92 | bwd_inner_microstep: 1649.86 | bwd_allreduce_microstep: 102.99 | step_microstep: 135.32
[2025-08-03 06:52:30,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2824.82 | bwd: 7627.42 | bwd_inner: 7155.67 | bwd_allreduce: 471.51 | step: 135.80
{'loss': 0.7235, 'learning_rate': 1.2572944818738587e-06, 'epoch': 0.84}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13894
total_samples=25659, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:33,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.49 | bwd_microstep: 1815.34 | bwd_inner_microstep: 1706.80 | bwd_allreduce_microstep: 108.48 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13426
total_samples=25664, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:35,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.91 | bwd_microstep: 1713.48 | bwd_inner_microstep: 1640.67 | bwd_allreduce_microstep: 72.75 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12111
total_samples=25667, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:38,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.69 | bwd_microstep: 2143.25 | bwd_inner_microstep: 1962.72 | bwd_allreduce_microstep: 180.47 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13463
total_samples=25671, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:41,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 06:52:41,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.49 | bwd_microstep: 1758.12 | bwd_inner_microstep: 1681.87 | bwd_allreduce_microstep: 76.19 | step_microstep: 113.26
[2025-08-03 06:52:41,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.50 | bwd: 7430.25 | bwd_inner: 6992.05 | bwd_allreduce: 437.96 | step: 113.76
{'loss': 0.732, 'learning_rate': 1.249444858312502e-06, 'epoch': 0.84}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11640
total_samples=25674, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:44,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.70 | bwd_microstep: 2130.79 | bwd_inner_microstep: 1990.35 | bwd_allreduce_microstep: 140.35 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13094
total_samples=25678, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:46,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.25 | bwd_microstep: 2026.70 | bwd_inner_microstep: 1886.53 | bwd_allreduce_microstep: 140.10 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13711
total_samples=25682, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:49,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.37 | bwd_microstep: 1868.41 | bwd_inner_microstep: 1732.71 | bwd_allreduce_microstep: 135.64 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13214
total_samples=25686, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:52,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 06:52:52,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.88 | bwd_microstep: 1815.60 | bwd_inner_microstep: 1708.39 | bwd_allreduce_microstep: 107.14 | step_microstep: 154.26
[2025-08-03 06:52:52,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2857.12 | bwd: 7841.56 | bwd_inner: 7317.96 | bwd_allreduce: 523.31 | step: 154.82
{'loss': 0.7412, 'learning_rate': 1.2416181820618745e-06, 'epoch': 0.84}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14038
total_samples=25690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:55,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.24 | bwd_microstep: 1812.00 | bwd_inner_microstep: 1737.42 | bwd_allreduce_microstep: 74.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13595
total_samples=25694, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:52:57,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.78 | bwd_microstep: 1828.74 | bwd_inner_microstep: 1730.27 | bwd_allreduce_microstep: 98.41 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13382
total_samples=25698, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:00,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.19 | bwd_microstep: 2239.42 | bwd_inner_microstep: 2109.26 | bwd_allreduce_microstep: 130.09 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13223
total_samples=25702, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:03,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 06:53:03,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.43 | bwd_microstep: 1859.22 | bwd_inner_microstep: 1780.20 | bwd_allreduce_microstep: 78.95 | step_microstep: 127.30
[2025-08-03 06:53:03,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2857.56 | bwd: 7739.42 | bwd_inner: 7357.14 | bwd_allreduce: 382.04 | step: 127.81
{'loss': 0.7357, 'learning_rate': 1.233814473646524e-06, 'epoch': 0.84}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14286
total_samples=25706, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:05,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.66 | bwd_microstep: 1782.91 | bwd_inner_microstep: 1728.47 | bwd_allreduce_microstep: 54.38 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13837
total_samples=25710, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:08,543] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.25 | bwd_microstep: 1782.87 | bwd_inner_microstep: 1724.63 | bwd_allreduce_microstep: 58.17 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12369
total_samples=25713, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:11,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.28 | bwd_microstep: 1749.74 | bwd_inner_microstep: 1595.51 | bwd_allreduce_microstep: 154.17 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11880
total_samples=25716, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:13,833] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 06:53:13,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.01 | bwd_microstep: 1882.52 | bwd_inner_microstep: 1763.51 | bwd_allreduce_microstep: 118.96 | step_microstep: 118.10
[2025-08-03 06:53:13,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.14 | bwd: 7198.09 | bwd_inner: 6812.12 | bwd_allreduce: 385.74 | step: 118.61
███▍ | 1686/2000 [5:09:34<56:39, 10.83s/it]                                                      84%|████████▍ | 1686/2000 [5:09:34<56:39, 10.83s/it] 84%|████████▍ | 1687/2000 [5:09:45<56:32, 10.84s/it]                                                      84%|████████▍ | 1687/2000 [5:09:45<56:32, 10.84s/it] 84%|████████▍ | 1688/2000 [5:09:56<56:02, 10.78s/it]                                                      84%|████████▍ | 1688/2000 [5:09:56<56:02, 10.78s/it] 84%|████████▍ | 1689/2000 [5:10:07<56:26, 10.89s/it]                                                      84%|████████▍ | 1689/2000 [5:10:07<56:26, 10.89s/it] 84%|████████▍ | 1690/2000 [5:10:18<56:30, 10.94s/it]                                                      84%|████████▍ | 1690/2000 [5:10:18<56:30, 10.94s/it] 85%|████████▍ | 1691/2000 [5:1{'loss': 0.7402, 'learning_rate': 1.226033753530763e-06, 'epoch': 0.85}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12032
total_samples=25719, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:16,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.78 | bwd_microstep: 2029.77 | bwd_inner_microstep: 1838.51 | bwd_allreduce_microstep: 191.20 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13546
total_samples=25723, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:19,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.47 | bwd_microstep: 1827.18 | bwd_inner_microstep: 1716.34 | bwd_allreduce_microstep: 110.77 | step_microstep: 0.26
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12472
total_samples=25727, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:21,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.02 | bwd_microstep: 1815.29 | bwd_inner_microstep: 1612.76 | bwd_allreduce_microstep: 202.45 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13719
total_samples=25731, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:24,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 06:53:24,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.15 | bwd_microstep: 1915.39 | bwd_inner_microstep: 1735.84 | bwd_allreduce_microstep: 179.49 | step_microstep: 128.95
[2025-08-03 06:53:24,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.35 | bwd: 7587.68 | bwd_inner: 6903.44 | bwd_allreduce: 684.00 | step: 129.61
{'loss': 0.7327, 'learning_rate': 1.218276042118629e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13239
total_samples=25735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:27,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.70 | bwd_microstep: 1717.27 | bwd_inner_microstep: 1661.32 | bwd_allreduce_microstep: 55.88 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13678
total_samples=25739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:29,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.91 | bwd_microstep: 1777.11 | bwd_inner_microstep: 1706.50 | bwd_allreduce_microstep: 70.55 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13576
total_samples=25743, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:32,528] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.90 | bwd_microstep: 2005.73 | bwd_inner_microstep: 1908.87 | bwd_allreduce_microstep: 96.80 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12698
total_samples=25747, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:36,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 06:53:36,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.67 | bwd_microstep: 3028.24 | bwd_inner_microstep: 2791.02 | bwd_allreduce_microstep: 237.13 | step_microstep: 115.68
[2025-08-03 06:53:36,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.11 | bwd: 8528.42 | bwd_inner: 8067.71 | bwd_allreduce: 460.46 | step: 116.21
{'loss': 0.7203, 'learning_rate': 1.2105413597538107e-06, 'epoch': 0.85}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12193
total_samples=25750, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:38,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.75 | bwd_microstep: 1735.30 | bwd_inner_microstep: 1558.42 | bwd_allreduce_microstep: 176.81 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12958
total_samples=25754, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:41,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.33 | bwd_microstep: 1835.84 | bwd_inner_microstep: 1779.02 | bwd_allreduce_microstep: 56.76 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13530
total_samples=25758, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:44,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.21 | bwd_microstep: 1856.92 | bwd_inner_microstep: 1724.21 | bwd_allreduce_microstep: 132.65 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15476
total_samples=25762, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:46,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 06:53:46,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.29 | bwd_microstep: 1846.84 | bwd_inner_microstep: 1798.98 | bwd_allreduce_microstep: 47.79 | step_microstep: 110.30
[2025-08-03 06:53:46,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2880.50 | bwd: 7274.95 | bwd_inner: 6860.63 | bwd_allreduce: 414.09 | step: 110.75
{'loss': 0.731, 'learning_rate': 1.202829726719611e-06, 'epoch': 0.85}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12336
total_samples=25765, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:49,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.71 | bwd_microstep: 2081.76 | bwd_inner_microstep: 1850.88 | bwd_allreduce_microstep: 230.81 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13093
total_samples=25769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:52,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.49 | bwd_microstep: 1818.24 | bwd_inner_microstep: 1701.79 | bwd_allreduce_microstep: 116.38 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=25772, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:55,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.40 | bwd_microstep: 1841.65 | bwd_inner_microstep: 1562.32 | bwd_allreduce_microstep: 279.26 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14717
total_samples=25776, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:53:57,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 06:53:57,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.47 | bwd_microstep: 1738.39 | bwd_inner_microstep: 1698.86 | bwd_allreduce_microstep: 39.47 | step_microstep: 112.65
[2025-08-03 06:53:57,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2882.01 | bwd: 7480.08 | bwd_inner: 6813.85 | bwd_allreduce: 666.00 | step: 113.13
{'loss': 0.7336, 'learning_rate': 1.195141163238892e-06, 'epoch': 0.85}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14858
total_samples=25781, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:00,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.50 | bwd_microstep: 2049.91 | bwd_inner_microstep: 1862.49 | bwd_allreduce_microstep: 187.35 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13821
total_samples=25786, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:03,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.72 | bwd_microstep: 1926.44 | bwd_inner_microstep: 1861.86 | bwd_allreduce_microstep: 64.51 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14817
total_samples=25790, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:06,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.23 | bwd_microstep: 1957.46 | bwd_inner_microstep: 1779.70 | bwd_allreduce_microstep: 177.70 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13540
total_samples=25794, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:08,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 06:54:08,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.31 | bwd_microstep: 1864.41 | bwd_inner_microstep: 1731.77 | bwd_allreduce_microstep: 132.58 | step_microstep: 142.95
[2025-08-03 06:54:08,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2848.69 | bwd: 7798.27 | bwd_inner: 7235.83 | bwd_allreduce: 562.21 | step: 143.41
0:28<55:31, 10.78s/it]                                                      85%|████████▍ | 1691/2000 [5:10:28<55:31, 10.78s/it] 85%|████████▍ | 1692/2000 [5:10:39<55:25, 10.80s/it]                                                      85%|████████▍ | 1692/2000 [5:10:39<55:25, 10.80s/it] 85%|████████▍ | 1693/2000 [5:10:51<56:44, 11.09s/it]                                                      85%|████████▍ | 1693/2000 [5:10:51<56:44, 11.09s/it] 85%|████████▍ | 1694/2000 [5:11:01<55:43, 10.93s/it]                                                      85%|████████▍ | 1694/2000 [5:11:01<55:43, 10.93s/it] 85%|████████▍ | 1695/2000 [5:11:12<55:17, 10.88s/it]                                                      85%|████████▍ | 1695/2000 [5:11:12<55:17, 10.88s/it] 85%|████████▍ | 1696/2000 [5:11:23<55:26, 10.94s/it]      {'loss': 0.7398, 'learning_rate': 1.1874756894740137e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13263
total_samples=25798, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:11,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.93 | bwd_microstep: 1791.94 | bwd_inner_microstep: 1696.07 | bwd_allreduce_microstep: 95.81 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13637
total_samples=25803, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:14,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.88 | bwd_microstep: 1874.09 | bwd_inner_microstep: 1764.34 | bwd_allreduce_microstep: 109.68 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13283
total_samples=25807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:17,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.22 | bwd_microstep: 2152.14 | bwd_inner_microstep: 2052.27 | bwd_allreduce_microstep: 99.81 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13326
total_samples=25811, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:20,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 42.12
[2025-08-03 06:54:20,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.94 | bwd_microstep: 2023.69 | bwd_inner_microstep: 1862.97 | bwd_allreduce_microstep: 160.65 | step_microstep: 158.79
[2025-08-03 06:54:20,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2880.89 | bwd: 7841.92 | bwd_inner: 7375.65 | bwd_allreduce: 466.02 | step: 159.24
{'loss': 0.7392, 'learning_rate': 1.1798333255267857e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13564
total_samples=25815, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:22,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.77 | bwd_microstep: 1870.70 | bwd_inner_microstep: 1697.63 | bwd_allreduce_microstep: 172.99 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13698
total_samples=25819, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:25,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.10 | bwd_microstep: 1773.12 | bwd_inner_microstep: 1689.33 | bwd_allreduce_microstep: 83.72 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11656
total_samples=25822, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:27,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.40 | bwd_microstep: 1870.53 | bwd_inner_microstep: 1612.05 | bwd_allreduce_microstep: 258.42 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=25825, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:30,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 06:54:30,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.73 | bwd_microstep: 1831.76 | bwd_inner_microstep: 1585.10 | bwd_allreduce_microstep: 246.60 | step_microstep: 152.38
[2025-08-03 06:54:30,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.93 | bwd: 7346.15 | bwd_inner: 6584.11 | bwd_allreduce: 761.80 | step: 152.88
{'loss': 0.7326, 'learning_rate': 1.1722140914384162e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13856
total_samples=25829, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:33,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.20 | bwd_microstep: 1884.24 | bwd_inner_microstep: 1831.66 | bwd_allreduce_microstep: 52.51 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13593
total_samples=25833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:35,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.75 | bwd_microstep: 1792.64 | bwd_inner_microstep: 1712.93 | bwd_allreduce_microstep: 79.64 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13735
total_samples=25837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:38,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.96 | bwd_microstep: 1960.91 | bwd_inner_microstep: 1735.52 | bwd_allreduce_microstep: 225.33 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13949
total_samples=25841, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:41,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 06:54:41,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.04 | bwd_microstep: 1820.10 | bwd_inner_microstep: 1709.63 | bwd_allreduce_microstep: 110.41 | step_microstep: 121.02
[2025-08-03 06:54:41,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.88 | bwd: 7457.95 | bwd_inner: 6989.74 | bwd_allreduce: 467.96 | step: 121.50
{'loss': 0.737, 'learning_rate': 1.1646180071894608e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13437
total_samples=25845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:44,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.53 | bwd_microstep: 1965.40 | bwd_inner_microstep: 1722.86 | bwd_allreduce_microstep: 242.48 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13374
total_samples=25849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:47,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.57 | bwd_microstep: 2191.07 | bwd_inner_microstep: 2184.62 | bwd_allreduce_microstep: 6.38 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11403
total_samples=25852, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:50,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1739.58 | bwd_microstep: 1943.20 | bwd_inner_microstep: 1832.53 | bwd_allreduce_microstep: 110.61 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11769
total_samples=25855, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:53,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 06:54:53,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.86 | bwd_microstep: 2026.68 | bwd_inner_microstep: 1802.61 | bwd_allreduce_microstep: 224.01 | step_microstep: 128.71
[2025-08-03 06:54:53,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3894.47 | bwd: 8126.40 | bwd_inner: 7542.61 | bwd_allreduce: 583.55 | step: 129.19
{'loss': 0.7396, 'learning_rate': 1.1570450926997657e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13585
total_samples=25859, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:56,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.73 | bwd_microstep: 2044.78 | bwd_inner_microstep: 1721.43 | bwd_allreduce_microstep: 323.29 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13493
total_samples=25863, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:54:59,235] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.47 | bwd_microstep: 1855.77 | bwd_inner_microstep: 1713.35 | bwd_allreduce_microstep: 142.35 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13160
total_samples=25867, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:02,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.92 | bwd_microstep: 2088.65 | bwd_inner_microstep: 1813.70 | bwd_allreduce_microstep: 274.88 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12214
total_samples=25870, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:04,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.83
[2025-08-03 06:55:04,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.15 | bwd_microstep: 1800.41 | bwd_inner_microstep: 1568.39 | bwd_allreduce_microstep: 231.94 | step_microstep: 123.12
[2025-08-03 06:55:04,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.19 | bwd: 7789.67 | bwd_inner: 6816.88 | bwd_allreduce: 972.54 | step: 123.60
                                                85%|████████▍ | 1696/2000 [5:11:23<55:26, 10.94s/it] 85%|████████▍ | 1697/2000 [5:11:34<55:36, 11.01s/it]                                                      85%|████████▍ | 1697/2000 [5:11:34<55:36, 11.01s/it] 85%|████████▍ | 1698/2000 [5:11:45<54:51, 10.90s/it]                                                      85%|████████▍ | 1698/2000 [5:11:45<54:51, 10.90s/it] 85%|████████▍ | 1699/2000 [5:11:56<54:21, 10.84s/it]                                                      85%|████████▍ | 1699/2000 [5:11:56<54:21, 10.84s/it] 85%|████████▌ | 1700/2000 [5:12:08<56:34, 11.32s/it]                                                      85%|████████▌ | 1700/2000 [5:12:08<56:34, 11.32s/it] 85%|████████▌ | 1701/2000 [5:12:19<55:56, 11.23s/it]                                   {'loss': 0.7231, 'learning_rate': 1.1494953678284105e-06, 'epoch': 0.85}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14509
total_samples=25874, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:07,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.47 | bwd_microstep: 2004.67 | bwd_inner_microstep: 1846.09 | bwd_allreduce_microstep: 158.52 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14329
total_samples=25878, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:10,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.36 | bwd_microstep: 2079.07 | bwd_inner_microstep: 2033.58 | bwd_allreduce_microstep: 45.42 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11722
total_samples=25881, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:13,051] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.45 | bwd_microstep: 1810.29 | bwd_inner_microstep: 1569.53 | bwd_allreduce_microstep: 240.70 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11971
total_samples=25884, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:16,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 06:55:16,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.43 | bwd_microstep: 1859.45 | bwd_inner_microstep: 1612.27 | bwd_allreduce_microstep: 247.11 | step_microstep: 445.27
[2025-08-03 06:55:16,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.63 | bwd: 7753.54 | bwd_inner: 7061.46 | bwd_allreduce: 691.83 | step: 445.65
{'loss': 0.7282, 'learning_rate': 1.1419688523736761e-06, 'epoch': 0.85}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12968
total_samples=25888, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:18,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.44 | bwd_microstep: 1985.25 | bwd_inner_microstep: 1842.19 | bwd_allreduce_microstep: 142.98 | step_microstep: 0.34
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13599
total_samples=25892, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:21,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 663.60 | bwd_microstep: 2124.03 | bwd_inner_microstep: 2006.32 | bwd_allreduce_microstep: 117.63 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11774
total_samples=25895, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:24,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.96 | bwd_microstep: 1764.34 | bwd_inner_microstep: 1539.46 | bwd_allreduce_microstep: 224.82 | step_microstep: 0.31
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13333
total_samples=25899, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:27,147] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 06:55:27,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.18 | bwd_microstep: 1814.26 | bwd_inner_microstep: 1634.26 | bwd_allreduce_microstep: 179.94 | step_microstep: 130.64
[2025-08-03 06:55:27,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.12 | bwd: 7687.94 | bwd_inner: 7022.22 | bwd_allreduce: 665.45 | step: 131.56
{'loss': 0.7449, 'learning_rate': 1.1344655660729676e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13904
total_samples=25904, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:29,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.31 | bwd_microstep: 1977.70 | bwd_inner_microstep: 1865.08 | bwd_allreduce_microstep: 112.55 | step_microstep: 0.73
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13103
total_samples=25908, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:32,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.06 | bwd_microstep: 1899.59 | bwd_inner_microstep: 1815.30 | bwd_allreduce_microstep: 84.23 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13945
total_samples=25912, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:35,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.44 | bwd_microstep: 1754.31 | bwd_inner_microstep: 1712.43 | bwd_allreduce_microstep: 41.81 | step_microstep: 0.20
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12245
total_samples=25916, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:38,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.72
[2025-08-03 06:55:38,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.14 | bwd_microstep: 2078.48 | bwd_inner_microstep: 1884.13 | bwd_allreduce_microstep: 194.29 | step_microstep: 155.06
[2025-08-03 06:55:38,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.89 | bwd: 7710.13 | bwd_inner: 7276.93 | bwd_allreduce: 432.97 | step: 156.10
{'loss': 0.7395, 'learning_rate': 1.1269855286027798e-06, 'epoch': 0.85}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13468
total_samples=25920, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:40,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.40 | bwd_microstep: 1815.37 | bwd_inner_microstep: 1700.55 | bwd_allreduce_microstep: 114.75 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14058
total_samples=25924, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:43,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.13 | bwd_microstep: 2055.29 | bwd_inner_microstep: 1917.75 | bwd_allreduce_microstep: 137.46 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12130
total_samples=25927, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:46,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.35 | bwd_microstep: 1812.24 | bwd_inner_microstep: 1586.43 | bwd_allreduce_microstep: 225.74 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812
total_samples=25930, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:48,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 06:55:48,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.65 | bwd_microstep: 1750.17 | bwd_inner_microstep: 1542.59 | bwd_allreduce_microstep: 207.50 | step_microstep: 110.52
[2025-08-03 06:55:48,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.47 | bwd: 7433.11 | bwd_inner: 6747.32 | bwd_allreduce: 685.54 | step: 111.00
{'loss': 0.7268, 'learning_rate': 1.1195287595786352e-06, 'epoch': 0.85}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13467
total_samples=25934, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:51,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.43 | bwd_microstep: 1902.08 | bwd_inner_microstep: 1694.44 | bwd_allreduce_microstep: 207.58 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13873
total_samples=25938, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:54,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.95 | bwd_microstep: 2081.88 | bwd_inner_microstep: 1935.52 | bwd_allreduce_microstep: 146.28 | step_microstep: 0.18
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12786
total_samples=25942, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:55:56,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.71 | bwd_microstep: 1734.97 | bwd_inner_microstep: 1618.64 | bwd_allreduce_microstep: 116.27 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13190
total_samples=25946, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:00,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 06:56:00,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1017.58 | bwd_microstep: 1975.81 | bwd_inner_microstep: 1867.02 | bwd_allreduce_microstep: 108.73 | step_microstep: 135.66
[2025-08-03 06:56:00,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3119.60 | bwd: 7694.80 | bwd_inner: 7115.61 | bwd_allreduce: 578.94 | step: 136.20
{'loss': 0.7473, 'learning_rate': 1.1120952785550477e-06, 'epoch': 0.85}
                   85%|████████▌ | 1701/2000 [5:12:19<55:56, 11.23s/it] 85%|████████▌ | 1702/2000 [5:12:31<55:57, 11.27s/it]                                                      85%|████████▌ | 1702/2000 [5:12:31<55:57, 11.27s/it] 85%|████████▌ | 1703/2000 [5:12:42<55:22, 11.19s/it]                                                      85%|████████▌ | 1703/2000 [5:12:42<55:22, 11.19s/it] 85%|████████▌ | 1704/2000 [5:12:52<54:52, 11.12s/it]                                                      85%|████████▌ | 1704/2000 [5:12:53<54:52, 11.12s/it] 85%|████████▌ | 1705/2000 [5:13:03<54:00, 10.99s/it]                                                      85%|████████▌ | 1705/2000 [5:13:03<54:00, 10.99s/it] 85%|████████▌ | 1706/2000 [5:13:14<54:13, 11.07s/it]                                                      85%|█�dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14686
total_samples=25950, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:03,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.60 | bwd_microstep: 2859.32 | bwd_inner_microstep: 2851.01 | bwd_allreduce_microstep: 8.25 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14984
total_samples=25955, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:06,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.73 | bwd_microstep: 1906.39 | bwd_inner_microstep: 1760.15 | bwd_allreduce_microstep: 146.17 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790
total_samples=25958, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:08,949] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.40 | bwd_microstep: 1816.41 | bwd_inner_microstep: 1588.98 | bwd_allreduce_microstep: 227.36 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13495
total_samples=25962, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:12,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 06:56:12,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.21 | bwd_microstep: 2465.69 | bwd_inner_microstep: 2150.20 | bwd_allreduce_microstep: 315.42 | step_microstep: 115.36
[2025-08-03 06:56:12,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.86 | bwd: 9047.85 | bwd_inner: 8350.34 | bwd_allreduce: 697.27 | step: 115.87
{'loss': 0.7332, 'learning_rate': 1.1046851050254504e-06, 'epoch': 0.85}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13579
total_samples=25966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:15,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.63 | bwd_microstep: 2021.43 | bwd_inner_microstep: 1876.81 | bwd_allreduce_microstep: 144.56 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13301
total_samples=25970, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:18,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.63 | bwd_microstep: 2236.36 | bwd_inner_microstep: 2010.14 | bwd_allreduce_microstep: 226.16 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12018
total_samples=25973, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:20,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.50 | bwd_microstep: 1839.78 | bwd_inner_microstep: 1715.26 | bwd_allreduce_microstep: 124.46 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14659
total_samples=25977, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:23,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 06:56:23,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.88 | bwd_microstep: 2119.43 | bwd_inner_microstep: 1928.04 | bwd_allreduce_microstep: 191.33 | step_microstep: 132.09
[2025-08-03 06:56:23,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2915.58 | bwd: 8217.06 | bwd_inner: 7530.25 | bwd_allreduce: 686.57 | step: 132.56
{'loss': 0.7303, 'learning_rate': 1.0972982584221592e-06, 'epoch': 0.85}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11750
total_samples=25980, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:26,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.18 | bwd_microstep: 1875.59 | bwd_inner_microstep: 1747.60 | bwd_allreduce_microstep: 127.92 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13617
total_samples=25984, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:29,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.17 | bwd_microstep: 1819.68 | bwd_inner_microstep: 1680.87 | bwd_allreduce_microstep: 138.73 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12267
total_samples=25987, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:31,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.97 | bwd_microstep: 2077.04 | bwd_inner_microstep: 1724.48 | bwd_allreduce_microstep: 352.50 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13388
total_samples=25991, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:34,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 06:56:34,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.21 | bwd_microstep: 1982.29 | bwd_inner_microstep: 1830.31 | bwd_allreduce_microstep: 151.92 | step_microstep: 129.19
[2025-08-03 06:56:34,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.46 | bwd: 7754.66 | bwd_inner: 6983.26 | bwd_allreduce: 771.16 | step: 129.73
{'loss': 0.7271, 'learning_rate': 1.0899347581163222e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13734
total_samples=25995, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:37,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.75 | bwd_microstep: 2045.39 | bwd_inner_microstep: 1860.28 | bwd_allreduce_microstep: 185.03 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13303
total_samples=25999, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:40,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.39 | bwd_microstep: 1737.73 | bwd_inner_microstep: 1659.76 | bwd_allreduce_microstep: 77.91 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13770
total_samples=26003, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:42,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.80 | bwd_microstep: 1819.38 | bwd_inner_microstep: 1714.51 | bwd_allreduce_microstep: 104.80 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11690
total_samples=26006, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:45,440] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 06:56:45,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.70 | bwd_microstep: 1788.53 | bwd_inner_microstep: 1546.95 | bwd_allreduce_microstep: 241.52 | step_microstep: 111.56
[2025-08-03 06:56:45,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.57 | bwd: 7391.08 | bwd_inner: 6781.50 | bwd_allreduce: 609.33 | step: 112.06
{'loss': 0.7326, 'learning_rate': 1.0825946234178575e-06, 'epoch': 0.85}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13266
total_samples=26010, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:48,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.77 | bwd_microstep: 1818.77 | bwd_inner_microstep: 1710.69 | bwd_allreduce_microstep: 108.02 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11884
total_samples=26013, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:50,877] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.31 | bwd_microstep: 2047.60 | bwd_inner_microstep: 1821.01 | bwd_allreduce_microstep: 226.52 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11911
total_samples=26016, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:53,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.28 | bwd_microstep: 2158.59 | bwd_inner_microstep: 1943.57 | bwd_allreduce_microstep: 214.95 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11717
total_samples=26019, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:56,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 06:56:56,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.92 | bwd_microstep: 1789.53 | bwd_inner_microstep: 1541.69 | bwd_allreduce_microstep: 247.78 | step_microstep: 110.32
[2025-08-03 06:56:56,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.22 | bwd: 7814.55 | bwd_inner: 7016.96 | bwd_allreduce: 797.35 | step: 110.90
{'loss': 0.7226, 'learning_rate': 1.0752778735754121e-06, 'epoch': 0.86}
��██████▌ | 1706/2000 [5:13:14<54:13, 11.07s/it] 85%|████████▌ | 1707/2000 [5:13:27<55:47, 11.42s/it]                                                      85%|████████▌ | 1707/2000 [5:13:27<55:47, 11.42s/it] 85%|████████▌ | 1708/2000 [5:13:38<55:46, 11.46s/it]                                                      85%|████████▌ | 1708/2000 [5:13:38<55:46, 11.46s/it] 85%|████████▌ | 1709/2000 [5:13:49<54:53, 11.32s/it]                                                      85%|████████▌ | 1709/2000 [5:13:49<54:53, 11.32s/it] 86%|████████▌ | 1710/2000 [5:14:00<53:40, 11.10s/it]                                                      86%|████████▌ | 1710/2000 [5:14:00<53:40, 11.10s/it] 86%|████████▌ | 1711/2000 [5:14:11<53:26, 11.09s/it]                                                      86%|████████▌ | 171dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13410
total_samples=26023, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:56:59,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.32 | bwd_microstep: 1798.99 | bwd_inner_microstep: 1700.36 | bwd_allreduce_microstep: 98.56 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11727
total_samples=26026, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:01,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.04 | bwd_microstep: 1718.77 | bwd_inner_microstep: 1529.46 | bwd_allreduce_microstep: 189.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13498
total_samples=26030, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:04,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.86 | bwd_microstep: 1826.82 | bwd_inner_microstep: 1728.35 | bwd_allreduce_microstep: 98.40 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12554
total_samples=26034, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:06,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 06:57:06,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.24 | bwd_microstep: 1806.58 | bwd_inner_microstep: 1618.37 | bwd_allreduce_microstep: 188.15 | step_microstep: 118.23
[2025-08-03 06:57:06,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.39 | bwd: 7151.20 | bwd_inner: 6576.54 | bwd_allreduce: 574.41 | step: 118.70
{'loss': 0.725, 'learning_rate': 1.067984527776309e-06, 'epoch': 0.86}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13174
total_samples=26039, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:09,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.72 | bwd_microstep: 1807.32 | bwd_inner_microstep: 1650.47 | bwd_allreduce_microstep: 156.78 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13524
total_samples=26043, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:12,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.89 | bwd_microstep: 1742.69 | bwd_inner_microstep: 1686.51 | bwd_allreduce_microstep: 56.04 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13283
total_samples=26047, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:14,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.20 | bwd_microstep: 1801.14 | bwd_inner_microstep: 1708.36 | bwd_allreduce_microstep: 92.71 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11646
total_samples=26050, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:19,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 06:57:19,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1861.81 | bwd_microstep: 2295.51 | bwd_inner_microstep: 2248.83 | bwd_allreduce_microstep: 46.61 | step_microstep: 108.79
[2025-08-03 06:57:19,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4037.54 | bwd: 7646.72 | bwd_inner: 7294.18 | bwd_allreduce: 352.27 | step: 109.33
{'loss': 0.7296, 'learning_rate': 1.0607146051465011e-06, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11933
total_samples=26053, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:21,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.16 | bwd_microstep: 2051.12 | bwd_inner_microstep: 1825.16 | bwd_allreduce_microstep: 225.90 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13666
total_samples=26057, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:24,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.20 | bwd_microstep: 2084.60 | bwd_inner_microstep: 1963.81 | bwd_allreduce_microstep: 120.73 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14092
total_samples=26062, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:27,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.78 | bwd_microstep: 1848.98 | bwd_inner_microstep: 1740.11 | bwd_allreduce_microstep: 108.81 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12687
total_samples=26066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:30,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 06:57:30,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.75 | bwd_microstep: 1975.10 | bwd_inner_microstep: 1798.41 | bwd_allreduce_microstep: 176.62 | step_microstep: 110.38
[2025-08-03 06:57:30,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.83 | bwd: 7959.84 | bwd_inner: 7327.47 | bwd_allreduce: 632.13 | step: 110.81
{'loss': 0.7395, 'learning_rate': 1.0534681247505107e-06, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11714
total_samples=26069, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:32,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.26 | bwd_microstep: 1763.63 | bwd_inner_microstep: 1554.93 | bwd_allreduce_microstep: 208.63 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15022
total_samples=26073, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:35,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.16 | bwd_microstep: 2024.92 | bwd_inner_microstep: 1934.61 | bwd_allreduce_microstep: 90.25 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 15933
total_samples=26078, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:38,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.06 | bwd_microstep: 1812.33 | bwd_inner_microstep: 1751.89 | bwd_allreduce_microstep: 60.37 | step_microstep: 0.25
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12526
total_samples=26082, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:41,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 06:57:41,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.49 | bwd_microstep: 2183.65 | bwd_inner_microstep: 1984.65 | bwd_allreduce_microstep: 198.93 | step_microstep: 107.41
[2025-08-03 06:57:41,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2787.89 | bwd: 7784.58 | bwd_inner: 7226.08 | bwd_allreduce: 558.26 | step: 107.91
{'loss': 0.7404, 'learning_rate': 1.0462451055913847e-06, 'epoch': 0.86}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13810
total_samples=26086, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:43,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.54 | bwd_microstep: 1981.51 | bwd_inner_microstep: 1865.48 | bwd_allreduce_microstep: 115.96 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13685
total_samples=26090, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:46,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.93 | bwd_microstep: 1772.03 | bwd_inner_microstep: 1703.86 | bwd_allreduce_microstep: 68.11 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11947
total_samples=26093, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:49,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.66 | bwd_microstep: 1852.42 | bwd_inner_microstep: 1583.20 | bwd_allreduce_microstep: 269.16 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13514
total_samples=26097, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:51,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 06:57:51,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.27 | bwd_microstep: 1884.25 | bwd_inner_microstep: 1824.75 | bwd_allreduce_microstep: 59.42 | step_microstep: 107.47
[2025-08-03 06:57:51,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.34 | bwd: 7490.26 | bwd_inner: 6977.27 | bwd_allreduce: 512.74 | step: 107.92
{'loss': 0.7371, 'learning_rate': 1.0390455666106547e-06, 'epoch': 0.86}
1/2000 [5:14:11<53:26, 11.09s/it] 86%|████████▌ | 1712/2000 [5:14:21<52:16, 10.89s/it]                                                      86%|████████▌ | 1712/2000 [5:14:21<52:16, 10.89s/it] 86%|████████▌ | 1713/2000 [5:14:33<53:48, 11.25s/it]                                                      86%|████████▌ | 1713/2000 [5:14:33<53:48, 11.25s/it] 86%|████████▌ | 1714/2000 [5:14:45<53:28, 11.22s/it]                                                      86%|████████▌ | 1714/2000 [5:14:45<53:28, 11.22s/it] 86%|████████▌ | 1715/2000 [5:14:56<52:58, 11.15s/it]                                                      86%|████████▌ | 1715/2000 [5:14:56<52:58, 11.15s/it] 86%|████████▌ | 1716/2000 [5:15:06<52:12, 11.03s/it]                                                      86%|████████▌ | 1716/2000 [5:15:06<52:12, 11.03sdynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12300
total_samples=26101, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:54,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 746.92 | bwd_microstep: 1855.40 | bwd_inner_microstep: 1612.58 | bwd_allreduce_microstep: 242.76 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12587
total_samples=26105, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:57,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.65 | bwd_microstep: 1700.50 | bwd_inner_microstep: 1569.46 | bwd_allreduce_microstep: 130.97 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11614
total_samples=26108, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:57:59,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.72 | bwd_microstep: 1789.45 | bwd_inner_microstep: 1562.30 | bwd_allreduce_microstep: 227.08 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12203
total_samples=26111, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:02,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61
[2025-08-03 06:58:02,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.13 | bwd_microstep: 1821.25 | bwd_inner_microstep: 1596.70 | bwd_allreduce_microstep: 224.48 | step_microstep: 153.04
[2025-08-03 06:58:02,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2888.35 | bwd: 7166.66 | bwd_inner: 6341.04 | bwd_allreduce: 825.37 | step: 153.53
{'loss': 0.7304, 'learning_rate': 1.0318695266882696e-06, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11912
total_samples=26114, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:05,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.15 | bwd_microstep: 1991.51 | bwd_inner_microstep: 1782.56 | bwd_allreduce_microstep: 208.88 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13358
total_samples=26118, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:08,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.19 | bwd_microstep: 2598.29 | bwd_inner_microstep: 1710.03 | bwd_allreduce_microstep: 888.21 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12663
total_samples=26122, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:11,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.13 | bwd_microstep: 1809.10 | bwd_inner_microstep: 1634.90 | bwd_allreduce_microstep: 174.13 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 16063
total_samples=26127, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:13,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 06:58:13,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.67 | bwd_microstep: 1787.90 | bwd_inner_microstep: 1755.82 | bwd_allreduce_microstep: 32.01 | step_microstep: 147.78
[2025-08-03 06:58:13,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.07 | bwd: 8186.85 | bwd_inner: 6883.30 | bwd_allreduce: 1303.30 | step: 148.12
{'loss': 0.7349, 'learning_rate': 1.024717004642557e-06, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11940
total_samples=26130, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:16,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.13 | bwd_microstep: 1794.70 | bwd_inner_microstep: 1595.78 | bwd_allreduce_microstep: 198.85 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13085
total_samples=26134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:19,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.94 | bwd_microstep: 1824.38 | bwd_inner_microstep: 1697.38 | bwd_allreduce_microstep: 126.92 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13515
total_samples=26138, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:21,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.59 | bwd_microstep: 1862.77 | bwd_inner_microstep: 1728.91 | bwd_allreduce_microstep: 133.79 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13827
total_samples=26142, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:24,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 06:58:24,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.84 | bwd_microstep: 1766.94 | bwd_inner_microstep: 1710.37 | bwd_allreduce_microstep: 56.50 | step_microstep: 109.96
[2025-08-03 06:58:24,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.43 | bwd: 7248.85 | bwd_inner: 6732.44 | bwd_allreduce: 516.14 | step: 110.78
{'loss': 0.7413, 'learning_rate': 1.0175880192301713e-06, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11824
total_samples=26145, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:27,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.59 | bwd_microstep: 1840.89 | bwd_inner_microstep: 1551.71 | bwd_allreduce_microstep: 289.11 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13294
total_samples=26149, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:29,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.41 | bwd_microstep: 1990.37 | bwd_inner_microstep: 1685.28 | bwd_allreduce_microstep: 305.03 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11716
total_samples=26152, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:32,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.81 | bwd_microstep: 1813.90 | bwd_inner_microstep: 1567.53 | bwd_allreduce_microstep: 246.31 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13761
total_samples=26157, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:35,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.79
[2025-08-03 06:58:35,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.79 | bwd_microstep: 2215.71 | bwd_inner_microstep: 2209.24 | bwd_allreduce_microstep: 6.40 | step_microstep: 137.26
[2025-08-03 06:58:35,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.54 | bwd: 7860.92 | bwd_inner: 7013.76 | bwd_allreduce: 846.93 | step: 137.71
{'loss': 0.7372, 'learning_rate': 1.010482589146048e-06, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11839
total_samples=26160, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:38,327] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.69 | bwd_microstep: 2013.08 | bwd_inner_microstep: 1793.81 | bwd_allreduce_microstep: 219.21 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13109
total_samples=26164, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:41,229] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.05 | bwd_microstep: 2118.12 | bwd_inner_microstep: 1981.50 | bwd_allreduce_microstep: 136.55 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11829
total_samples=26167, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:43,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.34 | bwd_microstep: 1982.30 | bwd_inner_microstep: 1546.88 | bwd_allreduce_microstep: 435.35 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11653
total_samples=26170, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:46,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 06:58:46,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.74 | bwd_microstep: 1909.57 | bwd_inner_microstep: 1552.11 | bwd_allreduce_microstep: 357.40 | step_microstep: 129.16
[2025-08-03 06:58:46,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.75 | bwd: 8023.12 | bwd_inner: 6874.30 | bwd_allreduce: 1148.58 | step: 129.67
{'loss': 0.7351, 'learning_rate': 1.0034007330233487e-06, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11731
total_samples=26173, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:49,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.24 | bwd_microstep: 1774.24 | bwd_inner_microstep: 1554.38 | bwd_allreduce_microstep: 219.80 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13564
total_samples=26177, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:52,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.65 | bwd_microstep: 1877.04 | bwd_inner_microstep: 1829.09 | bwd_allreduce_microstep: 47.88 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11718
total_samples=26180, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:54,681] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.29 | bwd_microstep: 1881.81 | bwd_inner_microstep: 1734.36 | bwd_allreduce_microstep: 147.38 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11995
total_samples=26183, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:58:57,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 06:58:57,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.11 | bwd_microstep: 2144.10 | bwd_inner_microstep: 1869.68 | bwd_allreduce_microstep: 274.35 | step_microstep: 113.26
[2025-08-03 06:58:57,728] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.22 | bwd: 7677.25 | bwd_inner: 6987.50 | bwd_allreduce: 689.48 | step: 113.76
/it] 86%|████████▌ | 1717/2000 [5:15:17<51:15, 10.87s/it]                                                      86%|████████▌ | 1717/2000 [5:15:17<51:15, 10.87s/it] 86%|████████▌ | 1718/2000 [5:15:28<51:56, 11.05s/it]                                                      86%|████████▌ | 1718/2000 [5:15:28<51:56, 11.05s/it] 86%|████████▌ | 1719/2000 [5:15:39<51:00, 10.89s/it]                                                      86%|████████▌ | 1719/2000 [5:15:39<51:00, 10.89s/it] 86%|████████▌ | 1720/2000 [5:15:50<51:07, 10.95s/it]                                                      86%|████████▌ | 1720/2000 [5:15:50<51:07, 10.95s/it] 86%|████████▌ | 1721/2000 [5:16:01<51:24, 11.06s/it]                                                      86%|████████▌ | 1721/2000 [5:16:01<51:24, 11.06s/it] 86%|██████�{'loss': 0.736, 'learning_rate': 9.963424694334122e-07, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12237
total_samples=26186, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:00,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.65 | bwd_microstep: 1839.65 | bwd_inner_microstep: 1601.80 | bwd_allreduce_microstep: 237.79 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13082
total_samples=26190, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:03,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.82 | bwd_microstep: 1952.50 | bwd_inner_microstep: 1697.97 | bwd_allreduce_microstep: 254.46 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12297
total_samples=26193, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:05,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.42 | bwd_microstep: 1880.32 | bwd_inner_microstep: 1586.50 | bwd_allreduce_microstep: 293.76 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12041
total_samples=26196, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:08,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 06:59:08,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.11 | bwd_microstep: 1763.74 | bwd_inner_microstep: 1563.24 | bwd_allreduce_microstep: 200.44 | step_microstep: 140.30
[2025-08-03 06:59:08,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.95 | bwd: 7436.26 | bwd_inner: 6449.51 | bwd_allreduce: 986.53 | step: 140.81
{'loss': 0.7366, 'learning_rate': 9.893078168857173e-07, 'epoch': 0.86}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13828
total_samples=26200, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:10,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.13 | bwd_microstep: 1729.29 | bwd_inner_microstep: 1686.85 | bwd_allreduce_microstep: 42.37 | step_microstep: 0.14
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13229
total_samples=26204, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:13,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.56 | bwd_microstep: 1833.25 | bwd_inner_microstep: 1689.10 | bwd_allreduce_microstep: 144.09 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12587
total_samples=26207, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:16,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.65 | bwd_microstep: 1832.54 | bwd_inner_microstep: 1577.90 | bwd_allreduce_microstep: 254.57 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13728
total_samples=26211, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:19,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 06:59:19,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.13 | bwd_microstep: 2479.21 | bwd_inner_microstep: 2318.96 | bwd_allreduce_microstep: 160.19 | step_microstep: 110.07
[2025-08-03 06:59:19,516] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.39 | bwd: 7874.35 | bwd_inner: 7272.81 | bwd_allreduce: 601.29 | step: 110.60
{'loss': 0.7299, 'learning_rate': 9.822967938278172e-07, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11604
total_samples=26214, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:22,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.67 | bwd_microstep: 1866.89 | bwd_inner_microstep: 1611.92 | bwd_allreduce_microstep: 254.90 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12724
total_samples=26218, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:24,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.15 | bwd_microstep: 1745.21 | bwd_inner_microstep: 1624.44 | bwd_allreduce_microstep: 120.71 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13300
total_samples=26222, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:27,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.53 | bwd_microstep: 1888.55 | bwd_inner_microstep: 1688.01 | bwd_allreduce_microstep: 200.48 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13352
total_samples=26226, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:29,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 06:59:29,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.28 | bwd_microstep: 1764.22 | bwd_inner_microstep: 1684.57 | bwd_allreduce_microstep: 79.59 | step_microstep: 133.21
[2025-08-03 06:59:29,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2776.55 | bwd: 7264.94 | bwd_inner: 6608.93 | bwd_allreduce: 655.75 | step: 133.71
{'loss': 0.7249, 'learning_rate': 9.753094186453028e-07, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12273
total_samples=26229, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:32,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.23 | bwd_microstep: 1752.34 | bwd_inner_microstep: 1562.12 | bwd_allreduce_microstep: 190.15 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11617
total_samples=26232, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:35,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1231.43 | bwd_microstep: 1922.75 | bwd_inner_microstep: 1813.91 | bwd_allreduce_microstep: 108.78 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14220
total_samples=26236, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:38,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.18 | bwd_microstep: 1833.17 | bwd_inner_microstep: 1758.81 | bwd_allreduce_microstep: 74.30 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13720
total_samples=26240, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:41,339] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 06:59:41,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.17 | bwd_microstep: 2062.42 | bwd_inner_microstep: 1924.82 | bwd_allreduce_microstep: 137.53 | step_microstep: 113.03
[2025-08-03 06:59:41,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3398.95 | bwd: 7570.72 | bwd_inner: 7059.65 | bwd_allreduce: 510.83 | step: 113.38
{'loss': 0.7327, 'learning_rate': 9.683457096617487e-07, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11737
total_samples=26243, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:44,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.32 | bwd_microstep: 1867.65 | bwd_inner_microstep: 1731.23 | bwd_allreduce_microstep: 136.35 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11706
total_samples=26246, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:46,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.16 | bwd_microstep: 1798.83 | bwd_inner_microstep: 1572.38 | bwd_allreduce_microstep: 226.38 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13501
total_samples=26250, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:49,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.76 | bwd_microstep: 1796.17 | bwd_inner_microstep: 1768.89 | bwd_allreduce_microstep: 27.20 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13861
total_samples=26254, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:51,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 06:59:51,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.16 | bwd_microstep: 1926.34 | bwd_inner_microstep: 1836.72 | bwd_allreduce_microstep: 89.56 | step_microstep: 110.35
[2025-08-03 06:59:51,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2874.32 | bwd: 7389.02 | bwd_inner: 6909.21 | bwd_allreduce: 479.57 | step: 110.79
��█▌ | 1722/2000 [5:16:12<51:04, 11.02s/it]                                                      86%|████████▌ | 1722/2000 [5:16:12<51:04, 11.02s/it] 86%|████████▌ | 1723/2000 [5:16:23<50:27, 10.93s/it]                                                      86%|████████▌ | 1723/2000 [5:16:23<50:27, 10.93s/it] 86%|████████▌ | 1724/2000 [5:16:34<50:28, 10.97s/it]                                                      86%|████████▌ | 1724/2000 [5:16:34<50:28, 10.97s/it] 86%|████████▋ | 1725/2000 [5:16:44<49:36, 10.82s/it]                                                      86%|████████▋ | 1725/2000 [5:16:44<49:36, 10.82s/it] 86%|████████▋ | 1726/2000 [5:16:56<50:08, 10.98s/it]                                                      86%|████████▋ | 1726/2000 [5:16:56<50:08, 10.98s/it] 86%|████████▋ | 1727/2000 [5:17:06{'loss': 0.7337, 'learning_rate': 9.614056851386743e-07, 'epoch': 0.86}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13947
total_samples=26258, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:54,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.27 | bwd_microstep: 2096.94 | bwd_inner_microstep: 1893.50 | bwd_allreduce_microstep: 203.35 | step_microstep: 0.35
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12508
total_samples=26261, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 06:59:57,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.74 | bwd_microstep: 1961.79 | bwd_inner_microstep: 1610.90 | bwd_allreduce_microstep: 350.83 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13230
total_samples=26265, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:00,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.57 | bwd_microstep: 1947.07 | bwd_inner_microstep: 1720.84 | bwd_allreduce_microstep: 226.16 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13463
total_samples=26269, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:03,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 07:00:03,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.02 | bwd_microstep: 1865.49 | bwd_inner_microstep: 1729.37 | bwd_allreduce_microstep: 136.07 | step_microstep: 110.99
[2025-08-03 07:00:03,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.54 | bwd: 7871.34 | bwd_inner: 6954.61 | bwd_allreduce: 916.49 | step: 111.56
{'loss': 0.7451, 'learning_rate': 9.544893632754816e-07, 'epoch': 0.86}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13195
total_samples=26274, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:05,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.81 | bwd_microstep: 1891.11 | bwd_inner_microstep: 1680.27 | bwd_allreduce_microstep: 210.77 | step_microstep: 0.19
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11734
total_samples=26277, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:08,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.68 | bwd_microstep: 2181.57 | bwd_inner_microstep: 1917.85 | bwd_allreduce_microstep: 263.65 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14025
total_samples=26281, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:11,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.16 | bwd_microstep: 2038.24 | bwd_inner_microstep: 1890.06 | bwd_allreduce_microstep: 148.12 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13552
total_samples=26285, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:14,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.71
[2025-08-03 07:00:14,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.03 | bwd_microstep: 1903.00 | bwd_inner_microstep: 1722.91 | bwd_allreduce_microstep: 180.03 | step_microstep: 152.92
[2025-08-03 07:00:14,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.60 | bwd: 8013.97 | bwd_inner: 7211.08 | bwd_allreduce: 802.65 | step: 153.35
{'loss': 0.7279, 'learning_rate': 9.475967622094207e-07, 'epoch': 0.86}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14643
total_samples=26289, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:16,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.81 | bwd_microstep: 1787.28 | bwd_inner_microstep: 1732.82 | bwd_allreduce_microstep: 54.38 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13418
total_samples=26293, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:19,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.87 | bwd_microstep: 1795.73 | bwd_inner_microstep: 1717.78 | bwd_allreduce_microstep: 77.88 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13257
total_samples=26297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:22,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.08 | bwd_microstep: 1931.12 | bwd_inner_microstep: 1854.43 | bwd_allreduce_microstep: 76.62 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13177
total_samples=26301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:24,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 07:00:24,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.17 | bwd_microstep: 1715.71 | bwd_inner_microstep: 1660.10 | bwd_allreduce_microstep: 55.55 | step_microstep: 120.32
[2025-08-03 07:00:24,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.86 | bwd: 7229.89 | bwd_inner: 6965.13 | bwd_allreduce: 264.51 | step: 120.69
{'loss': 0.7406, 'learning_rate': 9.407279000155311e-07, 'epoch': 0.86}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11613
total_samples=26304, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:27,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.74 | bwd_microstep: 1790.52 | bwd_inner_microstep: 1540.45 | bwd_allreduce_microstep: 250.01 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13380
total_samples=26308, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:30,276] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.45 | bwd_microstep: 1997.03 | bwd_inner_microstep: 1874.05 | bwd_allreduce_microstep: 122.91 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13916
total_samples=26312, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:33,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.05 | bwd_microstep: 2301.18 | bwd_inner_microstep: 1822.68 | bwd_allreduce_microstep: 478.44 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13763
total_samples=26316, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:36,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 07:00:36,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.50 | bwd_microstep: 1819.96 | bwd_inner_microstep: 1734.66 | bwd_allreduce_microstep: 85.24 | step_microstep: 126.57
[2025-08-03 07:00:36,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2861.68 | bwd: 7908.74 | bwd_inner: 6971.84 | bwd_allreduce: 936.67 | step: 127.08
{'loss': 0.732, 'learning_rate': 9.338827947066076e-07, 'epoch': 0.87}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12824
total_samples=26320, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:39,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.28 | bwd_microstep: 2181.30 | bwd_inner_microstep: 2014.95 | bwd_allreduce_microstep: 166.28 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14041
total_samples=26324, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:41,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.34 | bwd_microstep: 1830.71 | bwd_inner_microstep: 1750.98 | bwd_allreduce_microstep: 79.67 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14142
total_samples=26328, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:44,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.84 | bwd_microstep: 1855.43 | bwd_inner_microstep: 1750.67 | bwd_allreduce_microstep: 104.68 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13457
total_samples=26332, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:46,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 07:00:46,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.79 | bwd_microstep: 1738.78 | bwd_inner_microstep: 1633.02 | bwd_allreduce_microstep: 105.69 | step_microstep: 109.93
[2025-08-03 07:00:46,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2898.18 | bwd: 7606.28 | bwd_inner: 7149.61 | bwd_allreduce: 456.41 | step: 110.40
<49:30, 10.88s/it]                                                      86%|████████▋ | 1727/2000 [5:17:06<49:30, 10.88s/it] 86%|████████▋ | 1728/2000 [5:17:17<49:38, 10.95s/it]                                                      86%|████████▋ | 1728/2000 [5:17:17<49:38, 10.95s/it] 86%|████████▋ | 1729/2000 [5:17:29<49:54, 11.05s/it]                                                      86%|████████▋ | 1729/2000 [5:17:29<49:54, 11.05s/it] 86%|████████▋ | 1730/2000 [5:17:39<48:57, 10.88s/it]                                                      86%|████████▋ | 1730/2000 [5:17:39<48:57, 10.88s/it] 87%|████████▋ | 1731/2000 [5:17:50<49:11, 10.97s/it]                                                      87%|████████▋ | 1731/2000 [5:17:50<49:11, 10.97s/it] 87%|████████▋ | 1732/2000 [5:18:01<48:56, 10.96s/it]          {'loss': 0.7251, 'learning_rate': 9.270614642331377e-07, 'epoch': 0.87}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13237
total_samples=26336, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:49,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.93 | bwd_microstep: 1755.40 | bwd_inner_microstep: 1681.78 | bwd_allreduce_microstep: 73.55 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13777
total_samples=26340, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:52,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.00 | bwd_microstep: 2055.25 | bwd_inner_microstep: 1931.68 | bwd_allreduce_microstep: 123.51 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13297
total_samples=26344, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:54,990] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.85 | bwd_microstep: 1832.06 | bwd_inner_microstep: 1717.87 | bwd_allreduce_microstep: 114.13 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11988
total_samples=26348, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:00:57,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 07:00:57,717] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.69 | bwd_microstep: 1810.27 | bwd_inner_microstep: 1581.48 | bwd_allreduce_microstep: 228.70 | step_microstep: 130.40
[2025-08-03 07:00:57,718] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.39 | bwd: 7453.04 | bwd_inner: 6912.80 | bwd_allreduce: 539.98 | step: 131.01
{'loss': 0.7263, 'learning_rate': 9.202639264832669e-07, 'epoch': 0.87}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11755
total_samples=26351, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:00,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.79 | bwd_microstep: 1922.67 | bwd_inner_microstep: 1734.78 | bwd_allreduce_microstep: 187.80 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13522
total_samples=26355, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:03,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.73 | bwd_microstep: 1876.86 | bwd_inner_microstep: 1749.61 | bwd_allreduce_microstep: 127.17 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13269
total_samples=26359, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:05,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.19 | bwd_microstep: 2096.13 | bwd_inner_microstep: 2089.89 | bwd_allreduce_microstep: 6.18 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12784
total_samples=26363, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:08,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 07:01:08,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.76 | bwd_microstep: 1877.95 | bwd_inner_microstep: 1617.56 | bwd_allreduce_microstep: 260.33 | step_microstep: 122.30
[2025-08-03 07:01:08,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.40 | bwd: 7773.67 | bwd_inner: 7191.84 | bwd_allreduce: 581.56 | step: 122.97
{'loss': 0.7353, 'learning_rate': 9.134901992827427e-07, 'epoch': 0.87}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13145
total_samples=26367, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:11,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.07 | bwd_microstep: 2099.80 | bwd_inner_microstep: 1916.13 | bwd_allreduce_microstep: 183.60 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14042
total_samples=26371, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:14,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.89 | bwd_microstep: 1880.53 | bwd_inner_microstep: 1832.45 | bwd_allreduce_microstep: 48.01 | step_microstep: 0.77
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12963
total_samples=26375, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:17,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.79 | bwd_microstep: 2082.29 | bwd_inner_microstep: 1929.93 | bwd_allreduce_microstep: 152.29 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11887
total_samples=26378, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:19,910] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 07:01:19,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.61 | bwd_microstep: 1849.49 | bwd_inner_microstep: 1600.32 | bwd_allreduce_microstep: 249.10 | step_microstep: 136.74
[2025-08-03 07:01:19,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.28 | bwd: 7912.16 | bwd_inner: 7278.82 | bwd_allreduce: 633.08 | step: 138.05
{'loss': 0.7262, 'learning_rate': 9.067403003948783e-07, 'epoch': 0.87}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14165
total_samples=26383, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:22,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.65 | bwd_microstep: 2195.00 | bwd_inner_microstep: 1954.68 | bwd_allreduce_microstep: 240.26 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13781
total_samples=26387, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:26,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.12 | bwd_microstep: 2379.77 | bwd_inner_microstep: 2231.12 | bwd_allreduce_microstep: 148.59 | step_microstep: 0.75
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13064
total_samples=26391, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:28,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.75 | bwd_microstep: 1826.45 | bwd_inner_microstep: 1707.60 | bwd_allreduce_microstep: 118.79 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11591
total_samples=26394, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:31,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.05
[2025-08-03 07:01:31,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.49 | bwd_microstep: 1830.76 | bwd_inner_microstep: 1584.81 | bwd_allreduce_microstep: 245.87 | step_microstep: 145.81
[2025-08-03 07:01:31,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.94 | bwd: 8232.05 | bwd_inner: 7478.20 | bwd_allreduce: 753.60 | step: 146.79
{'loss': 0.7326, 'learning_rate': 9.000142475204965e-07, 'epoch': 0.87}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13375
total_samples=26398, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:34,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.16 | bwd_microstep: 1803.93 | bwd_inner_microstep: 1712.52 | bwd_allreduce_microstep: 91.34 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14633
total_samples=26402, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:36,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 741.46 | bwd_microstep: 1817.08 | bwd_inner_microstep: 1754.42 | bwd_allreduce_microstep: 62.60 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13991
total_samples=26406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:39,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.03 | bwd_microstep: 1770.13 | bwd_inner_microstep: 1720.22 | bwd_allreduce_microstep: 49.83 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12014
total_samples=26409, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:42,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 07:01:42,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.30 | bwd_microstep: 1924.61 | bwd_inner_microstep: 1774.14 | bwd_allreduce_microstep: 150.40 | step_microstep: 133.42
[2025-08-03 07:01:42,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.88 | bwd: 7315.81 | bwd_inner: 6961.30 | bwd_allreduce: 354.25 | step: 134.11
                                            87%|████████▋ | 1732/2000 [5:18:01<48:56, 10.96s/it] 87%|████████▋ | 1733/2000 [5:18:12<48:28, 10.89s/it]                                                      87%|████████▋ | 1733/2000 [5:18:12<48:28, 10.89s/it] 87%|████████▋ | 1734/2000 [5:18:23<48:30, 10.94s/it]                                                      87%|████████▋ | 1734/2000 [5:18:23<48:30, 10.94s/it] 87%|████████▋ | 1735/2000 [5:18:34<48:35, 11.00s/it]                                                      87%|████████▋ | 1735/2000 [5:18:34<48:35, 11.00s/it] 87%|████████▋ | 1736/2000 [5:18:46<49:05, 11.16s/it]                                                      87%|████████▋ | 1736/2000 [5:18:46<49:05, 11.16s/it] 87%|████████▋ | 1737/2000 [5:18:56<48:08, 10.98s/it]                                       {'loss': 0.7169, 'learning_rate': 8.933120582978827e-07, 'epoch': 0.87}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13497
total_samples=26413, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:44,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.45 | bwd_microstep: 1928.59 | bwd_inner_microstep: 1720.94 | bwd_allreduce_microstep: 207.57 | step_microstep: 0.19
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13611
total_samples=26418, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:47,631] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.78 | bwd_microstep: 2102.61 | bwd_inner_microstep: 1931.27 | bwd_allreduce_microstep: 171.28 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13597
total_samples=26422, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:50,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.59 | bwd_microstep: 1948.20 | bwd_inner_microstep: 1735.42 | bwd_allreduce_microstep: 212.72 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13509
total_samples=26426, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:53,360] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.43
[2025-08-03 07:01:53,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1015.61 | bwd_microstep: 1770.87 | bwd_inner_microstep: 1690.36 | bwd_allreduce_microstep: 80.43 | step_microstep: 113.71
[2025-08-03 07:01:53,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3134.36 | bwd: 7750.32 | bwd_inner: 7077.99 | bwd_allreduce: 672.09 | step: 114.38
{'loss': 0.7331, 'learning_rate': 8.866337503027523e-07, 'epoch': 0.87}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13619
total_samples=26430, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:55,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.60 | bwd_microstep: 1765.39 | bwd_inner_microstep: 1676.88 | bwd_allreduce_microstep: 88.44 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15034
total_samples=26435, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:01:58,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.33 | bwd_microstep: 1989.64 | bwd_inner_microstep: 1797.47 | bwd_allreduce_microstep: 192.10 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15801
total_samples=26440, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:01,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.11 | bwd_microstep: 1779.34 | bwd_inner_microstep: 1730.85 | bwd_allreduce_microstep: 48.42 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11825
total_samples=26443, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:04,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 07:02:04,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 768.38 | bwd_microstep: 1849.78 | bwd_inner_microstep: 1600.12 | bwd_allreduce_microstep: 249.57 | step_microstep: 113.25
[2025-08-03 07:02:04,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2876.36 | bwd: 7384.18 | bwd_inner: 6805.33 | bwd_allreduce: 578.61 | step: 113.58
{'loss': 0.733, 'learning_rate': 8.799793410481871e-07, 'epoch': 0.87}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13933
total_samples=26447, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:06,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.26 | bwd_microstep: 1836.09 | bwd_inner_microstep: 1744.71 | bwd_allreduce_microstep: 91.31 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 16119
total_samples=26451, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:09,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.93 | bwd_microstep: 2057.58 | bwd_inner_microstep: 1951.80 | bwd_allreduce_microstep: 105.71 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14162
total_samples=26455, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:12,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.32 | bwd_microstep: 2063.31 | bwd_inner_microstep: 1970.61 | bwd_allreduce_microstep: 92.64 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12057
total_samples=26458, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:15,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 07:02:15,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.07 | bwd_microstep: 1944.83 | bwd_inner_microstep: 1600.61 | bwd_allreduce_microstep: 344.13 | step_microstep: 115.59
[2025-08-03 07:02:15,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2892.51 | bwd: 7901.86 | bwd_inner: 7267.73 | bwd_allreduce: 633.88 | step: 116.07
{'loss': 0.7489, 'learning_rate': 8.733488479845997e-07, 'epoch': 0.87}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13765
total_samples=26462, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:17,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.97 | bwd_microstep: 1713.74 | bwd_inner_microstep: 1637.46 | bwd_allreduce_microstep: 76.21 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11981
total_samples=26465, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:20,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.50 | bwd_microstep: 2178.87 | bwd_inner_microstep: 2172.57 | bwd_allreduce_microstep: 6.24 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13703
total_samples=26469, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:23,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.16 | bwd_microstep: 1912.53 | bwd_inner_microstep: 1707.11 | bwd_allreduce_microstep: 205.34 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13000
total_samples=26472, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:26,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 07:02:26,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.55 | bwd_microstep: 1921.79 | bwd_inner_microstep: 1749.65 | bwd_allreduce_microstep: 172.08 | step_microstep: 127.81
[2025-08-03 07:02:26,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.11 | bwd: 7726.98 | bwd_inner: 7266.79 | bwd_allreduce: 459.95 | step: 128.37
{'loss': 0.7261, 'learning_rate': 8.667422884996823e-07, 'epoch': 0.87}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13433
total_samples=26476, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:29,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.98 | bwd_microstep: 2384.49 | bwd_inner_microstep: 2078.10 | bwd_allreduce_microstep: 306.31 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15207
total_samples=26480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:31,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.10 | bwd_microstep: 1837.42 | bwd_inner_microstep: 1782.00 | bwd_allreduce_microstep: 55.32 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13006
total_samples=26484, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:34,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.03 | bwd_microstep: 1871.80 | bwd_inner_microstep: 1674.05 | bwd_allreduce_microstep: 197.68 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11861
total_samples=26487, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:37,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 07:02:37,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.06 | bwd_microstep: 1830.82 | bwd_inner_microstep: 1605.10 | bwd_allreduce_microstep: 225.64 | step_microstep: 145.96
[2025-08-03 07:02:37,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.09 | bwd: 7924.60 | bwd_inner: 7139.25 | bwd_allreduce: 785.05 | step: 146.61
{'loss': 0.7306, 'learning_rate': 8.60159679918372e-07, 'epoch': 0.87}
               87%|████████▋ | 1737/2000 [5:18:56<48:08, 10.98s/it] 87%|████████▋ | 1738/2000 [5:19:08<48:26, 11.09s/it]                                                      87%|████████▋ | 1738/2000 [5:19:08<48:26, 11.09s/it] 87%|████████▋ | 1739/2000 [5:19:18<47:42, 10.97s/it]                                                      87%|████████▋ | 1739/2000 [5:19:18<47:42, 10.97s/it] 87%|████████▋ | 1740/2000 [5:19:30<47:50, 11.04s/it]                                                      87%|████████▋ | 1740/2000 [5:19:30<47:50, 11.04s/it] 87%|████████▋ | 1741/2000 [5:19:41<47:30, 11.01s/it]                                                      87%|████████▋ | 1741/2000 [5:19:41<47:30, 11.01s/it] 87%|████████▋ | 1742/2000 [5:19:52<47:36, 11.07s/it]                                                      87%|██�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13635
total_samples=26491, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:40,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.88 | bwd_microstep: 2260.64 | bwd_inner_microstep: 1996.72 | bwd_allreduce_microstep: 263.85 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13883
total_samples=26495, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:43,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.56 | bwd_microstep: 1811.26 | bwd_inner_microstep: 1766.64 | bwd_allreduce_microstep: 44.55 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13545
total_samples=26499, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:45,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.35 | bwd_microstep: 1751.85 | bwd_inner_microstep: 1685.74 | bwd_allreduce_microstep: 66.03 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12020
total_samples=26502, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:48,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.35
[2025-08-03 07:02:48,188] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.71 | bwd_microstep: 1744.43 | bwd_inner_microstep: 1546.36 | bwd_allreduce_microstep: 198.01 | step_microstep: 120.89
[2025-08-03 07:02:48,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.45 | bwd: 7568.23 | bwd_inner: 6995.46 | bwd_allreduce: 572.52 | step: 121.40
{'loss': 0.7364, 'learning_rate': 8.536010395027905e-07, 'epoch': 0.87}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13248
total_samples=26506, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:50,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.77 | bwd_microstep: 1818.78 | bwd_inner_microstep: 1686.91 | bwd_allreduce_microstep: 131.79 | step_microstep: 0.18
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12287
total_samples=26510, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:53,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.69 | bwd_microstep: 2179.12 | bwd_inner_microstep: 1961.41 | bwd_allreduce_microstep: 217.64 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14218
total_samples=26514, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:56,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.61 | bwd_microstep: 2036.12 | bwd_inner_microstep: 1957.42 | bwd_allreduce_microstep: 78.63 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14908
total_samples=26518, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:02:59,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.76
[2025-08-03 07:02:59,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.33 | bwd_microstep: 1804.80 | bwd_inner_microstep: 1764.32 | bwd_allreduce_microstep: 40.40 | step_microstep: 130.98
[2025-08-03 07:02:59,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.33 | bwd: 7838.86 | bwd_inner: 7370.05 | bwd_allreduce: 468.56 | step: 131.49
{'loss': 0.7441, 'learning_rate': 8.470663844522053e-07, 'epoch': 0.87}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14189
total_samples=26522, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:02,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.82 | bwd_microstep: 1961.06 | bwd_inner_microstep: 1843.25 | bwd_allreduce_microstep: 117.73 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13684
total_samples=26526, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:04,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.69 | bwd_microstep: 2032.55 | bwd_inner_microstep: 1875.64 | bwd_allreduce_microstep: 156.85 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14375
total_samples=26530, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:07,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.66 | bwd_microstep: 1706.73 | bwd_inner_microstep: 1694.07 | bwd_allreduce_microstep: 12.59 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13653
total_samples=26534, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:10,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.91
[2025-08-03 07:03:10,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.13 | bwd_microstep: 1912.52 | bwd_inner_microstep: 1703.33 | bwd_allreduce_microstep: 209.13 | step_microstep: 155.20
[2025-08-03 07:03:10,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.21 | bwd: 7612.91 | bwd_inner: 7116.28 | bwd_allreduce: 496.38 | step: 155.84
{'loss': 0.7398, 'learning_rate': 8.405557319029911e-07, 'epoch': 0.87}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13666
total_samples=26538, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:12,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.17 | bwd_microstep: 1995.44 | bwd_inner_microstep: 1883.46 | bwd_allreduce_microstep: 111.91 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12738
total_samples=26542, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:15,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.69 | bwd_microstep: 1884.57 | bwd_inner_microstep: 1677.29 | bwd_allreduce_microstep: 207.21 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13840
total_samples=26546, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:18,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.39 | bwd_microstep: 1715.34 | bwd_inner_microstep: 1644.00 | bwd_allreduce_microstep: 71.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11926
total_samples=26549, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:20,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 07:03:20,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.83 | bwd_microstep: 1790.61 | bwd_inner_microstep: 1555.75 | bwd_allreduce_microstep: 234.77 | step_microstep: 124.80
[2025-08-03 07:03:20,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.00 | bwd: 7386.02 | bwd_inner: 6760.48 | bwd_allreduce: 625.27 | step: 125.27
{'loss': 0.7272, 'learning_rate': 8.340690989285727e-07, 'epoch': 0.87}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13144
total_samples=26553, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:23,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.54 | bwd_microstep: 1737.12 | bwd_inner_microstep: 1648.86 | bwd_allreduce_microstep: 88.19 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14576
total_samples=26557, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:26,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.18 | bwd_microstep: 2000.03 | bwd_inner_microstep: 1836.49 | bwd_allreduce_microstep: 163.47 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12089
total_samples=26560, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:28,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.04 | bwd_microstep: 1792.82 | bwd_inner_microstep: 1560.46 | bwd_allreduce_microstep: 232.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12444
total_samples=26564, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:31,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.48
[2025-08-03 07:03:31,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.35 | bwd_microstep: 1722.26 | bwd_inner_microstep: 1554.36 | bwd_allreduce_microstep: 167.83 | step_microstep: 152.56
[2025-08-03 07:03:31,289] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2769.03 | bwd: 7252.29 | bwd_inner: 6600.18 | bwd_allreduce: 651.86 | step: 153.05
{'loss': 0.7359, 'learning_rate': 8.276065025393909e-07, 'epoch': 0.87}
�█████▋ | 1742/2000 [5:19:52<47:36, 11.07s/it] 87%|████████▋ | 1743/2000 [5:20:03<47:03, 10.99s/it]                                                      87%|████████▋ | 1743/2000 [5:20:03<47:03, 10.99s/it] 87%|████████▋ | 1744/2000 [5:20:14<47:01, 11.02s/it]                                                      87%|████████▋ | 1744/2000 [5:20:14<47:01, 11.02s/it] 87%|████████▋ | 1745/2000 [5:20:24<46:37, 10.97s/it]                                                      87%|████████▋ | 1745/2000 [5:20:25<46:37, 10.97s/it] 87%|████████▋ | 1746/2000 [5:20:35<46:02, 10.87s/it]                                                      87%|████████▋ | 1746/2000 [5:20:35<46:02, 10.87s/it] 87%|████████▋ | 1747/2000 [5:20:46<45:22, 10.76s/it]                                                      87%|████████▋ | 1747/20dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14491
total_samples=26568, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:34,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.04 | bwd_microstep: 2233.71 | bwd_inner_microstep: 2227.59 | bwd_allreduce_microstep: 6.05 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12846
total_samples=26572, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:36,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.50 | bwd_microstep: 1824.52 | bwd_inner_microstep: 1666.52 | bwd_allreduce_microstep: 157.92 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11556
total_samples=26575, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:39,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.28 | bwd_microstep: 2169.86 | bwd_inner_microstep: 1858.47 | bwd_allreduce_microstep: 311.32 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12504
total_samples=26578, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:42,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.86
[2025-08-03 07:03:42,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.42 | bwd_microstep: 1950.22 | bwd_inner_microstep: 1600.09 | bwd_allreduce_microstep: 350.07 | step_microstep: 116.38
[2025-08-03 07:03:42,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.18 | bwd: 8178.36 | bwd_inner: 7352.67 | bwd_allreduce: 825.44 | step: 116.90
{'loss': 0.7291, 'learning_rate': 8.211679596828481e-07, 'epoch': 0.87}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14502
total_samples=26582, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:45,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.92 | bwd_microstep: 1910.68 | bwd_inner_microstep: 1832.54 | bwd_allreduce_microstep: 78.07 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13900
total_samples=26586, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:48,069] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.80 | bwd_microstep: 1880.20 | bwd_inner_microstep: 1724.72 | bwd_allreduce_microstep: 155.42 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11817
total_samples=26589, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:50,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.86 | bwd_microstep: 1878.73 | bwd_inner_microstep: 1571.89 | bwd_allreduce_microstep: 306.75 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12131
total_samples=26592, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:53,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 07:03:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.94 | bwd_microstep: 1852.36 | bwd_inner_microstep: 1596.08 | bwd_allreduce_microstep: 256.21 | step_microstep: 134.50
[2025-08-03 07:03:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.45 | bwd: 7522.02 | bwd_inner: 6725.23 | bwd_allreduce: 796.54 | step: 135.00
{'loss': 0.7292, 'learning_rate': 8.147534872432761e-07, 'epoch': 0.87}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11924
total_samples=26595, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:56,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.12 | bwd_microstep: 2030.53 | bwd_inner_microstep: 1807.00 | bwd_allreduce_microstep: 223.45 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14775
total_samples=26599, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:03:59,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.98 | bwd_microstep: 1981.10 | bwd_inner_microstep: 1919.84 | bwd_allreduce_microstep: 61.19 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14379
total_samples=26603, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:01,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.24 | bwd_microstep: 1759.09 | bwd_inner_microstep: 1730.07 | bwd_allreduce_microstep: 28.96 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11663
total_samples=26606, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:04,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 07:04:04,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.52 | bwd_microstep: 1816.42 | bwd_inner_microstep: 1572.68 | bwd_allreduce_microstep: 243.67 | step_microstep: 110.75
[2025-08-03 07:04:04,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.77 | bwd: 7587.18 | bwd_inner: 7029.59 | bwd_allreduce: 557.36 | step: 111.27
{'loss': 0.725, 'learning_rate': 8.083631020418792e-07, 'epoch': 0.88}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12142
total_samples=26609, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:06,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.72 | bwd_microstep: 1703.91 | bwd_inner_microstep: 1544.66 | bwd_allreduce_microstep: 159.18 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12371
total_samples=26613, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:09,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.85 | bwd_microstep: 1750.81 | bwd_inner_microstep: 1588.92 | bwd_allreduce_microstep: 161.82 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13243
total_samples=26617, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:11,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.58 | bwd_microstep: 1867.50 | bwd_inner_microstep: 1811.66 | bwd_allreduce_microstep: 55.77 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11799
total_samples=26621, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:14,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 07:04:14,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.43 | bwd_microstep: 1915.74 | bwd_inner_microstep: 1663.16 | bwd_allreduce_microstep: 252.51 | step_microstep: 131.69
[2025-08-03 07:04:14,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.53 | bwd: 7238.00 | bwd_inner: 6608.38 | bwd_allreduce: 629.37 | step: 132.15
{'loss': 0.7202, 'learning_rate': 8.019968208366958e-07, 'epoch': 0.88}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14462
total_samples=26625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:17,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.40 | bwd_microstep: 1905.17 | bwd_inner_microstep: 1741.18 | bwd_allreduce_microstep: 163.91 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13528
total_samples=26629, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:20,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.95 | bwd_microstep: 1802.28 | bwd_inner_microstep: 1709.80 | bwd_allreduce_microstep: 92.41 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14435
total_samples=26633, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:22,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.95 | bwd_microstep: 1794.52 | bwd_inner_microstep: 1764.89 | bwd_allreduce_microstep: 29.56 | step_microstep: 0.86
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13412
total_samples=26637, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:25,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 07:04:25,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.39 | bwd_microstep: 1828.81 | bwd_inner_microstep: 1720.38 | bwd_allreduce_microstep: 108.33 | step_microstep: 136.59
[2025-08-03 07:04:25,332] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2824.62 | bwd: 7330.82 | bwd_inner: 6936.24 | bwd_allreduce: 394.30 | step: 137.73
{'loss': 0.7364, 'learning_rate': 7.956546603225601e-07, 'epoch': 0.88}
00 [5:20:46<45:22, 10.76s/it] 87%|████████▋ | 1748/2000 [5:20:57<46:04, 10.97s/it]                                                      87%|████████▋ | 1748/2000 [5:20:57<46:04, 10.97s/it] 87%|████████▋ | 1749/2000 [5:21:08<45:35, 10.90s/it]                                                      87%|████████▋ | 1749/2000 [5:21:08<45:35, 10.90s/it] 88%|████████▊ | 1750/2000 [5:21:19<45:16, 10.87s/it]                                                      88%|████████▊ | 1750/2000 [5:21:19<45:16, 10.87s/it] 88%|████████▊ | 1751/2000 [5:21:29<44:36, 10.75s/it]                                                      88%|████████▊ | 1751/2000 [5:21:29<44:36, 10.75s/it] 88%|████████▊ | 1752/2000 [5:21:40<44:13, 10.70s/it]                                                      88%|████████▊ | 1752/2000 [5:21:40<44:13, 10.70s/it]dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13231
total_samples=26642, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:27,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.72 | bwd_microstep: 1735.94 | bwd_inner_microstep: 1664.07 | bwd_allreduce_microstep: 71.82 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13602
total_samples=26646, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:30,371] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.70 | bwd_microstep: 1746.60 | bwd_inner_microstep: 1689.11 | bwd_allreduce_microstep: 57.41 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13379
total_samples=26650, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:32,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.99 | bwd_microstep: 1739.78 | bwd_inner_microstep: 1687.24 | bwd_allreduce_microstep: 52.46 | step_microstep: 0.73
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13585
total_samples=26654, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:35,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 07:04:35,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.37 | bwd_microstep: 1989.90 | bwd_inner_microstep: 1846.21 | bwd_allreduce_microstep: 143.62 | step_microstep: 111.61
[2025-08-03 07:04:35,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.70 | bwd: 7212.29 | bwd_inner: 6886.63 | bwd_allreduce: 325.39 | step: 112.62
{'loss': 0.7316, 'learning_rate': 7.893366371310463e-07, 'epoch': 0.88}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13119
total_samples=26658, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:38,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.62 | bwd_microstep: 2287.10 | bwd_inner_microstep: 2173.17 | bwd_allreduce_microstep: 113.86 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13270
total_samples=26662, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:41,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.46 | bwd_microstep: 2087.82 | bwd_inner_microstep: 2080.92 | bwd_allreduce_microstep: 6.79 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13873
total_samples=26666, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:44,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.50 | bwd_microstep: 1788.75 | bwd_inner_microstep: 1716.92 | bwd_allreduce_microstep: 71.77 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13501
total_samples=26670, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:46,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 07:04:46,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.68 | bwd_microstep: 1801.66 | bwd_inner_microstep: 1703.56 | bwd_allreduce_microstep: 98.04 | step_microstep: 122.42
[2025-08-03 07:04:46,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.18 | bwd: 7965.39 | bwd_inner: 7674.58 | bwd_allreduce: 290.53 | step: 122.99
{'loss': 0.7265, 'learning_rate': 7.830427678304353e-07, 'epoch': 0.88}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12641
total_samples=26674, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:49,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.99 | bwd_microstep: 1933.16 | bwd_inner_microstep: 1769.56 | bwd_allreduce_microstep: 163.52 | step_microstep: 0.23
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12986
total_samples=26678, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:52,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.47 | bwd_microstep: 1911.17 | bwd_inner_microstep: 1761.26 | bwd_allreduce_microstep: 149.84 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13399
total_samples=26682, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:55,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.25 | bwd_microstep: 1860.17 | bwd_inner_microstep: 1699.91 | bwd_allreduce_microstep: 160.19 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14341
total_samples=26686, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:04:57,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 07:04:57,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.86 | bwd_microstep: 1937.09 | bwd_inner_microstep: 1888.74 | bwd_allreduce_microstep: 48.28 | step_microstep: 117.29
[2025-08-03 07:04:57,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.49 | bwd: 7641.63 | bwd_inner: 7119.48 | bwd_allreduce: 521.91 | step: 117.76
{'loss': 0.7304, 'learning_rate': 7.767730689256614e-07, 'epoch': 0.88}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12484
total_samples=26689, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:00,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.94 | bwd_microstep: 1842.36 | bwd_inner_microstep: 1606.09 | bwd_allreduce_microstep: 236.20 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11741
total_samples=26692, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:03,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.81 | bwd_microstep: 2203.95 | bwd_inner_microstep: 1992.01 | bwd_allreduce_microstep: 211.88 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14877
total_samples=26697, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:06,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.24 | bwd_microstep: 1839.54 | bwd_inner_microstep: 1792.80 | bwd_allreduce_microstep: 46.68 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13544
total_samples=26701, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:09,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.00
[2025-08-03 07:05:09,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 749.05 | bwd_microstep: 1991.40 | bwd_inner_microstep: 1870.94 | bwd_allreduce_microstep: 120.39 | step_microstep: 144.41
[2025-08-03 07:05:09,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2878.97 | bwd: 7877.31 | bwd_inner: 7261.83 | bwd_allreduce: 615.23 | step: 144.93
{'loss': 0.7319, 'learning_rate': 7.705275568582848e-07, 'epoch': 0.88}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11712
total_samples=26704, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:11,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.46 | bwd_microstep: 1759.52 | bwd_inner_microstep: 1547.16 | bwd_allreduce_microstep: 212.30 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13385
total_samples=26708, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:14,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.76 | bwd_microstep: 1821.55 | bwd_inner_microstep: 1715.44 | bwd_allreduce_microstep: 106.05 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14182
total_samples=26712, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:16,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.30 | bwd_microstep: 1829.19 | bwd_inner_microstep: 1772.58 | bwd_allreduce_microstep: 56.54 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14797
total_samples=26716, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:19,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.87
[2025-08-03 07:05:19,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.46 | bwd_microstep: 1753.86 | bwd_inner_microstep: 1735.73 | bwd_allreduce_microstep: 18.07 | step_microstep: 124.94
[2025-08-03 07:05:19,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2855.91 | bwd: 7164.17 | bwd_inner: 6770.90 | bwd_allreduce: 393.03 | step: 125.30
{'loss': 0.7303, 'learning_rate': 7.643062480064301e-07, 'epoch': 0.88}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11879
total_samples=26719, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:22,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.11 | bwd_microstep: 1782.04 | bwd_inner_microstep: 1558.73 | bwd_allreduce_microstep: 223.24 | step_microstep: 0.33
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13869
total_samples=26723, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:24,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.84 | bwd_microstep: 2076.80 | bwd_inner_microstep: 1926.26 | bwd_allreduce_microstep: 150.48 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13260
total_samples=26727, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:27,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.29 | bwd_microstep: 1894.45 | bwd_inner_microstep: 1712.36 | bwd_allreduce_microstep: 182.01 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14225
total_samples=26733, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:30,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 07:05:30,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.23 | bwd_microstep: 1917.84 | bwd_inner_microstep: 1884.61 | bwd_allreduce_microstep: 33.16 | step_microstep: 132.41
[2025-08-03 07:05:30,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.40 | bwd: 7671.19 | bwd_inner: 7081.96 | bwd_allreduce: 588.99 | step: 133.11
 88%|████████▊ | 1753/2000 [5:21:50<43:43, 10.62s/it]                                                      88%|████████▊ | 1753/2000 [5:21:50<43:43, 10.62s/it] 88%|████████▊ | 1754/2000 [5:22:01<44:14, 10.79s/it]                                                      88%|████████▊ | 1754/2000 [5:22:01<44:14, 10.79s/it] 88%|████████▊ | 1755/2000 [5:22:12<44:10, 10.82s/it]                                                      88%|████████▊ | 1755/2000 [5:22:12<44:10, 10.82s/it] 88%|████████▊ | 1756/2000 [5:22:23<44:26, 10.93s/it]                                                      88%|████████▊ | 1756/2000 [5:22:23<44:26, 10.93s/it] 88%|████████▊ | 1757/2000 [5:22:34<43:40, 10.78s/it]                                                      88%|████████▊ | 1757/2000 [5:22:34<43:40, 10.78s/it] 88%|███████�{'loss': 0.7286, 'learning_rate': 7.581091586847522e-07, 'epoch': 0.88}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11952
total_samples=26736, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:32,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.57 | bwd_microstep: 1755.22 | bwd_inner_microstep: 1551.47 | bwd_allreduce_microstep: 203.68 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13606
total_samples=26741, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:35,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.83 | bwd_microstep: 2009.78 | bwd_inner_microstep: 1880.79 | bwd_allreduce_microstep: 128.92 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11929
total_samples=26744, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:38,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.01 | bwd_microstep: 2230.56 | bwd_inner_microstep: 1586.97 | bwd_allreduce_microstep: 643.53 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13356
total_samples=26748, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:41,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.54
[2025-08-03 07:05:41,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.25 | bwd_microstep: 1709.91 | bwd_inner_microstep: 1661.35 | bwd_allreduce_microstep: 48.50 | step_microstep: 125.97
[2025-08-03 07:05:41,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.59 | bwd: 7705.52 | bwd_inner: 6680.57 | bwd_allreduce: 1024.71 | step: 126.61
{'loss': 0.731, 'learning_rate': 7.519363051443996e-07, 'epoch': 0.88}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13319
total_samples=26752, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:43,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.76 | bwd_microstep: 1741.97 | bwd_inner_microstep: 1668.14 | bwd_allreduce_microstep: 73.76 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11821
total_samples=26755, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:46,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.08 | bwd_microstep: 1877.63 | bwd_inner_microstep: 1721.46 | bwd_allreduce_microstep: 156.11 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14975
total_samples=26759, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:49,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.18 | bwd_microstep: 1981.45 | bwd_inner_microstep: 1904.70 | bwd_allreduce_microstep: 76.68 | step_microstep: 0.73
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13309
total_samples=26763, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:51,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 07:05:51,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.06 | bwd_microstep: 1810.16 | bwd_inner_microstep: 1717.91 | bwd_allreduce_microstep: 92.19 | step_microstep: 121.77
[2025-08-03 07:05:51,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.03 | bwd: 7411.26 | bwd_inner: 7012.21 | bwd_allreduce: 398.81 | step: 122.79
{'loss': 0.7321, 'learning_rate': 7.457877035729588e-07, 'epoch': 0.88}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13377
total_samples=26767, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:54,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.14 | bwd_microstep: 1860.96 | bwd_inner_microstep: 1818.75 | bwd_allreduce_microstep: 42.14 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12050
total_samples=26770, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:05:57,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.80 | bwd_microstep: 1780.47 | bwd_inner_microstep: 1556.28 | bwd_allreduce_microstep: 224.11 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13301
total_samples=26774, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:00,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.41 | bwd_microstep: 2052.26 | bwd_inner_microstep: 1792.28 | bwd_allreduce_microstep: 259.91 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13972
total_samples=26778, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:02,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 07:06:02,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.29 | bwd_microstep: 1843.33 | bwd_inner_microstep: 1687.63 | bwd_allreduce_microstep: 155.64 | step_microstep: 116.89
[2025-08-03 07:06:02,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.58 | bwd: 7537.06 | bwd_inner: 6854.94 | bwd_allreduce: 681.88 | step: 117.38
{'loss': 0.727, 'learning_rate': 7.3966337009442e-07, 'epoch': 0.88}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11711
total_samples=26781, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:05,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.86 | bwd_microstep: 2099.81 | bwd_inner_microstep: 1736.84 | bwd_allreduce_microstep: 362.91 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11721
total_samples=26784, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:08,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.82 | bwd_microstep: 1704.20 | bwd_inner_microstep: 1529.33 | bwd_allreduce_microstep: 174.79 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13486
total_samples=26788, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:10,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.24 | bwd_microstep: 1821.71 | bwd_inner_microstep: 1715.08 | bwd_allreduce_microstep: 106.57 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12089
total_samples=26791, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:13,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 07:06:13,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.14 | bwd_microstep: 1984.44 | bwd_inner_microstep: 1767.66 | bwd_allreduce_microstep: 216.72 | step_microstep: 114.37
[2025-08-03 07:06:13,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.99 | bwd: 7610.21 | bwd_inner: 6748.90 | bwd_allreduce: 861.06 | step: 114.99
{'loss': 0.7245, 'learning_rate': 7.335633207691362e-07, 'epoch': 0.88}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13701
total_samples=26795, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:16,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.64 | bwd_microstep: 2018.51 | bwd_inner_microstep: 1894.17 | bwd_allreduce_microstep: 124.27 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12264
total_samples=26798, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:18,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.68 | bwd_microstep: 1801.94 | bwd_inner_microstep: 1583.41 | bwd_allreduce_microstep: 218.46 | step_microstep: 6.07
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11970
total_samples=26801, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:21,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 674.86 | bwd_microstep: 1953.62 | bwd_inner_microstep: 1755.51 | bwd_allreduce_microstep: 198.05 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12220
total_samples=26804, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:24,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 07:06:24,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.63 | bwd_microstep: 2012.97 | bwd_inner_microstep: 1819.69 | bwd_allreduce_microstep: 193.21 | step_microstep: 122.00
[2025-08-03 07:06:24,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2743.73 | bwd: 7787.10 | bwd_inner: 7052.78 | bwd_allreduce: 734.06 | step: 128.32
�▊ | 1758/2000 [5:22:45<43:40, 10.83s/it]                                                      88%|████████▊ | 1758/2000 [5:22:45<43:40, 10.83s/it] 88%|████████▊ | 1759/2000 [5:22:56<43:35, 10.85s/it]                                                      88%|████████▊ | 1759/2000 [5:22:56<43:35, 10.85s/it] 88%|████████▊ | 1760/2000 [5:23:06<43:11, 10.80s/it]                                                      88%|████████▊ | 1760/2000 [5:23:06<43:11, 10.80s/it] 88%|████████▊ | 1761/2000 [5:23:17<42:59, 10.79s/it]                                                      88%|████████▊ | 1761/2000 [5:23:17<42:59, 10.79s/it] 88%|████████▊ | 1762/2000 [5:23:28<42:50, 10.80s/it]                                                      88%|████████▊ | 1762/2000 [5:23:28<42:50, 10.80s/it] 88%|████████▊ | 1763/2000 [5:23:39<42:{'loss': 0.7404, 'learning_rate': 7.274875715937746e-07, 'epoch': 0.88}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12111
total_samples=26809, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:27,139] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.21 | bwd_microstep: 1803.26 | bwd_inner_microstep: 1561.89 | bwd_allreduce_microstep: 241.30 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15875
total_samples=26813, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:29,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.48 | bwd_microstep: 1921.33 | bwd_inner_microstep: 1838.10 | bwd_allreduce_microstep: 83.16 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12793
total_samples=26817, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:32,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.95 | bwd_microstep: 1812.60 | bwd_inner_microstep: 1667.42 | bwd_allreduce_microstep: 145.11 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11897
total_samples=26820, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:35,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 07:06:35,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.21 | bwd_microstep: 2082.57 | bwd_inner_microstep: 1841.36 | bwd_allreduce_microstep: 241.14 | step_microstep: 134.91
[2025-08-03 07:06:35,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2822.78 | bwd: 7619.80 | bwd_inner: 6908.77 | bwd_allreduce: 710.78 | step: 135.41
{'loss': 0.73, 'learning_rate': 7.21436138501278e-07, 'epoch': 0.88}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12160
total_samples=26823, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:38,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.36 | bwd_microstep: 1792.44 | bwd_inner_microstep: 1563.82 | bwd_allreduce_microstep: 228.54 | step_microstep: 0.14
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12775
total_samples=26827, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:40,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.47 | bwd_microstep: 1800.25 | bwd_inner_microstep: 1641.19 | bwd_allreduce_microstep: 159.00 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13404
total_samples=26831, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:43,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.20 | bwd_microstep: 1790.17 | bwd_inner_microstep: 1697.79 | bwd_allreduce_microstep: 92.31 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13390
total_samples=26835, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:46,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 07:06:46,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.66 | bwd_microstep: 2184.46 | bwd_inner_microstep: 2056.04 | bwd_allreduce_microstep: 128.35 | step_microstep: 113.57
[2025-08-03 07:06:46,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.60 | bwd: 7567.38 | bwd_inner: 6958.85 | bwd_allreduce: 608.28 | step: 114.06
{'loss': 0.7361, 'learning_rate': 7.154090373608236e-07, 'epoch': 0.88}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14520
total_samples=26839, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:48,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.99 | bwd_microstep: 1757.51 | bwd_inner_microstep: 1704.34 | bwd_allreduce_microstep: 53.11 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13362
total_samples=26843, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:51,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.59 | bwd_microstep: 1705.37 | bwd_inner_microstep: 1663.41 | bwd_allreduce_microstep: 41.88 | step_microstep: 0.13
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14184
total_samples=26847, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:53,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.10 | bwd_microstep: 1789.51 | bwd_inner_microstep: 1661.79 | bwd_allreduce_microstep: 127.65 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13849
total_samples=26851, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:56,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.81
[2025-08-03 07:06:56,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.68 | bwd_microstep: 1880.99 | bwd_inner_microstep: 1704.83 | bwd_allreduce_microstep: 176.10 | step_microstep: 135.06
[2025-08-03 07:06:56,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.29 | bwd: 7133.44 | bwd_inner: 6734.37 | bwd_allreduce: 398.83 | step: 135.41
{'loss': 0.7327, 'learning_rate': 7.094062839777838e-07, 'epoch': 0.88}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13545
total_samples=26855, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:06:59,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.09 | bwd_microstep: 1787.06 | bwd_inner_microstep: 1723.93 | bwd_allreduce_microstep: 63.07 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13571
total_samples=26859, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:01,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.47 | bwd_microstep: 1732.90 | bwd_inner_microstep: 1687.50 | bwd_allreduce_microstep: 45.34 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11729
total_samples=26862, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:04,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.34 | bwd_microstep: 2207.64 | bwd_inner_microstep: 2201.40 | bwd_allreduce_microstep: 6.15 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11712
total_samples=26865, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:07,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 07:07:07,318] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.24 | bwd_microstep: 1792.89 | bwd_inner_microstep: 1551.66 | bwd_allreduce_microstep: 241.16 | step_microstep: 120.00
[2025-08-03 07:07:07,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.07 | bwd: 7520.54 | bwd_inner: 7164.50 | bwd_allreduce: 355.78 | step: 120.47
{'loss': 0.7267, 'learning_rate': 7.03427894093679e-07, 'epoch': 0.88}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12600
total_samples=26869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:10,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.17 | bwd_microstep: 2048.01 | bwd_inner_microstep: 1857.22 | bwd_allreduce_microstep: 190.74 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11698
total_samples=26872, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:12,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.28 | bwd_microstep: 1799.81 | bwd_inner_microstep: 1564.96 | bwd_allreduce_microstep: 234.78 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11946
total_samples=26875, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:15,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.99 | bwd_microstep: 1939.92 | bwd_inner_microstep: 1567.32 | bwd_allreduce_microstep: 372.54 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 14327
total_samples=26878, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:18,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 07:07:18,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.03 | bwd_microstep: 1826.03 | bwd_inner_microstep: 1697.14 | bwd_allreduce_microstep: 128.83 | step_microstep: 130.56
[2025-08-03 07:07:18,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.39 | bwd: 7613.83 | bwd_inner: 6686.64 | bwd_allreduce: 926.97 | step: 130.88
51, 10.85s/it]                                                      88%|████████▊ | 1763/2000 [5:23:39<42:51, 10.85s/it] 88%|████████▊ | 1764/2000 [5:23:50<42:41, 10.85s/it]                                                      88%|████████▊ | 1764/2000 [5:23:50<42:41, 10.85s/it] 88%|████████▊ | 1765/2000 [5:24:01<42:27, 10.84s/it]                                                      88%|████████▊ | 1765/2000 [5:24:01<42:27, 10.84s/it] 88%|████████▊ | 1766/2000 [5:24:11<41:43, 10.70s/it]                                                      88%|████████▊ | 1766/2000 [5:24:11<41:43, 10.70s/it] 88%|████████▊ | 1767/2000 [5:24:22<41:34, 10.71s/it]                                                      88%|████████▊ | 1767/2000 [5:24:22<41:34, 10.71s/it] 88%|████████▊ | 1768/2000 [5:24:33<41:37, 10.77s/it]              {'loss': 0.7385, 'learning_rate': 6.974738833861383e-07, 'epoch': 0.88}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13541
total_samples=26882, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:20,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.34 | bwd_microstep: 1820.51 | bwd_inner_microstep: 1729.50 | bwd_allreduce_microstep: 90.95 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13465
total_samples=26886, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:23,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.24 | bwd_microstep: 1939.64 | bwd_inner_microstep: 1852.42 | bwd_allreduce_microstep: 87.15 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13984
total_samples=26890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:26,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.61 | bwd_microstep: 1713.45 | bwd_inner_microstep: 1660.70 | bwd_allreduce_microstep: 52.69 | step_microstep: 0.25
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12519
total_samples=26894, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:28,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.25
[2025-08-03 07:07:28,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.43 | bwd_microstep: 2040.05 | bwd_inner_microstep: 1813.84 | bwd_allreduce_microstep: 226.15 | step_microstep: 111.26
[2025-08-03 07:07:28,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.55 | bwd: 7513.71 | bwd_inner: 7056.46 | bwd_allreduce: 457.01 | step: 111.86
{'loss': 0.7426, 'learning_rate': 6.915442674688633e-07, 'epoch': 0.88}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12837
total_samples=26898, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:31,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.52 | bwd_microstep: 1725.27 | bwd_inner_microstep: 1617.12 | bwd_allreduce_microstep: 108.08 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13367
total_samples=26902, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:34,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.12 | bwd_microstep: 1949.43 | bwd_inner_microstep: 1862.80 | bwd_allreduce_microstep: 86.57 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11693
total_samples=26905, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:36,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.32 | bwd_microstep: 1792.96 | bwd_inner_microstep: 1548.80 | bwd_allreduce_microstep: 244.09 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13539
total_samples=26909, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:39,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.79
[2025-08-03 07:07:39,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.27 | bwd_microstep: 2131.18 | bwd_inner_microstep: 1974.73 | bwd_allreduce_microstep: 156.37 | step_microstep: 134.15
[2025-08-03 07:07:39,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.15 | bwd: 7598.90 | bwd_inner: 7003.44 | bwd_allreduce: 595.20 | step: 134.55
{'loss': 0.7335, 'learning_rate': 6.856390618915775e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13761
total_samples=26914, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:42,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.09 | bwd_microstep: 1763.84 | bwd_inner_microstep: 1682.52 | bwd_allreduce_microstep: 81.25 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12034
total_samples=26917, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:45,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.17 | bwd_microstep: 2005.65 | bwd_inner_microstep: 1773.49 | bwd_allreduce_microstep: 232.10 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13353
total_samples=26921, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:47,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.32 | bwd_microstep: 1856.11 | bwd_inner_microstep: 1720.54 | bwd_allreduce_microstep: 135.52 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11637
total_samples=26924, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:50,588] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.86
[2025-08-03 07:07:50,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.24 | bwd_microstep: 1811.28 | bwd_inner_microstep: 1573.94 | bwd_allreduce_microstep: 237.27 | step_microstep: 138.04
[2025-08-03 07:07:50,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2845.76 | bwd: 7436.94 | bwd_inner: 6750.48 | bwd_allreduce: 686.21 | step: 138.40
{'loss': 0.7237, 'learning_rate': 6.797582821399973e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14263
total_samples=26928, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:53,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.66 | bwd_microstep: 1756.05 | bwd_inner_microstep: 1719.60 | bwd_allreduce_microstep: 36.39 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13233
total_samples=26932, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:55,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.48 | bwd_microstep: 1914.31 | bwd_inner_microstep: 1669.95 | bwd_allreduce_microstep: 244.30 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13121
total_samples=26936, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:07:58,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.23 | bwd_microstep: 1888.30 | bwd_inner_microstep: 1782.14 | bwd_allreduce_microstep: 106.08 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14548
total_samples=26940, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:01,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 07:08:01,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.97 | bwd_microstep: 1995.14 | bwd_inner_microstep: 1756.39 | bwd_allreduce_microstep: 238.65 | step_microstep: 123.14
[2025-08-03 07:08:01,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.27 | bwd: 7553.88 | bwd_inner: 6928.06 | bwd_allreduce: 625.51 | step: 123.52
{'loss': 0.7255, 'learning_rate': 6.739019436357774e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13343
total_samples=26944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:03,956] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.30 | bwd_microstep: 1791.66 | bwd_inner_microstep: 1698.68 | bwd_allreduce_microstep: 92.92 | step_microstep: 0.71
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13147
total_samples=26948, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:06,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.29 | bwd_microstep: 2021.89 | bwd_inner_microstep: 1883.06 | bwd_allreduce_microstep: 138.77 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13471
total_samples=26952, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:09,310] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.13 | bwd_microstep: 1777.95 | bwd_inner_microstep: 1704.34 | bwd_allreduce_microstep: 73.54 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13110
total_samples=26956, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:12,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 07:08:12,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.23 | bwd_microstep: 1927.26 | bwd_inner_microstep: 1689.77 | bwd_allreduce_microstep: 237.42 | step_microstep: 114.13
[2025-08-03 07:08:12,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.88 | bwd: 7518.81 | bwd_inner: 6975.84 | bwd_allreduce: 542.73 | step: 115.42
                                        88%|████████▊ | 1768/2000 [5:24:33<41:37, 10.77s/it] 88%|████████▊ | 1769/2000 [5:24:43<41:26, 10.77s/it]                                                      88%|████████▊ | 1769/2000 [5:24:43<41:26, 10.77s/it] 88%|████████▊ | 1770/2000 [5:24:54<41:24, 10.80s/it]                                                      88%|████████▊ | 1770/2000 [5:24:54<41:24, 10.80s/it] 89%|████████▊ | 1771/2000 [5:25:05<41:07, 10.78s/it]                                                      89%|████████▊ | 1771/2000 [5:25:05<41:07, 10.78s/it] 89%|████████▊ | 1772/2000 [5:25:16<40:57, 10.78s/it]                                                      89%|████████▊ | 1772/2000 [5:25:16<40:57, 10.78s/it] 89%|████████▊ | 1773/2000 [5:25:26<40:44, 10.77s/it]                                           {'loss': 0.7357, 'learning_rate': 6.680700617364877e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13503
total_samples=26960, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:14,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.16 | bwd_microstep: 1748.06 | bwd_inner_microstep: 1679.94 | bwd_allreduce_microstep: 68.05 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13443
total_samples=26964, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:17,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.72 | bwd_microstep: 2081.36 | bwd_inner_microstep: 1787.10 | bwd_allreduce_microstep: 294.19 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13271
total_samples=26968, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:20,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.32 | bwd_microstep: 1763.70 | bwd_inner_microstep: 1698.56 | bwd_allreduce_microstep: 65.07 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13550
total_samples=26972, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:23,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 07:08:23,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.37 | bwd_microstep: 2152.14 | bwd_inner_microstep: 2145.95 | bwd_allreduce_microstep: 6.13 | step_microstep: 107.87
[2025-08-03 07:08:23,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.48 | bwd: 7745.32 | bwd_inner: 7311.55 | bwd_allreduce: 433.53 | step: 108.55
{'loss': 0.7362, 'learning_rate': 6.622626517355557e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14739
total_samples=26977, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:25,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.93 | bwd_microstep: 1733.43 | bwd_inner_microstep: 1706.65 | bwd_allreduce_microstep: 26.71 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12873
total_samples=26981, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:28,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.36 | bwd_microstep: 2055.40 | bwd_inner_microstep: 1994.14 | bwd_allreduce_microstep: 61.19 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15491
total_samples=26985, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:31,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.32 | bwd_microstep: 2023.36 | bwd_inner_microstep: 1949.27 | bwd_allreduce_microstep: 74.04 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13433
total_samples=26989, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:33,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 07:08:33,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.78 | bwd_microstep: 1707.06 | bwd_inner_microstep: 1663.39 | bwd_allreduce_microstep: 43.60 | step_microstep: 112.68
[2025-08-03 07:08:33,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.32 | bwd: 7519.31 | bwd_inner: 7313.44 | bwd_allreduce: 205.63 | step: 113.16
{'loss': 0.7322, 'learning_rate': 6.564797288622371e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13900
total_samples=26993, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:36,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.50 | bwd_microstep: 1816.78 | bwd_inner_microstep: 1746.33 | bwd_allreduce_microstep: 70.39 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12148
total_samples=26996, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:38,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.06 | bwd_microstep: 1757.28 | bwd_inner_microstep: 1563.97 | bwd_allreduce_microstep: 193.25 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14679
total_samples=27001, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:41,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.35 | bwd_microstep: 1831.44 | bwd_inner_microstep: 1774.48 | bwd_allreduce_microstep: 56.89 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13488
total_samples=27005, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:44,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 07:08:44,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.54 | bwd_microstep: 1928.11 | bwd_inner_microstep: 1717.55 | bwd_allreduce_microstep: 210.49 | step_microstep: 398.81
[2025-08-03 07:08:44,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.39 | bwd: 7333.65 | bwd_inner: 6802.32 | bwd_allreduce: 531.09 | step: 399.17
{'loss': 0.7331, 'learning_rate': 6.507213082815745e-07, 'epoch': 0.89}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13401
total_samples=27009, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:47,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.26 | bwd_microstep: 1836.32 | bwd_inner_microstep: 1657.04 | bwd_allreduce_microstep: 179.21 | step_microstep: 0.17
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11674
total_samples=27012, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:50,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.79 | bwd_microstep: 2017.32 | bwd_inner_microstep: 1799.75 | bwd_allreduce_microstep: 217.51 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13180
total_samples=27016, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:53,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.62 | bwd_microstep: 2175.73 | bwd_inner_microstep: 1840.33 | bwd_allreduce_microstep: 335.33 | step_microstep: 0.72
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13976
total_samples=27020, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:55,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 07:08:55,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.98 | bwd_microstep: 1821.06 | bwd_inner_microstep: 1747.47 | bwd_allreduce_microstep: 73.53 | step_microstep: 134.41
[2025-08-03 07:08:55,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.58 | bwd: 7850.49 | bwd_inner: 7044.58 | bwd_allreduce: 805.66 | step: 135.56
{'loss': 0.7408, 'learning_rate': 6.449874050943549e-07, 'epoch': 0.89}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12222
total_samples=27023, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:08:58,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.09 | bwd_microstep: 2145.66 | bwd_inner_microstep: 1807.32 | bwd_allreduce_microstep: 338.28 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12989
total_samples=27027, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:01,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.74 | bwd_microstep: 1774.05 | bwd_inner_microstep: 1664.04 | bwd_allreduce_microstep: 109.94 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12682
total_samples=27031, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:03,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.07 | bwd_microstep: 1792.46 | bwd_inner_microstep: 1655.86 | bwd_allreduce_microstep: 136.53 | step_microstep: 0.76
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11864
total_samples=27034, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:06,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 07:09:06,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.17 | bwd_microstep: 1809.54 | bwd_inner_microstep: 1578.98 | bwd_allreduce_microstep: 230.49 | step_microstep: 116.40
[2025-08-03 07:09:06,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.00 | bwd: 7521.76 | bwd_inner: 6706.20 | bwd_allreduce: 815.32 | step: 117.41
{'loss': 0.7287, 'learning_rate': 6.392780343370686e-07, 'epoch': 0.89}
           89%|████████▊ | 1773/2000 [5:25:26<40:44, 10.77s/it] 89%|████████▊ | 1774/2000 [5:25:37<40:47, 10.83s/it]                                                      89%|████████▊ | 1774/2000 [5:25:37<40:47, 10.83s/it] 89%|████████▉ | 1775/2000 [5:25:48<40:28, 10.79s/it]                                                      89%|████████▉ | 1775/2000 [5:25:48<40:28, 10.79s/it] 89%|████████▉ | 1776/2000 [5:25:59<40:22, 10.82s/it]                                                      89%|████████▉ | 1776/2000 [5:25:59<40:22, 10.82s/it] 89%|████████▉ | 1777/2000 [5:26:10<40:29, 10.89s/it]                                                      89%|████████▉ | 1777/2000 [5:26:10<40:29, 10.89s/it] 89%|████████▉ | 1778/2000 [5:26:21<40:10, 10.86s/it]                                                      89%|████dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11766
total_samples=27037, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:08,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.53 | bwd_microstep: 1694.52 | bwd_inner_microstep: 1523.28 | bwd_allreduce_microstep: 171.17 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13470
total_samples=27041, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:11,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.06 | bwd_microstep: 2013.03 | bwd_inner_microstep: 1883.82 | bwd_allreduce_microstep: 129.14 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13832
total_samples=27046, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:14,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.13 | bwd_microstep: 1899.80 | bwd_inner_microstep: 1744.11 | bwd_allreduce_microstep: 155.62 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13168
total_samples=27050, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:17,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 07:09:17,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.03 | bwd_microstep: 1810.93 | bwd_inner_microstep: 1690.60 | bwd_allreduce_microstep: 120.25 | step_microstep: 115.04
[2025-08-03 07:09:17,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.69 | bwd: 7418.33 | bwd_inner: 6841.81 | bwd_allreduce: 576.25 | step: 115.91
{'loss': 0.721, 'learning_rate': 6.335932109818754e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13833
total_samples=27054, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:19,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.16 | bwd_microstep: 1737.79 | bwd_inner_microstep: 1700.49 | bwd_allreduce_microstep: 37.23 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13326
total_samples=27058, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:22,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.84 | bwd_microstep: 1887.76 | bwd_inner_microstep: 1725.98 | bwd_allreduce_microstep: 161.71 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13008
total_samples=27062, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:24,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.72 | bwd_microstep: 1790.80 | bwd_inner_microstep: 1675.13 | bwd_allreduce_microstep: 115.60 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13090
total_samples=27066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:27,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 07:09:27,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.53 | bwd_microstep: 2017.37 | bwd_inner_microstep: 1872.98 | bwd_allreduce_microstep: 144.33 | step_microstep: 109.83
[2025-08-03 07:09:27,847] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.18 | bwd: 7433.78 | bwd_inner: 6974.58 | bwd_allreduce: 458.96 | step: 110.31
{'loss': 0.7477, 'learning_rate': 6.279329499365649e-07, 'epoch': 0.89}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12437
total_samples=27070, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:30,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.87 | bwd_microstep: 1807.81 | bwd_inner_microstep: 1583.18 | bwd_allreduce_microstep: 224.56 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14216
total_samples=27074, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:33,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.96 | bwd_microstep: 2072.97 | bwd_inner_microstep: 1954.09 | bwd_allreduce_microstep: 118.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13245
total_samples=27078, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:35,976] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.40 | bwd_microstep: 1873.39 | bwd_inner_microstep: 1736.10 | bwd_allreduce_microstep: 137.22 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11524
total_samples=27081, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:38,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 07:09:38,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 848.98 | bwd_microstep: 1880.99 | bwd_inner_microstep: 1846.95 | bwd_allreduce_microstep: 33.97 | step_microstep: 110.55
[2025-08-03 07:09:38,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3009.14 | bwd: 7635.22 | bwd_inner: 7120.32 | bwd_allreduce: 514.66 | step: 111.05
{'loss': 0.7305, 'learning_rate': 6.222972660445082e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332
total_samples=27085, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:41,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.73 | bwd_microstep: 2061.91 | bwd_inner_microstep: 1916.04 | bwd_allreduce_microstep: 145.81 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13554
total_samples=27089, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:44,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.81 | bwd_microstep: 1821.01 | bwd_inner_microstep: 1674.90 | bwd_allreduce_microstep: 146.04 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14282
total_samples=27093, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:47,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.63 | bwd_microstep: 2026.41 | bwd_inner_microstep: 1839.46 | bwd_allreduce_microstep: 186.89 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13542
total_samples=27097, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:49,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 07:09:49,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.56 | bwd_microstep: 1815.10 | bwd_inner_microstep: 1691.73 | bwd_allreduce_microstep: 123.31 | step_microstep: 120.05
[2025-08-03 07:09:49,843] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.67 | bwd: 7724.48 | bwd_inner: 7122.12 | bwd_allreduce: 602.12 | step: 120.52
{'loss': 0.7444, 'learning_rate': 6.166861740846297e-07, 'epoch': 0.89}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11733
total_samples=27100, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:52,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.59 | bwd_microstep: 1779.67 | bwd_inner_microstep: 1573.63 | bwd_allreduce_microstep: 205.97 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13156
total_samples=27104, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:54,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.52 | bwd_microstep: 1735.50 | bwd_inner_microstep: 1634.02 | bwd_allreduce_microstep: 101.40 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13075
total_samples=27108, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:09:57,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.82 | bwd_microstep: 1707.42 | bwd_inner_microstep: 1657.34 | bwd_allreduce_microstep: 50.01 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12433
total_samples=27112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:00,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 07:10:00,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.27 | bwd_microstep: 1874.54 | bwd_inner_microstep: 1576.46 | bwd_allreduce_microstep: 298.03 | step_microstep: 140.63
[2025-08-03 07:10:00,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.13 | bwd: 7097.19 | bwd_inner: 6441.45 | bwd_allreduce: 655.50 | step: 141.05
{'loss': 0.7246, 'learning_rate': 6.11099688771366e-07, 'epoch': 0.89}
████▉ | 1778/2000 [5:26:21<40:10, 10.86s/it] 89%|████████▉ | 1779/2000 [5:26:32<39:45, 10.80s/it]                                                      89%|████████▉ | 1779/2000 [5:26:32<39:45, 10.80s/it] 89%|████████▉ | 1780/2000 [5:26:42<39:27, 10.76s/it]                                                      89%|████████▉ | 1780/2000 [5:26:42<39:27, 10.76s/it] 89%|████████▉ | 1781/2000 [5:26:53<39:35, 10.84s/it]                                                      89%|████████▉ | 1781/2000 [5:26:53<39:35, 10.84s/it] 89%|████████▉ | 1782/2000 [5:27:04<39:31, 10.88s/it]                                                      89%|████████▉ | 1782/2000 [5:27:04<39:31, 10.88s/it] 89%|████████▉ | 1783/2000 [5:27:15<38:45, 10.72s/it]                                                      89%|████████▉ | 1783/2000 [dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12160
total_samples=27115, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:02,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.40 | bwd_microstep: 2000.81 | bwd_inner_microstep: 1891.84 | bwd_allreduce_microstep: 108.90 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13668
total_samples=27119, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:05,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.56 | bwd_microstep: 1991.15 | bwd_inner_microstep: 1703.95 | bwd_allreduce_microstep: 287.13 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12845
total_samples=27123, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:08,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.68 | bwd_microstep: 2016.55 | bwd_inner_microstep: 1857.59 | bwd_allreduce_microstep: 158.88 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13655
total_samples=27128, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:11,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.62
[2025-08-03 07:10:11,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.77 | bwd_microstep: 1753.41 | bwd_inner_microstep: 1683.93 | bwd_allreduce_microstep: 69.41 | step_microstep: 120.83
[2025-08-03 07:10:11,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2743.33 | bwd: 7761.96 | bwd_inner: 7137.32 | bwd_allreduce: 624.40 | step: 121.43
{'loss': 0.7378, 'learning_rate': 6.055378247546217e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14811
total_samples=27132, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:13,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.35 | bwd_microstep: 1785.01 | bwd_inner_microstep: 1747.13 | bwd_allreduce_microstep: 37.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13251
total_samples=27136, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:16,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.99 | bwd_microstep: 1884.30 | bwd_inner_microstep: 1777.99 | bwd_allreduce_microstep: 106.25 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13717
total_samples=27140, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:18,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.85 | bwd_microstep: 1727.92 | bwd_inner_microstep: 1672.70 | bwd_allreduce_microstep: 55.15 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12059
total_samples=27143, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:21,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.16
[2025-08-03 07:10:21,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.33 | bwd_microstep: 1789.04 | bwd_inner_microstep: 1568.91 | bwd_allreduce_microstep: 220.06 | step_microstep: 122.67
[2025-08-03 07:10:21,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.45 | bwd: 7186.33 | bwd_inner: 6766.72 | bwd_allreduce: 419.37 | step: 123.04
{'loss': 0.7437, 'learning_rate': 6.000005966197387e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14080
total_samples=27147, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:24,217] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.42 | bwd_microstep: 1874.07 | bwd_inner_microstep: 1750.60 | bwd_allreduce_microstep: 123.40 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11874
total_samples=27150, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:26,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.63 | bwd_microstep: 1889.81 | bwd_inner_microstep: 1756.92 | bwd_allreduce_microstep: 132.82 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11924
total_samples=27153, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:29,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.43 | bwd_microstep: 2093.53 | bwd_inner_microstep: 1857.96 | bwd_allreduce_microstep: 235.50 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11719
total_samples=27156, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:32,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 07:10:32,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.81 | bwd_microstep: 2003.52 | bwd_inner_microstep: 1823.67 | bwd_allreduce_microstep: 179.79 | step_microstep: 108.38
[2025-08-03 07:10:32,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.21 | bwd: 7860.98 | bwd_inner: 7189.15 | bwd_allreduce: 671.58 | step: 108.88
{'loss': 0.7328, 'learning_rate': 5.94488018887448e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13433
total_samples=27160, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:35,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.27 | bwd_microstep: 1731.00 | bwd_inner_microstep: 1667.82 | bwd_allreduce_microstep: 63.10 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13218
total_samples=27164, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:37,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.26 | bwd_microstep: 1822.83 | bwd_inner_microstep: 1705.01 | bwd_allreduce_microstep: 117.74 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12942
total_samples=27168, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:40,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.40 | bwd_microstep: 1964.93 | bwd_inner_microstep: 1803.62 | bwd_allreduce_microstep: 161.25 | step_microstep: 0.70
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11809
total_samples=27171, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:43,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.66
[2025-08-03 07:10:43,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.41 | bwd_microstep: 1835.08 | bwd_inner_microstep: 1592.06 | bwd_allreduce_microstep: 242.95 | step_microstep: 117.05
[2025-08-03 07:10:43,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.27 | bwd: 7353.90 | bwd_inner: 6768.50 | bwd_allreduce: 585.14 | step: 118.21
{'loss': 0.7466, 'learning_rate': 5.890001060138484e-07, 'epoch': 0.89}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11757
total_samples=27174, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:46,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.86 | bwd_microstep: 1980.22 | bwd_inner_microstep: 1528.74 | bwd_allreduce_microstep: 451.41 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13213
total_samples=27178, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:48,626] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.84 | bwd_microstep: 1822.07 | bwd_inner_microstep: 1704.46 | bwd_allreduce_microstep: 117.54 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12909
total_samples=27182, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:51,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.86 | bwd_microstep: 2239.64 | bwd_inner_microstep: 2055.70 | bwd_allreduce_microstep: 183.86 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13841
total_samples=27186, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:54,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 07:10:54,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.53 | bwd_microstep: 1726.63 | bwd_inner_microstep: 1715.09 | bwd_allreduce_microstep: 11.48 | step_microstep: 131.60
[2025-08-03 07:10:54,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.03 | bwd: 7768.60 | bwd_inner: 7003.98 | bwd_allreduce: 764.37 | step: 132.18
{'loss': 0.729, 'learning_rate': 5.835368723903456e-07, 'epoch': 0.89}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13863
total_samples=27190, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:57,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.18 | bwd_microstep: 2093.36 | bwd_inner_microstep: 1905.88 | bwd_allreduce_microstep: 187.40 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11801
total_samples=27193, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:10:59,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.19 | bwd_microstep: 2065.36 | bwd_inner_microstep: 1799.24 | bwd_allreduce_microstep: 266.07 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13214
total_samples=27197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:02,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.61 | bwd_microstep: 2061.37 | bwd_inner_microstep: 1918.98 | bwd_allreduce_microstep: 142.33 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13640
total_samples=27201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:05,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 07:11:05,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.42 | bwd_microstep: 1957.20 | bwd_inner_microstep: 1877.05 | bwd_allreduce_microstep: 80.06 | step_microstep: 119.80
[2025-08-03 07:11:05,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.34 | bwd: 8177.34 | bwd_inner: 7501.14 | bwd_allreduce: 675.95 | step: 120.31
5:27:15<38:45, 10.72s/it] 89%|████████▉ | 1784/2000 [5:27:25<38:48, 10.78s/it]                                                      89%|████████▉ | 1784/2000 [5:27:26<38:48, 10.78s/it] 89%|████████▉ | 1785/2000 [5:27:36<38:15, 10.68s/it]                                                      89%|████████▉ | 1785/2000 [5:27:36<38:15, 10.68s/it] 89%|████████▉ | 1786/2000 [5:27:47<38:33, 10.81s/it]                                                      89%|████████▉ | 1786/2000 [5:27:47<38:33, 10.81s/it] 89%|████████▉ | 1787/2000 [5:27:58<38:08, 10.75s/it]                                                      89%|████████▉ | 1787/2000 [5:27:58<38:08, 10.75s/it] 89%|████████▉ | 1788/2000 [5:28:09<38:15, 10.83s/it]                                                      89%|████████▉ | 1788/2000 [5:28:09<38:15, 10.83s/it] 89{'loss': 0.7322, 'learning_rate': 5.780983323436374e-07, 'epoch': 0.89}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12102
total_samples=27204, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:08,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.26 | bwd_microstep: 2212.61 | bwd_inner_microstep: 1986.96 | bwd_allreduce_microstep: 225.59 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14538
total_samples=27208, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:11,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.99 | bwd_microstep: 1993.30 | bwd_inner_microstep: 1842.25 | bwd_allreduce_microstep: 150.97 | step_microstep: 0.18
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13480
total_samples=27212, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:14,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.10 | bwd_microstep: 2010.22 | bwd_inner_microstep: 1853.13 | bwd_allreduce_microstep: 157.00 | step_microstep: 0.32
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13337
total_samples=27216, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:17,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 07:11:17,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.80 | bwd_microstep: 1886.57 | bwd_inner_microstep: 1725.97 | bwd_allreduce_microstep: 160.54 | step_microstep: 111.09
[2025-08-03 07:11:17,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.07 | bwd: 8102.77 | bwd_inner: 7408.31 | bwd_allreduce: 694.19 | step: 111.72
{'loss': 0.7438, 'learning_rate': 5.726845001356573e-07, 'epoch': 0.9}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11667
total_samples=27219, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:19,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.23 | bwd_microstep: 1802.74 | bwd_inner_microstep: 1566.63 | bwd_allreduce_microstep: 236.04 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12264
total_samples=27222, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:22,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.59 | bwd_microstep: 1985.41 | bwd_inner_microstep: 1757.47 | bwd_allreduce_microstep: 227.88 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13842
total_samples=27226, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:25,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.70 | bwd_microstep: 1936.33 | bwd_inner_microstep: 1885.96 | bwd_allreduce_microstep: 50.30 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13647
total_samples=27230, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:27,791] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.89
[2025-08-03 07:11:27,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.48 | bwd_microstep: 1820.14 | bwd_inner_microstep: 1741.45 | bwd_allreduce_microstep: 78.61 | step_microstep: 154.05
[2025-08-03 07:11:27,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.92 | bwd: 7544.68 | bwd_inner: 6951.51 | bwd_allreduce: 592.93 | step: 154.51
{'loss': 0.7324, 'learning_rate': 5.672953899635524e-07, 'epoch': 0.9}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13048
total_samples=27234, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:30,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.81 | bwd_microstep: 2055.09 | bwd_inner_microstep: 1975.21 | bwd_allreduce_microstep: 79.81 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12601
total_samples=27238, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:33,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.06 | bwd_microstep: 1772.93 | bwd_inner_microstep: 1617.40 | bwd_allreduce_microstep: 155.46 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13662
total_samples=27242, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:35,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.48 | bwd_microstep: 1865.56 | bwd_inner_microstep: 1726.34 | bwd_allreduce_microstep: 139.15 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12317
total_samples=27245, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:38,735] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 07:11:38,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.89 | bwd_microstep: 2018.87 | bwd_inner_microstep: 1826.06 | bwd_allreduce_microstep: 192.73 | step_microstep: 128.40
[2025-08-03 07:11:38,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.17 | bwd: 7712.49 | bwd_inner: 7145.01 | bwd_allreduce: 567.23 | step: 128.74
{'loss': 0.7309, 'learning_rate': 5.619310159596358e-07, 'epoch': 0.9}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16241
total_samples=27250, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:41,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.59 | bwd_microstep: 2312.72 | bwd_inner_microstep: 2168.88 | bwd_allreduce_microstep: 143.77 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13836
total_samples=27254, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:44,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.32 | bwd_microstep: 1750.43 | bwd_inner_microstep: 1706.13 | bwd_allreduce_microstep: 44.23 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11862
total_samples=27257, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:47,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.60 | bwd_microstep: 1806.03 | bwd_inner_microstep: 1573.81 | bwd_allreduce_microstep: 232.15 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13858
total_samples=27261, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:49,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 07:11:49,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.92 | bwd_microstep: 2011.50 | bwd_inner_microstep: 1868.39 | bwd_allreduce_microstep: 143.05 | step_microstep: 136.45
[2025-08-03 07:11:49,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2878.36 | bwd: 7880.74 | bwd_inner: 7317.21 | bwd_allreduce: 563.29 | step: 136.93
{'loss': 0.7319, 'learning_rate': 5.565913921913513e-07, 'epoch': 0.9}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11713
total_samples=27264, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:52,514] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.21 | bwd_microstep: 1785.66 | bwd_inner_microstep: 1540.12 | bwd_allreduce_microstep: 245.48 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12087
total_samples=27267, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:55,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.96 | bwd_microstep: 1768.44 | bwd_inner_microstep: 1569.60 | bwd_allreduce_microstep: 198.77 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12559
total_samples=27270, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:11:57,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.98 | bwd_microstep: 1849.91 | bwd_inner_microstep: 1626.24 | bwd_allreduce_microstep: 223.60 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12891
total_samples=27274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:00,545] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 07:12:00,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.79 | bwd_microstep: 1957.66 | bwd_inner_microstep: 1667.03 | bwd_allreduce_microstep: 290.56 | step_microstep: 110.76
[2025-08-03 07:12:00,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.86 | bwd: 7361.73 | bwd_inner: 6403.00 | bwd_allreduce: 958.50 | step: 111.23
%|████████▉ | 1789/2000 [5:28:20<38:39, 10.99s/it]                                                      89%|████████▉ | 1789/2000 [5:28:20<38:39, 10.99s/it] 90%|████████▉ | 1790/2000 [5:28:31<38:50, 11.10s/it]                                                      90%|████████▉ | 1790/2000 [5:28:31<38:50, 11.10s/it] 90%|████████▉ | 1791/2000 [5:28:42<38:20, 11.01s/it]                                                      90%|████████▉ | 1791/2000 [5:28:42<38:20, 11.01s/it] 90%|████████▉ | 1792/2000 [5:28:53<38:05, 10.99s/it]                                                      90%|████████▉ | 1792/2000 [5:28:53<38:05, 10.99s/it] 90%|████████▉ | 1793/2000 [5:29:04<38:07, 11.05s/it]                                                      90%|████████▉ | 1793/2000 [5:29:04<38:07, 11.05s/it] 90%|████████▉{'loss': 0.7417, 'learning_rate': 5.51276532661238e-07, 'epoch': 0.9}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12643
total_samples=27278, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:03,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.59 | bwd_microstep: 1963.39 | bwd_inner_microstep: 1957.42 | bwd_allreduce_microstep: 5.89 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13379
total_samples=27282, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:06,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.70 | bwd_microstep: 2100.41 | bwd_inner_microstep: 1790.55 | bwd_allreduce_microstep: 309.80 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11712
total_samples=27285, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:09,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.14 | bwd_microstep: 2005.37 | bwd_inner_microstep: 1787.27 | bwd_allreduce_microstep: 218.02 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12955
total_samples=27289, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:12,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 07:12:12,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.16 | bwd_microstep: 2098.39 | bwd_inner_microstep: 1904.65 | bwd_allreduce_microstep: 193.68 | step_microstep: 129.28
[2025-08-03 07:12:12,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3016.48 | bwd: 8167.60 | bwd_inner: 7439.89 | bwd_allreduce: 727.47 | step: 129.75
{'loss': 0.7164, 'learning_rate': 5.459864513068991e-07, 'epoch': 0.9}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12532
total_samples=27293, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:15,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.75 | bwd_microstep: 2039.41 | bwd_inner_microstep: 1841.92 | bwd_allreduce_microstep: 197.43 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14246
total_samples=27297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:17,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.67 | bwd_microstep: 1836.33 | bwd_inner_microstep: 1767.05 | bwd_allreduce_microstep: 69.22 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13485
total_samples=27301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:20,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.66 | bwd_microstep: 1941.12 | bwd_inner_microstep: 1875.84 | bwd_allreduce_microstep: 65.22 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13274
total_samples=27305, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:23,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.54
[2025-08-03 07:12:23,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.98 | bwd_microstep: 1786.33 | bwd_inner_microstep: 1704.08 | bwd_allreduce_microstep: 82.18 | step_microstep: 151.95
[2025-08-03 07:12:23,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2874.99 | bwd: 7603.25 | bwd_inner: 7188.88 | bwd_allreduce: 414.13 | step: 152.42
{'loss': 0.7312, 'learning_rate': 5.407211620009545e-07, 'epoch': 0.9}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12720
total_samples=27309, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:25,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.36 | bwd_microstep: 1884.06 | bwd_inner_microstep: 1743.32 | bwd_allreduce_microstep: 140.67 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14868
total_samples=27313, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:28,383] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.69 | bwd_microstep: 1804.48 | bwd_inner_microstep: 1740.78 | bwd_allreduce_microstep: 63.64 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14343
total_samples=27317, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:30,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.30 | bwd_microstep: 1837.47 | bwd_inner_microstep: 1813.35 | bwd_allreduce_microstep: 24.05 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13811
total_samples=27321, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:33,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 07:12:33,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.05 | bwd_microstep: 1901.83 | bwd_inner_microstep: 1704.18 | bwd_allreduce_microstep: 197.57 | step_microstep: 120.33
[2025-08-03 07:12:33,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.33 | bwd: 7427.90 | bwd_inner: 7001.63 | bwd_allreduce: 426.01 | step: 120.85
{'loss': 0.7278, 'learning_rate': 5.354806785510113e-07, 'epoch': 0.9}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13456
total_samples=27325, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:36,354] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.33 | bwd_microstep: 1775.84 | bwd_inner_microstep: 1648.80 | bwd_allreduce_microstep: 126.98 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 16172
total_samples=27329, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:39,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.83 | bwd_microstep: 1872.65 | bwd_inner_microstep: 1851.61 | bwd_allreduce_microstep: 20.95 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13325
total_samples=27333, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:41,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.65 | bwd_microstep: 2067.18 | bwd_inner_microstep: 1914.33 | bwd_allreduce_microstep: 152.79 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11991
total_samples=27336, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:44,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.63
[2025-08-03 07:12:44,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.99 | bwd_microstep: 1829.60 | bwd_inner_microstep: 1576.35 | bwd_allreduce_microstep: 253.18 | step_microstep: 137.01
[2025-08-03 07:12:44,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2821.74 | bwd: 7545.33 | bwd_inner: 6991.08 | bwd_allreduce: 553.99 | step: 137.67
{'loss': 0.7309, 'learning_rate': 5.30265014699628e-07, 'epoch': 0.9}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12313
total_samples=27339, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:47,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.54 | bwd_microstep: 2183.91 | bwd_inner_microstep: 1872.53 | bwd_allreduce_microstep: 311.31 | step_microstep: 0.36
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13812
total_samples=27343, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:50,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.82 | bwd_microstep: 1785.06 | bwd_inner_microstep: 1694.70 | bwd_allreduce_microstep: 90.25 | step_microstep: 0.82
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13678
total_samples=27347, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:52,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.49 | bwd_microstep: 1763.35 | bwd_inner_microstep: 1693.35 | bwd_allreduce_microstep: 69.94 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13712
total_samples=27351, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:55,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 07:12:55,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.38 | bwd_microstep: 1889.10 | bwd_inner_microstep: 1711.48 | bwd_allreduce_microstep: 177.56 | step_microstep: 133.86
[2025-08-03 07:12:55,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2788.17 | bwd: 7621.51 | bwd_inner: 6972.06 | bwd_allreduce: 649.15 | step: 135.31
 | 1794/2000 [5:29:15<37:29, 10.92s/it]                                                      90%|████████▉ | 1794/2000 [5:29:15<37:29, 10.92s/it] 90%|████████▉ | 1795/2000 [5:29:27<38:01, 11.13s/it]                                                      90%|████████▉ | 1795/2000 [5:29:27<38:01, 11.13s/it] 90%|████████▉ | 1796/2000 [5:29:37<37:38, 11.07s/it]                                                      90%|████████▉ | 1796/2000 [5:29:37<37:38, 11.07s/it] 90%|████████▉ | 1797/2000 [5:29:48<37:04, 10.96s/it]                                                      90%|████████▉ | 1797/2000 [5:29:48<37:04, 10.96s/it] 90%|████████▉ | 1798/2000 [5:29:59<36:44, 10.91s/it]                                                      90%|████████▉ | 1798/2000 [5:29:59<36:44, 10.91s/it] 90%|████████▉ | 1799/2000 [5:30:10<36:29, {'loss': 0.726, 'learning_rate': 5.250741841242735e-07, 'epoch': 0.9}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11914
total_samples=27354, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:12:58,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.74 | bwd_microstep: 1769.35 | bwd_inner_microstep: 1544.76 | bwd_allreduce_microstep: 224.52 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13292
total_samples=27358, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:00,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.09 | bwd_microstep: 1751.59 | bwd_inner_microstep: 1668.16 | bwd_allreduce_microstep: 83.37 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13398
total_samples=27362, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:03,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.27 | bwd_microstep: 1776.15 | bwd_inner_microstep: 1701.08 | bwd_allreduce_microstep: 75.00 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13355
total_samples=27366, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:05,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.10
[2025-08-03 07:13:05,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.68 | bwd_microstep: 1913.88 | bwd_inner_microstep: 1677.46 | bwd_allreduce_microstep: 236.35 | step_microstep: 141.30
[2025-08-03 07:13:05,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.70 | bwd: 7211.02 | bwd_inner: 6591.45 | bwd_allreduce: 619.32 | step: 141.62
{'loss': 0.7339, 'learning_rate': 5.199082004372958e-07, 'epoch': 0.9}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11787
total_samples=27369, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:08,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.62 | bwd_microstep: 1744.76 | bwd_inner_microstep: 1541.96 | bwd_allreduce_microstep: 202.73 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13238
total_samples=27373, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:11,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.96 | bwd_microstep: 1920.17 | bwd_inner_microstep: 1672.69 | bwd_allreduce_microstep: 247.42 | step_microstep: 0.22
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13827
total_samples=27377, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:13,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.79 | bwd_microstep: 1785.15 | bwd_inner_microstep: 1696.60 | bwd_allreduce_microstep: 88.49 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12288
total_samples=27381, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:16,647] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.52
[2025-08-03 07:13:16,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.22 | bwd_microstep: 2050.41 | bwd_inner_microstep: 1829.18 | bwd_allreduce_microstep: 221.15 | step_microstep: 110.30
[2025-08-03 07:13:16,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2806.50 | bwd: 7500.53 | bwd_inner: 6740.42 | bwd_allreduce: 759.87 | step: 110.75
{'loss': 0.7374, 'learning_rate': 5.147670771858848e-07, 'epoch': 0.9}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11612
total_samples=27384, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:19,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.01 | bwd_microstep: 2367.79 | bwd_inner_microstep: 2192.46 | bwd_allreduce_microstep: 175.26 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12847
total_samples=27388, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:22,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1019.74 | bwd_microstep: 1796.19 | bwd_inner_microstep: 1644.34 | bwd_allreduce_microstep: 151.78 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12707
total_samples=27392, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:25,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.52 | bwd_microstep: 1774.64 | bwd_inner_microstep: 1620.89 | bwd_allreduce_microstep: 153.68 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14253
total_samples=27396, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:27,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 07:13:27,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.18 | bwd_microstep: 1750.89 | bwd_inner_microstep: 1710.98 | bwd_allreduce_microstep: 39.85 | step_microstep: 144.64
[2025-08-03 07:13:27,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3096.38 | bwd: 7689.56 | bwd_inner: 7168.66 | bwd_allreduce: 520.65 | step: 145.14
{'loss': 0.7263, 'learning_rate': 5.096508278520385e-07, 'epoch': 0.9}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11714
total_samples=27399, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:30,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.31 | bwd_microstep: 2200.93 | bwd_inner_microstep: 1887.61 | bwd_allreduce_microstep: 313.25 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13272
total_samples=27403, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:33,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.79 | bwd_microstep: 1812.37 | bwd_inner_microstep: 1698.90 | bwd_allreduce_microstep: 113.38 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11639
total_samples=27406, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:36,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.00 | bwd_microstep: 2046.62 | bwd_inner_microstep: 1837.75 | bwd_allreduce_microstep: 208.81 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13772
total_samples=27410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:38,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.92
[2025-08-03 07:13:38,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.16 | bwd_microstep: 1733.55 | bwd_inner_microstep: 1673.28 | bwd_allreduce_microstep: 60.20 | step_microstep: 155.11
[2025-08-03 07:13:38,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2845.19 | bwd: 7793.52 | bwd_inner: 7097.54 | bwd_allreduce: 695.73 | step: 155.48
{'loss': 0.7256, 'learning_rate': 5.045594658525232e-07, 'epoch': 0.9}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13547
total_samples=27414, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:41,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.04 | bwd_microstep: 1827.16 | bwd_inner_microstep: 1716.26 | bwd_allreduce_microstep: 110.83 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12325
total_samples=27418, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:44,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.73 | bwd_microstep: 1960.93 | bwd_inner_microstep: 1804.83 | bwd_allreduce_microstep: 156.03 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11829
total_samples=27421, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:46,854] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.77 | bwd_microstep: 1774.31 | bwd_inner_microstep: 1546.42 | bwd_allreduce_microstep: 227.82 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12593
total_samples=27425, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:49,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 07:13:49,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.66 | bwd_microstep: 1756.53 | bwd_inner_microstep: 1639.97 | bwd_allreduce_microstep: 116.48 | step_microstep: 122.92
[2025-08-03 07:13:49,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.12 | bwd: 7318.97 | bwd_inner: 6707.48 | bwd_allreduce: 611.24 | step: 123.26
10.89s/it]                                                      90%|████████▉ | 1799/2000 [5:30:10<36:29, 10.89s/it] 90%|█████████ | 1800/2000 [5:30:20<35:54, 10.77s/it]                                                      90%|█████████ | 1800/2000 [5:30:20<35:54, 10.77s/it] 90%|█████████ | 1801/2000 [5:30:31<35:40, 10.75s/it]                                                      90%|█████████ | 1801/2000 [5:30:31<35:40, 10.75s/it] 90%|█████████ | 1802/2000 [5:30:42<35:57, 10.89s/it]                                                      90%|█████████ | 1802/2000 [5:30:42<35:57, 10.89s/it] 90%|█████████ | 1803/2000 [5:30:53<35:57, 10.95s/it]                                                      90%|█████████ | 1803/2000 [5:30:53<35:57, 10.95s/it] 90%|█████████ | 1804/2000 [5:31:04<35:22, 10.83s/it]                  {'loss': 0.7283, 'learning_rate': 4.994930045388414e-07, 'epoch': 0.9}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12119
total_samples=27429, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:52,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.30 | bwd_microstep: 1805.85 | bwd_inner_microstep: 1583.04 | bwd_allreduce_microstep: 222.74 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12252
total_samples=27432, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:54,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.81 | bwd_microstep: 1954.15 | bwd_inner_microstep: 1640.66 | bwd_allreduce_microstep: 313.41 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12216
total_samples=27435, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:13:57,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.30 | bwd_microstep: 1764.49 | bwd_inner_microstep: 1568.17 | bwd_allreduce_microstep: 196.24 | step_microstep: 0.28
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12271
total_samples=27439, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:00,203] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 07:14:00,204] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.88 | bwd_microstep: 1933.42 | bwd_inner_microstep: 1597.71 | bwd_allreduce_microstep: 335.65 | step_microstep: 110.86
[2025-08-03 07:14:00,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.21 | bwd: 7457.96 | bwd_inner: 6389.56 | bwd_allreduce: 1068.13 | step: 111.37
{'loss': 0.7462, 'learning_rate': 4.944514571971981e-07, 'epoch': 0.9}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13082
total_samples=27443, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:02,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.00 | bwd_microstep: 1946.91 | bwd_inner_microstep: 1851.78 | bwd_allreduce_microstep: 95.06 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12066
total_samples=27446, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:05,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.61 | bwd_microstep: 1737.95 | bwd_inner_microstep: 1554.32 | bwd_allreduce_microstep: 183.55 | step_microstep: 1.05
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11750
total_samples=27449, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:08,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.23 | bwd_microstep: 1879.65 | bwd_inner_microstep: 1580.35 | bwd_allreduce_microstep: 299.21 | step_microstep: 0.32
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12959
total_samples=27453, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:10,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 07:14:10,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.91 | bwd_microstep: 1737.17 | bwd_inner_microstep: 1632.91 | bwd_allreduce_microstep: 104.18 | step_microstep: 119.09
[2025-08-03 07:14:10,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2802.68 | bwd: 7301.75 | bwd_inner: 6619.37 | bwd_allreduce: 682.09 | step: 120.55
{'loss': 0.7384, 'learning_rate': 4.894348370484648e-07, 'epoch': 0.9}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14748
total_samples=27457, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:13,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.91 | bwd_microstep: 1812.35 | bwd_inner_microstep: 1758.12 | bwd_allreduce_microstep: 54.15 | step_microstep: 0.83
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12426
total_samples=27461, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:15,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.73 | bwd_microstep: 1819.54 | bwd_inner_microstep: 1634.83 | bwd_allreduce_microstep: 184.57 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14751
total_samples=27466, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:18,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.20 | bwd_microstep: 1788.92 | bwd_inner_microstep: 1744.40 | bwd_allreduce_microstep: 44.45 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11812
total_samples=27469, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:21,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.02
[2025-08-03 07:14:21,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.48 | bwd_microstep: 1889.81 | bwd_inner_microstep: 1760.45 | bwd_allreduce_microstep: 129.29 | step_microstep: 115.04
[2025-08-03 07:14:21,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.24 | bwd: 7310.68 | bwd_inner: 6897.79 | bwd_allreduce: 412.60 | step: 116.34
{'loss': 0.742, 'learning_rate': 4.844431572481412e-07, 'epoch': 0.9}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11900
total_samples=27472, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:23,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 677.08 | bwd_microstep: 1749.32 | bwd_inner_microstep: 1542.60 | bwd_allreduce_microstep: 206.65 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13743
total_samples=27476, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:26,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.67 | bwd_microstep: 1770.93 | bwd_inner_microstep: 1724.98 | bwd_allreduce_microstep: 45.89 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13507
total_samples=27480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:29,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.66 | bwd_microstep: 2130.05 | bwd_inner_microstep: 2123.38 | bwd_allreduce_microstep: 6.62 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11792
total_samples=27483, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:32,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.13
[2025-08-03 07:14:32,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.93 | bwd_microstep: 2019.73 | bwd_inner_microstep: 1720.92 | bwd_allreduce_microstep: 298.74 | step_microstep: 138.61
[2025-08-03 07:14:32,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.26 | bwd: 7670.09 | bwd_inner: 7111.86 | bwd_allreduce: 557.97 | step: 139.08
{'loss': 0.7362, 'learning_rate': 4.794764308863242e-07, 'epoch': 0.9}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14087
total_samples=27487, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:34,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.76 | bwd_microstep: 1840.14 | bwd_inner_microstep: 1744.20 | bwd_allreduce_microstep: 95.87 | step_microstep: 0.28
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13948
total_samples=27491, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:37,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.84 | bwd_microstep: 2010.03 | bwd_inner_microstep: 1881.35 | bwd_allreduce_microstep: 128.61 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12133
total_samples=27494, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:40,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.09 | bwd_microstep: 1811.29 | bwd_inner_microstep: 1628.01 | bwd_allreduce_microstep: 183.21 | step_microstep: 0.25
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 14469
total_samples=27498, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:43,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.75
[2025-08-03 07:14:43,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.47 | bwd_microstep: 1910.46 | bwd_inner_microstep: 1702.20 | bwd_allreduce_microstep: 208.20 | step_microstep: 116.15
[2025-08-03 07:14:43,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.08 | bwd: 7571.97 | bwd_inner: 6955.75 | bwd_allreduce: 615.97 | step: 116.82
                                    90%|█████████ | 1804/2000 [5:31:04<35:22, 10.83s/it] 90%|█████████ | 1805/2000 [5:31:15<35:04, 10.79s/it]                                                      90%|█████████ | 1805/2000 [5:31:15<35:04, 10.79s/it] 90%|█████████ | 1806/2000 [5:31:25<34:38, 10.71s/it]                                                      90%|█████████ | 1806/2000 [5:31:25<34:38, 10.71s/it] 90%|█████████ | 1807/2000 [5:31:36<34:20, 10.68s/it]                                                      90%|█████████ | 1807/2000 [5:31:36<34:20, 10.68s/it] 90%|█████████ | 1808/2000 [5:31:47<34:23, 10.75s/it]                                                      90%|█████████ | 1808/2000 [5:31:47<34:23, 10.75s/it] 90%|█████████ | 1809/2000 [5:31:57<34:15, 10.76s/it]                                               {'loss': 0.739, 'learning_rate': 4.745346709876786e-07, 'epoch': 0.9}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12918
total_samples=27502, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:45,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.77 | bwd_microstep: 1962.56 | bwd_inner_microstep: 1634.73 | bwd_allreduce_microstep: 327.76 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 12979
total_samples=27506, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:48,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.69 | bwd_microstep: 1940.26 | bwd_inner_microstep: 1865.51 | bwd_allreduce_microstep: 74.69 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13198
total_samples=27510, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:51,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.11 | bwd_microstep: 1839.99 | bwd_inner_microstep: 1699.09 | bwd_allreduce_microstep: 140.83 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12534
total_samples=27514, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:53,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 07:14:53,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.30 | bwd_microstep: 1736.25 | bwd_inner_microstep: 1581.14 | bwd_allreduce_microstep: 155.04 | step_microstep: 112.15
[2025-08-03 07:14:53,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.79 | bwd: 7479.12 | bwd_inner: 6780.46 | bwd_allreduce: 698.40 | step: 112.68
{'loss': 0.7261, 'learning_rate': 4.696178905113913e-07, 'epoch': 0.91}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12372
total_samples=27518, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:56,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.95 | bwd_microstep: 1793.51 | bwd_inner_microstep: 1617.44 | bwd_allreduce_microstep: 176.00 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11696
total_samples=27521, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:14:59,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.39 | bwd_microstep: 2082.74 | bwd_inner_microstep: 1837.15 | bwd_allreduce_microstep: 245.51 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13275
total_samples=27525, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:01,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.09 | bwd_microstep: 1813.90 | bwd_inner_microstep: 1705.55 | bwd_allreduce_microstep: 108.28 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11940
total_samples=27528, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:04,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 07:15:04,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.69 | bwd_microstep: 1817.89 | bwd_inner_microstep: 1570.56 | bwd_allreduce_microstep: 247.26 | step_microstep: 109.66
[2025-08-03 07:15:04,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.05 | bwd: 7508.08 | bwd_inner: 6730.69 | bwd_allreduce: 777.13 | step: 110.45
{'loss': 0.7238, 'learning_rate': 4.6472610235114513e-07, 'epoch': 0.91}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11890
total_samples=27531, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:07,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 746.33 | bwd_microstep: 1825.08 | bwd_inner_microstep: 1570.31 | bwd_allreduce_microstep: 254.70 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13285
total_samples=27535, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:09,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.10 | bwd_microstep: 2025.11 | bwd_inner_microstep: 1901.64 | bwd_allreduce_microstep: 123.40 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13881
total_samples=27539, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:12,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.19 | bwd_microstep: 1885.68 | bwd_inner_microstep: 1765.39 | bwd_allreduce_microstep: 120.22 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12015
total_samples=27542, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:15,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 07:15:15,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.48 | bwd_microstep: 2040.19 | bwd_inner_microstep: 1712.33 | bwd_allreduce_microstep: 327.78 | step_microstep: 114.51
[2025-08-03 07:15:15,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.03 | bwd: 7776.12 | bwd_inner: 6949.67 | bwd_allreduce: 826.20 | step: 115.03
{'loss': 0.7315, 'learning_rate': 4.5985931933508757e-07, 'epoch': 0.91}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11796
total_samples=27545, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:19,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1943.00 | bwd_microstep: 2256.39 | bwd_inner_microstep: 2146.01 | bwd_allreduce_microstep: 110.32 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11731
total_samples=27548, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:22,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.06 | bwd_microstep: 2162.17 | bwd_inner_microstep: 1938.79 | bwd_allreduce_microstep: 223.31 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11679
total_samples=27551, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:25,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.56 | bwd_microstep: 1917.65 | bwd_inner_microstep: 1551.83 | bwd_allreduce_microstep: 365.71 | step_microstep: 1.84
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11760
total_samples=27554, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:28,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.91
[2025-08-03 07:15:28,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.94 | bwd_microstep: 1785.44 | bwd_inner_microstep: 1550.20 | bwd_allreduce_microstep: 235.17 | step_microstep: 147.37
[2025-08-03 07:15:28,105] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4004.48 | bwd: 8121.70 | bwd_inner: 7186.83 | bwd_allreduce: 934.60 | step: 149.58
{'loss': 0.734, 'learning_rate': 4.550175542257862e-07, 'epoch': 0.91}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778
total_samples=27557, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:30,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.99 | bwd_microstep: 1713.37 | bwd_inner_microstep: 1531.06 | bwd_allreduce_microstep: 182.22 | step_microstep: 0.15
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13295
total_samples=27561, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:33,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.57 | bwd_microstep: 1820.60 | bwd_inner_microstep: 1692.40 | bwd_allreduce_microstep: 128.12 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13750
total_samples=27565, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:35,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 675.81 | bwd_microstep: 1782.60 | bwd_inner_microstep: 1716.83 | bwd_allreduce_microstep: 65.69 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13756
total_samples=27569, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:38,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 07:15:38,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.90 | bwd_microstep: 1949.25 | bwd_inner_microstep: 1871.45 | bwd_allreduce_microstep: 77.74 | step_microstep: 137.21
[2025-08-03 07:15:38,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2778.20 | bwd: 7265.87 | bwd_inner: 6811.73 | bwd_allreduce: 453.88 | step: 137.61
{'loss': 0.7385, 'learning_rate': 4.502008197202068e-07, 'epoch': 0.91}
       90%|█████████ | 1809/2000 [5:31:57<34:15, 10.76s/it] 90%|█████████ | 1810/2000 [5:32:08<34:01, 10.74s/it]                                                      90%|█████████ | 1810/2000 [5:32:08<34:01, 10.74s/it] 91%|█████████ | 1811/2000 [5:32:19<33:50, 10.74s/it]                                                      91%|█████████ | 1811/2000 [5:32:19<33:50, 10.74s/it] 91%|█████████ | 1812/2000 [5:32:30<33:55, 10.83s/it]                                                      91%|█████████ | 1812/2000 [5:32:30<33:55, 10.83s/it] 91%|█████████ | 1813/2000 [5:32:42<35:24, 11.36s/it]                                                      91%|█████████ | 1813/2000 [5:32:42<35:24, 11.36s/it] 91%|█████████ | 1814/2000 [5:32:53<34:24, 11.10s/it]                                                      91%|█████�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11974
total_samples=27572, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:41,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.25 | bwd_microstep: 1930.19 | bwd_inner_microstep: 1787.66 | bwd_allreduce_microstep: 142.46 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11883
total_samples=27575, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:44,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.82 | bwd_microstep: 2309.49 | bwd_inner_microstep: 2303.48 | bwd_allreduce_microstep: 5.95 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=27578, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:47,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.43 | bwd_microstep: 1958.97 | bwd_inner_microstep: 1738.58 | bwd_allreduce_microstep: 220.32 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11789
total_samples=27581, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:49,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 07:15:49,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.29 | bwd_microstep: 1726.59 | bwd_inner_microstep: 1535.26 | bwd_allreduce_microstep: 191.27 | step_microstep: 113.21
[2025-08-03 07:15:49,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.73 | bwd: 7925.30 | bwd_inner: 7364.98 | bwd_allreduce: 560.07 | step: 113.67
{'loss': 0.7338, 'learning_rate': 4.454091284496731e-07, 'epoch': 0.91}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11742
total_samples=27584, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:52,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.66 | bwd_microstep: 2013.08 | bwd_inner_microstep: 1804.14 | bwd_allreduce_microstep: 208.87 | step_microstep: 0.28
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11847
total_samples=27587, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:55,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.41 | bwd_microstep: 1764.57 | bwd_inner_microstep: 1553.51 | bwd_allreduce_microstep: 210.99 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12208
total_samples=27590, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:15:58,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 986.69 | bwd_microstep: 1989.88 | bwd_inner_microstep: 1770.26 | bwd_allreduce_microstep: 219.56 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13460
total_samples=27594, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:01,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.75
[2025-08-03 07:16:01,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.34 | bwd_microstep: 1969.38 | bwd_inner_microstep: 1835.41 | bwd_allreduce_microstep: 133.90 | step_microstep: 120.95
[2025-08-03 07:16:01,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3109.03 | bwd: 7736.96 | bwd_inner: 6963.31 | bwd_allreduce: 773.40 | step: 121.47
{'loss': 0.7333, 'learning_rate': 4.406424929798403e-07, 'epoch': 0.91}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12979
total_samples=27597, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:03,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.63 | bwd_microstep: 1869.56 | bwd_inner_microstep: 1625.05 | bwd_allreduce_microstep: 244.45 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12012
total_samples=27600, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:06,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.70 | bwd_microstep: 1973.71 | bwd_inner_microstep: 1748.30 | bwd_allreduce_microstep: 225.33 | step_microstep: 0.82
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13145
total_samples=27604, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:09,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.67 | bwd_microstep: 1970.46 | bwd_inner_microstep: 1837.12 | bwd_allreduce_microstep: 133.25 | step_microstep: 0.20
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13150
total_samples=27608, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:12,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 07:16:12,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.10 | bwd_microstep: 2078.86 | bwd_inner_microstep: 1746.67 | bwd_allreduce_microstep: 332.13 | step_microstep: 112.46
[2025-08-03 07:16:12,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2834.04 | bwd: 7892.67 | bwd_inner: 6957.14 | bwd_allreduce: 935.26 | step: 113.75
{'loss': 0.7376, 'learning_rate': 4.3590092581065055e-07, 'epoch': 0.91}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13780
total_samples=27613, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:14,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.00 | bwd_microstep: 1813.92 | bwd_inner_microstep: 1750.43 | bwd_allreduce_microstep: 63.43 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11710
total_samples=27616, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:17,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.23 | bwd_microstep: 1777.87 | bwd_inner_microstep: 1547.15 | bwd_allreduce_microstep: 230.64 | step_microstep: 0.35
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13344
total_samples=27620, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:19,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.40 | bwd_microstep: 1810.89 | bwd_inner_microstep: 1709.94 | bwd_allreduce_microstep: 100.89 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13674
total_samples=27624, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:22,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 07:16:22,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.66 | bwd_microstep: 1942.44 | bwd_inner_microstep: 1726.24 | bwd_allreduce_microstep: 216.12 | step_microstep: 129.27
[2025-08-03 07:16:22,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.23 | bwd: 7345.18 | bwd_inner: 6733.75 | bwd_allreduce: 611.18 | step: 129.88
{'loss': 0.7264, 'learning_rate': 4.3118443937631094e-07, 'epoch': 0.91}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11807
total_samples=27627, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:25,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.44 | bwd_microstep: 1801.79 | bwd_inner_microstep: 1565.57 | bwd_allreduce_microstep: 236.15 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13063
total_samples=27631, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:28,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.38 | bwd_microstep: 1884.54 | bwd_inner_microstep: 1691.87 | bwd_allreduce_microstep: 192.60 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13694
total_samples=27635, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:30,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.12 | bwd_microstep: 2170.02 | bwd_inner_microstep: 1902.49 | bwd_allreduce_microstep: 267.47 | step_microstep: 0.30
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13124
total_samples=27639, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:33,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.27
[2025-08-03 07:16:33,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.95 | bwd_microstep: 1824.78 | bwd_inner_microstep: 1676.93 | bwd_allreduce_microstep: 147.78 | step_microstep: 144.46
[2025-08-03 07:16:33,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2777.81 | bwd: 7681.17 | bwd_inner: 6836.86 | bwd_allreduce: 844.07 | step: 145.00
{'loss': 0.7293, 'learning_rate': 4.26493046045261e-07, 'epoch': 0.91}
��███ | 1814/2000 [5:32:53<34:24, 11.10s/it] 91%|█████████ | 1815/2000 [5:33:04<34:16, 11.12s/it]                                                      91%|█████████ | 1815/2000 [5:33:04<34:16, 11.12s/it] 91%|█████████ | 1816/2000 [5:33:15<34:13, 11.16s/it]                                                      91%|█████████ | 1816/2000 [5:33:15<34:13, 11.16s/it] 91%|█████████ | 1817/2000 [5:33:26<34:00, 11.15s/it]                                                      91%|█████████ | 1817/2000 [5:33:27<34:00, 11.15s/it] 91%|█████████ | 1818/2000 [5:33:37<33:19, 10.99s/it]                                                      91%|█████████ | 1818/2000 [5:33:37<33:19, 10.99s/it] 91%|█████████ | 1819/2000 [5:33:48<33:04, 10.97s/it]                                                      91%|█████████ | 1819/2000 [5:33dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14368
total_samples=27643, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:36,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.40 | bwd_microstep: 2101.51 | bwd_inner_microstep: 1912.11 | bwd_allreduce_microstep: 189.33 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14044
total_samples=27647, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:39,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.84 | bwd_microstep: 1852.17 | bwd_inner_microstep: 1761.17 | bwd_allreduce_microstep: 90.93 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13679
total_samples=27651, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:41,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.12 | bwd_microstep: 1791.10 | bwd_inner_microstep: 1726.52 | bwd_allreduce_microstep: 64.50 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13371
total_samples=27655, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:44,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.63
[2025-08-03 07:16:44,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.26 | bwd_microstep: 1752.39 | bwd_inner_microstep: 1684.22 | bwd_allreduce_microstep: 68.10 | step_microstep: 113.73
[2025-08-03 07:16:44,374] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.55 | bwd: 7497.24 | bwd_inner: 7084.02 | bwd_allreduce: 412.94 | step: 114.35
{'loss': 0.7416, 'learning_rate': 4.218267581201296e-07, 'epoch': 0.91}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13160
total_samples=27659, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:46,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.01 | bwd_microstep: 1748.09 | bwd_inner_microstep: 1695.00 | bwd_allreduce_microstep: 53.01 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13741
total_samples=27663, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:49,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.49 | bwd_microstep: 1973.17 | bwd_inner_microstep: 1916.50 | bwd_allreduce_microstep: 56.59 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13317
total_samples=27668, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:52,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.50 | bwd_microstep: 1771.96 | bwd_inner_microstep: 1703.53 | bwd_allreduce_microstep: 68.35 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11599
total_samples=27671, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:55,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 07:16:55,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.25 | bwd_microstep: 1965.99 | bwd_inner_microstep: 1767.33 | bwd_allreduce_microstep: 198.60 | step_microstep: 137.95
[2025-08-03 07:16:55,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.18 | bwd: 7459.27 | bwd_inner: 7082.36 | bwd_allreduce: 376.64 | step: 138.60
{'loss': 0.7256, 'learning_rate': 4.17185587837714e-07, 'epoch': 0.91}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13289
total_samples=27675, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:16:57,959] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.88 | bwd_microstep: 2116.74 | bwd_inner_microstep: 2110.08 | bwd_allreduce_microstep: 6.60 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13395
total_samples=27679, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:00,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.15 | bwd_microstep: 2002.85 | bwd_inner_microstep: 1693.75 | bwd_allreduce_microstep: 309.04 | step_microstep: 0.19
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12476
total_samples=27683, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:03,515] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.16 | bwd_microstep: 2028.30 | bwd_inner_microstep: 1867.54 | bwd_allreduce_microstep: 160.70 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11963
total_samples=27686, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:06,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.40
[2025-08-03 07:17:06,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.18 | bwd_microstep: 2040.33 | bwd_inner_microstep: 1810.87 | bwd_allreduce_microstep: 229.39 | step_microstep: 123.59
[2025-08-03 07:17:06,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.31 | bwd: 8188.28 | bwd_inner: 7482.22 | bwd_allreduce: 705.81 | step: 124.12
{'loss': 0.7153, 'learning_rate': 4.125695473689406e-07, 'epoch': 0.91}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13557
total_samples=27690, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:09,402] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.04 | bwd_microstep: 2190.38 | bwd_inner_microstep: 2076.77 | bwd_allreduce_microstep: 113.55 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13208
total_samples=27694, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:12,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.62 | bwd_microstep: 2107.14 | bwd_inner_microstep: 1870.53 | bwd_allreduce_microstep: 236.50 | step_microstep: 0.48
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12495
total_samples=27698, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:15,175] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.97 | bwd_microstep: 2099.82 | bwd_inner_microstep: 1632.94 | bwd_allreduce_microstep: 466.78 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13970
total_samples=27702, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:17,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 07:17:17,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.13 | bwd_microstep: 1914.74 | bwd_inner_microstep: 1754.41 | bwd_allreduce_microstep: 160.24 | step_microstep: 111.83
[2025-08-03 07:17:17,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.70 | bwd: 8312.16 | bwd_inner: 7334.67 | bwd_allreduce: 977.17 | step: 112.67
{'loss': 0.7379, 'learning_rate': 4.0797864881883977e-07, 'epoch': 0.91}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13480
total_samples=27706, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:20,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.51 | bwd_microstep: 1805.36 | bwd_inner_microstep: 1698.66 | bwd_allreduce_microstep: 106.63 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12470
total_samples=27710, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:23,367] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.01 | bwd_microstep: 1987.02 | bwd_inner_microstep: 1884.77 | bwd_allreduce_microstep: 102.18 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13868
total_samples=27714, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:26,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.67 | bwd_microstep: 1845.92 | bwd_inner_microstep: 1742.73 | bwd_allreduce_microstep: 103.13 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12888
total_samples=27718, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:28,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.79
[2025-08-03 07:17:28,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.68 | bwd_microstep: 1901.28 | bwd_inner_microstep: 1765.47 | bwd_allreduce_microstep: 135.74 | step_microstep: 121.59
[2025-08-03 07:17:28,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2883.81 | bwd: 7539.65 | bwd_inner: 7091.60 | bwd_allreduce: 447.77 | step: 122.17
{'loss': 0.7323, 'learning_rate': 4.034129042265067e-07, 'epoch': 0.91}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13300
total_samples=27722, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:31,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.68 | bwd_microstep: 1848.91 | bwd_inner_microstep: 1810.64 | bwd_allreduce_microstep: 38.21 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13815
total_samples=27726, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:34,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.87 | bwd_microstep: 1814.35 | bwd_inner_microstep: 1728.33 | bwd_allreduce_microstep: 85.95 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12168
total_samples=27729, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:36,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.09 | bwd_microstep: 1834.21 | bwd_inner_microstep: 1607.37 | bwd_allreduce_microstep: 226.77 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14810
total_samples=27734, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:39,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.20
[2025-08-03 07:17:39,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.79 | bwd_microstep: 2021.28 | bwd_inner_microstep: 1864.75 | bwd_allreduce_microstep: 156.46 | step_microstep: 112.41
[2025-08-03 07:17:39,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2872.37 | bwd: 7518.81 | bwd_inner: 7011.09 | bwd_allreduce: 507.47 | step: 112.87
:48<33:04, 10.97s/it] 91%|█████████ | 1820/2000 [5:33:59<32:40, 10.89s/it]                                                      91%|█████████ | 1820/2000 [5:33:59<32:40, 10.89s/it] 91%|█████████ | 1821/2000 [5:34:09<32:17, 10.83s/it]                                                      91%|█████████ | 1821/2000 [5:34:09<32:17, 10.83s/it] 91%|█████████ | 1822/2000 [5:34:21<32:36, 10.99s/it]                                                      91%|█████████ | 1822/2000 [5:34:21<32:36, 10.99s/it] 91%|█████████ | 1823/2000 [5:34:32<32:55, 11.16s/it]                                                      91%|█████████ | 1823/2000 [5:34:32<32:55, 11.16s/it] 91%|█████████ | 1824/2000 [5:34:43<32:27, 11.07s/it]                                                      91%|█████████ | 1824/2000 [5:34:43<32:27, 11.07s/it] 91%|�{'loss': 0.7534, 'learning_rate': 3.988723255650728e-07, 'epoch': 0.91}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14502
total_samples=27738, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:42,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.90 | bwd_microstep: 1706.22 | bwd_inner_microstep: 1675.96 | bwd_allreduce_microstep: 30.19 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=27741, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:44,896] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.99 | bwd_microstep: 2006.53 | bwd_inner_microstep: 1772.89 | bwd_allreduce_microstep: 233.56 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12127
total_samples=27744, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:47,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 754.93 | bwd_microstep: 1941.46 | bwd_inner_microstep: 1614.66 | bwd_allreduce_microstep: 326.73 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13777
total_samples=27749, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:50,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 07:17:50,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.20 | bwd_microstep: 2021.80 | bwd_inner_microstep: 1968.31 | bwd_allreduce_microstep: 53.40 | step_microstep: 120.29
[2025-08-03 07:17:50,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.94 | bwd: 7676.04 | bwd_inner: 7031.80 | bwd_allreduce: 643.97 | step: 120.66
{'loss': 0.7276, 'learning_rate': 3.943569247416801e-07, 'epoch': 0.91}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13255
total_samples=27753, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:53,079] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 668.93 | bwd_microstep: 1736.48 | bwd_inner_microstep: 1671.32 | bwd_allreduce_microstep: 65.06 | step_microstep: 0.91
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13188
total_samples=27757, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:55,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.19 | bwd_microstep: 1990.47 | bwd_inner_microstep: 1692.75 | bwd_allreduce_microstep: 297.65 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13202
total_samples=27761, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:17:58,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.09 | bwd_microstep: 1801.57 | bwd_inner_microstep: 1698.56 | bwd_allreduce_microstep: 102.93 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13637
total_samples=27765, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:01,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.85
[2025-08-03 07:18:01,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.75 | bwd_microstep: 1830.27 | bwd_inner_microstep: 1725.96 | bwd_allreduce_microstep: 104.22 | step_microstep: 141.42
[2025-08-03 07:18:01,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.89 | bwd: 7358.87 | bwd_inner: 6788.59 | bwd_allreduce: 569.97 | step: 142.60
{'loss': 0.7218, 'learning_rate': 3.8986671359743767e-07, 'epoch': 0.91}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13226
total_samples=27769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:04,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.35 | bwd_microstep: 2022.26 | bwd_inner_microstep: 1878.48 | bwd_allreduce_microstep: 143.71 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15860
total_samples=27773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:06,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.22 | bwd_microstep: 1870.20 | bwd_inner_microstep: 1795.27 | bwd_allreduce_microstep: 74.86 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13502
total_samples=27777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:09,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 743.47 | bwd_microstep: 1851.87 | bwd_inner_microstep: 1722.02 | bwd_allreduce_microstep: 129.79 | step_microstep: 0.86
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13583
total_samples=27781, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:12,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 07:18:12,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.27 | bwd_microstep: 2212.57 | bwd_inner_microstep: 2206.49 | bwd_allreduce_microstep: 6.02 | step_microstep: 114.99
[2025-08-03 07:18:12,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.24 | bwd: 7956.95 | bwd_inner: 7602.24 | bwd_allreduce: 354.46 | step: 116.23
{'loss': 0.7308, 'learning_rate': 3.8540170390740097e-07, 'epoch': 0.91}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13601
total_samples=27785, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:15,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.39 | bwd_microstep: 1863.07 | bwd_inner_microstep: 1735.34 | bwd_allreduce_microstep: 127.67 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11827
total_samples=27788, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:17,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.08 | bwd_microstep: 1739.43 | bwd_inner_microstep: 1538.69 | bwd_allreduce_microstep: 200.68 | step_microstep: 0.19
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12633
total_samples=27792, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:20,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.95 | bwd_microstep: 2009.63 | bwd_inner_microstep: 1846.23 | bwd_allreduce_microstep: 163.34 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13098
total_samples=27796, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:23,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 07:18:23,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.12 | bwd_microstep: 2213.89 | bwd_inner_microstep: 1898.48 | bwd_allreduce_microstep: 315.35 | step_microstep: 112.71
[2025-08-03 07:18:23,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.47 | bwd: 7826.07 | bwd_inner: 7018.72 | bwd_allreduce: 807.11 | step: 113.26
{'loss': 0.7287, 'learning_rate': 3.8096190738053815e-07, 'epoch': 0.91}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13434
total_samples=27800, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:26,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.74 | bwd_microstep: 1789.11 | bwd_inner_microstep: 1701.40 | bwd_allreduce_microstep: 87.64 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11772
total_samples=27803, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:28,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.91 | bwd_microstep: 2076.39 | bwd_inner_microstep: 1641.10 | bwd_allreduce_microstep: 435.19 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12841
total_samples=27807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:31,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.04 | bwd_microstep: 1912.15 | bwd_inner_microstep: 1809.54 | bwd_allreduce_microstep: 102.55 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12767
total_samples=27811, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:34,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 07:18:34,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 657.05 | bwd_microstep: 2119.33 | bwd_inner_microstep: 1977.81 | bwd_allreduce_microstep: 141.46 | step_microstep: 124.45
[2025-08-03 07:18:34,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.68 | bwd: 7897.03 | bwd_inner: 7129.87 | bwd_allreduce: 766.90 | step: 124.92
�████████▏| 1825/2000 [5:34:54<32:02, 10.99s/it]                                                      91%|█████████▏| 1825/2000 [5:34:54<32:02, 10.99s/it] 91%|█████████▏| 1826/2000 [5:35:05<31:49, 10.97s/it]                                                      91%|█████████▏| 1826/2000 [5:35:05<31:49, 10.97s/it] 91%|█████████▏| 1827/2000 [5:35:16<31:20, 10.87s/it]                                                      91%|█████████▏| 1827/2000 [5:35:16<31:20, 10.87s/it] 91%|█████████▏| 1828/2000 [5:35:27<31:28, 10.98s/it]                                                      91%|█████████▏| 1828/2000 [5:35:27<31:28, 10.98s/it] 91%|█████████▏| 1829/2000 [5:35:38<31:23, 11.01s/it]                                                      91%|█████████▏| 1829/2000 [5:35:38<31:23, 11.01s/it] 92%|███�{'loss': 0.7441, 'learning_rate': 3.7654733565969826e-07, 'epoch': 0.92}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13891
total_samples=27815, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:37,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.80 | bwd_microstep: 1893.35 | bwd_inner_microstep: 1839.15 | bwd_allreduce_microstep: 54.12 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11951
total_samples=27819, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:39,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.65 | bwd_microstep: 1716.07 | bwd_inner_microstep: 1533.81 | bwd_allreduce_microstep: 182.20 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11775
total_samples=27822, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:42,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.35 | bwd_microstep: 1730.07 | bwd_inner_microstep: 1535.86 | bwd_allreduce_microstep: 194.14 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14131
total_samples=27826, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:44,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 07:18:44,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.79 | bwd_microstep: 1742.84 | bwd_inner_microstep: 1718.27 | bwd_allreduce_microstep: 24.51 | step_microstep: 133.34
[2025-08-03 07:18:44,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.52 | bwd: 7082.39 | bwd_inner: 6627.08 | bwd_allreduce: 455.04 | step: 133.81
{'loss': 0.7242, 'learning_rate': 3.721580003215808e-07, 'epoch': 0.92}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12729
total_samples=27830, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:47,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.60 | bwd_microstep: 1829.99 | bwd_inner_microstep: 1662.14 | bwd_allreduce_microstep: 167.78 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11632
total_samples=27833, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:50,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.34 | bwd_microstep: 1820.23 | bwd_inner_microstep: 1553.09 | bwd_allreduce_microstep: 267.08 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11822
total_samples=27836, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:52,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.85 | bwd_microstep: 1972.01 | bwd_inner_microstep: 1797.75 | bwd_allreduce_microstep: 174.19 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14429
total_samples=27841, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:55,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 07:18:55,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.88 | bwd_microstep: 2023.63 | bwd_inner_microstep: 1762.09 | bwd_allreduce_microstep: 261.47 | step_microstep: 110.94
[2025-08-03 07:18:55,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2829.59 | bwd: 7645.92 | bwd_inner: 6775.07 | bwd_allreduce: 870.60 | step: 111.58
{'loss': 0.7383, 'learning_rate': 3.67793912876705e-07, 'epoch': 0.92}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13432
total_samples=27845, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:18:58,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.76 | bwd_microstep: 1736.36 | bwd_inner_microstep: 1642.39 | bwd_allreduce_microstep: 93.91 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11623
total_samples=27849, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:01,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 973.42 | bwd_microstep: 1918.25 | bwd_inner_microstep: 1771.12 | bwd_allreduce_microstep: 147.06 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11738
total_samples=27852, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:03,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.38 | bwd_microstep: 1856.99 | bwd_inner_microstep: 1600.09 | bwd_allreduce_microstep: 256.84 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13559
total_samples=27856, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:06,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 07:19:06,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.96 | bwd_microstep: 1817.03 | bwd_inner_microstep: 1730.88 | bwd_allreduce_microstep: 86.08 | step_microstep: 137.86
[2025-08-03 07:19:06,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3083.45 | bwd: 7328.67 | bwd_inner: 6744.47 | bwd_allreduce: 583.96 | step: 138.30
{'loss': 0.7189, 'learning_rate': 3.6345508476938296e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14305
total_samples=27861, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:09,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.61 | bwd_microstep: 1754.02 | bwd_inner_microstep: 1716.49 | bwd_allreduce_microstep: 37.46 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11836
total_samples=27864, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:12,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.33 | bwd_microstep: 2089.39 | bwd_inner_microstep: 1877.65 | bwd_allreduce_microstep: 211.67 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13288
total_samples=27868, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:15,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.81 | bwd_microstep: 2240.73 | bwd_inner_microstep: 2023.19 | bwd_allreduce_microstep: 217.49 | step_microstep: 0.22
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13621
total_samples=27872, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:17,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 07:19:17,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.85 | bwd_microstep: 1970.15 | bwd_inner_microstep: 1809.69 | bwd_allreduce_microstep: 160.40 | step_microstep: 130.58
[2025-08-03 07:19:17,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.52 | bwd: 8054.34 | bwd_inner: 7427.01 | bwd_allreduce: 627.11 | step: 131.06
{'loss': 0.7275, 'learning_rate': 3.591415273776855e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13370
total_samples=27876, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:20,816] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.98 | bwd_microstep: 2066.33 | bwd_inner_microstep: 1908.10 | bwd_allreduce_microstep: 158.15 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12014
total_samples=27879, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:23,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.55 | bwd_microstep: 1718.78 | bwd_inner_microstep: 1542.40 | bwd_allreduce_microstep: 176.32 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13335
total_samples=27883, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:26,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.56 | bwd_microstep: 2015.14 | bwd_inner_microstep: 2007.36 | bwd_allreduce_microstep: 7.71 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12090
total_samples=27886, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:28,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 07:19:28,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.41 | bwd_microstep: 1794.80 | bwd_inner_microstep: 1560.67 | bwd_allreduce_microstep: 234.07 | step_microstep: 125.46
[2025-08-03 07:19:28,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.44 | bwd: 7595.11 | bwd_inner: 7018.54 | bwd_allreduce: 576.33 | step: 125.85
�█████▏| 1830/2000 [5:35:49<31:16, 11.04s/it]                                                      92%|█████████▏| 1830/2000 [5:35:49<31:16, 11.04s/it] 92%|█████████▏| 1831/2000 [5:35:59<30:29, 10.82s/it]                                                      92%|█████████▏| 1831/2000 [5:35:59<30:29, 10.82s/it] 92%|█████████▏| 1832/2000 [5:36:10<30:20, 10.83s/it]                                                      92%|█████████▏| 1832/2000 [5:36:10<30:20, 10.83s/it] 92%|█████████▏| 1833/2000 [5:36:21<30:09, 10.84s/it]                                                      92%|█████████▏| 1833/2000 [5:36:21<30:09, 10.84s/it] 92%|█████████▏| 1834/2000 [5:36:32<30:21, 10.98s/it]                                                      92%|█████████▏| 1834/2000 [5:36:32<30:21, 10.98s/it] 92%|██████�{'loss': 0.726, 'learning_rate': 3.548532520134129e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13861
total_samples=27890, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:31,334] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.18 | bwd_microstep: 1739.56 | bwd_inner_microstep: 1702.07 | bwd_allreduce_microstep: 37.43 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11876
total_samples=27893, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:33,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.58 | bwd_microstep: 1797.86 | bwd_inner_microstep: 1563.52 | bwd_allreduce_microstep: 234.28 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14035
total_samples=27898, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:36,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.92 | bwd_microstep: 2024.95 | bwd_inner_microstep: 1928.21 | bwd_allreduce_microstep: 96.68 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11736
total_samples=27901, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:39,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 07:19:39,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.25 | bwd_microstep: 1924.55 | bwd_inner_microstep: 1531.23 | bwd_allreduce_microstep: 393.24 | step_microstep: 111.74
[2025-08-03 07:19:39,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.86 | bwd: 7486.97 | bwd_inner: 6725.03 | bwd_allreduce: 761.72 | step: 112.20
{'loss': 0.7317, 'learning_rate': 3.5059026992206645e-07, 'epoch': 0.92}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13488
total_samples=27906, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:42,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.66 | bwd_microstep: 1998.71 | bwd_inner_microstep: 1866.35 | bwd_allreduce_microstep: 132.29 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13051
total_samples=27909, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:45,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.54 | bwd_microstep: 1996.00 | bwd_inner_microstep: 1812.70 | bwd_allreduce_microstep: 183.22 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13425
total_samples=27913, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:47,622] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.12 | bwd_microstep: 1793.71 | bwd_inner_microstep: 1677.93 | bwd_allreduce_microstep: 115.72 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13856
total_samples=27918, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:50,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 07:19:50,304] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.79 | bwd_microstep: 1772.64 | bwd_inner_microstep: 1716.21 | bwd_allreduce_microstep: 56.36 | step_microstep: 132.63
[2025-08-03 07:19:50,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.04 | bwd: 7561.11 | bwd_inner: 7073.19 | bwd_allreduce: 487.67 | step: 133.09
{'loss': 0.7406, 'learning_rate': 3.4635259228282256e-07, 'epoch': 0.92}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13448
total_samples=27922, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:52,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.95 | bwd_microstep: 1790.04 | bwd_inner_microstep: 1688.87 | bwd_allreduce_microstep: 101.10 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15238
total_samples=27926, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:55,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.19 | bwd_microstep: 2022.74 | bwd_inner_microstep: 1939.32 | bwd_allreduce_microstep: 83.35 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13267
total_samples=27930, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:19:58,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.64 | bwd_microstep: 1820.13 | bwd_inner_microstep: 1703.26 | bwd_allreduce_microstep: 116.80 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13868
total_samples=27935, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:01,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 07:20:01,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.06 | bwd_microstep: 2011.21 | bwd_inner_microstep: 1737.79 | bwd_allreduce_microstep: 273.35 | step_microstep: 128.83
[2025-08-03 07:20:01,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.77 | bwd: 7644.16 | bwd_inner: 7069.24 | bwd_allreduce: 574.68 | step: 129.18
{'loss': 0.7321, 'learning_rate': 3.421402302084953e-07, 'epoch': 0.92}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13647
total_samples=27940, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:04,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.96 | bwd_microstep: 2000.81 | bwd_inner_microstep: 1856.30 | bwd_allreduce_microstep: 144.45 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12836
total_samples=27944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:06,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.79 | bwd_microstep: 1769.34 | bwd_inner_microstep: 1640.15 | bwd_allreduce_microstep: 129.12 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11798
total_samples=27947, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:09,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.29 | bwd_microstep: 1758.39 | bwd_inner_microstep: 1560.98 | bwd_allreduce_microstep: 197.34 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13562
total_samples=27951, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:12,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 07:20:12,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.16 | bwd_microstep: 2016.43 | bwd_inner_microstep: 1922.00 | bwd_allreduce_microstep: 94.35 | step_microstep: 406.29
[2025-08-03 07:20:12,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2805.13 | bwd: 7545.02 | bwd_inner: 6979.42 | bwd_allreduce: 565.35 | step: 406.61
{'loss': 0.7351, 'learning_rate': 3.379531947455128e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13331
total_samples=27955, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:14,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.49 | bwd_microstep: 1705.16 | bwd_inner_microstep: 1672.45 | bwd_allreduce_microstep: 32.64 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13590
total_samples=27959, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:17,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.23 | bwd_microstep: 1887.49 | bwd_inner_microstep: 1711.50 | bwd_allreduce_microstep: 175.91 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12339
total_samples=27962, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:20,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.74 | bwd_microstep: 1778.33 | bwd_inner_microstep: 1591.37 | bwd_allreduce_microstep: 186.89 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13624
total_samples=27966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:22,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 07:20:22,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.42 | bwd_microstep: 2100.20 | bwd_inner_microstep: 1883.38 | bwd_allreduce_microstep: 216.75 | step_microstep: 111.25
[2025-08-03 07:20:22,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.80 | bwd: 7471.24 | bwd_inner: 6858.70 | bwd_allreduce: 612.28 | step: 111.93
�██▏| 1835/2000 [5:36:43<30:05, 10.94s/it]                                                      92%|█████████▏| 1835/2000 [5:36:43<30:05, 10.94s/it] 92%|█████████▏| 1836/2000 [5:36:54<29:41, 10.86s/it]                                                      92%|█████████▏| 1836/2000 [5:36:54<29:41, 10.86s/it] 92%|█████████▏| 1837/2000 [5:37:05<29:28, 10.85s/it]                                                      92%|█████████▏| 1837/2000 [5:37:05<29:28, 10.85s/it] 92%|█████████▏| 1838/2000 [5:37:16<29:22, 10.88s/it]                                                      92%|█████████▏| 1838/2000 [5:37:16<29:22, 10.88s/it] 92%|█████████▏| 1839/2000 [5:37:27<29:20, 10.93s/it]                                                      92%|█████████▏| 1839/2000 [5:37:27<29:20, 10.93s/it] 92%|█████████�{'loss': 0.7362, 'learning_rate': 3.3379149687388866e-07, 'epoch': 0.92}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11599
total_samples=27969, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:25,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.15 | bwd_microstep: 2067.78 | bwd_inner_microstep: 1835.47 | bwd_allreduce_microstep: 232.24 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13705
total_samples=27973, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:28,715] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.19 | bwd_microstep: 2068.88 | bwd_inner_microstep: 1937.99 | bwd_allreduce_microstep: 130.82 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11672
total_samples=27976, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:31,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.06 | bwd_microstep: 1809.76 | bwd_inner_microstep: 1629.85 | bwd_allreduce_microstep: 179.83 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13221
total_samples=27980, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:34,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 07:20:34,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.54 | bwd_microstep: 2008.43 | bwd_inner_microstep: 1912.02 | bwd_allreduce_microstep: 96.35 | step_microstep: 426.61
[2025-08-03 07:20:34,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2877.86 | bwd: 7954.91 | bwd_inner: 7315.34 | bwd_allreduce: 639.32 | step: 427.14
{'loss': 0.7318, 'learning_rate': 3.2965514750718964e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14012
total_samples=27984, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:37,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.82 | bwd_microstep: 1954.34 | bwd_inner_microstep: 1948.09 | bwd_allreduce_microstep: 6.18 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14816
total_samples=27991, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:39,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.94 | bwd_microstep: 1782.55 | bwd_inner_microstep: 1747.43 | bwd_allreduce_microstep: 35.06 | step_microstep: 0.71
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11764
total_samples=27995, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:42,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.55 | bwd_microstep: 1836.46 | bwd_inner_microstep: 1522.47 | bwd_allreduce_microstep: 313.93 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14503
total_samples=28000, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:45,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 07:20:45,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.88 | bwd_microstep: 1924.88 | bwd_inner_microstep: 1753.48 | bwd_allreduce_microstep: 171.32 | step_microstep: 117.38
[2025-08-03 07:20:45,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.12 | bwd: 7498.28 | bwd_inner: 6971.47 | bwd_allreduce: 526.56 | step: 118.31
{'loss': 0.7412, 'learning_rate': 3.255441574925089e-07, 'epoch': 0.92}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11690
total_samples=28003, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:48,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.03 | bwd_microstep: 2067.42 | bwd_inner_microstep: 1874.48 | bwd_allreduce_microstep: 192.85 | step_microstep: 0.42
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13696
total_samples=28007, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:51,062] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.61 | bwd_microstep: 2172.87 | bwd_inner_microstep: 1919.08 | bwd_allreduce_microstep: 253.71 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12312
total_samples=28010, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:53,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.81 | bwd_microstep: 1924.35 | bwd_inner_microstep: 1598.30 | bwd_allreduce_microstep: 325.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15305
total_samples=28014, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:56,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.78
[2025-08-03 07:20:56,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.96 | bwd_microstep: 1842.35 | bwd_inner_microstep: 1783.79 | bwd_allreduce_microstep: 58.49 | step_microstep: 116.37
[2025-08-03 07:20:56,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.33 | bwd: 8007.05 | bwd_inner: 7175.66 | bwd_allreduce: 831.12 | step: 117.18
{'loss': 0.7367, 'learning_rate': 3.2145853761043844e-07, 'epoch': 0.92}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11998
total_samples=28017, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:20:59,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.31 | bwd_microstep: 1779.67 | bwd_inner_microstep: 1559.29 | bwd_allreduce_microstep: 220.32 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13153
total_samples=28021, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:01,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.18 | bwd_microstep: 1765.52 | bwd_inner_microstep: 1659.04 | bwd_allreduce_microstep: 106.41 | step_microstep: 0.76
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12007
total_samples=28024, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:04,323] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.24 | bwd_microstep: 1906.61 | bwd_inner_microstep: 1767.95 | bwd_allreduce_microstep: 138.58 | step_microstep: 0.34
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14620
total_samples=28028, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:06,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 07:21:06,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.55 | bwd_microstep: 1759.66 | bwd_inner_microstep: 1682.63 | bwd_allreduce_microstep: 76.96 | step_microstep: 113.77
[2025-08-03 07:21:06,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.21 | bwd: 7211.51 | bwd_inner: 6668.91 | bwd_allreduce: 542.34 | step: 115.13
{'loss': 0.7335, 'learning_rate': 3.1739829857504235e-07, 'epoch': 0.92}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12890
total_samples=28032, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:09,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.63 | bwd_microstep: 1713.82 | bwd_inner_microstep: 1609.77 | bwd_allreduce_microstep: 103.98 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373
total_samples=28036, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:12,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.42 | bwd_microstep: 1940.18 | bwd_inner_microstep: 1719.11 | bwd_allreduce_microstep: 221.01 | step_microstep: 0.15
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14391
total_samples=28040, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:14,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.95 | bwd_microstep: 1981.26 | bwd_inner_microstep: 1802.45 | bwd_allreduce_microstep: 178.74 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14394
total_samples=28044, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:17,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.61
[2025-08-03 07:21:17,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.62 | bwd_microstep: 1991.54 | bwd_inner_microstep: 1841.83 | bwd_allreduce_microstep: 149.64 | step_microstep: 111.58
[2025-08-03 07:21:17,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.56 | bwd: 7626.85 | bwd_inner: 6973.15 | bwd_allreduce: 653.45 | step: 111.96
�| 1840/2000 [5:37:37<28:57, 10.86s/it]                                                      92%|█████████▏| 1840/2000 [5:37:37<28:57, 10.86s/it] 92%|█████████▏| 1841/2000 [5:37:49<29:20, 11.07s/it]                                                      92%|█████████▏| 1841/2000 [5:37:49<29:20, 11.07s/it] 92%|█████████▏| 1842/2000 [5:38:00<28:51, 10.96s/it]                                                      92%|█████████▏| 1842/2000 [5:38:00<28:51, 10.96s/it] 92%|█████████▏| 1843/2000 [5:38:11<28:53, 11.04s/it]                                                      92%|█████████▏| 1843/2000 [5:38:11<28:53, 11.04s/it] 92%|█████████▏| 1844/2000 [5:38:21<28:15, 10.87s/it]                                                      92%|█████████▏| 1844/2000 [5:38:21<28:15, 10.87s/it] 92%|█████████▏| 1845/2{'loss': 0.7195, 'learning_rate': 3.133634510338235e-07, 'epoch': 0.92}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12367
total_samples=28047, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:20,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.31 | bwd_microstep: 1810.75 | bwd_inner_microstep: 1597.38 | bwd_allreduce_microstep: 213.30 | step_microstep: 0.10
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13210
total_samples=28051, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:23,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.74 | bwd_microstep: 2007.22 | bwd_inner_microstep: 1668.48 | bwd_allreduce_microstep: 338.67 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11816
total_samples=28054, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:25,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.52 | bwd_microstep: 1816.19 | bwd_inner_microstep: 1573.27 | bwd_allreduce_microstep: 242.84 | step_microstep: 0.24
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14310
total_samples=28058, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:28,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.44
[2025-08-03 07:21:28,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.37 | bwd_microstep: 1749.54 | bwd_inner_microstep: 1703.47 | bwd_allreduce_microstep: 45.99 | step_microstep: 132.11
[2025-08-03 07:21:28,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.86 | bwd: 7383.75 | bwd_inner: 6542.60 | bwd_allreduce: 840.90 | step: 132.57
{'loss': 0.7349, 'learning_rate': 3.093540055676958e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13452
total_samples=28062, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:31,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.29 | bwd_microstep: 1819.89 | bwd_inner_microstep: 1712.07 | bwd_allreduce_microstep: 107.76 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14248
total_samples=28066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:33,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.91 | bwd_microstep: 1988.41 | bwd_inner_microstep: 1875.69 | bwd_allreduce_microstep: 112.65 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12548
total_samples=28071, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:36,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.95 | bwd_microstep: 1799.02 | bwd_inner_microstep: 1610.36 | bwd_allreduce_microstep: 188.59 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12982
total_samples=28076, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:39,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.34
[2025-08-03 07:21:39,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.25 | bwd_microstep: 1789.62 | bwd_inner_microstep: 1650.78 | bwd_allreduce_microstep: 138.76 | step_microstep: 119.73
[2025-08-03 07:21:39,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.33 | bwd: 7396.99 | bwd_inner: 6848.90 | bwd_allreduce: 547.84 | step: 120.07
{'loss': 0.7368, 'learning_rate': 3.053699726909676e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14042
total_samples=28080, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:41,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.80 | bwd_microstep: 1798.04 | bwd_inner_microstep: 1731.60 | bwd_allreduce_microstep: 66.37 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13977
total_samples=28084, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:44,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.37 | bwd_microstep: 2157.46 | bwd_inner_microstep: 1937.09 | bwd_allreduce_microstep: 220.30 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11850
total_samples=28088, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:47,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.08 | bwd_microstep: 1974.44 | bwd_inner_microstep: 1811.66 | bwd_allreduce_microstep: 162.72 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13452
total_samples=28092, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:50,044] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 07:21:50,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.58 | bwd_microstep: 1835.91 | bwd_inner_microstep: 1829.84 | bwd_allreduce_microstep: 6.01 | step_microstep: 126.74
[2025-08-03 07:21:50,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.76 | bwd: 7765.89 | bwd_inner: 7310.18 | bwd_allreduce: 455.48 | step: 127.23
{'loss': 0.7247, 'learning_rate': 3.0141136285129825e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13476
total_samples=28096, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:52,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.15 | bwd_microstep: 1760.80 | bwd_inner_microstep: 1691.31 | bwd_allreduce_microstep: 69.41 | step_microstep: 0.17
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 15011
total_samples=28100, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:55,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.38 | bwd_microstep: 1750.64 | bwd_inner_microstep: 1693.73 | bwd_allreduce_microstep: 56.84 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=28104, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:21:57,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.45 | bwd_microstep: 1883.38 | bwd_inner_microstep: 1613.46 | bwd_allreduce_microstep: 269.85 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12122
total_samples=28107, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:00,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 07:22:00,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.40 | bwd_microstep: 1947.21 | bwd_inner_microstep: 1576.86 | bwd_allreduce_microstep: 370.28 | step_microstep: 119.32
[2025-08-03 07:22:00,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2767.31 | bwd: 7342.06 | bwd_inner: 6575.36 | bwd_allreduce: 766.48 | step: 119.83
{'loss': 0.7289, 'learning_rate': 2.974781864296783e-07, 'epoch': 0.92}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13394
total_samples=28111, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:03,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.63 | bwd_microstep: 1755.60 | bwd_inner_microstep: 1693.08 | bwd_allreduce_microstep: 62.45 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11700
total_samples=28114, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:06,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.04 | bwd_microstep: 2140.45 | bwd_inner_microstep: 1758.79 | bwd_allreduce_microstep: 381.60 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13499
total_samples=28118, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:08,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.61 | bwd_microstep: 1725.21 | bwd_inner_microstep: 1673.00 | bwd_allreduce_microstep: 52.15 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13288
total_samples=28122, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:11,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.06
[2025-08-03 07:22:11,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.81 | bwd_microstep: 1979.59 | bwd_inner_microstep: 1867.87 | bwd_allreduce_microstep: 111.65 | step_microstep: 113.16
[2025-08-03 07:22:11,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2841.99 | bwd: 7600.90 | bwd_inner: 6992.74 | bwd_allreduce: 607.93 | step: 113.49
000 [5:38:32<28:02, 10.85s/it]                                                      92%|█████████▏| 1845/2000 [5:38:32<28:02, 10.85s/it] 92%|█████████▏| 1846/2000 [5:38:43<27:41, 10.79s/it]                                                      92%|█████████▏| 1846/2000 [5:38:43<27:41, 10.79s/it] 92%|█████████▏| 1847/2000 [5:38:53<27:22, 10.73s/it]                                                      92%|█████████▏| 1847/2000 [5:38:53<27:22, 10.73s/it] 92%|█████████▏| 1848/2000 [5:39:04<27:24, 10.82s/it]                                                      92%|█████████▏| 1848/2000 [5:39:04<27:24, 10.82s/it] 92%|█████████▏| 1849/2000 [5:39:15<27:00, 10.73s/it]                                                      92%|█████████▏| 1849/2000 [5:39:15<27:00, 10.73s/it] 92%|█████████▎| 1850/2000 [5:39{'loss': 0.732, 'learning_rate': 2.935704537404083e-07, 'epoch': 0.93}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14320
total_samples=28127, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:13,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.02 | bwd_microstep: 1737.79 | bwd_inner_microstep: 1680.87 | bwd_allreduce_microstep: 56.86 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13457
total_samples=28131, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:16,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.29 | bwd_microstep: 2143.86 | bwd_inner_microstep: 1930.68 | bwd_allreduce_microstep: 213.11 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11913
total_samples=28134, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:19,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.33 | bwd_microstep: 1721.40 | bwd_inner_microstep: 1535.20 | bwd_allreduce_microstep: 186.13 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11888
total_samples=28137, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:22,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 07:22:22,075] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.63 | bwd_microstep: 1865.42 | bwd_inner_microstep: 1546.49 | bwd_allreduce_microstep: 318.85 | step_microstep: 118.95
[2025-08-03 07:22:22,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.20 | bwd: 7468.52 | bwd_inner: 6693.25 | bwd_allreduce: 775.02 | step: 119.49
{'loss': 0.7369, 'learning_rate': 2.8968817503105984e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13368
total_samples=28141, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:24,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.81 | bwd_microstep: 1869.51 | bwd_inner_microstep: 1691.18 | bwd_allreduce_microstep: 178.26 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11692
total_samples=28144, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:27,376] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.33 | bwd_microstep: 1868.40 | bwd_inner_microstep: 1729.76 | bwd_allreduce_microstep: 138.57 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11804
total_samples=28148, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:29,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.45 | bwd_microstep: 1744.69 | bwd_inner_microstep: 1537.03 | bwd_allreduce_microstep: 207.60 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12329
total_samples=28151, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:32,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 07:22:32,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.15 | bwd_microstep: 2000.12 | bwd_inner_microstep: 1871.52 | bwd_allreduce_microstep: 128.53 | step_microstep: 110.59
[2025-08-03 07:22:32,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2793.67 | bwd: 7482.77 | bwd_inner: 6829.49 | bwd_allreduce: 653.03 | step: 110.93
{'loss': 0.723, 'learning_rate': 2.8583136048245697e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13741
total_samples=28155, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:35,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.34 | bwd_microstep: 2011.12 | bwd_inner_microstep: 1882.14 | bwd_allreduce_microstep: 128.92 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13217
total_samples=28159, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:38,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.12 | bwd_microstep: 1824.50 | bwd_inner_microstep: 1698.20 | bwd_allreduce_microstep: 126.21 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14447
total_samples=28163, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:40,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.42 | bwd_microstep: 1931.46 | bwd_inner_microstep: 1800.91 | bwd_allreduce_microstep: 130.49 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11607
total_samples=28166, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:43,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.51
[2025-08-03 07:22:43,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.44 | bwd_microstep: 1747.04 | bwd_inner_microstep: 1527.73 | bwd_allreduce_microstep: 219.24 | step_microstep: 130.35
[2025-08-03 07:22:43,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2817.25 | bwd: 7514.19 | bwd_inner: 6908.98 | bwd_allreduce: 604.96 | step: 130.93
{'loss': 0.7375, 'learning_rate': 2.820000202086459e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15127
total_samples=28170, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:46,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.57 | bwd_microstep: 1951.37 | bwd_inner_microstep: 1926.35 | bwd_allreduce_microstep: 24.95 | step_microstep: 0.33
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11958
total_samples=28173, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:49,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.91 | bwd_microstep: 1987.37 | bwd_inner_microstep: 1796.09 | bwd_allreduce_microstep: 191.21 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13727
total_samples=28178, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:51,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.25 | bwd_microstep: 1830.74 | bwd_inner_microstep: 1654.77 | bwd_allreduce_microstep: 175.90 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13308
total_samples=28182, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:54,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.14
[2025-08-03 07:22:54,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.54 | bwd_microstep: 1722.29 | bwd_inner_microstep: 1665.79 | bwd_allreduce_microstep: 56.42 | step_microstep: 158.26
[2025-08-03 07:22:54,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.20 | bwd: 7491.83 | bwd_inner: 7043.00 | bwd_allreduce: 448.56 | step: 158.83
{'loss': 0.7402, 'learning_rate': 2.781941642568686e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14289
total_samples=28186, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:57,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.21 | bwd_microstep: 2038.62 | bwd_inner_microstep: 1881.03 | bwd_allreduce_microstep: 157.52 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11833
total_samples=28189, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:22:59,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.39 | bwd_microstep: 1919.39 | bwd_inner_microstep: 1720.38 | bwd_allreduce_microstep: 198.94 | step_microstep: 0.35
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13807
total_samples=28193, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:02,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.93 | bwd_microstep: 1822.88 | bwd_inner_microstep: 1721.01 | bwd_allreduce_microstep: 101.80 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13280
total_samples=28197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:05,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 07:23:05,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.43 | bwd_microstep: 2133.11 | bwd_inner_microstep: 2107.30 | bwd_allreduce_microstep: 25.74 | step_microstep: 129.18
[2025-08-03 07:23:05,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.89 | bwd: 7914.07 | bwd_inner: 7429.72 | bwd_allreduce: 484.07 | step: 129.77
:26<26:54, 10.76s/it]                                                      92%|█████████▎| 1850/2000 [5:39:26<26:54, 10.76s/it] 93%|█████████▎| 1851/2000 [5:39:36<26:39, 10.73s/it]                                                      93%|█████████▎| 1851/2000 [5:39:36<26:39, 10.73s/it] 93%|█████████▎| 1852/2000 [5:39:47<26:26, 10.72s/it]                                                      93%|█████████▎| 1852/2000 [5:39:47<26:26, 10.72s/it] 93%|█████████▎| 1853/2000 [5:39:58<26:17, 10.73s/it]                                                      93%|█████████▎| 1853/2000 [5:39:58<26:17, 10.73s/it] 93%|█████████▎| 1854/2000 [5:40:09<26:09, 10.75s/it]                                                      93%|█████████▎| 1854/2000 [5:40:09<26:09, 10.75s/it] 93%|█████████▎| 1855/2000 [5:40:20<26:16{'loss': 0.7323, 'learning_rate': 2.744138026075405e-07, 'epoch': 0.93}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14321
total_samples=28201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:08,287] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.34 | bwd_microstep: 2040.45 | bwd_inner_microstep: 1896.40 | bwd_allreduce_microstep: 143.98 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11970
total_samples=28204, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:10,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.95 | bwd_microstep: 1720.25 | bwd_inner_microstep: 1549.58 | bwd_allreduce_microstep: 170.60 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12226
total_samples=28207, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:13,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.76 | bwd_microstep: 1796.43 | bwd_inner_microstep: 1568.67 | bwd_allreduce_microstep: 227.69 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13640
total_samples=28211, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:16,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 07:23:16,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.90 | bwd_microstep: 1788.82 | bwd_inner_microstep: 1710.03 | bwd_allreduce_microstep: 78.73 | step_microstep: 114.56
[2025-08-03 07:23:16,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2779.86 | bwd: 7346.00 | bwd_inner: 6724.68 | bwd_allreduce: 621.08 | step: 115.30
{'loss': 0.735, 'learning_rate': 2.706589451742181e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13878
total_samples=28215, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:19,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.11 | bwd_microstep: 2273.41 | bwd_inner_microstep: 2058.13 | bwd_allreduce_microstep: 215.21 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12117
total_samples=28218, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:21,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.92 | bwd_microstep: 2095.09 | bwd_inner_microstep: 1878.63 | bwd_allreduce_microstep: 216.39 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13158
total_samples=28222, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:24,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.74 | bwd_microstep: 2068.96 | bwd_inner_microstep: 1706.96 | bwd_allreduce_microstep: 361.94 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13169
total_samples=28226, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:27,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.56
[2025-08-03 07:23:27,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.28 | bwd_microstep: 1726.31 | bwd_inner_microstep: 1645.34 | bwd_allreduce_microstep: 80.89 | step_microstep: 147.60
[2025-08-03 07:23:27,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2768.98 | bwd: 8163.81 | bwd_inner: 7289.05 | bwd_allreduce: 874.52 | step: 148.09
{'loss': 0.7348, 'learning_rate': 2.669296018035772e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13270
total_samples=28230, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:30,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.69 | bwd_microstep: 1804.44 | bwd_inner_microstep: 1674.20 | bwd_allreduce_microstep: 130.18 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11882
total_samples=28233, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:32,562] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.23 | bwd_microstep: 1786.55 | bwd_inner_microstep: 1578.21 | bwd_allreduce_microstep: 208.26 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13641
total_samples=28237, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:35,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.96 | bwd_microstep: 1815.11 | bwd_inner_microstep: 1722.74 | bwd_allreduce_microstep: 92.31 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12036
total_samples=28240, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:38,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.88
[2025-08-03 07:23:38,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 664.83 | bwd_microstep: 2205.27 | bwd_inner_microstep: 1926.95 | bwd_allreduce_microstep: 278.26 | step_microstep: 109.40
[2025-08-03 07:23:38,207] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.63 | bwd: 7611.42 | bwd_inner: 6902.10 | bwd_allreduce: 709.09 | step: 109.88
{'loss': 0.7254, 'learning_rate': 2.632257822753881e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13227
total_samples=28244, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:40,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.71 | bwd_microstep: 1714.07 | bwd_inner_microstep: 1651.36 | bwd_allreduce_microstep: 62.64 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13920
total_samples=28247, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:43,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.43 | bwd_microstep: 1770.88 | bwd_inner_microstep: 1661.31 | bwd_allreduce_microstep: 109.51 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13370
total_samples=28251, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:46,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.68 | bwd_microstep: 2668.72 | bwd_inner_microstep: 2525.59 | bwd_allreduce_microstep: 143.07 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11993
total_samples=28254, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:49,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 07:23:49,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.98 | bwd_microstep: 1746.64 | bwd_inner_microstep: 1548.83 | bwd_allreduce_microstep: 197.74 | step_microstep: 115.49
[2025-08-03 07:23:49,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2758.74 | bwd: 7900.37 | bwd_inner: 7387.09 | bwd_allreduce: 513.03 | step: 115.98
{'loss': 0.7232, 'learning_rate': 2.5954749630248355e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13678
total_samples=28258, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:51,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.54 | bwd_microstep: 1879.48 | bwd_inner_microstep: 1728.73 | bwd_allreduce_microstep: 150.68 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12655
total_samples=28261, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:54,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.14 | bwd_microstep: 1799.87 | bwd_inner_microstep: 1599.85 | bwd_allreduce_microstep: 199.95 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11724
total_samples=28264, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:23:57,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.12 | bwd_microstep: 2036.93 | bwd_inner_microstep: 1840.37 | bwd_allreduce_microstep: 196.50 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13090
total_samples=28268, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:00,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.14
[2025-08-03 07:24:00,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.45 | bwd_microstep: 1751.34 | bwd_inner_microstep: 1679.14 | bwd_allreduce_microstep: 72.13 | step_microstep: 139.19
[2025-08-03 07:24:00,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2865.17 | bwd: 7467.68 | bwd_inner: 6848.09 | bwd_allreduce: 619.34 | step: 139.56
, 10.87s/it]                                                      93%|█████████▎| 1855/2000 [5:40:20<26:16, 10.87s/it] 93%|█████████▎| 1856/2000 [5:40:30<25:52, 10.78s/it]                                                      93%|█████████▎| 1856/2000 [5:40:30<25:52, 10.78s/it] 93%|█████████▎| 1857/2000 [5:40:42<26:06, 10.96s/it]                                                      93%|█████████▎| 1857/2000 [5:40:42<26:06, 10.96s/it] 93%|█████████▎| 1858/2000 [5:40:53<25:49, 10.91s/it]                                                      93%|█████████▎| 1858/2000 [5:40:53<25:49, 10.91s/it] 93%|█████████▎| 1859/2000 [5:41:04<25:45, 10.96s/it]                                                      93%|█████████▎| 1859/2000 [5:41:04<25:45, 10.96s/it] 93%|█████████▎| 1860/2000 [5:41:14<25:27, 10.91s/{'loss': 0.7377, 'learning_rate': 2.5589475353073987e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13731
total_samples=28272, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:02,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.12 | bwd_microstep: 1839.01 | bwd_inner_microstep: 1719.53 | bwd_allreduce_microstep: 119.41 | step_microstep: 0.16
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12541
total_samples=28276, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:05,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.78 | bwd_microstep: 1901.53 | bwd_inner_microstep: 1782.12 | bwd_allreduce_microstep: 119.34 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13204
total_samples=28280, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:08,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.38 | bwd_microstep: 2044.45 | bwd_inner_microstep: 1903.94 | bwd_allreduce_microstep: 140.45 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13588
total_samples=28284, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:10,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 07:24:10,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.82 | bwd_microstep: 1767.16 | bwd_inner_microstep: 1689.41 | bwd_allreduce_microstep: 77.68 | step_microstep: 135.70
[2025-08-03 07:24:10,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.03 | bwd: 7552.21 | bwd_inner: 7095.00 | bwd_allreduce: 456.96 | step: 136.27
{'loss': 0.7342, 'learning_rate': 2.5226756353904925e-07, 'epoch': 0.93}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13521
total_samples=28289, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:13,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.32 | bwd_microstep: 2133.45 | bwd_inner_microstep: 2000.37 | bwd_allreduce_microstep: 133.01 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13935
total_samples=28293, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:16,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.97 | bwd_microstep: 1832.17 | bwd_inner_microstep: 1826.15 | bwd_allreduce_microstep: 5.96 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13348
total_samples=28297, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:18,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.43 | bwd_microstep: 1849.20 | bwd_inner_microstep: 1712.34 | bwd_allreduce_microstep: 136.81 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13225
total_samples=28301, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:21,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 07:24:21,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.17 | bwd_microstep: 2069.46 | bwd_inner_microstep: 1947.08 | bwd_allreduce_microstep: 122.31 | step_microstep: 150.43
[2025-08-03 07:24:21,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.82 | bwd: 7884.33 | bwd_inner: 7485.93 | bwd_allreduce: 398.17 | step: 150.73
{'loss': 0.736, 'learning_rate': 2.486659358392951e-07, 'epoch': 0.93}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13341
total_samples=28305, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:24,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.50 | bwd_microstep: 1968.20 | bwd_inner_microstep: 1703.21 | bwd_allreduce_microstep: 264.94 | step_microstep: 0.25
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12633
total_samples=28309, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:27,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.29 | bwd_microstep: 1734.07 | bwd_inner_microstep: 1586.19 | bwd_allreduce_microstep: 147.82 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11718
total_samples=28312, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:29,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.65 | bwd_microstep: 1829.33 | bwd_inner_microstep: 1580.37 | bwd_allreduce_microstep: 248.89 | step_microstep: 0.28
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13047
total_samples=28316, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:32,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 07:24:32,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.12 | bwd_microstep: 2048.14 | bwd_inner_microstep: 1883.26 | bwd_allreduce_microstep: 164.81 | step_microstep: 135.38
[2025-08-03 07:24:32,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.50 | bwd: 7579.80 | bwd_inner: 6753.02 | bwd_allreduce: 826.54 | step: 136.18
{'loss': 0.725, 'learning_rate': 2.450898798763268e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13425
total_samples=28320, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:35,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.35 | bwd_microstep: 1826.53 | bwd_inner_microstep: 1701.24 | bwd_allreduce_microstep: 125.23 | step_microstep: 0.23
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12518
total_samples=28324, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:38,308] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.44 | bwd_microstep: 2136.13 | bwd_inner_microstep: 1983.26 | bwd_allreduce_microstep: 152.81 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11790
total_samples=28327, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:41,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.31 | bwd_microstep: 2317.80 | bwd_inner_microstep: 2016.31 | bwd_allreduce_microstep: 301.43 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13275
total_samples=28331, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:44,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 07:24:44,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.67 | bwd_microstep: 1831.13 | bwd_inner_microstep: 1703.56 | bwd_allreduce_microstep: 127.50 | step_microstep: 123.12
[2025-08-03 07:24:44,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.69 | bwd: 8111.65 | bwd_inner: 7404.36 | bwd_allreduce: 707.05 | step: 123.59
{'loss': 0.7388, 'learning_rate': 2.4153940502793185e-07, 'epoch': 0.93}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14393
total_samples=28335, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:46,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.89 | bwd_microstep: 1814.94 | bwd_inner_microstep: 1703.35 | bwd_allreduce_microstep: 111.53 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13524
total_samples=28339, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:49,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.04 | bwd_microstep: 1849.12 | bwd_inner_microstep: 1715.12 | bwd_allreduce_microstep: 133.93 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12150
total_samples=28342, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:52,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.44 | bwd_microstep: 1988.58 | bwd_inner_microstep: 1798.54 | bwd_allreduce_microstep: 189.98 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12985
total_samples=28346, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:55,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 07:24:55,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.35 | bwd_microstep: 2061.08 | bwd_inner_microstep: 1850.92 | bwd_allreduce_microstep: 210.09 | step_microstep: 108.26
[2025-08-03 07:24:55,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2846.64 | bwd: 7713.78 | bwd_inner: 7067.93 | bwd_allreduce: 645.61 | step: 108.75
it]                                                      93%|█████████▎| 1860/2000 [5:41:14<25:27, 10.91s/it] 93%|█████████▎| 1861/2000 [5:41:25<25:10, 10.86s/it]                                                      93%|█████████▎| 1861/2000 [5:41:25<25:10, 10.86s/it] 93%|█████████▎| 1862/2000 [5:41:36<25:09, 10.94s/it]                                                      93%|█████████▎| 1862/2000 [5:41:36<25:09, 10.94s/it] 93%|█████████▎| 1863/2000 [5:41:47<24:55, 10.91s/it]                                                      93%|█████████▎| 1863/2000 [5:41:47<24:55, 10.91s/it] 93%|█████████▎| 1864/2000 [5:41:59<25:02, 11.04s/it]                                                      93%|█████████▎| 1864/2000 [5:41:59<25:02, 11.04s/it] 93%|█████████▎| 1865/2000 [5:42:09<24:47, 11.02s/it]     {'loss': 0.7237, 'learning_rate': 2.380145206048201e-07, 'epoch': 0.93}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12566
total_samples=28350, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:24:57,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.81 | bwd_microstep: 2007.49 | bwd_inner_microstep: 1636.70 | bwd_allreduce_microstep: 370.72 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13220
total_samples=28354, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:00,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 895.59 | bwd_microstep: 1985.69 | bwd_inner_microstep: 1835.32 | bwd_allreduce_microstep: 150.29 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12352
total_samples=28357, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:03,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.11 | bwd_microstep: 1772.11 | bwd_inner_microstep: 1592.44 | bwd_allreduce_microstep: 179.60 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11758
total_samples=28360, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:06,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.47
[2025-08-03 07:25:06,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.21 | bwd_microstep: 1806.61 | bwd_inner_microstep: 1573.93 | bwd_allreduce_microstep: 232.61 | step_microstep: 149.90
[2025-08-03 07:25:06,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3021.64 | bwd: 7571.94 | bwd_inner: 6638.40 | bwd_allreduce: 933.30 | step: 150.51
{'loss': 0.7342, 'learning_rate': 2.3451523585058756e-07, 'epoch': 0.93}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13611
total_samples=28364, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:08,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.66 | bwd_microstep: 1789.01 | bwd_inner_microstep: 1706.02 | bwd_allreduce_microstep: 82.94 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14259
total_samples=28368, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:11,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.41 | bwd_microstep: 1757.30 | bwd_inner_microstep: 1676.74 | bwd_allreduce_microstep: 80.49 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13215
total_samples=28372, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:13,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.38 | bwd_microstep: 1853.82 | bwd_inner_microstep: 1697.65 | bwd_allreduce_microstep: 156.09 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12212
total_samples=28375, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:16,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.55
[2025-08-03 07:25:16,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.47 | bwd_microstep: 2112.00 | bwd_inner_microstep: 1635.41 | bwd_allreduce_microstep: 476.49 | step_microstep: 108.54
[2025-08-03 07:25:16,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.84 | bwd: 7512.20 | bwd_inner: 6715.84 | bwd_allreduce: 796.07 | step: 109.20
{'loss': 0.7332, 'learning_rate': 2.3104155994170042e-07, 'epoch': 0.93}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13890
total_samples=28380, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:19,377] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 671.34 | bwd_microstep: 1730.69 | bwd_inner_microstep: 1678.13 | bwd_allreduce_microstep: 52.50 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16305
total_samples=28385, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:22,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.05 | bwd_microstep: 1981.02 | bwd_inner_microstep: 1836.32 | bwd_allreduce_microstep: 144.64 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13402
total_samples=28389, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:24,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.22 | bwd_microstep: 1816.62 | bwd_inner_microstep: 1718.86 | bwd_allreduce_microstep: 97.70 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11969
total_samples=28392, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:27,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.80
[2025-08-03 07:25:27,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.58 | bwd_microstep: 1948.79 | bwd_inner_microstep: 1563.71 | bwd_allreduce_microstep: 384.97 | step_microstep: 157.05
[2025-08-03 07:25:27,630] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2789.12 | bwd: 7477.17 | bwd_inner: 6797.00 | bwd_allreduce: 679.89 | step: 157.57
{'loss': 0.7275, 'learning_rate': 2.2759350198746978e-07, 'epoch': 0.93}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11795
total_samples=28395, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:30,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.31 | bwd_microstep: 1689.62 | bwd_inner_microstep: 1525.58 | bwd_allreduce_microstep: 163.96 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14509
total_samples=28399, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:32,823] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.61 | bwd_microstep: 1961.39 | bwd_inner_microstep: 1955.02 | bwd_allreduce_microstep: 6.31 | step_microstep: 0.81
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12218
total_samples=28402, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:35,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.43 | bwd_microstep: 1803.54 | bwd_inner_microstep: 1578.22 | bwd_allreduce_microstep: 225.26 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14434
total_samples=28406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:38,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.28
[2025-08-03 07:25:38,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 672.71 | bwd_microstep: 1722.24 | bwd_inner_microstep: 1672.32 | bwd_allreduce_microstep: 49.85 | step_microstep: 134.34
[2025-08-03 07:25:38,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2770.01 | bwd: 7176.84 | bwd_inner: 6731.13 | bwd_allreduce: 445.47 | step: 135.50
{'loss': 0.7296, 'learning_rate': 2.24171071030026e-07, 'epoch': 0.93}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12155
total_samples=28410, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:40,783] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.27 | bwd_microstep: 1986.34 | bwd_inner_microstep: 1781.75 | bwd_allreduce_microstep: 204.53 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11890
total_samples=28413, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:43,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.51 | bwd_microstep: 1810.11 | bwd_inner_microstep: 1567.80 | bwd_allreduce_microstep: 242.25 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11901
total_samples=28416, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:46,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.83 | bwd_microstep: 1915.03 | bwd_inner_microstep: 1741.68 | bwd_allreduce_microstep: 173.27 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12899
total_samples=28419, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:48,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.37
[2025-08-03 07:25:48,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.44 | bwd_microstep: 1804.18 | bwd_inner_microstep: 1623.44 | bwd_allreduce_microstep: 180.67 | step_microstep: 134.09
[2025-08-03 07:25:48,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2818.97 | bwd: 7515.71 | bwd_inner: 6714.67 | bwd_allreduce: 800.80 | step: 134.58
                                                 93%|█████████▎| 1865/2000 [5:42:09<24:47, 11.02s/it] 93%|█████████▎| 1866/2000 [5:42:21<24:37, 11.02s/it]                                                      93%|█████████▎| 1866/2000 [5:42:21<24:37, 11.02s/it] 93%|█████████▎| 1867/2000 [5:42:31<24:14, 10.94s/it]                                                      93%|█████████▎| 1867/2000 [5:42:31<24:14, 10.94s/it] 93%|█████████▎| 1868/2000 [5:42:42<23:56, 10.88s/it]                                                      93%|█████████▎| 1868/2000 [5:42:42<23:56, 10.88s/it] 93%|█████████▎| 1869/2000 [5:42:52<23:26, 10.73s/it]                                                      93%|█████████▎| 1869/2000 [5:42:52<23:26, 10.73s/it] 94%|█████████▎| 1870/2000 [5:43:03<23:16, 10.74s/it]              {'loss': 0.724, 'learning_rate': 2.2077427604429435e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11679
total_samples=28422, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:51,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.82 | bwd_microstep: 2052.11 | bwd_inner_microstep: 1747.93 | bwd_allreduce_microstep: 304.12 | step_microstep: 1.91
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11847
total_samples=28425, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:54,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.62 | bwd_microstep: 1944.82 | bwd_inner_microstep: 1604.13 | bwd_allreduce_microstep: 340.64 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12361
total_samples=28428, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:56,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.13 | bwd_microstep: 1853.11 | bwd_inner_microstep: 1729.62 | bwd_allreduce_microstep: 123.43 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12051
total_samples=28431, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:25:59,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.52
[2025-08-03 07:25:59,700] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.50 | bwd_microstep: 1793.52 | bwd_inner_microstep: 1567.41 | bwd_allreduce_microstep: 226.04 | step_microstep: 139.03
[2025-08-03 07:25:59,701] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.00 | bwd: 7643.62 | bwd_inner: 6649.08 | bwd_allreduce: 994.30 | step: 139.58
{'loss': 0.7293, 'learning_rate': 2.1740312593797274e-07, 'epoch': 0.94}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15289
total_samples=28435, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:02,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.26 | bwd_microstep: 1806.87 | bwd_inner_microstep: 1794.90 | bwd_allreduce_microstep: 11.91 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13154
total_samples=28440, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:04,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.86 | bwd_microstep: 1772.40 | bwd_inner_microstep: 1671.09 | bwd_allreduce_microstep: 101.24 | step_microstep: 0.29
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13452
total_samples=28444, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:07,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.53 | bwd_microstep: 1945.76 | bwd_inner_microstep: 1692.34 | bwd_allreduce_microstep: 253.31 | step_microstep: 0.97
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11660
total_samples=28447, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:10,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 07:26:10,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.56 | bwd_microstep: 1783.54 | bwd_inner_microstep: 1558.24 | bwd_allreduce_microstep: 225.24 | step_microstep: 126.13
[2025-08-03 07:26:10,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2833.13 | bwd: 7308.64 | bwd_inner: 6716.57 | bwd_allreduce: 591.80 | step: 127.51
{'loss': 0.747, 'learning_rate': 2.1405762955151178e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11607
total_samples=28450, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:12,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.61 | bwd_microstep: 1875.73 | bwd_inner_microstep: 1719.19 | bwd_allreduce_microstep: 156.48 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11619
total_samples=28453, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:15,558] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.19 | bwd_microstep: 1827.24 | bwd_inner_microstep: 1589.92 | bwd_allreduce_microstep: 237.26 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13694
total_samples=28457, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:18,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.39 | bwd_microstep: 1743.73 | bwd_inner_microstep: 1687.66 | bwd_allreduce_microstep: 55.99 | step_microstep: 0.35
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11909
total_samples=28460, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:20,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 07:26:20,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.61 | bwd_microstep: 1764.15 | bwd_inner_microstep: 1567.15 | bwd_allreduce_microstep: 196.93 | step_microstep: 149.23
[2025-08-03 07:26:20,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.73 | bwd: 7210.90 | bwd_inner: 6563.91 | bwd_allreduce: 646.74 | step: 149.83
{'loss': 0.7214, 'learning_rate': 2.1073779565808471e-07, 'epoch': 0.94}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13486
total_samples=28464, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:23,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.78 | bwd_microstep: 1908.98 | bwd_inner_microstep: 1692.64 | bwd_allreduce_microstep: 216.27 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11674
total_samples=28467, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:26,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.28 | bwd_microstep: 1784.64 | bwd_inner_microstep: 1555.38 | bwd_allreduce_microstep: 229.18 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13811
total_samples=28471, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:29,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.20 | bwd_microstep: 2251.16 | bwd_inner_microstep: 1916.16 | bwd_allreduce_microstep: 334.94 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13613
total_samples=28476, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:31,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.01
[2025-08-03 07:26:31,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.49 | bwd_microstep: 1716.32 | bwd_inner_microstep: 1640.06 | bwd_allreduce_microstep: 76.19 | step_microstep: 171.10
[2025-08-03 07:26:31,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.67 | bwd: 7661.16 | bwd_inner: 6804.24 | bwd_allreduce: 856.66 | step: 171.61
{'loss': 0.7394, 'learning_rate': 2.0744363296356872e-07, 'epoch': 0.94}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12834
total_samples=28480, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:34,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.56 | bwd_microstep: 1992.48 | bwd_inner_microstep: 1865.79 | bwd_allreduce_microstep: 126.63 | step_microstep: 0.12
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13745
total_samples=28484, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:37,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.43 | bwd_microstep: 1785.68 | bwd_inner_microstep: 1695.15 | bwd_allreduce_microstep: 90.46 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13296
total_samples=28488, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:39,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.63 | bwd_microstep: 1910.28 | bwd_inner_microstep: 1730.64 | bwd_allreduce_microstep: 179.58 | step_microstep: 0.18
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11601
total_samples=28491, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:42,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.64
[2025-08-03 07:26:42,472] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.00 | bwd_microstep: 1828.82 | bwd_inner_microstep: 1593.37 | bwd_allreduce_microstep: 235.38 | step_microstep: 127.76
[2025-08-03 07:26:42,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2849.55 | bwd: 7517.33 | bwd_inner: 6884.95 | bwd_allreduce: 632.13 | step: 128.32
                                        94%|█████████▎| 1870/2000 [5:43:03<23:16, 10.74s/it] 94%|█████████▎| 1871/2000 [5:43:14<23:12, 10.79s/it]                                                      94%|█████████▎| 1871/2000 [5:43:14<23:12, 10.79s/it] 94%|█████████▎| 1872/2000 [5:43:25<22:53, 10.73s/it]                                                      94%|█████████▎| 1872/2000 [5:43:25<22:53, 10.73s/it] 94%|█████████▎| 1873/2000 [5:43:35<22:33, 10.66s/it]                                                      94%|█████████▎| 1873/2000 [5:43:35<22:33, 10.66s/it] 94%|█████████▎| 1874/2000 [5:43:46<22:32, 10.74s/it]                                                      94%|█████████▎| 1874/2000 [5:43:46<22:32, 10.74s/it] 94%|█████████▍| 1875/2000 [5:43:57<22:24, 10.75s/it]                       {'loss': 0.7284, 'learning_rate': 2.0417515010652032e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11959
total_samples=28494, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:45,055] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.57 | bwd_microstep: 1794.19 | bwd_inner_microstep: 1561.72 | bwd_allreduce_microstep: 232.40 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13512
total_samples=28498, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:47,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1017.71 | bwd_microstep: 1793.78 | bwd_inner_microstep: 1675.70 | bwd_allreduce_microstep: 118.01 | step_microstep: 0.27
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13537
total_samples=28502, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:50,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.20 | bwd_microstep: 1956.18 | bwd_inner_microstep: 1822.69 | bwd_allreduce_microstep: 133.41 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11869
total_samples=28505, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:53,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 07:26:53,397] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.11 | bwd_microstep: 1802.89 | bwd_inner_microstep: 1575.41 | bwd_allreduce_microstep: 227.41 | step_microstep: 138.90
[2025-08-03 07:26:53,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3146.52 | bwd: 7347.09 | bwd_inner: 6635.52 | bwd_allreduce: 711.32 | step: 139.53
{'loss': 0.7328, 'learning_rate': 2.009323556581566e-07, 'epoch': 0.94}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 16384
total_samples=28509, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:56,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.66 | bwd_microstep: 1822.08 | bwd_inner_microstep: 1806.68 | bwd_allreduce_microstep: 15.34 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13316
total_samples=28513, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:26:58,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.89 | bwd_microstep: 2128.92 | bwd_inner_microstep: 1830.12 | bwd_allreduce_microstep: 298.74 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13728
total_samples=28517, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:01,729] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.25 | bwd_microstep: 2037.89 | bwd_inner_microstep: 1920.58 | bwd_allreduce_microstep: 117.21 | step_microstep: 0.33
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13929
total_samples=28522, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:04,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 07:27:04,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.89 | bwd_microstep: 2056.66 | bwd_inner_microstep: 1906.41 | bwd_allreduce_microstep: 150.17 | step_microstep: 112.13
[2025-08-03 07:27:04,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2775.63 | bwd: 8045.63 | bwd_inner: 7463.79 | bwd_allreduce: 581.57 | step: 112.98
{'loss': 0.7157, 'learning_rate': 1.977152581223274e-07, 'epoch': 0.94}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13431
total_samples=28526, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:07,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.49 | bwd_microstep: 2143.65 | bwd_inner_microstep: 1919.01 | bwd_allreduce_microstep: 224.54 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11609
total_samples=28529, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:10,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.99 | bwd_microstep: 2048.66 | bwd_inner_microstep: 1820.87 | bwd_allreduce_microstep: 227.73 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12856
total_samples=28533, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:13,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.96 | bwd_microstep: 2048.87 | bwd_inner_microstep: 1846.31 | bwd_allreduce_microstep: 202.49 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13047
total_samples=28537, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:16,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.46
[2025-08-03 07:27:16,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.69 | bwd_microstep: 1994.56 | bwd_inner_microstep: 1940.31 | bwd_allreduce_microstep: 54.19 | step_microstep: 110.62
[2025-08-03 07:27:16,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.04 | bwd: 8235.81 | bwd_inner: 7526.50 | bwd_allreduce: 709.05 | step: 111.24
{'loss': 0.7301, 'learning_rate': 1.9452386593549534e-07, 'epoch': 0.94}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13257
total_samples=28541, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:18,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.62 | bwd_microstep: 1822.98 | bwd_inner_microstep: 1696.21 | bwd_allreduce_microstep: 126.71 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13883
total_samples=28545, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:21,606] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.14 | bwd_microstep: 2098.43 | bwd_inner_microstep: 1840.93 | bwd_allreduce_microstep: 257.42 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13136
total_samples=28549, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:24,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.81 | bwd_microstep: 1958.04 | bwd_inner_microstep: 1863.78 | bwd_allreduce_microstep: 94.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13454
total_samples=28553, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:27,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 07:27:27,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.42 | bwd_microstep: 1981.30 | bwd_inner_microstep: 1709.85 | bwd_allreduce_microstep: 271.38 | step_microstep: 110.21
[2025-08-03 07:27:27,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.91 | bwd: 7860.80 | bwd_inner: 7110.77 | bwd_allreduce: 749.78 | step: 110.70
{'loss': 0.7291, 'learning_rate': 1.9135818746671587e-07, 'epoch': 0.94}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13599
total_samples=28557, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:29,940] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.84 | bwd_microstep: 1969.06 | bwd_inner_microstep: 1822.95 | bwd_allreduce_microstep: 146.03 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13203
total_samples=28561, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:32,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.94 | bwd_microstep: 1826.65 | bwd_inner_microstep: 1687.95 | bwd_allreduce_microstep: 138.63 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13180
total_samples=28565, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:35,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.19 | bwd_microstep: 1785.17 | bwd_inner_microstep: 1708.08 | bwd_allreduce_microstep: 77.00 | step_microstep: 1.02
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13486
total_samples=28569, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:38,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 07:27:38,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 742.47 | bwd_microstep: 2084.81 | bwd_inner_microstep: 1929.19 | bwd_allreduce_microstep: 155.56 | step_microstep: 110.51
[2025-08-03 07:27:38,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2876.38 | bwd: 7665.75 | bwd_inner: 7148.17 | bwd_allreduce: 517.31 | step: 111.93
                               94%|█████████▍| 1875/2000 [5:43:57<22:24, 10.75s/it] 94%|█████████▍| 1876/2000 [5:44:08<22:19, 10.80s/it]                                                      94%|█████████▍| 1876/2000 [5:44:08<22:19, 10.80s/it] 94%|█████████▍| 1877/2000 [5:44:19<22:26, 10.94s/it]                                                      94%|█████████▍| 1877/2000 [5:44:19<22:26, 10.94s/it] 94%|█████████▍| 1878/2000 [5:44:30<22:33, 11.09s/it]                                                      94%|█████████▍| 1878/2000 [5:44:31<22:33, 11.09s/it] 94%|█████████▍| 1879/2000 [5:44:42<22:21, 11.08s/it]                                                      94%|█████████▍| 1879/2000 [5:44:42<22:21, 11.08s/it] 94%|█████████▍| 1880/2000 [5:44:52<22:05, 11.04s/it]                                {'loss': 0.7281, 'learning_rate': 1.8821823101760949e-07, 'epoch': 0.94}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13767
total_samples=28573, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:40,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.85 | bwd_microstep: 1751.71 | bwd_inner_microstep: 1701.42 | bwd_allreduce_microstep: 50.23 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12641
total_samples=28577, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:43,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.02 | bwd_microstep: 1995.75 | bwd_inner_microstep: 1631.99 | bwd_allreduce_microstep: 363.70 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14340
total_samples=28582, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:46,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.63 | bwd_microstep: 1804.02 | bwd_inner_microstep: 1759.76 | bwd_allreduce_microstep: 44.20 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13802
total_samples=28587, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:48,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 07:27:48,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.53 | bwd_microstep: 1817.57 | bwd_inner_microstep: 1711.90 | bwd_allreduce_microstep: 105.60 | step_microstep: 150.72
[2025-08-03 07:27:48,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2785.96 | bwd: 7369.11 | bwd_inner: 6805.06 | bwd_allreduce: 563.80 | step: 151.07
{'loss': 0.7221, 'learning_rate': 1.8510400482234848e-07, 'epoch': 0.94}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12863
total_samples=28591, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:51,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.49 | bwd_microstep: 2068.07 | bwd_inner_microstep: 1930.34 | bwd_allreduce_microstep: 137.67 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16122
total_samples=28595, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:54,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.70 | bwd_microstep: 1800.36 | bwd_inner_microstep: 1793.69 | bwd_allreduce_microstep: 6.60 | step_microstep: 0.28
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12571
total_samples=28599, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:56,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.21 | bwd_microstep: 1776.52 | bwd_inner_microstep: 1628.20 | bwd_allreduce_microstep: 148.26 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13390
total_samples=28603, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:27:59,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.41
[2025-08-03 07:27:59,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.59 | bwd_microstep: 1820.21 | bwd_inner_microstep: 1760.73 | bwd_allreduce_microstep: 59.42 | step_microstep: 152.40
[2025-08-03 07:27:59,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.91 | bwd: 7465.22 | bwd_inner: 7112.94 | bwd_allreduce: 352.04 | step: 152.91
{'loss': 0.7277, 'learning_rate': 1.8201551704762453e-07, 'epoch': 0.94}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13330
total_samples=28607, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:02,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.14 | bwd_microstep: 1784.41 | bwd_inner_microstep: 1648.97 | bwd_allreduce_microstep: 135.37 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13628
total_samples=28611, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:05,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.01 | bwd_microstep: 2323.07 | bwd_inner_microstep: 2186.79 | bwd_allreduce_microstep: 136.22 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13576
total_samples=28615, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:07,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.42 | bwd_microstep: 1785.48 | bwd_inner_microstep: 1708.45 | bwd_allreduce_microstep: 76.96 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12224
total_samples=28618, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:10,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 07:28:10,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.26 | bwd_microstep: 1839.70 | bwd_inner_microstep: 1584.04 | bwd_allreduce_microstep: 255.59 | step_microstep: 131.72
[2025-08-03 07:28:10,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2798.77 | bwd: 7732.71 | bwd_inner: 7128.23 | bwd_allreduce: 604.23 | step: 132.13
{'loss': 0.7388, 'learning_rate': 1.7895277579264015e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11714
total_samples=28621, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:13,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.86 | bwd_microstep: 1973.70 | bwd_inner_microstep: 1757.63 | bwd_allreduce_microstep: 216.00 | step_microstep: 0.77
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12926
total_samples=28625, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:15,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.99 | bwd_microstep: 1784.97 | bwd_inner_microstep: 1670.17 | bwd_allreduce_microstep: 114.73 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12016
total_samples=28628, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:18,292] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.08 | bwd_microstep: 1726.39 | bwd_inner_microstep: 1545.83 | bwd_allreduce_microstep: 180.48 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13380
total_samples=28632, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:21,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 07:28:21,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.35 | bwd_microstep: 2191.41 | bwd_inner_microstep: 1956.35 | bwd_allreduce_microstep: 234.99 | step_microstep: 109.47
[2025-08-03 07:28:21,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2807.20 | bwd: 7676.54 | bwd_inner: 6929.98 | bwd_allreduce: 746.30 | step: 110.59
{'loss': 0.7316, 'learning_rate': 1.7591578908907724e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11780
total_samples=28635, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:24,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.51 | bwd_microstep: 2098.44 | bwd_inner_microstep: 1895.16 | bwd_allreduce_microstep: 203.21 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13145
total_samples=28639, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:27,148] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.95 | bwd_microstep: 2090.92 | bwd_inner_microstep: 2039.56 | bwd_allreduce_microstep: 51.29 | step_microstep: 0.14
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14438
total_samples=28643, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:30,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.15 | bwd_microstep: 2534.22 | bwd_inner_microstep: 1672.35 | bwd_allreduce_microstep: 861.81 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11703
total_samples=28646, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:33,414] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.74
[2025-08-03 07:28:33,416] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.69 | bwd_microstep: 2077.00 | bwd_inner_microstep: 1846.91 | bwd_allreduce_microstep: 230.02 | step_microstep: 116.27
[2025-08-03 07:28:33,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2838.24 | bwd: 8800.63 | bwd_inner: 7453.98 | bwd_allreduce: 1346.40 | step: 116.79
                      94%|█████████▍| 1880/2000 [5:44:53<22:05, 11.04s/it] 94%|█████████▍| 1881/2000 [5:45:03<21:38, 10.91s/it]                                                      94%|█████████▍| 1881/2000 [5:45:03<21:38, 10.91s/it] 94%|█████████▍| 1882/2000 [5:45:14<21:21, 10.86s/it]                                                      94%|█████████▍| 1882/2000 [5:45:14<21:21, 10.86s/it] 94%|█████████▍| 1883/2000 [5:45:25<21:15, 10.90s/it]                                                      94%|█████████▍| 1883/2000 [5:45:25<21:15, 10.90s/it] 94%|█████████▍| 1884/2000 [5:45:36<21:05, 10.91s/it]                                                      94%|█████████▍| 1884/2000 [5:45:36<21:05, 10.91s/it] 94%|█████████▍| 1885/2000 [5:45:48<21:33, 11.25s/it]                                         {'loss': 0.732, 'learning_rate': 1.7290456490107522e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11933
total_samples=28650, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:36,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.00 | bwd_microstep: 2070.85 | bwd_inner_microstep: 1910.55 | bwd_allreduce_microstep: 160.24 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13115
total_samples=28654, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:38,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.85 | bwd_microstep: 1859.57 | bwd_inner_microstep: 1729.43 | bwd_allreduce_microstep: 130.06 | step_microstep: 0.72
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11779
total_samples=28657, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:41,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.01 | bwd_microstep: 2166.69 | bwd_inner_microstep: 1747.41 | bwd_allreduce_microstep: 419.20 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13770
total_samples=28661, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:44,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.76
[2025-08-03 07:28:44,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.85 | bwd_microstep: 1850.81 | bwd_inner_microstep: 1741.06 | bwd_allreduce_microstep: 109.68 | step_microstep: 114.65
[2025-08-03 07:28:44,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2831.63 | bwd: 7947.97 | bwd_inner: 7128.44 | bwd_allreduce: 819.26 | step: 115.89
{'loss': 0.734, 'learning_rate': 1.699191111252241e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11868
total_samples=28664, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:47,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 679.71 | bwd_microstep: 1700.63 | bwd_inner_microstep: 1532.11 | bwd_allreduce_microstep: 168.45 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13295
total_samples=28668, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:49,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.74 | bwd_microstep: 1731.87 | bwd_inner_microstep: 1668.64 | bwd_allreduce_microstep: 63.16 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13827
total_samples=28671, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:52,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.60 | bwd_microstep: 2057.23 | bwd_inner_microstep: 1824.22 | bwd_allreduce_microstep: 232.95 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13466
total_samples=28675, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:55,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 07:28:55,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.09 | bwd_microstep: 1858.31 | bwd_inner_microstep: 1713.12 | bwd_allreduce_microstep: 145.12 | step_microstep: 134.26
[2025-08-03 07:28:55,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.06 | bwd: 7348.10 | bwd_inner: 6738.10 | bwd_allreduce: 609.76 | step: 134.88
{'loss': 0.7336, 'learning_rate': 1.6695943559052463e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11759
total_samples=28678, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:28:57,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.67 | bwd_microstep: 1953.12 | bwd_inner_microstep: 1539.42 | bwd_allreduce_microstep: 413.63 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13623
total_samples=28682, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:00,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.00 | bwd_microstep: 1877.45 | bwd_inner_microstep: 1743.06 | bwd_allreduce_microstep: 134.32 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11615
total_samples=28685, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:03,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.81 | bwd_microstep: 1967.31 | bwd_inner_microstep: 1769.38 | bwd_allreduce_microstep: 197.86 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13165
total_samples=28689, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:06,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.51
[2025-08-03 07:29:06,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.02 | bwd_microstep: 1842.98 | bwd_inner_microstep: 1713.18 | bwd_allreduce_microstep: 129.73 | step_microstep: 114.29
[2025-08-03 07:29:06,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.42 | bwd: 7640.90 | bwd_inner: 6765.05 | bwd_allreduce: 875.62 | step: 114.91
{'loss': 0.7277, 'learning_rate': 1.6402554605838173e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11620
total_samples=28692, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:08,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.39 | bwd_microstep: 1980.34 | bwd_inner_microstep: 1775.30 | bwd_allreduce_microstep: 204.98 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14194
total_samples=28697, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:11,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.13 | bwd_microstep: 1851.03 | bwd_inner_microstep: 1701.91 | bwd_allreduce_microstep: 149.06 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12393
total_samples=28701, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:14,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.53 | bwd_microstep: 1788.68 | bwd_inner_microstep: 1782.62 | bwd_allreduce_microstep: 6.00 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13942
total_samples=28705, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:16,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.17
[2025-08-03 07:29:16,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.93 | bwd_microstep: 1773.21 | bwd_inner_microstep: 1724.33 | bwd_allreduce_microstep: 48.82 | step_microstep: 111.25
[2025-08-03 07:29:16,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.91 | bwd: 7393.32 | bwd_inner: 6984.15 | bwd_allreduce: 408.93 | step: 111.71
{'loss': 0.7341, 'learning_rate': 1.6111745022257873e-07, 'epoch': 0.94}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12889
total_samples=28709, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:19,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.28 | bwd_microstep: 1785.41 | bwd_inner_microstep: 1607.08 | bwd_allreduce_microstep: 178.27 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13112
total_samples=28713, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:21,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.04 | bwd_microstep: 1787.44 | bwd_inner_microstep: 1644.57 | bwd_allreduce_microstep: 142.79 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12618
total_samples=28716, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:24,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.66 | bwd_microstep: 1724.91 | bwd_inner_microstep: 1570.19 | bwd_allreduce_microstep: 154.66 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13558
total_samples=28720, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:27,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.69
[2025-08-03 07:29:27,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.39 | bwd_microstep: 1801.97 | bwd_inner_microstep: 1708.87 | bwd_allreduce_microstep: 93.03 | step_microstep: 158.57
[2025-08-03 07:29:27,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.30 | bwd: 7099.79 | bwd_inner: 6530.70 | bwd_allreduce: 568.84 | step: 159.08
             94%|█████████▍| 1885/2000 [5:45:48<21:33, 11.25s/it] 94%|█████████▍| 1886/2000 [5:45:59<21:20, 11.24s/it]                                                      94%|█████████▍| 1886/2000 [5:45:59<21:20, 11.24s/it] 94%|█████████▍| 1887/2000 [5:46:10<20:47, 11.04s/it]                                                      94%|█████████▍| 1887/2000 [5:46:10<20:47, 11.04s/it] 94%|█████████▍| 1888/2000 [5:46:20<20:32, 11.01s/it]                                                      94%|█████████▍| 1888/2000 [5:46:21<20:32, 11.01s/it] 94%|█████████▍| 1889/2000 [5:46:31<20:08, 10.88s/it]                                                      94%|█████████▍| 1889/2000 [5:46:31<20:08, 10.88s/it] 94%|█████████▍| 1890/2000 [5:46:41<19:40, 10.73s/it]                                                  {'loss': 0.7319, 'learning_rate': 1.5823515570925763e-07, 'epoch': 0.94}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13121
total_samples=28724, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:29,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.02 | bwd_microstep: 1978.24 | bwd_inner_microstep: 1871.43 | bwd_allreduce_microstep: 106.74 | step_microstep: 0.29
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12589
total_samples=28728, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:32,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.31 | bwd_microstep: 2151.87 | bwd_inner_microstep: 1966.94 | bwd_allreduce_microstep: 184.87 | step_microstep: 0.33
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12215
total_samples=28731, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:35,633] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.28 | bwd_microstep: 2020.61 | bwd_inner_microstep: 1900.11 | bwd_allreduce_microstep: 120.43 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13297
total_samples=28735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:38,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.86
[2025-08-03 07:29:38,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.01 | bwd_microstep: 1778.05 | bwd_inner_microstep: 1691.49 | bwd_allreduce_microstep: 86.51 | step_microstep: 144.49
[2025-08-03 07:29:38,343] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2830.55 | bwd: 7928.82 | bwd_inner: 7429.96 | bwd_allreduce: 498.63 | step: 145.24
{'loss': 0.733, 'learning_rate': 1.5537867007690111e-07, 'epoch': 0.95}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 16226
total_samples=28739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:41,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.57 | bwd_microstep: 2125.51 | bwd_inner_microstep: 2022.48 | bwd_allreduce_microstep: 102.96 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11895
total_samples=28742, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:43,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.21 | bwd_microstep: 1852.83 | bwd_inner_microstep: 1611.96 | bwd_allreduce_microstep: 240.78 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12467
total_samples=28745, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:46,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.61 | bwd_microstep: 2178.90 | bwd_inner_microstep: 1802.38 | bwd_allreduce_microstep: 376.45 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14915
total_samples=28752, num_samples=7, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:49,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.44
[2025-08-03 07:29:49,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.48 | bwd_microstep: 1758.71 | bwd_inner_microstep: 1745.94 | bwd_allreduce_microstep: 12.71 | step_microstep: 125.62
[2025-08-03 07:29:49,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2849.78 | bwd: 7916.00 | bwd_inner: 7182.75 | bwd_allreduce: 732.99 | step: 126.43
{'loss': 0.7364, 'learning_rate': 1.5254800081630828e-07, 'epoch': 0.95}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11600
total_samples=28755, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:52,149] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.31 | bwd_microstep: 1807.34 | bwd_inner_microstep: 1577.65 | bwd_allreduce_microstep: 229.63 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11890
total_samples=28758, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:54,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.51 | bwd_microstep: 1744.41 | bwd_inner_microstep: 1547.29 | bwd_allreduce_microstep: 197.05 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11731
total_samples=28761, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:29:57,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.64 | bwd_microstep: 2016.14 | bwd_inner_microstep: 1530.75 | bwd_allreduce_microstep: 485.32 | step_microstep: 0.29
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14502
total_samples=28766, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:00,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.81
[2025-08-03 07:30:00,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.66 | bwd_microstep: 1727.80 | bwd_inner_microstep: 1686.86 | bwd_allreduce_microstep: 40.86 | step_microstep: 134.23
[2025-08-03 07:30:00,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2786.05 | bwd: 7295.74 | bwd_inner: 6342.54 | bwd_allreduce: 952.95 | step: 134.76
{'loss': 0.7346, 'learning_rate': 1.4974315535058016e-07, 'epoch': 0.95}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11988
total_samples=28769, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:02,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.75 | bwd_microstep: 1851.77 | bwd_inner_microstep: 1562.57 | bwd_allreduce_microstep: 289.13 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13143
total_samples=28773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:05,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.10 | bwd_microstep: 1984.85 | bwd_inner_microstep: 1883.62 | bwd_allreduce_microstep: 101.14 | step_microstep: 0.31
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12801
total_samples=28776, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:07,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.81 | bwd_microstep: 1748.39 | bwd_inner_microstep: 1593.42 | bwd_allreduce_microstep: 154.90 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15234
total_samples=28780, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:10,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 07:30:10,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.10 | bwd_microstep: 1760.53 | bwd_inner_microstep: 1741.59 | bwd_allreduce_microstep: 18.86 | step_microstep: 151.02
[2025-08-03 07:30:10,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2780.69 | bwd: 7345.59 | bwd_inner: 6781.19 | bwd_allreduce: 564.13 | step: 151.70
{'loss': 0.7283, 'learning_rate': 1.469641410350964e-07, 'epoch': 0.95}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13978
total_samples=28784, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:13,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.63 | bwd_microstep: 1756.41 | bwd_inner_microstep: 1711.12 | bwd_allreduce_microstep: 45.22 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13446
total_samples=28788, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:15,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.84 | bwd_microstep: 1990.76 | bwd_inner_microstep: 1699.86 | bwd_allreduce_microstep: 290.83 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11743
total_samples=28791, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:18,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.27 | bwd_microstep: 1816.89 | bwd_inner_microstep: 1600.01 | bwd_allreduce_microstep: 216.82 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11586
total_samples=28794, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:21,465] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 07:30:21,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.92 | bwd_microstep: 1988.92 | bwd_inner_microstep: 1783.45 | bwd_allreduce_microstep: 205.41 | step_microstep: 146.12
[2025-08-03 07:30:21,466] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.59 | bwd: 7553.04 | bwd_inner: 6794.44 | bwd_allreduce: 758.35 | step: 146.61
{'loss': 0.7378, 'learning_rate': 1.4421096515749855e-07, 'epoch': 0.95}
    94%|█████████▍| 1890/2000 [5:46:42<19:40, 10.73s/it] 95%|█████████▍| 1891/2000 [5:46:53<19:45, 10.88s/it]                                                      95%|█████████▍| 1891/2000 [5:46:53<19:45, 10.88s/it] 95%|█████████▍| 1892/2000 [5:47:04<19:45, 10.98s/it]                                                      95%|█████████▍| 1892/2000 [5:47:04<19:45, 10.98s/it] 95%|█████████▍| 1893/2000 [5:47:14<19:19, 10.84s/it]                                                      95%|█████████▍| 1893/2000 [5:47:14<19:19, 10.84s/it] 95%|█████████▍| 1894/2000 [5:47:25<19:01, 10.77s/it]                                                      95%|█████████▍| 1894/2000 [5:47:25<19:01, 10.77s/it] 95%|█████████▍| 1895/2000 [5:47:36<18:51, 10.78s/it]                                                      95%dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13508
total_samples=28798, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:24,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.19 | bwd_microstep: 1827.97 | bwd_inner_microstep: 1695.57 | bwd_allreduce_microstep: 132.33 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12875
total_samples=28802, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:26,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.08 | bwd_microstep: 1819.83 | bwd_inner_microstep: 1702.58 | bwd_allreduce_microstep: 117.19 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11838
total_samples=28805, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:29,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.08 | bwd_microstep: 1829.20 | bwd_inner_microstep: 1592.49 | bwd_allreduce_microstep: 236.64 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11855
total_samples=28808, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:32,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 07:30:32,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.68 | bwd_microstep: 1914.77 | bwd_inner_microstep: 1586.97 | bwd_allreduce_microstep: 327.73 | step_microstep: 108.52
[2025-08-03 07:30:32,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.96 | bwd: 7391.81 | bwd_inner: 6577.62 | bwd_allreduce: 813.96 | step: 108.94
{'loss': 0.7118, 'learning_rate': 1.4148363493766803e-07, 'epoch': 0.95}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13447
total_samples=28812, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:34,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.71 | bwd_microstep: 1973.32 | bwd_inner_microstep: 1675.15 | bwd_allreduce_microstep: 298.11 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13603
total_samples=28816, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:37,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.76 | bwd_microstep: 1748.81 | bwd_inner_microstep: 1687.94 | bwd_allreduce_microstep: 60.79 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14359
total_samples=28820, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:39,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.17 | bwd_microstep: 1744.36 | bwd_inner_microstep: 1712.99 | bwd_allreduce_microstep: 31.30 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11606
total_samples=28824, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:42,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 07:30:42,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.58 | bwd_microstep: 1775.40 | bwd_inner_microstep: 1537.64 | bwd_allreduce_microstep: 237.70 | step_microstep: 109.30
[2025-08-03 07:30:42,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.16 | bwd: 7241.94 | bwd_inner: 6613.72 | bwd_allreduce: 627.98 | step: 109.66
{'loss': 0.7274, 'learning_rate': 1.3878215752771264e-07, 'epoch': 0.95}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12836
total_samples=28828, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:45,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.39 | bwd_microstep: 2143.54 | bwd_inner_microstep: 1923.74 | bwd_allreduce_microstep: 219.74 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13305
total_samples=28832, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:48,406] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.61 | bwd_microstep: 2126.50 | bwd_inner_microstep: 1704.95 | bwd_allreduce_microstep: 421.50 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13864
total_samples=28836, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:50,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.39 | bwd_microstep: 1802.76 | bwd_inner_microstep: 1724.67 | bwd_allreduce_microstep: 78.02 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12173
total_samples=28839, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:54,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.29
[2025-08-03 07:30:54,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.09 | bwd_microstep: 2647.37 | bwd_inner_microstep: 2211.15 | bwd_allreduce_microstep: 436.16 | step_microstep: 133.39
[2025-08-03 07:30:54,534] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.40 | bwd: 8720.22 | bwd_inner: 7564.50 | bwd_allreduce: 1155.49 | step: 133.89
{'loss': 0.7374, 'learning_rate': 1.361065400119399e-07, 'epoch': 0.95}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14068
total_samples=28843, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:57,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.61 | bwd_microstep: 1803.49 | bwd_inner_microstep: 1716.34 | bwd_allreduce_microstep: 87.09 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13337
total_samples=28847, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:30:59,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.87 | bwd_microstep: 1955.65 | bwd_inner_microstep: 1892.69 | bwd_allreduce_microstep: 62.89 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13485
total_samples=28851, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:02,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.12 | bwd_microstep: 1745.11 | bwd_inner_microstep: 1680.60 | bwd_allreduce_microstep: 64.44 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13029
total_samples=28855, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:05,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 18.26
[2025-08-03 07:31:05,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.38 | bwd_microstep: 1882.72 | bwd_inner_microstep: 1665.26 | bwd_allreduce_microstep: 217.39 | step_microstep: 388.83
[2025-08-03 07:31:05,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.90 | bwd: 7387.02 | bwd_inner: 6954.89 | bwd_allreduce: 431.89 | step: 389.29
{'loss': 0.7401, 'learning_rate': 1.3345678940684615e-07, 'epoch': 0.95}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14658
total_samples=28859, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:08,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.60 | bwd_microstep: 2096.62 | bwd_inner_microstep: 1860.66 | bwd_allreduce_microstep: 235.90 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11992
total_samples=28862, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:11,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.80 | bwd_microstep: 2036.99 | bwd_inner_microstep: 1812.70 | bwd_allreduce_microstep: 224.23 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11979
total_samples=28865, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:14,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.85 | bwd_microstep: 2075.86 | bwd_inner_microstep: 1731.18 | bwd_allreduce_microstep: 344.62 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 11753
total_samples=28869, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:16,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.79
[2025-08-03 07:31:16,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.22 | bwd_microstep: 2038.14 | bwd_inner_microstep: 1603.32 | bwd_allreduce_microstep: 434.76 | step_microstep: 119.69
[2025-08-03 07:31:16,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2882.40 | bwd: 8247.65 | bwd_inner: 7007.84 | bwd_allreduce: 1239.58 | step: 120.02
{'loss': 0.7174, 'learning_rate': 1.30832912661093e-07, 'epoch': 0.95}
|█████████▍| 1895/2000 [5:47:36<18:51, 10.78s/it] 95%|█████████▍| 1896/2000 [5:47:46<18:37, 10.74s/it]                                                      95%|█████████▍| 1896/2000 [5:47:47<18:37, 10.74s/it] 95%|█████████▍| 1897/2000 [5:47:57<18:16, 10.64s/it]                                                      95%|█████████▍| 1897/2000 [5:47:57<18:16, 10.64s/it] 95%|█████████▍| 1898/2000 [5:48:09<18:46, 11.05s/it]                                                      95%|█████████▍| 1898/2000 [5:48:09<18:46, 11.05s/it] 95%|█████████▍| 1899/2000 [5:48:20<18:31, 11.01s/it]                                                      95%|█████████▍| 1899/2000 [5:48:20<18:31, 11.01s/it] 95%|█████████▌| 1900/2000 [5:48:31<18:36, 11.16s/it]                                                      95%|██�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12087
total_samples=28872, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:19,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.23 | bwd_microstep: 1749.78 | bwd_inner_microstep: 1555.65 | bwd_allreduce_microstep: 194.06 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12918
total_samples=28875, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:21,950] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.24 | bwd_microstep: 1699.93 | bwd_inner_microstep: 1575.65 | bwd_allreduce_microstep: 124.22 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12313
total_samples=28878, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:24,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 752.78 | bwd_microstep: 1873.27 | bwd_inner_microstep: 1743.26 | bwd_allreduce_microstep: 129.94 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12421
total_samples=28882, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:27,535] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 07:31:27,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.39 | bwd_microstep: 2008.92 | bwd_inner_microstep: 1689.92 | bwd_allreduce_microstep: 318.94 | step_microstep: 107.26
[2025-08-03 07:31:27,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.57 | bwd: 7331.95 | bwd_inner: 6564.48 | bwd_allreduce: 767.24 | step: 107.74
{'loss': 0.7257, 'learning_rate': 1.2823491665549193e-07, 'epoch': 0.95}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11689
total_samples=28885, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:30,056] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.29 | bwd_microstep: 1756.41 | bwd_inner_microstep: 1555.99 | bwd_allreduce_microstep: 200.34 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14937
total_samples=28889, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:32,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.33 | bwd_microstep: 2042.45 | bwd_inner_microstep: 1954.28 | bwd_allreduce_microstep: 88.11 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13307
total_samples=28893, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:35,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.34 | bwd_microstep: 1860.74 | bwd_inner_microstep: 1700.51 | bwd_allreduce_microstep: 160.17 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11642
total_samples=28896, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:38,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 07:31:38,274] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.71 | bwd_microstep: 1876.10 | bwd_inner_microstep: 1869.68 | bwd_allreduce_microstep: 6.35 | step_microstep: 135.97
[2025-08-03 07:31:38,275] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.60 | bwd: 7535.74 | bwd_inner: 7080.45 | bwd_allreduce: 455.06 | step: 136.42
{'loss': 0.7242, 'learning_rate': 1.2566280820298427e-07, 'epoch': 0.95}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13872
total_samples=28900, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:40,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.91 | bwd_microstep: 1768.60 | bwd_inner_microstep: 1710.33 | bwd_allreduce_microstep: 58.20 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14707
total_samples=28904, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:43,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.19 | bwd_microstep: 1817.55 | bwd_inner_microstep: 1744.39 | bwd_allreduce_microstep: 73.09 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12014
total_samples=28907, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:46,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.06 | bwd_microstep: 1796.30 | bwd_inner_microstep: 1572.02 | bwd_allreduce_microstep: 224.21 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12014
total_samples=28910, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:48,740] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 07:31:48,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.45 | bwd_microstep: 1868.98 | bwd_inner_microstep: 1733.28 | bwd_allreduce_microstep: 135.63 | step_microstep: 112.08
[2025-08-03 07:31:48,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.54 | bwd: 7251.47 | bwd_inner: 6760.02 | bwd_allreduce: 491.22 | step: 112.44
{'loss': 0.7339, 'learning_rate': 1.231165940486234e-07, 'epoch': 0.95}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13103
total_samples=28914, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:51,375] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.30 | bwd_microstep: 1826.30 | bwd_inner_microstep: 1717.66 | bwd_allreduce_microstep: 108.57 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13296
total_samples=28918, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:53,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.89 | bwd_microstep: 1780.90 | bwd_inner_microstep: 1691.15 | bwd_allreduce_microstep: 89.69 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13697
total_samples=28922, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:56,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.09 | bwd_microstep: 2014.41 | bwd_inner_microstep: 1742.06 | bwd_allreduce_microstep: 272.29 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11739
total_samples=28925, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:31:59,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.03
[2025-08-03 07:31:59,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.12 | bwd_microstep: 1955.56 | bwd_inner_microstep: 1532.82 | bwd_allreduce_microstep: 422.67 | step_microstep: 131.25
[2025-08-03 07:31:59,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2853.34 | bwd: 7577.23 | bwd_inner: 6683.70 | bwd_allreduce: 893.30 | step: 131.69
{'loss': 0.7313, 'learning_rate': 1.2059628086956044e-07, 'epoch': 0.95}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11646
total_samples=28928, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:02,361] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.24 | bwd_microstep: 1956.69 | bwd_inner_microstep: 1738.73 | bwd_allreduce_microstep: 217.89 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13957
total_samples=28933, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:05,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.24 | bwd_microstep: 2192.59 | bwd_inner_microstep: 2010.32 | bwd_allreduce_microstep: 182.20 | step_microstep: 0.22
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12624
total_samples=28937, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:07,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.13 | bwd_microstep: 1867.26 | bwd_inner_microstep: 1780.22 | bwd_allreduce_microstep: 86.98 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11673
total_samples=28940, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:11,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.89
[2025-08-03 07:32:11,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.98 | bwd_microstep: 2183.52 | bwd_inner_microstep: 1713.41 | bwd_allreduce_microstep: 470.00 | step_microstep: 113.57
[2025-08-03 07:32:11,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.52 | bwd: 8200.11 | bwd_inner: 7242.70 | bwd_allreduce: 957.13 | step: 114.02
{'loss': 0.727, 'learning_rate': 1.1810187527502182e-07, 'epoch': 0.95}
�██████▌| 1900/2000 [5:48:31<18:36, 11.16s/it] 95%|█████████▌| 1901/2000 [5:48:42<18:07, 10.98s/it]                                                      95%|█████████▌| 1901/2000 [5:48:42<18:07, 10.98s/it] 95%|█████████▌| 1902/2000 [5:48:53<17:49, 10.91s/it]                                                      95%|█████████▌| 1902/2000 [5:48:53<17:49, 10.91s/it] 95%|█████████▌| 1903/2000 [5:49:03<17:25, 10.78s/it]                                                      95%|█████████▌| 1903/2000 [5:49:03<17:25, 10.78s/it] 95%|█████████▌| 1904/2000 [5:49:14<17:16, 10.80s/it]                                                      95%|█████████▌| 1904/2000 [5:49:14<17:16, 10.80s/it] 95%|█████████▌| 1905/2000 [5:49:25<17:24, 10.99s/it]                                                      95%|█████�dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13750
total_samples=28944, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:13,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.91 | bwd_microstep: 1733.50 | bwd_inner_microstep: 1695.37 | bwd_allreduce_microstep: 38.06 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12459
total_samples=28947, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:16,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.24 | bwd_microstep: 1818.56 | bwd_inner_microstep: 1615.97 | bwd_allreduce_microstep: 202.53 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11768
total_samples=28950, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:18,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.04 | bwd_microstep: 1980.01 | bwd_inner_microstep: 1763.94 | bwd_allreduce_microstep: 216.00 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13739
total_samples=28954, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:21,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 07:32:21,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.63 | bwd_microstep: 1932.13 | bwd_inner_microstep: 1861.80 | bwd_allreduce_microstep: 70.27 | step_microstep: 136.98
[2025-08-03 07:32:21,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.75 | bwd: 7464.26 | bwd_inner: 6937.07 | bwd_allreduce: 526.93 | step: 137.44
{'loss': 0.7365, 'learning_rate': 1.1563338380629618e-07, 'epoch': 0.95}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11779
total_samples=28957, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:24,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.97 | bwd_microstep: 1783.04 | bwd_inner_microstep: 1541.90 | bwd_allreduce_microstep: 241.08 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12075
total_samples=28960, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:27,145] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.70 | bwd_microstep: 1993.57 | bwd_inner_microstep: 1768.21 | bwd_allreduce_microstep: 225.29 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13284
total_samples=28964, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:29,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.62 | bwd_microstep: 1916.03 | bwd_inner_microstep: 1709.15 | bwd_allreduce_microstep: 206.81 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12072
total_samples=28967, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:32,467] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.25
[2025-08-03 07:32:32,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.46 | bwd_microstep: 1749.67 | bwd_inner_microstep: 1564.87 | bwd_allreduce_microstep: 184.74 | step_microstep: 110.01
[2025-08-03 07:32:32,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.68 | bwd: 7442.36 | bwd_inner: 6584.13 | bwd_allreduce: 858.00 | step: 110.45
{'loss': 0.7306, 'learning_rate': 1.1319081293671541e-07, 'epoch': 0.95}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13340
total_samples=28971, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:34,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.76 | bwd_microstep: 1719.20 | bwd_inner_microstep: 1661.57 | bwd_allreduce_microstep: 57.55 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14583
total_samples=28975, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:37,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.27 | bwd_microstep: 1775.56 | bwd_inner_microstep: 1754.98 | bwd_allreduce_microstep: 20.51 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14470
total_samples=28979, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:40,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.78 | bwd_microstep: 2123.76 | bwd_inner_microstep: 2052.20 | bwd_allreduce_microstep: 71.49 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13638
total_samples=28983, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:43,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.97
[2025-08-03 07:32:43,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.03 | bwd_microstep: 1799.47 | bwd_inner_microstep: 1716.95 | bwd_allreduce_microstep: 82.46 | step_microstep: 135.36
[2025-08-03 07:32:43,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.75 | bwd: 7418.04 | bwd_inner: 7185.70 | bwd_allreduce: 232.10 | step: 135.84
{'loss': 0.7269, 'learning_rate': 1.1077416907163573e-07, 'epoch': 0.95}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13333
total_samples=28987, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:45,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.68 | bwd_microstep: 1805.54 | bwd_inner_microstep: 1703.65 | bwd_allreduce_microstep: 101.84 | step_microstep: 0.08
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12164
total_samples=28991, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:48,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.71 | bwd_microstep: 1760.20 | bwd_inner_microstep: 1617.09 | bwd_allreduce_microstep: 143.05 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13007
total_samples=28995, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:50,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.96 | bwd_microstep: 1774.80 | bwd_inner_microstep: 1679.54 | bwd_allreduce_microstep: 95.21 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13798
total_samples=29000, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:53,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74
[2025-08-03 07:32:53,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.27 | bwd_microstep: 1711.12 | bwd_inner_microstep: 1641.69 | bwd_allreduce_microstep: 69.35 | step_microstep: 142.01
[2025-08-03 07:32:53,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.54 | bwd: 7051.71 | bwd_inner: 6641.95 | bwd_allreduce: 409.52 | step: 142.32
{'loss': 0.7233, 'learning_rate': 1.0838345854842447e-07, 'epoch': 0.95}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12473
total_samples=29004, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:55,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.26 | bwd_microstep: 1778.64 | bwd_inner_microstep: 1596.12 | bwd_allreduce_microstep: 182.46 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13394
total_samples=29008, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:32:58,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.50 | bwd_microstep: 1983.49 | bwd_inner_microstep: 1867.89 | bwd_allreduce_microstep: 115.53 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13625
total_samples=29012, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:01,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 725.74 | bwd_microstep: 1958.89 | bwd_inner_microstep: 1729.13 | bwd_allreduce_microstep: 229.71 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11701
total_samples=29015, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:04,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.93
[2025-08-03 07:33:04,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 729.60 | bwd_microstep: 2102.54 | bwd_inner_microstep: 1614.96 | bwd_allreduce_microstep: 487.47 | step_microstep: 133.80
[2025-08-03 07:33:04,529] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2903.03 | bwd: 7823.61 | bwd_inner: 6808.11 | bwd_allreduce: 1015.23 | step: 134.13
{'loss': 0.7296, 'learning_rate': 1.0601868763643997e-07, 'epoch': 0.95}
�███▌| 1905/2000 [5:49:25<17:24, 10.99s/it] 95%|█████████▌| 1906/2000 [5:49:36<17:05, 10.91s/it]                                                      95%|█████████▌| 1906/2000 [5:49:36<17:05, 10.91s/it] 95%|█████████▌| 1907/2000 [5:49:47<16:49, 10.85s/it]                                                      95%|█████████▌| 1907/2000 [5:49:47<16:49, 10.85s/it] 95%|█████████▌| 1908/2000 [5:49:57<16:32, 10.78s/it]                                                      95%|█████████▌| 1908/2000 [5:49:58<16:32, 10.78s/it] 95%|█████████▌| 1909/2000 [5:50:08<16:08, 10.64s/it]                                                      95%|█████████▌| 1909/2000 [5:50:08<16:08, 10.64s/it] 96%|█████████▌| 1910/2000 [5:50:19<16:10, 10.79s/it]                                                      96%|████████�dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12172
total_samples=29018, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:07,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.71 | bwd_microstep: 1968.06 | bwd_inner_microstep: 1781.12 | bwd_allreduce_microstep: 186.87 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810
total_samples=29021, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:10,065] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.78 | bwd_microstep: 2011.53 | bwd_inner_microstep: 1550.03 | bwd_allreduce_microstep: 461.44 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12044
total_samples=29024, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:12,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 680.64 | bwd_microstep: 1750.46 | bwd_inner_microstep: 1562.30 | bwd_allreduce_microstep: 188.09 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11503
total_samples=29027, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:15,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.74
[2025-08-03 07:33:15,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.11 | bwd_microstep: 1890.65 | bwd_inner_microstep: 1537.84 | bwd_allreduce_microstep: 352.74 | step_microstep: 111.83
[2025-08-03 07:33:15,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.16 | bwd: 7620.76 | bwd_inner: 6431.29 | bwd_allreduce: 1189.23 | step: 112.18
{'loss': 0.7263, 'learning_rate': 1.0367986253701945e-07, 'epoch': 0.96}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13373
total_samples=29031, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:18,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.78 | bwd_microstep: 2136.69 | bwd_inner_microstep: 1975.65 | bwd_allreduce_microstep: 160.98 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12103
total_samples=29034, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:20,774] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.38 | bwd_microstep: 1771.40 | bwd_inner_microstep: 1575.14 | bwd_allreduce_microstep: 196.19 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11737
total_samples=29037, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:23,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.65 | bwd_microstep: 2087.20 | bwd_inner_microstep: 1550.80 | bwd_allreduce_microstep: 536.33 | step_microstep: 0.12
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13633
total_samples=29041, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:26,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 07:33:26,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.90 | bwd_microstep: 1899.62 | bwd_inner_microstep: 1778.11 | bwd_allreduce_microstep: 121.45 | step_microstep: 143.50
[2025-08-03 07:33:26,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.63 | bwd: 7894.96 | bwd_inner: 6879.70 | bwd_allreduce: 1015.03 | step: 143.86
{'loss': 0.7254, 'learning_rate': 1.0136698938346012e-07, 'epoch': 0.96}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13972
total_samples=29046, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:29,317] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.51 | bwd_microstep: 2069.17 | bwd_inner_microstep: 1954.27 | bwd_allreduce_microstep: 114.84 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11492
total_samples=29049, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:32,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.58 | bwd_microstep: 2014.14 | bwd_inner_microstep: 1792.47 | bwd_allreduce_microstep: 221.61 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11939
total_samples=29052, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:34,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.47 | bwd_microstep: 2058.04 | bwd_inner_microstep: 1816.25 | bwd_allreduce_microstep: 241.72 | step_microstep: 0.26
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11835
total_samples=29056, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:37,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.18
[2025-08-03 07:33:37,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.18 | bwd_microstep: 1771.41 | bwd_inner_microstep: 1537.93 | bwd_allreduce_microstep: 233.41 | step_microstep: 125.23
[2025-08-03 07:33:37,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2809.67 | bwd: 7912.82 | bwd_inner: 7100.92 | bwd_allreduce: 811.67 | step: 125.83
{'loss': 0.7313, 'learning_rate': 9.90800742410003e-08, 'epoch': 0.96}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11847
total_samples=29059, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:40,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.23 | bwd_microstep: 1766.15 | bwd_inner_microstep: 1561.85 | bwd_allreduce_microstep: 204.24 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12158
total_samples=29062, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:42,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.14 | bwd_microstep: 1967.00 | bwd_inner_microstep: 1569.77 | bwd_allreduce_microstep: 397.16 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13689
total_samples=29066, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:45,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.40 | bwd_microstep: 1740.80 | bwd_inner_microstep: 1678.78 | bwd_allreduce_microstep: 61.96 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11959
total_samples=29069, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:48,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 07:33:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.82 | bwd_microstep: 1773.14 | bwd_inner_microstep: 1568.22 | bwd_allreduce_microstep: 204.84 | step_microstep: 110.28
[2025-08-03 07:33:48,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.53 | bwd: 7247.15 | bwd_inner: 6378.61 | bwd_allreduce: 868.29 | step: 110.75
{'loss': 0.7397, 'learning_rate': 9.68191231068083e-08, 'epoch': 0.96}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13385
total_samples=29073, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:50,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.76 | bwd_microstep: 1888.30 | bwd_inner_microstep: 1842.11 | bwd_allreduce_microstep: 46.12 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14339
total_samples=29077, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:53,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.92 | bwd_microstep: 1911.19 | bwd_inner_microstep: 1896.28 | bwd_allreduce_microstep: 14.86 | step_microstep: 0.10
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13351
total_samples=29081, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:56,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.06 | bwd_microstep: 1844.29 | bwd_inner_microstep: 1699.43 | bwd_allreduce_microstep: 144.79 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13718
total_samples=29086, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:33:58,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 07:33:58,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.19 | bwd_microstep: 1935.71 | bwd_inner_microstep: 1612.07 | bwd_allreduce_microstep: 323.56 | step_microstep: 113.33
[2025-08-03 07:33:58,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.87 | bwd: 7579.54 | bwd_inner: 7049.88 | bwd_allreduce: 529.42 | step: 113.91
{'loss': 0.7313, 'learning_rate': 9.45841419099669e-08, 'epoch': 0.96}
�▌| 1910/2000 [5:50:19<16:10, 10.79s/it] 96%|█████████▌| 1911/2000 [5:50:30<16:00, 10.79s/it]                                                      96%|█████████▌| 1911/2000 [5:50:30<16:00, 10.79s/it] 96%|█████████▌| 1912/2000 [5:50:41<15:58, 10.89s/it]                                                      96%|█████████▌| 1912/2000 [5:50:41<15:58, 10.89s/it] 96%|█████████▌| 1913/2000 [5:50:52<15:53, 10.97s/it]                                                      96%|█████████▌| 1913/2000 [5:50:52<15:53, 10.97s/it] 96%|█████████▌| 1914/2000 [5:51:02<15:30, 10.82s/it]                                                      96%|█████████▌| 1914/2000 [5:51:02<15:30, 10.82s/it] 96%|█████████▌| 1915/2000 [5:51:13<15:18, 10.81s/it]                                                      96%|█████████▌| 191dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13304
total_samples=29090, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:02,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 757.10 | bwd_microstep: 2667.08 | bwd_inner_microstep: 2502.12 | bwd_allreduce_microstep: 164.91 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11912
total_samples=29094, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:05,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.06 | bwd_microstep: 2005.20 | bwd_inner_microstep: 1788.53 | bwd_allreduce_microstep: 216.62 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12853
total_samples=29098, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:08,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 984.83 | bwd_microstep: 1832.47 | bwd_inner_microstep: 1643.46 | bwd_allreduce_microstep: 188.94 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13182
total_samples=29103, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:10,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.74
[2025-08-03 07:34:10,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.87 | bwd_microstep: 1999.57 | bwd_inner_microstep: 1868.60 | bwd_allreduce_microstep: 130.90 | step_microstep: 112.55
[2025-08-03 07:34:10,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 3164.78 | bwd: 8504.38 | bwd_inner: 7802.70 | bwd_allreduce: 701.44 | step: 113.00
{'loss': 0.7259, 'learning_rate': 9.237513651145224e-08, 'epoch': 0.96}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14636
total_samples=29107, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:13,561] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.97 | bwd_microstep: 1845.39 | bwd_inner_microstep: 1780.12 | bwd_allreduce_microstep: 65.20 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13000
total_samples=29111, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:16,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.97 | bwd_microstep: 1842.19 | bwd_inner_microstep: 1574.12 | bwd_allreduce_microstep: 268.01 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14250
total_samples=29115, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:18,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.36 | bwd_microstep: 1866.02 | bwd_inner_microstep: 1764.12 | bwd_allreduce_microstep: 101.83 | step_microstep: 0.12
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12586
total_samples=29119, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:21,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 07:34:21,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.37 | bwd_microstep: 1779.92 | bwd_inner_microstep: 1634.91 | bwd_allreduce_microstep: 144.93 | step_microstep: 112.03
[2025-08-03 07:34:21,483] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.62 | bwd: 7333.56 | bwd_inner: 6753.26 | bwd_allreduce: 580.06 | step: 112.41
{'loss': 0.7299, 'learning_rate': 9.019211270412275e-08, 'epoch': 0.96}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13729
total_samples=29124, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:24,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.29 | bwd_microstep: 1724.24 | bwd_inner_microstep: 1672.93 | bwd_allreduce_microstep: 51.24 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13745
total_samples=29127, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:26,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 750.61 | bwd_microstep: 1768.53 | bwd_inner_microstep: 1637.84 | bwd_allreduce_microstep: 130.61 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12051
total_samples=29130, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:29,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.99 | bwd_microstep: 1839.24 | bwd_inner_microstep: 1566.05 | bwd_allreduce_microstep: 273.12 | step_microstep: 0.85
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12403
total_samples=29134, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:32,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.57
[2025-08-03 07:34:32,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.11 | bwd_microstep: 1882.35 | bwd_inner_microstep: 1571.24 | bwd_allreduce_microstep: 311.04 | step_microstep: 132.24
[2025-08-03 07:34:32,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2893.95 | bwd: 7214.43 | bwd_inner: 6448.04 | bwd_allreduce: 766.10 | step: 133.37
{'loss': 0.7325, 'learning_rate': 8.80350762127058e-08, 'epoch': 0.96}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15012
total_samples=29139, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:34,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.64 | bwd_microstep: 1768.97 | bwd_inner_microstep: 1744.48 | bwd_allreduce_microstep: 24.43 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13211
total_samples=29143, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:37,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.29 | bwd_microstep: 2170.37 | bwd_inner_microstep: 1919.85 | bwd_allreduce_microstep: 250.45 | step_microstep: 0.66
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13252
total_samples=29146, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:40,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.02 | bwd_microstep: 1748.68 | bwd_inner_microstep: 1612.15 | bwd_allreduce_microstep: 136.44 | step_microstep: 0.34
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13858
total_samples=29150, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:43,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 07:34:43,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.66 | bwd_microstep: 2080.64 | bwd_inner_microstep: 1953.76 | bwd_allreduce_microstep: 126.80 | step_microstep: 114.17
[2025-08-03 07:34:43,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2811.54 | bwd: 7768.72 | bwd_inner: 7230.25 | bwd_allreduce: 538.21 | step: 115.32
{'loss': 0.7458, 'learning_rate': 8.590403269377656e-08, 'epoch': 0.96}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13166
total_samples=29154, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:45,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.75 | bwd_microstep: 2030.39 | bwd_inner_microstep: 1861.24 | bwd_allreduce_microstep: 169.07 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13532
total_samples=29158, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:48,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 722.12 | bwd_microstep: 2340.90 | bwd_inner_microstep: 2133.70 | bwd_allreduce_microstep: 207.13 | step_microstep: 0.26
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12563
total_samples=29162, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:51,662] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.36 | bwd_microstep: 1916.19 | bwd_inner_microstep: 1775.26 | bwd_allreduce_microstep: 140.86 | step_microstep: 0.26
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13844
total_samples=29166, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:54,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 07:34:54,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.55 | bwd_microstep: 1877.75 | bwd_inner_microstep: 1724.65 | bwd_allreduce_microstep: 153.04 | step_microstep: 137.42
[2025-08-03 07:34:54,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2845.71 | bwd: 8165.28 | bwd_inner: 7494.85 | bwd_allreduce: 670.18 | step: 138.20
{'loss': 0.7351, 'learning_rate': 8.379898773574924e-08, 'epoch': 0.96}
5/2000 [5:51:13<15:18, 10.81s/it] 96%|█████████▌| 1916/2000 [5:51:25<15:39, 11.19s/it]                                                      96%|█████████▌| 1916/2000 [5:51:25<15:39, 11.19s/it] 96%|█████████▌| 1917/2000 [5:51:36<15:12, 11.00s/it]                                                      96%|█████████▌| 1917/2000 [5:51:36<15:12, 11.00s/it] 96%|█████████▌| 1918/2000 [5:51:46<14:50, 10.86s/it]                                                      96%|█████████▌| 1918/2000 [5:51:46<14:50, 10.86s/it] 96%|█████████▌| 1919/2000 [5:51:57<14:43, 10.91s/it]                                                      96%|█████████▌| 1919/2000 [5:51:57<14:43, 10.91s/it] 96%|█████████▌| 1920/2000 [5:52:09<14:45, 11.07s/it]                                                      96%|█████████▌| 1920/2000 [5dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11753
total_samples=29169, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:57,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.30 | bwd_microstep: 1770.59 | bwd_inner_microstep: 1538.57 | bwd_allreduce_microstep: 231.90 | step_microstep: 0.38
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13193
total_samples=29173, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:34:59,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.11 | bwd_microstep: 1705.32 | bwd_inner_microstep: 1655.45 | bwd_allreduce_microstep: 49.82 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11722
total_samples=29176, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:01,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.59 | bwd_microstep: 1720.22 | bwd_inner_microstep: 1539.60 | bwd_allreduce_microstep: 180.54 | step_microstep: 0.28
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12407
total_samples=29180, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:04,621] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 07:35:04,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.43 | bwd_microstep: 1738.90 | bwd_inner_microstep: 1610.02 | bwd_allreduce_microstep: 128.81 | step_microstep: 131.68
[2025-08-03 07:35:04,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2764.38 | bwd: 6935.10 | bwd_inner: 6343.64 | bwd_allreduce: 591.19 | step: 132.47
{'loss': 0.7216, 'learning_rate': 8.171994685885698e-08, 'epoch': 0.96}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11968
total_samples=29183, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:07,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.47 | bwd_microstep: 2229.20 | bwd_inner_microstep: 1831.60 | bwd_allreduce_microstep: 397.51 | step_microstep: 0.35
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13339
total_samples=29187, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:10,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.15 | bwd_microstep: 1853.00 | bwd_inner_microstep: 1720.04 | bwd_allreduce_microstep: 132.90 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12080
total_samples=29190, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:12,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.91 | bwd_microstep: 1872.36 | bwd_inner_microstep: 1736.94 | bwd_allreduce_microstep: 135.35 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11709
total_samples=29193, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:15,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.66
[2025-08-03 07:35:15,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.93 | bwd_microstep: 1771.80 | bwd_inner_microstep: 1547.58 | bwd_allreduce_microstep: 224.14 | step_microstep: 115.18
[2025-08-03 07:35:15,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.38 | bwd: 7726.42 | bwd_inner: 6836.16 | bwd_allreduce: 890.00 | step: 115.90
{'loss': 0.7215, 'learning_rate': 7.966691551514527e-08, 'epoch': 0.96}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13901
total_samples=29197, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:18,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.41 | bwd_microstep: 1735.54 | bwd_inner_microstep: 1725.91 | bwd_allreduce_microstep: 9.57 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12945
total_samples=29201, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:20,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.04 | bwd_microstep: 1744.24 | bwd_inner_microstep: 1652.70 | bwd_allreduce_microstep: 91.48 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11649
total_samples=29204, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:23,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.04 | bwd_microstep: 2008.92 | bwd_inner_microstep: 1789.66 | bwd_allreduce_microstep: 219.19 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13825
total_samples=29208, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:26,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.32
[2025-08-03 07:35:26,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.74 | bwd_microstep: 2089.10 | bwd_inner_microstep: 1927.41 | bwd_allreduce_microstep: 161.63 | step_microstep: 128.93
[2025-08-03 07:35:26,350] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2783.14 | bwd: 7577.85 | bwd_inner: 7095.68 | bwd_allreduce: 481.94 | step: 129.27
{'loss': 0.7206, 'learning_rate': 7.763989908844749e-08, 'epoch': 0.96}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13194
total_samples=29212, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:28,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.83 | bwd_microstep: 1780.68 | bwd_inner_microstep: 1661.80 | bwd_allreduce_microstep: 118.80 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13385
total_samples=29216, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:31,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.88 | bwd_microstep: 1964.14 | bwd_inner_microstep: 1838.30 | bwd_allreduce_microstep: 125.78 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13261
total_samples=29220, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:34,186] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.71 | bwd_microstep: 1776.00 | bwd_inner_microstep: 1734.89 | bwd_allreduce_microstep: 41.02 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13068
total_samples=29223, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:36,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 07:35:36,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.82 | bwd_microstep: 1777.81 | bwd_inner_microstep: 1613.90 | bwd_allreduce_microstep: 163.85 | step_microstep: 142.47
[2025-08-03 07:35:36,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.17 | bwd: 7298.68 | bwd_inner: 6848.88 | bwd_allreduce: 449.55 | step: 143.01
{'loss': 0.7197, 'learning_rate': 7.563890289437825e-08, 'epoch': 0.96}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13378
total_samples=29227, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:39,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.50 | bwd_microstep: 1714.10 | bwd_inner_microstep: 1651.58 | bwd_allreduce_microstep: 62.44 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13613
total_samples=29231, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:42,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.96 | bwd_microstep: 1920.71 | bwd_inner_microstep: 1703.19 | bwd_allreduce_microstep: 217.43 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13670
total_samples=29235, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:44,929] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.57 | bwd_microstep: 2055.39 | bwd_inner_microstep: 1896.61 | bwd_allreduce_microstep: 158.71 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13506
total_samples=29240, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:47,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 07:35:47,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.55 | bwd_microstep: 1903.45 | bwd_inner_microstep: 1825.63 | bwd_allreduce_microstep: 77.75 | step_microstep: 114.14
[2025-08-03 07:35:47,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2808.51 | bwd: 7593.70 | bwd_inner: 7077.02 | bwd_allreduce: 516.43 | step: 114.67
{'loss': 0.7291, 'learning_rate': 7.366393218031564e-08, 'epoch': 0.96}
:52:09<14:45, 11.07s/it] 96%|█████████▌| 1921/2000 [5:52:19<14:12, 10.79s/it]                                                      96%|█████████▌| 1921/2000 [5:52:19<14:12, 10.79s/it] 96%|█████████▌| 1922/2000 [5:52:30<14:05, 10.84s/it]                                                      96%|█████████▌| 1922/2000 [5:52:30<14:05, 10.84s/it] 96%|█████████▌| 1923/2000 [5:52:41<13:53, 10.82s/it]                                                      96%|█████████▌| 1923/2000 [5:52:41<13:53, 10.82s/it] 96%|█████████▌| 1924/2000 [5:52:51<13:36, 10.74s/it]                                                      96%|█████████▌| 1924/2000 [5:52:51<13:36, 10.74s/it] 96%|█████████▋| 1925/2000 [5:53:02<13:26, 10.76s/it]                                                      96%|█████████▋| 1925/2000 [5:53:02<13dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12509
total_samples=29244, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:50,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.33 | bwd_microstep: 1749.27 | bwd_inner_microstep: 1616.63 | bwd_allreduce_microstep: 132.58 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13270
total_samples=29248, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:53,076] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.50 | bwd_microstep: 2069.00 | bwd_inner_microstep: 1933.14 | bwd_allreduce_microstep: 135.81 | step_microstep: 0.09
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13423
total_samples=29252, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:55,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.47 | bwd_microstep: 1999.52 | bwd_inner_microstep: 1945.89 | bwd_allreduce_microstep: 53.56 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11606
total_samples=29255, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:35:58,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.11
[2025-08-03 07:35:58,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.38 | bwd_microstep: 1959.89 | bwd_inner_microstep: 1783.82 | bwd_allreduce_microstep: 176.01 | step_microstep: 112.60
[2025-08-03 07:35:58,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.60 | bwd: 7777.73 | bwd_inner: 7279.47 | bwd_allreduce: 498.03 | step: 112.91
{'loss': 0.7309, 'learning_rate': 7.171499212539124e-08, 'epoch': 0.96}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13631
total_samples=29259, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:01,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.39 | bwd_microstep: 2161.35 | bwd_inner_microstep: 2066.40 | bwd_allreduce_microstep: 94.89 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13156
total_samples=29263, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:04,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.47 | bwd_microstep: 2019.66 | bwd_inner_microstep: 1874.26 | bwd_allreduce_microstep: 145.34 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11929
total_samples=29266, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:07,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.98 | bwd_microstep: 1992.11 | bwd_inner_microstep: 1682.43 | bwd_allreduce_microstep: 309.60 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13531
total_samples=29270, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:09,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 07:36:09,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.50 | bwd_microstep: 1719.68 | bwd_inner_microstep: 1665.45 | bwd_allreduce_microstep: 54.16 | step_microstep: 148.95
[2025-08-03 07:36:09,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2828.26 | bwd: 7892.86 | bwd_inner: 7288.53 | bwd_allreduce: 604.08 | step: 149.58
{'loss': 0.725, 'learning_rate': 6.979208784047454e-08, 'epoch': 0.96}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15723
total_samples=29274, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:12,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.74 | bwd_microstep: 1765.72 | bwd_inner_microstep: 1750.15 | bwd_allreduce_microstep: 15.51 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14041
total_samples=29278, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:15,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.60 | bwd_microstep: 1847.84 | bwd_inner_microstep: 1760.50 | bwd_allreduce_microstep: 87.27 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11736
total_samples=29281, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:17,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.46 | bwd_microstep: 1865.99 | bwd_inner_microstep: 1532.96 | bwd_allreduce_microstep: 332.95 | step_microstep: 0.30
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11951
total_samples=29284, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:20,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.75
[2025-08-03 07:36:20,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.54 | bwd_microstep: 1718.97 | bwd_inner_microstep: 1536.94 | bwd_allreduce_microstep: 181.96 | step_microstep: 132.75
[2025-08-03 07:36:20,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.28 | bwd: 7198.57 | bwd_inner: 6580.55 | bwd_allreduce: 617.78 | step: 133.41
{'loss': 0.7247, 'learning_rate': 6.78952243681541e-08, 'epoch': 0.96}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11815
total_samples=29287, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:22,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.28 | bwd_microstep: 1839.69 | bwd_inner_microstep: 1706.46 | bwd_allreduce_microstep: 133.17 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13763
total_samples=29291, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:25,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.96 | bwd_microstep: 2039.65 | bwd_inner_microstep: 1725.12 | bwd_allreduce_microstep: 314.46 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13252
total_samples=29295, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:28,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.09 | bwd_microstep: 1789.42 | bwd_inner_microstep: 1694.43 | bwd_allreduce_microstep: 94.92 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11852
total_samples=29298, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:31,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.99
[2025-08-03 07:36:31,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.77 | bwd_microstep: 1801.26 | bwd_inner_microstep: 1725.25 | bwd_allreduce_microstep: 75.95 | step_microstep: 148.72
[2025-08-03 07:36:31,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.03 | bwd: 7470.09 | bwd_inner: 6851.26 | bwd_allreduce: 618.58 | step: 149.21
{'loss': 0.742, 'learning_rate': 6.602440668273758e-08, 'epoch': 0.96}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11612
total_samples=29301, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:33,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.05 | bwd_microstep: 1735.01 | bwd_inner_microstep: 1521.07 | bwd_allreduce_microstep: 213.86 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13777
total_samples=29306, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:36,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.94 | bwd_microstep: 1924.96 | bwd_inner_microstep: 1816.05 | bwd_allreduce_microstep: 108.85 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13549
total_samples=29310, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:38,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.92 | bwd_microstep: 1891.89 | bwd_inner_microstep: 1822.54 | bwd_allreduce_microstep: 69.28 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11877
total_samples=29313, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:41,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 07:36:41,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.91 | bwd_microstep: 2038.92 | bwd_inner_microstep: 1585.76 | bwd_allreduce_microstep: 453.05 | step_microstep: 130.32
[2025-08-03 07:36:41,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2843.74 | bwd: 7590.82 | bwd_inner: 6745.42 | bwd_allreduce: 845.14 | step: 130.82
{'loss': 0.7275, 'learning_rate': 6.417963969022389e-08, 'epoch': 0.96}
:26, 10.76s/it] 96%|█████████▋| 1926/2000 [5:53:13<13:21, 10.82s/it]                                                      96%|█████████▋| 1926/2000 [5:53:13<13:21, 10.82s/it] 96%|█████████▋| 1927/2000 [5:53:24<13:17, 10.93s/it]                                                      96%|█████████▋| 1927/2000 [5:53:24<13:17, 10.93s/it] 96%|█████████▋| 1928/2000 [5:53:35<12:56, 10.78s/it]                                                      96%|█████████▋| 1928/2000 [5:53:35<12:56, 10.78s/it] 96%|█████████▋| 1929/2000 [5:53:45<12:44, 10.77s/it]                                                      96%|█████████▋| 1929/2000 [5:53:45<12:44, 10.77s/it] 96%|█████████▋| 1930/2000 [5:53:56<12:35, 10.80s/it]                                                      96%|█████████▋| 1930/2000 [5:53:56<12:35, 10.8dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14124
total_samples=29317, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:44,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.99 | bwd_microstep: 1804.00 | bwd_inner_microstep: 1739.12 | bwd_allreduce_microstep: 64.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13382
total_samples=29321, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:47,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.88 | bwd_microstep: 1983.80 | bwd_inner_microstep: 1902.13 | bwd_allreduce_microstep: 81.60 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 14084
total_samples=29325, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:49,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.40 | bwd_microstep: 1879.54 | bwd_inner_microstep: 1696.92 | bwd_allreduce_microstep: 182.56 | step_microstep: 0.11
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13102
total_samples=29329, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:53,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.94
[2025-08-03 07:36:53,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.73 | bwd_microstep: 2377.10 | bwd_inner_microstep: 2166.25 | bwd_allreduce_microstep: 210.78 | step_microstep: 114.76
[2025-08-03 07:36:53,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2897.94 | bwd: 8044.49 | bwd_inner: 7504.42 | bwd_allreduce: 539.84 | step: 115.09
{'loss': 0.7527, 'learning_rate': 6.236092822829887e-08, 'epoch': 0.97}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12609
total_samples=29333, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:56,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.24 | bwd_microstep: 2099.50 | bwd_inner_microstep: 1858.92 | bwd_allreduce_microstep: 240.52 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13520
total_samples=29337, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:36:58,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.15 | bwd_microstep: 1850.94 | bwd_inner_microstep: 1710.36 | bwd_allreduce_microstep: 140.50 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12467
total_samples=29340, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:01,303] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.15 | bwd_microstep: 1766.32 | bwd_inner_microstep: 1580.73 | bwd_allreduce_microstep: 185.52 | step_microstep: 0.22
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11897
total_samples=29343, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:04,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 07:37:04,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.11 | bwd_microstep: 1939.59 | bwd_inner_microstep: 1813.42 | bwd_allreduce_microstep: 126.07 | step_microstep: 130.57
[2025-08-03 07:37:04,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.59 | bwd: 7656.37 | bwd_inner: 6963.42 | bwd_allreduce: 692.69 | step: 131.14
{'loss': 0.7233, 'learning_rate': 6.056827706632185e-08, 'epoch': 0.97}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13137
total_samples=29347, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:06,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.59 | bwd_microstep: 1725.49 | bwd_inner_microstep: 1647.95 | bwd_allreduce_microstep: 77.47 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14034
total_samples=29351, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:09,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 744.78 | bwd_microstep: 2077.60 | bwd_inner_microstep: 1935.16 | bwd_allreduce_microstep: 142.38 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13875
total_samples=29355, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:12,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.05 | bwd_microstep: 1839.67 | bwd_inner_microstep: 1797.90 | bwd_allreduce_microstep: 41.71 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14046
total_samples=29359, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:15,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.22
[2025-08-03 07:37:15,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.58 | bwd_microstep: 2190.41 | bwd_inner_microstep: 2041.76 | bwd_allreduce_microstep: 148.58 | step_microstep: 109.53
[2025-08-03 07:37:15,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2827.93 | bwd: 7833.22 | bwd_inner: 7422.77 | bwd_allreduce: 410.21 | step: 109.87
{'loss': 0.7383, 'learning_rate': 5.880169090531351e-08, 'epoch': 0.97}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13337
total_samples=29363, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:18,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.82 | bwd_microstep: 2098.36 | bwd_inner_microstep: 1903.49 | bwd_allreduce_microstep: 194.81 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14417
total_samples=29368, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:21,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.92 | bwd_microstep: 2412.09 | bwd_inner_microstep: 2124.86 | bwd_allreduce_microstep: 287.17 | step_microstep: 0.13
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12893
total_samples=29372, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:23,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.37 | bwd_microstep: 1841.05 | bwd_inner_microstep: 1678.75 | bwd_allreduce_microstep: 162.24 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11636
total_samples=29375, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:26,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.10
[2025-08-03 07:37:26,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.08 | bwd_microstep: 1787.36 | bwd_inner_microstep: 1554.17 | bwd_allreduce_microstep: 233.13 | step_microstep: 128.47
[2025-08-03 07:37:26,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2835.13 | bwd: 8138.92 | bwd_inner: 7261.25 | bwd_allreduce: 877.43 | step: 128.94
{'loss': 0.7273, 'learning_rate': 5.7061174377937015e-08, 'epoch': 0.97}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14209
total_samples=29379, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:29,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.02 | bwd_microstep: 1770.89 | bwd_inner_microstep: 1726.26 | bwd_allreduce_microstep: 44.57 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13771
total_samples=29383, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:31,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.18 | bwd_microstep: 1757.72 | bwd_inner_microstep: 1688.41 | bwd_allreduce_microstep: 69.24 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13335
total_samples=29387, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:34,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.18 | bwd_microstep: 1717.34 | bwd_inner_microstep: 1652.52 | bwd_allreduce_microstep: 64.76 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11831
total_samples=29390, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:37,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 07:37:37,004] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.49 | bwd_microstep: 1974.85 | bwd_inner_microstep: 1797.43 | bwd_allreduce_microstep: 177.37 | step_microstep: 132.18
[2025-08-03 07:37:37,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2765.80 | bwd: 7220.86 | bwd_inner: 6864.60 | bwd_allreduce: 356.03 | step: 132.66
{'loss': 0.7281, 'learning_rate': 5.534673204849572e-08, 'epoch': 0.97}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13530
total_samples=29394, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:39,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.03 | bwd_microstep: 2052.45 | bwd_inner_microstep: 1900.61 | bwd_allreduce_microstep: 151.77 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13153
total_samples=29398, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:42,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.43 | bwd_microstep: 1716.98 | bwd_inner_microstep: 1664.92 | bwd_allreduce_microstep: 51.99 | step_microstep: 0.14
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12423
total_samples=29402, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:45,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.43 | bwd_microstep: 1941.64 | bwd_inner_microstep: 1932.85 | bwd_allreduce_microstep: 8.72 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13434
total_samples=29406, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:47,663] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.96
[2025-08-03 07:37:47,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.57 | bwd_microstep: 1781.47 | bwd_inner_microstep: 1702.75 | bwd_allreduce_microstep: 78.65 | step_microstep: 115.52
[2025-08-03 07:37:47,664] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2771.39 | bwd: 7492.58 | bwd_inner: 7201.13 | bwd_allreduce: 291.21 | step: 116.11
0s/it] 97%|█████████▋| 1931/2000 [5:54:08<12:36, 10.96s/it]                                                      97%|█████████▋| 1931/2000 [5:54:08<12:36, 10.96s/it] 97%|█████████▋| 1932/2000 [5:54:18<12:24, 10.94s/it]                                                      97%|█████████▋| 1932/2000 [5:54:19<12:24, 10.94s/it] 97%|█████████▋| 1933/2000 [5:54:30<12:15, 10.98s/it]                                                      97%|█████████▋| 1933/2000 [5:54:30<12:15, 10.98s/it] 97%|█████████▋| 1934/2000 [5:54:41<12:12, 11.10s/it]                                                      97%|█████████▋| 1934/2000 [5:54:41<12:12, 11.10s/it] 97%|█████████▋| 1935/2000 [5:54:51<11:48, 10.90s/it]                                                      97%|█████████▋| 1935/2000 [5:54:51<11:48, 10.90s/it] 9{'loss': 0.7225, 'learning_rate': 5.365836841291439e-08, 'epoch': 0.97}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14636
total_samples=29411, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:50,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.39 | bwd_microstep: 1895.99 | bwd_inner_microstep: 1870.77 | bwd_allreduce_microstep: 25.15 | step_microstep: 0.13
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 12892
total_samples=29415, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:52,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.98 | bwd_microstep: 1841.35 | bwd_inner_microstep: 1671.32 | bwd_allreduce_microstep: 169.97 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14160
total_samples=29419, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:55,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.11 | bwd_microstep: 1775.96 | bwd_inner_microstep: 1724.22 | bwd_allreduce_microstep: 51.69 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11834
total_samples=29422, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:37:58,221] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.83
[2025-08-03 07:37:58,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.79 | bwd_microstep: 1800.49 | bwd_inner_microstep: 1574.59 | bwd_allreduce_microstep: 225.83 | step_microstep: 131.75
[2025-08-03 07:37:58,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2819.20 | bwd: 7313.84 | bwd_inner: 6840.90 | bwd_allreduce: 472.71 | step: 132.10
{'loss': 0.7347, 'learning_rate': 5.199608789873134e-08, 'epoch': 0.97}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13261
total_samples=29426, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:00,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.13 | bwd_microstep: 1805.04 | bwd_inner_microstep: 1699.72 | bwd_allreduce_microstep: 105.25 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13656
total_samples=29430, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:03,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.66 | bwd_microstep: 2082.67 | bwd_inner_microstep: 1936.37 | bwd_allreduce_microstep: 146.24 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13644
total_samples=29434, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:06,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.88 | bwd_microstep: 2085.93 | bwd_inner_microstep: 1919.78 | bwd_allreduce_microstep: 166.08 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11866
total_samples=29437, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:09,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.04
[2025-08-03 07:38:09,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.34 | bwd_microstep: 2009.39 | bwd_inner_microstep: 1806.74 | bwd_allreduce_microstep: 202.58 | step_microstep: 107.58
[2025-08-03 07:38:09,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2814.93 | bwd: 7983.07 | bwd_inner: 7362.62 | bwd_allreduce: 620.22 | step: 107.92
{'loss': 0.7375, 'learning_rate': 5.035989486508075e-08, 'epoch': 0.97}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12283
total_samples=29441, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:12,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.66 | bwd_microstep: 2101.65 | bwd_inner_microstep: 1815.76 | bwd_allreduce_microstep: 285.83 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12935
total_samples=29445, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:14,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.30 | bwd_microstep: 1763.52 | bwd_inner_microstep: 1663.27 | bwd_allreduce_microstep: 100.19 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13334
total_samples=29449, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:17,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.93 | bwd_microstep: 1866.38 | bwd_inner_microstep: 1720.59 | bwd_allreduce_microstep: 145.72 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13777
total_samples=29453, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:20,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.50
[2025-08-03 07:38:20,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.36 | bwd_microstep: 1882.53 | bwd_inner_microstep: 1698.92 | bwd_allreduce_microstep: 183.53 | step_microstep: 139.49
[2025-08-03 07:38:20,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.19 | bwd: 7614.13 | bwd_inner: 6898.53 | bwd_allreduce: 715.35 | step: 139.82
{'loss': 0.7372, 'learning_rate': 4.874979360268928e-08, 'epoch': 0.97}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11917
total_samples=29456, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:23,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.23 | bwd_microstep: 2057.73 | bwd_inner_microstep: 1557.14 | bwd_allreduce_microstep: 500.53 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13452
total_samples=29460, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:25,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.98 | bwd_microstep: 1995.04 | bwd_inner_microstep: 1866.60 | bwd_allreduce_microstep: 128.38 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13345
total_samples=29464, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:28,480] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 664.44 | bwd_microstep: 1793.91 | bwd_inner_microstep: 1701.39 | bwd_allreduce_microstep: 92.46 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14041
total_samples=29468, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:31,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 07:38:31,424] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.57 | bwd_microstep: 1758.00 | bwd_inner_microstep: 1706.46 | bwd_allreduce_microstep: 51.46 | step_microstep: 425.80
[2025-08-03 07:38:31,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2772.16 | bwd: 7604.74 | bwd_inner: 6831.60 | bwd_allreduce: 772.89 | step: 426.29
{'loss': 0.727, 'learning_rate': 4.716578833386054e-08, 'epoch': 0.97}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12260
total_samples=29471, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:34,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.47 | bwd_microstep: 1809.84 | bwd_inner_microstep: 1566.70 | bwd_allreduce_microstep: 243.06 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13114
total_samples=29475, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:36,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.38 | bwd_microstep: 1772.42 | bwd_inner_microstep: 1689.17 | bwd_allreduce_microstep: 83.20 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13808
total_samples=29479, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:39,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.27 | bwd_microstep: 1838.87 | bwd_inner_microstep: 1712.34 | bwd_allreduce_microstep: 126.47 | step_microstep: 0.22
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12647
total_samples=29483, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:42,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.84
[2025-08-03 07:38:42,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.95 | bwd_microstep: 2045.00 | bwd_inner_microstep: 1896.01 | bwd_allreduce_microstep: 148.90 | step_microstep: 121.15
[2025-08-03 07:38:42,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.01 | bwd: 7466.20 | bwd_inner: 6864.22 | bwd_allreduce: 601.73 | step: 121.66
7%|█████████▋| 1936/2000 [5:55:02<11:32, 10.82s/it]                                                      97%|█████████▋| 1936/2000 [5:55:02<11:32, 10.82s/it] 97%|█████████▋| 1937/2000 [5:55:13<11:16, 10.74s/it]                                                      97%|█████████▋| 1937/2000 [5:55:13<11:16, 10.74s/it] 97%|█████████▋| 1938/2000 [5:55:24<11:14, 10.88s/it]                                                      97%|█████████▋| 1938/2000 [5:55:24<11:14, 10.88s/it] 97%|█████████▋| 1939/2000 [5:55:35<11:03, 10.88s/it]                                                      97%|█████████▋| 1939/2000 [5:55:35<11:03, 10.88s/it] 97%|█████████▋| 1940/2000 [5:55:46<10:57, 10.95s/it]                                                      97%|█████████▋| 1940/2000 [5:55:46<10:57, 10.95s/it] 97%|██{'loss': 0.7346, 'learning_rate': 4.56078832124629e-08, 'epoch': 0.97}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11636
total_samples=29486, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:44,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.56 | bwd_microstep: 1843.21 | bwd_inner_microstep: 1597.91 | bwd_allreduce_microstep: 245.23 | step_microstep: 0.23
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13434
total_samples=29490, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:47,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.89 | bwd_microstep: 2267.16 | bwd_inner_microstep: 1675.62 | bwd_allreduce_microstep: 591.45 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13741
total_samples=29495, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:50,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.54 | bwd_microstep: 1773.71 | bwd_inner_microstep: 1691.00 | bwd_allreduce_microstep: 82.65 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13319
total_samples=29499, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:53,555] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 17.21
[2025-08-03 07:38:53,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.61 | bwd_microstep: 2230.15 | bwd_inner_microstep: 2069.19 | bwd_allreduce_microstep: 160.89 | step_microstep: 109.44
[2025-08-03 07:38:53,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.53 | bwd: 8114.27 | bwd_inner: 7033.72 | bwd_allreduce: 1080.30 | step: 110.08
{'loss': 0.7348, 'learning_rate': 4.4076082323920576e-08, 'epoch': 0.97}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13858
total_samples=29503, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:56,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.89 | bwd_microstep: 1848.43 | bwd_inner_microstep: 1692.20 | bwd_allreduce_microstep: 156.16 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 15327
total_samples=29507, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:38:58,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.93 | bwd_microstep: 1797.30 | bwd_inner_microstep: 1786.19 | bwd_allreduce_microstep: 11.05 | step_microstep: 0.42
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12837
total_samples=29511, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:01,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.35 | bwd_microstep: 1719.99 | bwd_inner_microstep: 1622.27 | bwd_allreduce_microstep: 97.65 | step_microstep: 0.17
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15193
total_samples=29515, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:04,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.49
[2025-08-03 07:39:04,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.69 | bwd_microstep: 1856.65 | bwd_inner_microstep: 1786.69 | bwd_allreduce_microstep: 69.91 | step_microstep: 164.10
[2025-08-03 07:39:04,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2781.78 | bwd: 7222.43 | bwd_inner: 6887.34 | bwd_allreduce: 334.85 | step: 164.81
{'loss': 0.7263, 'learning_rate': 4.257038968520366e-08, 'epoch': 0.97}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13148
total_samples=29519, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:06,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.01 | bwd_microstep: 1720.25 | bwd_inner_microstep: 1659.24 | bwd_allreduce_microstep: 60.94 | step_microstep: 0.25
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12800
total_samples=29523, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:09,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.27 | bwd_microstep: 1825.90 | bwd_inner_microstep: 1667.30 | bwd_allreduce_microstep: 158.54 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11956
total_samples=29526, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:11,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.78 | bwd_microstep: 1814.66 | bwd_inner_microstep: 1599.13 | bwd_allreduce_microstep: 215.47 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14964
total_samples=29532, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:14,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.33
[2025-08-03 07:39:14,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.97 | bwd_microstep: 1806.27 | bwd_inner_microstep: 1759.36 | bwd_allreduce_microstep: 46.84 | step_microstep: 137.55
[2025-08-03 07:39:14,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.96 | bwd: 7167.13 | bwd_inner: 6685.02 | bwd_allreduce: 481.87 | step: 138.16
{'loss': 0.7389, 'learning_rate': 4.109080924481479e-08, 'epoch': 0.97}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12425
total_samples=29535, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:17,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.42 | bwd_microstep: 2154.68 | bwd_inner_microstep: 1630.41 | bwd_allreduce_microstep: 524.21 | step_microstep: 0.10
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12281
total_samples=29539, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:20,231] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.44 | bwd_microstep: 2020.08 | bwd_inner_microstep: 1797.17 | bwd_allreduce_microstep: 222.84 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13244
total_samples=29543, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:22,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 673.80 | bwd_microstep: 1939.05 | bwd_inner_microstep: 1672.73 | bwd_allreduce_microstep: 266.25 | step_microstep: 0.29
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13320
total_samples=29548, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:25,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.08
[2025-08-03 07:39:25,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.24 | bwd_microstep: 1966.93 | bwd_inner_microstep: 1774.99 | bwd_allreduce_microstep: 191.88 | step_microstep: 146.27
[2025-08-03 07:39:25,802] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2773.84 | bwd: 8080.78 | bwd_inner: 6875.29 | bwd_allreduce: 1205.24 | step: 146.93
{'loss': 0.74, 'learning_rate': 3.963734488278248e-08, 'epoch': 0.97}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13319
total_samples=29552, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:28,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.11 | bwd_microstep: 2017.93 | bwd_inner_microstep: 1888.49 | bwd_allreduce_microstep: 129.38 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12073
total_samples=29555, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:31,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.21 | bwd_microstep: 1722.21 | bwd_inner_microstep: 1554.37 | bwd_allreduce_microstep: 167.77 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14763
total_samples=29559, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:33,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.67 | bwd_microstep: 2064.74 | bwd_inner_microstep: 1939.20 | bwd_allreduce_microstep: 125.46 | step_microstep: 0.19
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13781
total_samples=29563, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:36,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.00
[2025-08-03 07:39:36,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.81 | bwd_microstep: 1859.90 | bwd_inner_microstep: 1825.59 | bwd_allreduce_microstep: 34.26 | step_microstep: 110.22
[2025-08-03 07:39:36,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.73 | bwd: 7664.85 | bwd_inner: 7207.64 | bwd_allreduce: 456.95 | step: 110.78
███████▋| 1941/2000 [5:55:57<10:42, 10.89s/it]                                                      97%|█████████▋| 1941/2000 [5:55:57<10:42, 10.89s/it] 97%|█████████▋| 1942/2000 [5:56:08<10:40, 11.04s/it]                                                      97%|█████████▋| 1942/2000 [5:56:08<10:40, 11.04s/it] 97%|█████████▋| 1943/2000 [5:56:18<10:19, 10.87s/it]                                                      97%|█████████▋| 1943/2000 [5:56:18<10:19, 10.87s/it] 97%|█████████▋| 1944/2000 [5:56:29<10:01, 10.75s/it]                                                      97%|█████████▋| 1944/2000 [5:56:29<10:01, 10.75s/it] 97%|█████████▋| 1945/2000 [5:56:40<10:00, 10.92s/it]                                                      97%|█████████▋| 1945/2000 [5:56:40<10:00, 10.92s/it] 97%|█████{'loss': 0.7321, 'learning_rate': 3.82100004106456e-08, 'epoch': 0.97}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11959
total_samples=29566, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:39,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.79 | bwd_microstep: 2188.76 | bwd_inner_microstep: 1808.82 | bwd_allreduce_microstep: 379.87 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13060
total_samples=29569, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:42,278] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.02 | bwd_microstep: 1871.16 | bwd_inner_microstep: 1622.25 | bwd_allreduce_microstep: 248.84 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15897
total_samples=29573, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:44,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.58 | bwd_microstep: 1806.41 | bwd_inner_microstep: 1800.37 | bwd_allreduce_microstep: 5.98 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13763
total_samples=29577, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:47,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.07
[2025-08-03 07:39:47,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.91 | bwd_microstep: 2056.25 | bwd_inner_microstep: 1880.18 | bwd_allreduce_microstep: 176.01 | step_microstep: 129.13
[2025-08-03 07:39:47,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.23 | bwd: 7922.64 | bwd_inner: 7111.62 | bwd_allreduce: 810.79 | step: 129.62
{'loss': 0.728, 'learning_rate': 3.680877957145112e-08, 'epoch': 0.97}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13570
total_samples=29581, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:50,445] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.05 | bwd_microstep: 1799.56 | bwd_inner_microstep: 1675.49 | bwd_allreduce_microstep: 124.00 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11686
total_samples=29584, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:53,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.32 | bwd_microstep: 1760.13 | bwd_inner_microstep: 1537.77 | bwd_allreduce_microstep: 222.30 | step_microstep: 0.10
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14106
total_samples=29589, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:55,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.69 | bwd_microstep: 1756.07 | bwd_inner_microstep: 1713.95 | bwd_allreduce_microstep: 42.05 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13035
total_samples=29593, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:39:58,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 07:39:58,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.75 | bwd_microstep: 1816.02 | bwd_inner_microstep: 1645.98 | bwd_allreduce_microstep: 169.96 | step_microstep: 135.19
[2025-08-03 07:39:58,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2812.73 | bwd: 7131.83 | bwd_inner: 6573.19 | bwd_allreduce: 558.38 | step: 135.67
{'loss': 0.7322, 'learning_rate': 3.543368603973529e-08, 'epoch': 0.97}
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14093
total_samples=29598, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:00,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.06 | bwd_microstep: 1949.57 | bwd_inner_microstep: 1710.40 | bwd_allreduce_microstep: 239.11 | step_microstep: 0.21
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13362
total_samples=29602, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:03,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.41 | bwd_microstep: 1750.74 | bwd_inner_microstep: 1680.81 | bwd_allreduce_microstep: 69.86 | step_microstep: 0.25
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13407
total_samples=29606, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:06,087] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.29 | bwd_microstep: 1813.58 | bwd_inner_microstep: 1744.97 | bwd_allreduce_microstep: 68.54 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13830
total_samples=29610, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:09,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.01
[2025-08-03 07:40:09,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.73 | bwd_microstep: 2014.07 | bwd_inner_microstep: 1936.42 | bwd_allreduce_microstep: 77.58 | step_microstep: 460.75
[2025-08-03 07:40:09,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2791.41 | bwd: 7528.00 | bwd_inner: 7072.60 | bwd_allreduce: 455.16 | step: 461.32
{'loss': 0.7323, 'learning_rate': 3.408472342152136e-08, 'epoch': 0.97}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13428
total_samples=29614, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:11,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.59 | bwd_microstep: 1747.23 | bwd_inner_microstep: 1741.15 | bwd_allreduce_microstep: 6.02 | step_microstep: 0.09
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11633
total_samples=29617, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:14,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.34 | bwd_microstep: 1840.33 | bwd_inner_microstep: 1606.40 | bwd_allreduce_microstep: 233.84 | step_microstep: 0.16
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13695
total_samples=29622, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:17,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.82 | bwd_microstep: 2015.10 | bwd_inner_microstep: 1762.48 | bwd_allreduce_microstep: 252.56 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15132
total_samples=29627, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:19,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 07:40:19,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.19 | bwd_microstep: 1733.38 | bwd_inner_microstep: 1726.84 | bwd_allreduce_microstep: 6.48 | step_microstep: 130.64
[2025-08-03 07:40:19,886] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2792.86 | bwd: 7336.11 | bwd_inner: 6836.86 | bwd_allreduce: 498.98 | step: 131.14
{'loss': 0.7342, 'learning_rate': 3.2761895254306285e-08, 'epoch': 0.97}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11904
total_samples=29630, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:22,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.80 | bwd_microstep: 1834.15 | bwd_inner_microstep: 1644.33 | bwd_allreduce_microstep: 189.75 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13413
total_samples=29634, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:25,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.41 | bwd_microstep: 1918.65 | bwd_inner_microstep: 1732.76 | bwd_allreduce_microstep: 185.82 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15438
total_samples=29638, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:27,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.68 | bwd_microstep: 1967.45 | bwd_inner_microstep: 1885.85 | bwd_allreduce_microstep: 81.52 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13752
total_samples=29642, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:30,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.91
[2025-08-03 07:40:30,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.93 | bwd_microstep: 1766.40 | bwd_inner_microstep: 1726.19 | bwd_allreduce_microstep: 40.14 | step_microstep: 111.37
[2025-08-03 07:40:30,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.75 | bwd: 7486.70 | bwd_inner: 6989.13 | bwd_allreduce: 497.32 | step: 111.86
████▋| 1946/2000 [5:56:51<09:48, 10.90s/it]                                                      97%|█████████▋| 1946/2000 [5:56:51<09:48, 10.90s/it] 97%|█████████▋| 1947/2000 [5:57:02<09:42, 10.99s/it]                                                      97%|█████████▋| 1947/2000 [5:57:02<09:42, 10.99s/it] 97%|█████████▋| 1948/2000 [5:57:13<09:21, 10.81s/it]                                                      97%|█████████▋| 1948/2000 [5:57:13<09:21, 10.81s/it] 97%|█████████▋| 1949/2000 [5:57:24<09:15, 10.89s/it]                                                      97%|█████████▋| 1949/2000 [5:57:24<09:15, 10.89s/it] 98%|█████████▊| 1950/2000 [5:57:34<08:59, 10.79s/it]                                                      98%|█████████▊| 1950/2000 [5:57:34<08:59, 10.79s/it] 98%|████████{'loss': 0.7327, 'learning_rate': 3.1465205007052965e-08, 'epoch': 0.98}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13576
total_samples=29646, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:33,913] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.52 | bwd_microstep: 2549.42 | bwd_inner_microstep: 1957.91 | bwd_allreduce_microstep: 591.44 | step_microstep: 0.23
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14206
total_samples=29651, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:36,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.46 | bwd_microstep: 1710.89 | bwd_inner_microstep: 1682.98 | bwd_allreduce_microstep: 27.85 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15710
total_samples=29655, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:38,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.41 | bwd_microstep: 1788.03 | bwd_inner_microstep: 1781.87 | bwd_allreduce_microstep: 6.10 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11616
total_samples=29658, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:42,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 07:40:42,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.50 | bwd_microstep: 2124.83 | bwd_inner_microstep: 1816.91 | bwd_allreduce_microstep: 307.83 | step_microstep: 141.63
[2025-08-03 07:40:42,005] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2790.81 | bwd: 8173.23 | bwd_inner: 7239.67 | bwd_allreduce: 933.30 | step: 142.11
{'loss': 0.7115, 'learning_rate': 3.019465608018024e-08, 'epoch': 0.98}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12232
total_samples=29661, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:44,525] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.17 | bwd_microstep: 1749.74 | bwd_inner_microstep: 1562.41 | bwd_allreduce_microstep: 187.26 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13369
total_samples=29665, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:47,184] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.10 | bwd_microstep: 1860.10 | bwd_inner_microstep: 1689.00 | bwd_allreduce_microstep: 171.03 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13476
total_samples=29669, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:49,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.65 | bwd_microstep: 1955.12 | bwd_inner_microstep: 1885.76 | bwd_allreduce_microstep: 69.28 | step_microstep: 0.43
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13917
total_samples=29673, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:52,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.81
[2025-08-03 07:40:52,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.79 | bwd_microstep: 2001.98 | bwd_inner_microstep: 1890.69 | bwd_allreduce_microstep: 111.23 | step_microstep: 141.07
[2025-08-03 07:40:52,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2820.63 | bwd: 7567.00 | bwd_inner: 7027.84 | bwd_allreduce: 538.90 | step: 141.91
{'loss': 0.7255, 'learning_rate': 2.8950251805553997e-08, 'epoch': 0.98}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11781
total_samples=29676, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:55,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.50 | bwd_microstep: 2097.93 | bwd_inner_microstep: 1851.34 | bwd_allreduce_microstep: 246.48 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13256
total_samples=29679, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:40:58,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.43 | bwd_microstep: 1797.02 | bwd_inner_microstep: 1635.75 | bwd_allreduce_microstep: 161.19 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=29683, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:00,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.34 | bwd_microstep: 1723.78 | bwd_inner_microstep: 1672.38 | bwd_allreduce_microstep: 51.34 | step_microstep: 0.27
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12888
total_samples=29687, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:03,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 42.52
[2025-08-03 07:41:03,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.49 | bwd_microstep: 1835.57 | bwd_inner_microstep: 1633.55 | bwd_allreduce_microstep: 201.95 | step_microstep: 149.42
[2025-08-03 07:41:03,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.69 | bwd: 7454.37 | bwd_inner: 6793.02 | bwd_allreduce: 661.07 | step: 150.17
{'loss': 0.7132, 'learning_rate': 2.773199544648164e-08, 'epoch': 0.98}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11965
total_samples=29690, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:06,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.45 | bwd_microstep: 1854.69 | bwd_inner_microstep: 1549.26 | bwd_allreduce_microstep: 305.36 | step_microstep: 0.25
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13092
total_samples=29694, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:08,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.22 | bwd_microstep: 1744.64 | bwd_inner_microstep: 1641.39 | bwd_allreduce_microstep: 103.17 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13974
total_samples=29698, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:11,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 712.62 | bwd_microstep: 1793.41 | bwd_inner_microstep: 1726.99 | bwd_allreduce_microstep: 66.35 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12566
total_samples=29702, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:14,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 07:41:14,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.81 | bwd_microstep: 2351.57 | bwd_inner_microstep: 2154.48 | bwd_allreduce_microstep: 197.01 | step_microstep: 129.65
[2025-08-03 07:41:14,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.03 | bwd: 7744.37 | bwd_inner: 7072.12 | bwd_allreduce: 671.97 | step: 130.32
{'loss': 0.7292, 'learning_rate': 2.6539890197695428e-08, 'epoch': 0.98}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14132
total_samples=29706, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:17,158] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.06 | bwd_microstep: 1851.75 | bwd_inner_microstep: 1742.42 | bwd_allreduce_microstep: 109.27 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13332
total_samples=29710, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:19,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.78 | bwd_microstep: 2021.38 | bwd_inner_microstep: 1920.34 | bwd_allreduce_microstep: 100.98 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12751
total_samples=29714, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:22,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.70 | bwd_microstep: 1849.70 | bwd_inner_microstep: 1679.92 | bwd_allreduce_microstep: 169.71 | step_microstep: 0.14
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14474
total_samples=29719, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:25,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.85
[2025-08-03 07:41:25,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.91 | bwd_microstep: 2441.12 | bwd_inner_microstep: 2432.80 | bwd_allreduce_microstep: 8.26 | step_microstep: 110.67
[2025-08-03 07:41:25,927] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.38 | bwd: 8164.00 | bwd_inner: 7775.48 | bwd_allreduce: 388.29 | step: 111.17
█▊| 1951/2000 [5:57:45<08:47, 10.76s/it]                                                      98%|█████████▊| 1951/2000 [5:57:45<08:47, 10.76s/it] 98%|█████████▊| 1952/2000 [5:57:56<08:46, 10.96s/it]                                                      98%|█████████▊| 1952/2000 [5:57:56<08:46, 10.96s/it] 98%|█████████▊| 1953/2000 [5:58:07<08:33, 10.92s/it]                                                      98%|█████████▊| 1953/2000 [5:58:07<08:33, 10.92s/it] 98%|█████████▊| 1954/2000 [5:58:18<08:19, 10.85s/it]                                                      98%|█████████▊| 1954/2000 [5:58:18<08:19, 10.85s/it] 98%|█████████▊| 1955/2000 [5:58:29<08:10, 10.89s/it]                                                      98%|█████████▊| 1955/2000 [5:58:29<08:10, 10.89s/it] 98%|█████████▊| 1{'loss': 0.7312, 'learning_rate': 2.537393918535358e-08, 'epoch': 0.98}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13378
total_samples=29723, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:28,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 732.46 | bwd_microstep: 2169.32 | bwd_inner_microstep: 2067.39 | bwd_allreduce_microstep: 101.84 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13786
total_samples=29727, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:31,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.50 | bwd_microstep: 2012.37 | bwd_inner_microstep: 1751.91 | bwd_allreduce_microstep: 260.37 | step_microstep: 0.31
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13286
total_samples=29731, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:34,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.86 | bwd_microstep: 1777.59 | bwd_inner_microstep: 1698.51 | bwd_allreduce_microstep: 79.01 | step_microstep: 0.85
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13048
total_samples=29735, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:36,992] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 07:41:36,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.13 | bwd_microstep: 1803.24 | bwd_inner_microstep: 1665.07 | bwd_allreduce_microstep: 138.10 | step_microstep: 124.13
[2025-08-03 07:41:36,995] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2898.87 | bwd: 7762.58 | bwd_inner: 7182.88 | bwd_allreduce: 579.42 | step: 125.61
{'loss': 0.7374, 'learning_rate': 2.423414546702807e-08, 'epoch': 0.98}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14313
total_samples=29739, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:39,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 676.01 | bwd_microstep: 1848.91 | bwd_inner_microstep: 1804.19 | bwd_allreduce_microstep: 44.64 | step_microstep: 0.27
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13274
total_samples=29743, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:42,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.03 | bwd_microstep: 1851.41 | bwd_inner_microstep: 1728.20 | bwd_allreduce_microstep: 123.14 | step_microstep: 0.18
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14470
total_samples=29749, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:44,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.73 | bwd_microstep: 1709.44 | bwd_inner_microstep: 1677.17 | bwd_allreduce_microstep: 32.21 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13546
total_samples=29754, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:47,944] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.39
[2025-08-03 07:41:47,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.72 | bwd_microstep: 2319.14 | bwd_inner_microstep: 2001.25 | bwd_allreduce_microstep: 317.82 | step_microstep: 129.60
[2025-08-03 07:41:47,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2759.41 | bwd: 7728.96 | bwd_inner: 7210.81 | bwd_allreduce: 517.90 | step: 130.16
{'loss': 0.7376, 'learning_rate': 2.312051203169352e-08, 'epoch': 0.98}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12770
total_samples=29758, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:50,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.77 | bwd_microstep: 1749.91 | bwd_inner_microstep: 1652.34 | bwd_allreduce_microstep: 97.49 | step_microstep: 0.34
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12243
total_samples=29761, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:53,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.82 | bwd_microstep: 1841.19 | bwd_inner_microstep: 1601.43 | bwd_allreduce_microstep: 239.70 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13579
total_samples=29765, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:56,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 889.19 | bwd_microstep: 1941.26 | bwd_inner_microstep: 1732.53 | bwd_allreduce_microstep: 208.67 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 15120
total_samples=29769, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:41:58,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 07:41:58,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.45 | bwd_microstep: 1773.86 | bwd_inner_microstep: 1729.51 | bwd_allreduce_microstep: 44.28 | step_microstep: 132.60
[2025-08-03 07:41:58,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2983.17 | bwd: 7306.28 | bwd_inner: 6715.82 | bwd_allreduce: 590.22 | step: 133.18
{'loss': 0.7365, 'learning_rate': 2.2033041799723877e-08, 'epoch': 0.98}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13617
total_samples=29773, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:01,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.83 | bwd_microstep: 2059.55 | bwd_inner_microstep: 1926.30 | bwd_allreduce_microstep: 133.18 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13769
total_samples=29777, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:04,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.84 | bwd_microstep: 2001.99 | bwd_inner_microstep: 1715.45 | bwd_allreduce_microstep: 286.48 | step_microstep: 0.12
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13694
total_samples=29781, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:06,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.94 | bwd_microstep: 1904.37 | bwd_inner_microstep: 1643.33 | bwd_allreduce_microstep: 260.98 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11865
total_samples=29784, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:09,687] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.42
[2025-08-03 07:42:09,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.29 | bwd_microstep: 1807.05 | bwd_inner_microstep: 1560.66 | bwd_allreduce_microstep: 246.32 | step_microstep: 136.98
[2025-08-03 07:42:09,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2803.83 | bwd: 7773.01 | bwd_inner: 6845.72 | bwd_allreduce: 927.05 | step: 137.48
{'loss': 0.7368, 'learning_rate': 2.0971737622883515e-08, 'epoch': 0.98}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11598
total_samples=29787, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:12,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.60 | bwd_microstep: 1901.86 | bwd_inner_microstep: 1757.86 | bwd_allreduce_microstep: 143.93 | step_microstep: 0.28
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12831
total_samples=29791, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:14,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.28 | bwd_microstep: 1812.39 | bwd_inner_microstep: 1631.31 | bwd_allreduce_microstep: 181.03 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13679
total_samples=29795, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:17,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.79 | bwd_microstep: 1775.96 | bwd_inner_microstep: 1695.37 | bwd_allreduce_microstep: 80.52 | step_microstep: 0.13
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12854
total_samples=29799, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:20,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.09
[2025-08-03 07:42:20,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.07 | bwd_microstep: 2038.33 | bwd_inner_microstep: 1867.65 | bwd_allreduce_microstep: 170.60 | step_microstep: 111.32
[2025-08-03 07:42:20,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2813.68 | bwd: 7528.59 | bwd_inner: 6952.20 | bwd_allreduce: 576.15 | step: 111.85
956/2000 [5:58:40<08:06, 11.05s/it]                                                      98%|█████████▊| 1956/2000 [5:58:40<08:06, 11.05s/it] 98%|█████████▊| 1957/2000 [5:58:51<07:55, 11.06s/it]                                                      98%|█████████▊| 1957/2000 [5:58:51<07:55, 11.06s/it] 98%|█████████▊| 1958/2000 [5:59:02<07:42, 11.02s/it]                                                      98%|█████████▊| 1958/2000 [5:59:02<07:42, 11.02s/it] 98%|█████████▊| 1959/2000 [5:59:13<07:28, 10.94s/it]                                                      98%|█████████▊| 1959/2000 [5:59:13<07:28, 10.94s/it] 98%|█████████▊| 1960/2000 [5:59:24<07:18, 10.96s/it]                                                      98%|█████████▊| 1960/2000 [5:59:24<07:18, 10.96s/it] 98%|█████████▊| 1961/2000 {'loss': 0.7411, 'learning_rate': 1.9936602284318375e-08, 'epoch': 0.98}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13570
total_samples=29803, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:22,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 687.43 | bwd_microstep: 1743.92 | bwd_inner_microstep: 1692.10 | bwd_allreduce_microstep: 51.75 | step_microstep: 0.12
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14612
total_samples=29807, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:25,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.26 | bwd_microstep: 1806.84 | bwd_inner_microstep: 1759.26 | bwd_allreduce_microstep: 47.51 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14248
total_samples=29811, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:28,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.21 | bwd_microstep: 1861.82 | bwd_inner_microstep: 1830.83 | bwd_allreduce_microstep: 30.92 | step_microstep: 0.20
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14858
total_samples=29815, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:31,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 07:42:31,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.86 | bwd_microstep: 1792.11 | bwd_inner_microstep: 1716.29 | bwd_allreduce_microstep: 75.75 | step_microstep: 440.71
[2025-08-03 07:42:31,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2816.68 | bwd: 7204.73 | bwd_inner: 6998.48 | bwd_allreduce: 206.01 | step: 441.16
{'loss': 0.7326, 'learning_rate': 1.8927638498551502e-08, 'epoch': 0.98}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13155
total_samples=29819, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:33,817] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 738.59 | bwd_microstep: 1756.69 | bwd_inner_microstep: 1667.01 | bwd_allreduce_microstep: 89.61 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11455
total_samples=29822, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:36,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.25 | bwd_microstep: 1832.81 | bwd_inner_microstep: 1523.42 | bwd_allreduce_microstep: 309.32 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11774
total_samples=29825, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:39,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.74 | bwd_microstep: 1997.93 | bwd_inner_microstep: 1606.64 | bwd_allreduce_microstep: 391.22 | step_microstep: 0.30
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13585
total_samples=29829, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:42,113] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.31
[2025-08-03 07:42:42,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.30 | bwd_microstep: 2023.06 | bwd_inner_microstep: 2016.91 | bwd_allreduce_microstep: 6.09 | step_microstep: 122.96
[2025-08-03 07:42:42,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2851.82 | bwd: 7610.55 | bwd_inner: 6813.98 | bwd_allreduce: 796.32 | step: 123.50
{'loss': 0.7304, 'learning_rate': 1.7944848911470857e-08, 'epoch': 0.98}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13152
total_samples=29833, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:44,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.88 | bwd_microstep: 1961.22 | bwd_inner_microstep: 1670.51 | bwd_allreduce_microstep: 290.64 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=29837, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:47,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.65 | bwd_microstep: 1752.51 | bwd_inner_microstep: 1686.64 | bwd_allreduce_microstep: 65.81 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11966
total_samples=29840, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:50,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 785.83 | bwd_microstep: 1860.39 | bwd_inner_microstep: 1600.13 | bwd_allreduce_microstep: 260.20 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12386
total_samples=29843, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:52,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.94
[2025-08-03 07:42:52,784] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.64 | bwd_microstep: 1757.70 | bwd_inner_microstep: 1577.50 | bwd_allreduce_microstep: 180.14 | step_microstep: 134.92
[2025-08-03 07:42:52,785] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2908.92 | bwd: 7331.88 | bwd_inner: 6534.77 | bwd_allreduce: 796.87 | step: 135.53
{'loss': 0.7367, 'learning_rate': 1.698823610032929e-08, 'epoch': 0.98}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13478
total_samples=29847, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:55,401] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.15 | bwd_microstep: 1815.71 | bwd_inner_microstep: 1724.43 | bwd_allreduce_microstep: 91.21 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13386
total_samples=29851, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:42:58,297] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.34 | bwd_microstep: 2115.56 | bwd_inner_microstep: 1876.83 | bwd_allreduce_microstep: 238.65 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11618
total_samples=29854, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:01,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.05 | bwd_microstep: 1969.58 | bwd_inner_microstep: 1552.57 | bwd_allreduce_microstep: 416.94 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14370
total_samples=29859, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:03,809] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.31
[2025-08-03 07:43:03,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.82 | bwd_microstep: 1852.17 | bwd_inner_microstep: 1738.01 | bwd_allreduce_microstep: 114.07 | step_microstep: 151.20
[2025-08-03 07:43:03,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2815.29 | bwd: 7753.08 | bwd_inner: 6891.84 | bwd_allreduce: 860.97 | step: 151.87
{'loss': 0.7366, 'learning_rate': 1.605780257373124e-08, 'epoch': 0.98}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12782
total_samples=29862, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:06,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.23 | bwd_microstep: 2080.97 | bwd_inner_microstep: 1872.85 | bwd_allreduce_microstep: 208.03 | step_microstep: 0.28
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 13077
total_samples=29867, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:09,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.65 | bwd_microstep: 2016.38 | bwd_inner_microstep: 1865.85 | bwd_allreduce_microstep: 150.45 | step_microstep: 0.35
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13599
total_samples=29871, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:12,116] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.15 | bwd_microstep: 1877.98 | bwd_inner_microstep: 1634.41 | bwd_allreduce_microstep: 243.51 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13408
total_samples=29875, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:14,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 07:43:14,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.50 | bwd_microstep: 1804.05 | bwd_inner_microstep: 1706.58 | bwd_allreduce_microstep: 97.41 | step_microstep: 159.67
[2025-08-03 07:43:14,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.47 | bwd: 7779.44 | bwd_inner: 7079.70 | bwd_allreduce: 699.48 | step: 160.41
[5:59:35<07:05, 10.90s/it]                                                      98%|█████████▊| 1961/2000 [5:59:35<07:05, 10.90s/it] 98%|█████████▊| 1962/2000 [5:59:46<06:52, 10.87s/it]                                                      98%|█████████▊| 1962/2000 [5:59:46<06:52, 10.87s/it] 98%|█████████▊| 1963/2000 [5:59:56<06:42, 10.87s/it]                                                      98%|█████████▊| 1963/2000 [5:59:56<06:42, 10.87s/it] 98%|█████████▊| 1964/2000 [6:00:07<06:29, 10.81s/it]                                                      98%|█████████▊| 1964/2000 [6:00:07<06:29, 10.81s/it] 98%|█████████▊| 1965/2000 [6:00:18<06:20, 10.87s/it]                                                      98%|█████████▊| 1965/2000 [6:00:18<06:20, 10.87s/it] 98%|█████████▊| 1966/2000 [6:00:29<{'loss': 0.7372, 'learning_rate': 1.5153550771630498e-08, 'epoch': 0.98}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13465
total_samples=29879, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:17,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.18 | bwd_microstep: 1930.62 | bwd_inner_microstep: 1666.99 | bwd_allreduce_microstep: 263.56 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13490
total_samples=29883, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:20,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.59 | bwd_microstep: 2217.10 | bwd_inner_microstep: 2114.27 | bwd_allreduce_microstep: 102.76 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11539
total_samples=29886, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:23,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.92 | bwd_microstep: 2137.31 | bwd_inner_microstep: 1782.26 | bwd_allreduce_microstep: 354.97 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11498
total_samples=29889, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:26,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 07:43:26,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.49 | bwd_microstep: 1824.18 | bwd_inner_microstep: 1591.28 | bwd_allreduce_microstep: 232.84 | step_microstep: 119.71
[2025-08-03 07:43:26,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2860.11 | bwd: 8109.27 | bwd_inner: 7154.79 | bwd_allreduce: 954.22 | step: 120.33
{'loss': 0.7324, 'learning_rate': 1.4275483065321338e-08, 'epoch': 0.98}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11638
total_samples=29892, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:29,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.30 | bwd_microstep: 2149.17 | bwd_inner_microstep: 1933.78 | bwd_allreduce_microstep: 215.31 | step_microstep: 0.29
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13697
total_samples=29896, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:31,730] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.03 | bwd_microstep: 1751.78 | bwd_inner_microstep: 1662.55 | bwd_allreduce_microstep: 89.15 | step_microstep: 0.87
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14877
total_samples=29900, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:34,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.61 | bwd_microstep: 1778.17 | bwd_inner_microstep: 1737.31 | bwd_allreduce_microstep: 40.79 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11770
total_samples=29903, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:36,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.92
[2025-08-03 07:43:36,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.52 | bwd_microstep: 1740.25 | bwd_inner_microstep: 1542.56 | bwd_allreduce_microstep: 197.61 | step_microstep: 115.73
[2025-08-03 07:43:36,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2795.38 | bwd: 7419.44 | bwd_inner: 6876.21 | bwd_allreduce: 542.95 | step: 117.00
{'loss': 0.7288, 'learning_rate': 1.3423601757436289e-08, 'epoch': 0.98}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13901
total_samples=29907, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:39,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.90 | bwd_microstep: 1817.57 | bwd_inner_microstep: 1715.92 | bwd_allreduce_microstep: 101.57 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13917
total_samples=29911, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:42,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.37 | bwd_microstep: 2115.89 | bwd_inner_microstep: 1738.37 | bwd_allreduce_microstep: 377.44 | step_microstep: 0.20
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14016
total_samples=29915, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:44,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.78 | bwd_microstep: 1775.13 | bwd_inner_microstep: 1716.55 | bwd_allreduce_microstep: 58.51 | step_microstep: 0.23
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13358
total_samples=29919, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:47,922] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.86
[2025-08-03 07:43:47,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.11 | bwd_microstep: 2038.09 | bwd_inner_microstep: 1909.51 | bwd_allreduce_microstep: 128.51 | step_microstep: 131.12
[2025-08-03 07:43:47,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2844.09 | bwd: 7746.73 | bwd_inner: 7080.35 | bwd_allreduce: 666.13 | step: 131.68
{'loss': 0.7339, 'learning_rate': 1.2597909081931702e-08, 'epoch': 0.98}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11871
total_samples=29922, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:50,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.91 | bwd_microstep: 2000.39 | bwd_inner_microstep: 1782.65 | bwd_allreduce_microstep: 217.67 | step_microstep: 0.24
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13370
total_samples=29926, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:53,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 719.64 | bwd_microstep: 1831.98 | bwd_inner_microstep: 1716.04 | bwd_allreduce_microstep: 115.87 | step_microstep: 0.26
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13185
total_samples=29930, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:56,404] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.62 | bwd_microstep: 2305.68 | bwd_inner_microstep: 2107.97 | bwd_allreduce_microstep: 197.63 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13435
total_samples=29934, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:43:59,140] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.21
[2025-08-03 07:43:59,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.92 | bwd_microstep: 1863.90 | bwd_inner_microstep: 1833.46 | bwd_allreduce_microstep: 30.37 | step_microstep: 118.79
[2025-08-03 07:43:59,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.02 | bwd: 8002.01 | bwd_inner: 7440.12 | bwd_allreduce: 561.63 | step: 119.43
{'loss': 0.7267, 'learning_rate': 1.179840720409331e-08, 'epoch': 0.98}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12097
total_samples=29937, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:01,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.03 | bwd_microstep: 1839.66 | bwd_inner_microstep: 1621.43 | bwd_allreduce_microstep: 218.16 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13468
total_samples=29941, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:04,407] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.46 | bwd_microstep: 1836.09 | bwd_inner_microstep: 1713.95 | bwd_allreduce_microstep: 122.08 | step_microstep: 0.21
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11597
total_samples=29944, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:07,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.28 | bwd_microstep: 2001.46 | bwd_inner_microstep: 1788.86 | bwd_allreduce_microstep: 212.52 | step_microstep: 0.19
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13178
total_samples=29948, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:09,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.02
[2025-08-03 07:44:09,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.67 | bwd_microstep: 1779.95 | bwd_inner_microstep: 1632.87 | bwd_allreduce_microstep: 147.01 | step_microstep: 137.74
[2025-08-03 07:44:09,874] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2854.37 | bwd: 7457.22 | bwd_inner: 6757.10 | bwd_allreduce: 699.87 | step: 138.27
06:11, 10.93s/it]                                                      98%|█████████▊| 1966/2000 [6:00:29<06:11, 10.93s/it] 98%|█████████▊| 1967/2000 [6:00:41<06:05, 11.07s/it]                                                      98%|█████████▊| 1967/2000 [6:00:41<06:05, 11.07s/it] 98%|█████████▊| 1968/2000 [6:00:51<05:50, 10.94s/it]                                                      98%|█████████▊| 1968/2000 [6:00:51<05:50, 10.94s/it] 98%|█████████▊| 1969/2000 [6:01:02<05:39, 10.97s/it]                                                      98%|█████████▊| 1969/2000 [6:01:02<05:39, 10.97s/it] 98%|█████████▊| 1970/2000 [6:01:13<05:31, 11.04s/it]                                                      98%|█████████▊| 1970/2000 [6:01:14<05:31, 11.04s/it] 99%|█████████▊| 1971/2000 [6:01:24<05:17, 10{'loss': 0.739, 'learning_rate': 1.102509822051845e-08, 'epoch': 0.99}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13535
total_samples=29952, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:12,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.25 | bwd_microstep: 1843.20 | bwd_inner_microstep: 1722.42 | bwd_allreduce_microstep: 120.70 | step_microstep: 0.18
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13381
total_samples=29956, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:15,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.32 | bwd_microstep: 1775.66 | bwd_inner_microstep: 1698.66 | bwd_allreduce_microstep: 76.93 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12118
total_samples=29959, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:17,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.68 | bwd_microstep: 1877.80 | bwd_inner_microstep: 1766.20 | bwd_allreduce_microstep: 111.53 | step_microstep: 0.23
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11690
total_samples=29962, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:20,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 07:44:20,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.77 | bwd_microstep: 1746.44 | bwd_inner_microstep: 1543.40 | bwd_allreduce_microstep: 202.96 | step_microstep: 114.41
[2025-08-03 07:44:20,370] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2839.94 | bwd: 7243.15 | bwd_inner: 6730.68 | bwd_allreduce: 512.21 | step: 115.05
{'loss': 0.7271, 'learning_rate': 1.0277984159122734e-08, 'epoch': 0.99}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13456
total_samples=29966, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:23,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.49 | bwd_microstep: 1851.57 | bwd_inner_microstep: 1718.62 | bwd_allreduce_microstep: 132.88 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14011
total_samples=29970, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:26,100] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 740.16 | bwd_microstep: 2261.68 | bwd_inner_microstep: 2100.35 | bwd_allreduce_microstep: 161.26 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11941
total_samples=29973, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:28,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.49 | bwd_microstep: 1791.93 | bwd_inner_microstep: 1561.03 | bwd_allreduce_microstep: 230.83 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14385
total_samples=29977, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:31,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 07:44:31,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.59 | bwd_microstep: 1787.32 | bwd_inner_microstep: 1739.54 | bwd_allreduce_microstep: 47.71 | step_microstep: 137.36
[2025-08-03 07:44:31,393] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.66 | bwd: 7692.55 | bwd_inner: 7119.54 | bwd_allreduce: 572.77 | step: 137.90
{'loss': 0.7469, 'learning_rate': 9.557066979123398e-09, 'epoch': 0.99}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11723
total_samples=29981, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:33,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.13 | bwd_microstep: 1736.74 | bwd_inner_microstep: 1518.27 | bwd_allreduce_microstep: 218.39 | step_microstep: 0.30
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13376
total_samples=29985, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:36,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.69 | bwd_microstep: 1894.80 | bwd_inner_microstep: 1819.64 | bwd_allreduce_microstep: 75.08 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15176
total_samples=29989, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:39,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.33 | bwd_microstep: 1797.58 | bwd_inner_microstep: 1775.94 | bwd_allreduce_microstep: 21.56 | step_microstep: 0.24
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14886
total_samples=29994, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:42,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 07:44:42,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.15 | bwd_microstep: 1969.16 | bwd_inner_microstep: 1722.22 | bwd_allreduce_microstep: 246.88 | step_microstep: 113.47
[2025-08-03 07:44:42,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2800.21 | bwd: 7398.33 | bwd_inner: 6836.07 | bwd_allreduce: 561.99 | step: 114.20
{'loss': 0.7346, 'learning_rate': 8.862348571043733e-09, 'epoch': 0.99}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11907
total_samples=29997, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:44,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.35 | bwd_microstep: 2175.86 | bwd_inner_microstep: 1949.91 | bwd_allreduce_microstep: 225.88 | step_microstep: 0.24
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12646
total_samples=30001, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:47,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.33 | bwd_microstep: 1748.09 | bwd_inner_microstep: 1608.40 | bwd_allreduce_microstep: 139.62 | step_microstep: 0.20
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810
total_samples=30004, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:50,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 682.35 | bwd_microstep: 1888.87 | bwd_inner_microstep: 1565.48 | bwd_allreduce_microstep: 323.33 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13397
total_samples=30008, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:52,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.60
[2025-08-03 07:44:52,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.43 | bwd_microstep: 1804.62 | bwd_inner_microstep: 1731.00 | bwd_allreduce_microstep: 73.53 | step_microstep: 131.38
[2025-08-03 07:44:52,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.37 | bwd: 7617.50 | bwd_inner: 6854.78 | bwd_allreduce: 762.46 | step: 132.08
{'loss': 0.7391, 'learning_rate': 8.193830756699773e-09, 'epoch': 0.99}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13266
total_samples=30012, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:55,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 693.52 | bwd_microstep: 1798.08 | bwd_inner_microstep: 1695.68 | bwd_allreduce_microstep: 102.34 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11996
total_samples=30015, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:44:58,230] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 720.47 | bwd_microstep: 1980.34 | bwd_inner_microstep: 1786.51 | bwd_allreduce_microstep: 193.76 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13676
total_samples=30019, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:01,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.77 | bwd_microstep: 2070.53 | bwd_inner_microstep: 1931.71 | bwd_allreduce_microstep: 138.75 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13328
total_samples=30023, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:04,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 07:45:04,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 718.04 | bwd_microstep: 2017.98 | bwd_inner_microstep: 1770.50 | bwd_allreduce_microstep: 247.42 | step_microstep: 110.90
[2025-08-03 07:45:04,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2832.74 | bwd: 7867.00 | bwd_inner: 7184.40 | bwd_allreduce: 682.36 | step: 111.27
.95s/it]                                                      99%|█████████▊| 1971/2000 [6:01:24<05:17, 10.95s/it] 99%|█████████▊| 1972/2000 [6:01:35<05:02, 10.81s/it]                                                      99%|█████████▊| 1972/2000 [6:01:35<05:02, 10.81s/it] 99%|█████████▊| 1973/2000 [6:01:46<04:53, 10.88s/it]                                                      99%|█████████▊| 1973/2000 [6:01:46<04:53, 10.88s/it] 99%|█████████▊| 1974/2000 [6:01:56<04:40, 10.80s/it]                                                      99%|█████████▊| 1974/2000 [6:01:56<04:40, 10.80s/it] 99%|█████████▉| 1975/2000 [6:02:07<04:30, 10.82s/it]                                                      99%|█████████▉| 1975/2000 [6:02:07<04:30, 10.82s/it] 99%|█████████▉| 1976/2000 [6:02:18<04:21, 10.91s/it]{'loss': 0.7296, 'learning_rate': 7.551515289203615e-09, 'epoch': 0.99}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14495
total_samples=30028, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:07,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.27 | bwd_microstep: 2252.33 | bwd_inner_microstep: 2245.93 | bwd_allreduce_microstep: 6.30 | step_microstep: 0.79
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12134
total_samples=30031, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:09,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.71 | bwd_microstep: 1772.08 | bwd_inner_microstep: 1560.59 | bwd_allreduce_microstep: 211.42 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13893
total_samples=30035, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:12,209] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 700.72 | bwd_microstep: 1857.79 | bwd_inner_microstep: 1812.34 | bwd_allreduce_microstep: 45.38 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13447
total_samples=30039, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:14,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.63
[2025-08-03 07:45:14,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.98 | bwd_microstep: 1881.03 | bwd_inner_microstep: 1806.16 | bwd_allreduce_microstep: 74.80 | step_microstep: 130.37
[2025-08-03 07:45:14,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2784.60 | bwd: 7763.27 | bwd_inner: 7425.04 | bwd_allreduce: 337.95 | step: 131.39
{'loss': 0.7303, 'learning_rate': 6.935403852950107e-09, 'epoch': 0.99}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12256
total_samples=30042, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:17,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 763.50 | bwd_microstep: 1833.68 | bwd_inner_microstep: 1574.39 | bwd_allreduce_microstep: 259.22 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12449
total_samples=30046, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:20,560] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.13 | bwd_microstep: 2119.31 | bwd_inner_microstep: 1903.32 | bwd_allreduce_microstep: 215.93 | step_microstep: 0.10
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13587
total_samples=30050, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:23,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.38 | bwd_microstep: 1700.22 | bwd_inner_microstep: 1650.21 | bwd_allreduce_microstep: 49.93 | step_microstep: 0.35
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12430
total_samples=30055, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:25,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.15
[2025-08-03 07:45:25,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.66 | bwd_microstep: 1845.45 | bwd_inner_microstep: 1637.16 | bwd_allreduce_microstep: 208.23 | step_microstep: 116.72
[2025-08-03 07:45:25,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2856.59 | bwd: 7498.71 | bwd_inner: 6765.07 | bwd_allreduce: 733.38 | step: 117.29
{'loss': 0.7339, 'learning_rate': 6.345498063622391e-09, 'epoch': 0.99}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14137
total_samples=30059, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:28,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.21 | bwd_microstep: 1878.48 | bwd_inner_microstep: 1732.57 | bwd_allreduce_microstep: 145.85 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13953
total_samples=30063, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:30,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.25 | bwd_microstep: 1777.73 | bwd_inner_microstep: 1720.75 | bwd_allreduce_microstep: 56.92 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13238
total_samples=30067, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:33,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.21 | bwd_microstep: 2009.90 | bwd_inner_microstep: 1893.37 | bwd_allreduce_microstep: 116.47 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12426
total_samples=30070, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:36,951] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.13
[2025-08-03 07:45:36,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.83 | bwd_microstep: 2312.00 | bwd_inner_microstep: 1874.55 | bwd_allreduce_microstep: 437.37 | step_microstep: 112.83
[2025-08-03 07:45:36,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2810.43 | bwd: 7978.17 | bwd_inner: 7221.24 | bwd_allreduce: 756.68 | step: 113.18
{'loss': 0.7226, 'learning_rate': 5.781799468177473e-09, 'epoch': 0.99}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12274
total_samples=30073, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:39,712] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.58 | bwd_microstep: 1969.45 | bwd_inner_microstep: 1772.79 | bwd_allreduce_microstep: 196.60 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13392
total_samples=30077, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:42,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.60 | bwd_microstep: 1806.12 | bwd_inner_microstep: 1679.17 | bwd_allreduce_microstep: 126.87 | step_microstep: 0.19
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 14032
total_samples=30082, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:44,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 737.92 | bwd_microstep: 1791.77 | bwd_inner_microstep: 1710.57 | bwd_allreduce_microstep: 81.12 | step_microstep: 0.77
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13866
total_samples=30086, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:47,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.45
[2025-08-03 07:45:47,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.03 | bwd_microstep: 1974.92 | bwd_inner_microstep: 1873.76 | bwd_allreduce_microstep: 101.08 | step_microstep: 150.53
[2025-08-03 07:45:47,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2847.05 | bwd: 7542.32 | bwd_inner: 7036.30 | bwd_allreduce: 505.76 | step: 151.61
{'loss': 0.7229, 'learning_rate': 5.2443095448506674e-09, 'epoch': 0.99}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11763
total_samples=30089, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:50,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.70 | bwd_microstep: 2008.48 | bwd_inner_microstep: 1796.50 | bwd_allreduce_microstep: 211.91 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14973
total_samples=30094, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:53,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.97 | bwd_microstep: 1736.19 | bwd_inner_microstep: 1727.31 | bwd_allreduce_microstep: 8.82 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11899
total_samples=30097, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:55,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.24 | bwd_microstep: 1748.15 | bwd_inner_microstep: 1568.94 | bwd_allreduce_microstep: 179.12 | step_microstep: 0.15
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12144
total_samples=30101, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:45:58,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.36
[2025-08-03 07:45:58,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.42 | bwd_microstep: 1915.60 | bwd_inner_microstep: 1708.99 | bwd_allreduce_microstep: 206.55 | step_microstep: 111.05
[2025-08-03 07:45:58,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2799.26 | bwd: 7408.47 | bwd_inner: 6801.74 | bwd_allreduce: 606.47 | step: 111.45
                                                      99%|█████████▉| 1976/2000 [6:02:18<04:21, 10.91s/it] 99%|█████████▉| 1977/2000 [6:02:29<04:11, 10.94s/it]                                                      99%|█████████▉| 1977/2000 [6:02:29<04:11, 10.94s/it] 99%|█████████▉| 1978/2000 [6:02:40<03:59, 10.88s/it]                                                      99%|█████████▉| 1978/2000 [6:02:40<03:59, 10.88s/it] 99%|█████████▉| 1979/2000 [6:02:51<03:50, 10.98s/it]                                                      99%|█████████▉| 1979/2000 [6:02:51<03:50, 10.98s/it] 99%|█████████▉| 1980/2000 [6:03:02<03:38, 10.93s/it]                                                      99%|█████████▉| 1980/2000 [6:03:02<03:38, 10.93s/it] 99%|█████████▉| 1981/2000 [6:03:13<03:25, 10.84s/it]         {'loss': 0.7218, 'learning_rate': 4.733029703146708e-09, 'epoch': 0.99}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12267
total_samples=30104, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:00,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.27 | bwd_microstep: 1711.18 | bwd_inner_microstep: 1560.37 | bwd_allreduce_microstep: 150.74 | step_microstep: 0.13
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 15684
total_samples=30108, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:03,589] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.98 | bwd_microstep: 1923.05 | bwd_inner_microstep: 1800.36 | bwd_allreduce_microstep: 122.63 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13770
total_samples=30112, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:06,299] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.63 | bwd_microstep: 1923.06 | bwd_inner_microstep: 1728.92 | bwd_allreduce_microstep: 194.06 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11708
total_samples=30115, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:08,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.55
[2025-08-03 07:46:08,984] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 701.16 | bwd_microstep: 1773.01 | bwd_inner_microstep: 1547.49 | bwd_allreduce_microstep: 225.45 | step_microstep: 138.54
[2025-08-03 07:46:08,985] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2804.97 | bwd: 7330.35 | bwd_inner: 6637.13 | bwd_allreduce: 692.97 | step: 138.92
{'loss': 0.7243, 'learning_rate': 4.247961283835311e-09, 'epoch': 0.99}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13207
total_samples=30119, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:11,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 734.13 | bwd_microstep: 2015.81 | bwd_inner_microstep: 1888.27 | bwd_allreduce_microstep: 127.47 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13520
total_samples=30123, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:14,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.53 | bwd_microstep: 1728.31 | bwd_inner_microstep: 1698.81 | bwd_allreduce_microstep: 29.42 | step_microstep: 0.48
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13504
total_samples=30127, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:16,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 735.96 | bwd_microstep: 1805.17 | bwd_inner_microstep: 1708.08 | bwd_allreduce_microstep: 97.03 | step_microstep: 0.13
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12150
total_samples=30131, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:19,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.32
[2025-08-03 07:46:19,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.26 | bwd_microstep: 1805.63 | bwd_inner_microstep: 1571.05 | bwd_allreduce_microstep: 234.52 | step_microstep: 128.75
[2025-08-03 07:46:19,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2858.79 | bwd: 7354.98 | bwd_inner: 6866.19 | bwd_allreduce: 488.53 | step: 129.58
{'loss': 0.7326, 'learning_rate': 3.789105558954509e-09, 'epoch': 0.99}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12759
total_samples=30135, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:22,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.35 | bwd_microstep: 2238.06 | bwd_inner_microstep: 2231.71 | bwd_allreduce_microstep: 6.28 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11502
total_samples=30138, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:25,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.91 | bwd_microstep: 1767.67 | bwd_inner_microstep: 1529.73 | bwd_allreduce_microstep: 237.87 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13655
total_samples=30142, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:27,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.53 | bwd_microstep: 1778.68 | bwd_inner_microstep: 1701.61 | bwd_allreduce_microstep: 77.00 | step_microstep: 0.27
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 15646
total_samples=30146, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:30,486] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.43
[2025-08-03 07:46:30,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 681.90 | bwd_microstep: 1821.82 | bwd_inner_microstep: 1759.94 | bwd_allreduce_microstep: 61.82 | step_microstep: 133.94
[2025-08-03 07:46:30,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2801.61 | bwd: 7606.29 | bwd_inner: 7222.99 | bwd_allreduce: 383.05 | step: 134.63
{'loss': 0.7395, 'learning_rate': 3.3564637317984318e-09, 'epoch': 0.99}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13708
total_samples=30150, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:33,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 731.70 | bwd_microstep: 2054.73 | bwd_inner_microstep: 1976.45 | bwd_allreduce_microstep: 78.21 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13415
total_samples=30154, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:35,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.18 | bwd_microstep: 1806.32 | bwd_inner_microstep: 1726.53 | bwd_allreduce_microstep: 79.72 | step_microstep: 0.29
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14033
total_samples=30158, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:38,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.06 | bwd_microstep: 1775.41 | bwd_inner_microstep: 1727.42 | bwd_allreduce_microstep: 47.91 | step_microstep: 0.84
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11816
total_samples=30161, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:41,082] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.97
[2025-08-03 07:46:41,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.61 | bwd_microstep: 1711.70 | bwd_inner_microstep: 1534.91 | bwd_allreduce_microstep: 176.72 | step_microstep: 137.26
[2025-08-03 07:46:41,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2825.49 | bwd: 7348.22 | bwd_inner: 6965.31 | bwd_allreduce: 382.66 | step: 138.53
{'loss': 0.7318, 'learning_rate': 2.9500369369195313e-09, 'epoch': 0.99}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11810
total_samples=30164, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:43,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.44 | bwd_microstep: 2087.70 | bwd_inner_microstep: 1854.30 | bwd_allreduce_microstep: 233.32 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13830
total_samples=30168, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:46,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.65 | bwd_microstep: 1823.80 | bwd_inner_microstep: 1743.37 | bwd_allreduce_microstep: 80.36 | step_microstep: 0.22
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13278
total_samples=30172, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:49,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.73 | bwd_microstep: 2246.36 | bwd_inner_microstep: 1876.45 | bwd_allreduce_microstep: 369.84 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13518
total_samples=30176, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:52,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.26
[2025-08-03 07:46:52,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 694.90 | bwd_microstep: 2260.69 | bwd_inner_microstep: 2115.28 | bwd_allreduce_microstep: 145.35 | step_microstep: 115.62
[2025-08-03 07:46:52,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2842.64 | bwd: 8418.60 | bwd_inner: 7589.39 | bwd_allreduce: 828.96 | step: 116.14
                                             99%|█████████▉| 1981/2000 [6:03:13<03:25, 10.84s/it] 99%|█████████▉| 1982/2000 [6:03:23<03:13, 10.76s/it]                                                      99%|█████████▉| 1982/2000 [6:03:23<03:13, 10.76s/it] 99%|█████████▉| 1983/2000 [6:03:34<03:02, 10.73s/it]                                                      99%|█████████▉| 1983/2000 [6:03:34<03:02, 10.73s/it] 99%|█████████▉| 1984/2000 [6:03:45<02:52, 10.77s/it]                                                      99%|█████████▉| 1984/2000 [6:03:45<02:52, 10.77s/it] 99%|█████████▉| 1985/2000 [6:03:55<02:40, 10.72s/it]                                                      99%|█████████▉| 1985/2000 [6:03:55<02:40, 10.72s/it] 99%|█████████▉| 1986/2000 [6:04:07<02:34, 11.01s/it]                  {'loss': 0.7293, 'learning_rate': 2.5698262401263607e-09, 'epoch': 0.99}
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 13154
total_samples=30180, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:55,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 717.89 | bwd_microstep: 2011.44 | bwd_inner_microstep: 1823.44 | bwd_allreduce_microstep: 187.93 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13993
total_samples=30184, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:46:58,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 711.82 | bwd_microstep: 1730.92 | bwd_inner_microstep: 1696.01 | bwd_allreduce_microstep: 34.84 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13366
total_samples=30188, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:01,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 726.51 | bwd_microstep: 2187.06 | bwd_inner_microstep: 2060.11 | bwd_allreduce_microstep: 126.88 | step_microstep: 0.19
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 12488
total_samples=30192, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:03,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 07:47:03,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 696.83 | bwd_microstep: 1972.72 | bwd_inner_microstep: 1781.74 | bwd_allreduce_microstep: 190.91 | step_microstep: 126.59
[2025-08-03 07:47:03,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2852.98 | bwd: 7902.20 | bwd_inner: 7361.30 | bwd_allreduce: 540.65 | step: 127.16
{'loss': 0.7326, 'learning_rate': 2.215832638474691e-09, 'epoch': 0.99}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14056
total_samples=30196, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:06,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 709.68 | bwd_microstep: 1816.25 | bwd_inner_microstep: 1740.39 | bwd_allreduce_microstep: 75.78 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13524
total_samples=30200, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:09,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.57 | bwd_microstep: 2023.58 | bwd_inner_microstep: 1889.77 | bwd_allreduce_microstep: 133.74 | step_microstep: 0.24
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 15730
total_samples=30204, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:12,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 691.82 | bwd_microstep: 1921.07 | bwd_inner_microstep: 1750.69 | bwd_allreduce_microstep: 170.30 | step_microstep: 0.38
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13419
total_samples=30208, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:14,986] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.48
[2025-08-03 07:47:14,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 736.33 | bwd_microstep: 2060.86 | bwd_inner_microstep: 1924.84 | bwd_allreduce_microstep: 135.95 | step_microstep: 110.21
[2025-08-03 07:47:14,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2840.32 | bwd: 7821.81 | bwd_inner: 7305.69 | bwd_allreduce: 515.87 | step: 110.94
{'loss': 0.7386, 'learning_rate': 1.888057060274173e-09, 'epoch': 0.99}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13432
total_samples=30212, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:17,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 727.04 | bwd_microstep: 1829.31 | bwd_inner_microstep: 1697.18 | bwd_allreduce_microstep: 132.06 | step_microstep: 0.11
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 14456
total_samples=30218, num_samples=6, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:20,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.17 | bwd_microstep: 1768.19 | bwd_inner_microstep: 1684.50 | bwd_allreduce_microstep: 83.61 | step_microstep: 0.15
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12728
total_samples=30221, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:22,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.55 | bwd_microstep: 1984.86 | bwd_inner_microstep: 1767.47 | bwd_allreduce_microstep: 217.32 | step_microstep: 0.29
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12151
total_samples=30224, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:25,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.12
[2025-08-03 07:47:25,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 690.79 | bwd_microstep: 1918.81 | bwd_inner_microstep: 1575.23 | bwd_allreduce_microstep: 343.51 | step_microstep: 109.13
[2025-08-03 07:47:25,706] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2823.49 | bwd: 7501.22 | bwd_inner: 6724.37 | bwd_allreduce: 776.60 | step: 109.68
{'loss': 0.7261, 'learning_rate': 1.5865003650761268e-09, 'epoch': 0.99}
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 13189
total_samples=30228, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:28,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 698.21 | bwd_microstep: 2000.26 | bwd_inner_microstep: 1827.90 | bwd_allreduce_microstep: 172.28 | step_microstep: 0.26
dynamic ViT batch size: 43, images per sample: 43.0, dynamic token length: 12876
total_samples=30232, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:31,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.12 | bwd_microstep: 1750.79 | bwd_inner_microstep: 1595.01 | bwd_allreduce_microstep: 155.71 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13217
total_samples=30236, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:33,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.40 | bwd_microstep: 2104.95 | bwd_inner_microstep: 1986.29 | bwd_allreduce_microstep: 118.60 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11894
total_samples=30239, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:37,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.38
[2025-08-03 07:47:37,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.81 | bwd_microstep: 2281.40 | bwd_inner_microstep: 2273.01 | bwd_allreduce_microstep: 8.32 | step_microstep: 135.85
[2025-08-03 07:47:37,074] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2797.46 | bwd: 8137.45 | bwd_inner: 7682.21 | bwd_allreduce: 454.99 | step: 136.35
{'loss': 0.7342, 'learning_rate': 1.3111633436779792e-09, 'epoch': 0.99}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11945
total_samples=30243, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:39,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 733.50 | bwd_microstep: 1880.91 | bwd_inner_microstep: 1687.46 | bwd_allreduce_microstep: 193.38 | step_microstep: 0.28
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 16007
total_samples=30247, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:42,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 713.96 | bwd_microstep: 2011.62 | bwd_inner_microstep: 1870.92 | bwd_allreduce_microstep: 140.63 | step_microstep: 0.86
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13774
total_samples=30251, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:45,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.55 | bwd_microstep: 1794.52 | bwd_inner_microstep: 1712.84 | bwd_allreduce_microstep: 81.62 | step_microstep: 0.11
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12160
total_samples=30254, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:48,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 16.57
[2025-08-03 07:47:48,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 739.45 | bwd_microstep: 2004.83 | bwd_inner_microstep: 1658.06 | bwd_allreduce_microstep: 346.70 | step_microstep: 113.75
[2025-08-03 07:47:48,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2894.39 | bwd: 7691.94 | bwd_inner: 6929.28 | bwd_allreduce: 762.41 | step: 115.01
                                    99%|█████████▉| 1986/2000 [6:04:07<02:34, 11.01s/it] 99%|█████████▉| 1987/2000 [6:04:18<02:23, 11.05s/it]                                                      99%|█████████▉| 1987/2000 [6:04:18<02:23, 11.05s/it] 99%|█████████▉| 1988/2000 [6:04:29<02:12, 11.05s/it]                                                      99%|█████████▉| 1988/2000 [6:04:29<02:12, 11.05s/it] 99%|█████████▉| 1989/2000 [6:04:40<02:00, 10.95s/it]                                                      99%|█████████▉| 1989/2000 [6:04:40<02:00, 10.95s/it]100%|█████████▉| 1990/2000 [6:04:51<01:50, 11.08s/it]                                                     100%|█████████▉| 1990/2000 [6:04:51<01:50, 11.08s/it]100%|█████████▉| 1991/2000 [6:05:02<01:39, 11.05s/it]                           {'loss': 0.7216, 'learning_rate': 1.062046718121046e-09, 'epoch': 1.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13596
total_samples=30258, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:50,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.11 | bwd_microstep: 1901.51 | bwd_inner_microstep: 1859.12 | bwd_allreduce_microstep: 42.32 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12060
total_samples=30261, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:53,497] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 715.83 | bwd_microstep: 1986.59 | bwd_inner_microstep: 1759.10 | bwd_allreduce_microstep: 227.42 | step_microstep: 0.31
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13893
total_samples=30265, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:55,987] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 689.03 | bwd_microstep: 1725.58 | bwd_inner_microstep: 1693.92 | bwd_allreduce_microstep: 31.59 | step_microstep: 0.11
dynamic ViT batch size: 46, images per sample: 46.0, dynamic token length: 16156
total_samples=30269, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:47:58,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.53
[2025-08-03 07:47:58,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.54 | bwd_microstep: 1820.72 | bwd_inner_microstep: 1797.25 | bwd_allreduce_microstep: 23.37 | step_microstep: 115.03
[2025-08-03 07:47:58,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2796.44 | bwd: 7434.45 | bwd_inner: 7109.39 | bwd_allreduce: 324.80 | step: 115.69
{'loss': 0.7375, 'learning_rate': 8.391511416816489e-10, 'epoch': 1.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13823
total_samples=30273, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:01,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 692.44 | bwd_microstep: 2207.81 | bwd_inner_microstep: 2142.00 | bwd_allreduce_microstep: 65.74 | step_microstep: 0.12
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11916
total_samples=30276, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:04,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 699.84 | bwd_microstep: 2051.92 | bwd_inner_microstep: 1791.62 | bwd_allreduce_microstep: 260.23 | step_microstep: 0.17
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13962
total_samples=30280, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:07,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 678.98 | bwd_microstep: 1974.56 | bwd_inner_microstep: 1763.14 | bwd_allreduce_microstep: 211.35 | step_microstep: 0.26
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13226
total_samples=30284, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:10,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 15.02
[2025-08-03 07:48:10,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.97 | bwd_microstep: 1884.10 | bwd_inner_microstep: 1834.51 | bwd_allreduce_microstep: 49.52 | step_microstep: 144.47
[2025-08-03 07:48:10,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2774.16 | bwd: 8118.48 | bwd_inner: 7531.27 | bwd_allreduce: 586.93 | step: 145.00
{'loss': 0.7358, 'learning_rate': 6.424771988788881e-10, 'epoch': 1.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13513
total_samples=30288, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:13,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.09 | bwd_microstep: 2222.39 | bwd_inner_microstep: 1985.63 | bwd_allreduce_microstep: 236.67 | step_microstep: 0.81
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11649
total_samples=30291, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:15,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 702.41 | bwd_microstep: 2026.79 | bwd_inner_microstep: 1669.37 | bwd_allreduce_microstep: 357.36 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13271
total_samples=30295, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:18,524] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 724.67 | bwd_microstep: 1858.39 | bwd_inner_microstep: 1720.41 | bwd_allreduce_microstep: 137.91 | step_microstep: 0.15
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13434
total_samples=30299, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:21,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.24
[2025-08-03 07:48:21,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 730.46 | bwd_microstep: 1819.93 | bwd_inner_microstep: 1747.52 | bwd_allreduce_microstep: 72.34 | step_microstep: 115.13
[2025-08-03 07:48:21,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2865.57 | bwd: 7927.56 | bwd_inner: 7122.92 | bwd_allreduce: 804.36 | step: 116.19
{'loss': 0.7311, 'learning_rate': 4.720254054679796e-10, 'epoch': 1.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13660
total_samples=30303, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:24,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 714.64 | bwd_microstep: 1919.30 | bwd_inner_microstep: 1726.18 | bwd_allreduce_microstep: 193.06 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13306
total_samples=30307, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:26,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 745.04 | bwd_microstep: 2030.86 | bwd_inner_microstep: 1871.36 | bwd_allreduce_microstep: 159.43 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 14790
total_samples=30311, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:29,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 683.29 | bwd_microstep: 1839.61 | bwd_inner_microstep: 1822.89 | bwd_allreduce_microstep: 16.65 | step_microstep: 0.10
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11962
total_samples=30314, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:32,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.30
[2025-08-03 07:48:32,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 721.44 | bwd_microstep: 1827.94 | bwd_inner_microstep: 1585.89 | bwd_allreduce_microstep: 241.98 | step_microstep: 124.17
[2025-08-03 07:48:32,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2864.33 | bwd: 7617.77 | bwd_inner: 7006.32 | bwd_allreduce: 611.21 | step: 124.67
{'loss': 0.7318, 'learning_rate': 3.277962084369257e-10, 'epoch': 1.0}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11949
total_samples=30317, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:35,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.55 | bwd_microstep: 2184.38 | bwd_inner_microstep: 1938.19 | bwd_allreduce_microstep: 246.12 | step_microstep: 0.30
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13065
total_samples=30321, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:37,714] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 697.22 | bwd_microstep: 1788.21 | bwd_inner_microstep: 1689.27 | bwd_allreduce_microstep: 98.88 | step_microstep: 0.14
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13354
total_samples=30325, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:40,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.72 | bwd_microstep: 1791.21 | bwd_inner_microstep: 1692.86 | bwd_allreduce_microstep: 98.28 | step_microstep: 0.11
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13122
total_samples=30329, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:43,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.65
[2025-08-03 07:48:43,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 728.72 | bwd_microstep: 1816.28 | bwd_inner_microstep: 1702.06 | bwd_allreduce_microstep: 114.15 | step_microstep: 129.10
[2025-08-03 07:48:43,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2826.15 | bwd: 7580.13 | bwd_inner: 7022.38 | bwd_allreduce: 557.51 | step: 129.64
                          100%|█████████▉| 1991/2000 [6:05:02<01:39, 11.05s/it]100%|█████████▉| 1992/2000 [6:05:13<01:27, 10.93s/it]                                                     100%|█████████▉| 1992/2000 [6:05:13<01:27, 10.93s/it]100%|█████████▉| 1993/2000 [6:05:24<01:17, 11.05s/it]                                                     100%|█████████▉| 1993/2000 [6:05:24<01:17, 11.05s/it]100%|█████████▉| 1994/2000 [6:05:36<01:06, 11.10s/it]                                                     100%|█████████▉| 1994/2000 [6:05:36<01:06, 11.10s/it]100%|█████████▉| 1995/2000 [6:05:47<00:55, 11.05s/it]                                                     100%|█████████▉| 1995/2000 [6:05:47<00:55, 11.05s/it]100%|█████████▉| 1996/2000 [6:05:57<00:43, 10.99s/it]                                    {'loss': 0.7242, 'learning_rate': 2.0978998601206558e-10, 'epoch': 1.0}
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13527
total_samples=30334, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:45,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 705.89 | bwd_microstep: 2139.74 | bwd_inner_microstep: 1989.77 | bwd_allreduce_microstep: 149.91 | step_microstep: 0.80
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 12173
total_samples=30337, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:48,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 695.47 | bwd_microstep: 1758.66 | bwd_inner_microstep: 1562.53 | bwd_allreduce_microstep: 196.06 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14010
total_samples=30341, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:51,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 710.92 | bwd_microstep: 1779.17 | bwd_inner_microstep: 1715.70 | bwd_allreduce_microstep: 63.41 | step_microstep: 0.14
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11753
total_samples=30344, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:54,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 13.98
[2025-08-03 07:48:54,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 723.80 | bwd_microstep: 2186.29 | bwd_inner_microstep: 2180.46 | bwd_allreduce_microstep: 5.77 | step_microstep: 112.21
[2025-08-03 07:48:54,133] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2836.02 | bwd: 7863.91 | bwd_inner: 7448.45 | bwd_allreduce: 415.22 | step: 113.26
{'loss': 0.7275, 'learning_rate': 1.1800704765030367e-10, 'epoch': 1.0}
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11778
total_samples=30347, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:56,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 894.61 | bwd_microstep: 1723.23 | bwd_inner_microstep: 1540.41 | bwd_allreduce_microstep: 182.74 | step_microstep: 0.16
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14970
total_samples=30352, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:48:59,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 684.07 | bwd_microstep: 1775.37 | bwd_inner_microstep: 1713.13 | bwd_allreduce_microstep: 62.18 | step_microstep: 0.21
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 12698
total_samples=30356, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:02,228] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 707.65 | bwd_microstep: 2081.45 | bwd_inner_microstep: 1980.92 | bwd_allreduce_microstep: 100.47 | step_microstep: 0.25
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11575
total_samples=30359, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:04,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.59
[2025-08-03 07:49:04,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 708.07 | bwd_microstep: 1721.73 | bwd_inner_microstep: 1525.88 | bwd_allreduce_microstep: 195.79 | step_microstep: 111.07
[2025-08-03 07:49:04,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2994.32 | bwd: 7301.84 | bwd_inner: 6760.34 | bwd_allreduce: 541.26 | step: 111.71
{'loss': 0.7242, 'learning_rate': 5.244763404133046e-11, 'epoch': 1.0}
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 13590
total_samples=30363, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:07,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.86 | bwd_microstep: 1726.97 | bwd_inner_microstep: 1673.57 | bwd_allreduce_microstep: 53.32 | step_microstep: 0.16
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 13168
total_samples=30366, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:09,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 716.22 | bwd_microstep: 1811.50 | bwd_inner_microstep: 1634.22 | bwd_allreduce_microstep: 177.22 | step_microstep: 0.11
dynamic ViT batch size: 47, images per sample: 47.0, dynamic token length: 14903
total_samples=30371, num_samples=5, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:12,811] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 703.35 | bwd_microstep: 2113.48 | bwd_inner_microstep: 2046.31 | bwd_allreduce_microstep: 67.09 | step_microstep: 0.24
dynamic ViT batch size: 42, images per sample: 42.0, dynamic token length: 11722
total_samples=30374, num_samples=3, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:15,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.92
[2025-08-03 07:49:15,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 688.44 | bwd_microstep: 1940.51 | bwd_inner_microstep: 1576.58 | bwd_allreduce_microstep: 363.85 | step_microstep: 135.52
[2025-08-03 07:49:15,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2794.81 | bwd: 7592.52 | bwd_inner: 6930.67 | bwd_allreduce: 661.59 | step: 136.02
{'loss': 0.737, 'learning_rate': 1.311191710651194e-11, 'epoch': 1.0}
dynamic ViT batch size: 44, images per sample: 44.0, dynamic token length: 13782
total_samples=30378, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:18,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 706.64 | bwd_microstep: 1763.54 | bwd_inner_microstep: 1646.08 | bwd_allreduce_microstep: 117.39 | step_microstep: 0.12
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13209
total_samples=30382, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:21,083] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 704.37 | bwd_microstep: 2104.32 | bwd_inner_microstep: 2050.44 | bwd_allreduce_microstep: 53.83 | step_microstep: 0.26
dynamic ViT batch size: 45, images per sample: 45.0, dynamic token length: 12898
total_samples=30386, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:23,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 686.02 | bwd_microstep: 1992.44 | bwd_inner_microstep: 1843.65 | bwd_allreduce_microstep: 148.71 | step_microstep: 0.28
dynamic ViT batch size: 48, images per sample: 48.0, dynamic token length: 13073
total_samples=30390, num_samples=4, num_padding_tokens=0, num_padding_images=0
[2025-08-03 07:49:26,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 14.16
[2025-08-03 07:49:26,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 685.72 | bwd_microstep: 2082.86 | bwd_inner_microstep: 1864.53 | bwd_allreduce_microstep: 218.26 | step_microstep: 109.10
[2025-08-03 07:49:26,790] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 2782.67 | bwd: 7943.22 | bwd_inner: 7404.70 | bwd_allreduce: 538.27 | step: 109.76
{'loss': 0.7305, 'learning_rate': 0.0, 'epoch': 1.0}
                 100%|█████████▉| 1996/2000 [6:05:57<00:43, 10.99s/it]100%|█████████▉| 1997/2000 [6:06:08<00:33, 11.02s/it]                                                     100%|█████████▉| 1997/2000 [6:06:09<00:33, 11.02s/it]100%|█████████▉| 1998/2000 [6:06:19<00:21, 10.92s/it]                                                     100%|█████████▉| 1998/2000 [6:06:19<00:21, 10.92s/it]100%|█████████▉| 1999/2000 [6:06:30<00:10, 10.89s/it]                                                     100%|█████████▉| 1999/2000 [6:06:30<00:10, 10.89s/it]100%|██████████| 2000/2000 [6:06:41<00:00, 10.97s/it]                                                     100%|██████████| 2000/2000 [6:06:41<00:00, 10.97s/it][INFO|trainer.py:2936] 2025-08-03 07:49:29,746 >> Saving model checkpoint to work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000
[INFO|configuration_utils.py:473] 2025-08-03 07:49:29,761 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/config.json
[INFO|configuration_utils.py:594] 2025-08-03 07:49:29,765 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/generation_config.json
[INFO|modeling_utils.py:2493] 2025-08-03 07:49:34,265 >> Model weights saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/model.safetensors
[INFO|tokenization_utils_base.py:2433] 2025-08-03 07:49:34,271 >> tokenizer config file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2442] 2025-08-03 07:49:34,277 >> Special tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/special_tokens_map.json
[INFO|tokenization_utils_base.py:2493] 2025-08-03 07:49:34,279 >> added tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/added_tokens.json
[2025-08-03 07:49:34,911] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step2000 is about to be saved!
[2025-08-03 07:49:34,972] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_16_mp_rank_00_model_states.pt...
[2025-08-03 07:49:34,932] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt
[2025-08-03 07:49:34,932] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2025-08-03 07:49:34,950] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_24_mp_rank_00_model_states.pt...
[2025-08-03 07:49:34,954] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_8_mp_rank_00_model_states.pt...
[2025-08-03 07:49:35,022] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_24_mp_rank_00_model_states.pt.
[2025-08-03 07:49:35,053] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_8_mp_rank_00_model_states.pt.
[2025-08-03 07:49:35,044] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2025-08-03 07:49:35,112] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/zero_pp_rank_16_mp_rank_00_model_states.pt.
[2025-08-03 07:49:35,108] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt...
[2025-08-03 07:49:35,130] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt...
[2025-08-03 07:49:35,090] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-08-03 07:49:35,107] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt...
[2025-08-03 07:49:36,654] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt.
[2025-08-03 07:49:36,654] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt
[2025-08-03 07:49:36,690] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt.
[2025-08-03 07:49:36,691] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt
[2025-08-03 07:49:36,718] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt.
[2025-08-03 07:49:36,718] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt
[2025-08-03 07:49:36,731] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-08-03 07:49:36,752] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tmp-checkpoint-2000/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-08-03 07:49:36,782] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2000 is ready now!
[2025-08-03 07:49:36,799] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2000 is ready now!
[2025-08-03 07:49:36,802] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2000 is ready now!
[2025-08-03 07:49:36,824] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2000 is ready now!
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 74, in load
    loaded_dict = pickle.load(handle)
FileNotFoundError: [Errno 2] No such file or directory
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 74, in load
    loaded_dict = pickle.load(handle)
FileNotFoundError: [Errno 2] No such file or directory
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle'
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle'
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle'
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 74, in load
    loaded_dict = pickle.load(handle)
FileNotFoundError: [Errno 2] No such file or directory
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 421, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_2d_kernel", __class__._2d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_2d_kernel.pickle'
[INFO|trainer.py:1962] 2025-08-03 07:49:40,263 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:1962] 2025-08-03 07:49:40,744 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:1962] 2025-08-03 07:49:41,108 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


[INFO|trainer.py:1962] 2025-08-03 07:49:41,645 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 22016.0455, 'train_samples_per_second': 11.628, 'train_steps_per_second': 0.091, 'train_loss': 0.7865399915277957, 'epoch': 1.0}
                                                     100%|██████████| 2000/2000 [6:07:00<00:00, 10.97s/it]100%|██████████| 2000/2000 [6:07:00<00:00, 11.01s/it]
[INFO|trainer.py:2936] 2025-08-03 07:49:47,696 >> Saving model checkpoint to work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802
[INFO|configuration_utils.py:473] 2025-08-03 07:49:47,703 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/config.json
[INFO|configuration_utils.py:594] 2025-08-03 07:49:47,707 >> Configuration saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/generation_config.json
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 444, in matmul_ext_update_autotune_table
    fp16_matmul._update_autotune_table()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 422, in _update_autotune_table
    TritonMatmul._update_autotune_table(__class__.__name__ + "_4d_kernel", __class__._4d_kernel)
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 145, in _update_autotune_table
    autotune_table = cache_manager.load()
  File "/mnt/petrelfs/yangganlin/anaconda3/envs/monointernvl/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/triton/matmul_ext.py", line 73, in load
    with open(self.file_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/petrelfs/yangganlin/triton_ygl/Fp16Matmul_4d_kernel.pickle'
[INFO|modeling_utils.py:2493] 2025-08-03 07:49:54,598 >> Model weights saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/model.safetensors
[INFO|tokenization_utils_base.py:2433] 2025-08-03 07:49:54,603 >> tokenizer config file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/tokenizer_config.json
[INFO|tokenization_utils_base.py:2442] 2025-08-03 07:49:54,607 >> Special tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/special_tokens_map.json
[INFO|tokenization_utils_base.py:2493] 2025-08-03 07:49:54,609 >> added tokens file saved in work_dirs/internvl_chat_v3/internvl3_2b_dynamic_res_2nd_finetune_full/20250802/added_tokens.json
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     0.7865
  train_runtime            = 6:06:56.04
  train_samples            =         -1
  train_samples_per_second =     11.628
  train_steps_per_second   =      0.091