Multi-GPU failed to run · Issue #48 · Wangt-CN/DisCo

Hi, thanks for the great work. My GPU: `2080ti * 10` `AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 mpirun -np 8 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir run_test \ --local_train_batch_size 8 --local_eval_batch_size 8 --log_dir exp/tiktok_pretrain \ --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \ --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \ --train_yaml /data/mfyan/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml /data/mfyan/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \ --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \ --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0` The first is error reporting `RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:12475 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:12475 (errno: 98 - Address already in use).` . Then I changed the port number in `utils/dist.py` to something else, and found that the same type of error was still reported, so I changed the port number to `random.randint(10000, 20000)`, and it worked. But I found 8 processes running only on **GPU 0**, resulting in `RuntimeError: CUDA error: out of memory` .

AI Analysis

This issue appears to be discussing a feature request or bug report related to the repository. Based on the content, it seems to be resolved. The issue was opened by MingfuYAN and has received 6 comments.

Add a comment

Comment form would go here

Multi-GPU failed to run#48