python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracke...
🚨 错误信息
python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[rank0]:[W903 15:36:02.496083400 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
🤖 AI解决方案
UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
WARNING: destroy_process_group() was not called before program exit, which can leak resources.
原因分析
- 共享内存泄漏:
resource_tracker报告存在未清理的shared_memory对象。 - 未销毁进程组:PyTorch 的
destroy_process_group()函数未在程序退出前调用,导致 NCCL 资源未被正确释放。
解决方案
在程序退出前,确保调用 torch.distributed.destroy_process_group()。
import torch
import torch.distributed as dist
# ... 你的分布式训练代码 ...
if __name__ == "__main__":
# 初始化分布式环境
dist.init_process_group("nccl")
# ... 你的分布式训练逻辑 ...
# 在程序退出前销毁进程组
dist.destroy_process_group()
预防措施
dist.init_process_group(),并在程序结束时调用 dist.destroy_process_group(),确保资源生命周期管理。multiprocessing 模块的场景,确保所有子进程结束时,其创建的共享内存资源得到妥善清理。如果手动创建共享内存,务必在不再需要时调用 shm.close() 和 shm.unlink()。