Skip to content

HSA_STATUS_ERROR_OUT_OF_RESOURCES on AMD Instinct MI250X when doing allocations in a loop #844

@Alexander-Barth

Description

@Alexander-Barth

Questionnaire

  1. Does ROCm works for you outside of Julia, e.g. C/C++/Python?

yes

  1. Post output of rocminfo.
output of `rocminfo`
ROCk module version 6.3.6 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:  
  1. Post output of AMDGPU.versioninfo() if possible.
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                                     │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/llvm/bin/ld.lld                                             │
│     +     │ Device Libraries │ -         │ /tmp/julia-depot-FlowMatching-barthale/artifacts/b46ab46ef568406312e5f500efb677511199c2f9/amdgcn/bitcode │
│     +     │ HIP              │ 6.2.41134 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libamdhip64.so                                              │
│     +     │ rocBLAS          │ 4.2.1     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocblas.so                                               │
│     +     │ rocSOLVER        │ 3.26.0    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsolver.so                                             │
│     +     │ rocSPARSE        │ 3.2.0     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsparse.so                                             │
│     +     │ rocRAND          │ 2.10.5    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocrand.so                                               │
│     +     │ rocFFT           │ 1.0.29    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocfft.so                                                │
│     +     │ MIOpen           │ 3.2.0     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libMIOpen.so                                                │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │                Name │               GCN arch │ Wavefront │     Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│  1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘

Reproducing the bug

  1. Describe what's not working.

I am trying to train a neural network and make inference using this neural network with Lux (1.22.1)
AMDGPU (2.1.1) and julia (1.12.0). But I am getting either the error HSA_STATUS_ERROR_OUT_OF_RESOURCES or Failed to successfully execute function and free resources for it. Reporting current memory usage: HIP pool used...
Ref: https://discourse.julialang.org/t/gpu-memory-issue-on-amdgpu/133560

  1. Provide MWE to reproduce it (if possible).

Here is a reproducer for the error HSA_STATUS_ERROR_OUT_OF_RESOURCES where I allocate arrays of random size:

using AMDGPU
function mytest(N)
  total = 0f0; 
  for i = 1:N
    total += sum(AMDGPU.ones(Float32,ntuple(i -> rand(1:250),4)))
  end
  return total
end
mytest(10_000)

The error message is:

:0:rocdevice.cpp            :2982: 3734245763676 us: [pid:38494 tid:0x1498153ff700] Callback: 
Queue 0x149815000000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime 
failed to allocate the necessary resources. This error may also occur when the core runtime library needs to 
spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 0 MB

I do not have any no local setting. With the following settings in LocalPreferences.toml:

[AMDGPU]
hard_memory_limit = "80 %"
eager_gc = true

I have the error:

ERROR: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 393.358 MiB.
- HIP pool reserved: 393.358 MiB.
- Hard memory limit: 51.188 GiB.

Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:44
  [2] alloc_or_retry!(f::AMDGPU.Runtime.Mem.var"#5#6"{HIPStream, Int64, Base.RefValue{Ptr{Nothing}}}, isfailed::typeof(isnothing); stream::HIPStream)
    @ AMDGPU.Runtime.Mem /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:34
  [3] alloc_or_retry!
    @ /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:1 [inlined]
  [4] AMDGPU.Runtime.Mem.HIPBuffer(bytesize::Int64; stream::HIPStream)

Maybe the later error is the same as ROCm/hip#3422 (comment).

The issue also persists with the current version of AMDGPU:

(examples) pkg> st AMDGPU
Status `/pfs/lustrep4/users/barthale/.julia/dev/FlowMatching/examples/Project.toml`
  [21141c5a] AMDGPU v2.1.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions