HSA_STATUS_ERROR_OUT_OF_RESOURCES on AMD Instinct MI250X when doing allocations in a loop

## Questionnaire

1. Does ROCm works for you outside of Julia, e.g. C/C++/Python?

yes

2. Post output of `rocminfo`.

<details>

<summary>output of `rocminfo`</summary>

```
ROCk module version 6.3.6 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD EPYC 7A53 64-Core Processor    
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD EPYC 7A53 64-Core Processor    
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:  
```
</details>

3. Post output of `AMDGPU.versioninfo()` if possible.
```
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                                     │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/llvm/bin/ld.lld                                             │
│     +     │ Device Libraries │ -         │ /tmp/julia-depot-FlowMatching-barthale/artifacts/b46ab46ef568406312e5f500efb677511199c2f9/amdgcn/bitcode │
│     +     │ HIP              │ 6.2.41134 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libamdhip64.so                                              │
│     +     │ rocBLAS          │ 4.2.1     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocblas.so                                               │
│     +     │ rocSOLVER        │ 3.26.0    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsolver.so                                             │
│     +     │ rocSPARSE        │ 3.2.0     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsparse.so                                             │
│     +     │ rocRAND          │ 2.10.5    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocrand.so                                               │
│     +     │ rocFFT           │ 1.0.29    │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocfft.so                                                │
│     +     │ MIOpen           │ 3.2.0     │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libMIOpen.so                                                │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │                Name │               GCN arch │ Wavefront │     Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│  1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘
```

## Reproducing the bug

1. Describe what's not working.

I am trying to train a neural network and make inference using this neural network with Lux (1.22.1)
AMDGPU (2.1.1) and julia (1.12.0). But I am getting either the error `HSA_STATUS_ERROR_OUT_OF_RESOURCES` or `Failed to successfully execute function and free resources for it. Reporting current memory usage: HIP pool used...`
Ref: https://discourse.julialang.org/t/gpu-memory-issue-on-amdgpu/133560

2. Provide MWE to reproduce it (if possible).

Here is a reproducer for the error `HSA_STATUS_ERROR_OUT_OF_RESOURCES` where I allocate arrays of random size:

```julia
using AMDGPU
function mytest(N)
  total = 0f0; 
  for i = 1:N
    total += sum(AMDGPU.ones(Float32,ntuple(i -> rand(1:250),4)))
  end
  return total
end
mytest(10_000)
```

The error message is:
```
:0:rocdevice.cpp            :2982: 3734245763676 us: [pid:38494 tid:0x1498153ff700] Callback: 
Queue 0x149815000000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime 
failed to allocate the necessary resources. This error may also occur when the core runtime library needs to 
spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 0 MB
```

I do not have any no local setting. With the following settings in `LocalPreferences.toml`:

```
[AMDGPU]
hard_memory_limit = "80 %"
eager_gc = true
```

I have the error:

```
ERROR: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 393.358 MiB.
- HIP pool reserved: 393.358 MiB.
- Hard memory limit: 51.188 GiB.

Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:44
  [2] alloc_or_retry!(f::AMDGPU.Runtime.Mem.var"#5#6"{HIPStream, Int64, Base.RefValue{Ptr{Nothing}}}, isfailed::typeof(isnothing); stream::HIPStream)
    @ AMDGPU.Runtime.Mem /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:34
  [3] alloc_or_retry!
    @ /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:1 [inlined]
  [4] AMDGPU.Runtime.Mem.HIPBuffer(bytesize::Int64; stream::HIPStream)
```

Maybe the later error is the same as https://github.com/ROCm/hip/issues/3422#issuecomment-2408574367.

The issue also persists with the current version of AMDGPU:

```
(examples) pkg> st AMDGPU
Status `/pfs/lustrep4/users/barthale/.julia/dev/FlowMatching/examples/Project.toml`
  [21141c5a] AMDGPU v2.1.2
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HSA_STATUS_ERROR_OUT_OF_RESOURCES on AMD Instinct MI250X when doing allocations in a loop #844

Questionnaire

Reproducing the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HSA_STATUS_ERROR_OUT_OF_RESOURCES on AMD Instinct MI250X when doing allocations in a loop #844

Description

Questionnaire

Reproducing the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions