-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Questionnaire
- Does ROCm works for you outside of Julia, e.g. C/C++/Python?
yes
- Post output of
rocminfo.
output of `rocminfo`
ROCk module version 6.3.6 is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD EPYC 7A53 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7A53 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size:
- Post output of
AMDGPU.versioninfo()if possible.
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name │ Version │ Path │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ + │ LLD │ - │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/llvm/bin/ld.lld │
│ + │ Device Libraries │ - │ /tmp/julia-depot-FlowMatching-barthale/artifacts/b46ab46ef568406312e5f500efb677511199c2f9/amdgcn/bitcode │
│ + │ HIP │ 6.2.41134 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libamdhip64.so │
│ + │ rocBLAS │ 4.2.1 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocblas.so │
│ + │ rocSOLVER │ 3.26.0 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsolver.so │
│ + │ rocSPARSE │ 3.2.0 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocsparse.so │
│ + │ rocRAND │ 2.10.5 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocrand.so │
│ + │ rocFFT │ 1.0.29 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/librocfft.so │
│ + │ MIOpen │ 3.2.0 │ /appl/lumi/SW/LUMI-24.03/G/EB/rocm/6.2.2/lib/libMIOpen.so │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │ Name │ GCN arch │ Wavefront │ Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│ 1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │ 64 │ 63.984 GiB │ 64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘
Reproducing the bug
- Describe what's not working.
I am trying to train a neural network and make inference using this neural network with Lux (1.22.1)
AMDGPU (2.1.1) and julia (1.12.0). But I am getting either the error HSA_STATUS_ERROR_OUT_OF_RESOURCES or Failed to successfully execute function and free resources for it. Reporting current memory usage: HIP pool used...
Ref: https://discourse.julialang.org/t/gpu-memory-issue-on-amdgpu/133560
- Provide MWE to reproduce it (if possible).
Here is a reproducer for the error HSA_STATUS_ERROR_OUT_OF_RESOURCES where I allocate arrays of random size:
using AMDGPU
function mytest(N)
total = 0f0;
for i = 1:N
total += sum(AMDGPU.ones(Float32,ntuple(i -> rand(1:250),4)))
end
return total
end
mytest(10_000)The error message is:
:0:rocdevice.cpp :2982: 3734245763676 us: [pid:38494 tid:0x1498153ff700] Callback:
Queue 0x149815000000 Aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime
failed to allocate the necessary resources. This error may also occur when the core runtime library needs to
spawn threads or create internal OS-specific events. Code: 0x1008 Available Free mem : 0 MB
I do not have any no local setting. With the following settings in LocalPreferences.toml:
[AMDGPU]
hard_memory_limit = "80 %"
eager_gc = true
I have the error:
ERROR: Failed to successfully execute function and free resources for it.
Reporting current memory usage:
- HIP pool used: 393.358 MiB.
- HIP pool reserved: 393.358 MiB.
- Hard memory limit: 51.188 GiB.
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:44
[2] alloc_or_retry!(f::AMDGPU.Runtime.Mem.var"#5#6"{HIPStream, Int64, Base.RefValue{Ptr{Nothing}}}, isfailed::typeof(isnothing); stream::HIPStream)
@ AMDGPU.Runtime.Mem /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:34
[3] alloc_or_retry!
@ /tmp/julia-depot-FlowMatching-barthale/packages/AMDGPU/np0dr/src/runtime/memory/utils.jl:1 [inlined]
[4] AMDGPU.Runtime.Mem.HIPBuffer(bytesize::Int64; stream::HIPStream)
Maybe the later error is the same as ROCm/hip#3422 (comment).
The issue also persists with the current version of AMDGPU:
(examples) pkg> st AMDGPU
Status `/pfs/lustrep4/users/barthale/.julia/dev/FlowMatching/examples/Project.toml`
[21141c5a] AMDGPU v2.1.2