-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Environment
libzmq version: 4.3.5 (statically linked: libzmq-mt-s-4_3_5.lib / libzmq-mt-sgd-4_3_5.lib)
Platform: Windows (x32)
Build type: [Release]
Compiler & version: [MSVC v143]
libzmq is statically linked into our main executable (no separate libzmq*.dll module). The crash comes from the static lib code inside the EXE.
Summary
We are seeing intermittent process aborts inside libzmq with the assertion:
zmq_assert(dummy == 0);
in signaler_t::recv_failable().
The assertion fires while creating/connecting a new ZMQ_REQ socket (used for a short‑lived JSON‑RPC call) in our application. It is not tied to process shutdown; it happens during normal runtime under load.
Call stack (simplified)
Typical stack at the abort (symbolized):
text JsonRpcRequestClient::call -> JsonRpcIoFactory::createInstance -> IoServiceWorkerGroup::createSocket -> ZmqReqSocket::ZmqReqSocket -> Detail::ZmqSocket::connectImpl -> azmq / zmq internals ... -> zmq::signaler_t::recv_failable() zmq_assert(dummy == 0);
So the crash always surfaces while creating / connecting a fresh ZMQ_REQ socket, but we don’t know what earlier state causes dummy != 0.
How we use ZMQ (relevant parts)
We have a small wrapper over azmq + libzmq:
A global IoServiceWorkerGroup with boost::asio::io_service and N worker threads.
A ZmqSocket template that owns an azmq::socket and implements:
read() using async_receive() (captures shared_from_this() to keep the socket alive).
write() posting a handler on a strand that does a synchronous send().
runAfter() which only creates an asio::steady_timer (no direct ZMQ calls).
close() which calls shutdown(receive) + cancel() on the azmq::socket.
For REQ sockets specifically:
Each JSON‑RPC call uses a new REQ socket instance:
We call a factory that does g_workerGroupReq.createSocket(endpoint);
This ends up constructing Detail::ZmqSocket<ZMQ_REQ> and calling connectImpl(endpoint).
The call pattern is:
io->runAfter(10s, timeout_callback_for_write)
io->write(request, completion_for_write)
Our code blocks in p->get_future().get() until write completes or times out.
Then io->runAfter(20s, timeout_callback_for_read)
io->read(read_completion)
Again, we block in p->get_future().get() until the read or timeout completes.
The read() implementation:
cpp void read(std::function func) override { std::lock_guard guard{ m_socketGuard }; m_socket->async_receive( [self = shared_from_this(), func = std::move(func), this] (auto& ec, auto& msg, auto size) { receiveHandler(func, ec, msg, size); }); }
So as long as async_receive is pending, there is a shared_ptr keeping the ZmqSocket alive.
The write() implementation:
cpp void write(const Json& data, std::function func) override { std::weak_ptr self = shared_from_this(); m_strand.post([data, func = std::move(func), self, this]() { try { if (auto s = self.lock()) { std::lock_guard guard{ m_socketGuard }; m_socket->send(boost::asio::buffer(data.dump())); func(nullptr); } } catch (...) { func(std::current_exception()); } }); }
If the socket has been destroyed before the posted handler runs, self.lock() fails and we do not call send() at all.
Lifetime considerations (what we checked)
We tried to audit our code for scenarios where a socket could be destroyed while ZMQ still has commands in its mailbox:
The REQ socket is owned by a std::shared_ptr named io inside JsonRpcRequestClient::call().
io remains alive until after:
The write phase (send or 10s timeout) completes.
The read phase (receive or 20s timeout) completes.
ZmqSocket::~ZmqSocket() calls close(), which does shutdown(receive) and cancel(), but this only happens when the last shared_ptr goes away (after both phases have completed).
Because read()’s handler captures shared_from_this(), the socket can’t be destroyed while async_receive is still pending.
For SUB sockets (notifications), we also use the same read() pattern with shared_from_this() and long‑lived owners.
So for this code path, we haven’t found a place where we:
Destroy a ZMQ socket object while a connect/send/recv operation is still pending for that socket.
This specific path we’ve tried to follow the documented rules (single thread per socket, etc.).
What we observe
The abort is intermittent and seems correlated with higher load (multiple drives, offline sync, etc.), but not tied to shutdown.
The call stack always shows the failure during creation/connection of a new REQ socket for a simple JSON‑RPC call (get_skipped_paths).
There is no libzmq.dll in the process; it’s all coming from the statically linked libzmq-mt-s-4_3_5 library.
Unfortunately we don’t yet have a small standalone reproducer; the problem shows up after running our full application with OfflineSync under real workloads.
Questions
-
Under what conditions, from libzmq’s point of view, can signaler_t::recv_failable() read a non‑zero dummy byte and hit zmq_assert(dummy == 0);?
-
Are there any known issues in 4.3.5 on Windows that could cause this assertion in signaler_t when using statically linked libs?
-
Is there additional instrumentation or debug logging we can enable inside libzmq to better understand:
-
What FD signaler_t::r is using at the moment of the assert?
-
Whether that FD might have been re‑used or closed/reopened by the OS?
-
Are there any recommended patterns (or anti‑patterns) regarding creating many short‑lived REQ sockets (one per RPC call) on a shared context that could explain this?
We’re happy to:
Build and run with an instrumented version of libzmq if you can suggest extra checks or debug prints.
Attempt to extract a smaller reproducer, though so far the issue only shows under our full application load.