We’ve been working on Federated Learning (FL) for autonomous agents and hit a hard bottleneck: standard FL (sending gradients back and forth) is bandwidth-heavy and requires massive edge compute.
We wrote this paper to propose an architectural shift we call Fluid Federated Learning (FFL).
The core engineering contributions are:
Prism Protocol: We implemented a "Software-Defined Memory" architecture. It uses io_uring to stream sparse, random projections of model weights directly from NVMe storage to the GPU.
This allows us to process "Virtual Batches" of terabyte-scale models on commodity hardware by exploiting the Johnson-Lindenstrauss lemma (Holographic Slicing).
Federated State-Space Duality (F-SSD): Instead of averaging gradients (which is slow and leaky), we exploit the duality between Transformers and SSMs (like Mamba) to federate the Recurrent States.
The Result: We can run massive foundation models on edge devices with limited VRAM by treating the SSD as a "slow" memory tier without destroying optimization fidelity.
I’m curious if anyone here has experimented with io_uring for model serving? We found the async I/O overhead to be negligible compared to the memory gains, but wondering if there are better ways to handle the sparse projections.
We wrote this paper to propose an architectural shift we call Fluid Federated Learning (FFL).
The core engineering contributions are:
Prism Protocol: We implemented a "Software-Defined Memory" architecture. It uses io_uring to stream sparse, random projections of model weights directly from NVMe storage to the GPU.
This allows us to process "Virtual Batches" of terabyte-scale models on commodity hardware by exploiting the Johnson-Lindenstrauss lemma (Holographic Slicing).
Federated State-Space Duality (F-SSD): Instead of averaging gradients (which is slow and leaky), we exploit the duality between Transformers and SSMs (like Mamba) to federate the Recurrent States.
The Result: We can run massive foundation models on edge devices with limited VRAM by treating the SSD as a "slow" memory tier without destroying optimization fidelity.
I’m curious if anyone here has experimented with io_uring for model serving? We found the async I/O overhead to be negligible compared to the memory gains, but wondering if there are better ways to handle the sparse projections.
Happy to answer questions on the implementation.
reply