Modern interconnects often have programmable processors in the network interface that can be utilized to of.oad communication processing from host CPU. In this paper, we explore different schemes to support collective operations at the network interface and propose a new collective protocol. With barrier as an initial case study, we have demontrated that much of the communication processing can be greatly simpli.ed with this collective protocol. Accordingly, we have designed and implemented ef.cient and scalable NIC-based barrier operations over two high performance interconnects, Quadrics and Myrinet.
Our evaluation shows that, over a Quadrics cluster of 8 nodes with ELan3 Network, the NIC-based barrier operation achieves a barrier latency of only 5.60µs. This result is a 2.48 factor of improvement over the Elanlib tree-based barrier operation. Over a Myrinet cluster of 8 nodes with LANai-XP NIC cards, a barrier latency of 14.20µs over 8 nodes is achieved. This is a 2.64 factor of improvement over the host-based barrier algorithm. Furthermore, an analytical model developed for the proposed scheme indicates that a NIC-based barrier operation on a 1024-node cluster can be performed with only 22.13µs latency over Quadrics and with 38.94µs latency over Myrinet. These results indicate the potential for developing high performance communication subsystems for next generation clusters.