Durham helps Rockport Networks take on Nvidia

Durham helps Rockport Networks take on Nvidia

Feature articles |
Researchers at Durham University in the UK are helping US startup Rockport Networks take on the giant of high performance computing networking, Nvidia. “What we do is perform huge simulations of the universe, starting from the big bang and watching the evolution of the universe to the present day,” said…
By Nick Flaherty

Share:

Researchers at Durham University in the UK are helping US startup Rockport Networks take on the giant of high performance computing networking, Nvidia.

“What we do is perform huge simulations of the universe, starting from the big bang and watching the evolution of the universe to the present day,” said Alastair Basden, Head of the COSMA HPC Service in the Department of Physics University of Durham. “This requires months of compute on tens of thousands of cores with massive memory and interconnect.”

With applications running in parallel on multiple nodes the communication can be a significant bottleneck for artificial intelligence, machine learning and high performance computing. As compute and storage increases in performance the networking becomes the bottleneck

Durham has three HPC systems, the Durham Intelligent NIC Environment (DINE), COSMA7 and COSMA8. It has been using Infiniband from Nvidia’s Mellanox division as well as the Bluefield network processors in DINE and COSMA but has also tested out the Rockport technology in a proof on concept while the company has been in stealth mode.

COSMA7 (above) has 452 nodes with 12,000 cores and Infiniband networking to 250Tbytes of memory. COSMA8 extends that to 360TBytes of RAM and 50,000 cores.

This week the COSMA7 system is being split with 224 nodes on Infiniband and 224 nodes on Rockport. “This will allow direct comparisons on proper sized scientific applications and look at real world congestion with all sorts of work from different users, with the file system connected to both fabrics,” said Basden.

“What we see is that the DINE network is prone to congestion and the performance slows. With the Rockport fabric this does not happen, performance remains constant as congestion increases. We introduced artificial congestion and the time to completion of the code remained constant,” he said.

Traditional spine and leaf dilutes the bandwidth by 65% and this degrades the workload performance between 20 to 30% and applications take longer to run. It’s not just a network issue, it’s a network performance issue and utility issue. Every time more switches are added there are more opportunities for congestion, for buffers to fill up and increase the latency.

So the Rockport ‘switchless’ architecture uses per packet adaptive routing in each node with passive optical interconnect.

“The challenge with using Ethernet has been latency so we change the way you move the traffic,” said Matt Williams, CTO of Rockport. “Instead of nodes connected to layers of switches the nodes integrate the switch,” he said. “Each node is connected to 12 neighbour nodes, and we appear to be Ethernet to the end point for code consistency. We have opened up the data plane.”

The NC1225 network card includes all the distributed switching functionality in an FPGA. This implements the rNOS (Rockport Network Operating System). It has 12 25Gbit/s channels carried in a single cable to a passive interconnect unit (not a switch) called the SHFL (or shuffle). This creates a mesh topology wired in a very intelligent way, says Williams. The 24 fibres in cable are 12 fibre pairs, with field replaceable optics. Six cable pairs go to other nodes while the other six go to SHFLs in other racks. “This gives the same network distance in rack and across the data centre and the latency per hop is consistent,” he said. “We build a six dimensional torus that scales to 1536 nodes and in future products the scaling will go up.”

This scales linearly as adding another node adds more switching. Boosting the scaling rather than the speed is important.

“25G is as fast as you can go without forward error correction (FEC) and that adds latency. We like NRZ optics without PAM4 that will allow us to get to high bandwidth,” said Williams.

“We provide dedicated hardware in the cards running rNOS with self discovering nodes, and we pre calculated 12 different ways to get to a destination, always monitoring the performance. When there is an issue whether a failure or congestion the cards can reallocate the link.”

It breaks the Ethernet packets into smaller packets called FLITs that are between 64 and 160bytes, with strict priority queuing in the switch in the Xilinx/AMD FPGA for a maximum blocking time of 25ns. This smaller packet helps maintain the lower latency, and compares to a latency of 15000ns on Infiniband switches.

The company is now designing dedicated hardware. “We have ASIC programmes under way – we see performance enhancements

The autonomous network manager (ANM) is another key element, says Williams. This can provide per job analysis going back 7 days, and a view of optical power of each connection and CRC rate to analyse the performance in detail.

www.durham.ac.uk;www.rockportnetworks.com

Related articles

Other articles on eeNews Europe

Linked Articles
eeNews Europe
10s
Baidu