Alibaba says its new low-level software has reduced network outages, lowered load balancing costs, and improved SmartNIC performance by shifting workloads to underused infrastructure. As reported by The Register, the company outlined its results in three research papers it plans to present at the SIGCOMM conference next week.
One of the papers introduces a system called ZooRoute, designed to keep cloud networks running when failures occur. Alibaba’s researchers describe it as “a fast failure recovery service that ensures global bypass in large-scale cloud networks in seconds.”
Network failures are a fact of life for cloud operators, so how quickly providers can respond makes a difference. Current approaches like fast rerouting or traffic engineering are measured in seconds and minutes, the company says. For end users, that can still mean interruptions or lost sessions. Because of this, some tenants have developed their own backup methods, often by paying for redundant resources or changing the way their applications interact with networks. Both options add cost and complexity.
ZooRoute attempts to solve this by constantly probing the network for alternate paths. If a link goes down, the system already knows which path is available and can redirect traffic immediately. The paper notes that Alibaba Cloud has used ZooRoute in production for 18 months, and during that time it has reduced overall outage time by more than 92%.
Smoother load balancing with Hermes
Another research effort focuses on Hermes, a system that addresses inefficiencies in layer 7 load balancers. The devices are central to modern cloud networks, distributing millions of requests to available servers and workers. Traditional methods use Linux tools like epoll to pass connections from the kernel to user-space workers. While reliable, this can create bottlenecks and cause some workers to become overloaded while others are idle.
In Alibaba Cloud’s networks, Hermes introduces a new scheduling layer based on eBPF, a Linux technology that allows tasks to run inside the kernel. By filtering requests before they reach workers, Hermes can prioritise which traffic gets handled first and spread it more evenly. In testing, this approach reduced CPU use imbalances by about 90 per cent and lowered uneven connection counts by more than 99%.
For operators, the results are tangible. Worker “hangs” – where processes get stuck and need intervention – fell by nearly 100%. At the same time, the cost of running layer 7 load balancing infrastructure dropped by almost 19%. The improvements point to more stable performance for tenants and lower operating costs for providers.
Smarter SmartNICs with Nezha
The third paper introduces Nezha, a distributed system for balancing workloads in SmartNICs. Network cards equipped with their own processors are used widely in large cloud environments. They take on networking and storage functions, freeing up processor cycles.
In Alibaba Cloud’s operations, some SmartNICs had become overloaded while others were underused. Nezha addresses the issue by monitoring use and moving tasks from busy SmartNICs to ones with spare capacity.
The researchers write that deploying Nezha costs only a fraction of that of adding new hardware. They also report that Nezha has improved performance by removing bottlenecks from virtual switches running on SmartNICs and pushing them into the virtual machine kernel stack, where they are easier to manage.
What Alibaba’s cloud research means for providers
Taken together, the three systems demonstrate how large providers like Alibaba are trying to squeeze more efficiency and dependability out of existing infrastructure. Outages and bottlenecks have a direct impact on customer confidence, and cause unnecessary hardware spending.
The company’s research highlights the growing importance of software-based techniques to managing complicated cloud networks.
(Photo by Compare Fibre)
See also: Alibaba Cloud expands in South Korea with second data centre
Want to learn more about Cloud Computing from industry leaders? Check out Cyber Security & Cloud Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events, click here for more information.
CloudTech News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.
Source link
#Alibaba #unveils #research #tools #cut #outages #cloud #costs