What is AI Inferencing at the Edge and Why is it Important for Business?
AI inference at the edge refers to running trained machine learning (ML) models closer to end users compared to traditional cloud AI inference. Edge inference accelerates the response time of ML models, enabling real-time AI applications in industries such as gaming, healthcare, and retail.
What is AI inference at the edge?
Before we look specifically at AI inference at the edge, it’s worth understanding what AI inference is in general. In the AI/ML development lifecycle, inference is where a trained ML model performs tasks on new, previously unseen data, such as making predictions or generating content. AI inference happens when end users interact directly with an ML model embedded in an application. For example, when a user enters a prompt into ChatGPT and gets a response back, the moment that ChatGPT “thinks” is the moment that inference happens, and the output is the result of that inference.
AI inference at the edge is a subset of AI inference where an ML model runs on a server close to end users; for example, in the same region or even the same city. This proximity reduces latency to milliseconds for faster model response, which is beneficial for real-time applications such as image recognition, fraud detection, or gaming map generation.
Head of AI Product at Gcore.
How AI inference at the edge compares to edge AI
AI inference at the edge is a subset of edge AI. Edge AI involves processing data and running ML models closer to the data source rather than in the cloud. Edge AI encompasses everything related to edge AI computing, from edge servers (the metro edge) to IoT devices and telecom base stations (the far edge). Edge AI also involves training at the edge, not just inference. In this article, we’ll focus on AI inference on edge servers.
How inference at the edge compares to inference in the cloud
With cloud AI inference, you run an ML model on the remote cloud server and the user data is sent and processed in the cloud. In this case, an end user can interact with the model from a different region, country, or even continent. As a result, the latency of cloud inference ranges from hundreds of milliseconds to seconds. This type of AI inference is suitable for applications that do not require local data processing or low latency, such as ChatGPT, DALL-E, and other popular GenAI tools. Edge inference differs in two related ways:
- Inference takes place closer to the end user
- Latency is lower
How AI Inference Works at the Edge
AI inference at the edge relies on an IT infrastructure with two key architectural components: a low-latency network and servers powered by AI chips. If you need scalable AI inference that can handle peak loads, you also need a container orchestration service like Kubernetes, which runs on edge servers and ensures that your ML models can scale up and down quickly and automatically. Today, only a few vendors have the infrastructure to deliver global AI inference at the edge that meets these requirements.
Low latency network: A provider offering AI inference at the edge must have a distributed network of edge points of presence (PoPs) where servers are located. The more edge PoPs, the faster the network roundtrip time, meaning ML model responses happen faster for end users. A provider should have dozens or even hundreds of PoPs globally and should offer smart routing, which routes a user request to the nearest edge server to efficiently and effectively utilize the globally distributed network.
Servers with AI accelerators: To reduce computation time, you should run your ML model on a server or VM powered by an AI accelerator, such as NVIDIA GPU. There are GPUs specifically designed for AI inference. For example, one of the latest models, the NVIDIA L40S, has up to 5x faster inference performance than the A100 and H100 GPUs, which are primarily designed for training large ML models but are also used for inference. The NVIDIA L40S GPU is currently the best AI accelerator for running AI inference.
Container Orchestration: Deploying ML models in containers makes models scalable and portable. A provider can manage an underlying container orchestration tool on your behalf. In that setup, an ML engineer who wants to integrate a model into an application would simply upload a container image with an ML model and get a ready-to-use ML model endpoint. When load spikes, containers with your ML model are automatically scaled up and then scaled back down when load decreases.
Key Benefits of AI Inference at the Edge
AI inferencing at the edge offers three key benefits for different industries or use cases: low latency, security and sovereignty, and cost efficiency.
Low latency
The lower the network latency, the faster your model will respond. If a provider’s average network latency is under 50ms, this is suitable for most applications that require near-instantaneous responses. In comparison, cloud latency can be as low as a few hundred milliseconds depending on your location relative to the cloud server. That’s a noticeable difference for an end user, with cloud latency potentially causing frustration as end users wait for their AI responses.
Keep in mind that a low latency network only takes into account data travel time. A network latency of 50ms doesn’t mean that users will get an AI output in 50ms; you need to add in the time it takes for the ML model to perform inference. That ML model processing time depends on the model being used and can account for the majority of the processing time for end users. That’s all the more reason to make sure you’re using a low latency network so that your users get the best possible response time while ML model developers continue to improve the speed of model inference.
Security and sovereignty
Keeping data at the edge, meaning local to the user, simplifies compliance with local laws and regulations, such as the GDPR and its equivalents in other countries. An edge inference provider should configure its inference infrastructure to comply with local laws to ensure you and your users are properly protected.
Edge inference also increases the confidentiality and privacy of your end-users’ data because it is processed locally instead of being sent to remote cloud servers. This reduces the attack surface and minimizes the risk of data exposure in transit.
Cost efficiency
Typically, a provider will only charge for the compute power used by the ML model. This, along with carefully configured autoscaling and model execution schedules, can significantly reduce inference costs. Who should be using AI inference at the edge?
Here are some common scenarios where edge inference would be the optimal choice:
- Low latency is crucial for your application and users. A wide range of real-time applications, from facial recognition to trading analysis, require low latency. Edge inference provides the option for the lowest latency inference.
- Your user base is spread across multiple geographic locations. In this case, you need to provide the same user experience, meaning the same low latency, to all your users, regardless of their location. This requires a globally distributed edge network.
- You don’t want to deal with infrastructure maintenance. If supporting cloud and AI infrastructure is not part of your core business, it may be worth delegating these processes to an experienced, knowledgeable partner. You can then focus your resources on developing your application.
- You want to keep your data local, for example within the country where it is generated. In this case, you need to perform AI inference as close to your end users as possible. A globally distributed edge network can meet this need, while the cloud probably doesn’t offer the level of distribution you need.
Which industries benefit from AI inference at the edge?
AI inference at the edge is beneficial for any industry where AI/ML is being used, but especially those developing real-time applications. In the technology sector, this would include generative AI applications, chatbots and virtual assistants, data augmentation, and AI tools for software engineers. In gaming, it would be AI content and map generation, real-time player analytics, and real-time AI bot customization and conversation. For the retail market, typical applications would be smart grocery with self-checkout and merchandising, virtual try-on and content generation, predictions and recommendations.
In manufacturing, the benefits are real-time defect detection in production pipelines, VR/VX applications, and rapid response feedback, while in the media and entertainment industry it would be content analytics, real-time translation, and automated transcription. Another sector developing real-time applications is automotive, specifically rapid response for autonomous vehicles, vehicle personalization, advanced driver assistance, and real-time traffic updates.
Conclusion
For organizations looking to deploy real-time applications, AI inference at the edge is a critical part of their infrastructure. It significantly reduces latency and delivers ultra-fast response times. For end users, this means a more seamless, engaging experience, whether they’re playing online games, using chatbots, or shopping online with a virtual try-on service. Enhanced data security means businesses can deliver superior AI services while protecting user data. AI inference at the edge is a critical enabler for AI/ML production deployment at scale, driving AI/ML innovation and efficiency across industries.
We provide an overview of the best bare metal hosting.
This article was produced as part of TechRadarPro’s Expert Insights channel, where we showcase the best and brightest minds in the technology sector today. The views expressed here are those of the author and do not necessarily represent those of TechRadarPro or Future plc. If you’re interested in contributing, you can read more here: