Scalable Platforms with EKS and Karpenter

Beatriz De miguel pérezLast updated on February 6, 2025
4 min

Ready to build better conversations?

Simple to set up. Easy to use. Powerful integrations.

Get free access

Ready to build better conversations?

Simple to set up. Easy to use. Powerful integrations.

Get free access

When building scalable platforms on AWS, Kubernetes (EKS) provides a powerful foundation for deploying and managing containerized applications. However, achieving cost-effective scalability requires fine-tuning the infrastructure, which is where Karpenter comes into play. Karpenter is a Kubernetes-native EC2 node provisioner that enables dynamic scaling of EC2 instances based on the needs of your workloads. This article outlines how to leverage Karpenter for scaling, identifies common challenges, and provides recommendations for optimal configuration.

Karpenter with EKS: A Powerful Combination

Amazon Elastic Kubernetes Service (EKS) simplifies Kubernetes management, while Karpenter automates the provisioning and scaling of EC2 instances in response to workload demands. By using Karpenter, you can:

  • Automatically adjust the size and quantity of your EC2 nodes.

  • Optimize instance selection for specific workloads.

  • Reduce infrastructure costs by scaling down unused resources. 

  • Leverage consolidation features to automatically rebalance workloads onto fewer nodes during low utilization periods.

  • Handle spot instance interruptions gracefully by automatically draining and replacing affected nodes before termination

However, there are challenges to keep in mind when working with Karpenter, and ensuring that you are configuring it properly for your applications is key to maximizing its benefits.

Key Challenges and Recommendations

Right-Size Karpenter Resources

Incorrect configurations can lead to unnecessary overprovisioning, resulting in increased costs or underprovisioning, which might cause application failures.

Properly configuring nodepools is critical to ensuring efficient resource utilization and workload stability.

Recommendation:

  • Configure Karpenter to spread nodes across availability zones (AZs) for better resilience and distribution.

  • Select node sizes carefully so that your pods fit efficiently within the nodes. Avoid oversized nodes that lead to underutilization or excessive costs.

  • For workloads on Spot instances, limit the list of instance types to avoid excessively large types, as they are often less available. Use balanced instance types suitable for your workloads.

  • Provision Karpenter’s own pods with adequate resources—especially memory—as they are critical for scaling. Use Fargate nodes for Karpenter pods to ensure reliability and separation from application workloads.

Set Application-Level Resource Requests and Limits

Application deployments should be optimized for CPU and memory usage to prevent issues like throttling or OOM kills while maximizing node utilization.

CPU Management:

  • Avoid setting strict CPU limits, as it won’t kill your pod but may slow it down during contention.

  • Adjust CPU requests based on observed usage to avoid throttling and ensure pods get the resources they need. Monitor for slowdowns, and analyze CPU metrics in detail if issues arise.

Memory Management:

Memory is more critical—exceeding node memory leads to OOM kills. Unlike CPU, memory usage does not spike in short bursts, so it's crucial to set accurate memory limits.

Horizontal Pod Autoscaling (HPA)

Horizontal Pod Autoscaling (HPA) automatically scales pods based on resource usage or custom metrics like CPU or memory. However, the scaling formula can be complex and difficult to interpret.

Recommendation:

  • Aim for at least 80% CPU utilization to avoid underusing resources.

  • Don’t rely solely on CPU metrics—if scaling isn’t keeping up, consider using custom metrics like queue length or request rates. Tools like KEDA can help integrate relevant metrics for more effective scaling.

  • Pay attention to the scaling velocity to ensure it matches the workload demand, avoiding delays or excessive scaling.

Isolate Apps with Taints and Tolerations

Applications might have specific resource or security requirements, which could conflict with other workloads if not properly isolated. Taints and tolerations in Kubernetes help enforce isolation.

Recommendation:

  • Use taints and tolerations to ensure that specific workloads are scheduled only on nodes that meet their requirements (e.g., specific instance types or GPU availability).

  • Taints on nodes can restrict which workloads are scheduled, and tolerations on pods allow specific workloads to bypass taints.

DNS Matters

In a distributed environment like Kubernetes, DNS queries are frequent. Using more efficient DNS notation can help reduce unnecessary DNS requests, improving performance.

Recommendation:

  • Use 5-dot notation (.svc.cluster.local) to avoid generating more DNS queries than necessary. This method minimizes the lookup times and reduces the load on CoreDNS.

  • Scale CoreDNS Properly CoreDNS is responsible for handling DNS queries within Kubernetes. If your DNS is not scaled appropriately, it can become a bottleneck and impact your workloads. Monitor CoreDNS performance and scale the deployment if necessary. Increasing the number of replicas or adjusting resource limits can help prevent DNS resolution delays.

AWS Quota Review

AWS has various service quotas and limits that, if reached, can disrupt your operations. Understanding your AWS quotas for services like EC2, EFS, and Spot Instances is crucial for maintaining a stable and cost-efficient infrastructure.

Recommendation:

Review your AWS quotas periodically to ensure you’re not hitting any limits, examples:

  • EFS throughput

  • Spot Instance quotas

  • On-demand instance quotas

  • QPS (queries per second) for pulling container images

By carefully managing your quotas and understanding their limits, you can avoid unexpected interruptions in your platform.

Conclusion

Scaling Kubernetes on AWS with EKS and Karpenter is an excellent strategy for achieving cost-effective, dynamic infrastructure scaling. However, optimal results require careful configuration and a deep understanding of potential challenges. By right-sizing resources, isolating workloads, optimizing DNS, and selecting the appropriate instance types, you can ensure that your applications run smoothly while keeping costs manageable. Additionally, effective monitoring and resource management practices such as setting proper resource limits, reviewing quotas, and leveraging HPA can provide a more resilient and scalable platform.

By following the recommendations outlined in this article, you can successfully build a robust and scalable platform using EKS and Karpenter.

If you’d like to hear more tips and stories from our team of innovators and developers, check out our whole range of Tech Team Stories.


Published on February 6, 2025.

Ready to build better conversations?

Aircall runs on the device you're using right now.