Roman Pietrzak aka Yosh
EAI_AGAIN error on k8s - Investigating node.js failing to resolve DNS on kubernetes
Last update: 2020-03-27


Sorry, unfinished

This article was never finished as is never going to be finished, but - it still contains some useful knowledge, so I leave it AS IS.

Intro

When working on Mojaloop's performance, we found that our node.js instances (which we have 100+) on our AWS performance deployment have intermittent problems with DNS resolution.
This article is a simple case study with some workaround-ish solution at the end.
It took a while to understand what's going on and "the internet" pointed to lot of similar stories - but without any clear answers or solutions.

The problem: node.js fails to resolve DNS

We're running a performance test: tens of workers process transactions in hundreds/sec. Our workers are node.js instances. They talk to kafka (producing and consuming messages) and to http (e.g. sending notifications).

We've noticed quite rare occurence of EAI_AGAIN error reported by http client on node.js. The error is highly intermittent - happens randomly one 1-2 of the workers every few seconds:

Chart explained:
  • Each color represents one instance
  • Each spike is exactly one occurence
  • Chart is made in grafana, feeding data from prometheus. Node.js instances report this data using prometheus node library.






TODO: Mess


We hit random getaddrinfo() timings


DNS attempts increase:
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
ndots explained: https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html
/etc/resolv.conf options.attempts: http://man7.org/linux/man-pages/man5/resolv.conf.5.html


linux alpine: https://gitlab.alpinelinux.org/alpine/aports/issues/10063
musl libc: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Name_Resolver_.2F_DNS
example similar report: https://www.openwall.com/lists/musl/2017/09/28/1


apt-get install net-tools


# cat /etc/resolv.conf
nameserver 10.43.0.10
search backend.svc.cluster.local svc.cluster.local cluster.local eu-west-2.compute.internal
options ndots:5


kubectl -n backend get po
kubectl -n backend exec -it back-ml-api-adapter-handler-notification-6f878c54df-zw5rg -- sh


while true; do date; dig facebook.com; sleep 1; done | tee resolve.log
while true; do date; dig facebook.com; sleep 0.1; done
while true; do date; time dig facebook.com; sleep 0.1; done | grep status
while true; do date; time dig facebook.com; sleep 0.1; done | grep elapsed
while true; do date; time dig facebook.com; sleep 0.002; done


0.00user 0.00system 0:05.01elapsed 0%CPU (0avgtext+0avgdata 10716maxresident)k
0inputs+0outputs (0major+646minor)pagefaults 0swaps


JavaScript failed !
So this is static version of this website.
This website works a lot better in JavaScript enabled browser.
Please enable JavaScript.