Sorry, unfinished
This article was never finished as is never going to be finished, but - it still contains some useful knowledge, so I leave it AS IS.Intro
When working on Mojaloop's performance, we found that our node.js instances (which we have 100+) on our AWS performance deployment have intermittent problems with DNS resolution.This article is a simple case study with some workaround-ish solution at the end.
It took a while to understand what's going on and "the internet" pointed to lot of similar stories - but without any clear answers or solutions.
The problem: node.js fails to resolve DNS
We're running a performance test: tens of workers process transactions in hundreds/sec. Our workers are node.js instances. They talk to kafka (producing and consuming messages) and to http (e.g. sending notifications).We've noticed quite rare occurence of EAI_AGAIN error reported by http client on node.js. The error is highly intermittent - happens randomly one 1-2 of the workers every few seconds:

Chart explained:
- Each color represents one instance
- Each spike is exactly one occurence
- Chart is made in grafana, feeding data from prometheus. Node.js instances report this data using prometheus node library.

TODO: Mess
We hit random getaddrinfo() timings
DNS attempts increase:
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
ndots explained: https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html
/etc/resolv.conf options.attempts: http://man7.org/linux/man-pages/man5/resolv.conf.5.html
linux alpine: https://gitlab.alpinelinux.org/alpine/aports/issues/10063
musl libc: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Name_Resolver_.2F_DNS
example similar report: https://www.openwall.com/lists/musl/2017/09/28/1
apt-get install net-tools
# cat /etc/resolv.conf
nameserver 10.43.0.10
search backend.svc.cluster.local svc.cluster.local cluster.local eu-west-2.compute.internal
options ndots:5
kubectl -n backend get po
kubectl -n backend exec -it back-ml-api-adapter-handler-notification-6f878c54df-zw5rg -- sh
while true; do date; dig facebook.com; sleep 1; done | tee resolve.log
while true; do date; dig facebook.com; sleep 0.1; done
while true; do date; time dig facebook.com; sleep 0.1; done | grep status
while true; do date; time dig facebook.com; sleep 0.1; done | grep elapsed
while true; do date; time dig facebook.com; sleep 0.002; done
0.00user 0.00system 0:05.01elapsed 0%CPU (0avgtext+0avgdata 10716maxresident)k
0inputs+0outputs (0major+646minor)pagefaults 0swaps