Our company recently upgraded our WordPress website from a single server to a load-balanced cluster, but unexpectedly, problems have been cropping up nonstop since the upgrade.
We're currently experiencing a particularly bizarre situation: our monitoring system shows all servers running normally with all health checks passing, yet some users consistently report being unable to access the website, receiving an "Origin DNS Error" message. Strangely, these users are spread across different regions and networks, and the issue occurs without any discernible pattern.
Our operations team has been investigating for nearly seven hours now. We've cross-checked the DNS records multiple times and reviewed the server logs, but we still can't pinpoint the cause. What's even more frustrating is that everything works perfectly fine when we test it in the office, yet the client keeps reporting errors.
Has anyone encountered a similar situation? In a load-balancing environment, why might monitoring show normal operation while actual user access fails? Where should troubleshooting typically begin? We've already tried flushing DNS caches and restarting servers, but the issue still occurs occasionally.
Since this is the company's official website, the issue has already impacted our business operations, and we're quite concerned. We're seeking guidance from experienced experts—any troubleshooting suggestions would be greatly appreciated! Thank you all in advance!



No comments