nginx返回502错误问题排查

nginx反向代理java服务抛“104: Connection reset by peer”错误

Posted by Mr.Zhou on November 13, 2018

背景

我司近期正在大力推进容器项目,业务部署到QA环境后,发现个不大不小的问题,访问服务,网关nginx有概率返回502错误。

问题

测试反馈偶尔会有,出现频率不高,但复现概率较高。

这边查了下日志,确实查到了不少502的响应,匹配error日志,对应时间段伴随出现较多104: Connection reset by peer异常错误。

不贴截图了,直接上nginx日志,包括访问日志及错误日志:

nginx访问日志

{"timestamp":"2018-11-11T00:48:04+08:00","version":1,"c-httpprotocol":"HTTP/1.1","s-ip":"<rs_srv_ip>:<rs_srv_port>","c-ip":"<client_ip>","s-upstrem":"<upstream_name>","c-bytes":349,"s-bytes":335,"s-response_time":0.001,"c-cost-time":0.001,"c-host":"qarest.yeshj.com","host":"<gw_srv_port>","c-uri-stem":"/502.html","c-request-args":"-","c-useragent":"-","c-method":"GET","c-referer":"-","c-requestid":"-","s-status":"502"}

nginx错误日志

2018/11/11 00:48:04 [error] 3280#0: *427542100 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <ssl_gw_srv_port>, server: qarest.yeshj.com, request: "POST /api/passport/account/query/ HTTP/1.1", upstream: "http://<rs_srv_ip>:<rs_srv_port>/api/passport/account/query/", host: "qarest.yeshj.com"

异常分析

HTTP status 中502的定义

The server was acting as a gateway or proxy and received an invalid response from the upstream server.

网上对于nginx 502 Bad Gateway错误,普遍的解释是nginx作为网关/反向代理设备时,后端真实服务器出现故障,网关/反向代理服务器从后端真实服务器接收到“无效”响应所导致。

从nginx错误日志中,发现大量104: Connection reset by peer异常错误,也即上游服务reset了连接,所以直接导致了502错误。

网上对于104: Connection reset by peer解释是:upstream发送了RST,将连接重置。

有人总结出重置的原因是:“nginx的buffer太小,timeout太小”,nginx http模块添加一下参数配置可修复。

首先查看当前nginx.conf是否有相应配置,如果有,值是多少,考虑加倍配置:

# cd /home/soft/openresty/nginx/conf/
# grep buffer * -irn
# grep timeout * -irn

参考配置参数如下:

client_header_buffer_size 64k;
large_client_header_buffers 4 64k;
client_body_buffer_size 20m;
fastcgi_buffer_size 128k;
fastcgi_buffers 4 128k;
fastcgi_busy_buffers_size 256k;
gzip_buffers 16 8k;
proxy_buffer_size 64k;
proxy_buffers 4 128k;
proxy_busy_buffers_size 256k;

keepalive_timeout 240;
fastcgi_connect_timeout 600;
fastcgi_send_timeout 600;
fastcgi_read_timeout 600;

proxy_connect_timeout 600s;
proxy_send_timeout 1200;
proxy_read_timeout 1200;

这边暂未进行测试验证,而是采用另一种方法缓解问题,具体如下。

缓解

同事发现另一个方法也可缓解该问题:将nginx http upstream模块中的keepalive设置注释掉(原来配置为300),也即取消nginx的长连接配置,这样每次客户端请求连接均不保留,每次请求都需要新建连接。

官方同时提到,keepalive值不宜设置过大,否则有可能会影响到与上游服务器新连接的建立:

The connections parameter should be set to a number small enough to let upstream servers process new incoming connections as well.

修改后缓解的原因,我们推测是nginx长连接有并发瓶颈,如果设置太大的话,会影响到与上流服务器建连,导致:104: Connection reset by peer,最终导致:nginx 502 Bad Gateway

后记

网上同样有朋友通过修改linux内核keepalive参数来提高http并发,同样也可以参考优化下,这边暂未进行测试验证。