背景
我司近期正在大力推进容器项目,业务部署到QA环境后,发现个不大不小的问题,访问服务,网关nginx有概率返回502错误。
问题
测试反馈偶尔会有,出现频率不高,但复现概率较高。
这边查了下日志,确实查到了不少502的响应,匹配error日志,对应时间段伴随出现较多104: Connection reset by peer
异常错误。
不贴截图了,直接上nginx日志,包括访问日志及错误日志:
nginx访问日志
{"timestamp":"2018-11-11T00:48:04+08:00","version":1,"c-httpprotocol":"HTTP/1.1","s-ip":"<rs_srv_ip>:<rs_srv_port>","c-ip":"<client_ip>","s-upstrem":"<upstream_name>","c-bytes":349,"s-bytes":335,"s-response_time":0.001,"c-cost-time":0.001,"c-host":"qarest.yeshj.com","host":"<gw_srv_port>","c-uri-stem":"/502.html","c-request-args":"-","c-useragent":"-","c-method":"GET","c-referer":"-","c-requestid":"-","s-status":"502"}
nginx错误日志
2018/11/11 00:48:04 [error] 3280#0: *427542100 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <ssl_gw_srv_port>, server: qarest.yeshj.com, request: "POST /api/passport/account/query/ HTTP/1.1", upstream: "http://<rs_srv_ip>:<rs_srv_port>/api/passport/account/query/", host: "qarest.yeshj.com"
异常分析
HTTP status 中502的定义:
The server was acting as a gateway or proxy and received an invalid response from the upstream server.
网上对于nginx 502 Bad Gateway
错误,普遍的解释是nginx作为网关/反向代理设备时,后端真实服务器出现故障,网关/反向代理服务器从后端真实服务器接收到“无效”响应所导致。
从nginx错误日志中,发现大量104: Connection reset by peer
异常错误,也即上游服务reset
了连接,所以直接导致了502错误。
网上对于104: Connection reset by peer
的解释是:upstream发送了RST,将连接重置。
有人总结出重置的原因是:“nginx的buffer太小,timeout太小”,nginx http模块添加一下参数配置可修复。
首先查看当前nginx.conf是否有相应配置,如果有,值是多少,考虑加倍配置:
# cd /home/soft/openresty/nginx/conf/
# grep buffer * -irn
# grep timeout * -irn
参考配置参数如下:
client_header_buffer_size 64k;
large_client_header_buffers 4 64k;
client_body_buffer_size 20m;
fastcgi_buffer_size 128k;
fastcgi_buffers 4 128k;
fastcgi_busy_buffers_size 256k;
gzip_buffers 16 8k;
proxy_buffer_size 64k;
proxy_buffers 4 128k;
proxy_busy_buffers_size 256k;
keepalive_timeout 240;
fastcgi_connect_timeout 600;
fastcgi_send_timeout 600;
fastcgi_read_timeout 600;
proxy_connect_timeout 600s;
proxy_send_timeout 1200;
proxy_read_timeout 1200;
这边暂未进行测试验证,而是采用另一种方法缓解问题,具体如下。
缓解
同事发现另一个方法也可缓解该问题:将nginx http upstream模块中的keepalive
设置注释掉(原来配置为300),也即取消nginx的长连接配置,这样每次客户端请求连接均不保留,每次请求都需要新建连接。
官方同时提到,keepalive值不宜设置过大,否则有可能会影响到与上游服务器新连接的建立:
The connections parameter should be set to a number small enough to let upstream servers process new incoming connections as well.
修改后缓解的原因,我们推测是nginx长连接有并发瓶颈,如果设置太大的话,会影响到与上流服务器建连,导致:104: Connection reset by peer
,最终导致:nginx 502 Bad Gateway
。
后记
网上同样有朋友通过修改linux内核keepalive参数来提高http并发,同样也可以参考优化下,这边暂未进行测试验证。