一、问题背景

在对一个接口进行反向代理时，nginx总是会打印error日志，但是请求是正常的。根据nginx的error日志，发现是请求到ipv6地址失败，但是最终的代理请求是成功的，所以接下来分析一下nginx的负载均衡及失败重试机制。

nginx配置文件如下：

server {
    listen       9000;
    server_name  localhost;
    #proxy_next_upstream off;
    location /ncs/telecom/phone-verify {
        proxy_pass https://open.e.189.cn/auth/verifyinfo.do;
        proxy_read_timeout 300;
        proxy_connect_timeout 300;
        proxy_redirect off;
    
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Host 'open.e.189.cn';
        proxy_set_header X-Real-IP $remote_addr;
    }
}

nginx error日志内容如下：

[error] 3587705#0: *3471 connect() to [240e:698:100::2]:443 failed (101: Network is unreachable) while connecting to upstream, client: 127.0.0.1, server: localhost, request: "GET /ncs/telecom/phone-verify HTTP/1.1", upstream: "https://[240e:698:100::2]:443/auth/verifyinfo.do", host: "localhost:9000"
[error] 3587705#0: *3471 connect() to [240e:698:100::4]:443 failed (101: Network is unreachable) while connecting to upstream, client: 127.0.0.1, server: localhost, request: "GET /ncs/telecom/phone-verify HTTP/1.1", upstream: "https://[240e:698:100::4]:443/auth/verifyinfo.do", host: "localhost:9000"
[error] 3587705#0: *3471 connect() to [240e:698:100::5]:443 failed (101: Network is unreachable) while connecting to upstream, client: 127.0.0.1, server: localhost, request: "GET /ncs/telecom/phone-verify HTTP/1.1", upstream: "https://[240e:698:100::5]:443/auth/verifyinfo.do", host: "localhost:9000"
[error] 3587705#0: *3471 connect() to [240e:698:100::3]:443 failed (101: Network is unreachable) while connecting to upstream, client: 127.0.0.1, server: localhost, request: "GET /ncs/telecom/phone-verify HTTP/1.1", upstream: "https://[240e:698:100::3]:443/auth/verifyinfo.do", host: "localhost:9000"

二、proxy_pass

Syntax: proxy_pass URL;
Default: —
Context: location, if in location, limit_except

Sets the protocol and address of a proxied server and an optional URI to which a location should be mapped. As a protocol, “http” or “https” can be specified. The address can be specified as a domain name or IP address, and an optional port:

proxy_pass http://localhost:8000/uri/;

or as a UNIX-domain socket path specified after the word “unix” and enclosed in colons:

proxy_pass http://unix:/tmp/backend.socket:/uri/;

If a domain name resolves to several addresses, all of them will be used in a round-robin fashion. In addition, an address can be specified as a server group.

Parameter value can contain variables. In this case, if an address is specified as a domain name, the name is searched among the described server groups, and, if not found, is determined using a resolver.

根据proxy_pass的说明，我们的代理地址可以是域名或IP地址，如果地址是域名并且这个域名被解析到多个地址时，nginx以轮询的方式访问解析后的这些地址。
此外，可以将这个地址指定为一个server group，即我们的代理地址时一个域名的时候，域名解析后的地址将会被转换为一个upstream服务组，以默认轮询的方式提供服务。

代理地址的参数中可以包含变量，当代理地址被指定为一个域名的时候，则在描述的服务器组中搜索该名称，如果没有找到，则使用域名解析器确定。

三、resolver

Syntax: resolver address … [valid=time] [ipv6=on|off] [status_zone=zone];
Default: —
Context: http, server, location

Configures name servers used to resolve names of upstream servers into addresses, for example:

resolver 127.0.0.1 [::1]:5353;

The address can be specified as a domain name or IP address, with an optional port (1.3.1, 1.2.2). If port is not specified, the port 53 is used. Name servers are queried in a round-robin fashion.

Before version 1.1.7, only a single name server could be configured. Specifying name servers using IPv6 addresses is supported starting from versions 1.3.1 and 1.2.2.

By default, nginx will look up both IPv4 and IPv6 addresses while resolving. If looking up of IPv6 addresses is not desired, the ipv6=off parameter can be specified.

Resolving of names into IPv6 addresses is supported starting from version 1.5.8.

By default, nginx caches answers using the TTL value of a response. An optional valid parameter allows overriding it:

resolver 127.0.0.1 [::1]:5353 valid=30s;

通过proxy_pass的说明我们知道，在nginx中可以配置resolver对域名进行解析，并关闭ipv6的解析。

所以nginx代理的的配置文件如下：

server {
    listen       9000;
    server_name  localhost;
    resolver 114.114.114.114 valid=60s ipv6=off;
    
    location /ncs/telecom/phone-verify {
        set $telecom open.e.189.cn;
        proxy_pass https://$telecom/auth/verifyinfo.do;
        proxy_read_timeout 300;
        proxy_connect_timeout 300;
        proxy_redirect off;

        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Host 'open.e.189.cn';
        proxy_set_header X-Real-IP $remote_addr;
    }
}

通过上述配置（注意：必须将proxy_pass参数中的域名设置为变量)，我们指定代理地址中的域名每间隔60s进行解析一次，解析时排除掉域名中的ipv6地址。

域名open.e.189.cn解析后的IP地址如下：

[root@10-13-149-183 conf.d]# nslookup open.e.189.cn
Server:         10.13.255.1
Address:        10.13.255.1#53

Non-authoritative answer:
open.e.189.cn   canonical name = open.e-189.21cn.com.
Name:   open.e-189.21cn.com
Address: 42.123.76.75
Name:   open.e-189.21cn.com
Address: 42.123.76.52
Name:   open.e-189.21cn.com
Address: 42.123.76.87
Name:   open.e-189.21cn.com
Address: 240e:698:100::3
Name:   open.e-189.21cn.com
Address: 240e:698:100::2
Name:   open.e-189.21cn.com
Address: 240e:698:100::5
Name:   open.e-189.21cn.com
Address: 240e:698:100::4

修改nginx配置后重启，发现error日志中已经没有报错了，至此nginx代理请求ipv6报错的问题已经解决了。

前面我们看到，即使在nginx打印了请求ipv6的失败的情况下，但我们的请求最后仍是成功的，这是由于nginx的失败重试机制，接下来我们分析一下nginx的失败重试机制。

四、upstream

nginx的ngx_http_upstream_module的模块时用来定义一个服务组的，这个服务组可以被proxy_pass等模块引用。
前面在proxy_pass中提到，当我们的代理地址是一个域名的时候，域名解析后的地址将被转换位一个upstream服务组。
upstream的配置示例如下：

resolver 10.0.0.1;

upstream dynamic {
    zone upstream_dynamic 64k;

    server backend1.example.com      weight=5;
    server backend2.example.com:8080 fail_timeout=5s slow_start=30s;
    server 192.0.2.1                 max_fails=3;
    server backend3.example.com      resolve;
    server backend4.example.com      service=http resolve;

    server backup1.example.com:8080  backup;
    server backup2.example.com:8080  backup;
}

server {
    location / {
        proxy_pass http://dynamic;
        health_check;
    }
}

Syntax: server address [parameters];
Default: —
Context: upstream

max_fails=number
sets the number of unsuccessful attempts to communicate with the server that should happen in the duration set by the fail_timeout parameter to consider the server unavailable for a duration also set by the fail_timeout parameter. By default, the number of unsuccessful attempts is set to 1. The zero value disables the accounting of attempts. What is considered an unsuccessful attempt is defined by the proxy_next_upstream, fastcgi_next_upstream, uwsgi_next_upstream, scgi_next_upstream, memcached_next_upstream, and grpc_next_upstream directives.
fail_timeout=time
sets the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable;
and the period of time the server will be considered unavailable.
By default, the parameter is set to 10 seconds.

nginx默认使用轮询机制请求server group中服务，通过max_fails和fail_timeout两个参数控制upstream下的服务是否可用。
这个两个参数表示在fail_timeout时间内，一个服务累计的失败次数超过max_fails，则这个服务在接一下来的fail_timeout时间内不可用。
max_fails默认值为1，设置为0表示不限制。fail_timeout默认值为10，即默认一个服务在10s内失败一次，则在接一下来的10s内，这个服务将变为不可用。

当server group中的一个server请求失败时，nginx的会进行失败重试，参见proxy_next_upstream模块。

五、proxy_next_upstream

Syntax: proxy_next_upstream error | timeout | invalid_header | http_500 | http_502 | http_503 | http_504 | http_403 | http_404 | http_429 | non_idempotent | off …;
Default:
proxy_next_upstream error timeout;
Context: http, server, location

Specifies in which cases a request should be passed to the next server:

error
an error occurred while establishing a connection with the server, passing a request to it, or reading the response header;
timeout
a timeout has occurred while establishing a connection with the server, passing a request to it, or reading the response header;

One should bear in mind that passing a request to the next server is only possible if nothing has been sent to a client yet. That is, if an error or timeout occurs in the middle of the transferring of a response, fixing this is impossible.

The directive also defines what is considered an unsuccessful attempt of communication with a server. The cases of error, timeout and invalid_header are always considered unsuccessful attempts, even if they are not specified in the directive. The cases of http_500, http_502, http_503, http_504, and http_429 are considered unsuccessful attempts only if they are specified in the directive. The cases of http_403 and http_404 are never considered unsuccessful attempts.

根据上述说明nginx遇到error、timeout等错误时，会下一个server进行重试。我们之前遇到的请求的ipv6的错误属于errror情况，nignx会进行重试。

总结

根据上述说明，我们对域名地址进行反向代理时，域名解析后的地址会被转为一个upstream的server group，并以轮询的方式进行访问；我们可以通过配置resolver来设置域名解析服务器地址、同时可以关闭ipv6的解析；
nginx对server group中的server进行访问时，会根据max_fails和fail_timeout连个参数判断服务是否可用，在server请求失败后，会根据错误类型判断是否使用下一个server进行失败重试。