如何避免 Akka Cluster 节点被隔离?

分享 未结
1 2 1 8
小编 2018-08-25发布
收藏 点赞

Akka Remoting (Artery) 文档中关于节点隔离的描述如下:

红框部分描述 Akka Remote 会在 Failure Detector 触发时直接隔离节点,但是 Akka Cluster 在 Failure Detector 触发时不会隔离节点。

Akka Cluster 文档中关于节点隔离的描述如下:

综合上面两部分文档,Akka Cluster 在 Failure Detector 触发时并不会隔离节点,也就是phi值超过阈值并不会导致节点被隔离,而是当目标节点无法响应大量的系统消息时,会直接将该节点隔离,并且不可恢复。

由此引出如下三个疑问:

  1. 在 Akka Cluster 中, Failure Detector 的作用被弱化了?即使没有也无所谓?
  2. 目标节点无法响应大量的系统消息,这个"大量"究竟是多大?是否可以自定义?
  3. 如何配置 Akka Cluster,即使网络再差也不会或很难出现节点被隔离的情况?

期盼高手给予解答!

回帖
  • 2018-08-25

    来自 Akka Tech Lead Patrik Nordwall 的回答:

    在 Akka Cluster 中, Failure Detector 的作用被弱化了?即使没有也无所谓?

    To detect network problems and crashed nodes. If the heartbeat messages (requrest-reply) can’t get through (lost or delayed) it will mark them as Unreachable. When heartbeats can get through again it will be marked as reachable again. This doesn’t mean that nodes are removed from the cluster membership.

    To decide when an Unreachable node should be removed from the cluster it has to be Downed. That is done by a downing provider, such as Lightbend’s Split Brain Resolver, or manually with Cluster management tool.

    Some cluster tools, such as Cluster aware routers, use the reachability information to avoid routing messages to unreachable nodes.

    目标节点无法响应大量的系统消息,这个"大量"究竟是多大?是否可以自定义?

    That should be rare, but for example if many actors stop at the same time and there are watchers of these actors on other nodes there may be a storm of Terminated messages sent more quickly than they can be delivered and thereby filling up buffers.

    The default size of the system messages bufffer is 20000 and it can be increased with configuration property akka.remote.artery.advanced.system-message-buffer-size. There is no drawback apart from possible memory consumption to increase this. The buffer is an ArrayDeque so it grows as needed, but doesn’t shrink.

    There is also another queue for outgoing control (system) messages and the max size of that is configured with akka.remote.artery.advanced.outbound-control-queue-size. The default is 3072. This is a LinkedBlockingQueue so it’s also ok to increase. I think we should increase the default of this, by the way.

    如何配置 Akka Cluster,即使网络再差也不会或很难出现节点被隔离的情况?

    Answer to previous question covers this as well.

    2 回复