[IEC-47] [IEC][SEBA] CI tests fail with voltha not healthy Created: 22/May/20  Updated: 27/May/20  Resolved: 27/May/20

Status: Done
Project: Integrated Edge Cloud
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Catalin Iova Assignee: Catalin Iova
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

There are two kind of failures. It seems both cases has the same issue since there is no response from voltha:

https://jenkins.akraino.org/view/iec/job/iec-type2-install-seba_on_arm-fuel-baremetal-ubuntu1804-daily-master/20/console
https://jenkins.akraino.org/view/iec/job/iec-type2-install-seba_on_arm-fuel-baremetal-ubuntu1804-daily-master/21/console



 Comments   
Comment by Catalin Iova [ 27/May/20 ]

Workaround occurs on https://jenkins.akraino.org/view/iec/job/iec-type2-install-seba_on_arm-fuel-baremetal-ubuntu1804-daily-master/32/console and voltha recovered successfully.

Comment by Catalin Iova [ 22/May/20 ]

First workaround applied: https://github.com/iecedge/automation-tools/commit/d63aedab1c7d70c70908228a0abbacc9898b5ca4

Comment by Catalin Iova [ 22/May/20 ]

It seems voltha coordinator remains in a state that only renew session and check its leader membership. The worker cannot get a stable voltha leader and assign the core stores keys leading to worker not finishing its initialization. This is happening after huge delays in responses from etcd and also after etcd error: "etcdserver: request timed out".
The etcd error signaled few times is because the etcd leader loss the leadership of the etcd cluster during cluster initialization and a new leader is assigned. This etcd cluster leader election process can take 2-3 to tens of seconds and can happen after the voltha core created its session/lease in previous etcd leader.
The voltha coordinator, leader and worker should have a recovery mechanism implemented but seems this is not fully implemented.
This is also signaled on https://jira.opencord.org/browse/VOL-339 and sub-tasks for when voltha is used with consul instead of etcd but it is a similar handling.
It is difficult and come with great risks to implement the fix inside voltha core so there is a workaround available.

Possible workarounds:

1. After voltha and etcd-cluster are installed and ready, check voltha is healthy. If no response or voltha not healthy after two minutes, then delete vcore-0 pod. In this way voltha core is restarted after etcd-cluster is ready.

2.etcd-cluster shall be installed before and independently by voltha pods. Voltha installation shall also wait for etcd-cluster to be ready. This shall be done by updating the voltha helm-chart too which is not a preferred workaround.

Generated at Sat Feb 10 06:02:25 UTC 2024 using Jira 9.4.5#940005-sha1:e3094934eac4fd8653cf39da58f39364fb9cc7c1.