[VAL-91] LTP test case failure: "SSH session not active" Created: 28/Nov/19  Updated: 05/Dec/19  Resolved: 05/Dec/19

Status: Done
Project: Validation
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Juha Kosonen Assignee: Juha Kosonen
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

 

==============================================================================
Ltp :: Validation, robustness and stability of Linux 
==============================================================================
RunLTP all tests :: Wait ~5hrs to complete 2536 tests | FAIL |
'INFO: creating /opt/ltp/results directory
INFO: no command files were provided. Executing following runtest scenario files:
syscalls fs fs_perms_simple fsx dio io mm ipc sched math nptl pty containers fs_bind controllers filecaps cap_bounds fcntl-locktests connectors power_management_tests hugetlb commands hyperthreading can cpuhotplug net.ipv6_lib input cve crypto kernel_misc uevent
Checking for required user/group ids
'nobody' user id and group found.
'bin' user id and group found.
'daemon' user id and group found.
Users group found.
Sys group found.
Required users/groups exist.
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.
/etc/centos-release
 [ Message content over the limit has been removed. ]
...g_usage_in_bytes_test 2 TINFO: Process is still here after warm up: 1312173
memcg_usage_in_bytes_test 2 TFAIL: memory.memsw.usage_in_bytes is 4202496, 4194304 expected
<<<execution_status>>>
initiation_status="ok"
duration=2 termination_type=exited termination_id=1 corefile=no
cutime=4 cstime=15
<<<test_end>>>
<<<test_start>>>
tag=memcg_stress stime=1574848339
cmdline="memcg_stress_test.sh"
contacts=""
analysis=exit
<<<test_output>>>
memcg_stress_test 1 TINFO: timeout per run is 0h 35m 0s
memcg_stress_test 1 TINFO: Calculated available memory 178387 MB
memcg_stress_test 1 TINFO: Testing 150 cgroups, using 1188 MB, interval 5
memcg_stress_test 1 TINFO: Starting cgroups
memcg_stress_test 1 TINFO: Testing cgroups for 900s' does not contain 'INFO: ltp-pan reported all tests PASS'
Also teardown failed:
Several failures occurred:
1) SSHException: SSH session not active
2) There was no directory matching '/opt/ltp/output'.
3) SSHException: SSH session not active
4) SSHException: SSH session not active
5) There was no directory matching '/opt/ltp/results'.
6) SSHException: SSH session not active
------------------------------------------------------------------------------
Ltp :: Validation, robustness and stability of Linux | FAIL |
Suite teardown failed:
Several failures occurred:
1) SSHException: SSH session not active
2) SSHException: SSH session not active
1 critical test, 0 passed, 1 failed
1 test total, 0 passed, 1 failed
==============================================================================

 



 Comments   
Comment by Juha Kosonen [ 05/Dec/19 ]

merged:  https://gerrit.akraino.org/r/c/validation/+/2075

Comment by Juha Kosonen [ 28/Nov/19 ]

A patch submitted for a review:  https://gerrit.akraino.org/r/c/validation/+/2075

Comment by Juha Kosonen [ 28/Nov/19 ]

At some phase during the test run, even ping works, SSH connectivity hangs for a long periods of time

 

[cloudadmin@controller-1 ~]$ ping -c2 controller-2
PING controller-2 (192.168.17.4) 56(84) bytes of data.
64 bytes from controller-2 (192.168.17.4): icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from controller-2 (192.168.17.4): icmp_seq=2 ttl=64 time=0.155 ms
--- controller-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.068/0.111/0.155/0.044 ms
[cloudadmin@controller-1 ~]$ time ssh controller-2
^C
real 1m14.575s
user 0m0.003s
sys 0m0.004s

The node in question is detected as non-functional on k8s level too:

[cloudadmin@controller-1 ~]$ kubectl get no
NAME STATUS ROLES AGE VERSION
192.168.17.1 Ready master 7d23h v1.16.2
192.168.17.2 Ready worker 7d23h v1.16.2
192.168.17.3 Ready worker 7d23h v1.16.2
192.168.17.4 NotReady master 7d23h v1.16.2
192.168.17.5 Ready master 7d23h v1.16.2

Finally the node restarts:

Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Normal NodeNotReady 66m kubelet, 192.168.17.4 Node 192.168.17.4 status is now: NodeNotReady
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: nginx, pid: 1227263
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: nginx, pid: 1227264
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: nginx, pid: 1227210
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: sh, pid: 1227067
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: memcached, pid: 1227086
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: rsync, pid: 1227069
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: supervisord, pid: 1226460
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: crond, pid: 1227660
 Warning SystemOOM 66m kubelet, 192.168.17.4 System OOM encountered, victim process: memcg_process_s, pid: 1299961
 Warning SystemOOM 66m (x3 over 66m) kubelet, 192.168.17.4 (combined from similar events): System OOM encountered, victim process: memcg_process_s, pid: 1299953
 Normal NodeHasSufficientMemory 66m (x2 over 66m) kubelet, 192.168.17.4 Node 192.168.17.4 status is now: NodeHasSufficientMemory
 Normal NodeHasNoDiskPressure 66m (x2 over 66m) kubelet, 192.168.17.4 Node 192.168.17.4 status is now: NodeHasNoDiskPressure
 Normal NodeHasSufficientPID 66m (x2 over 66m) kubelet, 192.168.17.4 Node 192.168.17.4 status is now: NodeHasSufficientPID
 Normal Starting 66m kubelet, 192.168.17.4 Starting kubelet.
 Normal NodeAllocatableEnforced 66m kubelet, 192.168.17.4 Updated Node Allocatable limit across pods
 Normal NodeReady 66m kubelet, 192.168.17.4 Node 192.168.17.4 status is now: NodeReady
 Normal Starting 31m kube-proxy, 192.168.17.4 Starting kube-proxy.
 Normal Starting 11m kubelet, 192.168.17.4 Starting kubelet.
 Normal NodeHasSufficientMemory 11m (x8 over 11m) kubelet, 192.168.17.4 Node 192.168.17.4 status is now: NodeHasSufficientMemory
 Normal NodeHasNoDiskPressure 11m (x8 over 11m) kubelet, 192.168.17.4 Node 192.168.17.4 status is now: NodeHasNoDiskPressure
 Normal NodeHasSufficientPID 11m (x7 over 11m) kubelet, 192.168.17.4 Node 192.168.17.4 status is now: NodeHasSufficientPID
 Normal NodeAllocatableEnforced 11m kubelet, 192.168.17.4 Updated Node Allocatable limit across pods

There's a plenty of memory in the node:

 

[cloudadmin@controller-2 ~]$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff 2G online no 0
0x0000000100000000-0x0000002b7fffffff 170G online yes 2-86
0x0000002b80000000-0x0000002f7fffffff 16G online no 87-94
0x0000002f80000000-0x000000307fffffff 4G online yes 95-96
Memory block size: 2G
Total online memory: 192G
Total offline memory: 0B

 

At this stage, I suggest to remove the particular case which runs all LTP test suites. Can be added back later if considered reasonable - and after it has been verified as fully functional.

Comment by Juha Kosonen [ 28/Nov/19 ]

Yes, let's wait for a while.

Comment by Cristina Pauna [ 28/Nov/19 ]

Do you want to wait for this to be fixed before I make a new tag?

The new images won’t be built until next week anyhow because of Thanksgiving in US.

Comment by Juha Kosonen [ 28/Nov/19 ]

cristinapauna, FYI

Generated at Sat Feb 10 06:07:15 UTC 2024 using Jira 9.4.5#940005-sha1:e3094934eac4fd8653cf39da58f39364fb9cc7c1.