[IEC-26] [IEC][SEBA][PONSim] cord-tester setup_venv fails: pynacl pwhash_scrypt out of memory Created: 25/Oct/19 Updated: 30/Oct/19 Resolved: 30/Oct/19 |
|
| Status: | Done |
| Project: | Integrated Edge Cloud |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Medium |
| Reporter: | Ciprian Barbu | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
This problem happens on aarch64 when trying to prepare the env for cord-tester testing framework. Because the pynacl python package is not compiled for aarch64, the setup_venv.sh will try to build it and in the process also runs the tests (i.e. calls make check). One of the 72 tests (pwhash_scrypt) fails after timing out and what seems to be an out of memory error. The same problem has been reported last year here: One of the solutions suggested is to increase the memlock ulimit, which for some reason is very low by default on aarch64. I have tried setting up this ulimit in the container we use for testing and it works when running on the jenkins slave (baremetal). However the test still crashes when running the framework from the same cord-tester image but on the master node of an aarch64 virtual pod.
|
| Comments |
| Comment by Ciprian Barbu [ 29/Oct/19 ] |
|
My bug was closed on the libsodium github repo, but I replied to this thread on pynacl: |
| Comment by Ciprian Barbu [ 29/Oct/19 ] |
|
I have opened a new issue for the github repository: |
| Comment by Ciprian Barbu [ 29/Oct/19 ] |
|
So it looks like there are two problems here. Most of the times (if not always) the build of pynacl will fail during the tests because the ulimit for max memory lock is very low. This is also described in the bug report mentioned in the description. The second problem was seen when running the build on the IEC master node. Here the test would suddenly start taking up more and more memory, exhausting all of the available RAM and then going on to consume the swap memory. I eventually started debugging this test case and found there is a negative test in pwhash_scrypt [1] which by the looks of it specifies an invalid encoded string which probably is supposed to cause a failure down the line when allocating memory [2] for the decryption process. I noticed that the computed required memory goes in the range of TB, which is really not supposed to work. Searching for references on mmap not failing when large amounts of memory are requests, I ended up on this thread [3], called: One of the answers in this thread talks about memory overcommit limit, which can be set via sysctl vm.memory_overcommit. This parameter controls the amount of memory overcommit checks that the kernel performs during syscalls like mmap and related. On our IEC K8S master, this was set to 1, which means no overcommit handling is performed. Additionally, the way libsodium implements scrypt alloc_region function, it forces it to also populate the pages, which is useful under normal circumstances, but in this case, combined with the value of vm.memory_overcommit, caused the OS to exhaust all the memory trying to perform the mmap request. So the solution is to set the vm.memory_overcommit value to something that allows extra memory checks, values of 0 or 2 both work fine. But in my opinion the libsodium implementation of maybe the test case is not very well thought, since this is one valid example of configuring the OS, to allow better control for memory in virtualized environments. One could also consider this a security threat. [1] https://github.com/jedisct1/libsodium/blob/1.0.16/test/default/pwhash_scrypt.c#L269 |