[IEC-22] [IEC][SEBA][PONSim] RG DHCP error Created: 30/Aug/19 Updated: 25/Nov/19 Resolved: 25/Nov/19 |
|
| Status: | Done |
| Project: | Integrated Edge Cloud |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Medium |
| Reporter: | Ciprian Barbu | Assignee: | Ciprian Barbu |
| Resolution: | Done | Votes: | 0 |
| Labels: | Release_2 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
After fixing |
| Comments |
| Comment by Ciprian Barbu [ 01/Nov/19 ] |
|
The first patch, the one backported from master to v1.13.5, has been verified to work and it is now part of our ONOS aarch64 image. See this change [1]. The second patch, which was proposed by Charles Chan on upstream ONOS, has also been verified on one of our PODs in the ENEA lab, and it works ok. I backported this patch too on the iecedge/onos repo, and updated the charts correspondingly. See this change [2]. |
| Comment by Ciprian Barbu [ 18/Oct/19 ] |
|
Update: While I was away in vacation Robert built an ONOS 1.13.9 and tried it to no avail, as this version of ONOS seems to be too new to work with SEBA 1.0. Then I tried a build with 1.13.5 and adding the patch that fixes the concurrent access as described by Charles Chan in the bug report. This worked fine, the issue is gone, so I will proceed to merging the necessary changes. |
| Comment by Ciprian Barbu [ 20/Sep/19 ] |
|
I've opened a new topic on the seba-dev mailing list and we got a suggestion to try a newer version of ONOS which has a potential fix for the problem: For building the new ONOS version, we will need to use the modified Dockerfile for aarch64: This will need to be ported to 1.13.9 and built on an aarch64 machine. |
| Comment by Ciprian Barbu [ 19/Sep/19 ] |
|
The problem might reside in atomix, there are some related issues, but I can't tell right now if the fix is included in our version: |
| Comment by Ciprian Barbu [ 19/Sep/19 ] |
|
Latest news is that there is a big Java traceback in the Onos logs, coming from the Xconnect Manager aka org.onosproject.onos-apps-segmentrouting-app. This seems to be the actual problem:
2019-09-13 15:10:25,097 | INFO | qtp300471503-38 | XconnectManager | 177 - org.onosproject.onos-apps-segmentrouting-app - 1.13.5 | Adding or updating xconnect. deviceId=of:0000000000000001, vlanId=222, ports=[1, 2] |
| Comment by Ciprian Barbu [ 17/Sep/19 ] |
|
We made little progress with the investigation, but so far it looks like the problem could be somewhere in ONOS, and more exactly in one of the ONOS apps. There are several apps that are involved, the aaa manager, the l2dhcprelay and even the olt-app. On the AAA Manager, we tracked the sequence of messages from the Radius server, where on a successful authentication the manager will authorize the access for the client [1], which in turn will send a notification of type APPROVED [2]. Then, looking at the ONOS logs, the next message comes from the olt-app, which will program the necessary vlans for the subscriber [3]. This will call the internal function provisionVlans which seems to do what we need, in that it programs flows for packets tagged with s-tag and c-tag among other things. At this point we are not sure if the action of programming the flows fails somewhere, as we have not determined what component is responsible for that. But there is also a possibility that we have a different version of the olt-app, or something else. since these are built by Opencord in Jenkins [4]. The apps are published on an org.opencord maven repository here [5] and so we need to track how the apps have been built for the aarch64 version. [1]https://gerrit.opencord.org/gitweb?p=aaa.git;a=blob;f=src/main/java/org/opencord/aaa/AaaManager.java;h=b6b417d85b3ff626d63bad59befc177c081ada1b;hb=ec6670ea8dc4df1d76dd438b6d24f4221fa4f2f8#l322 |
| Comment by Ciprian Barbu [ 13/Sep/19 ] |
|
On a further search for missing groups of rules, I ended up here: I'm not yet sure how this translates into the actual app, I think ONOS comes precompiled in the container, no actual java source code files present. |
| Comment by Ciprian Barbu [ 13/Sep/19 ] |
|
After a lot of time debugging, mostly by comparing against a working setup, I have identified one issue in particular, while inspecting the logs in VOLTHA cli, ONOS cli and mininet OvS. Basically everything happens the same on both scenarios up until the authentication step. Prior to this, after ponsim-pod tosca loader defines loads the models for ONU and RG, the devices and virtual devices look the same on both setups. On ONOS side, the device associated with the openvswitch AGG Switch has 47 flows. After authentication is performed, a major difference appears. On the working setup, 6 more flows are added, whereas on the non-working setup only 4 more are added. One of the two missing flows is in table 50 and deals with tagged packets, including the DHCP response. The other one, I'm not clear yet what is it's purpose. Here are the missing two flows as shown by ONOS: Additionally, on the OvS flow dump I identified a few missing groups of flows. These don't seem to have a correspondent in ONOS, not that I know of at least. Here are the missing group rules: These rules should allow the packets coming from the BNG to go to the RG skipping ONOS. However, ONOS still needs them in order to update the status of the ONU, which seems to not happen because of the missing flow rule with vlan 222.
The question now is why where this difference comes from, what component is responsible for programming the missing flows and what is broken. There is a slight chance that ONOS is to blame, but I incline to think it's the OLT instead. However the problem might be elsewhere, since the DHCP and EAPOL packets seem to be fine, it's s-tag/c-tag related flows that are missing. I will need to investigate further. I've asked for help on the #seba chanel on Slack, but since people were away with ONF Connect this week I didn't get much help yet. I've attached a complete log file with differences between the two setups. |
| Comment by Ciprian Barbu [ 30/Aug/19 ] |
|
I've been using the information on the Opencord guide for troubleshooting DHCP not working: The flows seem to be correct, but the Onos logs showed the DHCP requests coming from the RG but not the DHCP replies from the BNG (in mininet POD). So I tried to figure out if the packets really get to the mininet and if there is a reponse. I tried to obtain the flows programmed into openvswitch but I seem to get some giberish at the output. I need to investigate further, there is a chance that the controller registered to ovs is not behaving correctly. |