[IEC-22] [IEC][SEBA][PONSim] RG DHCP error Created: 30/Aug/19  Updated: 25/Nov/19  Resolved: 25/Nov/19

Status: Done
Project: Integrated Edge Cloud
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Ciprian Barbu Assignee: Ciprian Barbu
Resolution: Done Votes: 0
Labels: Release_2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File flows.txt    
Issue Links:
Blocks
blocks IEC-14 Create Jenkins job for running SEBA a... Done

 Description   

After fixing IEC-16, the next step is for the Residential Gateway (rg) pod simulator in PONSim to make a DHCP request to the BNG. However this doesn't work at the moment on aarch64 deployments. On x86 as well I encountered situations where it doesn't work, but it seems to be a different situations.



 Comments   
Comment by Ciprian Barbu [ 01/Nov/19 ]

The first patch, the one backported from master to v1.13.5, has been verified to work and it is now part of our ONOS aarch64 image. See this change [1].

The second patch, which was proposed by Charles Chan on upstream ONOS, has also been verified on one of our PODs in the ENEA lab, and it works ok. I backported this patch too on the iecedge/onos repo, and updated the charts correspondingly. See this change [2].

[1] https://gerrit.akraino.org/r/#/c/iec/+/1808/

[2] https://gerrit.akraino.org/r/c/iec/+/1897

Comment by Ciprian Barbu [ 18/Oct/19 ]

Update:

While I was away in vacation Robert built an ONOS 1.13.9 and tried it to no avail, as this version of ONOS seems to be too new to work with SEBA 1.0.

Then I tried a build with 1.13.5 and adding the patch that fixes the concurrent access as described by Charles Chan in the bug report. This worked fine, the issue is gone, so I will proceed to merging the necessary changes.

Comment by Ciprian Barbu [ 20/Sep/19 ]

I've opened a new topic on the seba-dev mailing list and we got a suggestion to try a newer version of ONOS which has a potential fix for the problem:
https://groups.google.com/a/opennetworking.org/forum/#!topic/seba-dev/FoEBPnsFC30

For building the new ONOS version, we will need to use the modified Dockerfile for aarch64:
https://github.com/cachengo/cachengo-onos/commit/6af2d66935f7b450965ef07040da4c2721637938

This will need to be ported to 1.13.9 and built on an aarch64 machine.

Comment by Ciprian Barbu [ 19/Sep/19 ]

The problem might reside in atomix, there are some related issues, but I can't tell right now if the fix is included in our version:

https://github.com/atomix/atomix/issues/978

Comment by Ciprian Barbu [ 19/Sep/19 ]

Latest news is that there is a big Java traceback in the Onos logs, coming from the Xconnect Manager aka org.onosproject.onos-apps-segmentrouting-app. This seems to be the actual problem:

 

2019-09-13 15:10:25,097 | INFO | qtp300471503-38 | XconnectManager | 177 - org.onosproject.onos-apps-segmentrouting-app - 1.13.5 | Adding or updating xconnect. deviceId=of:0000000000000001, vlanId=222, ports=[1, 2]
2019-09-13 15:10:40,118 | ERROR | nt-partition-1-0 | ThreadPoolContext | 90 - io.atomix - 2.0.23 | An uncaught exception occurred
java.lang.IllegalStateException: org.onosproject.store.service.ConsistentMapException$Timeout: onos-sr-xconnect-next
at org.onosproject.store.primitives.impl.MeteredAsyncConsistentMap$InternalMeteredMapEventListener.event(MeteredAsyncConsistentMap.java:310)
at org.onosproject.store.primitives.impl.CachingAsyncConsistentMap.lambda$null$1(CachingAsyncConsistentMap.java:93)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:399)[95:com.google.guava:22.0.0]
at org.onosproject.store.primitives.impl.CachingAsyncConsistentMap.lambda$null$2(CachingAsyncConsistentMap.java:93)
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)[:1.8.0_201]
at org.onosproject.store.primitives.impl.CachingAsyncConsistentMap.lambda$new$3(CachingAsyncConsistentMap.java:93)
at org.onosproject.store.primitives.impl.TranscodingAsyncConsistentMap$InternalBackingMapEventListener.event(TranscodingAsyncConsistentMap.java:366)
at org.onosproject.store.primitives.impl.TranscodingAsyncConsistentMap$InternalBackingMapEventListener.event(TranscodingAsyncConsistentMap.java:366)
at org.onosproject.store.primitives.resources.impl.AtomixConsistentMap.lambda$null$1(AtomixConsistentMap.java:128)[133:org.onosproject.onos-core-primitives:1.13.5]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:399)[95:com.google.guava:22.0.0]
at org.onosproject.store.primitives.resources.impl.AtomixConsistentMap.lambda$null$2(AtomixConsistentMap.java:128)[133:org.onosproject.onos-core-primitives:1.13.5]
at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)[:1.8.0_201]
at org.onosproject.store.primitives.resources.impl.AtomixConsistentMap.lambda$handleEvent$3(AtomixConsistentMap.java:128)[133:org.onosproject.onos-core-primitives:1.13.5]
at java.util.ArrayList.forEach(ArrayList.java:1257)[:1.8.0_201]
at org.onosproject.store.primitives.resources.impl.AtomixConsistentMap.handleEvent(AtomixConsistentMap.java:127)[133:org.onosproject.onos-core-primitives:1.13.5]
at io.atomix.protocols.raft.proxy.impl.DelegatingRaftProxy.lambda$addEventListener$4(DelegatingRaftProxy.java:122)
at io.atomix.protocols.raft.proxy.impl.BlockingAwareRaftProxyClient.lambda$null$2(BlockingAwareRaftProxyClient.java:67)
at io.atomix.utils.concurrent.ThreadPoolContext.lambda$new$0(ThreadPoolContext.java:81)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_201]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)[:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)[:1.8.0_201]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)[:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[:1.8.0_201]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_201]
Caused by: org.onosproject.store.service.ConsistentMapException$Timeout: onos-sr-xconnect-next
at org.onosproject.store.primitives.DefaultConsistentMap.complete(DefaultConsistentMap.java:258)
at org.onosproject.store.primitives.DefaultConsistentMap.containsKey(DefaultConsistentMap.java:77)
at org.onosproject.segmentrouting.xconnect.impl.XconnectManager.populateNext(XconnectManager.java:485)
at org.onosproject.segmentrouting.xconnect.impl.XconnectManager.populateXConnect(XconnectManager.java:455)
at org.onosproject.segmentrouting.xconnect.impl.XconnectManager.access$300(XconnectManager.java:98)
at org.onosproject.segmentrouting.xconnect.impl.XconnectManager$XconnectMapListener.event(XconnectManager.java:297)
at org.onosproject.store.primitives.impl.MeteredAsyncConsistentMap$InternalMeteredMapEventListener.event(MeteredAsyncConsistentMap.java:306)
... 24 more

Comment by Ciprian Barbu [ 17/Sep/19 ]

We made little progress with the investigation, but so far it looks like the problem could be somewhere in ONOS, and more exactly in one of the ONOS apps.

There are several apps that are involved, the aaa manager, the l2dhcprelay and even the olt-app.

On the AAA Manager, we tracked the sequence of messages from the Radius server, where on a successful authentication the manager will authorize the access for the client [1], which in turn will send a notification of type APPROVED [2].

Then, looking at the ONOS logs, the next message comes from the olt-app, which will program the necessary vlans for the subscriber [3]. This will call the internal function provisionVlans which seems to do what we need, in that it programs flows for packets tagged with s-tag and c-tag among other things.

At this point we are not sure if the action of programming the flows fails somewhere, as we have not determined what component is responsible for that. But there is also a possibility that we have a different version of the olt-app, or something else. since these are built by Opencord in Jenkins [4]. The apps are published on an org.opencord maven repository here [5] and so we need to track how the apps have been built for the aarch64 version.

[1]https://gerrit.opencord.org/gitweb?p=aaa.git;a=blob;f=src/main/java/org/opencord/aaa/AaaManager.java;h=b6b417d85b3ff626d63bad59befc177c081ada1b;hb=ec6670ea8dc4df1d76dd438b6d24f4221fa4f2f8#l322
[2]https://gerrit.opencord.org/gitweb?p=aaa.git;a=blob;f=src/main/java/org/opencord/aaa/StateMachine.java;h=d888c9883b16bdaf658e1a6a8e612c2cf1083ad8;hb=ec6670ea8dc4df1d76dd438b6d24f4221fa4f2f8#l417
[3]https://gerrit.opencord.org/gitweb?p=olt.git;a=blob;f=app/src/main/java/org/opencord/olt/impl/Olt.java;h=e2fdc5585542b490770ba08dc63ad26983632ea2;hb=1f864fc3d95d0d57a61ac3196540536473ce64ef#l260
[4]https://jenkins.opencord.org/view/ONOS%20Apps/job/cord-onos-app-publishing/
[5]https://mvnrepository.com/artifact/org.opencord

Comment by Ciprian Barbu [ 13/Sep/19 ]

On a further search for missing groups of rules, I ended up here:
https://github.com/opennetworkinglab/onos/blob/master/drivers/default/src/main/java/org/onosproject/driver/pipeline/ofdpa/OvsOfdpaPipeline.java

I'm not yet sure how this translates into the actual app, I think ONOS comes precompiled in the container, no actual java source code files present.

Comment by Ciprian Barbu [ 13/Sep/19 ]

After a lot of time debugging, mostly by comparing against a working setup, I have identified one issue in particular, while inspecting the logs in VOLTHA cli, ONOS cli and mininet OvS.

Basically everything happens the same on both scenarios up until the authentication step. Prior to this, after ponsim-pod tosca loader defines loads the models for ONU and RG, the devices and virtual devices look the same on both setups. On ONOS side, the device associated with the openvswitch AGG Switch has 47 flows.

After authentication is performed, a major difference appears. On the working setup, 6 more flows are added, whereas on the non-working setup only 4 more are added. One of the two missing flows is in table 50 and deals with tagged packets, including the DHCP response. The other one, I'm not clear yet what is it's purpose. Here are the missing two flows as shown by ONOS:
   ADDED, bytes=5022, packets=42, table=50, priority=1000, selector=[VLAN_VID:222], treatment=[deferred=[GROUP:0x40de0000], transition=TABLE:60]
   ADDED, bytes=5022, packets=42, table=60, priority=60000, selector=[], treatment=[immediate=[NOACTION]]

Additionally, on the OvS flow dump I identified a few missing groups of flows. These don't seem to have a correspondent in ONOS, not that I know of at least. Here are the missing group rules:
group_id=14548993,type=indirect,bucket=actions=output:1
group_id=14548994,type=indirect,bucket=actions=output:2
group_id=1088290816,type=all,bucket=actions=group:14548993,bucket=actions=group:14548994

These rules should allow the packets coming from the BNG to go to the RG skipping ONOS. However, ONOS still needs them in order to update the status of the ONU, which seems to not happen because of the missing flow rule with vlan 222.

 

The question now is why where this difference comes from, what component is responsible for programming the missing flows and what is broken. There is a slight chance that ONOS is to blame, but I incline to think it's the OLT instead.
This is because in the voltha namespace there is a vcore-0 which runs the core of VOLTHA, and by the looks of it this has a definition of the ponsim_olt PON type:
https://github.com/iecedge/voltha/blob/voltha-1.6/voltha/adapters/ponsim_olt/ponsim_olt.py#L807

However the problem might be elsewhere, since the DHCP and EAPOL packets seem to be fine, it's s-tag/c-tag related flows that are missing. I will need to investigate further.

I've asked for help on the #seba chanel on Slack, but since people were away with ONF Connect this week I didn't get much help yet.

I've attached a complete log file with differences between the two setups.

flows.txt

Comment by Ciprian Barbu [ 30/Aug/19 ]

I've been using the information on the Opencord guide for troubleshooting DHCP not working:
https://guide.opencord.org/cord-6.1/profiles/seba/troubleshoot/no-dhcp.html

The flows seem to be correct, but the Onos logs showed the DHCP requests coming from the RG but not the DHCP replies from the BNG (in mininet POD).

So I tried to figure out if the packets really get to the mininet and if there is a reponse.
In the mininet POD there is a virtual topology (created by mininet of course) which simulates the different ports involved. There is a bridge s1 which contains a virtual ports s1-eth1 and the outgoing physical interface eth1. I can see the DHCP requests and replies on the s1-eth1 port (with tcpdump) but on the physical interface eth1. So to me it looks like the simulated BNG works, at least the dnsmasq process sends the reply, but it is not forwarded by openvswitch on the physical interface. Comparing with a working x86, pod, the converse is true, so indeed something is wrong at the OVS level.

I tried to obtain the flows programmed into openvswitch but I seem to get some giberish at the output. I need to investigate further, there is a chance that the controller registered to ovs is not behaving correctly.

Generated at Sat Feb 10 06:02:12 UTC 2024 using Jira 9.4.5#940005-sha1:e3094934eac4fd8653cf39da58f39364fb9cc7c1.