Quantcast
Channel: VMware Communities: Message List
Viewing all articles
Browse latest Browse all 230656

Re: Spontaneous Locale-ID change on the host

$
0
0

Hello, Raymundo.

 

I did the test again and problem persist.

 

Host in question, the one playing remote side of stretched cluster is esx-22.

First, I enforced locale-id on it one more time to ensure it is correct.

 

vmkernel.log

12:2016-06-06T05:10:06.664Z cpu0:35166)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 42243c55-196b-520f-d733-e99899274998 to 420299c3-a98d-5ef5-6704-8c77475a4c37

netcpa.log

 

 

2016-06-06T05:10:06.664Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in Function Callback Type 3

2016-06-06T05:10:06.664Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in App Id host-43 Type VNET

 

After that, I rebooted esx-22 to reset all services running on it, and during boot-up locale-id changed again!

I changed it back and continued with the test.

 

Stretched cluster in question:

esx-12:

  • vm1 - VNI6000
  • vm2 - VNI6003
  • vm3 - VNI6005
  • vm4 - VNI6006

esx-22:

  • vm5 - VNI6000
  • vm6 - VNI6003
  • vm7 - VNI6005
  • vm8 - VNI6006

All those VNIs are behind UDLR.

I also ran ping from my laptop to all this VMs to monitor connectivity realtime

 

Management cluster is separate, two-node as well, and all management and control VMs are running on "DC1" part of management cluster.

 

So, I powered off "DC1" part of my lab and, obviously, vm1-vm4 went down, but connectivity for vm8 was lost too, it resumed 10 seconds afterwards, but still this was unexpected behavior.

 

Even weirder thing happened later: as soon as vm1-vm4 booted up on esx-22, ping to all VMs on VNI6000 vanished. I connected to the consoles of VMs in question and found out that, they were able to ping each other and VMs on other VNIs, connected to UDLR. They could also ping default gateway, but VNI6000 traffic refused to leave UDLR.

This was extremely frustrating. All firewalls were either disabled (EGS) to configured to allow all traffic (DFW), so it was not the case. Also, I have never experienced anything like this with "simple" DLR.

 

Anyway, by that time NSX Controllers have booted, esx-22 regained connectivity with them, and VNI6000 VMs became reachable again.

For a short period of time everything was working fine, then NSX Management Service came up and BOOM, locale-id changed again:

 

vmkernel.log

499:2016-06-06T05:19:55.444Z cpu0:35166)vdrb: VdrCpProcessLocaleIdMsg:2329: CP:INFO: VDR Control Plane : Changing localeId from 420299c3-a98d-5ef5-6704-8c77475a4c37 to 42243c55-196b-520f-d733-e99899274998

 

netcpa.log

2016-06-06T05:19:55.508Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in Function Callback Type 3

2016-06-06T05:19:55.508Z info netcpa[FF9BBB70] [Originator@6876 sub=Default] Got Message in App Id host-43 Type VNET

2016-06-06T05:19:55.508Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Received vdr instance message numVdrId 2 ...

2016-06-06T05:19:55.508Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Updated vdr instance vdr name = default+edge-5, vdr id = 5000, auth token = becbcd28-fd1a-432c-b4dc-de12ea067510, universal = false, localEgress = false

2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] Updated vdr instance vdr name = default+edge-47748c53-6f4b-47cc-a55e-858691cbad47, vdr id = 6000, auth token = 3e209177-983c-4ea8-8bc2-af2fe9b707f2, universal = true, localEgress = true

2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] No flap edge CP link for vdr id 5000

2016-06-06T05:19:55.509Z info netcpa[FF97AB70] [Originator@6876 sub=Default] No flap edge CP link for vdr id 6000

 

So either local egress is broken somehow, or my environment is totally broken. But my environment is clean installed vSphere 6.0 + NSX 6.2, then updated to vSphere 6.0u2 + NSX 6.2.2. It is a lab, so it barely see any use and I'm only one to mess with it.

 

Anyway, I was extremelt disturbed by VNI6000 behavior. I guess I'll run another test without local egress and watch UDLR is that situation.

 

Am I mistaken, or you work for VMware? If you do, I can provide you with all logs you want for you to go through and/or show to someone inside the shop.


Viewing all articles
Browse latest Browse all 230656

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>