Check your 6 (DNS)

I’ve been meaning to write this post for a number of years, and it’s probably been 2 or 3 years since I last had an incident of this nature, but time to share the knowledge, and an often overlooked area of concern, for both Exchange and Skype/Lync is the state of the DNS environment.  DNS is one of those foundation services that if it’s not working correctly, can give you all kinds of grief.

Many super guru’s out there when deploying an On Premises Skype/Lync environment like to do everything with PowerShell, including the Schema changes, Forest and Domain Preps.  There is something comforting about a nice GUI, with progress bars and check boxes.  I have the PS commands as well though, just in case things aren’t working quite right.

In one such environment, the Deployment wizard just could not get the Schema to go.  However, after the 4th or 5th attempt, it worked. Rather then trying to figure out the Why, I went on with preparing the Forest, this time using: Enable-CsAdForest -GroupDomain company.local -Verbose  Boom, failed again.  Quick scour through the xml and I find this:  Cannot find one suitable domain controller under the domain company.local.  Ok, I check, there are plenty of Domain Controllers available, hmmm, 6 GC’s for that site alone… for 200 people…  I retried the same command again and again, 5th time, it works.  Ok, that ain’t right.  

Quick chat with the customer revealed that the reason for so many GC’s is that the log on process for users was soooo slow often times, 5-8 mins, but other times fine.  But the additional DC’s didn’t really seem to help.  Well, that’s not good.

AD Health check later, (this kind of command is helpfull, dcdiag /e /c /v /f:c:\temp\dcdiag1.txt), checking replication status, Sites and Services of course, nothing really stood out.  Other than LSSAS being unusually busy for a 200 user environment with 6 GC’s, time to start with the basics…  How does a client pick a DC for log on…  Like many things, DNS is key.

I start poking around DNS, making sure resolution is working for all the servers listed in AD Sites and Services. Yup, all good.  Then I start looking in the _msdcs folders for the domain…  Drilling down through DC, _Sites, Default-First-Site-Name, _tcp.  Ok we’ve got entries… a WHOLE lot of entries.  Checking the rest; Domains, GC, PDC, and with the exception of PDC, there were too many GC’s/DC’s listed.  In fact there were 13 Global Catalog servers listed, of which only 6 still existed.  I sense resolution ahead.

Turned out no one was decommissioning Domain Controllers properly in the environment.  Basically they would shut down a retiring server, leave it off for a few days, and then delete it from Sites and Services and from the Domain.  Problem was, the DNS entries were never cleaned up.  AD Decom steps  Because DCPromo was not used none of the GC/DC related DNS entries were removed.

After an hour or two late of gleaning through every level of the DNS structure, verifying where the server existed, and/or existed as a DC (IP recycling), LSASS utilisation on the GC’s dropped from +20% down to 1%.  I reran the Schema prep without issue, same for Forest and Domain preps, no more problem.

In Summary, whether logging on to the domain, or running forest or domain preps, you PC/Server gets back a list of GC from DNS.  Sometimes the first entry is a working one, never in my case, but I ran into the same issue with another client, recognized the problem right away, and was able to specify an operating GC, and was able to advise them to cleanup ASAP their DNS entries.

Enable-CsAdForest -GlobalCatalog -GroupDomain company.local -GroupDomainController -GlobalSettingsDomainController -Verbose

Enable-CsAdDomain -Domain company.local -GlobalCatalog -Verbose

If you absolutely have to resort to the above commands to get the environment prepped, take it as a sign that something isn’t right, and you’ll want to have a look at the AD DNS.


Automatically Delete Device Update Logs

Set-CsDeviceUpdateConfiguration, the forgotten command of many a Lync/Skype project.  If you don’t want the LyncShare\x-WebServices-x\DeviceUpdateLogs\Server\Audit\imageUpdates folder filling up beyond what is deemed necessary, then use the following:

Set-CsDeviceUpdateConfiguration -LogCleanUpTimeOfDay 17:05

You can also modify the retention time with the -LogCleanUpInterval 5.00:00:00 for a 5 day retention or 365.00:00:00 for 1 year.  Default is 10 days, but unless the time of day is set, the process will never kick in.

Try not to set it 5 mins ahead, or you might be waiting 24 hours and 5 minutes for it to actually start deleting, a good Microsoft Minute (15 mins) should do, or just schedule, check the next day that it’s working and cross it off your monthly/quarterly maintenance check list.

Technet article: 



Challenging Process

Many people have heard the analogy with the monkeys-ladder-banana’s, if not, here’s a refresher:MonkeyAnalogy

I’m not going to discuss the merits of whether the experiment ever happened or not, I always thought it to be an valid parable of what happens in many organizations.  Certainly if you have been in IT long enough, you’ve probably hit upon a process that no one knows the “why” of it.  Challenging in this case is a verb, not an adjective.

Case in point:  I was working a large multi-Ministry Exchange consolidation project using Quest, and we finished migrating all the mailboxes from Exchange 2003 to 2007, BUT, there was a report still being run based on mechanisms that were no longer available in the new Exchange environment.  For 9+ months we were not allowed to shut down the retiring Exchange server as a developer could not be found to re-engineer the report.

Server maintenance gets costly, plus the box was a little flaky, time to go GenX on this and find out more about this report, so I went on site to discover more.  First stop, talk to the person who receives the report, lets just say it’s Bobo.

“Bobo, what data or information are you needing out of this report you receive every week?”

“Oh, I don’t look at it, I forward to my team lead Bubbles”

Ok, off to visit Bubbles.  “Bubbles, what data or information are you looking for in this report you get forwarded from Bobo”

“oh, I don’t look at it, I forward it to Koko”

Ok… off to visit Koko.  “Koko, what data or information are you looking for in this report you get forwarded from Bubbles?”

“Oh, I have a rule that autoforwards that subject line to George, I don’t actually look at it”

I’m sensing a pattern here… George sends me to Clyde, from Clyde to Kong.

“Kong, this report with 6 fwd’s on it, what do you do with it.”

Kong: “I delete the darned thing, I wish people would stop sending me this useless report”

The Exchange 2003 environment was decommissioned shortly there after.

Seems like a funny story, it’s actually sad and costly.  I believe the Ministry was getting charged by an outside vendor ~$1500 a month for the server, support, storage and backup.

Lesson summary, well there are a few lessons here.

  1. Whether its a family tradition, or corporate process, without knowing the meaning or the why, it is often rendered useless.
  2. Not every problem requires a solution.  Sometimes the problem is the problem itself.
  3. As a Consultant/Contractor, clients are paying us good money for our experience and value.  Sometimes that value is having “outside eyes” looking at issues that have blinded internal users.
  4. Sometimes there is a forgotten “why”, I’m not advocating blindly trampling processes like a bull.  Be the 800 lb gorilla willing to talk about the elephant in the corner.

NI1 or NI2 or DMS100

I’ll be honest, my IT background and bread’n’butter for 15 years was primarily Exchange.  OCS/Lync presented an interesting opportunity to expand my Exchange horizons with the integration of UM and ultimately into Unified Communications. I do not have a background with telco’s, though I have set up a couple of RightFax servers (just dated myself there, no idea who owns them now, still Captaris?!?) with Brooktrout boards using PRI lines.  Needless to say, I’m not an expert when it comes to ISDN.  The first company I worked for was a law firm running ISDN 128k data line for internet and one of the first things I discovered for the firm was LAN connection to the internet, 10 Mb, it was lightning.  18 years later and ISDN is still around though mostly for voice I would.  (I bet you just figured out why it’s UCC Ramblings didn’t you)

Anywho, a recent telco experience let me to do a bunch of research as to why we were having so many problems with our PRI line connection with an AudioCodes gateway.  When requesting of the telco what protocol type, clock master, line code… they insisted it was NI1.  Things did basically work (smarter people than me will probably have picked up on what I said there) but there were warnings/errors in the sip traces.  You could make and receive calls the usual, BUT something was missing, the Display name, blank, always blank.

After performing a PSTN trace for the AudioCodes engineer to analyse, there was lots of data missing, and things really didn’t look right.  The request came back, try DMS100 or NI2.  But telco says it’s NI1 times 5, very insistent.  What the heck, I set up amazing routing in S4b to have outbound calls go out 2 different gateways, no worries.  Low and behold changing from NI1 protocol to DMS100 was the silver bullet.  Errors gone, warnings gone, Display name appears, woo hoo.  New PSTN trace showed the Display property being used, and not a Facilities method like in NI2.  Very definitely a T1 DMS100 ISDN line.

Now I’m perplexed, I’m GenX, i need to know the why.  Few days of reading later, I know more about ISDN then I ever care to, I much prefer SIP trunking, but if you don’t know where you came from, how will you know where you’re going…

ISDN has been around since the ’70s, DMS100 was developed in 1979.  Resiliant little protocol built in a time of efficiency where every bit mattered. blah blah blah – here’s a link: ISDN  NI1 was developed to standardise BRI lines (Basic Rate Interface), hmmm, basic, I think I mentioned that word before.  NI2 was developed for PRI lines (Primary Rate Interface).

I don’t know the history, or the reasoning to why a certain telco insists on selling their PRI lines as either NI1 or NI2.  I’ve only seen NI2 out west to be honest, and the first “NI1” trunk that I’ve see happens to be out east.  I’ll blame the sales guy who first started creating the database and ordering system. Or maybe they’ve forgotten too, I’ll have to post the monkeys-in-the-cage analogy sometime.

Summary:  In Canada anyway, if a telco says to use NI1, there is a very high probability that it’s actually DMS100.  Or if you just can’t get an answer, try DMS100 and if its not working out, then go NI2.  If you’re still out of luck, the NFAS is possibly in play and you definitely need to talk to the translations engineer to find out more.

Lync/Skype Emergency Calling

For Lync and Skype Services, I typically was normalizing ^([2-9]11)$ to +1$  So in the Voice Routes I specify a service filter \+?[2-9]11.  And on the “Trunk Configuration | Called number translation rules” I have one for ^\+?([2-9])$ to $1 so the plus gets removed before routing, IF there is a plus.  But you do something long enough, you sometimes forget why you do the things you do…

An incident with a client reminded me of the why, Emergency Calling.  As you can see from the screenshot below, Emergency Calling will bypass all normalizing rules, so if the Voice Policy’s/Routes are only routing calls based on ^\+\d+ then only calls normalized with a + will route, as was the case with this particular client.  VVX phones have a built in function for tagging calls with a Session Priority of Emergency (which can be seen in the CDR), when that is invoked, Normalization is bypassed, and the 911 does not get normalized to +911, thus the call fails because there was no route.  211, 311, 411 they all work because they aren’t Emergency Calls, so they go through the normalization process.


After creating a new Route and PSTN Usage, insert the appropriate filters, update the Outbound Calling Translation rules, good to go again.

New PSTN Route – Pattern: ^\+?[2-9]11 and assign the Appropriate Trunk

Called Number Translation Rule – Services: ^\+?([2-9]11)$ –> $1

In this case, I was able to complete testing of 911 service by using manipulation rules in the AudioCodes gateway, first testing that 811 would redirect to my cell phone, then testing with 911 in the manipulation.  Calls routed, CDR’s looked good again, no longer could reproduce the issue, everything is good again.

Screen shot courtesy of: