Response Groups and Disconnected calls

I’ve hit this a couple times now and burns my brain each time cause I know the answer.  Maybe blogging about it will keep it in the forefront more.

Problem Scenario:  User changes their password.  User is a member of the Response Group.  Response group calls to User drop as soon as they’re picked up by User.

80 mins later after reviewing logs, traces, settings, removing/re-adding, moon phases, the light goes back on; Client Certificate.

Resolution:  Sign out the user, Remove User Certificate via the Control Panel, Sign user back in.  Poof, magic, RGS calls work again.

I’m sure if I had more time to ponder this week as to the reasoning as to why this only affects Response Group calls, as the user can make and receive direct calls, but incoming calls via RGS would drop for a user that has recently changed their password.

Feel free to leave a note below if you have some sort of divine reasoning what password changes affect clients certs in such a way to only affect RGS Calls.

 

Response Groups and Polycom VVX’s

DiagnosticsReportA recent baffling issue with a client involving Response Groups in a Skype for Business environment.  Response Group Agents, using VVX phones (both 310 and 600’s) were experiencing an issue with calls coming from other VVX phones within the company.  As with any troubleshooting case, clear scope of the issue and being able to reproduce the issue is so very key.  Naturally I had neither, just anecdotal statements that some calls are failing to Main Reception…  They only get about 200+ calls a day, no biggie.  Please give me a Date/Time of an incident, and is it reproducible.

A few days later it was discovered that it was internal calls to Main Reception that were having the issue, great, can reproduce, even externally with a Skype call.  It’s that lovely Diagnostic ID 32 “Call terminated on mid-call media failure where both endpoints are internal” or Diagnostic ID 33 “Call terminated on a mid-call media failure where one endpoint is internal and the other is remote”.  I hate these two, but in my past experience is that these two codes will have something to do with Codecs.  Sometimes you get lucky and someone misspelled the internal Edge interface DNS entry or Certificate entry, but we’ve been in production much to long for it to be that and only be reported now.  No, this was going to be a codec problem.

When reproducing the issue myself, I also noticed something very odd when calling from my VVX 600, it sometimes it looked like it was going into a Conference Call. Another odd symptom which we can see in the Monitoring Reports is that some of the calls with the RGS group show Video in the Modalities list but not always.  Main Reception had both an Agent with a VVX600 and another agent with a VVX310, occasionally Video showed up but only for the 600 agent.  So, yah, something codec related.

I apologize at this point because I didn’t have the time, or the clients patience to do a proper SIP trace and log collection, I went straight for the throat and modified the codec list on the RGS Agents phones, and my own.  I modified the Codec Priorities on the phones to look like below, the 310’s wouldn’t have the Video Codec so not to worry there.

VVX600-After

I moved G.711Mu, and G.722 to the top of the list.  If your outside of North America I believe you would move G.711A to the top of your list.  G.722.1 (24 kbps) I think is unnecessary I thought maybe it was the codec for Microsoft SIREN but I can’t tell.  G.711A I left in just cause.  I completely removed the video codec’s as they don’t work with Lync 2013 or Skype environments anyway, plus the RGS/Video Conference thing…  I don’t see a purpose for the bottom 4 audio codec’s and will be testing with them removed for awhile.  I’ll update the blog if I get any further confirmation about what codec should be in and what definitely doesn’t need to be there for Lync/Skype environments.  I’m personally testing for the next while just have G.722, G.711Mu, and G.711A, I don’t think there are any other codec’s on the VVX phones that are compatible with a Lync 2013 or Skype for Business environment.

I experienced a similar codec issue with a Telus SIP Trunk in a Lync 2013 environments and using an AudioCodes M1k.  In that case it was Telus Cell phones calling, coming through the Telus SIP trunk, and AMR (Adaptive Multi-Rate speech codec) was getting negotiated, but there was no media actually happening, or it would be one way.  Problem there was fixed by limiting the codec’s on the AudioCodes on the Telus SIP Trunk side to only G.711Mu (PCMU) or G.711A (PCMA).

Now I’m going to go backcheck a few clients who have VVX phones and look for Diagnostic ID 32 and 33 in their environments.  The rub on this is that 32 and 33 aren’t Errors, just Warnings, so you have to look for them yourself.  To find them, go into your Monitoring Reports server and click on Top Failures Report.  Change the Category from “Unexpected Failure” to “Both expected and unexpected failure”.  In the Diagnostic IDs field, enter “32,33”, like below.  Change the date range as well if you like.

DiagnosticsReport

Dang, 70 in the last month… And affects both RGS and Exchange UM calls…  But none since the VVX Provisioning policy took effect.  Speaking of VVX Provisioning, here is the line I added to modify all the VVX phones via the common CFG file:

<WEB video.codecPref.H261=”0″ video.codecPref.H263=”0″ video.codecPref.H2631998=”0″ video.codecPref.H264=”0″ voice.codecPref.G711_A=”4″ voice.codecPref.G711_Mu=”2″ voice.codecPref.G722=”1″ voice.codecPref.G7221.24kbps=”3″ voice.codecPref.G7221_C.48kbps=”6″ voice.codecPref.Siren14.48kbps=”7″ voice.codecPref.Siren22.64kbps=”5″ />

If you haven’t set the prov.polling settings, the change probably won’t happen until you reboot the phone.  I think I smell another blog post, basic VVX Provisioning server configuration settings, IF you’re not using Event Zero UC Commander for Provisioning…

Additional Notes:  I should also have made note we also 1) Rebuilt the Response Group in trying to resolve.  In fact, in making minor changes it corrupted just like in the old days, and we had no choice but to delete, wait for replication and recreate, issue persisted. 2) Installed CU-259, issue still persisted.

Check your 6 (DNS)

I’ve been meaning to write this post for a number of years, and it’s probably been 2 or 3 years since I last had an incident of this nature, but time to share the knowledge, and an often overlooked area of concern, for both Exchange and Skype/Lync is the state of the DNS environment.  DNS is one of those foundation services that if it’s not working correctly, can give you all kinds of grief.

Many super guru’s out there when deploying an On Premises Skype/Lync environment like to do everything with PowerShell, including the Schema changes, Forest and Domain Preps.  There is something comforting about a nice GUI, with progress bars and check boxes.  I have the PS commands as well though, just in case things aren’t working quite right.

In one such environment, the Deployment wizard just could not get the Schema to go.  However, after the 4th or 5th attempt, it worked. Rather then trying to figure out the Why, I went on with preparing the Forest, this time using: Enable-CsAdForest -GroupDomain company.local -Verbose  Boom, failed again.  Quick scour through the xml and I find this:  Cannot find one suitable domain controller under the domain company.local.  Ok, I check, there are plenty of Domain Controllers available, hmmm, 6 GC’s for that site alone… for 200 people…  I retried the same command again and again, 5th time, it works.  Ok, that ain’t right.  

Quick chat with the customer revealed that the reason for so many GC’s is that the log on process for users was soooo slow often times, 5-8 mins, but other times fine.  But the additional DC’s didn’t really seem to help.  Well, that’s not good.

AD Health check later, (this kind of command is helpfull, dcdiag /e /c /v /f:c:\temp\dcdiag1.txt), checking replication status, Sites and Services of course, nothing really stood out.  Other than LSSAS being unusually busy for a 200 user environment with 6 GC’s, time to start with the basics…  How does a client pick a DC for log on…  Like many things, DNS is key.

I start poking around DNS, making sure resolution is working for all the servers listed in AD Sites and Services. Yup, all good.  Then I start looking in the _msdcs folders for the domain…  Drilling down through DC, _Sites, Default-First-Site-Name, _tcp.  Ok we’ve got entries… a WHOLE lot of entries.  Checking the rest; Domains, GC, PDC, and with the exception of PDC, there were too many GC’s/DC’s listed.  In fact there were 13 Global Catalog servers listed, of which only 6 still existed.  I sense resolution ahead.

Turned out no one was decommissioning Domain Controllers properly in the environment.  Basically they would shut down a retiring server, leave it off for a few days, and then delete it from Sites and Services and from the Domain.  Problem was, the DNS entries were never cleaned up.  AD Decom steps  Because DCPromo was not used none of the GC/DC related DNS entries were removed.

After an hour or two late of gleaning through every level of the DNS structure, verifying where the server existed, and/or existed as a DC (IP recycling), LSASS utilisation on the GC’s dropped from +20% down to 1%.  I reran the Schema prep without issue, same for Forest and Domain preps, no more problem.

In Summary, whether logging on to the domain, or running forest or domain preps, you PC/Server gets back a list of GC from DNS.  Sometimes the first entry is a working one, never in my case, but I ran into the same issue with another client, recognized the problem right away, and was able to specify an operating GC, and was able to advise them to cleanup ASAP their DNS entries.

Enable-CsAdForest -GlobalCatalog ad01.company.local -GroupDomain company.local -GroupDomainController ad01.company.local -GlobalSettingsDomainController ad01.company.local -Verbose

Enable-CsAdDomain -Domain company.local -GlobalCatalog ad01.company.local -Verbose

If you absolutely have to resort to the above commands to get the environment prepped, take it as a sign that something isn’t right, and you’ll want to have a look at the AD DNS.

 

Automatically Delete Device Update Logs

Set-CsDeviceUpdateConfiguration, the forgotten command of many a Lync/Skype project.  If you don’t want the LyncShare\x-WebServices-x\DeviceUpdateLogs\Server\Audit\imageUpdates folder filling up beyond what is deemed necessary, then use the following:

Set-CsDeviceUpdateConfiguration -LogCleanUpTimeOfDay 17:05

You can also modify the retention time with the -LogCleanUpInterval 5.00:00:00 for a 5 day retention or 365.00:00:00 for 1 year.  Default is 10 days, but unless the time of day is set, the process will never kick in.

Try not to set it 5 mins ahead, or you might be waiting 24 hours and 5 minutes for it to actually start deleting, a good Microsoft Minute (15 mins) should do, or just schedule, check the next day that it’s working and cross it off your monthly/quarterly maintenance check list.

Technet article:  https://technet.microsoft.com/en-us/library/gg398320.aspx 

 

 

Challenging Process

Many people have heard the analogy with the monkeys-ladder-banana’s, if not, here’s a refresher:MonkeyAnalogy

I’m not going to discuss the merits of whether the experiment ever happened or not, I always thought it to be an valid parable of what happens in many organizations.  Certainly if you have been in IT long enough, you’ve probably hit upon a process that no one knows the “why” of it.  Challenging in this case is a verb, not an adjective.

Case in point:  I was working a large multi-Ministry Exchange consolidation project using Quest, and we finished migrating all the mailboxes from Exchange 2003 to 2007, BUT, there was a report still being run based on mechanisms that were no longer available in the new Exchange environment.  For 9+ months we were not allowed to shut down the retiring Exchange server as a developer could not be found to re-engineer the report.

Server maintenance gets costly, plus the box was a little flaky, time to go GenX on this and find out more about this report, so I went on site to discover more.  First stop, talk to the person who receives the report, lets just say it’s Bobo.

“Bobo, what data or information are you needing out of this report you receive every week?”

“Oh, I don’t look at it, I forward to my team lead Bubbles”

Ok, off to visit Bubbles.  “Bubbles, what data or information are you looking for in this report you get forwarded from Bobo”

“oh, I don’t look at it, I forward it to Koko”

Ok… off to visit Koko.  “Koko, what data or information are you looking for in this report you get forwarded from Bubbles?”

“Oh, I have a rule that autoforwards that subject line to George, I don’t actually look at it”

I’m sensing a pattern here… George sends me to Clyde, from Clyde to Kong.

“Kong, this report with 6 fwd’s on it, what do you do with it.”

Kong: “I delete the darned thing, I wish people would stop sending me this useless report”

The Exchange 2003 environment was decommissioned shortly there after.

Seems like a funny story, it’s actually sad and costly.  I believe the Ministry was getting charged by an outside vendor ~$1500 a month for the server, support, storage and backup.

Lesson summary, well there are a few lessons here.

  1. Whether its a family tradition, or corporate process, without knowing the meaning or the why, it is often rendered useless.
  2. Not every problem requires a solution.  Sometimes the problem is the problem itself.
  3. As a Consultant/Contractor, clients are paying us good money for our experience and value.  Sometimes that value is having “outside eyes” looking at issues that have blinded internal users.
  4. Sometimes there is a forgotten “why”, I’m not advocating blindly trampling processes like a bull.  Be the 800 lb gorilla willing to talk about the elephant in the corner.

NI1 or NI2 or DMS100

I’ll be honest, my IT background and bread’n’butter for 15 years was primarily Exchange.  OCS/Lync presented an interesting opportunity to expand my Exchange horizons with the integration of UM and ultimately into Unified Communications. I do not have a background with telco’s, though I have set up a couple of RightFax servers (just dated myself there, no idea who owns them now, still Captaris?!?) with Brooktrout boards using PRI lines.  Needless to say, I’m not an expert when it comes to ISDN.  The first company I worked for was a law firm running ISDN 128k data line for internet and one of the first things I discovered for the firm was LAN connection to the internet, 10 Mb, it was lightning.  18 years later and ISDN is still around though mostly for voice I would.  (I bet you just figured out why it’s UCC Ramblings didn’t you)

Anywho, a recent telco experience let me to do a bunch of research as to why we were having so many problems with our PRI line connection with an AudioCodes gateway.  When requesting of the telco what protocol type, clock master, line code… they insisted it was NI1.  Things did basically work (smarter people than me will probably have picked up on what I said there) but there were warnings/errors in the sip traces.  You could make and receive calls the usual, BUT something was missing, the Display name, blank, always blank.

After performing a PSTN trace for the AudioCodes engineer to analyse, there was lots of data missing, and things really didn’t look right.  The request came back, try DMS100 or NI2.  But telco says it’s NI1 times 5, very insistent.  What the heck, I set up amazing routing in S4b to have outbound calls go out 2 different gateways, no worries.  Low and behold changing from NI1 protocol to DMS100 was the silver bullet.  Errors gone, warnings gone, Display name appears, woo hoo.  New PSTN trace showed the Display property being used, and not a Facilities method like in NI2.  Very definitely a T1 DMS100 ISDN line.

Now I’m perplexed, I’m GenX, i need to know the why.  Few days of reading later, I know more about ISDN then I ever care to, I much prefer SIP trunking, but if you don’t know where you came from, how will you know where you’re going…

ISDN has been around since the ’70s, DMS100 was developed in 1979.  Resiliant little protocol built in a time of efficiency where every bit mattered. blah blah blah – here’s a link: ISDN  NI1 was developed to standardise BRI lines (Basic Rate Interface), hmmm, basic, I think I mentioned that word before.  NI2 was developed for PRI lines (Primary Rate Interface).

I don’t know the history, or the reasoning to why a certain telco insists on selling their PRI lines as either NI1 or NI2.  I’ve only seen NI2 out west to be honest, and the first “NI1” trunk that I’ve see happens to be out east.  I’ll blame the sales guy who first started creating the database and ordering system. Or maybe they’ve forgotten too, I’ll have to post the monkeys-in-the-cage analogy sometime.

Summary:  In Canada anyway, if a telco says to use NI1, there is a very high probability that it’s actually DMS100.  Or if you just can’t get an answer, try DMS100 and if its not working out, then go NI2.  If you’re still out of luck, the NFAS is possibly in play and you definitely need to talk to the translations engineer to find out more.

Lync/Skype Emergency Calling

For Lync and Skype Services, I typically was normalizing ^([2-9]11)$ to +1$  So in the Voice Routes I specify a service filter \+?[2-9]11.  And on the “Trunk Configuration | Called number translation rules” I have one for ^\+?([2-9])$ to $1 so the plus gets removed before routing, IF there is a plus.  But you do something long enough, you sometimes forget why you do the things you do…

An incident with a client reminded me of the why, Emergency Calling.  As you can see from the screenshot below, Emergency Calling will bypass all normalizing rules, so if the Voice Policy’s/Routes are only routing calls based on ^\+\d+ then only calls normalized with a + will route, as was the case with this particular client.  VVX phones have a built in function for tagging calls with a Session Priority of Emergency (which can be seen in the CDR), when that is invoked, Normalization is bypassed, and the 911 does not get normalized to +911, thus the call fails because there was no route.  211, 311, 411 they all work because they aren’t Emergency Calls, so they go through the normalization process.

EmergencyCallRouting

After creating a new Route and PSTN Usage, insert the appropriate filters, update the Outbound Calling Translation rules, good to go again.

New PSTN Route – Pattern: ^\+?[2-9]11 and assign the Appropriate Trunk

Called Number Translation Rule – Services: ^\+?([2-9]11)$ –> $1

In this case, I was able to complete testing of 911 service by using manipulation rules in the AudioCodes gateway, first testing that 811 would redirect to my cell phone, then testing with 911 in the manipulation.  Calls routed, CDR’s looked good again, no longer could reproduce the issue, everything is good again.

Screen shot courtesy of:  https://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/EXL318

 

In the beginning…

This would be my first blog posting in 6 years.  I’m not much of a writer, punctuation will be entertaining to say the least, thank God for spellcheck, and I will literally ramble on.  I am going familiarizing myself with this WordPress interface, I kind of miss FrontPage and WYSIWYG, but orientation takes time.  As a long time IT Administrator, blinders go up, where this site may be easy for a normal, some one immersed in IT looks for the hard way to do things.  This site will change, evolve and hopefully get better.