443 in Skype for Business Land

A cautionary reminder when firewall rules are being set up for Skype for Business, that 443 or 443/TCP or 443/TCP/SIP does NOT mean HTTPS.  Honestly, I don’t think I’ve met a firewall yet that supports Microsoft’s 443/SIP for a so my rule request is very specifically 443/TCP, unless the required rule, like for the Reverse Proxy actually state HTTPS, require 443/TCP the rule ye be.

MS Firewall Rule Requirements

The above link goes to Microsoft’s firewall port requirements, and for the Edge now they are specifying 443/SIP or 443/TCP for the Edge Access role (updated July 11, 2016).  This is interesting because I think there has been a change in behavior in the Edge services, specifically around 443.  Scenario below:

Client running CU-235 (yes, I use the last 3 digits of the CU, as some “CU’s” are security updates so I find this to be the most logical way of determining the CU being referenced), anyway, CU-235 was applied to both the Frontend and Edge servers properly.  Internally, few if any issues were noticed, externally, there were problems with taking a 2-way PC-to-PC call, and adding a 3rd person to the call.  Naturally the call is elevated into a Conference call and bridges through the Edge.  Setup of this conference take 4-10+ seconds longer than usual, BUT the only the person who brought in the 3rd party into the call, and the 3rd party person, are in the call.  The 2nd person who was in the call is Paused, and is eventually booted, but can click Rejoin and finally enter the Conference Call.  Very annoying.

CU-259 comes along, and there’s hope that this may be resolved… Noop.  In fact things go horribly horribly worse, as well as the same for CU-272.  Meetings fail, all 2-way to 3-way calls externally fail, much more functionality is broken with Webconf and AV.  Both times, the system was rolled back to a tolerable level.

Please bear in mind, I was requesting review of the firewall rules the whole time; power outage borked the firewall, couldn’t access, needed reboot, needed change window… ect.  First glance at the rules, thar she sat, HTTPS.  IF you are reviewing firewall rules with a client/FWadmin, and see HTTPS, you can pretty much assume this is an APPLICATION LAYER rule, and will restrict verbs and actions of the protocol being used.  Some firewalls, if you enter 443/TCP the rule will actually switch to HTTPS (Juniper I’ve seen do this), and it requires the FWAdmin to make changes to create a new specific rule for 443/TCP.  Different firewalls will also exhibit various different failures for the Edge.  Simply put, IF in testing or in production, you have any kind of weirdness with External Conferencing or Web Conferencing functionality, Step 1) Make sure the firewall isn’t using an HTTPS rule.  443/TCP, gitterdone.

Oh look a bird:  It also doesn’t hurt to make sure that SIP ALG, SIP Inspect, or any other APPLICATION LAYER filtering for SIP is disabled.  Microsoft SIP is encrypted, but sometimes weirdness still happens

Summary:  You may have been getting away with HTTPS  rules on your Edge external rules, but with CU-235 and higher, especially higher, there is a good chance you’re going to encounter issues.  Oh yes, MS Support was contacted before I came on board, and they were baffled by the behavior and failures, traces that they ran couldn’t explain the problem.  So, SIP trace and wireshark wise, it’s not easy to identify this problem.

Thou shall not let Internal Users connect to External Edge Interface…

Been involved with UC for a while, long before it was called UC, and over time we’ve all developed cardinal rules when it comes to deployments.  One that me, and I know several others have adhered to “Thou shall not let Internal Users connect to External Edge Interface”.  Right!?!

Times are a changing, and rules are often made to be broken, add the one above to the list, or bending of it anyway.  Extended Skype Online/Hybrid coexistence.  While in a Hybrid configuration users Online and Users Onprem are one big happy environment, right?  Wrong!!  Reality is, they are two separate environments with a Shared SIP Namespace, with some bits of Replication from Onprem to Online thrown in.  (Online doesn’t replicate to OnPrem, see other postings).

Usually this is all good, UNTIL, an Online user who normally works from home decided to come into the Office one day.  They sign in no problem, they hit up SIP and the Internal sees they’re an Online user, redirects the up, and right as rain.  Time to join the Onprem meeting hosted by their in office Manager.  Audio/Video works, but nooo presentation, and a error message comes up when trying to share content to Present:  “Your DNS configuration is preventing you from presenting content” or possibly other variants.

Skype Online users when signed in are for connectivity purposes, are External Federated Users, and actually need to connect to the Web Conferencing Interface on the External Edge.   If you’ve been following the aforementioned cardinal rule, there is not likely a name resolution for WebConf, and/or possibly firewall rules blocking internal connection to the External interface.

Don’t believe me, it’s in TechNet as a requirement for Hybrid: https://technet.microsoft.com/en-us/library/jj205403.aspx


Another odd scenario, and I hope this is rare; One large International Corporation, Separate forests, separate Domains, but replicating their split-brain internal DNS zones which house the internal SIP/Skype DNS entries.  Corporate Site A can’t resolve webconf.corporateB.com, because they have B’s internal Split-Brain Public zone replicated/resolved, instead of the Public DNS Zone.

Seems like the new rule now is, Add the External Web Services and Webconf FQDN’s to your Internal split brain DNS zones now.

Good times.

Additional note:  The Skype Online AV traffic also appeared to be going through the Edge AV NIC in the Wireshark captures.  Same machine signed into an Onprem account, connected directly with the Frontend.

Skype for Business Hybrid pt.1

I’m sure there will be many more parts to this as O365 is ever “evolving”, euphemism for “it’s bloody different every time I go in there…”

An issue I recently hit when trying to connect an OnPrem Skype environment to a company’s Online counterpart, aka setting up Skype4b Hybrid, I ran into this lovely error:

Get-CsWebTicket : Failed to connect live id servers.  Make sure proxy is enabled or machine has network connection to live id servers


After verifying and re-verifying numerous items like: connectivity; access portal to verify password; review every powershell command; DNS entries; confirm moving csusers via PowerShell; and moon phases, it was time to contact support.  A week later, plus 2-4 engineers (who can keep track), I get the knowledgeable one.

Run these 3 commands, from an As Administrator CMD prompt, not PowerShell:

ICACLS %windir%\System32\config\systemprofile\AppData\Local /grant *S-1-5-20:(OI)(CI)(RA)

ICACLS %windir%\System32\config\systemprofile\AppData\Local\Microsoft\MSOIdentityCRL /grant *S-1-5-20:(OI)(CI)(IO)(F)

%windir%\system32\inetsrv\appcmd recycle apppool /apppool.name:LyncIntManagement

Good to go after that, no more problems signing in, and was able to complete the “Set up Hybrid with Skype Online” wizard, plus move users up and down without the use of Skype PowerShell.

This is potentially an issue with CU-259, not sure when it began or when it will be fixed, but the above commands appear to apply a missing/broken ACL.

Special shout out to Arran on his article for Online-to-Onprem setups.  His section on getting the already Online users enabled in the newly created Onprem system, saved my bacon:  https://blog.kloud.com.au/2015/08/26/skype-for-business-online-to-on-premises-migration/

Additional note for Skype Hybrid setups, when creating the new CSHostingProvider, check your Skype Online Admin Panel.  IF the URL is admin1a.online.lync.com, your Autodiscover URL on your hosting provider will likely change to https://webdir1a.online.lync.com/Autodiscover/AutodiscoverService.svc/root after you’ve completed the “Set up Hybrid with Skype Online” wizard.  At least that’s my experience every time.  Doesn’t matter much, just being petty I’m sure, but next time I’ll be trying this command out instead, assuming its admin1a again.

New-CSHostingProvider -Identity SkypeforBusinessOnline -ProxyFqdn “sipfed.online.lync.com” -Enabled $true -EnabledSharedAddressSpace $true -HostsOCSUsers $true -VerificationLevel UseSourceVerification -IsLocal $false -AutodiscoverUrl https://webdir1a.online.lync.com/Autodiscover/AutodiscoverService.svc/root


Response Groups and Disconnected calls

I’ve hit this a couple times now and burns my brain each time cause I know the answer.  Maybe blogging about it will keep it in the forefront more.

Problem Scenario:  User changes their password.  User is a member of the Response Group.  Response group calls to User drop as soon as they’re picked up by User.

80 mins later after reviewing logs, traces, settings, removing/re-adding, moon phases, the light goes back on; Client Certificate.

Resolution:  Sign out the user, Remove User Certificate via the Control Panel, Sign user back in.  Poof, magic, RGS calls work again.

I’m sure if I had more time to ponder this week as to the reasoning as to why this only affects Response Group calls, as the user can make and receive direct calls, but incoming calls via RGS would drop for a user that has recently changed their password.

Feel free to leave a note below if you have some sort of divine reasoning what password changes affect clients certs in such a way to only affect RGS Calls.


Response Groups and Polycom VVX’s

DiagnosticsReportA recent baffling issue with a client involving Response Groups in a Skype for Business environment.  Response Group Agents, using VVX phones (both 310 and 600’s) were experiencing an issue with calls coming from other VVX phones within the company.  As with any troubleshooting case, clear scope of the issue and being able to reproduce the issue is so very key.  Naturally I had neither, just anecdotal statements that some calls are failing to Main Reception…  They only get about 200+ calls a day, no biggie.  Please give me a Date/Time of an incident, and is it reproducible.

A few days later it was discovered that it was internal calls to Main Reception that were having the issue, great, can reproduce, even externally with a Skype call.  It’s that lovely Diagnostic ID 32 “Call terminated on mid-call media failure where both endpoints are internal” or Diagnostic ID 33 “Call terminated on a mid-call media failure where one endpoint is internal and the other is remote”.  I hate these two, but in my past experience is that these two codes will have something to do with Codecs.  Sometimes you get lucky and someone misspelled the internal Edge interface DNS entry or Certificate entry, but we’ve been in production much to long for it to be that and only be reported now.  No, this was going to be a codec problem.

When reproducing the issue myself, I also noticed something very odd when calling from my VVX 600, it sometimes it looked like it was going into a Conference Call. Another odd symptom which we can see in the Monitoring Reports is that some of the calls with the RGS group show Video in the Modalities list but not always.  Main Reception had both an Agent with a VVX600 and another agent with a VVX310, occasionally Video showed up but only for the 600 agent.  So, yah, something codec related.

I apologize at this point because I didn’t have the time, or the clients patience to do a proper SIP trace and log collection, I went straight for the throat and modified the codec list on the RGS Agents phones, and my own.  I modified the Codec Priorities on the phones to look like below, the 310’s wouldn’t have the Video Codec so not to worry there.


I moved G.711Mu, and G.722 to the top of the list.  If your outside of North America I believe you would move G.711A to the top of your list.  G.722.1 (24 kbps) I think is unnecessary I thought maybe it was the codec for Microsoft SIREN but I can’t tell.  G.711A I left in just cause.  I completely removed the video codec’s as they don’t work with Lync 2013 or Skype environments anyway, plus the RGS/Video Conference thing…  I don’t see a purpose for the bottom 4 audio codec’s and will be testing with them removed for awhile.  I’ll update the blog if I get any further confirmation about what codec should be in and what definitely doesn’t need to be there for Lync/Skype environments.  I’m personally testing for the next while just have G.722, G.711Mu, and G.711A, I don’t think there are any other codec’s on the VVX phones that are compatible with a Lync 2013 or Skype for Business environment.

I experienced a similar codec issue with a Telus SIP Trunk in a Lync 2013 environments and using an AudioCodes M1k.  In that case it was Telus Cell phones calling, coming through the Telus SIP trunk, and AMR (Adaptive Multi-Rate speech codec) was getting negotiated, but there was no media actually happening, or it would be one way.  Problem there was fixed by limiting the codec’s on the AudioCodes on the Telus SIP Trunk side to only G.711Mu (PCMU) or G.711A (PCMA).

Now I’m going to go backcheck a few clients who have VVX phones and look for Diagnostic ID 32 and 33 in their environments.  The rub on this is that 32 and 33 aren’t Errors, just Warnings, so you have to look for them yourself.  To find them, go into your Monitoring Reports server and click on Top Failures Report.  Change the Category from “Unexpected Failure” to “Both expected and unexpected failure”.  In the Diagnostic IDs field, enter “32,33”, like below.  Change the date range as well if you like.


Dang, 70 in the last month… And affects both RGS and Exchange UM calls…  But none since the VVX Provisioning policy took effect.  Speaking of VVX Provisioning, here is the line I added to modify all the VVX phones via the common CFG file:

<WEB video.codecPref.H261=”0″ video.codecPref.H263=”0″ video.codecPref.H2631998=”0″ video.codecPref.H264=”0″ voice.codecPref.G711_A=”4″ voice.codecPref.G711_Mu=”2″ voice.codecPref.G722=”1″ voice.codecPref.G7221.24kbps=”3″ voice.codecPref.G7221_C.48kbps=”6″ voice.codecPref.Siren14.48kbps=”7″ voice.codecPref.Siren22.64kbps=”5″ />

If you haven’t set the prov.polling settings, the change probably won’t happen until you reboot the phone.  I think I smell another blog post, basic VVX Provisioning server configuration settings, IF you’re not using Event Zero UC Commander for Provisioning…

Additional Notes:  I should also have made note we also 1) Rebuilt the Response Group in trying to resolve.  In fact, in making minor changes it corrupted just like in the old days, and we had no choice but to delete, wait for replication and recreate, issue persisted. 2) Installed CU-259, issue still persisted.

Check your 6 (DNS)

I’ve been meaning to write this post for a number of years, and it’s probably been 2 or 3 years since I last had an incident of this nature, but time to share the knowledge, and an often overlooked area of concern, for both Exchange and Skype/Lync is the state of the DNS environment.  DNS is one of those foundation services that if it’s not working correctly, can give you all kinds of grief.

Many super guru’s out there when deploying an On Premises Skype/Lync environment like to do everything with PowerShell, including the Schema changes, Forest and Domain Preps.  There is something comforting about a nice GUI, with progress bars and check boxes.  I have the PS commands as well though, just in case things aren’t working quite right.

In one such environment, the Deployment wizard just could not get the Schema to go.  However, after the 4th or 5th attempt, it worked. Rather then trying to figure out the Why, I went on with preparing the Forest, this time using: Enable-CsAdForest -GroupDomain company.local -Verbose  Boom, failed again.  Quick scour through the xml and I find this:  Cannot find one suitable domain controller under the domain company.local.  Ok, I check, there are plenty of Domain Controllers available, hmmm, 6 GC’s for that site alone… for 200 people…  I retried the same command again and again, 5th time, it works.  Ok, that ain’t right.  

Quick chat with the customer revealed that the reason for so many GC’s is that the log on process for users was soooo slow often times, 5-8 mins, but other times fine.  But the additional DC’s didn’t really seem to help.  Well, that’s not good.

AD Health check later, (this kind of command is helpfull, dcdiag /e /c /v /f:c:\temp\dcdiag1.txt), checking replication status, Sites and Services of course, nothing really stood out.  Other than LSSAS being unusually busy for a 200 user environment with 6 GC’s, time to start with the basics…  How does a client pick a DC for log on…  Like many things, DNS is key.

I start poking around DNS, making sure resolution is working for all the servers listed in AD Sites and Services. Yup, all good.  Then I start looking in the _msdcs folders for the domain…  Drilling down through DC, _Sites, Default-First-Site-Name, _tcp.  Ok we’ve got entries… a WHOLE lot of entries.  Checking the rest; Domains, GC, PDC, and with the exception of PDC, there were too many GC’s/DC’s listed.  In fact there were 13 Global Catalog servers listed, of which only 6 still existed.  I sense resolution ahead.

Turned out no one was decommissioning Domain Controllers properly in the environment.  Basically they would shut down a retiring server, leave it off for a few days, and then delete it from Sites and Services and from the Domain.  Problem was, the DNS entries were never cleaned up.  AD Decom steps  Because DCPromo was not used none of the GC/DC related DNS entries were removed.

After an hour or two late of gleaning through every level of the DNS structure, verifying where the server existed, and/or existed as a DC (IP recycling), LSASS utilisation on the GC’s dropped from +20% down to 1%.  I reran the Schema prep without issue, same for Forest and Domain preps, no more problem.

In Summary, whether logging on to the domain, or running forest or domain preps, you PC/Server gets back a list of GC from DNS.  Sometimes the first entry is a working one, never in my case, but I ran into the same issue with another client, recognized the problem right away, and was able to specify an operating GC, and was able to advise them to cleanup ASAP their DNS entries.

Enable-CsAdForest -GlobalCatalog ad01.company.local -GroupDomain company.local -GroupDomainController ad01.company.local -GlobalSettingsDomainController ad01.company.local -Verbose

Enable-CsAdDomain -Domain company.local -GlobalCatalog ad01.company.local -Verbose

If you absolutely have to resort to the above commands to get the environment prepped, take it as a sign that something isn’t right, and you’ll want to have a look at the AD DNS.


Automatically Delete Device Update Logs

Set-CsDeviceUpdateConfiguration, the forgotten command of many a Lync/Skype project.  If you don’t want the LyncShare\x-WebServices-x\DeviceUpdateLogs\Server\Audit\imageUpdates folder filling up beyond what is deemed necessary, then use the following:

Set-CsDeviceUpdateConfiguration -LogCleanUpTimeOfDay 17:05

You can also modify the retention time with the -LogCleanUpInterval 5.00:00:00 for a 5 day retention or 365.00:00:00 for 1 year.  Default is 10 days, but unless the time of day is set, the process will never kick in.

Try not to set it 5 mins ahead, or you might be waiting 24 hours and 5 minutes for it to actually start deleting, a good Microsoft Minute (15 mins) should do, or just schedule, check the next day that it’s working and cross it off your monthly/quarterly maintenance check list.

Technet article:  https://technet.microsoft.com/en-us/library/gg398320.aspx 



Challenging Process

Many people have heard the analogy with the monkeys-ladder-banana’s, if not, here’s a refresher:MonkeyAnalogy

I’m not going to discuss the merits of whether the experiment ever happened or not, I always thought it to be an valid parable of what happens in many organizations.  Certainly if you have been in IT long enough, you’ve probably hit upon a process that no one knows the “why” of it.  Challenging in this case is a verb, not an adjective.

Case in point:  I was working a large multi-Ministry Exchange consolidation project using Quest, and we finished migrating all the mailboxes from Exchange 2003 to 2007, BUT, there was a report still being run based on mechanisms that were no longer available in the new Exchange environment.  For 9+ months we were not allowed to shut down the retiring Exchange server as a developer could not be found to re-engineer the report.

Server maintenance gets costly, plus the box was a little flaky, time to go GenX on this and find out more about this report, so I went on site to discover more.  First stop, talk to the person who receives the report, lets just say it’s Bobo.

“Bobo, what data or information are you needing out of this report you receive every week?”

“Oh, I don’t look at it, I forward to my team lead Bubbles”

Ok, off to visit Bubbles.  “Bubbles, what data or information are you looking for in this report you get forwarded from Bobo”

“oh, I don’t look at it, I forward it to Koko”

Ok… off to visit Koko.  “Koko, what data or information are you looking for in this report you get forwarded from Bubbles?”

“Oh, I have a rule that autoforwards that subject line to George, I don’t actually look at it”

I’m sensing a pattern here… George sends me to Clyde, from Clyde to Kong.

“Kong, this report with 6 fwd’s on it, what do you do with it.”

Kong: “I delete the darned thing, I wish people would stop sending me this useless report”

The Exchange 2003 environment was decommissioned shortly there after.

Seems like a funny story, it’s actually sad and costly.  I believe the Ministry was getting charged by an outside vendor ~$1500 a month for the server, support, storage and backup.

Lesson summary, well there are a few lessons here.

  1. Whether its a family tradition, or corporate process, without knowing the meaning or the why, it is often rendered useless.
  2. Not every problem requires a solution.  Sometimes the problem is the problem itself.
  3. As a Consultant/Contractor, clients are paying us good money for our experience and value.  Sometimes that value is having “outside eyes” looking at issues that have blinded internal users.
  4. Sometimes there is a forgotten “why”, I’m not advocating blindly trampling processes like a bull.  Be the 800 lb gorilla willing to talk about the elephant in the corner.

NI1 or NI2 or DMS100

I’ll be honest, my IT background and bread’n’butter for 15 years was primarily Exchange.  OCS/Lync presented an interesting opportunity to expand my Exchange horizons with the integration of UM and ultimately into Unified Communications. I do not have a background with telco’s, though I have set up a couple of RightFax servers (just dated myself there, no idea who owns them now, still Captaris?!?) with Brooktrout boards using PRI lines.  Needless to say, I’m not an expert when it comes to ISDN.  The first company I worked for was a law firm running ISDN 128k data line for internet and one of the first things I discovered for the firm was LAN connection to the internet, 10 Mb, it was lightning.  18 years later and ISDN is still around though mostly for voice I would.  (I bet you just figured out why it’s UCC Ramblings didn’t you)

Anywho, a recent telco experience let me to do a bunch of research as to why we were having so many problems with our PRI line connection with an AudioCodes gateway.  When requesting of the telco what protocol type, clock master, line code… they insisted it was NI1.  Things did basically work (smarter people than me will probably have picked up on what I said there) but there were warnings/errors in the sip traces.  You could make and receive calls the usual, BUT something was missing, the Display name, blank, always blank.

After performing a PSTN trace for the AudioCodes engineer to analyse, there was lots of data missing, and things really didn’t look right.  The request came back, try DMS100 or NI2.  But telco says it’s NI1 times 5, very insistent.  What the heck, I set up amazing routing in S4b to have outbound calls go out 2 different gateways, no worries.  Low and behold changing from NI1 protocol to DMS100 was the silver bullet.  Errors gone, warnings gone, Display name appears, woo hoo.  New PSTN trace showed the Display property being used, and not a Facilities method like in NI2.  Very definitely a T1 DMS100 ISDN line.

Now I’m perplexed, I’m GenX, i need to know the why.  Few days of reading later, I know more about ISDN then I ever care to, I much prefer SIP trunking, but if you don’t know where you came from, how will you know where you’re going…

ISDN has been around since the ’70s, DMS100 was developed in 1979.  Resiliant little protocol built in a time of efficiency where every bit mattered. blah blah blah – here’s a link: ISDN  NI1 was developed to standardise BRI lines (Basic Rate Interface), hmmm, basic, I think I mentioned that word before.  NI2 was developed for PRI lines (Primary Rate Interface).

I don’t know the history, or the reasoning to why a certain telco insists on selling their PRI lines as either NI1 or NI2.  I’ve only seen NI2 out west to be honest, and the first “NI1” trunk that I’ve see happens to be out east.  I’ll blame the sales guy who first started creating the database and ordering system. Or maybe they’ve forgotten too, I’ll have to post the monkeys-in-the-cage analogy sometime.

Summary:  In Canada anyway, if a telco says to use NI1, there is a very high probability that it’s actually DMS100.  Or if you just can’t get an answer, try DMS100 and if its not working out, then go NI2.  If you’re still out of luck, the NFAS is possibly in play and you definitely need to talk to the translations engineer to find out more.

Lync/Skype Emergency Calling

For Lync and Skype Services, I typically was normalizing ^([2-9]11)$ to +1$  So in the Voice Routes I specify a service filter \+?[2-9]11.  And on the “Trunk Configuration | Called number translation rules” I have one for ^\+?([2-9])$ to $1 so the plus gets removed before routing, IF there is a plus.  But you do something long enough, you sometimes forget why you do the things you do…

An incident with a client reminded me of the why, Emergency Calling.  As you can see from the screenshot below, Emergency Calling will bypass all normalizing rules, so if the Voice Policy’s/Routes are only routing calls based on ^\+\d+ then only calls normalized with a + will route, as was the case with this particular client.  VVX phones have a built in function for tagging calls with a Session Priority of Emergency (which can be seen in the CDR), when that is invoked, Normalization is bypassed, and the 911 does not get normalized to +911, thus the call fails because there was no route.  211, 311, 411 they all work because they aren’t Emergency Calls, so they go through the normalization process.


After creating a new Route and PSTN Usage, insert the appropriate filters, update the Outbound Calling Translation rules, good to go again.

New PSTN Route – Pattern: ^\+?[2-9]11 and assign the Appropriate Trunk

Called Number Translation Rule – Services: ^\+?([2-9]11)$ –> $1

In this case, I was able to complete testing of 911 service by using manipulation rules in the AudioCodes gateway, first testing that 811 would redirect to my cell phone, then testing with 911 in the manipulation.  Calls routed, CDR’s looked good again, no longer could reproduce the issue, everything is good again.

Screen shot courtesy of:  https://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/EXL318