Welcome to the World: June 2014

Tuesday, June 24

Components Required for Enterprise Voice LYNC

Front End Server VoIP Components
Mediation Server Component
PSTN Connectivity Components
Perimeter Network VoIP Components

a) VoIP components located on Front End Servers are as follows:

Translation Service
Inbound Routing component
Outbound Routing component
Exchange UM Routing component
Intercluster Routing component
Mediation Server Component

1) Translation Service:

The Translation Service is the server component that is responsible for translating a dialed number into the E.164 format or another format, according to the normalization rules that are defined by the administrator. The Translation Service can translate to formats other than E.164 if your organization uses a private numbering system or uses a gateway or PBX that does not support E.164.

2) Inbound Routing component:

The Inbound Routing component handles incoming calls largely according to preferences that are specified by users on their Enterprise Voice clients. It also facilitates delegate ringing and simultaneous ringing, if configured by the user. For example, users specify whether unanswered calls are forwarded or simply logged for notification. If call forwarding is enabled, users can specify whether unanswered calls should be forwarded to another number or to a Exchange UM server that has been configured to provide call answering. The Inbound Routing component is installed by default on all Standard Edition server and Front End Servers.

3) Outbound Routing component:

The Outbound Routing component routes calls to PBX or PSTN destinations. It applies call authorization rules, as defined by the user’s voice policy, to callers and determines the optimal PSTN gateway for routing each call. The Outbound Routing component is installed by default on all Standard Edition server and Front End Servers.
The routing logic that is used by the Outbound Routing component is in large measure configured by network or telephony administrators according to the requirements of their organizations.

4) Exchange UM Routing component:

The Exchange UM routing component handles routing between Lync Server and servers running Exchange Unified Messaging (UM), to integrate Lync Server with Unified Messaging features.
The Exchange UM routing component also handles rerouting of voice mail over the PSTN if Exchange UM servers are unavailable. If you have Enterprise Voice users at branch sites that do not have a resilient WAN link to a central site, the Survivable Branch Appliance that you deploy at the branch site provides voice mail survivability for branch users during a WAN outage. When the WAN link is unavailable, the Survivable Branch Appliance does the following:

reroutes unanswered calls over the PSTN to the Exchange Unified Messaging server in the central site
provides the ability for a user to retrieve voice mail messages over the PSTN
queues missed call notifications, and then uploads them to the Exchange UM server when the WAN link is restored.

To enable voice mail rerouting, we recommend that your Exchange administrator configure Exchange UM Auto Attendant (AA) to accept messages only.

5) Intercluster Routing component:

The Intercluster routing component is responsible for routing calls to the callee’s primary Registrar pool. If that is unavailable, the component routes the call to the callee’s backup Registrar pool. If the callee’s primary and backup Registrar pools are unreachable over the IP network, the Intercluster routing component reroutes the call over the PSTN to the user’s telephone number.

6) Other Front End Server Component:

Other components residing on the Front End Server or Director that provide essential support for VoIP, but are not themselves VoIP components, include the following:

User Services. Perform reverse number lookup on the destination phone number of each incoming call and match that number to the SIP URI of the destination user. Using this information, the Inbound Routing component distributes the call to that user’s registered SIP endpoints. User Services is a core component on all Front End Servers and Directors.
User Replicator. Extracts user phone numbers from Active Directory Domain Services and writes them to tables in the RTC database, where they are available to User Services and Address Book Server. User Replicator is a core component on all Front End Servers.
Address Book Server. Provides global address list information from Active Directory Domain Services to Lync Server clients. It also retrieves user and contact information from the RTC database, writes the information to the Address Book files, and then stores the files on a shared folder where they are downloaded by Lync clients. The Address Book Server writes the information to the RTCAb database, which is used by the Address Book Web Query service to respond to user search queries from Microsoft Lync 2010 Mobile. It optionally normalizes enterprise user phone numbers that are written to the RTC database for the purpose of provisioning user contacts in Lync. The Address Book service is installed by default on all Front End Servers. The Address Book Web Query service is installed by default with the Web services on each Front End Servers.

b) Mediation Server Component

The Mediation Server translates signaling and, in some configurations, media between your internal Lync Server 2013, Enterprise Voice infrastructure and a public switched telephone network (PSTN) gateway or a Session Initiation Protocol (SIP) trunk. On the Lync Server 2013 side, Mediation Server listens on a single mutual TLS (MTLS) transport address. On the gateway side, Mediation Server listens on all associated listening ports associated with trunks defined in the Topology document. All qualified gateways must support TLS, but can enable TCP as well. TCP is supported for gateways that do not support TLS.
If you also have an existing Public Branch Exchange (PBX) in your environment, Mediation Server handles calls between Enterprise Voice users and the PBX. If your PBX is an IP-PBX, you can create a direct SIP connection between the PBX and Mediation Server. If your PBX is a Time Division Multiplex (TDM) PBX, you must also deploy a PSTN gateway between Mediation Server and the PBX.
The Mediation Server is collocated with the Front End Server by default. The Mediation Server can also be deployed in a stand-alone pool for performance reasons, or if you deploy SIP trunking, in which case the stand-alone pool is strongly recommended.
If you deploy Direct SIP connections to a qualified PSTN gateway that supports media bypass and DNS load balancing, a stand-alone Mediation Server pool is not necessary. A stand-alone Mediation Server pool is not necessary because qualified gateways are capable of DNS load balancing to a pool of Mediation Servers and they can receive traffic from any Mediation Server in a pool.
We also recommend that you collocate the Mediation Server on a Front End pool when you have deployed IP-PBXs or connect to an Internet Telephony Server Provider’s Session Border Controller (SBC), as long as any of the following conditions are met:

The IP-PBX or SBC is configured to receive traffic from any Mediation Server in the pool and can route traffic uniformly to all Mediation Servers in the pool.
The IP-PBX does not support media bypass, but the Front End pool that is hosting the Mediation Server can handle voice transcoding for calls to which media bypass does not apply.

Encrypting and decrypting SRTP on the Lync Server side
Translating SIP over TCP (for gateways that do not support TLS) to SIP over mutual TLS
Translating media streams between Lync Server and the gateway peer of the Mediation Server
Connecting clients that are outside the network to internal ICE components, which enable media traversal of NAT and firewalls
Acting as an intermediary for call flows that a gateway does not support, such as calls from remote workers on an Enterprise Voice client
In deployments that include SIP trunking, working with the SIP trunking service provider to provide PSTN support, which eliminates the need for a PSTN gateway

The following figure shows the signaling and media protocols that are used by the Mediation Server when communicating with a basic PSTN gateway and the Enterprise Voice infrastructure.

Mediation Server Protocols diagram

I) M::N: Trunk

Lync Server 2013 supports greater flexibility in the definition of a trunk for call routing purposes from previous releases. A trunk is a logical association between a Mediation Server and listening port number with a gateway and a listening port number. This implies several things: A Mediation Server can have multiple trunks to the same gateway; a Mediation Server can have multiple trunks to different gateways; conversely a gateway can have multiple trunks to different Mediation Servers.
A root trunk is still required to be created when a gateway is added to the Lync topology using Topology Builder. The number of gateways that a given Mediation Server can handle depends on the processing capacity of the server during peak busy hours. If you deploy a Mediation Server on hardware that exceeds the minimum hardware requirements for Lync Server 2013, as described in Supported Hardware in the Supportability documentation, then the estimate of how many active non-bypass calls a stand-alone Mediation Server can handle is approximately 1000 calls. When deployed on hardware meeting these specifications, the Mediation Server is expected to perform transcoding, but still route calls for multiple gateways even if the gateways do not support media bypass.
When defining a call route, you specify the trunks associated with that route, but you do not specify which Mediation Servers are associated with that route. Instead, you use Topology Builder to associate trunks with Mediation Servers. In other words, routing determines which trunk to use for a call, and, subsequently, the Mediation Server associated with that trunk is sent the signaling for that call.
The Mediation Server can be deployed as a pool; this pool can be collocated with a Front End pool, or it can be deployed as a stand-alone pool. When a Mediation Server is collocated with a Front End pool, the pool size can be at most 12 (the limit of the Registrar pool size). Taken together, these new capabilities increase the reliability and deployment flexibility for Mediation Servers, but they require associated capabilities in the following peer entities:

PSTN gateway. A Lync Server 2013 qualified gateway must implement DNS load balancing, which enables a qualified public switched telephone network (PSTN) gateway to act as a load balancer for one pool of Mediation Servers, and thereby to load-balance calls across the pool.
Session Border Controller. For a SIP trunk, the peer entity is a Session Border Controller (SBC) at an Internet telephony service provider. In the direction from the Mediation Server pool to the SBC, the SBC can receive connections from any Mediation Server in the pool. In the direction from the SBC to the pool, traffic can be sent to any Mediation Server in the pool. One method of achieving this is through DNS load balancing, if supported by the service provider and SBC. An alternative is to give the service provider the IP addresses of all Mediation Servers in the pool, and the service provider will provision these in their SBC as a separate SIP trunk for each Mediation Server. The service provider will then handle the load balancing for its own servers. Not all service providers or SBCs may support these capabilities. Furthermore, the service provider may charge extra for this capability. Typically, each SIP trunk to the SBC incurs a monthly fee.
IP-PBX. In the direction from the Mediation Server pool to the IP-PBX SIP termination, the IP-PBX can receive connections from any Mediation Server in the pool. In the direction from the IP-PBX to the pool, traffic can be sent to any Mediation Server in the pool. Because most IP-PBXs do not support DNS load balancing, we recommend that individual direct SIP connections be defined from the IP-PBX to each Mediation Server in the pool. The IP-PBX will then handle its own load balancing by distributing traffic over the trunk group. The assumption is that the trunk group has a consistent set of routing rules at the IP-PBX. Whether a particular IP-PBX supports this trunk group concept and how it intersects with the IP-PBX’s own redundancy and clustering architecture needs to be determined before you can decide whether a Mediation Server cluster can interact correctly with an IP-PBX.

A Mediation Server pool must have a uniform view of the peer gateway with which it interacts. This means that all members of the pool access the same definition of the peer gateway from the configuration store and are equally likely to interact with it for outgoing calls. Therefore, there is no way to segment the pool so that some Mediation Servers communicate with only certain gateway peers for outgoing calls. If such segmentation is necessary, a separate pool of Mediation Servers must be used. This would be the case, for example, if the associated capabilities in PSTN gateways, SIP trunks, or IP-PBXs to interact with a pool as detailed earlier in this topic are not present.
A particular PSTN gateway, IP-PBX, or SIP trunk peer can route to multiple Mediation Servers or trunks. The number of gateways that a particular pool of Mediation Servers can control depends on the number of calls that use media bypass. If a large number of calls use media bypass, a Mediation Server in the pool can handle many more calls, because only signaling layer processing is necessary.

II) Call Admission Control and Mediation Server

Call admission control (CAC), first introduced in Lync Server 2010, manages real-time session establishment, based on available bandwidth, to help prevent poor Quality of Experience (QoE) for users on congested networks. To support this capability, the Mediation Server, which provides signaling and media translation between the Enterprise Voice infrastructure and a gateway or SIP trunking provider, is responsible for bandwidth management for its two interactions on the Lync Server side and on the gateway side. In call admission control, the terminating entity for a call handles the bandwidth reservation. The gateway peers (PSTN gateway, IP-PBX, SBC) that the Mediation Server interacts with on the gateway side do not support Lync Server 2013 call admission control. Thus, the Mediation Server has to handle bandwidth interactions on behalf of its gateway peer. Whenever possible, the Mediation Server will reserve bandwidth in advance. If that is not possible (for example, if the locality of the ultimate media endpoint on the gateway side is unknown for an outgoing call to the gateway peer), bandwidth is reserved when the call is placed. This behavior can result in oversubscription of bandwidth, but it is the only way to prevent false rings.
Media bypass and bandwidth reservation are mutually exclusive. If a media bypass is employed for a call, call admission control is not performed for that call. The assumption here is that there are no links with constrained bandwidth involved in the call. If call admission control is used for a particular call that involves the Mediation Server, that call cannot employ media bypass.

III) Enhanced 9-1-1 (E9-1-1) and Mediation Server

The Mediation Server has extended capabilities so that it can correctly interact with Enhanced 9-1-1 (E9-1-1) service providers. No special configuration is needed on the Mediation Server; the SIP extensions required for E9-1-1 interaction are, by default, included in the Mediation Server’s SIP protocol for its interactions with a gateway peer (PSTN gateway, IP-PBX, or the SBC of an Internet Telephony Service Provider, including E9-1-1 Service Providers)
Whether the SIP trunk to an E9-1-1 Service Provider can be terminated on an existing Mediation Server pool or will require stand-alone Mediation Servers will depend on whether the E9-1-1 SBC can interact with a pool of Mediation Servers.

IV) Media Bypass and Mediation Server

Media bypass is a Lync Server capability that enables an administrator to configure call routing to flow directly between the user endpoint and the public switched telephone network (PSTN) gateway without traversing the Mediation Server. Media bypass improves call quality by reducing latency, unnecessary translation, possibility of packet loss, and the number of potential points of failure. Where a remote site without a Mediation Server is connected to a central site by one or more WAN links with constrained bandwidth, media bypass lowers the bandwidth requirement by enabling media from a client at a remote site to flow directly to its local gateway without first having to flow across the WAN link to a Mediation Server at the central site and back.This reduction in media processing also complements the Mediation Server’s ability to control multiple gateways.
Media bypass and call admission control (CAC) are mutually exclusive. If media bypass is employed for a call, CAC is not performed for that call. The assumption is that there are no links with constrained bandwidth involved in the call.

V) Components and Topologies for Mediation Server

Dependencies

The Mediation Server has the following dependencies:

Registrar. Required. The Registrar is the next hop for signaling in the Mediation Server interactions with the Lync Server 2013 network. Note that Mediation Server can be collocated on a Front End Server along with the Registrar, in addition to being installed in a stand-alone pool consisting only of Mediation Servers. The Registrar is collocated with a Mediation Server and PSTN gateway on a Survivable Branch Appliance.
Monitoring Server. Optional but highly recommended. The Monitoring Server allows the Mediation Server to record quality metrics associated with its media sessions.
Edge Server. Required for external user support. The Edge Server allows the Mediation Server to interact with users who are located behind a NAT or firewall.

VI) Topologies

The Lync Server 2013, Mediation Server is by default collocated with an instance of the Registrar on a Standard Edition server, a Front End pool, or Survivable Branch Appliance. All Mediation Servers in a Front End pool must be configured identically.
Where performance is an issue, it may be preferable to deploy one or more Mediation Servers in a dedicated stand-alone pool. Or, if you are deploying SIP trunking, we recommend that you deploy a stand-alone Mediation Server pool.
If you deploy Direct SIP connections to a qualified PSTN gateway that supports media bypass and DNS load balancing, a stand-alone Mediation Server pool is not necessary. A stand-alone Mediation Server pool is not necessary because qualified gateways are capable of DNS load balancing to a pool of Mediation Servers and they can receive traffic from any Mediation Server in a pool.
We also recommend that you collocate the Mediation Server on a Front End pool when you have deployed IP-PBXs or connect to an Internet Telephony Server Provider’s Session Border Controller (SBC), as long as any of the following conditions are met:

The IP-PBX or SBC is configured to receive traffic from any Mediation Server in the pool and can route traffic uniformly to all Mediation Servers in the pool.
The IP-PBX does not support media bypass, but the Front End pool that is hosting the Mediation Server can handle voice transcoding for calls to which media bypass does not apply.

You can use the Microsoft Lync Server 2013, Planning Tool to evaluate whether the Front End pool where you want to collocate the Mediation Server can handle the load. If your environment cannot meet these requirements, then you must deploy a stand-alone Mediation Server pool.
The following figure shows a simple topology consisting of two sites connected by a WAN link. Mediation Server is collocated with the Registrar on a Front End pool at Site 1. The Mediation Servers at Site 1 controls both the PSTN gateway at Site 1 and the gateway at Site 2. In this topology, media bypass is enabled globally to use site and region information, and the trunks to each PSTN gateway (GW1 and GW2) have bypass enabled.
Voice Topology with Mediation Server WAN Gateway

Voice Topology with Mediation Server WAN Gateway

The next figure shows a simple topology where the Mediation Server is collocated with the Registrar on Front End pool at Site 1 and has a Direct SIP connection to the IP-PBX at Site 1. In this figure, the Mediation Server also controls a PSTN gateway at Site 2. Assume that Lync users exist at both Sites 1 and 2. Also assume that the IP-PBX has an associated media processor that must be traversed by all media originating from Lync endpoints before being sent to media endpoints controlled by the IP-PBX. In this topology, media bypass is enabled globally to use site and region information, and the trunks to the PBX and PSTN gateway have media bypass enabled.
Voice Topology Mediation Server WAN PBX

The last figure in this topic shows a topology where the Mediation Server is connected to the SBC of an Internet Telephony Service Provider.

VII) Deployment Guidelines for Mediation Server

1) Standalone Mediation Server

Mediation Server is by default collocated on the Standard Edition server or Front End Server in a Front End pool at central sites. The number of public switched telephone network (PSTN) calls that can be handled and the number of machines required in the pool will depend on the following:

The number of gateway peers that the Mediation Server pool controls
The high-volume traffic periods through those gateways
The percentage of calls that are calls whose media bypass the Mediation Server

When planning, be sure to take into account the media processing requirements for PSTN calls and A/V conferences that are not configured for media bypass, as well as the processing needed to handle signaling interactions for the number of busy-hour calls that need to be supported. If there is not enough CPU, then you must deploy a stand-alone pool of Mediation Servers; and PSTN gateways, IP-PBXs, and SBCs will need to be split into subsets that are controlled by the collocated Mediation Servers in one pool and the stand-alone Mediation Servers in one or more stand-alone pools.
If you deployed PSTN gateways, IP-PBXs, or Session Border Controllers (SBCs) that do not support the correct capabilities to interact with a pool of Mediation Servers, including the following, then they will need to be associated with a stand-alone pool consisting of a single Mediation Server:

Perform network layer Domain Name System (DNS) load balancing across Mediation Servers in a pool (or otherwise route traffic uniformly to all Mediation Servers in a pool)
Accept traffic from any Mediation Server in a pool

You can use the Microsoft Lync Server 2013, Planning Tool to evaluate whether collocating the Mediation Server with your Front End pool can handle the load. If your environment cannot meet these requirements, then you must deploy a stand-alone Mediation Server pool.

2) Central Site & Branch Site consideration

Mediation Servers at the central site can be used to route calls for IP-PBXs or PSTN gateways at branch sites. If you deploy SIP trunks, however, you must deploy a Mediation Server at the site where each trunk terminates. Having a Mediation Server at the central site route calls for an IP-PBX or PSTN gateway at a branch site does not require the use of media bypass. However, if you can enable media bypass, doing so will reduce media path latency and, consequently, result in improved media quality because the media path is no longer required to follow the signaling path. Media bypass will also decrease the processing load on the pool.

If branch site resiliency is required, a Survivable Branch Appliance or combination of a Front End Server, a Mediation Server, and a gateway must be deployed at the branch site. (The assumption with branch site resiliency is that presence and conferencing are not resilient at the site.)

c) PSTN Connectivity Components

An enterprise-grade VoIP solution must provide for calls to and from the public switched telephone network (PSTN) without any decline in Quality of Service (QoS). In addition, users should not be aware of the underlying technology when they place and receive calls. From the user's perspective, a call between the Enterprise Voice infrastructure and the PSTN should seem like just another SIP session.
For PSTN connections, you can either deploy a SIP trunk or a PSTN gateway (with a PBX, also known as a Direct SIP link, or without a PBX).

1) SIP Trunking:

As an alternative to using PSTN gateways, you can connect your Enterprise Voice solution to the PSTN by using SIP trunking. SIP trunking enables the following scenarios:

An enterprise user inside or outside the corporate firewall can make a local or long-distance call specified by an E.164-compliant number that is terminated on the PSTN as a service of the corresponding service provider.
Any PSTN subscriber can contact an enterprise user inside or outside the corporate firewall by dialing a Direct Inward Dialing (DID) number associated with that enterprise user.

The use of this deployment solution requires a SIP trunking service provider.

2) PSTN Gateways:

PSTN gateways are third-party devices that translate signaling and media between the Enterprise Voice infrastructure and a PSTN or a PBX. PSTN gateways work with the Mediation Server to present a PSTN or PBX call to an Enterprise Voice client. The Mediation Server also presents calls from Enterprise Voice clients to the PSTN gateway for routing to the PSTN or PBX. For a list of partners who work with Microsoft to provide devices that work with Lync Server, see the Microsoft Unified Communications Partners website at http://go.microsoft.com/fwlink/p/?linkId=202836.

3) Private Exchange Exchanges:

If you have an existing voice infrastructure that uses a private branch exchange (PBX), you can use your PBX with Lync Server Enterprise Voice.
The supported Enterprise Voice-PBX integration scenarios are as follows:

IP-PBX that supports media bypass, with a Mediation Server.
IP-PBX that requires a stand-alone PSTN gateway.
Time division multiplexing (TDM) PBX, with a stand-alone PSTN gateway

D) Perimeter Network VoIP Components

Outside callers who use unified communications clients for individual or conference calls rely on Edge Server for voice communication with coworkers.
On an Edge Server, the Access Edge service provides SIP signaling for calls from Lync users who are outside your organization’s firewall. The A/V Edge service enables media traversal of NAT and firewalls. A caller who uses a unified communications (UC) client from outside the corporate firewall relies on the A/V Edge service for both individual and conference calls.
The A/V Authentication service is collocated with, and provides authentication services for, the A/V Edge service. Outside users who attempt to connect to the A/V Edge service require an authentication token that is provided by the A/V Authentication Service before their calls can go through.

Wednesday, June 4

Monitoring and Maintaining CUCM Appliance Hardware

Hardware Platform Monitoring and Management The Appliance supports a variety of interfaces to enable monitoring in the following eight focus areas:

1) CPU status/utilization

2) Memory status/utilization

3) System components temperatures

4) Fan status

5) Power Supply status

6) RAID & disk status

7) Network status (incl. NIC)

8) Operational status, including instrumentation of system/kernel status and data dumps following major system issues, indicating nature/type of the operational problem and degree of severity.

This section focuses on hardware-layer monitoring for 3) thru 8). 1) and 2) are covered in the section on CUCM Application-layer and Services-layer Monitoring.

Hardware Platform Monitoring and Management The Appliance supports a variety of interfaces to enable monitoring in the following eight focus areas:

1) CPU status/utilization

2) Memory status/utilization

3) System components temperatures

4) Fan status

5) Power Supply status

6) RAID & disk status

7) Network status (incl. NIC)

8) Operational status, including instrumentation of system/kernel status and data dumps following major system issues, indicating nature/type of the operational problem and degree of severity.

This section focuses on hardware-layer monitoring for 3) thru 8). 1) and 2) are covered in the section on CUCM Application-layer and Services-layer Monitoring.

For postmortem analysis, RIS Data Collector PerfMonLog tracks processes %cpu usage as well as at system level.

RTMT monitors CPU usage. When CPU usage is above a threshold, RTMT generates CPUPegging/CallProcessNodeCPUPegging alerts. From RTM Alert Central, you can also see current status.

There are two kinds of RTMT alerts. The ﬁrst set is pre-conﬁgured (also called pre-canned) , and the second set is user deﬁned. You can customize both of them. The main difference is that you cannot delete pre-conﬁgured, whereas you can add and delete user-deﬁned alerts. However, you can disable both pre-conﬁgured and user-deﬁned alerts. To view the pre-conﬁgured alerts, from the RTMT client application, select the RTMT -> Tools -> Alert -> Alert Central menu option. The pre-conﬁgured alerts are enabled by default. In most cases, you do not have to change the default threshold settings conﬁgured for the pre-conﬁgured alerts. However, you have an option to change the threshold settings to meet your requirements. The notiﬁcation can be an e-mail or a pager. To set up e-mail notiﬁcation, you should specify the SMTP server name and port number. You can do this in the RTMT client application by selecting the Alert/Threshold -> Enable E-Mail Server menu option.

In addition to CPUPegging / CallProcessNodeCPUPeggin, high CPU usage potentially causes other alerts to occur such as: CodeYellow CodeRed CoreDumpFileFound CriticalServiceDown LowCallManagerHeartbeatRate LowTFTPServerHeartbeakRate LowAttendantConsoleHeartRate

% Iowait Monitoring High %iowait indicates high disk I/O activities. A few things needed to be considered: High IOwait due to heavy memory swapping. Please check %CPU Time for Swap Partition to see if there is high level of memory swapping activity. One potential cause of high memory swapping is memory leak. High IOwait due to DB activity . Database accesses Active Partition. If %CPU Time for Active Partition is high, then most likely there are a lot of DB activities. High IOwait due to Common (or Log) Partition, where trace and log files are stored. You can check the following things: 1.Check Trace Log Center to see if there is any trace collection activity going on. If call processing is impacted (ie, CodeYellow), then consider adjusting trace collection schedule. If zip option is used, please turning it off. 2.Trace setting – At Detailed level, CUCM generates a lot of trace. If high %iowait and/or CUCM is in CodeYellow state, and CUCM service trace setting is at Detailed, please chance trace setting to “Error” to reduce the trace writing. You can use RTMT to identify processes that are responsible for high %iowait: If %iowait is high enough to cause CPUPegging alert, check the alert

     message to check processes waiting for disk IO.

Go to RTMT Process page, sort by Status. Check for processes in Uninterruptible Disk Sleep state Download RIS Data Collector PerfMonLog file to examine the process status for longer period of time. Below is an example of RTMT Process page, sorted by Status. You can check for processes in Uninterruptible Disk Sleep state. In the case below, it’s sFTP process:

You can also use CLI to isolate which process causes high IOwait:

Syntax admin:utils fior

     utils fior status
     utils fior enable
     utils fior disable
     utils fior start
     utils fior stop
     utils fior list
     utils fior top

For example: admin:utils fior list 2007-05-31 Counters Reset

 Time         Process        PID   State       Bytes Read          Bytes Written

----------------- ----- ----- -------------------- --------------------

17:02:45 rpmq 31206 Done 14173728 0 17:04:51 java 31147 Done 310724 3582 17:04:56 snmpget 31365 Done 989543 0 17:10:22 top 12516 Done 7983360 0 17:21:17 java 31485 Done 313202 2209 17:44:34 java 1194 Done 192483 0 17:44:51 java 1231 Done 192291 0 17:45:09 cdpd 6145 Done 0 2430100 17:45:25 java 1319 Done 192291 0 17:45:31 java 1330 Done 192291 0 17:45:38 java 1346 Done 192291 0 17:45:41 rpmq 1381 Done 14172704 0 17:45:44 java 1478 Done 192291 0 17:46:05 rpmq 1540 Done 14172704 0 17:46:55 cat 1612 Done 2560 165400 17:46:56 troff 1615 Done 244103 0 18:41:52 rpmq 4541 Done 14172704 0 18:42:09 rpmq 4688 Done 14172704 0

CLI fior output sorted by top disk users admin:utils fior top Top processes for interval starting 2007-05-31 15:27:23 Sort by Bytes Written Process PID Bytes Read Read Rate Bytes Written Write Rate

----- -------------- ------------- -------------- -------------

       Linuxzip 19556       61019083      15254771       12325229        3081307
       Linuxzip 19553       58343109      11668622         9860680        1972136
       Linuxzip 19544       55679597      11135919         7390382        1478076
       installdb 28786         3764719            83660         6847693          152171
       Linuxzip 20150       18963498        6321166         6672927        2224309
       Linuxzip 20148       53597311      17865770         5943560        1981187
       Linuxzip 19968         9643296        4821648         5438963        2719482
       Linuxzip 19965       53107868      10621574         5222659        1044532
       Linuxzip 19542       53014605      13253651         4922147        1230537
             mv      5048          3458525       3458525          3454941        3454941

utils diagnose list: This command will list all available diagnostic tests. For exemple: admin: utils diagnose list Available diagnostics modules disk_space - Check available disk space as well as any unusual disk usage service_manager - Check if service manager is running tomcat - Check if Tomcat is deadlocked or not running utils diagnose test: This command will execute each diagnostic test, but will not attempt to repair anything. Example: admin: utils diagnose test

Starting diagnostic test(s)

===============

test - disk_space : Passed test - service_manager : Passed test - tomcat : Passed Diagnostics Completed utils diagnose module <moduleName> This command will execute a single diagnostic test and attempt to fix the problem if possible. You can also use the command "utils diagnose fix" to run all of the diagnostic tests at once. Example: admin: utils diagnose module tomcat Starting diagnostic test(s)

===============

test - tomcat : Passed Diagnostics Completed utils diagnose fix: This command will execute all diagnostic tests, and if possible, attempt to repair the system. Example: admin: utils diagnose fix Starting diagnostic test(s)

===============

test - disk_space : Passed test - service_manager : Passed test - tomcat : Passed

Diagnostics Completed utils create report hardware no parameters are required Creates a system report containing disk array, remote console, diagnositic, and environmental data. Example: admin:utils create report hardware

         ***   W A R N I N G   ***

This process can take several minutes as the disk array, remote console, system diagnostics and environmental systems are probed for their current values. Continue? Press y or Y to continue, any other key to cancel request. Continuing with System Report request... Collecting Disk Array Data...SmartArray Equipped server detected...Done Collecting Remote Console Data...Done Collecting Model Specific System Diagnostic Information...Done Collecting Environmental Data...Done Collecting Remote Console System Log Data...Done Creating single compressed system report...Done System report written to SystemReport-20070730020505.tgz To retrieve diagnostics use CLI command: file get activelog platform/log/SystemReport-20070730020505.tgz

utils iostat interval optional (seconds) Interval between two iostat readings - mandatory if iterations is being used iterations optional The number of iostat iterations to be performed - mandatory if interval is being used filename optional Redirect the output to a file Help: utils iostat: This command will provide the iostat output for the given number of iterations and interval. Example: admin: utils iostat Executing command... Please be patient Tue Oct 9 12:47:09 IST 2007 Linux 2.4.21-47.ELsmp (csevdir60) 10/09/2007 Time: 12:47:09 PM avg-cpu: %user %nice %sys %iowait %idle

          3.61    0.02    3.40    0.51   92.47

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 3.10 19.78 0.34 7.49 27.52 218.37 13.76 109.19 31.39 0.05 5.78 0.73 0.57 sda1 0.38 4.91 0.14 0.64 4.21 44.40 2.10 22.20 62.10 0.02 26.63 1.62 0.13 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.88 0.00 2.20 2.20 0.00 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.28 0.00 1.88 1.88 0.00 sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.83 0.00 1.67 1.67 0.00 sda5 0.00 0.08 0.01 0.01 0.04 0.73 0.02 0.37 64.43 0.00 283.91 69.81 0.08 sda6 2.71 14.79 0.20 6.84 23.26 173.24 11.63 86.62 27.92 0.02 2.98 0.61 0.43

The following table lists some equivalent perfmon counters between CUCM 4.x and CUCM 5.x and later:

CUCM 4.x Perfmon counters CUCM 5.x appliance Perfmon counters Process % Privileged Time Process STime % Processor Time % CPU Time Processor % UserTime Processor User Percentage % Privileged Time System Percentage % Idle Time Nice Percentage % Processor Time % CPU Time

Memory Monitoring Virtual memory consists of physical memory (RAM) and swap memory (Disk). RTMT “CPU & Memory” page has system level memory usage information as the following: Total: total amount of physical memory Free: amount of free memory Shared: amount of shared memory used Buffers: amount of memory used for buffering purpose Cached: amount of cached memory Used: calculated as Total – Free – Buffers – Cached + Shared Total Swap: total amount of swap space. Used Swap: the amount of swap space in use on the system. Free Swap: the amount of free swap space available on the system

You can also query memory information through APIs: Through SOAP, you can query the following perfmon counters: Under Memory object: % Mem Used, % VM Used, Total Kbytes, Total Swap Kbytes, Total VM Kbytes, Used Kbytes, Used Swap Kbytes, Used VM KBytes Under Process object: VmSize, VmData, VmRSS, % Memory Usage Through SNMP, you can query the following perfmon counters: Host Resource MIB: hrStorageSize, hrStorageUsed, hrStorageAllocationUnits, hrStorageDescr, hrStorageType hrMemorySize You can also download some historical information by using RTMT Trace Log Central: Cisco AMC Service PerfMonLog // enabled by default. Deprecated in ccm 6.0 because Cisco RIS Data Collector PerfMonLog is introduced Cisco RIS Data Collector PerfMonLog // disabled by default in CUCM 5.x; enabled by default in CUCM 6.0 Note: Perfmon Virtual Memory refers to Total (Physical + Swap) memory whereas Host Resource MIB Virtual Memory refers to Swap memory only.

RTMT “Process” pre-can screen displays process level memory usage (VmSize, VmRSS, and VmData) information. VmSize is total virtual memory used by the process VmRSS is the Resident Set currently in physical memory used by the process including Code, Data and Stack VmData is the virtual memory usage of heap by the process Page Fault Count represents the number of major page faults that a process encountered that required the data to be loaded into physical memory You can go to RTMT “Process” pre-can screen and sort VmSize by clicking on VmSize tab. Then you can identify which process consumes more memory.

Hints on Memory leak From RTMT Process page, if a process’ VmSize is continuously increasing, that process causes memory leaking. When process leaks memory, the system administrator should report to Cisco with proper trace files. Ris Data Collector PerfMonLog is a good one to collect as it contains historical information on memory usage. Then the system administrator can schedule restarting the service during off hour to reclaim the memory.

Alert Central Alert Central (RTMT -> Tools -> Alert -> Alert Central) has all the Cisco predefined alerts and provides the current status of each alertable condition. One column to pay attention to is the “In Safe Range”. If it’s marked as No then the condition is still not corrected. For instance, if “In Safe Range” is “No” for CallProcessingNodeCPUPegging, then it means the CPU usage on that node is still above the threshold. In the bottom is the history information. You can take a look to see alerts generated previously. Quite often by the time you realize that service has crashed, the corresponding trace files have been overwritten. It would be hard for Cisco TAC to work on the issue without trace files. In this case, it would be useful to know that CoreDumpFileFound, CodeYellow, and CriticalServiceDown alerts have Enable Trace Download option. To enable it, open the Set Alert Properties. The last page has option to enable trace download. This can be used to make sure a trace file corresponding to a crash is created.

Caution - Enabling TCT Download may affect services on the server. Configuring a high number of downloads will adversely impact the quality of services on the server. Alerts can also send out an alarms (syslog messages) by configuring the Alarm Configuration page for the Cisco AMC Service. In addition to AlertHistory shown in RTMT Alert Central, there is AMC Alert Log which you can download with Trace & Log Central to get up to 7 days (by default) worth of Alert history.

From Alert Central, you can see the current status:

The following table compares the names of perfmon counters on virtual memory between CUCM 4.x and CUCM 5.x. CUCM 4.x Perfmon counters CUCM 5.x appliance Perfmon counters Process Private Bytes Process VmRSS Virtual Bytes VmSize

Partition (Disk) Usage monitoring There are 4 partitions in CUCM hard drive: Common partition, also referred as Log partition, where trace/log files are stored Active partition contains files (binaries, libraries and config files) of active OS and CUCM version Inactive partition contains files for alternative CUCM version (e.g., older version that was upgraded from or newer version recently upgraded to but the server has not been toggled to this version to run). Swap partition is used for Swap space. You can also get partition information through APIs: Through SOAP APIs, you can query the following perfmon counters Under Partition object: Total Mbytes, Used Mbytes, Queue Length, Write Bytes Per Sec, Read Bytes Per Sec Through SNMP MIB, you can query the following information: Host Resource MIB: hrStorageSize, hrStorageUsed hrStorageAllocationUnits, hrStorageDescr, hrStorageType You can also download historical information by using RTMT Trace and Log Central: Cisco AMC Service PerfMonLog // enabled by default. Deprecated in CUCM 6.0, because Cisco RIS Data Collector PerfMonLog is introduced. Cisco RIS Data Collector PerfMonLog // disabled by default in CUCM 5.x; enabled by default in CUCM 6.0 You can use RTMT to monitor disk Usage:

Partition Name mapping Perfmon Instance Names as shown in RTMT and SOAP Names shown in Host Resource hrStorage Description Active / Inactive /partB Common /common Boot /grub Swap Virtual Memory SharedMemory /dev/shm

LogPartitionLowWaterMarkExceeded alert occurs when the percentage of used disk space in the log partition has exceeded the configured low water mark. This alert should be considered as early warning for an administrator to clean up disk space. You can use RMT Trace/Log Central to collect trace/log files and then delete these trace/log files from the server. In addition to manually clean up the traces/log files, the system administrator should also adjust the number of trace files to be kept to avoid hitting low water mark again. LogPartitionHighWaterMarkExceeded alert occurs when the percentage of used disk space in the log partition has exceeded the configured high water mark. When this alert is generated, Log Partition Monitoring (LPM) utility starts to delete files in Log Partition until the Log Partition is down to the low water mark to avoid running out of disk space. Since LPM may delete some files that you want to keep, you need to act upon receiving LogPartitionLowWaterMarkExceed alert. LowActivePartitionAvailableDiskSpace alert occurs when the percentage of available disk space of the Active Partition is lower than the configured value. Please use the default threshold that Cisco recommends. At default threshold, this alert should never be generated. If this alert occurs, a system administrator can adjust the threshold as temporary workaround but Cisco TAC should look into this. One place to look is /tmp using remote access. We have seen cases where large files are left there by 3rd party software. LowInactivePartitionAvailableDiskSpace alert occurs when the percentage of available disk space of the InActive Partition is lower than the configured value. Please use the default threshold that Cisco recommends. At default threshold, this alert should never be generated. If this alert occurs, a system administrator can adjust the threshold as temporary workaround but Cisco TAC should look into this.

The following table is a comparison of partition related perfmon counters between CUCM 4.x and CUCM 5.x. CCM 4.x Perfmon counters CCM 5.x appliance Perfmon counters Logical Disk % Disk Time Partition % CPU Time Disk Read Bytes/sec Read Kbytes Per Sec

Disk Write Bytes/sec Write Kbytes Per Sec Current Disk Queue Length Queue Length Free Megabytes Used Mbytes Total Mbytes % Free Space % Used

Database Replication among CUCM nodes

You can use RTMT database Summary to monitor your database activities (ie. CallManager -> Service -> Database Summary):

The following CLI can be used to monitor and manage intra-cluster connections: utils dbreplication status utils dbreplication repair all/nodename utils dbreplication reset all/nodename utils dbreplication stop utils dbreplication dropadmindb utils dbreplication setrepltimeout show tech dbstateinfo show tech dbinuse show tech notify run sql <query> Cisco Unified Communications Manager Monitoring “ccm” is the process name for Cisco Unified Communications Manager service. The following table is a general guideline for ccm service CPU usage

ccm CPU usage “Process(ccm)\% CPU Time” MCS-7835 Server MCS-7845 Server < 44% - good < 22% - good 44-52 % - warning 22-36 % -warning > 60% - bad > 30% -bad

You may ask: “ Why MCS-7845 server has more processors, but it has lower threadshold for CUP usage?”

Here is why: CCM process is multithreaded application. But main router thread does the bulk of call processing. A single thread can run only on one processor at any given time even when there are multiple processors available. That means ccm main router thread can run out of cpu resource even when there are idle processors. With hyper-threading on, MCS 7845 server has 4 virtual processors. So on server where the main router thread is running at full blast to do call processing, it is possible three other processors are near idle. In this situation UC Manager can get into Code Yellow state even when total CPU usage is 25-30%. (Similarly 7835 server with two virtual processors, UC Manager could get into Code Yellow state at around 50-60% cpu usage. NOTE 1: Code Yellow state is when ccm service is so overloaded that it cannot process incoming calls anymore. In this case, ccm initiates call throttling. NOTE 2: This doesn't mean you will see one processor's cpu usage at 100% and rest 0% in RTMT. Since main thread can run on processor A for 1/10th of second and processor B next 2/10th of seconds, etc, the cpu usage shown in RTMT would be more balanced. By default RTMT shows average CPU usage for 30 second duration.

You can also use APIs to query perfmon counters. Through SOAP APIs, you can query: Perfmon counters Device information DB access CDR access Through SNMP, CISCO-CCM-MIB: ccmPhoneTable, ccmGatewayTable, etc You can also download historical information by using RTMT Trace/Log Central Cisco AMC Service PerfMonLog // enabled by default. Deprecated in CUCM 6.0 because Cisco RIS Data Collector PerfMonLog is introduced. Cisco RIS Data Collector PerfMonLog // disabled by default in CUCM 5.x; enabled by default in CUCM 6.0.

Code Yellow CodeYellow alert is generated when ccm service goes into Code Yellow state, which means ccm service is overloaded. You can configure Code Yellow alert so that once Code Yellow alert occurs, the trace files can be downloaded for troubleshooting purpose.

AverageExpectedDelay counter represents the current average expected delay for handling any incoming message. If the value is above the value specified in "Code Yellow Entry Latency" service parameter, CodeYellow alarm is generated. This counter is one of key indicator of call processing performance issue.

Sometimes, you might see CodeYellow, but total CPU usage is only 25%. This is because CUCM needs one processor for call processing, when no processor resource available, CodeYellow may occur even total CPU usage is only around 25-30% in a four virtual processor server. Similarly on a two processor server, CodeYellow is possible around 50% total CPU usage.

Other perfmon counters should be monitored are: Cisco CallManager\CallsActive, CallsAttempted, EncryptedCallsActive, AuthenticatedCallsActive, VideoCallsActive Cisco CallManager\RegisteredHardwarePhones, RegisteredMGCPGateway, Cisco CallManager\T1ChannelsActive, FXOPortsActive, MTPResourceActive, MOHMulticastResourceActive Cisco Locations\BandwidthAvailable Cisco CallManager System Performance\AverageExpectedDelay CodeYellow DBReplicationFailure LowCallManagerHeartbeat ExcessiveVoiceQualityReports MaliciousCallTrace CDRFileDeliveryFailure/CDRAgentSendFileFailed Critical Service Down CoreDumpFileFound The following is screen shot of RTMT performance page:

Note: In general, CUCM 4.x Communications Manager perfmon counters have been preserved by using the same names and representing the same values. And also CISCO-CCM-MIB has backward compatibility.

RIS Data Collector PerfMonLog CCM 5.x, RIS Data Collector PerfMonLog file is not enabled by default. To Enable RIS Data Collector PerfMonLog, go to CUCM admin page, go to Service Parameter Page, select Cisco RIS Data Collector service and set Enable Logging to True, as the following:

It is recommended enable RIS Data Collector PerfMonLog which is very useful for troubleshooting since it tracks CPU, memory, disk, network, etc. If you enable RIS Data Collector PerfMonLog, then you can disable AMC PerfMonLog.

Note: RIS Data Collector PerfMonLog is introduced in CUCM 6.0 to replace AMC PerfMonLog. RIS Data Collector PerfMonLog provides a little more information than AMC PerfMonLog. For detailed information, please see CUCM Serviceability User Guide.

It is recommended turn on RIS Data Collector PerfMonLog as soon as CUCM is up and running (by default, it is turned on). When RIS Data Collector PerfMonLog is turned, the impact on CPU is so small (around 1%) that can be ignored.

RIS Data Collector PerfMonLog Use RTMT Trace & Log Center to download Cisco RIS Data Collector PerfMonLog files for a time period that you are interested in; Open the log file using Windows Perfmon Viewer (or RTMT Perfmon viewer), then add Performance counters of interest such as CPU usage -> Processor or Process % CPU Memory usage -> Memory %VM Used Disk usage -> Partition % Used Call Processing -> Cisco CallManager CallsActive The following is a screen shot of Windows Perfmon Viewer:

Service Status Monitoring

RTMT Critical Service page provides current status of all critical services, as the following:

CriticalServiceDown alert is generated when any of service is down.

Note 1: RTMT backend service checks for the status (by default) every 30 seconds. So it is possible if service goes down and comes back up within that period, CriticalServiceDown alert may not be generated. Note 2: CriticalServiceDown alert monitors only those services listed in RTMT Critical Services page. If you suspect (or want to double check) if service got restarted (without generating Core files), a few ways to check are: RTMT Critical Service page has elapsed time. Check RIS Troubleshooting perfmon log files and see if PID for service (process) is changed.

The following CLI can be used to check the logs of Service Manager: file get activelog platform/servm_startup.log file get activelog platform/log/servm*.log

The following CLI can be used to duplicate certain RTMT functions: utils service show perf show risdb

CoreDumpFileFound alert is generated when RTMT backend service detects new Core Dump file. Both CriticalServiceDown and CoreDumpFileFound alert can be configured to download corresponding trace files for troubleshooting purpose. This helps to preserve trace files at the time of a crash.

Syslog Messages Monitoring Syslog can be viewed using RTMT syslog viewer, please the following screen shot:

Sending syslog traps to remote server (CISCO-SYSLOG-MIB) If you want to send syslog messages as syslog traps, here are the steps. 1.Setup Trap (Notification) destination from Unified CM Serviceability SNMP page – http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/service/5_1_3/ccmsrva/sasnmpv1.html 2.Enable trap generation in CISCO-SYSLOG-MIB 3.Set appropriate SysLog level in CISCO-SYSLOG-MIB If you feel you are missing syslog traps for some Unified Communications Manager service alarms, check RTMT syslog viewer to see if the alarms are shown there. If not, adjust alarm configuration setting to send alarms to local syslog. For information on alarm configuration, refer to the Alarm Configuration section of the Cisco Unified CallManager Serviceability Administration Guide here – http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/service/5_1_3/ccmsrva/saalarm.html Also check that the SysLog level in CISCO-SYSLOG-MIB is set at the appropriate level.

Syslog generated due to hardware failures has an event severity of 4 or higher and contains one of the following patterns:

cma*[???]:*
cma*[????]:*
cma*[?????]:*
hp*[???]:*
hp*[????]:*
hp*[?????]:*

Therefore, you can do a manual search for the patterns above to find hardware failure events in syslog.

RTMT Alerts as Syslog Messages and Traps RTMT Alerts can be logged as syslog messages and send to remote syslog and syslog traps server. To send to local and remote syslog, please configure AMC alarm configuration page of CUCM Serviceability Web Page. For CUCM 5.1 and later releases, please go to Serviceability Web Page, under Alarm Configuration, check AMC service parameter “Alarm Enabled”. Go to Serviceability Web Page, under Tools -> Control Center – Network Services, restart AMC services in Serviceability Web Page.

Phone registration status needs to be monitored for sudden changes. If the registration status changes slightly and readjusts quickly over a short time frame, then it could be indicative of phone move, add, or change. A sudden smaller drop in phone registration counter can be indicative of a localized outage, for instance an access switch or a WAN circuit outage or malfunction. A significant drop in registered phone level needs immediate attention by the administrator. This counter especially needs to be monitored before and after the upgrades to ensure the system is restored completely.

RTMT Reports

RTMT has a number of pre-can screens for information such as Summary, Call Activity, Device Status, Server Status, Service Status, and Alert Status. RTMT “Summary” pre-can screen shows a summary view of CUC M system health. It shows CPU, Memory, Registered Phones, CallsInProgress, and ActiveGateway ports & channels. This should be one of the first thing you want to check each day to make sure CPU & memory usage are within normal range for your cluster and all phones are registered properly.

Phone Summary and Device Summary pre-can screens provide more detailed information about phone and gateway status. If there are a number of devices that fail to register, then you can use the Admin Find/List page or RTMT device search to get further information regarding the problem devices. Critical Services pre-can screen displays the current running/activation status of key services. You can access all the pre-can screens by simply clicking the corresponding icons on the left.

Serviceability Reports Archive The Cisco Serviceability Reporter service generates daily reports in Cisco Unified CallManager Serviceability Web Page. Each report provides a summary that comprises different charts that display the statistics for that particular report. Reporter generates reports once a day on the basis of logged information, such as Device Statistics Report Server Statistics Report Service Statistics Report Call Activities Report Alert Summary Report Performance Protection Report For detailed information about each report, please see the following URL:http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/service/5_0_2/ccmsrvs/sssrvrep.html#wp1033420

Welcome to the World

Tuesday, June 24

Components Required for Enterprise Voice LYNC

Wednesday, June 4

Monitoring and Maintaining CUCM Appliance Hardware

===============

===============

===============

My CCIE#53599

Followers