Quantcast
Channel: Kevin Holman's System Center Blog
Viewing all 153 articles
Browse latest View live

How grooming and auto-resolution work in the OpsMgr 2007 Operational database

$
0
0

How Grooming and Auto-Resolution works in the OpsMgr 2007 Operations DB

 

 

Warning – don’t read this if you are bored easily. 

 

 

In a simplified view to groom alerts…..

 

Grooming of the ops DB is called once per day at 12:00am…. by the rule:  “Partitioning and Grooming  You can search for this rule in the Authoring space of the console, under Rules.  It is targeted to the “Root Management Server” and is part of the System Center Internal Library.

 

It calls the “p_PartitioningAndGrooming” stored procedure, which calls p_Grooming, which calls p_GroomNonPartitionedObjects (Alerts are not partitioned) which inspects the PartitionAndGroomingSettings table… and executes each stored procedure.  The Alerts stored procedure in that table is referenced as p_AlertGrooming which has the following sql statement:

 

    SELECT AlertId INTO #AlertsToGroom

    FROM dbo.Alert

    WHERE TimeResolved IS NOT NULL

    AND TimeResolved < @GroomingThresholdUTC

    AND ResolutionState = 255

 

So…. the criteria for what is groomed is pretty simple:  In a resolution state of “Closed” (255) and older than the 7 day default setting (or your custom setting referenced in the table above)

 

We won’t groom any alerts that are in New (0), or any custom resolution-states (custom ID #).  Those will have to be set to “Closed” (255)…. either by autoresolution of a monitor returning to healthy, direct user interaction, our built in autoresolution mechanism, or your own custom script.

 

Ok – that covers grooming.

 

However – I can see that brings up the question – how does auto-resolution work?

 

 

 

 

That specifically states “alerts in the new resolution state”.  I don’t think that is completely correct:

 

That is called upon by the rule “Alert Auto Resolve Execute All” which runs p_AlertAutoResolveExecuteAll once per day at 4:00am.  This calls p_AlertAutoResolve twice…. once with a variable of “0” and once with a “1”.

 

Here is the sql statement:

 

IF (@AutoResolveType = 0)

    BEGIN

        SELECT @AlertResolvePeriodInDays = [SettingValue]

        FROM dbo.[GlobalSettings]

        WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_HealthyAlertAutoResolvePeriod()

 

        SET @AutoResolveThreshold = DATEADD(dd, -@AlertResolvePeriodInDays, getutcdate())

        SET @RootMonitorId = dbo.fn_ManagedTypeId_SystemHealthEntityState()

   

        -- We will resolve all alerts that have green state and are un-resolved

        -- and haven't been modified for N number of days.

        INSERT INTO @AlertsToBeResolved

        SELECT A.[AlertId]

        FROM dbo.[Alert] A

        JOIN dbo.[State] S

            ON A.[BaseManagedEntityId] = S.[BaseManagedEntityId] AND S.[MonitorId] = @RootMonitorId

        WHERE A.[LastModified] < @AutoResolveThreshold

        AND A.[ResolutionState] <> 255

        AND S.[HealthState] = 1

 

<snip>

 

    ELSE IF (@AutoResolveType = 1)

    BEGIN

        SELECT @AlertResolvePeriodInDays = [SettingValue]

        FROM dbo.[GlobalSettings]

        WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_AlertAutoResolvePeriod()

 

        SET @AutoResolveThreshold = DATEADD(dd, -@AlertResolvePeriodInDays, getutcdate())

 

        -- We will resolve all alerts that are un-resolved

        -- and haven't been modified for N number of days.

        INSERT INTO @AlertsToBeResolved

        SELECT A.[AlertId]

        FROM dbo.[Alert] A

        WHERE A.[LastModified] < @AutoResolveThreshold

        AND ResolutionState <> 255

 

 

So we are basically checking that Resolution state <> 255….. not specifically “New” (0) as we would lead you to believe by the wording in the interface.  There are simply two types of auto-resolution:  Resolve all alerts where the object has returned to a healthy state in “N” days….. and Resolve all alerts no matter what, as long as they haven’t been modified in “N” days.


Failed tasks aren't groomed from the Operational Database

$
0
0

This appears to be present up to RC-SP1 version, build 6.0.6246.0

 

In the Task Status console view - I noticed an old failed task from 2 months ago..... however, my task grooming is set to 7 days.

 

To view the grooming process:

http://blogs.technet.com/kevinholman/archive/2007/12/13/how-grooming-and-auto-resolution-work-in-the-opsmgr-2007-operational-database.aspx

Basically – select * from PartitionAndGroomingSettings will show you all grooming going on.

Tasks are kept in the jobstatus table.

Select * from jobstatus will show all tasks.

p_jobstatusgrooming is called to groom this table.

Here is the text of that SP:

--------------------------------

USE [OperationsManager]

GO

/****** Object:  StoredProcedure [dbo].[p_JobStatusGrooming]    Script Date: 02/05/2008 10:49:32 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

ALTER PROCEDURE [dbo].[p_JobStatusGrooming]

AS

BEGIN

SET NOCOUNT ON

DECLARE @Err int

DECLARE @Ret int

DECLARE @RowCount int

DECLARE @SaveTranCount int

DECLARE @GroomingThresholdLocal datetime

DECLARE @GroomingThresholdUTC datetime

DECLARE @TimeGroomingRan datetime

DECLARE @MaxTimeGroomed datetime

SET @SaveTranCount = @@TRANCOUNT

SET @TimeGroomingRan = getutcdate()

SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())

FROM dbo.PartitionAndGroomingSettings

WHERE ObjectName = 'JobStatus'

EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT

IF (@@ERROR <> 0)

BEGIN

GOTO Error_Exit

END

-- Selecting the max time to be groomed to update the table

SELECT @MaxTimeGroomed = MAX(LastModified)

FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC  

IF @MaxTimeGroomed IS NULL

GOTO Success_Exit

BEGIN TRAN

-- Change the Statement below to reflect the new item

-- that needs to be groomed

DELETE FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

SET @Err = @@ERROR

IF (@Err <> 0)

BEGIN

GOTO Error_Exit

END

UPDATE dbo.PartitionAndGroomingSettings

SET GroomingRunTime = @TimeGroomingRan,

        DataGroomedMaxTime = @MaxTimeGroomed

WHERE ObjectName = 'JobStatus'

SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

IF (@Err <> 0 OR @RowCount <> 1)

BEGIN

GOTO Error_Exit

END

COMMIT TRAN

Success_Exit:

RETURN 0

Error_Exit:

-- If there was an error and there is a transaction

-- pending, rollback.

IF (@@TRANCOUNT > @SaveTranCount)

ROLLBACK TRAN

RETURN 1

END

------------------------------------

 

 

Here is the problem in the SP:

 

DELETE FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

 

 

We only delete (groom) tasks that have a timestamp in TimeFinished.  If a failed task doesn’t finish – this field will be NULL and never gets groomed.

Print Server management pack fills the Operational DB with TONS of perf data

$
0
0

This is something I have noticed in MOM 2005, and seems to be the same in the conversion MP for OpsMgr 2007.  (Version 6.0.5000.0 of the Microsoft.Windows.Server.Printserver (Converted) MP).  When you import this MP, it will fill the Operational and reporting databases with performance data about print jobs and queues, if you have a large number of print servers/queues in your environment.

If reporting on this perf data is not critical to your environment, you should disable these rules:

clip_image002

Grooming process in the Operations Database

$
0
0

This is a continuation of my other post, on general alert grooming:

How grooming and auto-resolution work in the OpsMgr 2007 Operational database

 

Grooming of the OpsDB is called once per day at 12:00am…. by the rule:  “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the “Root Management Server” and is part of the System Center Internal Library.

image

 

It calls the “p_PartitioningAndGrooming” stored procedure.  This SP calls two other SP's:  p_Partitioning and then p_Grooming

p_Partitioning inspects the table PartitionAndGroomingSettings, and then calls the SP p_PartitionObject for each object in the PartitionAndGroomingSettings table where "IsPartitioned = 1"   (note - we partition event and perf into 61 daily tables - just like MOM 2005)

The PartitionAndGroomingSettings table:

image

 

The p_PartitionObject SP first identifies the next partition in the sequence, truncates it to make sure it is empty, and then updates the PartitionTables table in the database, to update the IsCurrent field to the next numeric table for events and perf.  Then it calls the p_PartitionAlterInsertView sproc, to make new data start writing to the current event and perf table.

To review which tables you are writing to - execute the following query:   select * from partitiontables where IsCurrent = '1'

A select * from partitiontables will show you all 61 event and perf tables, and when they were used.  You should see a PartitionStartTime updated every day - around midnight (time is stored in UTC in the database).  If partitioning is failing to run, then we wont see this date changing every day.  

 

Ok - that's the first step of the p_PartitioningAndGrooming sproc - Partitioning.  Now - if that is all successful, we will start grooming!

The p_Grooming is called after partitioning is successful.  One of the first things it does - is to update the InternalJobHistory table.  In this able - we keep a record of all partitioning and grooming jobs.  It is a good spot check to see what's going on with grooming.  To have a peek at this table - execute a select * from InternalJobHistory order by InternalJobHistoryId

image

 

The p_Grooming sproc then calls p_GroomPartitionedObjects  This sproc will first examine the PartitionAndGroomingSettings and compare the days to keep column, against the current date, to figure out how many partitions to groom.  It will then inspect the partitions to ensure they have data, and then truncate the partition, by calling p_PartitionTruncate.  The p_GroomPartitionedObjects sproc will then update the PartitionAndGroomingSettings table with the current time, under the GroomingRunTime column. 

Next - the p_Grooming sproc continues, by calling p_GroomNonPartitionedObjects.  p_GroomNonPartitionedObjects is a short, but complex sproc - in that is calls all the individual sprocs listed in the PartitionAndGroomingSettings table where IsPartitioned = 0.  (see my other post at the link above to follow the logic of one of these non-partitioned sprocs)

Next - the p_Grooming sproc continues, by updating the InternalJobHistory table, to give it a status of success (StatusCode of 1 = success, 2= failed, 0 appears to be never completed?)

 

If you ever have a problem with grooming - or need to get your OpsDB database size under control - simply reduce the data retention days, in the console, under Administration, Settings, Database Grooming.  To start with - I recommend setting all these to just 2 days, fromt he default of 7.  This keeps your OpsDB under control until you have time to tune all the noise fromt he MP's you import.  So just reduce this number, then open up query analyzer, and execute p_PartitioningAndGrooming  When it is done, check the job status by executing select * from InternalJobHistory order by InternalJobHistoryId   The last groom job should be present, and successful.  The OpsDB size should be smaller, with more free space.  And to validate, you can always run my large table query, found at:   Useful Operations Manager 2007 SQL queries

What SQL maintenance should I perform on my OpsMgr databases?

$
0
0

This question comes up a lot.  The answer is really - not what maintenance you should be performing... but what maintenance you should be *excluding*.... or when.  Here is why:

Most SQL DBA's will set up some pretty basic default maintenance on all SQL DB's they support.  This often includes, but is not limited to:

DBCC CHECKDB  (to look for DB errors and report on them)

UPDATE STATISTICS  (to boost query performance)

DBCC DBREINDEX  (to rebuild the table indexes to boost performance)

BACKUP

SQL DBA's might schedule these to run via the SQL Agent to execute nightly, weekly, or some combination of the above depending on DB size and requirements.

On the other side of the coin.... in some companies, the MOM/OpsMgr team installs and owns the SQL server.... and they dont do ANY default maintenance to SQL.  Because of this - a focus in OpsMgr was to have the Ops DB and Datawarehouse DB to be fully self-maintaining.... providing a good level of SQL performance whether or not any default maintenance was being done.

Operational Database:

Reindexing is already taking place against the OperationsManager database for some of the tables.  This is built into the product.  What we need to ensure - is that any default DBA maintenance tasks are not redundant nor conflicting with our built-in maintenance, and our built-in schedules:

There is a rule in OpsMgr that is targeted at the Root Management Server:

image

The rule executes the "p_OptimizeIndexes" stored procedure, every day at 2:30AM:

image

image

This rule cannot be changed or modified.  Therefore - we need to ensure there is not other SQL maintenance (including backups) running at 2:30AM, or performance will be impacted.

If you want to view the built-in UPDATE STATISTICS and DBCC DBREINDEX jobs history - just run the following queries:

select *
from DomainTable dt
inner join DomainTableIndexOptimizationHistory dti
on dt.domaintablerowID = dti.domaintableindexrowID
ORDER BY optimizationdurationseconds DESC

select *
from DomainTable dt
inner join DomainTableStatisticsUpdateHistory dti
on dt.domaintablerowID = dti.domaintablerowID
ORDER BY UpdateDurationSeconds DESC

Take note of the update/optimization duration seconds column.  This will show you how long your maintenance is typically running.  In a healthy environment these should not take very long.

 

If you want to view the fragmentation levels of the current tables in the database, run:

DBCC SHOWCONTIG WITH FAST

Here is some sample output:

----------------------------------------------------------------------------------------------

DBCC SHOWCONTIG scanning 'Alert' table...
Table: 'Alert' (1771153355); index ID: 1, database ID: 5
TABLE level scan performed.
- Pages Scanned................................: 936
- Extent Switches..............................: 427
- Scan Density [Best Count:Actual Count].......: 27.34% [117:428]
- Logical Scan Fragmentation ..................: 60.90%

----------------------------------------------------------------------------------------------

In general - we would like the "Scan density" to be high (Above 80%), and the "Logical Scan Fragmentation" to be low (below 30%).  What you might find... is that *some* of the tables are more fragmented than others, because our built-in maintenance does not reindex all tables.  Especially tables like the raw perf, event, and localizedtext tables.

That said - there is nothing wrong with running a DBA's default maintenance against the Operational database..... reindexing these tables in the database might also help console performance.  We just dont want to run any DBA maintenance during the same time that we run our own internal maintenance, so try not to conflict with this schedule.  Care should also be taken in any default DBA maintenance, that it does not run too long, or impact normal operations of OpsMgr.  Maintenance jobs should be monitored, and should not conflict with the backup schedules as well.

Here is a reindex job you can schedule with SQL agent.... for the OpsDB:

USE OperationsManager
go
SET ANSI_NULLS ON
SET ANSI_PADDING ON
SET ANSI_WARNINGS ON
SET ARITHABORT ON
SET CONCAT_NULL_YIELDS_NULL ON
SET QUOTED_IDENTIFIER ON
SET NUMERIC_ROUNDABORT OFF
EXEC SP_MSForEachTable "Print 'Reindexing '+'?' DBCC DBREINDEX ('?')"

 

Data Warehouse Database:

The data warehouse DB is also fully self maintaining.  This is called out by a rule "Standard Data Warehouse Data Set maintenance rule" which is targeted to the "Standard Data Set" object type.  This stored procedure is called on the data warehouse every 60 seconds.  It performs many, many tasks, of which Index optimization is but one.

image

This SP calls the StandardDatasetOptimize stored procedure, which handles any index operations.

To examine the index and statistics history - run the following query for the Alert, Event, Perf, and State tables:

 

select basetablename, optimizationstartdatetime, optimizationdurationseconds,
      beforeavgfragmentationinpercent, afteravgfragmentationinpercent,
      optimizationmethod, onlinerebuildlastperformeddatetime
from StandardDatasetOptimizationHistory sdoh
inner join StandardDatasetAggregationStorageIndex sdasi
on sdoh.StandardDatasetAggregationStorageIndexRowId = sdasi.StandardDatasetAggregationStorageIndexRowId
inner join StandardDatasetAggregationStorage sdas
on sdasi.StandardDatasetAggregationStorageRowId = sdas.StandardDatasetAggregationStorageRowId
ORDER BY optimizationdurationseconds DESC

 

Then examine the default domain tables optimization history.... run the same two queries as listed above for the OperationsDB.

In the data warehouse - we can see that all the necessary tables are being updated and reindexed as needed.  When a table is 10% fragmented - we reorganize.  When it is 30% or more, we rebuild the index.

Therefore - there is no need for a DBA to execute any UPDATE STATISTICS or DBCC DBREINDEX maintenance against this database.  Furthermore, since we run our maintenance every 60 seconds, and only execute maintenance when necessary, there is no "set window" where we will run our maintenance jobs.  This means that if a DBA team also sets up a UPDATE STATISTICS or DBCC DBREINDEX job - it can conflict with our jobs and execute concurrently.  This should not be performed. 

 

For the above reasons, I would recommend against any maintenance jobs on the Data Warehouse DB, beyond a CHECKDB (only if DBA's mandate it) and a good backup schedule. 

 

For the OpsDB: any standard maintenance is fine, as long as it does not conflict with the built-in maintenance, or impact production by taking too long, or having an impact on I/O.

 

Lastly - I'd like to discuss the recovery model of the SQL database.  We default to "simple" for all our DB's.  This should be left alone.... unless you have *very* specific reasons to change this.  Some SQL teams automatically assume all databases should be set to "full" recovery model.  This requires that they back up the transaction logs on a very regular basis, but give the added advantage of restoring up to the time of the last t-log backup.  For OpsMgr, this is of very little value, as the data changing on an hourly basis is of little value compared to the complexity added by moving from simple to full.  Also, changing to full will mean that your transaction logs will only checkpoint once a t-log backup is performed.  What I have seen, is that many companies aren't prepared for the amount of data written to these databases.... and their standard transaction log backups (often hourly) are not frequent enough to keep them from filling.  The only valid reason to change to FULL, in my opinion, is when you are using an advanced replication strategy, like log shipping, which requires full recovery model.  When in doubt - keep it simple.  :-)

 

 

P.S....  The Operations Database needs 50% free space at all times.  This is for growth, and for re-index operations to be successful.  This is a general supportability recommendation, but the OpsDB will alert when this falls below 40%. 

For the Data warehouse.... we do not require the same 50% free space.  This would be a temendous requireemnts if we had a multiple-terabyte database!

Think of the data warehouse to have 2 stages... a "growth" stage (while it is adding data and not yet grooming much (havent hit the default 400 days retention) and a "maturity stage" where agent count is steady, MP's are not changing, and the grooming is happening because we are at 400 days retention.  During "growth" we need to watch and maintain free space, and monitor for available disk space.  In "maturity" we only need enough free space to handle our index operations.  when you start talking 1 Terabyte of data.... that means 500GB of free space, which is expensive, and.  If you cannot allocate it.... then just allow auto-grow and monitor the database.... but always plan for it from a volume size perspective.

For transaction log sizing - we don't have any hard rules.  A good rule of thumb for the OpsDB is ~20% to 50% of the database size.... this all depends on your environment.  For the Data warehouse, it depends on how large the warehouse is - but you will probably find steady state to require somewhere around 10% to 20% of the warehouse size.  Any time we are doing any additional grooming of an alert/event/perf storm.... or changin grooming from 400 days to 300 days - this will require a LOT more transaction log space - so keep that in mind as your databases grow.

Boosting OpsMgr performance - by reducing the OpsDB data retention

$
0
0

Here is a little tip I often advise my customers on.....

The default data retention in OpsMgr is 7 days for most data types:

 

image

 

These are default settings which work well for a large cross section of different agent counts.  In MOM 2005 - we defaulted to 4 days.  Many customers, especially with large agent counts, would have to reduce that in MOM 2005 down to 2 days to keep a manageable Onepoint DB size.

 

That being said - to boost UI performance, and reduce OpsDB database size - consider reducing these values down to your real business requirements.  For a new, out of the box management group - I advise my customers to set these to 2 days.  This will keep less noise in your database as you deploy, and tune, agents and management packs.  This keeps a smaller DB, and a more responsive UI, in large agent count environments.

Essentially - set each value to "2" except for Performance Signature, which we will change to 1.  Performance Signature is unique.... the setting here isnt actually "Days" of retention.  It is "business cycles".  This is for self-tuning threshold ONLY.  This data is used for calculating business cycle based self-tuning thresholds.  There is NO REASON for this ever to be larger than the default of "2" business cycles.... and large agent count environments can see a performance benefit by bumping this down to only keeping "1" business cycle.

 

image

 

Then - once your Management group is fully deployed, and you have tuned your alert, performance, event, and state data.... IF you have a business requirement to keep this data for longer - bump it up.

Keep in mind - this will NOT cause you to groom out Alerts that are open - only closed alerts, and still will keep your closed alerts around for a couple days.

These settings have no impact on the data that is being written to the data warehouse - so any alert, event, or perf data needed will always be there.

Tuning tip – turning off some over-collection of events

$
0
0

We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don't care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.

However – one area of OpsMgr that often goes overlooked, is event overcollection.  This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat.  I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it.  They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.

MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP.  These events are items like "config requested”, “config delivered”, “new config active”.  They might be interesting, but there is no advanced analysis included to use these to detect a problem.  In small environments, they are not usually a big deal.  But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them.  I have yet to see a customer who did that.

 

At a high level – here is how I like to review these events:

  1. Review the Most Common Events query that your OpsDB has.
  2. Create a “My Workspace” view for each event that has a HIGH event count.
  3. Examine the event details for value to YOU.
  4. View the rule that collected the event.
    1. Does the rule also alert or do anything special, or does it simply collect the event?
    2. Do you think the event is required for any special reporting you do?
  5. Create an Override, in an Override MP for the rule source management pack, to disable the rule.
  6. Continue to the next event in the query output, and evaluate it.

 

So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:

Most common events by event number and event publishername:

SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
FROM EventAllView eav with (nolock)
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC

The trick is – to run this query periodically – and to examine the most common events for YOUR environment.  The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it:  (I will use a common event 21024 as an example:)

 

image

 

image

 

What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.

Next – I will examine the rule.  I will look at the Data Source section, and the Response section.  The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section.  If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.

If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it.  You will find you can disable most of the top consumers in the database.

 

Here is why I consider it totally cool to disable these uninteresting event collection rules:

  • If they are really important – there will be different alert generating rule to fire an alert
  • They fill the databases, agent queues, agent load, and network traffic with unimportant information.
  • While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
  • Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
  • If we find we do need one later – simply remove the override.

 

Here is an example of this one:

image

 

So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.

 

Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:

 

1206
1210
1215
1216
10102
10401
10403
10409
10457
10720
11771
21024
21025
21402
21403
21404
21405
29102
29103

 

I don't recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value.  Just knocking out the top 10 events will often free up 90% of the space they were consuming.

The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them.  You might find you have some other events as your top consumers.  I recommend you review them in the same manner as above – methodically.  Then revisit this every month or two to see if anything changed.

I’d also love to hear if you have other events that you see as your top consumer that isn't my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them.  I’d be happy to give feedback on those, or add any new ones to my list.

Understanding and modifying Data Warehouse retention and grooming

$
0
0

You will likely find that the default retention in the OpsMgr data warehouse will need to be adjusted for your environment.  I often find customers are reluctant to adjust these – because they don't know what they want to keep.  So – they assume the defaults are good – and they just keep EVERYTHING. 

This is a bad idea. 

A data warehouse will often be one of the largest databases supported by a company.  Large databases cost money.  They cost money to support.  They are more difficult to maintain.  They cost more to backup in time, tape capacity, network impact, etc.  They take longer to restore in the case of a disaster.  The larger they get, the more they cost in hardware (disk space) to support them.  The larger they get, can impact how long reports take to complete.

For these reasons – you should give STRONG consideration to reducing your warehouse retention to your reporting REQUIREMENTS.  If you don't have any – MAKE SOME!

Originally – when the product released – you had to directly edit SQL tables to adjust this.  Then – a command line tool was released to adjust these values – making the process easier and safer.  This post is just going to be a walk through of this process to better understand using this tool – and what each dataset actually means.

Here is the link to the command line tool: 

http://blogs.technet.com/momteam/archive/2008/05/14/data-warehouse-data-retention-policy-dwdatarp-exe.aspx

 

Different data types are kept in the Data Warehouse in unique “Datasets”.  Each dataset represents a different data type (events, alerts, performance, etc..) and the aggregation type (raw, hourly, daily)

Not every customer will have exactly the same data sets.  This is because some management packs will add their own dataset – if that MP has something very unique that it will collect – that does not fit into the default “buckets” that already exist.

 

So – first – we need to understand the different datasets available – and what they mean.  All the datasets for an environment are kept in the “Dataset” table in the Warehouse database.

select * from dataset
order by DataSetDefaultName

This will show us the available datasets.  Common datasets are:

Alert data set
Client Monitoring data set
Event data set
Microsoft.Windows.Client.Vista.Dataset.ClientPerf
Microsoft.Windows.Client.Vista.Dataset.DiskFailure
Microsoft.Windows.Client.Vista.Dataset.Memory
Microsoft.Windows.Client.Vista.Dataset.ShellPerf
Performance data set
State data set

Alert, Event, Performance, and State are the most common ones we look at.

 

However – in the warehouse – we also keep different aggregations of some of the datasets – where it makes sense.  The most common datasets that we will aggregate are Performance data, State data, and Client Monitoring data (AEM).  The reason we have raw, hourly, and daily aggregations – is to be able to keep data for longer periods of time – but still have very good performance on running reports.

In MOM 2005 – we used to stick ALL the raw performance data into a single table in the Warehouse.  After a year of data was reached – this meant the perf table would grow to a HUGE size – and running multiple queries against this table would be impossible to complete with acceptable performance.  It also meant grooming this table would take forever, and would be prone to timeouts and failures.

In OpsMgr – now we aggregate this data into hourly and daily aggregations.  These aggregations allow us to “summarize” the performance, or state data, into MUCH smaller table sizes.  This means we can keep data for a MUCH longer period of time than ever before.  We also optimized this by splitting these into multiple tables.  When a table reaches a pre-determined size, or number of records – we will start a new table for inserting.  This allows grooming to be incredibly efficient – because now we can simply drop the old tables when all of the data in a table is older than the grooming retention setting.

 

Ok – that’s the background on aggregations.  To see this information – we will need to look at the StandardDatasetAggregation table.

select * from StandardDatasetAggregation

That table contains all the datasets, and their aggregation settings.  To help make more sense of this -  I will join the dataset and the StandardDatasetAggregation tables in a single query – to only show you what you need to look at:

SELECT DataSetDefaultName,
AggregationTypeId,
MaxDataAgeDays
FROM StandardDatasetAggregation sda
INNER JOIN dataset ds on ds.datasetid = sda.datasetid
ORDER BY DataSetDefaultName

This query will give us the common dataset name, the aggregation type, and the current maximum retention setting.

For the AggregationTypeId:

0 = Raw

20 = Hourly

30 = Daily

Here is my output:

DataSetDefaultName AggregationTypeId MaxDataAgeDays
Alert data set 0 400
Client Monitoring data set 0 30
Client Monitoring data set 30 400
Event data set 0 100
Microsoft.Windows.Client.Vista.Dataset.ClientPerf 0 7
Microsoft.Windows.Client.Vista.Dataset.ClientPerf 30 91
Microsoft.Windows.Client.Vista.Dataset.DiskFailure 0 7
Microsoft.Windows.Client.Vista.Dataset.DiskFailure 30 182
Microsoft.Windows.Client.Vista.Dataset.Memory 0 7
Microsoft.Windows.Client.Vista.Dataset.Memory 30 91
Microsoft.Windows.Client.Vista.Dataset.ShellPerf 0 7
Microsoft.Windows.Client.Vista.Dataset.ShellPerf 30 91
Performance data set 0 10
Performance data set 20 400
Performance data set 30 400
State data set 0 180
State data set 20 400
State data set 30 400

 

You will probably notice – that we only keep 10 days of RAW Performance by default.  Generally – you don't want to mess with this.  This is simply to keep a short amount of raw data – to build our hourly and daily aggregations from.  All built in performance reports in SCOM run from Hourly, or Daily aggregations by default.

 

Now we are cooking!

Fortunately – there is a command line tool published that will help make changes to these retention periods, and provide more information about how much data we have currently.  This tool is called DWDATARP.EXE.  It is available for download HERE.

This gives us a nice way to view the current settings.  Download this to your tools machine, your RMS, or directly on your warehouse machine.  Run it from a command line.

Run just the tool with no parameters to get help:    

C:\>dwdatarp.exe

To get our current settings – run the tool with ONLY the –s (server\instance) and –d (database) parameters.  This will output the current settings.  However – it does not format well to the screen – so output it to a TXT file and open it:

C:\>dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW > c:\dwoutput.txt

Here is my output (I removed some of the vista/client garbage for brevity)

 

Dataset name Aggregation name Max Age Current Size, Kb
Alert data set Raw data 400 18,560 ( 1%)
Client Monitoring data set Raw data 30 0 ( 0%)
Client Monitoring data set Daily aggregations 400 16 ( 0%)
Configuration dataset Raw data 400 153,016 ( 4%)
Event data set Raw data 100 1,348,168 ( 37%)
Performance data set Raw data 10 467,552 ( 13%)
Performance data set Hourly aggregations 400 1,265,160 ( 35%)
Performance data set Daily aggregations 400 61,176 ( 2%)
State data set Raw data 180 13,024 ( 0%)
State data set Hourly aggregations 400 305,120 ( 8%)
State data set Daily aggregations 400 20,112 ( 1%)

 

Right off the bat – I can see how little data that daily performance actually consumes.  I can see how much data that only 10 days of RAW perf data consume.  I also see a surprising amount of event data consuming space in the database.  Typically – you will see that perf hourly will consume the most space in a warehouse.

 

So – with this information in hand – I can do two things….

  • I can know what is using up most of the space in my warehouse.
  • I can know the Dataset name, and Aggregation name… to input to the command line tool to adjust it!

 

Now – on to the retention adjustments.

 

First thing – I will need to gather my Reporting service level agreement from management.  This is my requirement for how long I need to keep data for reports.  I also need to know “what kind” of reports they want to be able to run for this period.

From this discussion with management – we determined:

  • We require detailed performance reports for 90 days (hourly aggregations)
  • We require less detailed performance reports (daily aggregations) for 1 year for trending and capacity planning.
  • We want to keep a record of all ALERTS for 6 months.
  • We don't use any event reports, so we can reduce this retention from 100 days to 30 days.
  • We don't use AEM (Client Monitoring Dataset) so we will leave this unchanged.
  • We don't report on state changes much (if any) so we will set all of these to 90 days.

Now I will use the DWDATARP.EXE tool – to adjust these values based on my company reporting SLA:

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Hourly aggregations" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Daily aggregations" -m 365

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Alert data set" -a "Raw data" -m 180

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Event data set" -a "Raw Data" -m 30

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Raw data" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Hourly aggregations" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Daily aggregations" -m 90

 

Now my table reflects my reporting SLA – and my actual space needed in the warehouse will be much reduced in the long term:

 

Dataset name Aggregation name Max Age Current Size, Kb
Alert data set Raw data 180 18,560 ( 1%)
Client Monitoring data set Raw data 30 0 ( 0%)
Client Monitoring data set Daily aggregations 400 16 ( 0%)
Configuration dataset Raw data 400 152,944 ( 4%)
Event data set Raw data 30 1,348,552 ( 37%)
Performance data set Raw data 10 468,960 ( 13%)
Performance data set Hourly aggregations 90 1,265,992 ( 35%)
Performance data set Daily aggregations 365 61,176 ( 2%)
State data set Raw data 90 13,024 ( 0%)
State data set Hourly aggregations 90 305,120 ( 8%)
State data set Daily aggregations 90 20,112 ( 1%)

 

Here are some general rules of thumb (might be different if your environment is unique)

  • Only keep the maximum retention of data in the warehouse per your reporting requirements.
  • Do not modify the performance RAW dataset.
  • Most performance reports are run against Perf Hourly data for detail performance throughout the day.  For reports that span long periods of time (weeks/months) you should generally use Daily aggregation.
  • Daily aggregations should generally be kept for the same retention as hourly – or longer.
  • Hourly datasets use up much more space than daily aggregations.
  • Most people don't use events in reports – and these can often be groomed much sooner than the default of 100 days.
  • Most people don't do a lot of state reporting beyond 30 days, and these can be groomed much sooner as well if desired.
  • Don't modify a setting if you don't use it.  There is no need.
  • The Configuration dataset generally should not be modified.  This keeps data about objects to report on, in the warehouse.  It should be set to at LEAST the longest of any perf, alert, event, or state datasets that you use for reporting.

How grooming and auto-resolution work in the OpsMgr 2007 Operational database

$
0
0

How Grooming and Auto-Resolution works in the OpsMgr 2007 Operations DB

 

 

Warning – don’t read this if you are bored easily. 

 

 

In a simplified view to groom alerts…..

 

Grooming of the ops DB is called once per day at 12:00am…. by the rule:  “Partitioning and Grooming  You can search for this rule in the Authoring space of the console, under Rules.  It is targeted to the “Root Management Server” and is part of the System Center Internal Library.

 

It calls the “p_PartitioningAndGrooming” stored procedure, which calls p_Grooming, which calls p_GroomNonPartitionedObjects (Alerts are not partitioned) which inspects the PartitionAndGroomingSettings table… and executes each stored procedure.  The Alerts stored procedure in that table is referenced as p_AlertGrooming which has the following sql statement:

 

    SELECT AlertId INTO #AlertsToGroom

    FROM dbo.Alert

    WHERE TimeResolved ISNOTNULL

    AND TimeResolved < @GroomingThresholdUTC

    AND ResolutionState = 255

 

So…. the criteria for what is groomed is pretty simple:  In a resolution state of “Closed” (255) and older than the 7 day default setting (or your custom setting referenced in the table above)

 

We won’t groom any alerts that are in New (0), or any custom resolution-states (custom ID #).  Those will have to be set to “Closed” (255)…. either by autoresolution of a monitor returning to healthy, direct user interaction, our built in autoresolution mechanism, or your own custom script.

 

Ok – that covers grooming.

 

However – I can see that brings up the question – how does auto-resolution work?

 

 

 

 

That specifically states “alerts in the new resolution state”.  I don’t think that is completely correct:

 

That is called upon by the rule “Alert Auto Resolve Execute All” which runs p_AlertAutoResolveExecuteAll once per day at 4:00am.  This calls p_AlertAutoResolve twice…. once with a variable of “0” and once with a “1”.

 

Here is the sql statement:

 

IF(@AutoResolveType = 0)

    BEGIN

        SELECT @AlertResolvePeriodInDays = [SettingValue]

        FROM dbo.[GlobalSettings]

        WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_HealthyAlertAutoResolvePeriod()

 

        SET @AutoResolveThreshold =DATEADD(dd,-@AlertResolvePeriodInDays,getutcdate())

        SET @RootMonitorId = dbo.fn_ManagedTypeId_SystemHealthEntityState()

   

        -- We will resolve all alerts that have green state and are un-resolved

        -- and haven't been modified for N number of days.

        INSERTINTO @AlertsToBeResolved

        SELECT A.[AlertId]

        FROM dbo.[Alert] A

        JOIN dbo.[State] S

            ON A.[BaseManagedEntityId] = S.[BaseManagedEntityId] AND S.[MonitorId] = @RootMonitorId

        WHERE A.[LastModified] < @AutoResolveThreshold

        AND A.[ResolutionState] <> 255

        AND S.[HealthState] = 1

 

<snip>

 

    ELSEIF(@AutoResolveType = 1)

    BEGIN

        SELECT @AlertResolvePeriodInDays = [SettingValue]

        FROM dbo.[GlobalSettings]

        WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_AlertAutoResolvePeriod()

 

        SET @AutoResolveThreshold =DATEADD(dd,-@AlertResolvePeriodInDays,getutcdate())

 

        -- We will resolve all alerts that are un-resolved

        -- and haven't been modified for N number of days.

        INSERTINTO @AlertsToBeResolved

        SELECT A.[AlertId]

        FROM dbo.[Alert] A

        WHERE A.[LastModified] < @AutoResolveThreshold

        AND ResolutionState <> 255

 

 

So we are basically checking that Resolution state <> 255….. not specifically “New” (0) as we would lead you to believe by the wording in the interface.  There are simply two types of auto-resolution:  Resolve all alerts where the object has returned to a healthy state in “N” days….. and Resolve all alerts no matter what, as long as they haven’t been modified in “N” days.

Failed tasks aren't groomed from the Operational Database

$
0
0

This appears to be present up to RC-SP1 version, build 6.0.6246.0

 

In the Task Status console view - I noticed an old failed task from 2 months ago..... however, my task grooming is set to 7 days.

 

To view the grooming process:

http://blogs.technet.com/kevinholman/archive/2007/12/13/how-grooming-and-auto-resolution-work-in-the-opsmgr-2007-operational-database.aspx

Basically – select * from PartitionAndGroomingSettings will show you all grooming going on.

Tasks are kept in the jobstatus table.

Select * from jobstatus will show all tasks.

p_jobstatusgrooming is called to groom this table.

Here is the text of that SP:

--------------------------------

USE [OperationsManager]

GO

/****** Object:  StoredProcedure [dbo].[p_JobStatusGrooming]    Script Date: 02/05/2008 10:49:32 ******/

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

ALTER PROCEDURE [dbo].[p_JobStatusGrooming]

AS

BEGIN

SET NOCOUNT ON

DECLARE @Err int

DECLARE @Ret int

DECLARE @RowCount int

DECLARE @SaveTranCount int

DECLARE @GroomingThresholdLocal datetime

DECLARE @GroomingThresholdUTC datetime

DECLARE @TimeGroomingRan datetime

DECLARE @MaxTimeGroomed datetime

SET @SaveTranCount = @@TRANCOUNT

SET @TimeGroomingRan = getutcdate()

SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())

FROM dbo.PartitionAndGroomingSettings

WHERE ObjectName = 'JobStatus'

EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT

IF (@@ERROR <> 0)

BEGIN

GOTO Error_Exit

END

-- Selecting the max time to be groomed to update the table

SELECT @MaxTimeGroomed = MAX(LastModified)

FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC  

IF @MaxTimeGroomed IS NULL

GOTO Success_Exit

BEGIN TRAN

-- Change the Statement below to reflect the new item

-- that needs to be groomed

DELETE FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

SET @Err = @@ERROR

IF (@Err <> 0)

BEGIN

GOTO Error_Exit

END

UPDATE dbo.PartitionAndGroomingSettings

SET GroomingRunTime = @TimeGroomingRan,

        DataGroomedMaxTime = @MaxTimeGroomed

WHERE ObjectName = 'JobStatus'

SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

IF (@Err <> 0 OR @RowCount <> 1)

BEGIN

GOTO Error_Exit

END

COMMIT TRAN

Success_Exit:

RETURN 0

Error_Exit:

-- If there was an error and there is a transaction

-- pending, rollback.

IF (@@TRANCOUNT > @SaveTranCount)

ROLLBACK TRAN

RETURN 1

END

------------------------------------

 

 

Here is the problem in the SP:

 

DELETE FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

 

 

We only delete (groom) tasks that have a timestamp in TimeFinished.  If a failed task doesn’t finish – this field will be NULL and never gets groomed.

Print Server management pack fills the Operational DB with TONS of perf data

$
0
0

This is something I have noticed in MOM 2005, and seems to be the same in the conversion MP for OpsMgr 2007.  (Version 6.0.5000.0 of the Microsoft.Windows.Server.Printserver (Converted) MP).  When you import this MP, it will fill the Operational and reporting databases with performance data about print jobs and queues, if you have a large number of print servers/queues in your environment.

If reporting on this perf data is not critical to your environment, you should disable these rules:

clip_image002

Grooming process in the Operations Database

$
0
0

This is a continuation of my other post, on general alert grooming:

How grooming and auto-resolution work in the OpsMgr 2007 Operational database

 

Grooming of the OpsDB is called once per day at 12:00am…. by the rule:  “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the “Root Management Server” and is part of the System Center Internal Library.

image

 

It calls the “p_PartitioningAndGrooming” stored procedure.  This SP calls two other SP's:  p_Partitioning and then p_Grooming

p_Partitioning inspects the table PartitionAndGroomingSettings, and then calls the SP p_PartitionObject for each object in the PartitionAndGroomingSettings table where "IsPartitioned = 1"   (note - we partition event and perf into 61 daily tables - just like MOM 2005)

The PartitionAndGroomingSettings table:

image

 

The p_PartitionObject SP first identifies the next partition in the sequence, truncates it to make sure it is empty, and then updates the PartitionTables table in the database, to update the IsCurrent field to the next numeric table for events and perf.  Then it calls the p_PartitionAlterInsertView sproc, to make new data start writing to the current event and perf table.

To review which tables you are writing to - execute the following query:   select * from partitiontables where IsCurrent = '1'

A select * from partitiontables will show you all 61 event and perf tables, and when they were used.  You should see a PartitionStartTime updated every day - around midnight (time is stored in UTC in the database).  If partitioning is failing to run, then we wont see this date changing every day.  

 

Ok - that's the first step of the p_PartitioningAndGrooming sproc - Partitioning.  Now - if that is all successful, we will start grooming!

The p_Grooming is called after partitioning is successful.  One of the first things it does - is to update the InternalJobHistory table.  In this able - we keep a record of all partitioning and grooming jobs.  It is a good spot check to see what's going on with grooming.  To have a peek at this table - execute a select * from InternalJobHistory order by InternalJobHistoryId

image

 

The p_Grooming sproc then calls p_GroomPartitionedObjects  This sproc will first examine the PartitionAndGroomingSettings and compare the days to keep column, against the current date, to figure out how many partitions to groom.  It will then inspect the partitions to ensure they have data, and then truncate the partition, by calling p_PartitionTruncate.  The p_GroomPartitionedObjects sproc will then update the PartitionAndGroomingSettings table with the current time, under the GroomingRunTime column. 

Next - the p_Grooming sproc continues, by calling p_GroomNonPartitionedObjects.  p_GroomNonPartitionedObjects is a short, but complex sproc - in that is calls all the individual sprocs listed in the PartitionAndGroomingSettings table where IsPartitioned = 0.  (see my other post at the link above to follow the logic of one of these non-partitioned sprocs)

Next - the p_Grooming sproc continues, by updating the InternalJobHistory table, to give it a status of success (StatusCode of 1 = success, 2= failed, 0 appears to be never completed?)

 

If you ever have a problem with grooming - or need to get your OpsDB database size under control - simply reduce the data retention days, in the console, under Administration, Settings, Database Grooming.  To start with - I recommend setting all these to just 2 days, fromt he default of 7.  This keeps your OpsDB under control until you have time to tune all the noise fromt he MP's you import.  So just reduce this number, then open up query analyzer, and execute p_PartitioningAndGrooming  When it is done, check the job status by executing select * from InternalJobHistory order by InternalJobHistoryId   The last groom job should be present, and successful.  The OpsDB size should be smaller, with more free space.  And to validate, you can always run my large table query, found at:   Useful Operations Manager 2007 SQL queries

What SQL maintenance should I perform on my OpsMgr databases?

$
0
0

This question comes up a lot.  The answer is really - not what maintenance you should be performing... but what maintenance you should be *excluding*.... or when.  Here is why:

Most SQL DBA's will set up some pretty basic default maintenance on all SQL DB's they support.  This often includes, but is not limited to:

DBCC CHECKDB  (to look for DB errors and report on them)

UPDATE STATISTICS  (to boost query performance)

DBCC DBREINDEX  (to rebuild the table indexes to boost performance)

BACKUP

SQL DBA's might schedule these to run via the SQL Agent to execute nightly, weekly, or some combination of the above depending on DB size and requirements.

On the other side of the coin.... in some companies, the MOM/OpsMgr team installs and owns the SQL server.... and they dont do ANY default maintenance to SQL.  Because of this - a focus in OpsMgr was to have the Ops DB and Datawarehouse DB to be fully self-maintaining.... providing a good level of SQL performance whether or not any default maintenance was being done.

Operational Database:

Reindexing is already taking place against the OperationsManager database for some of the tables.  This is built into the product.  What we need to ensure - is that any default DBA maintenance tasks are not redundant nor conflicting with our built-in maintenance, and our built-in schedules:

There is a rule in OpsMgr that is targeted at the Root Management Server:

image

The rule executes the "p_OptimizeIndexes" stored procedure, every day at 2:30AM:

image

image

This rule cannot be changed or modified.  Therefore - we need to ensure there is not other SQL maintenance (including backups) running at 2:30AM, or performance will be impacted.

If you want to view the built-in UPDATE STATISTICS and DBCC DBREINDEX jobs history - just run the following queries:

select *
from DomainTable dt
inner join DomainTableIndexOptimizationHistory dti
on dt.domaintablerowID = dti.domaintableindexrowID
ORDER BY optimizationdurationseconds DESC

select *
from DomainTable dt
inner join DomainTableStatisticsUpdateHistory dti
on dt.domaintablerowID = dti.domaintablerowID
ORDER BY UpdateDurationSeconds DESC

Take note of the update/optimization duration seconds column.  This will show you how long your maintenance is typically running.  In a healthy environment these should not take very long.

 

If you want to view the fragmentation levels of the current tables in the database, run:

DBCC SHOWCONTIG WITH FAST

Here is some sample output:

----------------------------------------------------------------------------------------------

DBCC SHOWCONTIG scanning 'Alert' table...
Table: 'Alert' (1771153355); index ID: 1, database ID: 5
TABLE level scan performed.
- Pages Scanned................................: 936
- Extent Switches..............................: 427
- Scan Density [Best Count:Actual Count].......: 27.34% [117:428]
- Logical Scan Fragmentation ..................: 60.90%

----------------------------------------------------------------------------------------------

In general - we would like the "Scan density" to be high (Above 80%), and the "Logical Scan Fragmentation" to be low (below 30%).  What you might find... is that *some* of the tables are more fragmented than others, because our built-in maintenance does not reindex all tables.  Especially tables like the raw perf, event, and localizedtext tables.

That said - there is nothing wrong with running a DBA's default maintenance against the Operational database..... reindexing these tables in the database might also help console performance.  We just dont want to run any DBA maintenance during the same time that we run our own internal maintenance, so try not to conflict with this schedule.  Care should also be taken in any default DBA maintenance, that it does not run too long, or impact normal operations of OpsMgr.  Maintenance jobs should be monitored, and should not conflict with the backup schedules as well.

Here is a reindex job you can schedule with SQL agent.... for the OpsDB:

USE OperationsManager
go
SET ANSI_NULLS ON
SET ANSI_PADDING ON
SET ANSI_WARNINGS ON
SET ARITHABORT ON
SET CONCAT_NULL_YIELDS_NULL ON
SET QUOTED_IDENTIFIER ON
SET NUMERIC_ROUNDABORT OFF
EXEC SP_MSForEachTable "Print 'Reindexing '+'?' DBCC DBREINDEX ('?')"

 

Data Warehouse Database:

The data warehouse DB is also fully self maintaining.  This is called out by a rule "Standard Data Warehouse Data Set maintenance rule" which is targeted to the "Standard Data Set" object type.  This stored procedure is called on the data warehouse every 60 seconds.  It performs many, many tasks, of which Index optimization is but one.

image

This SP calls the StandardDatasetOptimize stored procedure, which handles any index operations.

To examine the index and statistics history - run the following query for the Alert, Event, Perf, and State tables:

 

select basetablename, optimizationstartdatetime, optimizationdurationseconds,
      beforeavgfragmentationinpercent, afteravgfragmentationinpercent,
      optimizationmethod, onlinerebuildlastperformeddatetime
from StandardDatasetOptimizationHistory sdoh
inner join StandardDatasetAggregationStorageIndex sdasi
on sdoh.StandardDatasetAggregationStorageIndexRowId = sdasi.StandardDatasetAggregationStorageIndexRowId
inner join StandardDatasetAggregationStorage sdas
on sdasi.StandardDatasetAggregationStorageRowId = sdas.StandardDatasetAggregationStorageRowId
ORDER BY optimizationdurationseconds DESC

 

Then examine the default domain tables optimization history.... run the same two queries as listed above for the OperationsDB.

In the data warehouse - we can see that all the necessary tables are being updated and reindexed as needed.  When a table is 10% fragmented - we reorganize.  When it is 30% or more, we rebuild the index.

Therefore - there is no need for a DBA to execute any UPDATE STATISTICS or DBCC DBREINDEX maintenance against this database.  Furthermore, since we run our maintenance every 60 seconds, and only execute maintenance when necessary, there is no "set window" where we will run our maintenance jobs.  This means that if a DBA team also sets up a UPDATE STATISTICS or DBCC DBREINDEX job - it can conflict with our jobs and execute concurrently.  This should not be performed. 

 

For the above reasons, I would recommend against any maintenance jobs on the Data Warehouse DB, beyond a CHECKDB (only if DBA's mandate it) and a good backup schedule. 

 

For the OpsDB: any standard maintenance is fine, as long as it does not conflict with the built-in maintenance, or impact production by taking too long, or having an impact on I/O.

 

Lastly - I'd like to discuss the recovery model of the SQL database.  We default to "simple" for all our DB's.  This should be left alone.... unless you have *very* specific reasons to change this.  Some SQL teams automatically assume all databases should be set to "full" recovery model.  This requires that they back up the transaction logs on a very regular basis, but give the added advantage of restoring up to the time of the last t-log backup.  For OpsMgr, this is of very little value, as the data changing on an hourly basis is of little value compared to the complexity added by moving from simple to full.  Also, changing to full will mean that your transaction logs will only checkpoint once a t-log backup is performed.  What I have seen, is that many companies aren't prepared for the amount of data written to these databases.... and their standard transaction log backups (often hourly) are not frequent enough to keep them from filling.  The only valid reason to change to FULL, in my opinion, is when you are using an advanced replication strategy, like log shipping, which requires full recovery model.  When in doubt - keep it simple.  :-)

 

 

P.S....  The Operations Database needs 50% free space at all times.  This is for growth, and for re-index operations to be successful.  This is a general supportability recommendation, but the OpsDB will alert when this falls below 40%. 

For the Data warehouse.... we do not require the same 50% free space.  This would be a temendous requireemnts if we had a multiple-terabyte database!

Think of the data warehouse to have 2 stages... a "growth" stage (while it is adding data and not yet grooming much (havent hit the default 400 days retention) and a "maturity stage" where agent count is steady, MP's are not changing, and the grooming is happening because we are at 400 days retention.  During "growth" we need to watch and maintain free space, and monitor for available disk space.  In "maturity" we only need enough free space to handle our index operations.  when you start talking 1 Terabyte of data.... that means 500GB of free space, which is expensive, and.  If you cannot allocate it.... then just allow auto-grow and monitor the database.... but always plan for it from a volume size perspective.

For transaction log sizing - we don't have any hard rules.  A good rule of thumb for the OpsDB is ~20% to 50% of the database size.... this all depends on your environment.  For the Data warehouse, it depends on how large the warehouse is - but you will probably find steady state to require somewhere around 10% to 20% of the warehouse size.  Any time we are doing any additional grooming of an alert/event/perf storm.... or changin grooming from 400 days to 300 days - this will require a LOT more transaction log space - so keep that in mind as your databases grow.

Boosting OpsMgr performance - by reducing the OpsDB data retention

$
0
0

Here is a little tip I often advise my customers on.....

The default data retention in OpsMgr is 7 days for most data types:

 

image

 

These are default settings which work well for a large cross section of different agent counts.  In MOM 2005 - we defaulted to 4 days.  Many customers, especially with large agent counts, would have to reduce that in MOM 2005 down to 2 days to keep a manageable Onepoint DB size.

 

That being said - to boost UI performance, and reduce OpsDB database size - consider reducing these values down to your real business requirements.  For a new, out of the box management group - I advise my customers to set these to 2 days.  This will keep less noise in your database as you deploy, and tune, agents and management packs.  This keeps a smaller DB, and a more responsive UI, in large agent count environments.

Essentially - set each value to "2" except for Performance Signature, which we will change to 1.  Performance Signature is unique.... the setting here isnt actually "Days" of retention.  It is "business cycles".  This is for self-tuning threshold ONLY.  This data is used for calculating business cycle based self-tuning thresholds.  There is NO REASON for this ever to be larger than the default of "2" business cycles.... and large agent count environments can see a performance benefit by bumping this down to only keeping "1" business cycle.

 

image

 

Then - once your Management group is fully deployed, and you have tuned your alert, performance, event, and state data.... IF you have a business requirement to keep this data for longer - bump it up.

Keep in mind - this will NOT cause you to groom out Alerts that are open - only closed alerts, and still will keep your closed alerts around for a couple days.

These settings have no impact on the data that is being written to the data warehouse - so any alert, event, or perf data needed will always be there.

Tuning tip – turning off some over-collection of events

$
0
0

We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don't care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.

However – one area of OpsMgr that often goes overlooked, is event overcollection.  This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat.  I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it.  They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.

MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP.  These events are items like "config requested”, “config delivered”, “new config active”.  They might be interesting, but there is no advanced analysis included to use these to detect a problem.  In small environments, they are not usually a big deal.  But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them.  I have yet to see a customer who did that.

 

At a high level – here is how I like to review these events:

  1. Review the Most Common Events query that your OpsDB has.
  2. Create a “My Workspace” view for each event that has a HIGH event count.
  3. Examine the event details for value to YOU.
  4. View the rule that collected the event.
    1. Does the rule also alert or do anything special, or does it simply collect the event?
    2. Do you think the event is required for any special reporting you do?
  5. Create an Override, in an Override MP for the rule source management pack, to disable the rule.
  6. Continue to the next event in the query output, and evaluate it.

 

So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:

Most common events by event number and event publishername:

SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
FROM EventAllView eav with (nolock)
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC

The trick is – to run this query periodically – and to examine the most common events for YOUR environment.  The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it:  (I will use a common event 21024 as an example:)

 

image

 

image

 

What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.

Next – I will examine the rule.  I will look at the Data Source section, and the Response section.  The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section.  If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.

If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it.  You will find you can disable most of the top consumers in the database.

 

Here is why I consider it totally cool to disable these uninteresting event collection rules:

  • If they are really important – there will be different alert generating rule to fire an alert
  • They fill the databases, agent queues, agent load, and network traffic with unimportant information.
  • While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
  • Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
  • If we find we do need one later – simply remove the override.

 

Here is an example of this one:

image

 

So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.

 

Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:

 

1206
1210
1215
1216
10102
10401
10403
10409
10457
10720
11771
21024
21025
21402
21403
21404
21405
29102
29103

 

I don't recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value.  Just knocking out the top 10 events will often free up 90% of the space they were consuming.

The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them.  You might find you have some other events as your top consumers.  I recommend you review them in the same manner as above – methodically.  Then revisit this every month or two to see if anything changed.

I’d also love to hear if you have other events that you see as your top consumer that isn't my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them.  I’d be happy to give feedback on those, or add any new ones to my list.


Understanding and modifying Data Warehouse retention and grooming

$
0
0

You will likely find that the default retention in the OpsMgr data warehouse will need to be adjusted for your environment.  I often find customers are reluctant to adjust these – because they don't know what they want to keep.  So – they assume the defaults are good – and they just keep EVERYTHING. 

This is a bad idea. 

A data warehouse will often be one of the largest databases supported by a company.  Large databases cost money.  They cost money to support.  They are more difficult to maintain.  They cost more to backup in time, tape capacity, network impact, etc.  They take longer to restore in the case of a disaster.  The larger they get, the more they cost in hardware (disk space) to support them.  The larger they get, can impact how long reports take to complete.

For these reasons – you should give STRONG consideration to reducing your warehouse retention to your reporting REQUIREMENTS.  If you don't have any – MAKE SOME!

Originally – when the product released – you had to directly edit SQL tables to adjust this.  Then – a command line tool was released to adjust these values – making the process easier and safer.  This post is just going to be a walk through of this process to better understand using this tool – and what each dataset actually means.

Here is the link to the command line tool: 

http://blogs.technet.com/momteam/archive/2008/05/14/data-warehouse-data-retention-policy-dwdatarp-exe.aspx

 

Different data types are kept in the Data Warehouse in unique “Datasets”.  Each dataset represents a different data type (events, alerts, performance, etc..) and the aggregation type (raw, hourly, daily)

Not every customer will have exactly the same data sets.  This is because some management packs will add their own dataset – if that MP has something very unique that it will collect – that does not fit into the default “buckets” that already exist.

 

So – first – we need to understand the different datasets available – and what they mean.  All the datasets for an environment are kept in the “Dataset” table in the Warehouse database.

select * from dataset
order by DataSetDefaultName

This will show us the available datasets.  Common datasets are:

Alert data set
Client Monitoring data set
Event data set
Microsoft.Windows.Client.Vista.Dataset.ClientPerf
Microsoft.Windows.Client.Vista.Dataset.DiskFailure
Microsoft.Windows.Client.Vista.Dataset.Memory
Microsoft.Windows.Client.Vista.Dataset.ShellPerf
Performance data set
State data set

Alert, Event, Performance, and State are the most common ones we look at.

 

However – in the warehouse – we also keep different aggregations of some of the datasets – where it makes sense.  The most common datasets that we will aggregate are Performance data, State data, and Client Monitoring data (AEM).  The reason we have raw, hourly, and daily aggregations – is to be able to keep data for longer periods of time – but still have very good performance on running reports.

In MOM 2005 – we used to stick ALL the raw performance data into a single table in the Warehouse.  After a year of data was reached – this meant the perf table would grow to a HUGE size – and running multiple queries against this table would be impossible to complete with acceptable performance.  It also meant grooming this table would take forever, and would be prone to timeouts and failures.

In OpsMgr – now we aggregate this data into hourly and daily aggregations.  These aggregations allow us to “summarize” the performance, or state data, into MUCH smaller table sizes.  This means we can keep data for a MUCH longer period of time than ever before.  We also optimized this by splitting these into multiple tables.  When a table reaches a pre-determined size, or number of records – we will start a new table for inserting.  This allows grooming to be incredibly efficient – because now we can simply drop the old tables when all of the data in a table is older than the grooming retention setting.

 

Ok – that’s the background on aggregations.  To see this information – we will need to look at the StandardDatasetAggregation table.

select * from StandardDatasetAggregation

That table contains all the datasets, and their aggregation settings.  To help make more sense of this -  I will join the dataset and the StandardDatasetAggregation tables in a single query – to only show you what you need to look at:

SELECT DataSetDefaultName,
AggregationTypeId,
MaxDataAgeDays
FROM StandardDatasetAggregation sda
INNER JOIN dataset ds on ds.datasetid = sda.datasetid
ORDER BY DataSetDefaultName

This query will give us the common dataset name, the aggregation type, and the current maximum retention setting.

For the AggregationTypeId:

0 = Raw

20 = Hourly

30 = Daily

Here is my output:

DataSetDefaultNameAggregationTypeIdMaxDataAgeDays
Alert data set0400
Client Monitoring data set030
Client Monitoring data set30400
Event data set0100
Microsoft.Windows.Client.Vista.Dataset.ClientPerf07
Microsoft.Windows.Client.Vista.Dataset.ClientPerf3091
Microsoft.Windows.Client.Vista.Dataset.DiskFailure07
Microsoft.Windows.Client.Vista.Dataset.DiskFailure30182
Microsoft.Windows.Client.Vista.Dataset.Memory07
Microsoft.Windows.Client.Vista.Dataset.Memory3091
Microsoft.Windows.Client.Vista.Dataset.ShellPerf07
Microsoft.Windows.Client.Vista.Dataset.ShellPerf3091
Performance data set010
Performance data set20400
Performance data set30400
State data set0180
State data set20400
State data set30400

 

You will probably notice – that we only keep 10 days of RAW Performance by default.  Generally – you don't want to mess with this.  This is simply to keep a short amount of raw data – to build our hourly and daily aggregations from.  All built in performance reports in SCOM run from Hourly, or Daily aggregations by default.

 

Now we are cooking!

Fortunately – there is a command line tool published that will help make changes to these retention periods, and provide more information about how much data we have currently.  This tool is called DWDATARP.EXE.  It is available for download HERE.

This gives us a nice way to view the current settings.  Download this to your tools machine, your RMS, or directly on your warehouse machine.  Run it from a command line.

Run just the tool with no parameters to get help:    

C:\>dwdatarp.exe

To get our current settings – run the tool with ONLY the –s (server\instance) and –d (database) parameters.  This will output the current settings.  However – it does not format well to the screen – so output it to a TXT file and open it:

C:\>dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW > c:\dwoutput.txt

Here is my output (I removed some of the vista/client garbage for brevity)

 

Dataset nameAggregation nameMax AgeCurrent Size, Kb
Alert data setRaw data40018,560 ( 1%)
Client Monitoring data setRaw data300 ( 0%)
Client Monitoring data setDaily aggregations40016 ( 0%)
Configuration datasetRaw data400153,016 ( 4%)
Event data setRaw data1001,348,168 ( 37%)
Performance data setRaw data10467,552 ( 13%)
Performance data setHourly aggregations4001,265,160 ( 35%)
Performance data setDaily aggregations40061,176 ( 2%)
State data setRaw data18013,024 ( 0%)
State data setHourly aggregations400305,120 ( 8%)
State data setDaily aggregations40020,112 ( 1%)

 

Right off the bat– I can see how little data that daily performance actually consumes.  I can see how much data that only 10 days of RAW perf data consume.  I also see a surprising amount of event data consuming space in the database.  Typically – you will see that perf hourly will consume the most space in a warehouse.

 

So – with this information in hand – I can do two things….

  • I can know what is using up most of the space in my warehouse.
  • I can know the Dataset name, and Aggregation name… to input to the command line tool to adjust it!

 

Now – on to the retention adjustments.

 

First thing– I will need to gather my Reporting service level agreement from management.  This is my requirement for how long I need to keep data for reports.  I also need to know “what kind” of reports they want to be able to run for this period.

From this discussion with management – we determined:

  • We require detailed performance reports for 90 days (hourly aggregations)
  • We require less detailed performance reports (daily aggregations) for 1 year for trending and capacity planning.
  • We want to keep a record of all ALERTS for 6 months.
  • We don't use any event reports, so we can reduce this retention from 100 days to 30 days.
  • We don't use AEM (Client Monitoring Dataset) so we will leave this unchanged.
  • We don't report on state changes much (if any) so we will set all of these to 90 days.

Now I will use the DWDATARP.EXE tool– to adjust these values based on my company reporting SLA:

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Hourly aggregations" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Daily aggregations" -m 365

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Alert data set" -a "Raw data" -m 180

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Event data set" -a "Raw Data" -m 30

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Raw data" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Hourly aggregations" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Daily aggregations" -m 90

 

Now my table reflects my reporting SLA– and my actual space needed in the warehouse will be much reduced in the long term:

 

Dataset nameAggregation nameMax AgeCurrent Size, Kb
Alert data setRaw data18018,560 ( 1%)
Client Monitoring data setRaw data300 ( 0%)
Client Monitoring data setDaily aggregations40016 ( 0%)
Configuration datasetRaw data400152,944 ( 4%)
Event data setRaw data301,348,552 ( 37%)
Performance data setRaw data10468,960 ( 13%)
Performance data setHourly aggregations901,265,992 ( 35%)
Performance data setDaily aggregations36561,176 ( 2%)
State data setRaw data9013,024 ( 0%)
State data setHourly aggregations90305,120 ( 8%)
State data setDaily aggregations9020,112 ( 1%)

 

Here are some general rules of thumb (might be different if your environment is unique)

  • Only keep the maximum retention of data in the warehouse per your reporting requirements.
  • Do not modify the performance RAW dataset.
  • Most performance reports are run against Perf Hourly data for detail performance throughout the day.  For reports that span long periods of time (weeks/months) you should generally use Daily aggregation.
  • Daily aggregations should generally be kept for the same retention as hourly – or longer.
  • Hourly datasets use up much more space than daily aggregations.
  • Most people don't use events in reports – and these can often be groomed much sooner than the default of 100 days.
  • Most people don't do a lot of state reporting beyond 30 days, and these can be groomed much sooner as well if desired.
  • Don't modify a setting if you don't use it.  There is no need.
  • The Configuration dataset generally should not be modified.  This keeps data about objects to report on, in the warehouse.  It should be set to at LEAST the longest of any perf, alert, event, or state datasets that you use for reporting.

OpsMgr 2012 – Grooming deep dive in the OperationsManager database

$
0
0

Grooming of the OpsDB in OpsMgr 2012 is very similar to OpsMgr 2007.  Grooming is called once per day at 12:00am…. by the rule:  “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the “All Management Servers Resource Pool” and is part of the System Center Internal Library.

image

It calls the “p_PartitioningAndGrooming” stored procedure.  This SP calls two other SP's:  p_Partitioning and then p_Grooming

p_Partitioning inspects the table PartitionAndGroomingSettings, and then calls the SP p_PartitionObject for each object in the PartitionAndGroomingSettings table where "IsPartitioned = 1"   (note - we partition event and perf into 61 daily tables - just like MOM 2005/SCOM 2007)

The PartitionAndGroomingSettings table:

image

The p_PartitionObject SP first identifies the next partition in the sequence, truncates it to make sure it is empty, and then updates the PartitionTables table in the database, to update the IsCurrent field to the next numeric table for events and perf.  It also sets the current time as the partition end time in the previous “is current” row, and sets the current time in the partition start time of the new “is current” row.  Then it calls the p_PartitionAlterInsertView sproc, to make new data start writing to the “new” current event and perf table.

To review which tables you are writing to - execute the following query:   select * from partitiontables where IsCurrent = '1'

A select * from partitiontables will show you all 61 event and perf tables, and when they were used.  You should see a PartitionStartTime updated every day - around midnight (time is stored in UTC in the database).  If partitioning is failing to run, then we wont see this date changing every day.  

Ok - that's the first step of the p_PartitioningAndGrooming sproc - Partitioning.  Now - if that is all successful, we will start grooming!

The p_Grooming is called after partitioning is successful.  One of the first things it does - is to update the InternalJobHistory table.  In this table - we keep a record of all partitioning and grooming jobs.  It is a good spot check to see what's going on with grooming.  To have a peek at this table - execute a select * from InternalJobHistory order by InternalJobHistoryId DESC

image

The p_Grooming sproc then calls p_GroomPartitionedObjects 

p_GroomPartitionedObjects  will first examine the PartitionAndGroomingSettings and compare the “days to keep” column value, against the current date, to figure out how many partitions to keep vs groom.  It will then inspect the partitions (tables) to ensure they have data, and then truncate the partition, by calling p_PartitionTruncate.  A truncate command is just a VERY fast and efficient way to delete all data from a table without issuing a highly transactional DELETE command.  The p_GroomPartitionedObjects sproc will then update the PartitionAndGroomingSettings table with the current time, under the GroomingRunTime column, to reflect when grooming last ran. 

Next - the p_Grooming sproc continues, by calling p_GroomNonPartitionedObjects. 

p_GroomNonPartitionedObjects is a short, but complex sproc - in that is calls all the individual sprocs listed in the PartitionAndGroomingSettings table where IsPartitioned = 0.  The following stored procedures are present in my database as non-partitioned data:

  • p_AlertGrooming
  • p_StateChangeEventGrooming
  • p_MaintenanceModeHistoryGrooming
  • p_AvailabilityHistoryGrooming
  • p_JobStatusGrooming
  • p_MonitoringJobStatusGrooming
  • p_PerformanceSignatureGrooming
  • p_PendingSdkDataSourceGrooming
  • p_InternalJobHistoryGrooming
  • p_EntityChangeLogGroom
  • p_UserSettingsStoreGrooming
  • p_TriggerEntityChangeLogStagedGrooming

Now, for the above sprocs, each one could potentially return a success or failure.  They will also likely call additional sprocs, for specific tasks.  You can see, the rabbit hole is deep.  Smile  This is just an example of the complexity involved in self-maintenance and grooming.  If you are experiencing a grooming failure of any kind, and the error messages involve any of the above stored procedures when you execute p_PartitioningAndGrooming manually, you should open a support case with Microsoft for troubleshooting and resolution.  The theory is, that each of the above procedures grooms a specific non-partitioned dataset.  Under NORMAL circumstances, each should be able to complete in a reasonable time frame.  The challenge becomes evident when you have something go wrong, like alert storms, state change even storms from monitors flip-flop, lots of performance signature data from using self-tuning threshold monitors, huge amounts of pending SDK datasource data from large Exchange 2010 environments, or from other MP’s that might leverage this.  Grooming non-partitioned data is slow, and highly resource intensive and transactional.  These are specific delete statements, from tables directly, often combined with creating temp tables in TempDB.  Having a good presized high performance TempDB can help, as will ensuring you have plenty of transaction log space for the database, and having the disk subsystem offer as many IOPS as possible.  http://technet.microsoft.com/en-us/library/ms175527(v=SQL.105).aspx

Next - the p_Grooming sproc continues, by updating the InternalJobHistory table, to give it a status of success (StatusCode of 1 = success, 2= failed, 0 appears to be never completed?)

If you ever have a problem with grooming - or need to get your OpsDB database size under control - simply reduce the data retention days, in the console, under Administration, Settings, Database Grooming.  To start with - I recommend setting all these to just 2 days, from the default of 7.  This keeps your OpsDB under control until you have time to tune all the noise from the MP's you import.  So just reduce this number, then open up query analyzer, and execute EXEC p_PartitioningAndGrooming  When it is done, check the job status by executing select * from InternalJobHistory order by InternalJobHistoryId DESC   The last groom job should be present, and successful.  The OpsDB size should be smaller, with more free space.  And to validate, you can always run my large table query, found at:   Useful Operations Manager 2007 SQL queries

System Center Universe is coming – January 19th!

$
0
0

 

REGISTER NOW HERE:  http://www.systemcenteruniverse.com/

image

 

Read Cameron Fuller’s blog post on this here:  http://blogs.catapultsystems.com/cfuller/archive/2015/12/17/scuniverse-returns-to-dallas-tx-and-the-world-on-january-19th-2016/

 

 

SCU is an awesome day of sessions covering Microsoft System Center, Windows Server, and Azure technologies from top speakers including Microsoft experts and MVP’s in the field.

There are two tracks depending on your interests – Cloud and Datacenter Management, and Enterprise Client Management.

The sponsors for 2016 include:

  • Catapult Systems
  • Microsoft
  • Veeam
  • Adaptiva
  • Secunia
  • Heat Software
  • MPx Alliance
  • Squared Up
  • Cireson

If you cannot attend in person – you can still attend via simulcast!  If you want to attend virtually, there are user group based simulcast locations around the world. Registration is available at: http://www.systemcenteruniverse.com/venue.htm

Simulcast event locations include:

  • Austin, TX
  • Denver, CO
  • Houston, TX
  • Omaha, NE
  • Phoenix, AZ
  • San Antonio, TX
  • Seattle, WA
  • Tampa, FL
  • Amsterdam
  • Germany
  • Vienna
  • And of course our event location in Dallas, TX!

If you want to attend, the in-person event it is available in Dallas Texas and registration is available at: https://www.eventbrite.com/e/scu-2016-live-tickets-7970023555

UR8 for SCOM 2012 R2 – Step by Step

$
0
0

 

image

 

NOTE:  I get this question every time we release an update rollup:   ALL SCOM Update Rollups are CUMULATIVE.  This means you do not need to apply them in order, you can always just apply the latest update.  If you have deployed SCOM 2012R2 and never applied an update rollup – you can go strait to the latest one available.  If you applied an older one (such as UR3) you can always go straight to the latest one!

 

 

KB Article for OpsMgr:  https://support.microsoft.com/en-us/kb/3096382

KB Article for all System Center components:  https://support.microsoft.com/en-us/kb/3096378

Download catalog site:  http://catalog.update.microsoft.com/v7/site/Search.aspx?q=3096382

 

Key fixes:

  • Slow load of alert view when it is opened by an operator
    Sometimes when the operators change between alert views, the views take up to two minutes to load. After this update rollup is installed, the reported performance issue is eradicated. The Alert View Load for the Operator role is now almost same as that for the Admin role user.
  • SCOMpercentageCPUTimeCounter.vbs causes enterprise wide performance issue
    Health Service encountered slow performance every five to six (5-6) minutes in a cyclical manner. This update rollup resolves this issue.
  • System Center Operations Manager Event ID 33333 Message: The statement has been terminated.
    This change filters out "statement has been terminated" warnings that SQL Server throws. These warning messages cannot be acted on. Therefore, they are removed.
  • System Center 2012 R2 Operations Manager: Report event 21404 occurs with error '0x80070057' after Update Rollup 3 or Update Rollup 4 is applied.
    In Update Rollup 3, a design change was made in the agent code that regressed and caused SCOM agent to report error ‘0x80070057’ and MonitoringHost.exe to stop responding/crash in some scenarios. This update rollup rolls back that UR3 change.
  • SDK service crashes because of Callback exceptions from event handlers being NULL
    In a connected management group environment in certain race condition scenarios, the SDK of the local management group crashes if there are issues during the connection to the different management groups. After this update rollup is installed, the SDK of the local management group should no longer crash.
  • Run As Account(s) Expiring Soon — Alert does not raise early enough
    The 14-day warning for the RunAs account expiration was not visible in the SCOM console. Customers received only an Error event in the console three days before the account expiration. After this update rollup is installed, customers will receive a warning in their SCOM console 14 days before the RunAs account expiration, and receive an Error event three (3) days before the RunAs account expiration.
  • Network Device Certification
    As part of Network device certification, we have certified the following additional devices in Operations Manager to make extended monitoring available for them:
    • Cisco ASA5515
    • Cisco ASA5525
    • Cisco ASA5545
    • Cisco IPS 4345
    • Cisco Nexus 3172PQ
    • Cisco ASA5515-IPS
    • Cisco ASA5545-IPS
    • F5 Networks BIG-IP 2000
    • Dell S4048
    • Dell S3048
    • Cisco ASA5515sc
    • Cisco ASA5545sc
  • French translation of APM abbreviation is misleading
    The French translation of “System Center Management APM service” is misleading. APM abbreviation is translated incorrectly in the French version of Microsoft System Center 2012 R2 Operations Manager. APM means “Application Performance Monitoring” but is translated as “Advanced Power Management." This fix corrects the translation.
  • p_HealthServiceRouteForTaskByManagedEntityId does not account for deleted resource pool members in System Center 2012 R2 Operations Manager
    If customers use Resource Pools and take some servers out of the pool, discovery tasks start failing in some scenarios. After this update rollup is installed, these issues are resolved.
  • Exception in the 'Managed Computer' view when you select Properties of a managed server in Operations Manager Console
    In the Operations Manager Server “Managed Computer” view on the Administrator tab, clicking the “Properties” button of a management server causes an error. After this update rollup is installed, a dialog box that contains a “Heart Beat” tab is displayed.
  • Duplicate entries for devices when network discovery runs
    When customers run discovery tasks to discover network devices, duplicate network devices that have alternative MAC addresses are discovered in some scenarios. After this update rollup is installed, customers will not receive any duplicate devices discovered in their environments.
  • Preferred Partner Program in Administration Pane
    This update lets customers view certified System Center Operations Manager partner solutions directly from the console. Customers can obtain an overview of the partner solutions and visit the partner websites to download and install the solutions.
There are no updates for Linux, and there are no updated MP’s for Linux in this update.

 

Lets get started.

From reading the KB article – the order of operations is:

  1. Install the update rollup package on the following server infrastructure:
    • Management servers
    • Gateway servers
    • Web console server role computers
    • Operations console role computers
  2. Apply SQL scripts.
  3. Manually import the management packs.
  4. Update Agents

Now, NORMALLY we need to add another step – if we are using Xplat monitoring – need to update the Linux/Unix MP’s and agents.   However, in UR8 for SCOM 2012 R2, there are no updates for Linux

 

 

 

1.  Management Servers

image

Since there is no RMS anymore, it doesn’t matter which management server I start with.  There is no need to begin with whomever holds the RMSe role.  I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

I can apply this update manually via the MSP files, or I can use Windows Update.  I have 3 management servers, so I will demonstrate both.  I will do the first management server manually.  This management server holds 3 roles, and each must be patched:  Management Server, Web Console, and Console.

The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:

image

Then extract the contents:

image

Once I have the MSP files, I am ready to start applying the update to each server by role.

***Note:  You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.

My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

image

This launches a quick UI which applies the update.  It will bounce the SCOM services as well.  The update usually does not provide any feedback that it had success or failure. 

I got a prompt to restart:

image

I choose yes and allow the server to restart to complete the update.

 

You can check the application log for the MsiInstaller events to show completion:

Log Name:      Application
Source:        MsiInstaller
Event ID:      1036
Level:         Information
Computer:      SCOM01.opsmgr.net
Description:
Windows Installer installed an update. Product Name: System Center Operations Manager 2012 Server. Product Version: 7.1.10226.0. Product Language: 1033. Manufacturer: Microsoft Corporation. Update Name: System Center 2012 R2 Operations Manager UR8 Update Patch. Installation success or error status: 0.

You can also spot check a couple DLL files for the file version attribute. 

image

Next up – run the Web Console update:

image

This runs much faster.   A quick file spot check:

image

Lastly – install the console update (make sure your console is closed):

image

A quick file spot check:

image

 

 

Secondary Management Servers:

image

I now move on to my secondary management servers, applying the server update, then the console update. 

On this next management server, I will use the example of Windows Update as opposed to manually installing the MSP files.  I check online, and make sure that I have configured Windows Update to give me updates for additional products: 

Apparently when I tried this – the catalog was broken – because none of the system center stuff was showing up in Windows Updates.

So….. because of this – I elect to do manual updates like I did above.

I apply these updates, and reboot each management server, until all management servers are updated.

 

 

 

Updating Gateways:

image

I can use Windows Update or manual installation.

image

The update launches a UI and quickly finishes.

Then I will spot check the DLL’s:

image

I can also spot-check the \AgentManagement folder, and make sure my agent update files are dropped here correctly:

image

 

 

 

2. Apply the SQL Scripts

In the path on your management servers, where you installed/extracted the update, there are two SQL script files: 

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

image

First – let’s run the script to update the OperationsManager database.  Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file.  Make sure it is pointing to your OperationsManager database, then execute the script.

You should run this script with each UR, even if you ran this on a previous UR.  The script body can change so as a best practice always re-run this.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.  I have had customers state this takes from a few minutes to as long as an hour. In MOST cases – you will need to shut down the SDK, Config, and Monitoring Agent (healthservice) on ALL your management servers in order for this to be able to run with success.

You will see the following (or similar) output:

image47

or

image

IF YOU GET AN ERROR – STOP!  Do not continue.  Try re-running the script several times until it completes without errors.  In a production environment, you almost certainly have to shut down the services (sdk, config, and healthservice) on your management servers, to break their connection to the databases, to get a successful run.

Technical tidbit:   Even if you previously ran this script in UR1, UR2, UR3, UR4, UR5, UR6, or UR7, you should run this again for UR8, as the script body can change with updated UR’s.

image

Next, we have a script to run against the warehouse DB.  Do not skip this step under any circumstances.    From:

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups

(note – your path may vary slightly depending on if you have an upgraded environment of clean install)

Open a SQL management studio query window, connect it to your OperationsManagerDW database, and then open the script file UR_Datawarehouse.sql.  Make sure it is pointing to your OperationsManagerDW database, then execute the script.

If you see a warning about line endings, choose Yes to continue.

image

Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

You will see the following (or similar) output:

image

 

 

 

3. Manually import the management packs

image

There are 26 management packs in this update!

The path for these is on your management server, after you have installed the “Server” update:

\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Management Packs for Update Rollups

However, the majority of them are Advisor/OMS, and language specific.  Only import the ones you need, and that are correct for your language.  I will remove all the Advisor MP’s for other languages, and I am left with the following:

image

The TFS MP bundles are only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc.  If you are not currently using these MP’s, there is no need to import or update them.  I’d skip this MP import unless you already have these MP’s present in your environment.

The Advisor MP’s are only needed if you are using Microsoft Operations Management Suite cloud service, (Previously known as Advisor, and Operation Insights).

However, the Image and Visualization libraries deal with Dashboard updates, and these always need to be updated.

I import all of these shown without issue.

 

 

4.  Update Agents

image43_thumb

Agents should be placed into pending actions by this update (mine worked great) for any agent that was not manually installed (remotely manageable = yes):   One the Management servers where I used Windows Update to patch them, their agents did not show up in this list.  Only agents where I manually patched their management server showed up in this list.  FYI.

image

If your agents are not placed into pending management – this is generally caused by not running the update from an elevated command prompt, or having manually installed agents which will not be placed into pending.

In this case – my agents that were reporting to a management server that was updated using Windows Update – did NOT place agents into pending.  Only the agents reporting to the management server for which I manually executed the patch worked.

You can approve these – which will result in a success message once complete:

image

Soon you should start to see PatchList getting filled in from the Agents By Version view under Operations Manager monitoring folder in the console:

image

 

 

 

5.  Update Unix/Linux MPs and Agents

image

There are no updates for Linux in UR8.  Please see the instructions for UR7 if you are not updating from UR7 directly:

http://blogs.technet.com/b/kevinholman/archive/2015/08/17/ur7-for-scom-2012-r2-step-by-step.aspx

 

 

6.  Update the remaining deployed consoles

image

This is an important step.  I have consoles deployed around my infrastructure – on my Orchestrator server, SCVMM server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc.  These should all get the matching update version.

 

 

 

Review:

Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

image

Known issues:

See the existing list of known issues documented in the KB article.

1.  Many people are reporting that the SQL script is failing to complete when executed.  You should attempt to run this multiple times until it completes without error.  You might need to stop the Exchange correlation engine, stop all the SCOM services on the management servers, and/or bounce the SQL server services in order to get a successful completion in a busy management group.  The errors reported appear as below:

——————————————————
(1 row(s) affected)
(1 row(s) affected)
Msg 1205, Level 13, State 56, Line 1
Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Msg 3727, Level 16, State 0, Line 1
Could not drop constraint. See previous errors.
——————————————————–

Writing a service recovery script – Cluster service example

$
0
0

 

I had a customer request the ability to monitor the cluster service on clusters, and ONLY alert when a recovery attempt failed.

This is a fairly standard request for service monitoring when we use recoveries – we generally don’t want an alert to be generated from the Service Monitor, because that will be immediate upon service down detection.  We want the service monitor to detect the service down, then run a recovery, and then if the recovery fails to restore service, generate an alert.

Here is an example of that.

The cluster service monitor is unique, in that it already has a built in recovery.  However, it is too simple for our needs, as it only runs NET START.

image

 

So the first thing we will need to do, is create an override disabling this built in recovery:

image

 

Next – override the “Cluster service status” monitor to not generate alerts:

image

 

Now we can add our own script base recovery to the monitor:

image

 

image

 

And paste in a script which I will provide below.  Here is the script:

'========================================================================== ' ' COMMENT: This is a recovery script to recovery the Cluster Service ' '========================================================================== Option Explicit SetLocale("en-us") Dim StartTime,EndTime,sTime 'Capture script start time StartTime = Now 'Time that the script starts so that we can see how long it has been watching to see if the service stops again. Dim strTime strTime = Time Dim oAPI Set oAPI = CreateObject("MOM.ScriptAPI") Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3750,0,"Service Recovery script is starting") Dim strComputer, strService, strStartMode, strState, objCount, strClusterService 'The script will always be run on the machine that generated the monitor error strComputer = "." strClusterService = "ClusSvc" 'Record the current state of each service before recovery in an event Dim strClusterServicestate ServiceState(strClusterService) strClusterServicestate = strState Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3751,0,"Current service state before recovery is: " & strClusterService & " : " & strClusterServicestate) 'Stop script if all services are running If (strClusterServicestate = "Running") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3752,2,"All services were found to be already running, recovery should not run, ending script") Wscript.Quit End If 'Check to see if a specific event has been logged previously that means this recovery script should NOT run if event is present 'This section optional and not commonly used Dim dtmStartDate, iCount, colEvents, objWMIService, objEvent ' Const CONVERT_TO_LOCAL_TIME = True ' Set dtmStartDate = CreateObject("WbemScripting.SWbemDateTime") ' dtmStartDate.SetVarDate dateadd("n", -60, now)' CONVERT_TO_LOCAL_TIME ' ' iCount = 0 ' Set objWMIService = GetObject("winmgmts:" _ ' & "{impersonationLevel=impersonate,(Security)}!\\" _ ' & strComputer & "\root\cimv2") ' Set colEvents = objWMIService.ExecQuery _ ' ("Select * from Win32_NTLogEvent Where Logfile = 'Application' and " _ ' & "TimeWritten > '" & dtmStartDate & "' and EventCode = 100") ' For Each objEvent In colEvents ' iCount = iCount+1 ' Next ' If iCount => 1 Then ' EndTime = Now ' sTime = DateDiff("s", StartTime, EndTime) ' Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3761,2,"script found event which blocks execution of this recovery. Recovery will not run. Script ending after " & sTime & " seconds") ' WScript.Quit ' ElseIf iCount < 1 Then ' Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3762,0,"script did not find any blocking events. Script will continue") ' End If 'At least one service is stopped to cause this recovery, stopping all three services so we can start them in order 'You would only use this section if you had multiple services and they needed to be started in a specific order ' Call oAPI.LogScriptEvent("ServiceRecovery.vbs",3753,0,"At least one service was found not running. Recovery will run. Attempting to stop all services now") ' ServiceStop(strService1) ' ServiceStop(strService2) ' ServiceStop(strService3) 'Check to make sure all services are actually in stopped state ' Optional Wait 15 seconds for slow services to stop ' Wscript.Sleep 15000 ServiceState(strClusterService) strClusterServicestate = strState 'Stop script if all services are not stopped If (strClusterServicestate <> "Stopped") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3754,2,"Recovery script found service is not in stopped state. Manual intervention is required, ending script. Current service state is: " & strClusterService & " : " & strClusterServicestate) Wscript.Quit Else Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3755,0,"Recovery script verified all services in stopped state. Continuing.") End If 'Start services in order. Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3756,0,"Attempting to start all services") Dim errReturn 'Restart Services and watch to see if the command executed without error ServiceStart(strClusterService) Wscript.sleep 5000 'Check service state to ensure all services started ServiceState(strClusterService) strClusterServicestate = strState 'Log success or fail of recovery If (strClusterServicestate = "Running") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3757,0,"All services were successfully started and then found to be running") Else Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3758,2,"Recovery script failed to start all services. Manual intervention is required. Current service state is: " & strClusterService & " : " & strClusterServicestate) End If 'Check to see if this recovery script has been run three times in the last 60 minutes for loop detection Set dtmStartDate = CreateObject("WbemScripting.SWbemDateTime") dtmStartDate.SetVarDate dateadd("n", -60, now)' CONVERT_TO_LOCAL_TIME iCount = 0 Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate,(Security)}!\\" _ & strComputer & "\root\cimv2") Set colEvents = objWMIService.ExecQuery _ ("Select * from Win32_NTLogEvent Where Logfile = 'Operations Manager' and " _ & "TimeWritten > '" & dtmStartDate & "' and EventCode = 3750") For Each objEvent In colEvents iCount = iCount+1 Next If iCount => 3 Then EndTime = Now sTime = DateDiff("s", StartTime, EndTime) Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3759,2,"script restarted " & strClusterService & " service 3 or more times in the last hour, script ending after " & sTime & " seconds") WScript.Quit ElseIf iCount < 3 Then EndTime = Now sTime = DateDiff("s", StartTime, EndTime) Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3760,0,"script restarted " & strClusterService & " service less than 3 times in the last hour, script ending after " & sTime & " seconds") End If Wscript.Quit '================================================================================== ' Subroutine: ServiceState ' Purpose: Gets the service state and startmode from WMI '================================================================================== Sub ServiceState(strService) Dim objWMIService, colRunningServices, objService Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colRunningServices = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name = '"& strService & "'") For Each objService in colRunningServices strState = objService.State strStartMode = objService.StartMode Next End Sub '================================================================================== ' Subroutine: ServiceStart ' Purpose: Starts a service '================================================================================== Sub ServiceStart(strService) Dim objWMIService, colRunningServices, objService, colServiceList Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colServiceList = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name='"& strService & "'") For Each objService in colServiceList errReturn = objService.StartService() Next End Sub '================================================================================== ' Subroutine: ServiceStop ' Purpose: Stops a service '================================================================================== Sub ServiceStop(strService) Dim objWMIService, colRunningServices, objService, colServiceList Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colServiceList = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name='"& strService & "'") For Each objService in colServiceList errReturn = objService.StopService() Next End Sub

 

Here it is inserted into the UI.  I provide a 3 minute timeout for this one:

 

image

 

Here is how it will look once added:

image

 

Now – we need to generate an alert when the script detects that it failed to start the service:

image

 

Provide a name and we will target the same class as the service monitor:

image

 

For the expression – the ID comes from the event generated by the recovery script, and the string search makes sure we are only alerting on a Cluster service recovery, if we reuse the script for other services we need to be able to distinguish from them:

image

 

 

Lets test!

If we just simply stop the Cluster Service – the recovery kicks in and see evidence in the state changes, and event log:

 

image

 

I like REALLY verbose logging in the scripts I write…. more is MUCH better than less especially when troubleshooting, and recoveries should not be running often clogging up the logs.

image

image

image

image

 

image

image

 

 

If the recovery fails to start the service – the script detects this – drops a very specific event, and then an alert is generated for the service being down and manual intervention required:

 

image

 

image

 

 

There we have it – we only get alerts if the service is not recoverable.  This makes SCOM more actionable.  If we want a record of this for reporting – we can collect the events for recovery starting, and then report on those events.

You can download this example MP at:

https://gallery.technet.microsoft.com/Cluster-Service-Recovery-270ca2cd

Viewing all 153 articles
Browse latest View live