How grooming and auto-resolution work in the OpsMgr 2007 Operational database

December 13, 2007, 6:22 pm

≫ Next: Failed tasks aren't groomed from the Operational Database

How Grooming and Auto-Resolution works in the OpsMgr 2007 Operations DB

Warning – don’t read this if you are bored easily.

In a simplified view to groom alerts…..

Grooming of the ops DB is called once per day at 12:00am…. by the rule: “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the “Root Management Server” and is part of the System Center Internal Library.

It calls the “p_PartitioningAndGrooming” stored procedure, which calls p_Grooming, which calls p_GroomNonPartitionedObjects (Alerts are not partitioned) which inspects the PartitionAndGroomingSettings table… and executes each stored procedure. The Alerts stored procedure in that table is referenced as p_AlertGrooming which has the following sql statement:

SELECT AlertId INTO #AlertsToGroom

FROM dbo.Alert

WHERE TimeResolved IS NOT NULL

AND TimeResolved < @GroomingThresholdUTC

AND ResolutionState = 255

So…. the criteria for what is groomed is pretty simple: In a resolution state of “Closed” (255) and older than the 7 day default setting (or your custom setting referenced in the table above)

We won’t groom any alerts that are in New (0), or any custom resolution-states (custom ID #). Those will have to be set to “Closed” (255)…. either by autoresolution of a monitor returning to healthy, direct user interaction, our built in autoresolution mechanism, or your own custom script.

Ok – that covers grooming.

However – I can see that brings up the question – how does auto-resolution work?

That specifically states “alerts in the new resolution state”. I don’t think that is completely correct:

That is called upon by the rule “Alert Auto Resolve Execute All” which runs p_AlertAutoResolveExecuteAll once per day at 4:00am. This calls p_AlertAutoResolve twice…. once with a variable of “0” and once with a “1”.

Here is the sql statement:

IF (@AutoResolveType = 0)

BEGIN

SELECT @AlertResolvePeriodInDays = [SettingValue]

FROM dbo.[GlobalSettings]

WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_HealthyAlertAutoResolvePeriod()

SET @AutoResolveThreshold = DATEADD(dd, -@AlertResolvePeriodInDays, getutcdate())

SET @RootMonitorId = dbo.fn_ManagedTypeId_SystemHealthEntityState()

-- We will resolve all alerts that have green state and are un-resolved

-- and haven't been modified for N number of days.

INSERT INTO @AlertsToBeResolved

SELECT A.[AlertId]

FROM dbo.[Alert] A

JOIN dbo.[State] S

ON A.[BaseManagedEntityId] = S.[BaseManagedEntityId] AND S.[MonitorId] = @RootMonitorId

WHERE A.[LastModified] < @AutoResolveThreshold

AND A.[ResolutionState] <> 255

AND S.[HealthState] = 1

<snip>

ELSE IF (@AutoResolveType = 1)

BEGIN

SELECT @AlertResolvePeriodInDays = [SettingValue]

FROM dbo.[GlobalSettings]

WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_AlertAutoResolvePeriod()

SET @AutoResolveThreshold = DATEADD(dd, -@AlertResolvePeriodInDays, getutcdate())

-- We will resolve all alerts that are un-resolved

-- and haven't been modified for N number of days.

INSERT INTO @AlertsToBeResolved

SELECT A.[AlertId]

FROM dbo.[Alert] A

WHERE A.[LastModified] < @AutoResolveThreshold

AND ResolutionState <> 255

So we are basically checking that Resolution state <> 255….. not specifically “New” (0) as we would lead you to believe by the wording in the interface. There are simply two types of auto-resolution: Resolve all alerts where the object has returned to a healthy state in “N” days….. and Resolve all alerts no matter what, as long as they haven’t been modified in “N” days.

↧

Failed tasks aren't groomed from the Operational Database

February 5, 2008, 12:02 pm

≫ Next: Print Server management pack fills the Operational DB with TONS of perf data

≪ Previous: How grooming and auto-resolution work in the OpsMgr 2007 Operational database

This appears to be present up to RC-SP1 version, build 6.0.6246.0

In the Task Status console view - I noticed an old failed task from 2 months ago..... however, my task grooming is set to 7 days.

To view the grooming process:

http://blogs.technet.com/kevinholman/archive/2007/12/13/how-grooming-and-auto-resolution-work-in-the-opsmgr-2007-operational-database.aspx

Basically – select * from PartitionAndGroomingSettings will show you all grooming going on.

Tasks are kept in the jobstatus table.

Select * from jobstatus will show all tasks.

p_jobstatusgrooming is called to groom this table.

Here is the text of that SP:

--------------------------------

USE [OperationsManager]

/****** Object: StoredProcedure [dbo].[p_JobStatusGrooming] Script Date: 02/05/2008 10:49:32 ******/

SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

ALTER PROCEDURE [dbo].[p_JobStatusGrooming]

BEGIN

SET NOCOUNT ON

DECLARE @Err int

DECLARE @Ret int

DECLARE @RowCount int

DECLARE @SaveTranCount int

DECLARE @GroomingThresholdLocal datetime

DECLARE @GroomingThresholdUTC datetime

DECLARE @TimeGroomingRan datetime

DECLARE @MaxTimeGroomed datetime

SET @SaveTranCount = @@TRANCOUNT

SET @TimeGroomingRan = getutcdate()

SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())

FROM dbo.PartitionAndGroomingSettings

WHERE ObjectName = 'JobStatus'

EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT

IF (@@ERROR <> 0)

BEGIN

GOTO Error_Exit

END

-- Selecting the max time to be groomed to update the table

SELECT @MaxTimeGroomed = MAX(LastModified)

FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

IF @MaxTimeGroomed IS NULL

GOTO Success_Exit

BEGIN TRAN

-- Change the Statement below to reflect the new item

-- that needs to be groomed

DELETE FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

SET @Err = @@ERROR

IF (@Err <> 0)

BEGIN

GOTO Error_Exit

END

UPDATE dbo.PartitionAndGroomingSettings

SET GroomingRunTime = @TimeGroomingRan,

DataGroomedMaxTime = @MaxTimeGroomed

WHERE ObjectName = 'JobStatus'

SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

IF (@Err <> 0 OR @RowCount <> 1)

BEGIN

GOTO Error_Exit

END

COMMIT TRAN

Success_Exit:

RETURN 0

Error_Exit:

-- If there was an error and there is a transaction

-- pending, rollback.

IF (@@TRANCOUNT > @SaveTranCount)

ROLLBACK TRAN

RETURN 1

END

------------------------------------

Here is the problem in the SP:

DELETE FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

We only delete (groom) tasks that have a timestamp in TimeFinished. If a failed task doesn’t finish – this field will be NULL and never gets groomed.

↧

Print Server management pack fills the Operational DB with TONS of perf data

February 11, 2008, 7:19 pm

≫ Next: Grooming process in the Operations Database

≪ Previous: Failed tasks aren't groomed from the Operational Database

This is something I have noticed in MOM 2005, and seems to be the same in the conversion MP for OpsMgr 2007. (Version 6.0.5000.0 of the Microsoft.Windows.Server.Printserver (Converted) MP). When you import this MP, it will fill the Operational and reporting databases with performance data about print jobs and queues, if you have a large number of print servers/queues in your environment.

If reporting on this perf data is not critical to your environment, you should disable these rules:

↧

Grooming process in the Operations Database

February 12, 2008, 9:51 pm

≫ Next: What SQL maintenance should I perform on my OpsMgr databases?

≪ Previous: Print Server management pack fills the Operational DB with TONS of perf data

This is a continuation of my other post, on general alert grooming:

How grooming and auto-resolution work in the OpsMgr 2007 Operational database

Grooming of the OpsDB is called once per day at 12:00am…. by the rule: “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the “Root Management Server” and is part of the System Center Internal Library.

It calls the “p_PartitioningAndGrooming” stored procedure. This SP calls two other SP's: p_Partitioning and then p_Grooming

p_Partitioning inspects the table PartitionAndGroomingSettings, and then calls the SP p_PartitionObject for each object in the PartitionAndGroomingSettings table where "IsPartitioned = 1" (note - we partition event and perf into 61 daily tables - just like MOM 2005)

The PartitionAndGroomingSettings table:

To review which tables you are writing to - execute the following query: select * from partitiontables where IsCurrent = '1'

A select * from partitiontables will show you all 61 event and perf tables, and when they were used. You should see a PartitionStartTime updated every day - around midnight (time is stored in UTC in the database). If partitioning is failing to run, then we wont see this date changing every day.

Ok - that's the first step of the p_PartitioningAndGrooming sproc - Partitioning. Now - if that is all successful, we will start grooming!

The p_Grooming is called after partitioning is successful. One of the first things it does - is to update the InternalJobHistory table. In this able - we keep a record of all partitioning and grooming jobs. It is a good spot check to see what's going on with grooming. To have a peek at this table - execute a select * from InternalJobHistory order by InternalJobHistoryId

The p_Grooming sproc then calls p_GroomPartitionedObjects This sproc will first examine the PartitionAndGroomingSettings and compare the days to keep column, against the current date, to figure out how many partitions to groom. It will then inspect the partitions to ensure they have data, and then truncate the partition, by calling p_PartitionTruncate. The p_GroomPartitionedObjects sproc will then update the PartitionAndGroomingSettings table with the current time, under the GroomingRunTime column.

Next - the p_Grooming sproc continues, by calling p_GroomNonPartitionedObjects. p_GroomNonPartitionedObjects is a short, but complex sproc - in that is calls all the individual sprocs listed in the PartitionAndGroomingSettings table where IsPartitioned = 0. (see my other post at the link above to follow the logic of one of these non-partitioned sprocs)

Next - the p_Grooming sproc continues, by updating the InternalJobHistory table, to give it a status of success (StatusCode of 1 = success, 2= failed, 0 appears to be never completed?)

If you ever have a problem with grooming - or need to get your OpsDB database size under control - simply reduce the data retention days, in the console, under Administration, Settings, Database Grooming. To start with - I recommend setting all these to just 2 days, fromt he default of 7. This keeps your OpsDB under control until you have time to tune all the noise fromt he MP's you import. So just reduce this number, then open up query analyzer, and execute p_PartitioningAndGrooming When it is done, check the job status by executing select * from InternalJobHistory order by InternalJobHistoryId The last groom job should be present, and successful. The OpsDB size should be smaller, with more free space. And to validate, you can always run my large table query, found at: Useful Operations Manager 2007 SQL queries

↧

What SQL maintenance should I perform on my OpsMgr databases?

April 11, 2008, 7:38 pm

≫ Next: Boosting OpsMgr performance - by reducing the OpsDB data retention

≪ Previous: Grooming process in the Operations Database

This question comes up a lot. The answer is really - not what maintenance you should be performing... but what maintenance you should be *excluding*.... or when. Here is why:

Most SQL DBA's will set up some pretty basic default maintenance on all SQL DB's they support. This often includes, but is not limited to:

DBCC CHECKDB (to look for DB errors and report on them)

UPDATE STATISTICS (to boost query performance)

DBCC DBREINDEX (to rebuild the table indexes to boost performance)

BACKUP

SQL DBA's might schedule these to run via the SQL Agent to execute nightly, weekly, or some combination of the above depending on DB size and requirements.

On the other side of the coin.... in some companies, the MOM/OpsMgr team installs and owns the SQL server.... and they dont do ANY default maintenance to SQL. Because of this - a focus in OpsMgr was to have the Ops DB and Datawarehouse DB to be fully self-maintaining.... providing a good level of SQL performance whether or not any default maintenance was being done.

Operational Database:

Reindexing is already taking place against the OperationsManager database for some of the tables. This is built into the product. What we need to ensure - is that any default DBA maintenance tasks are not redundant nor conflicting with our built-in maintenance, and our built-in schedules:

There is a rule in OpsMgr that is targeted at the Root Management Server:

The rule executes the "p_OptimizeIndexes" stored procedure, every day at 2:30AM:

This rule cannot be changed or modified. Therefore - we need to ensure there is not other SQL maintenance (including backups) running at 2:30AM, or performance will be impacted.

If you want to view the built-in UPDATE STATISTICS and DBCC DBREINDEX jobs history - just run the following queries:

select *
from DomainTable dt
inner join DomainTableIndexOptimizationHistory dti
on dt.domaintablerowID = dti.domaintableindexrowID
ORDER BY optimizationdurationseconds DESC

select *
from DomainTable dt
inner join DomainTableStatisticsUpdateHistory dti
on dt.domaintablerowID = dti.domaintablerowID
ORDER BY UpdateDurationSeconds DESC

Take note of the update/optimization duration seconds column. This will show you how long your maintenance is typically running. In a healthy environment these should not take very long.

If you want to view the fragmentation levels of the current tables in the database, run:

DBCC SHOWCONTIG WITH FAST

Here is some sample output:

----------------------------------------------------------------------------------------------

DBCC SHOWCONTIG scanning 'Alert' table...
Table: 'Alert' (1771153355); index ID: 1, database ID: 5
TABLE level scan performed.
- Pages Scanned................................: 936
- Extent Switches..............................: 427
- Scan Density [Best Count:Actual Count].......: 27.34% [117:428]
- Logical Scan Fragmentation ..................: 60.90%

----------------------------------------------------------------------------------------------

In general - we would like the "Scan density" to be high (Above 80%), and the "Logical Scan Fragmentation" to be low (below 30%). What you might find... is that *some* of the tables are more fragmented than others, because our built-in maintenance does not reindex all tables. Especially tables like the raw perf, event, and localizedtext tables.

That said - there is nothing wrong with running a DBA's default maintenance against the Operational database..... reindexing these tables in the database might also help console performance. We just dont want to run any DBA maintenance during the same time that we run our own internal maintenance, so try not to conflict with this schedule. Care should also be taken in any default DBA maintenance, that it does not run too long, or impact normal operations of OpsMgr. Maintenance jobs should be monitored, and should not conflict with the backup schedules as well.

Here is a reindex job you can schedule with SQL agent.... for the OpsDB:

USE OperationsManager
go
SET ANSI_NULLS ON
SET ANSI_PADDING ON
SET ANSI_WARNINGS ON
SET ARITHABORT ON
SET CONCAT_NULL_YIELDS_NULL ON
SET QUOTED_IDENTIFIER ON
SET NUMERIC_ROUNDABORT OFF
EXEC SP_MSForEachTable "Print 'Reindexing '+'?' DBCC DBREINDEX ('?')"

Data Warehouse Database:

The data warehouse DB is also fully self maintaining. This is called out by a rule "Standard Data Warehouse Data Set maintenance rule" which is targeted to the "Standard Data Set" object type. This stored procedure is called on the data warehouse every 60 seconds. It performs many, many tasks, of which Index optimization is but one.

This SP calls the StandardDatasetOptimize stored procedure, which handles any index operations.

To examine the index and statistics history - run the following query for the Alert, Event, Perf, and State tables:

select basetablename, optimizationstartdatetime, optimizationdurationseconds,
beforeavgfragmentationinpercent, afteravgfragmentationinpercent,
optimizationmethod, onlinerebuildlastperformeddatetime
from StandardDatasetOptimizationHistory sdoh
inner join StandardDatasetAggregationStorageIndex sdasi
on sdoh.StandardDatasetAggregationStorageIndexRowId = sdasi.StandardDatasetAggregationStorageIndexRowId
inner join StandardDatasetAggregationStorage sdas
on sdasi.StandardDatasetAggregationStorageRowId = sdas.StandardDatasetAggregationStorageRowId
ORDER BY optimizationdurationseconds DESC

Then examine the default domain tables optimization history.... run the same two queries as listed above for the OperationsDB.

In the data warehouse - we can see that all the necessary tables are being updated and reindexed as needed. When a table is 10% fragmented - we reorganize. When it is 30% or more, we rebuild the index.

Therefore - there is no need for a DBA to execute any UPDATE STATISTICS or DBCC DBREINDEX maintenance against this database. Furthermore, since we run our maintenance every 60 seconds, and only execute maintenance when necessary, there is no "set window" where we will run our maintenance jobs. This means that if a DBA team also sets up a UPDATE STATISTICS or DBCC DBREINDEX job - it can conflict with our jobs and execute concurrently. This should not be performed.

For the above reasons, I would recommend against any maintenance jobs on the Data Warehouse DB, beyond a CHECKDB (only if DBA's mandate it) and a good backup schedule.

For the OpsDB: any standard maintenance is fine, as long as it does not conflict with the built-in maintenance, or impact production by taking too long, or having an impact on I/O.

Lastly - I'd like to discuss the recovery model of the SQL database. We default to "simple" for all our DB's. This should be left alone.... unless you have *very* specific reasons to change this. Some SQL teams automatically assume all databases should be set to "full" recovery model. This requires that they back up the transaction logs on a very regular basis, but give the added advantage of restoring up to the time of the last t-log backup. For OpsMgr, this is of very little value, as the data changing on an hourly basis is of little value compared to the complexity added by moving from simple to full. Also, changing to full will mean that your transaction logs will only checkpoint once a t-log backup is performed. What I have seen, is that many companies aren't prepared for the amount of data written to these databases.... and their standard transaction log backups (often hourly) are not frequent enough to keep them from filling. The only valid reason to change to FULL, in my opinion, is when you are using an advanced replication strategy, like log shipping, which requires full recovery model. When in doubt - keep it simple. :-)

P.S.... The Operations Database needs 50% free space at all times. This is for growth, and for re-index operations to be successful. This is a general supportability recommendation, but the OpsDB will alert when this falls below 40%.

For the Data warehouse.... we do not require the same 50% free space. This would be a temendous requireemnts if we had a multiple-terabyte database!

Think of the data warehouse to have 2 stages... a "growth" stage (while it is adding data and not yet grooming much (havent hit the default 400 days retention) and a "maturity stage" where agent count is steady, MP's are not changing, and the grooming is happening because we are at 400 days retention. During "growth" we need to watch and maintain free space, and monitor for available disk space. In "maturity" we only need enough free space to handle our index operations. when you start talking 1 Terabyte of data.... that means 500GB of free space, which is expensive, and. If you cannot allocate it.... then just allow auto-grow and monitor the database.... but always plan for it from a volume size perspective.

For transaction log sizing - we don't have any hard rules. A good rule of thumb for the OpsDB is ~20% to 50% of the database size.... this all depends on your environment. For the Data warehouse, it depends on how large the warehouse is - but you will probably find steady state to require somewhere around 10% to 20% of the warehouse size. Any time we are doing any additional grooming of an alert/event/perf storm.... or changin grooming from 400 days to 300 days - this will require a LOT more transaction log space - so keep that in mind as your databases grow.

↧

Boosting OpsMgr performance - by reducing the OpsDB data retention

November 4, 2008, 6:24 pm

≫ Next: Tuning tip – turning off some over-collection of events

≪ Previous: What SQL maintenance should I perform on my OpsMgr databases?

Here is a little tip I often advise my customers on.....

The default data retention in OpsMgr is 7 days for most data types:

These are default settings which work well for a large cross section of different agent counts. In MOM 2005 - we defaulted to 4 days. Many customers, especially with large agent counts, would have to reduce that in MOM 2005 down to 2 days to keep a manageable Onepoint DB size.

That being said - to boost UI performance, and reduce OpsDB database size - consider reducing these values down to your real business requirements. For a new, out of the box management group - I advise my customers to set these to 2 days. This will keep less noise in your database as you deploy, and tune, agents and management packs. This keeps a smaller DB, and a more responsive UI, in large agent count environments.

Essentially - set each value to "2" except for Performance Signature, which we will change to 1. Performance Signature is unique.... the setting here isnt actually "Days" of retention. It is "business cycles". This is for self-tuning threshold ONLY. This data is used for calculating business cycle based self-tuning thresholds. There is NO REASON for this ever to be larger than the default of "2" business cycles.... and large agent count environments can see a performance benefit by bumping this down to only keeping "1" business cycle.

Then - once your Management group is fully deployed, and you have tuned your alert, performance, event, and state data.... IF you have a business requirement to keep this data for longer - bump it up.

Keep in mind - this will NOT cause you to groom out Alerts that are open - only closed alerts, and still will keep your closed alerts around for a couple days.

These settings have no impact on the data that is being written to the data warehouse - so any alert, event, or perf data needed will always be there.

↧

Tuning tip – turning off some over-collection of events

November 25, 2009, 2:34 pm

≫ Next: Understanding and modifying Data Warehouse retention and grooming

≪ Previous: Boosting OpsMgr performance - by reducing the OpsDB data retention

We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don't care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.

However – one area of OpsMgr that often goes overlooked, is event overcollection. This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat. I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it. They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.

MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP. These events are items like "config requested”, “config delivered”, “new config active”. They might be interesting, but there is no advanced analysis included to use these to detect a problem. In small environments, they are not usually a big deal. But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them. I have yet to see a customer who did that.

At a high level – here is how I like to review these events:

Review the Most Common Events query that your OpsDB has.
Create a “My Workspace” view for each event that has a HIGH event count.
Examine the event details for value to YOU.
View the rule that collected the event.
1. Does the rule also alert or do anything special, or does it simply collect the event?
2. Do you think the event is required for any special reporting you do?
Create an Override, in an Override MP for the rule source management pack, to disable the rule.
Continue to the next event in the query output, and evaluate it.

So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:

Most common events by event number and event publishername:

SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
FROM EventAllView eav with (nolock)
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC

The trick is – to run this query periodically – and to examine the most common events for YOUR environment. The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it: (I will use a common event 21024 as an example:)

What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.

Next – I will examine the rule. I will look at the Data Source section, and the Response section. The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section. If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.

If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it. You will find you can disable most of the top consumers in the database.

Here is why I consider it totally cool to disable these uninteresting event collection rules:

If they are really important – there will be different alert generating rule to fire an alert
They fill the databases, agent queues, agent load, and network traffic with unimportant information.
While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
If we find we do need one later – simply remove the override.

Here is an example of this one:

So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.

Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:

1206

1210

1215

1216

10102

10401

10403

10409

10457

10720

11771

21024

21025

21402

21403

21404

21405

29102

29103

I don't recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value. Just knocking out the top 10 events will often free up 90% of the space they were consuming.

The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them. You might find you have some other events as your top consumers. I recommend you review them in the same manner as above – methodically. Then revisit this every month or two to see if anything changed.

I’d also love to hear if you have other events that you see as your top consumer that isn't my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them. I’d be happy to give feedback on those, or add any new ones to my list.

↧

Understanding and modifying Data Warehouse retention and grooming

January 5, 2010, 11:07 am

≫ Next: How grooming and auto-resolution work in the OpsMgr 2007 Operational database

≪ Previous: Tuning tip – turning off some over-collection of events

You will likely find that the default retention in the OpsMgr data warehouse will need to be adjusted for your environment. I often find customers are reluctant to adjust these – because they don't know what they want to keep. So – they assume the defaults are good – and they just keep EVERYTHING.

This is a bad idea.

A data warehouse will often be one of the largest databases supported by a company. Large databases cost money. They cost money to support. They are more difficult to maintain. They cost more to backup in time, tape capacity, network impact, etc. They take longer to restore in the case of a disaster. The larger they get, the more they cost in hardware (disk space) to support them. The larger they get, can impact how long reports take to complete.

For these reasons – you should give STRONG consideration to reducing your warehouse retention to your reporting REQUIREMENTS. If you don't have any – MAKE SOME!

Originally – when the product released – you had to directly edit SQL tables to adjust this. Then – a command line tool was released to adjust these values – making the process easier and safer. This post is just going to be a walk through of this process to better understand using this tool – and what each dataset actually means.

Here is the link to the command line tool:

http://blogs.technet.com/momteam/archive/2008/05/14/data-warehouse-data-retention-policy-dwdatarp-exe.aspx

Different data types are kept in the Data Warehouse in unique “Datasets”. Each dataset represents a different data type (events, alerts, performance, etc..) and the aggregation type (raw, hourly, daily)

Not every customer will have exactly the same data sets. This is because some management packs will add their own dataset – if that MP has something very unique that it will collect – that does not fit into the default “buckets” that already exist.

So – first – we need to understand the different datasets available – and what they mean. All the datasets for an environment are kept in the “Dataset” table in the Warehouse database.

select * from dataset
order by DataSetDefaultName

This will show us the available datasets. Common datasets are:

Alert data set
Client Monitoring data set
Event data set
Microsoft.Windows.Client.Vista.Dataset.ClientPerf
Microsoft.Windows.Client.Vista.Dataset.DiskFailure
Microsoft.Windows.Client.Vista.Dataset.Memory
Microsoft.Windows.Client.Vista.Dataset.ShellPerf
Performance data set
State data set

Alert, Event, Performance, and State are the most common ones we look at.

However – in the warehouse – we also keep different aggregations of some of the datasets – where it makes sense. The most common datasets that we will aggregate are Performance data, State data, and Client Monitoring data (AEM). The reason we have raw, hourly, and daily aggregations – is to be able to keep data for longer periods of time – but still have very good performance on running reports.

In MOM 2005 – we used to stick ALL the raw performance data into a single table in the Warehouse. After a year of data was reached – this meant the perf table would grow to a HUGE size – and running multiple queries against this table would be impossible to complete with acceptable performance. It also meant grooming this table would take forever, and would be prone to timeouts and failures.

In OpsMgr – now we aggregate this data into hourly and daily aggregations. These aggregations allow us to “summarize” the performance, or state data, into MUCH smaller table sizes. This means we can keep data for a MUCH longer period of time than ever before. We also optimized this by splitting these into multiple tables. When a table reaches a pre-determined size, or number of records – we will start a new table for inserting. This allows grooming to be incredibly efficient – because now we can simply drop the old tables when all of the data in a table is older than the grooming retention setting.

Ok – that’s the background on aggregations. To see this information – we will need to look at the StandardDatasetAggregation table.

select * from StandardDatasetAggregation

That table contains all the datasets, and their aggregation settings. To help make more sense of this - I will join the dataset and the StandardDatasetAggregation tables in a single query – to only show you what you need to look at:

SELECT DataSetDefaultName,
AggregationTypeId,
MaxDataAgeDays
FROM StandardDatasetAggregation sda
INNER JOIN dataset ds on ds.datasetid = sda.datasetid
ORDER BY DataSetDefaultName

This query will give us the common dataset name, the aggregation type, and the current maximum retention setting.

For the AggregationTypeId:

0 = Raw

20 = Hourly

30 = Daily

Here is my output:

DataSetDefaultName	AggregationTypeId	MaxDataAgeDays
Alert data set	0	400
Client Monitoring data set	0	30
Client Monitoring data set	30	400
Event data set	0	100
Microsoft.Windows.Client.Vista.Dataset.ClientPerf	0	7
Microsoft.Windows.Client.Vista.Dataset.ClientPerf	30	91
Microsoft.Windows.Client.Vista.Dataset.DiskFailure	0	7
Microsoft.Windows.Client.Vista.Dataset.DiskFailure	30	182
Microsoft.Windows.Client.Vista.Dataset.Memory	0	7
Microsoft.Windows.Client.Vista.Dataset.Memory	30	91
Microsoft.Windows.Client.Vista.Dataset.ShellPerf	0	7
Microsoft.Windows.Client.Vista.Dataset.ShellPerf	30	91
Performance data set	0	10
Performance data set	20	400
Performance data set	30	400
State data set	0	180
State data set	20	400
State data set	30	400

You will probably notice – that we only keep 10 days of RAW Performance by default. Generally – you don't want to mess with this. This is simply to keep a short amount of raw data – to build our hourly and daily aggregations from. All built in performance reports in SCOM run from Hourly, or Daily aggregations by default.

Now we are cooking!

Fortunately – there is a command line tool published that will help make changes to these retention periods, and provide more information about how much data we have currently. This tool is called DWDATARP.EXE. It is available for download HERE.

This gives us a nice way to view the current settings. Download this to your tools machine, your RMS, or directly on your warehouse machine. Run it from a command line.

Run just the tool with no parameters to get help:

C:\>dwdatarp.exe

To get our current settings – run the tool with ONLY the –s (server\instance) and –d (database) parameters. This will output the current settings. However – it does not format well to the screen – so output it to a TXT file and open it:

C:\>dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW > c:\dwoutput.txt

Here is my output (I removed some of the vista/client garbage for brevity)

Dataset name	Aggregation name	Max Age	Current Size, Kb
Alert data set	Raw data	400	18,560 ( 1%)
Client Monitoring data set	Raw data	30	0 ( 0%)
Client Monitoring data set	Daily aggregations	400	16 ( 0%)
Configuration dataset	Raw data	400	153,016 ( 4%)
Event data set	Raw data	100	1,348,168 ( 37%)
Performance data set	Raw data	10	467,552 ( 13%)
Performance data set	Hourly aggregations	400	1,265,160 ( 35%)
Performance data set	Daily aggregations	400	61,176 ( 2%)
State data set	Raw data	180	13,024 ( 0%)
State data set	Hourly aggregations	400	305,120 ( 8%)
State data set	Daily aggregations	400	20,112 ( 1%)

Right off the bat – I can see how little data that daily performance actually consumes. I can see how much data that only 10 days of RAW perf data consume. I also see a surprising amount of event data consuming space in the database. Typically – you will see that perf hourly will consume the most space in a warehouse.

So – with this information in hand – I can do two things….

I can know what is using up most of the space in my warehouse.
I can know the Dataset name, and Aggregation name… to input to the command line tool to adjust it!

Now – on to the retention adjustments.

First thing – I will need to gather my Reporting service level agreement from management. This is my requirement for how long I need to keep data for reports. I also need to know “what kind” of reports they want to be able to run for this period.

From this discussion with management – we determined:

We require detailed performance reports for 90 days (hourly aggregations)
We require less detailed performance reports (daily aggregations) for 1 year for trending and capacity planning.
We want to keep a record of all ALERTS for 6 months.
We don't use any event reports, so we can reduce this retention from 100 days to 30 days.
We don't use AEM (Client Monitoring Dataset) so we will leave this unchanged.
We don't report on state changes much (if any) so we will set all of these to 90 days.

Now I will use the DWDATARP.EXE tool – to adjust these values based on my company reporting SLA:

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Hourly aggregations" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Daily aggregations" -m 365

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Alert data set" -a "Raw data" -m 180

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Event data set" -a "Raw Data" -m 30

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Raw data" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Hourly aggregations" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Daily aggregations" -m 90

Now my table reflects my reporting SLA – and my actual space needed in the warehouse will be much reduced in the long term:

Dataset name	Aggregation name	Max Age	Current Size, Kb
Alert data set	Raw data	180	18,560 ( 1%)
Client Monitoring data set	Raw data	30	0 ( 0%)
Client Monitoring data set	Daily aggregations	400	16 ( 0%)
Configuration dataset	Raw data	400	152,944 ( 4%)
Event data set	Raw data	30	1,348,552 ( 37%)
Performance data set	Raw data	10	468,960 ( 13%)
Performance data set	Hourly aggregations	90	1,265,992 ( 35%)
Performance data set	Daily aggregations	365	61,176 ( 2%)
State data set	Raw data	90	13,024 ( 0%)
State data set	Hourly aggregations	90	305,120 ( 8%)
State data set	Daily aggregations	90	20,112 ( 1%)

Here are some general rules of thumb (might be different if your environment is unique)

Only keep the maximum retention of data in the warehouse per your reporting requirements.
Do not modify the performance RAW dataset.
Most performance reports are run against Perf Hourly data for detail performance throughout the day. For reports that span long periods of time (weeks/months) you should generally use Daily aggregation.
Daily aggregations should generally be kept for the same retention as hourly – or longer.
Hourly datasets use up much more space than daily aggregations.
Most people don't use events in reports – and these can often be groomed much sooner than the default of 100 days.
Most people don't do a lot of state reporting beyond 30 days, and these can be groomed much sooner as well if desired.
Don't modify a setting if you don't use it. There is no need.
The Configuration dataset generally should not be modified. This keeps data about objects to report on, in the warehouse. It should be set to at LEAST the longest of any perf, alert, event, or state datasets that you use for reporting.

↧

How grooming and auto-resolution work in the OpsMgr 2007 Operational database

December 13, 2007, 6:22 pm

≫ Next: Failed tasks aren't groomed from the Operational Database

≪ Previous: Understanding and modifying Data Warehouse retention and grooming

How Grooming and Auto-Resolution works in the OpsMgr 2007 Operations DB

Warning – don’t read this if you are bored easily.

In a simplified view to groom alerts…..

SELECT AlertId INTO #AlertsToGroom

FROM dbo.Alert

WHERE TimeResolved ISNOTNULL

AND TimeResolved < @GroomingThresholdUTC

AND ResolutionState = 255

So…. the criteria for what is groomed is pretty simple: In a resolution state of “Closed” (255) and older than the 7 day default setting (or your custom setting referenced in the table above)

Ok – that covers grooming.

However – I can see that brings up the question – how does auto-resolution work?

That specifically states “alerts in the new resolution state”. I don’t think that is completely correct:

Here is the sql statement:

IF(@AutoResolveType = 0)

BEGIN

SELECT @AlertResolvePeriodInDays = [SettingValue]

FROM dbo.[GlobalSettings]

WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_HealthyAlertAutoResolvePeriod()

SET @AutoResolveThreshold =DATEADD(dd,-@AlertResolvePeriodInDays,getutcdate())

SET @RootMonitorId = dbo.fn_ManagedTypeId_SystemHealthEntityState()

-- We will resolve all alerts that have green state and are un-resolved

-- and haven't been modified for N number of days.

INSERTINTO @AlertsToBeResolved

SELECT A.[AlertId]

FROM dbo.[Alert] A

JOIN dbo.[State] S

ON A.[BaseManagedEntityId] = S.[BaseManagedEntityId] AND S.[MonitorId] = @RootMonitorId

WHERE A.[LastModified] < @AutoResolveThreshold

AND A.[ResolutionState] <> 255

AND S.[HealthState] = 1

<snip>

ELSEIF(@AutoResolveType = 1)

BEGIN

SELECT @AlertResolvePeriodInDays = [SettingValue]

FROM dbo.[GlobalSettings]

WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_AlertAutoResolvePeriod()

SET @AutoResolveThreshold =DATEADD(dd,-@AlertResolvePeriodInDays,getutcdate())

-- We will resolve all alerts that are un-resolved

-- and haven't been modified for N number of days.

INSERTINTO @AlertsToBeResolved

SELECT A.[AlertId]

FROM dbo.[Alert] A

WHERE A.[LastModified] < @AutoResolveThreshold

AND ResolutionState <> 255

↧

Failed tasks aren't groomed from the Operational Database

February 5, 2008, 12:02 pm

≫ Next: Print Server management pack fills the Operational DB with TONS of perf data

≪ Previous: How grooming and auto-resolution work in the OpsMgr 2007 Operational database

This appears to be present up to RC-SP1 version, build 6.0.6246.0

In the Task Status console view - I noticed an old failed task from 2 months ago..... however, my task grooming is set to 7 days.

To view the grooming process:

http://blogs.technet.com/kevinholman/archive/2007/12/13/how-grooming-and-auto-resolution-work-in-the-opsmgr-2007-operational-database.aspx

Basically – select * from PartitionAndGroomingSettings will show you all grooming going on.

Tasks are kept in the jobstatus table.

Select * from jobstatus will show all tasks.

p_jobstatusgrooming is called to groom this table.

Here is the text of that SP:

--------------------------------

USE [OperationsManager]

/****** Object: StoredProcedure [dbo].[p_JobStatusGrooming] Script Date: 02/05/2008 10:49:32 ******/

SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

ALTER PROCEDURE [dbo].[p_JobStatusGrooming]

BEGIN

SET NOCOUNT ON

DECLARE @Err int

DECLARE @Ret int

DECLARE @RowCount int

DECLARE @SaveTranCount int

DECLARE @GroomingThresholdLocal datetime

DECLARE @GroomingThresholdUTC datetime

DECLARE @TimeGroomingRan datetime

DECLARE @MaxTimeGroomed datetime

SET @SaveTranCount = @@TRANCOUNT

SET @TimeGroomingRan = getutcdate()

SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())

FROM dbo.PartitionAndGroomingSettings

WHERE ObjectName = 'JobStatus'

EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT

IF (@@ERROR <> 0)

BEGIN

GOTO Error_Exit

END

-- Selecting the max time to be groomed to update the table

SELECT @MaxTimeGroomed = MAX(LastModified)

FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

IF @MaxTimeGroomed IS NULL

GOTO Success_Exit

BEGIN TRAN

-- Change the Statement below to reflect the new item

-- that needs to be groomed

DELETE FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

SET @Err = @@ERROR

IF (@Err <> 0)

BEGIN

GOTO Error_Exit

END

UPDATE dbo.PartitionAndGroomingSettings

SET GroomingRunTime = @TimeGroomingRan,

DataGroomedMaxTime = @MaxTimeGroomed

WHERE ObjectName = 'JobStatus'

SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

IF (@Err <> 0 OR @RowCount <> 1)

BEGIN

GOTO Error_Exit

END

COMMIT TRAN

Success_Exit:

RETURN 0

Error_Exit:

-- If there was an error and there is a transaction

-- pending, rollback.

IF (@@TRANCOUNT > @SaveTranCount)

ROLLBACK TRAN

RETURN 1

END

------------------------------------

Here is the problem in the SP:

DELETE FROM dbo.JobStatus

WHERE TimeFinished IS NOT NULL

AND LastModified < @GroomingThresholdUTC

We only delete (groom) tasks that have a timestamp in TimeFinished. If a failed task doesn’t finish – this field will be NULL and never gets groomed.

↧

Print Server management pack fills the Operational DB with TONS of perf data

February 11, 2008, 7:19 pm

≫ Next: Grooming process in the Operations Database

≪ Previous: Failed tasks aren't groomed from the Operational Database

If reporting on this perf data is not critical to your environment, you should disable these rules:

↧

Grooming process in the Operations Database

February 12, 2008, 9:51 pm

≫ Next: What SQL maintenance should I perform on my OpsMgr databases?

≪ Previous: Print Server management pack fills the Operational DB with TONS of perf data

This is a continuation of my other post, on general alert grooming:

How grooming and auto-resolution work in the OpsMgr 2007 Operational database

It calls the “p_PartitioningAndGrooming” stored procedure. This SP calls two other SP's: p_Partitioning and then p_Grooming

The PartitionAndGroomingSettings table:

To review which tables you are writing to - execute the following query: select * from partitiontables where IsCurrent = '1'

Ok - that's the first step of the p_PartitioningAndGrooming sproc - Partitioning. Now - if that is all successful, we will start grooming!

Next - the p_Grooming sproc continues, by updating the InternalJobHistory table, to give it a status of success (StatusCode of 1 = success, 2= failed, 0 appears to be never completed?)