Oracle Exadata Database Machine provides a very powerful all-inclusive solution for highly available databases and extremely fast IO access to data. It combines Oracle’s powerful Grid Infrastructure and Real Application Cluster database solution with the power of the Exadata Storage Server technology in a pre-configured configuration.
Exadata provides a platform that is a solution for systems ranging from data warehouses doing large scan intensive operations to online transaction systems needing high amounts of concurrency.
The Exadata Storage Servers are based on the 64-bit Intel-based Sun Fire Servers and they are shipped preloaded with Oracle Enterprise Linux x86_64 operating systems, the Exadata Storage Server software and InfiniBand protocol drivers.
While Exadata consistently provides amazing performance, like anything Oracle, it is important that DBAs are able to monitor the Exadata Storage Servers for both potential performance issues and errors.
Many aspects of the Exadata Storage Servers can be monitored including current active requests, hardware sensors, disk I/O errors, network errors, free space and metrics that are being managed. In this article, we’ll be focusing on using and monitoring metrics using the CELLCLI command line tool.
To set up and configure the cells for alerts and notifications you should be logged into the Exadata Storage Server(s) using the cellmonitor account.
Overview of Metrics Monitoring
First, let’s take a look at how the Exadata storage server monitoring works. The primary process that manages Exadata storage servers is CELLSRV. It will periodically record important metrics on components like the CPUs, cell disks, grid disks, flash cache and IORM (IO Resource Management). These metrics are initially stored in memory. The MS (Management Server) retrieves these metrics from CELLSRV and keeps a subset of the values in memory and once an hour writes a history to an internal disk repository. The retention period for these metrics and alert information defaults to seven days and can be controlled by a specific setting on the storage server called metricHistoryDays. It is changed using an ALTER CELL command in CELLCLI on each storage server.
Viewing Metric Information
At the center of the monitoring solutions is metrics and each of the metrics have the following significant attributes
- name
- metricObjectName – the specific object being measured such as the specific cell disk
- objectType –
- IORM_CONSUMER_GROUP
- IORM_DATABASE
- IORM_CATEGORY
- CELL
- CELLDISK
- CELL_FILESYSTEM
- GRIDDISK
- HOST_INTERCONNECT
- FLASHCACHE
- unit
- number
- percentage
- F (fahrenheit)
- C (celsius)
- metricValue
- metricType
- cumulative (since it was created)
- instantaneous (at the time the metric was collected)
- rate (change over time)
- transition (collected when the metric value changed)
There are several naming conventions followed that are worth knowing, to help us understand what we are looking at (or for), when managing the Exadata Storage Server metrics.
Metric names are prefixed as follows:
CL_ (cell)
CD_ (cell disk)
GD_ (grid disk)
FC_ (flash cache)
DB_ (database)
CG_ (consumer group)
CT_ (category)
N_ (interconnect network)
IO related metrics are further identified by codes that help to identify the operation(s) being done
IO_BY (number of MB)
IO_TM (latency)
IO_WT (wait time)
They might also include a code to indicate reads (_R) or writes (_W), followed by an indicator of large (> 128k) _LG or small (<=128K) and a code for requests, seconds. While this all may sound complicated, after working with the names for a period of time the names actually do start to make sense.
For example:
GD_IO_BY_R_SM_SEC is the number of MB of small block I/O reads per second on a grid disk.
To see the specific details about any of the metrics, use the LIST METRICDEFINITION command. For example, if you would like to see the detailed information of all metrics for celldisks – enter the following in CELLCLI>
LIST METRICDEFINITION WHERE objectType='CELLDISK' DETAIL
To view the history of any given metric, use the LIST METRICHISTORY command in CELLCLI. To see the current value of a metric use LIST METRICCURRENT. The following command would show the metric history of flash cache metrics collected after a specific date and time
LIST METRICHISTORY WHERE name like 'FC_.*' and collectionTime > '2013-01-31T13:15:30-08:00'
Or, to see the current value of metrics for all grid disks:
LIST METRICCURRENT WHERE objectType='GRIDDISK'
Working with Metrics Alerts
As administrators, not only can we view the metrics and the metric history, we are also able to define alert thresholds (both warning and critical) on many of these metrics along with I/O error counts, memory utilization and IORM metrics. Additionally, once an alert has been generated, actions taken to evaluate and resolve the alert can be tracked through the CELLCLI.
Alerts generated by the Exadata Storage Servers have the following attributes:
- alertSource
- BMR
- Metric
- ADR (automatic diagnostic repository_
- severity
- critical
- warning
- info
- clear
- alertType
- stateful
- stateless
- metricObjectName
- examinedBy
- metricName
- name
- description
- alertAction (recommended action to perform)
- alertMessage (brief information)
- failedMail (intended recipient of a failed notification)
- failedSNMP (intended SNMP subscriber of a failed notification)
- beginTime
- endTime
- notificationState
- 0 (never tried)
- 1 (sent successfully)
- 2 (retrying – up to 5 times)
- 3 (five failed retries)
To learn more about the details of the alert definitions use the LIST ALERTDEFINITION command in CELLCLI and indicate which attributes you would like to see.
LIST ALERTDEFINITION ATTRIBUTES name, metricName, description
To see warning level alerts that have been generated, and not yet examined by an administrator:
LIST ALERTHISTORY where examinedBy = ' ' and severity = 'warning' DETAIL
To mark an alert as examined:
ALTER ALERTHISTORY nnnn examinedBy="Karen" (where nnnn is the alert id #)
To create thresholds on metrics, indicate the name of the metric, the warning and critical levels, the comparison operator the number of occurrences and observation time using the CREATE THRESHOLD command. The observation attribute indicates the number of measurements that the metric values are averaged over.
For example, to create a threshold on waits for small IO requests for a IORM category called online that would give you a warning at 2500 milliseconds or higher and a critical at 4000 milliseconds or higher you would enter something like:
CREATE THRESHOLD ct_io_wt_sm_rq.online warning=2500, critical=4000, comparison='>', occurrences=2, observation=5
About Alert Email Notifications
In order to actually have the Exadata Storage Servers send notifications via email (or alternately SNMP) each of the servers has to be configured with the appropriate settings. This is done using the ALTER CELL command in CELLCLI.
ALTER CELL smtpServer='mailserver.somewhere.com', - smtpFromAddr='[email protected]', - smtpPwd='email_password', - smtpToAddr='[email protected]', - notificationPolicy='critical,warning,clear', - notificationMethod='mail'
There is also a verification command that can be run to test that the storage server can actually reach the mail server.
ALTER CELL VALIDATE MAIL
Watching for Undelivered Alerts
Once the alerts and notifications have been set up, it is still important to periodically check the storage servers just to make sure any alerts that have been generated have actually been delivered (via email and/or to Grid or Cloud Control).
LIST ALERTHISTORY where notificationState != 1 and examinedBy=''
If there are undelivered alerts double check the cell configuration, agent status and network connectivity.
Conclusion
The Oracle Exadata Database Machine is easily one of the fastest growing product lines for Oracle and with proven performance and availability. While we do face a learning curve to learn to fully manage and monitor the systems, it’s easy to see that the Exadata Storage Server software provides a set of very powerful options that allow us to configure, manage and monitor the performance and status of the storage servers in an Oracle Exadata Database Machine.