Monitoring guide for SPs
The ARGO Monitoring service provides a flexible and scalable framework for monitoring status, availability and reliability of a wide range of services provided by infrastructures with medium to high complexity. ARGO generates reports using customer defined profiles (e.g. for SLA management, operations, etc.). During the report generation, ARGO takes into account custom factors such as the importance of a specific service endpoint and scheduled or unscheduled downtimes.
ARGO Monitoring Service for NI4OS consists of production and development infrastructure. Production infrastructure is deployed in a redundant manner and is used for generating reports and raising alarms for production-grade on-boarded services. Development infrastructure is used for testing and integration of new services and probes. Web UI can be found:
- Production: https://argo.ni4os.eu
- Development: https://argo-devel.ni4os.eu
Topology
The topology tool used in NI4OS is GOCDB and contains general information about the sites participating in the project. It is actually a central registry for e-Infrastructure topology. GOCDB enables detailed describing of service endpoints with custom attributes, tagging and additional sub-endpoints. Services are assigned to resources centres, which are grouped in operations centres. Besides service endpoints, GOCDB enables definition of contact points and declaration of downtimes for individual services endpoints or resource centres.
Monitoring service relies on topology database to provide the following information:
- the monitored service(s)
- the service types they are running (ex. wiki)
- the service endpoints of the service (ex. endpoint)
- the way they are organized (ex. in groups of sites, in groups of services)
- the service actors (owners, admins, contact points).
When adding service endpoint following fields are mandatory for monitoring service:
- Production Level (Is this service in production?):
- Y = monitored on production & development infrastructure
- N = monitored only on development infrastructure
- Monitored (Is this service monitored?)
- must be set to Y
- Notifications (Do you wish to receive notifications about this service?)
- set to Y if you wish to receive alerts.
Extra GOCDB attributes
ARGO can use extra GOCDB attributes to properly monitor service endpoints. Table below will contain attributes that must be defined for each service type. Until now there are no extra GOCDB attributes.
Metrics
A metric is a simple chunk of code that checks specific functionality of a given service. For example:
- Portal-WebCheck: checks the http if it responds
- CertValidity: checks the validity of a certificate
For your service you will need some metrics so as to start monitoring it. We will start monitoring with some basic checks like webcheck and cert validity.
Service probe
Apart from the basic checks each service should have a list of specific metrics from the user perspective. Monitoring services from the user point of view means that all the services have to be monitored in the same way regardless of who the service providers are and where they are located.
The owners of the service are the ones that know exactly how the service is working. The service development team with the support of the monitoring team is responsible to implement the probe that checks and at the same time mimics the actual end user behaviour without requiring special privileges or special configurations.
Before you start implementing your own probe please check in the library if appropriate probe is already used for monitoring sevices:
- Probes: https://poem.ni4os.eu/ui/public_probes
- Metrics (currently available): https://poem.ni4os.eu/ui/public_metrics
If you cannot find a probe for your service then you should follow the development process described in the next chapter.
Probe Development Process
- Discuss (what to check): Discussion with representatives - developers of each service in order to agree on a set of monitored metrics.
- Develop (How to check): Development and testing of probe(s). The development lifecycle includes: coding of the probe, documentation, testing and packaging.
- Monitor (Lets start monitoring): The lifecycle of the deployment of the service probe is based on the following repetitive steps: a) guidelines from the service owners are created. The monitoring team makes the necessary configurations. b) test, verify. if it passes the tests c) The report changes and now has your service metrics!!!! Monitoring starts and you can get the status A/R reports for your service.
- Nagios Exchange: https://exchange.nagios.org/
- The probe development guidelines: http://argoeu.github.io/monitoring-probes/v1/guidelines_for_monitoring_probes/
Checklist
Integrating new service into ARGO Monitoring service. See below the two main steps.
1. Does Topology database (GOCDB) already have service type for my service? | |
---|---|
YES - SP to add new service endpoint in Topology database | |
NO - SP to follow procedure to add new service type to Topology database, then add new service endpoint & go to step 2. | |
2. Does POEM contain metrics & probes that can be used to monitor my service? | |
YES - ARGO admin to add mapping between new service type and metrics in relevant profiles | |
NO - SP to follow Probe development process. |
References
- [1] ARGO documentation for users
- Probe development guidelines
- List of available probes
- List of available metrics
- Topology tool