vRealize Operations reporting on host profiles

Introduction

After I published my post about importing advanced host properties into vRealize Operations, a colleague told me that it was very nice, but the real kicker would be to somehow import the host profile compliance stature into vRealize Operation. Again, true to his motto “How hard can it be?”, the Viking set out to explore this challenge.

And, yet again, it turned out to be not hard at all. A few lines of Python code did the trick.

What’s a host profile?

Maybe a quick recap of host profiles could be useful. Host profiles were introduced several years ago in vSphere 4.0 and roughly serve three purposes:

  1. Work in conjunction with Auto Deploy to allow “stateless” hosts. These hosts don’t need any configuration at the host level, but read their configuration from a host profile when they boot and applies it on-the-fly.
  2. Report compliance against a host profile, i.e. list all deviations of host settings from the host profile.
  3. Proactively apply a host profile to an existing host to force it into compliance.

Under the hood, a host profile is just a long list of rules describing what the desired settings of a host should be. Host profiles are typically generated from a reference host. You simply select a host you consider correctly configured and vCenter will create all the rules to check and enforce compliance based on that host.

Host profiles and vRealize Operations

I often hear how nice it would be if vRealize Operations could check compliance with host profiles. So often that I decided to give it a try. As you may know, adding new functionality to vRealize Operations is really simple with e.g. Python. Building the host profile functionality required about a 100 lines of code.

Here’s how my script works:

  1. List all the host profiles known to a vCenter.
  2. Trigger a host profile compliance check. This will check the profile against all the hosts it’s attached to.
  3. Scan through the result and send notification events to vRealize Operations for each violation found.
  4. Define a symptom that triggers based on the event and compliance alert based on the symptom.

Running the script

For complete information about installing and running the script, refer to the Github site: https://github.com/prydin/vrops-import-hostprops

However, running the script is simple. Create a config file as described on Github and run this command:

python import-host-profiles.py --config path/to/config/file

If you want the check to run periodically, you should put the script in your crontab on Linux or in a scheduled job on Windows. Checking compliance of a large number of hosts take a while, so you should only run the script one or a couple of times per day.

The result

Once you have imported the alert definition and captured some host profile violations, you should see them under the Analysis->Compliance sub tab. It should look something like below. Of course, you can trigger email notifications or webhook notifications based on the compliance alert.

Screen Shot 2017-12-13 at 2.15.23 PM

Useful links

Code, installation instructions and manual:

https://github.com/prydin/vrops-import-hostprops

 

Advertisement

Geek corner: Finding Patterns in Wavefront Time Series Data using Python and SciPy

Introduction

Today we’re going to geek out even more than usual. Follow me down memory lane! My first job was to write code that processed data from testing integrated circuits. That was an eternity ago, but what I learned about signal processing turned out to be applicable to pretty much any kind of time series data.

So I thought I should play around with some of the time series data I have stored in Wavefront.

Of course, it deserves to be said that Wavefront can do almost any math natively in its web user interface. So I really had to think hard to come up with something that would give me an excuse to play around with the Wavefront API and some Python code.

Here’s the example I came up with: Find out which objects exhibits the highest degree of periodic self similarity on some metric. For example, tell me which VMs have a CPU utilization that have the strongest tendency to repeat themselves over some period (e.g. daily, hourly, every 23.5 minutes, etc.).

A possible application for this may be around capacity planning. By understanding the which workloads have a tendency to vary strongly and predictably over some period of time, we may be able to place workloads in such a way that their peak demands don’t coincide.

The first few paragraphs of this post should be interesting to anyone wanting to do advanced data processing of Wavefront data using Python. The final paragraphs about the math are mostly me geeking out for the fun of it. It can be safely ignored if it doesn’t interest you.

Problem Statement

The purpose of the tool we’re building in this post is to answer the question: “Which objects have the strongest periodic variation (seasonality) on some metric?”.

For example, which VMs have a CPU load that tends to have a strong periodic variation? And furthermore, what is the period of that variation?

Usage

It should be pointed out that this tool would need some work before it’s useful in production. It is intended as an example of what can be done.

To obtain the code, clone the repository from here or download this single Python file.

To run the code, you have to set the following two environment variables:

  • WF_URL – The URL to your Wavefront instance, e.g. https://try.wavefront.com
  • WF_TOKEN – Your API token. Refer to the Wavefront documentation for how to obtain it.
python fft.py name.of.metric

You can add the –plot flag at the end of the command to get a visual representation of the top periodic time series.

The output will be a list of object names, period length in minutes and peak value.

Example:

Screen Shot 2017-09-25 at 8.55.09 AM

Let’s pick an example and examine it!

nsx-mgr-west,31.45390070921986,0.17699407037085668

The first column is just the name of the object. The next column says that the pattern seems to repeat every 31.5 minutes. The third column is the relative spectral peak energy. The best way of interpreting that is by saying that 18% of the total “energy” of the signal can be attributed component that repeats itself every 31.5 minutes.

Let’s have a look at the original time series in Wavefront and see if we were right!

Screen Shot 2017-09-25 at 9.48.11 AM

Yes indeed! It sure looks like this signal is repeating itself about every 30 minutes.

A Note on Accuracy

Why did we get a result of 31.5 minutes? Isn’t it a little strange that we didn’t get an even 30 minutes? Yes it is and there’s a reason for it. The algorithm introduces false accuracy. In our example, we’re processing 5 minute samples, so the resolution will never get better than 5 minutes. Therefore, you should round the results to the nearest 5 minutes.

Data Source

The data for this example comes from one of our demo vRealize Operations instances and was exported to Wavefront using the vRealize Operations Export tool that can be found here.

Design

The design is very simple: We use Python to pull down time series data for some set of objects. Then we’re using the SciPy and NumPy libraries to perform the mathematical analysis.

SciPy

SciPy (which is based on Numpy) is an extensive library for scientific mathematics. It offers a wide range of statistical functions and signal processing tools, making it very useful for advanced processing of time series data.

High Level Summary of the Algorithm

There are many ways of finding self similarity in time series data. The most common ones are probably spectral analysis and autocorrelation. Here is a quick comparison of the two:

  • Spectral Analysis: Transform the time series to the Frequency Domain. This rearranges the data into a set of frequencies and amplitudes. By finding the frequency with the highest amplitude, we can determine the most prevailing periodicity in the time series data.
  • Autocorrelation: Suppose a time series repeats itself every 1 hour. If you correlate the the time series with a time shifted version of the same series you should get a very good correlation when the time shift is 1 hour in our example. Autocorrelation works by performing correlations with increasing time shifts and picking the time shift the gave the best correlation. That should correspond to the length of the periodicity of the data.

Autocorrelation seems to be the most popular algorithm. However, for very noisy data, it can be hard to find the best correlation among all the noise. In our tests, we found that spectral analysis (or spectral power density analysis, to be exact) gave the best results.

To find the highest peak in the frequency spectrum, we huse the “power spectrum”, which simply means that we square the result of the Fourier transform. This will exaggerate any peaks and makes it easier to find the most prevalent frequencies. It also creates an interesting relationship to the autocorrelation function. If you’re interested, we’ll geek out over that towards the end of this article.

Once we’ve done that for all our objects, we sort them based on the highest peak and print them. To make the scoring fair, we calculate the total “energy” (sum of “power” over time) and divide the peak amplitudes by that. This way, we measure the percentage that a peak contributes to the total “energy” of the signal.

Code Highlights

Pulling the data

Before we do anything, we need to pull the data from Wavefront. We do this using the Wavefront REST API.

query = 'align({0}m, interpolate(ts("{1}")))'.format(granularity, metric)
start = time.time() - 86400 * 5 # Look at 5 days of data
result = requests.get('{0}/api/v2/chart/api?q={1}&s={2}&g=m'.format(base_url, query, start),
                      headers={"Authorization": "Bearer " + api_key, "Accept": "application/json"})

Next, we rearrange the data so that for each object we have just a list of samples that we assume are spaced 5 minutes apart.

candidates = []
data = json.loads(result.content)
for object in data["timeseries"]:
    samples = []
    for point in object["data"]:
        samples.append(point[1])

Normalizing the data

Now we can start doing the math. First we normalize the data to force it to have an average of 0, a maximum of 1 and a minimum of -1. It’s important to do this to remove any and constant bias (sometimes referred to as “DC-component”) from the signal.

top = np.max(samples)
bottom = np.min(samples)
mid = np.average(samples)
normSamples = (samples - mid)
normSamples /= top - bottom

Turn into a power spectrum using the fft function from SciPy

Now that we have the data on a form we like, use a Fast Fourier Transform (FFT) to go from the time domain to the frequency domain. The output of and FFT is a series of complex numbers, so we take the absolute value (magnitude) of them to turn it into a real number. Finally, we square every value. This will exaggerate any frequency spikes and also has a cool relationship to autocorrelation we’ll discuss later.

spectrum = np.abs(fft(normSamples))
spectrum *= spectrum

At this point, we should have a spectrum that, for a good match, looks something like this. This corresponds to a signal repeating at a 60 minute interval.

Screen Shot 2017-09-25 at 10.38.50 AM

The algorithm we use is has very low accuracy at low frequencies (long cycles), so we dump the firs few samples. It’s important to sample enough of the signal. If we want to detect behaviors the repeat over, say, 24 hours, you should sample at least 5 days of data. This way, we can discard the lowest frequencies and still find what we’re looking for.

offset = 5
spectrum = spectrum[offset:]

Now we can find the peak by simply looking for the highest value in the array. Notice that we’re only looking in the lower half of the array. This is because of a property of the FFT known as “aliasing”. The latter n/2 samples will just be mirror image of the first n/2 ones.

Finding the peak and normalizing

maxInd = np.argmax(spectrum[:int(n/2)+1])

Next we scale the peak we found. In order to compare this peak with peaks we found from other objects, we need to somehow normalize it. A straightforward way of doing this is to estimate how much of the total energy this peak contributed with. Energy is the integral over power, but since we’re operating in a discrete time world, we can calculate it as a simple sum over the spectrum. The result will be the relative contribution this frequency has to the total energy.

energy = np.sum(spectrum)
top = spectrum[maxInd] / energy

The final step of the math is to convert the raw frequency (as expressed by an index in the spectrum array) to a period length.

lag = 5 * n / (maxInd + offset)

Filtering the best matches

We consider a “good” match a peak where the local spectral energy is at least 10% of the total energy. This is an arbitrary value, but it seems to work quite well.

if top > 0.1:
    entry = [ top, lag, object["host"] ]
    candidates.append(entry)

Sorting and outputting

Finally, once we’ve gone through every object, we sort the list based on relative peak strength and output as CSV.

best_matches = sorted(candidates, key=lambda match: match[0], reverse=True)
for m in best_matches:
   print("{0},{1},{2}".format(m[2], m[1], m[0]))

Extra Geek Credits: Autocorrelation vs. Power Spectrum

If you’ve made it here, I applaud you. You should have a pretty good understanding of the code by now.  It’s about to get a lot geekier…

Autocorrelation

Mathematically, autocorrelation is a very simple operation. For a real-valued signal, you simply sum up every sample of the signal multiplied by a sample from the same signal, but with a time shift. It can be written like this:

a(τ)=Σx(t)x(t+τ)

Let’s say a signal is repeating itself every 60 minutes. If we set the lag, τ, to 60 minutes, we should get a better correlation than for any other value of τ, since the signal should be very similar between now and 60 minute ago.

So one way to find periodic behavior is to repeatedly calculate the autocorrelation and sweep the value of τ from the shortest to the longest period you’re interested in. When you find the highest value of your autocorrelation, you have also found the period of the signal.

Well, at least in theory.

What ends up happening is that a lot of times, it isn’t obvious where the highest peak is and it could be affected by noise, causing a lot of random errors. Consider this autocorrelation plot, for example.

Screen Shot 2017-09-25 at 11.02.06 AM

One thing we notice, though, is that autocorrelations of signals that have meaningful periodic behaviors are themselves periodic. Here’s a good example:

Screen Shot 2017-09-25 at 11.05.33 AM

So maybe looking for the period of the autocorrelation would be more fruitful than hunting for a peak? And finding the period implies finding a frequency, which sounds almost like a Fourier Transform would come in handy, right? Yes it does. And there’s a beautiful relationship between the autocorrelation and the Fourier Transform that will take us there.

The final piece of the puzzle

If you’ve read this far, I’m really impressed. But here’s the big reveal:

It can be proven that autocorrelation and FFT are related as follows:

{\displaystyle {\begin{aligned}F_{R}(f)&=\operatorname {FFT} [X(t)]\\S(f)&=F_{R}(f)F_{R}^{*}(f)\\R(\tau )&=\operatorname {IFFT} [S(f)]\end{aligned}}}

Where R(τ) is the autocorrelation function.

In other words: The autocorrelation is the Inverse FFT of the FFT of the signal, multiplied by the complex conjugate of the FFT of the signal.

But an complex number multiplied by its own conjugate is equal to the square of the absolute value. So S(f) is really the same as |FFT[X(t)]|2. And since S(f) is the step right before the final Inverse FFT, we can conclude that it must be the FFT of the autocorrelation. So we can comfortably state:

The power spectrum of a signal is the exact same thing as the FFT of its autocorrelation!

Why it’s cool

Not only is the power spectrum a nice way of exaggerating peaks to make them easier to find. The peak of the power spectrum can also be interpreted as a measure of the prevailing period of the autocorrelation function, which, in turn, gives a good indication of the prevailing period of the data we’re analyzing.

Caveats

As I said in the beginning of this post, this code isn’t production quality. There are a few things that should be addressed before it’s useful to a larger audience.

  • All data is loaded into memory. That obviously won’t scale. Instead, we should used a streaming JSon parser and maybe split the query up into smaller shunks.
  • The calculation of the local spectral energy is very naive. It’s assuming that all the energy is concentrated in a single spike of unity length. It should be adjusted to account for spikes that are spread out across the spectrum.
  • The accuracy, especially at low frequencies, is rather poor. One possible improvement would be to use FFT to find the candidates and autocorrelation to fine tune the result.

Conclusion

The idea behind this post was to discuss how Python and a math library like SciPy can be used to analyze Wavefront data. Feel free to comment and to fork the code and make improvements!

 

A Simple Template for vR Ops Python Agents

Background

Since I wrote the post about vraidmon, I’ve gotten some requests for a more generic framework. Although what I’m describing here isn’t so much a framework as a simple template, some may find it useful. You can find it here.

What does it do?

In a nutshell, this template helps you with three things:

  1. Parsing a config file
  2. Simplifying metric and properties posting
  3. Simplifying management of parent-child relationships.

The config file

The template takes the name of a configuration file as its single argument. The file contains a list of simple name/value pairs. Here is an example:

vropshost=vrops-01.rydin.nu
vropsuser=admin
vropspassword=secret
adapterSpecific1=bla
adapterSpecific2=blabla

The vropshost, vropsuser and vropspassword are mandatory and specify the host address, username and password for connecting to vR Ops. In addition to this, adapter specific configuration items can be specified.

Writing the adapter

Screen Shot 2015-12-22 at 3.51.03 PM

Using this template is very easy. Just locate the “YOUR CODE GOES HERE” marker and start coding! The only requirement is that you have to set the adapterKind and adapterName variables.

Setting the adapter type and name

The adapterKind is the “adapter kind key” for the adapter you’re building. It basically tells the system what kind of adapter you are. You can really put anything in here, but it should be descriptive to the adapter type, such as “SAPAdapter” or “CISCORouterAdapter”.

The “adapterName” is a name of this particular instance of the adapter. There could very well be multiple adapters of the same type, so a unique name is needed to tell them apart.

Posting data

Once we’ve got the adapter naming out of the way, we can start posting data. This is done through the “sendData” function. It takes 5 parameters:

  1. The vR Ops connection we’re using
  2. The resource type we’re posting for. This should uniquely specify what kind of object the data is for. E.g. “CISCORouterPort” if we’re posting data about a port.
  3. Resource name. This uniquely identities the object we’re posting data for, e.g. “port-4711”.
  4. A name-value pair list of metrics we’re posting for the resource, e.g. { “throughput”: 1234, “errors”: 55 }
  5. A name-value pair list of properties we’re posting e.g. { “state”: “functional”, “vendor”: “VMware” }

The sendData function returns the resource we posted data for. Note that if the resource didn’t already exist, it will be created for you.

Building relationships

Sometimes you want your monitored resources to be connected to some parent object. For example, if you are monitoring tables in a database, you may want them to be children of a database object. This way, they will show up correctly in dependency graphs in vR Ops and your users will find it easier to navigate to them.

To form a parent-child relationship, use the addChild function. It takes the following parameters:

  1. The vR Ops connection to use
  2. The adapter kind of the parent object
  3. The adapter name of the parent object
  4. The resource kind of the parent object
  5. The resource name of the adapter object
  6. The child resource to add
  7. (Optional) A flag determining whether parent objects should be automatically created if they don’t exist. By default, they are not automatically created and adding a child results in a no-op.

Here’s an example of an agent using this template that monitors Linux processes. As you can see, we’ve used the “addChild” function to add the Linux process objects to a Linux host.

Screen Shot 2015-12-22 at 4.10.51 PM.png

Downloadable content

Github repository: https://github.com/prydin/vrops-py-template

vraidmon – A vRealize Operations Adapter Written in Python

Background

Two events coincided today: Realizing that I’ve been running without redundancy on a Linux software RAID coincided with me having some time to spare for a change. So I decided to write a simple vR Ops agent to monitor a software RAID, and, since I had some time on my hands, I wanted in Python instead of my go-to language Java. This is my very first Python program, so bear with me.

Learning Python proved to be very easy. It’s a very intuitive language. I’ve got some issues with loosely typed languages, but for a quick hack I’m willing to put that aside. What turned out to be a bit more difficult was to find any good examples of agents for vR Ops written in Python. The blog vRandom had a getting started guide that allowed me to understand how to install and get the thing running, but how to code the actual agent was harder to find. But after some trial and error I got the hang of it and I decided to share my experience here.

A quick look at the code

Initializing

The client library is called “nagini”, so before we can do anything, we need to import it:

import nagini

Next, we need to obtain a connection to vR Ops. This is done like this:

vrops = nagini.Nagini(host=vropshost, user_pass=(vropsuser, 
vropspassword))

It’s worth noticing that this doesn’t actually make the connection. It just sets up the connection parameters. So if you do something wrong, like mistyping the password, this code will gladly execute but you’ll get an error later on when you’re trying to use the connection.

Sending the data

Here is where things get quite interesting and the Python client libraries show their strength. While all the REST methods are exposed to the programmer, the libraries also add some nice convenience methods.

Normally, when you post metrics for a resource, you need to do it in three steps:

  1. Look up the resource to obtain its ID.
  2. If the resource doesn’t exist, create it.
  3. Send the metrics using the resource ID you just obtained by looking it up or creating it.

The vR Ops Python client can do this in a single call! All we have to do is to put together somewhat elaborate data structure and send it off. Let’s have a look!

Screen Shot 2015-12-15 at 9.21.09 AM

Wow, that’s a lot of code! But don’t despair! Let’s break it down. In this example, I’m sending but stats and properties. If you recall, stats are numeric values, such as memory utilization and remaining disk space, whereas properties are strings and typically more static in nature. Let’s examine the stats structure.

The REST API states that stats should be sent as a single member called “stat-content”. That member contains an array of sub-structures, each representing a sample. Each sample has a “statKey” holding the name of the metric, a “timestamp” member and a “data” member holding the actual data value. As you can see above, I’m sending three values, the “totalMembers”, “healthyMenbers” and “percentHealthy”.

Now that we have our fancy structure, we can send it off using the find_create_resource_push_data method. It does exactly what the name implies: It tries to find the resource, if it doesn’t it creates it and then it pushes the data. To make it woek, we need to send the name of the resource, the name of the resource kind as well as adapter kind and the stats and/or properties. And that’s it! We’ve now built an adapter with Python using a single call to the vR Ops API!

What about the timestamp?

But wait, there’s one more thing we need to know. The vR Ops API expects us to send a timestamp as a Java-style epoc milliseconds long integer. How do we obtain one of those in Python.

Easy! We just time.time() method (you have to import “time” first). This gives us the epoc seconds (as opposed to milliseconds) as a float, so we have to multiply it by 1000 and convert it to a long. It will look something like this:

timestamp = long(time.time() * 1000)

Installing and running

Prerequisites

First of all, you need to install python (obviously). Most likely it’s already installed, but if it isn’t, just go to the package manager on your system and install it. Next, you need the client libraries for vR Ops. They are located on your vR Ops server and can be downloaded and installed like this:

mkdir tmp
cd tmp
wget --no-check-certificate https://<your vrops server>/suite-api/docs/bindings/python/vcops-python.zip
unzip vcops-python.zip
python setup.py install

Installing the agent

Download the agent from here. Unzip it in the directory where you want to run it, e.g. /opt/vraidmon

mkdir /opt/vraidmon
unzip <path to my zip>

Next, we need to edit the configuration file vraidmon.cfg. You need to provide hostname and login details for your vR Ops. The name of the raid status file should remain /proc/mdstat unless you have some very exotic version of Linux. Here is an example:

vropshost=myvrops.mydomain.com
vropsuser=admin
vropspassword=topsecret!
mdstat=/proc/mdstat

Now you can run the agent using this command:

python vraidmon.py vraidmon.cfg

You’ll probably get some warning messages about certificates not being trusted, but we can safely ignore those for now.

Making it automatic

Running the agent manually just sends a single sample. We need some way of making this run periodically and the simplest way of doing that is through a cron job. I simply added this to my crontab (ignore any line breaks and paste it in as a single line):

*/5 * * * * /usr/bin/python /opt/vraidmon/vraidmon.py /opt/vraidmon/vraidmon.cfg 2> /dev/null

Adding alerts

In order for this adapter to be useful, I added a simple alert that fires when the percentage of healthy members drops below 100%. This is what a simulated failure looks like:

Screen Shot 2015-12-15 at 11.30.49 AM.png

Downloadable material

Python source code and sample config file.

Alert and symptom definition.