Agent Health

Like many aspects of the business, which were shaken up due to COVID-19, endpoint agent health became a significant area of concern. Healthy agents ensure you have visibility into device activity, so how do you validate agents are healthy? At Code42 we had started attacking this problem when the majority of our workforce was still in the office, and it became apparent that this wasn’t going to be an easy task. First, there are a number of important questions which need to be answered before you can start to build a solution, which I will address later on. Next you’ll need to do your research and gather your requirements from stakeholders. Then you can start to develop your workflow and logic to meet your expected end state. After you’ve come up with a framework to get the results you’re hoping to achieve, it’s time to build. Lastly, if you’re anything like me, it’s time to loop back and make sure you’re providing high fidelity data which meets the original requirements.

You might be wondering how we got into this mess in the first place. Well, we have our Mobile Device Management solution in place and polling loads of information from our endpoints. Our first challenge was that we are mostly a Mac shop, with a growing Windows population, so our Mac-only MDM solution doesn’t cover all of our endpoints. I’m still working on building our Windows specific script. Then we looked at the data our MDM solution was polling:

  • Is the data timely?
  • Is the data accurate?
  • Is the data subject to manual intervention to be correct?

I think you’re starting to see the problems we determined we need to address. Surely there’s a tool out there which can solve this problem for us, right?! Much to our surprise, that answer is “no”. So here we are. Hopefully, if you’ve come this far, I can provide some useful information to tackle this problem in your organization.

Before starting any project it’s very important to gather requirements from your stakeholders. For me, this was our Endpoint Manager and our SecOps team. A few useful things to think about when coming up with requirements:

  • How many endpoints does your organization have deployed?
  • What is your mix of endpoint operating systems?
  • How many different agents are installed on those endpoints?
  • What is considered “Healthy” for those endpoints?

For the most part those were all pretty straight forward for us, we’re a relatively small fish and don’t have loads of bloat or agent sprawl. So I’ve got the endpoints I’m interested in, I’ve got a list of the agents we want to track, but what do you consider “Healthy”? I find this one is best answered by the data, and it’s not going to be a blanket answer.

Now we are onto the easy part, building your logic! Not. The biggest gripe I came across, in this whole process, is the lack of a globally unique identifier across agents! Without going to a very dark place, why can’t agents use the unique identifier the manufacturer provided? How about Serial Number! You’ll need to get creative with your logic to handle the different identifiers APIs allow you to query. I ended up getting a full data dump from each of our tools to start the process of creating my keys across different tools and how I would store those key value pairs in my script; more to come in the coding section. Our end goal was to report on, for each agent:

  • Unique identifier
  • Agent Version
  • Last Check-In
  • Errors or Alerts for an endpoint

Seems straightforward, but remember I said you may need to get creative.

We immediately ran into issues with the planned logic, probably a failure on my part of assuming how and what I’d be able to query. Certain tools don’t allow you to query for values which are recorded for a given device, only the tool specific unique identifier. My solution: 

Dear Tool, 
Kindly give me every unique identifier you have.
Love,
Mike

I iterated over that list and plucked the info I needed for each device. Did it work? You bet, but it wasn’t the best way to go about it. Most agents have commands which can be run locally and can provide some very useful information, such as unique identifiers! Huzzah. For this specific tool we put our MDM to work and had it record the unique identifier in a custom field which I utilize. My biggest piece of advice, make sure your source of truth for deployed endpoints is correct! This will save you time, just filter out endpoints you don’t care about.

While getting the data is important, the fidelity of that data is what will let you sleep at night. Check your data! Choose an appropriate sample size, for your organization, and validate the data you provided with what the SaaS solutions say. Depending on what your tool sets do, you may run into scenarios where endpoints have multiple entries or stagnant entries. Circle back with your stakeholders, make sure what you’re providing is meaningful and actionable. Lastly, come up with a cadence which makes sense not only for running your script but also for getting together with your stakeholders to see if adjustments are necessary.

So that’s it; Easy right? The solution we created makes it really easy to find endpoints which need attention and validate, in one location, the health of all of our endpoints across the organization. I hope I’ve provided something useful, insightful or thought provoking. Until next time, keep after the pursuit of better.

Associated Python Script:
https://github.com/code42/redblue42/blob/master/endpoint_agent_health.py