Something that could be very cool would be AI driven monitoring analysis and troubleshooting pre-work to support operative staff. Trying to figure out what's exactly going wrong in a complex, distributed system requires the system and the user to ingest a lot of information at times in order to make a guess what's going on. This is something an AI could really help with - ingest all of the information over some time period, and output a number of guesses or possible problems with probabilities and possible root causes assigned to them.
Think of an operator logging into the system at 3am with a C&C EVA voice calling and greeting them, "Hello Operator. We have 24 customer-facing systems offline. Preliminary analysis indicates that these are caused by 3 internal services and 2 postgres clusters being offline. With a confidence of 83%, the internal services are also offline due to the postgres clusters. I will now send you what I have found out about the postgres clusters with a probability of the scenarios and monitoring indicators for the different scenarios. It looks like the clusters are failing to elect a leader after several network-caused timeline switches with a confidence of 72%. Highly matching runbooks exist for 4 out of 7 highest probability scenarios"
Well, maybe I'd prefer that in text on a website, but no need to be that serious right now.
It is possible to get monitoring of this quality even with existing tools and maybe some internal extensions of these tools, sure. But that's a barrel you can pour effort into and it has no bottom at all. And it gets harder and harder the more complex and the more dynamic your environment is. It'd be great to dump all of that into a black box, even if that black box just establishes a global timeline of events.
Think of an operator logging into the system at 3am with a C&C EVA voice calling and greeting them, "Hello Operator. We have 24 customer-facing systems offline. Preliminary analysis indicates that these are caused by 3 internal services and 2 postgres clusters being offline. With a confidence of 83%, the internal services are also offline due to the postgres clusters. I will now send you what I have found out about the postgres clusters with a probability of the scenarios and monitoring indicators for the different scenarios. It looks like the clusters are failing to elect a leader after several network-caused timeline switches with a confidence of 72%. Highly matching runbooks exist for 4 out of 7 highest probability scenarios"
Well, maybe I'd prefer that in text on a website, but no need to be that serious right now.
It is possible to get monitoring of this quality even with existing tools and maybe some internal extensions of these tools, sure. But that's a barrel you can pour effort into and it has no bottom at all. And it gets harder and harder the more complex and the more dynamic your environment is. It'd be great to dump all of that into a black box, even if that black box just establishes a global timeline of events.