DevOps Command Center

Why is this important?

Modern teams communicate via real-time messaging throughout the day but not all messages are created equal. Unlike social messaging, workplace messaging is often process-driven rather than completely ad-hoc and these processes could be critical to the success of the organization. We’ve learned from our customers that they often have little visibility into these processes, let alone the ability to facilitate and optimize. As a result, they rely on tribal knowledge and piecemeal solutions which doesn’t scale well.

That is why we want to help teams collaborate to solve problems that are time-sensitive, recurring, situational.

What are the use cases?

We believe incident response is one of the best use case for real-time messaging. To start, we are focused on these three areas:

  • Build and deploy pipeline

  • Information security

  • Site reliability

For example a business-critical web application may be reported to be unresponsive and its underlying architecture span multiple teams that now need to stand it back up before SLAs are breached, or a user suspects their data may have been compromised and the security team needs to investigate and recover all possibilities to minimize damage.

Who is this for?

We are currently focused on organizations that have most if not all of the following characteristics:

  • Primarily uses Mattermost (regardless of editions) at least within some population

  • Operates and maintains systems that have reliability and availability requirements

  • Aims to scale procedures and protocols in order to keep up with growth or meet mandates

  • Coordinate many (between 10 to 50) contributors and stakeholders during incident responses

Here are some characteristics that signal an organization may not be a great fit right now:

  • Building end-to-end custom solutions

  • Has little need for context transfer between contributors and stakeholders

    • Eg. small teams, rare external collaboration, strong tribal knowledge

  • Does not have an incident playbook or post-mortem procedure

What are other similar solutions?

Here is an overview of comparable solutions that we’ve found so far:

Theme

Examples

Pros

Cons

Theme

Examples

Pros

Cons

Standalone incident management

  • Comprehensive feature set (on-call schedule, escalation path, phone calls)

  • System info is siloed from conversations, slows down resolution and post-mortem

Chat-centric incident response

  • Available right where teams are collaborating in real-time

  • One more vendor to maintain (procurement, downtime, security, deployment etc.)

Security-specific orchestration

  • Powerful and customizable

  • Heavyweight

  • Doesn’t help non-security teams

Ticketing workflow

  • Already implemented, familiar to teams

  • Struggles with the speed and noise of real-time collaboration

How is this different?

Mattermost has a couple superpowers that allow us to solve the problem a bit differently.

Teams are already familiar with their messaging tool and Mattermost can come out of the box with these functionalities. This way, the solution is already deployed and works with all of your existing integrations.

We also found that chat is also the place where real-time collaboration is anchored, especially during the use cases that we are targeting. It is the most complete record of process execution from sharing information and discussions to making decisions and taking actions. As a result, it also has the most leverage to get teams to follow protocols.

Lastly, Mattermost is open-sourced and can be self-hosted for better security and reliability especially considering the sensitive nature of the data that is generated.

What is this going to do?

Our roadmap is around three components that are common across domains:

  1. Reactive: Incident response

    1. Centralize communication to reduce information fragmentation

    2. Improve context-transfer to minimize confusion

    3. Prompt contributors with next steps to speed up resolution

    4. Inform stakeholders of active incidents to raise awareness

  2. Retroactive: Post-mortem and reporting

    1. Visualize incident timeline to recognize bottlenecks in the process

    2. Replay investigations, discussions, and actions taken to extract lessons

    3. Chart trends in aggregate to identify broader changes that are needed

    4. Export channel transcript to save externally or further analyze

  3. Proactive: Playbook and process design

    1. Template checklist prompts to provide teams structure

    2. Auto-send messages with resources to ramp up new contributors

    3. Prepare multiple playbooks to save time looking up documentation

    4. Trigger actions in other tools to reduce repeated work

How to get more info?