


Automatic Troubleshooting & ITSM System using EventBridge and Lambda
Introduction :
Folks, In IT Operations, it's a very generic task to monitor server metrices like utilization of cpu/memory and disk or filesystems, but in case any of the metrics gets triggered to be critical, then dedicated persons need to perform some basic troubleshooting by logging into server and find out the initial cause of utilization which person has to perform multiple times if he gets multiple same alert that creates boredom and not productive at all. So as a workaround, there can be a system developed which will react once alarm gets triggered and act on those instances by executing few basic troubleshooting commands. Just to summarize the problem statement and expectation -
Problem Statement :
Develop a system which will fulfill below expectations -
- Each EC2 instances should be monitored by CloudWatch.
- Once alarm gets triggered, something has to be there which will login to that affected EC2 instance and perform some basic troubleshooting commands.
- Then, create a JIRA issue to document that incident and add the output of commands in comment section.
- Then, send an automatic email with providing all alarm details and JIRA issue details.
Architecture Diagram :
Prerequisites :
- EC2 Instances
- CloudWatch Alarms
- EventBridge Rule
- Lambda Function
- JIRA Account
- Simple Notification Service
Implementation Steps :
A. CloudWatch Agent Installation and Configuration Setup :
Open Systems Manager console and click on "Documents"
Search for "AWS-ConfigureAWSPackage" document and execute by providing required details.
Package Name = AmazonCloudwatchAgent
Post installation, CloudWatch agent needs to be configured as per configuration file . For this, execute AmazonCloudWatch-ManageAgent document. Also, make sure JSON CloudWatch config file is stored in SSM Parameter.
Once you see that metrices are reporting to CloudWatch console, then create an alarm for CPU and Memory utilizations etc.B. Setup EventBridge Rule :
To track the alarm state changes, here, we have customized pattern a little to track alarm state changes from OK to ALARM only, not reverse one. Then, add this rule to a lambda function as a trigger.
{ "source": ["aws.cloudwatch"], "detail-type": ["CloudWatch Alarm State Change"], "detail": { "state": { "value": ["ALARM"] }, "previousState": { "value": ["OK"] } } }
- C. Create a Lambda Function for Sending Email and Logging an Incident in JIRA : This lambda function is created for multiple activities which is triggered by EventBridge rule and as a destination SNS topic is added by using AWS SDK(Boto3). Once EventBridge rule is triggered then sends JSON event content to lambda by which function captures multiple details to process in different way. Here, as of now we have worked on two type of alarms - i. CPU Utilization and ii. Memory Utilization. Once any of these two alarms are triggered and alarm state is changed from OK to ALARM, then EventBridge gets triggered which also triggered Lambda function to perform those tasks mentioned in the form code.
Lambda Prerequisites :
We need below modules to import for make the codes work -
- >> os
- >> sys
- >> json
- >> boto3
- >> time
- >> requests
Note: From above modules, except 'requests' module rest all are downloaded within a lambda underlying infrastructure by default. Importing 'requests' module directly will not be supported in Lambda. Hence, first, install request module in a folder in your local machine(laptop) by executing below command -
pip3 install requests -t <directory path> --no-user
_After that, this will be downloaded in the folder from where you are executing above command or where you want to store the module source codes, here I hope lambda code is being prepared in your local machine. If yes, then create a zip file of that entire lambda source codes with module. After that, upload the zip file to lambda function.
So, here we are performing below two scenarios -
1. CPU Utilization - If CPU utilization alarm gets triggered, then lambda function need to fetch the instance and login to that instance and perform top 5 high consuming processes. Then, it will create a JIRA issue and add the process details in the comment section. Simultaneously, it will send an email with alarm details and jira issue details with process output.
2. Memory Utilization - Same approach as above
Now, let me reframe the task details which lambda is supposed to perform -
- Login to Instance
- Perform Basic Troubleshooting Steps.
- Create a JIRA Issue
- Send Email to Recipient with all Details
Scenario 1: When alarm state has been changed from OK to ALARM
First Set (Define the cpu and memory function) :
################# Importing Required Modules ################ ############################################################ import json import boto3 import time import os import sys sys.path.append('./python') ## This will add requests module along with all dependencies into this script import requests from requests.auth import HTTPBasicAuth ################## Calling AWS Services ################### ########################################################### ssm = boto3.client('ssm') sns_client = boto3.client('sns') ec2 = boto3.client('ec2') ################## Defining Blank Variable ################ ########################################################### cpu_process_op = '' mem_process_op = '' issueid = '' issuekey = '' issuelink = '' ################# Function for CPU Utilization ################ ############################################################### def cpu_utilization(instanceid, metric_name, previous_state, current_state): global cpu_process_op if previous_state == 'OK' and current_state == 'ALARM': command = 'ps -eo user,pid,ppid,cmd,%mem,%cpu --sort=-%cpu | head -5' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) cpu_process_op = output['StandardOutputContent'] else: print('None') ################# Function for Memory Utilization ################ ############################################################### def mem_utilization(instanceid, metric_name, previous_state, current_state): global mem_process_op if previous_state == 'OK' and current_state == 'ALARM': command = 'ps -eo user,pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -5' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) mem_process_op = output['StandardOutputContent'] else: print('None')
Second Set (Create JIRA Issue) :
################## Create JIRA Issue ################ ##################################################### def create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val): ## Create Issue ## url ='https://<your-user-name>.atlassian.net//rest/api/2/issue' username = os.environ['username'] api_token = os.environ['token'] project = 'AnirbanSpace' issue_type = 'Incident' assignee = os.environ['username'] summ_metric = '%CPU Utilization' if 'CPU' in metric_name else '%Memory Utilization' if 'mem' in metric_name else '%Filesystem Utilization' if metric_name == 'disk_used_percent' else None metric_val = metric_val summary = f'Client | {account} | {instanceid} | {summ_metric} | Metric Value: {metric_val}' description = f'Client: Company\nAccount: {account}\nRegion: {region}\nInstanceID = {instanceid}\nTimestamp = {timestamp}\nCurrent State: {current_state}\nPrevious State = {previous_state}\nMetric Value = {metric_val}' issue_data = { "fields": { "project": { "key": "SCRUM" }, "summary": summary, "description": description, "issuetype": { "name": issue_type }, "assignee": { "name": assignee } } } data = json.dumps(issue_data) headers = { "Accept": "application/json", "Content-Type": "application/json" } auth = HTTPBasicAuth(username, api_token) response = requests.post(url, headers=headers, auth=auth, data=data) global issueid global issuekey global issuelink issueid = response.json().get('id') issuekey = response.json().get('key') issuelink = response.json().get('self') ################ Add Comment To Above Created JIRA Issue ################### output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None comment_api_url = f"{url}/{issuekey}/comment" add_comment = requests.post(comment_api_url, headers=headers, auth=auth, data=json.dumps({"body": output})) ## Check the response if response.status_code == 201: print("Issue created successfully. Issue key:", response.json().get('key')) else: print(f"Failed to create issue. Status code: {response.status_code}, Response: {response.text}")
Third Set (Send an Email) :
################## Send An Email ################ ################################################# def send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink): ### Define a dictionary of custom input ### metric_list = {'mem_used_percent': 'Memory', 'disk_used_percent': 'Disk', 'CPUUtilization': 'CPU'} ### Conditions ### if previous_state == 'OK' and current_state == 'ALARM' and metric_name in list(metric_list.keys()): metric_msg = metric_list[metric_name] output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None print('This is output', output) email_body = f"Hi Team, \n\nPlease be informed that {metric_msg} utilization is high for the instanceid {instanceid}. Please find below more information \n\nAlarm Details:\nMetricName = {metric_name}, \nAccount = {account}, \nTimestamp = {timestamp}, \nRegion = {region}, \nInstanceID = {instanceid}, \nCurrentState = {current_state}, \nReason = {current_reason}, \nMetricValue = {metric_val}, \nThreshold = 80.00 \n\nProcessOutput: \n{output}\nIncident Deatils:\nIssueID = {issueid}, \nIssueKey = {issuekey}, \nLink = {issuelink}\n\nRegards,\nAnirban Das,\nGlobal Cloud Operations Team" res = sns_client.publish( TopicArn = os.environ['snsarn'], Subject = f'High {metric_msg} Utilization Alert : {instanceid}', Message = str(email_body) ) print('Mail has been sent') if res else print('Email not sent') else: email_body = str(0)
Fourth Set (Calling Lambda Handler Function) :
################## Lambda Handler Function ################ ########################################################### def lambda_handler(event, context): instanceid = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId'] metric_name = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['name'] account = event['account'] timestamp = event['time'] region = event['region'] current_state = event['detail']['state']['value'] current_reason = event['detail']['state']['reason'] previous_state = event['detail']['previousState']['value'] previous_reason = event['detail']['previousState']['reason'] metric_val = json.loads(event['detail']['state']['reasonData'])['evaluatedDatapoints'][0]['value'] ##### function calling ##### if metric_name == 'CPUUtilization': cpu_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) elif metric_name == 'mem_used_percent': mem_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) else: None
Alarm Email Screenshot :
Note: In ideal scenario, threshold is 80%, but for testing I changed it to 10%. Please see the Reason.
Alarm JIRA Issue :
Scenario 2: When alarm state has been changed from OK to Insufficient data
In this scenario, if any server cpu or memory utilization metrics data are not captured, then alarm state gets changed from OK to INSUFFICIENT_DATA. This state can be achieved in two ways - a.) If server is in stopped state b.) If CloudWatch agent is not running or went in dead state.
So, as per below script, you'll be able to see that when cpu or memory utilization alarm status gets insufficient data, then lambda will first check if instance is in running status or not. If instance is in running state, then it will login and check CloudWatch agent status. Post that, it will create a JIRA issue and post the agent status in comment section of JIRA issue. After that, it will send an email with alarm details and agent status.
Full Code :
################# Importing Required Modules ################ ############################################################ import json import boto3 import time import os import sys sys.path.append('./python') ## This will add requests module along with all dependencies into this script import requests from requests.auth import HTTPBasicAuth ################## Calling AWS Services ################### ########################################################### ssm = boto3.client('ssm') sns_client = boto3.client('sns') ec2 = boto3.client('ec2') ################## Defining Blank Variable ################ ########################################################### cpu_process_op = '' mem_process_op = '' issueid = '' issuekey = '' issuelink = '' ################# Function for CPU Utilization ################ ############################################################### def cpu_utilization(instanceid, metric_name, previous_state, current_state): global cpu_process_op if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA': ec2_status = ec2.describe_instance_status(InstanceIds=[instanceid,])['InstanceStatuses'][0]['InstanceState']['Name'] if ec2_status == 'running': command = 'systemctl status amazon-cloudwatch-agent;sleep 3;systemctl restart amazon-cloudwatch-agent' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) cpu_process_op = output['StandardOutputContent'] else: cpu_process_op = f'Instance current status is {ec2_status}. Not able to reach out!!' print(f'Instance current status is {ec2_status}. Not able to reach out!!') else: print('None') ################# Function for Memory Utilization ################ ############################################################### def mem_utilization(instanceid, metric_name, previous_state, current_state): global mem_process_op if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA': ec2_status = ec2.describe_instance_status(InstanceIds=[instanceid,])['InstanceStatuses'][0]['InstanceState']['Name'] if ec2_status == 'running': command = 'systemctl status amazon-cloudwatch-agent' print(f'Impacted Instance ID is : {instanceid}, Metric Name: {metric_name}') # Start a session print(f'Starting session to {instanceid}') response = ssm.send_command(InstanceIds = [instanceid], DocumentName="AWS-RunShellScript", Parameters={'commands': [command]}) command_id = response['Command']['CommandId'] print(f'Command ID: {command_id}') # Retrieve the command output time.sleep(4) output = ssm.get_command_invocation(CommandId=command_id, InstanceId=instanceid) print('Please find below output -\n', output['StandardOutputContent']) mem_process_op = output['StandardOutputContent'] print(mem_process_op) else: mem_process_op = f'Instance current status is {ec2_status}. Not able to reach out!!' print(f'Instance current status is {ec2_status}. Not able to reach out!!') else: print('None') ################## Create JIRA Issue ################ ##################################################### def create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val): ## Create Issue ## url ='https://<your-user-name>.atlassian.net//rest/api/2/issue' username = os.environ['username'] api_token = os.environ['token'] project = 'AnirbanSpace' issue_type = 'Incident' assignee = os.environ['username'] summ_metric = '%CPU Utilization' if 'CPU' in metric_name else '%Memory Utilization' if 'mem' in metric_name else '%Filesystem Utilization' if metric_name == 'disk_used_percent' else None metric_val = metric_val summary = f'Client | {account} | {instanceid} | {summ_metric} | Metric Value: {metric_val}' description = f'Client: Company\nAccount: {account}\nRegion: {region}\nInstanceID = {instanceid}\nTimestamp = {timestamp}\nCurrent State: {current_state}\nPrevious State = {previous_state}\nMetric Value = {metric_val}' issue_data = { "fields": { "project": { "key": "SCRUM" }, "summary": summary, "description": description, "issuetype": { "name": issue_type }, "assignee": { "name": assignee } } } data = json.dumps(issue_data) headers = { "Accept": "application/json", "Content-Type": "application/json" } auth = HTTPBasicAuth(username, api_token) response = requests.post(url, headers=headers, auth=auth, data=data) global issueid global issuekey global issuelink issueid = response.json().get('id') issuekey = response.json().get('key') issuelink = response.json().get('self') ################ Add Comment To Above Created JIRA Issue ################### output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None comment_api_url = f"{url}/{issuekey}/comment" add_comment = requests.post(comment_api_url, headers=headers, auth=auth, data=json.dumps({"body": output})) ## Check the response if response.status_code == 201: print("Issue created successfully. Issue key:", response.json().get('key')) else: print(f"Failed to create issue. Status code: {response.status_code}, Response: {response.text}") ################## Send An Email ################ ################################################# def send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink): ### Define a dictionary of custom input ### metric_list = {'mem_used_percent': 'Memory', 'disk_used_percent': 'Disk', 'CPUUtilization': 'CPU'} ### Conditions ### if previous_state == 'OK' and current_state == 'INSUFFICIENT_DATA' and metric_name in list(metric_list.keys()): metric_msg = metric_list[metric_name] output = cpu_process_op if metric_name == 'CPUUtilization' else mem_process_op if metric_name == 'mem_used_percent' else None email_body = f"Hi Team, \n\nPlease be informed that {metric_msg} utilization alarm state has been changed to {current_state} for the instanceid {instanceid}. Please find below more information \n\nAlarm Details:\nMetricName = {metric_name}, \n Account = {account}, \nTimestamp = {timestamp}, \nRegion = {region}, \nInstanceID = {instanceid}, \nCurrentState = {current_state}, \nReason = {current_reason}, \nMetricValue = {metric_val}, \nThreshold = 80.00 \n\nProcessOutput = \n{output}\nIncident Deatils:\nIssueID = {issueid}, \nIssueKey = {issuekey}, \nLink = {issuelink}\n\nRegards,\nAnirban Das,\nGlobal Cloud Operations Team" res = sns_client.publish( TopicArn = os.environ['snsarn'], Subject = f'Insufficient {metric_msg} Utilization Alarm : {instanceid}', Message = str(email_body) ) print('Mail has been sent') if res else print('Email not sent') else: email_body = str(0) ################## Lambda Handler Function ################ ########################################################### def lambda_handler(event, context): instanceid = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['dimensions']['InstanceId'] metric_name = event['detail']['configuration']['metrics'][0]['metricStat']['metric']['name'] account = event['account'] timestamp = event['time'] region = event['region'] current_state = event['detail']['state']['value'] current_reason = event['detail']['state']['reason'] previous_state = event['detail']['previousState']['value'] previous_reason = event['detail']['previousState']['reason'] metric_val = 'NA' ##### function calling ##### if metric_name == 'CPUUtilization': cpu_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) elif metric_name == 'mem_used_percent': mem_utilization(instanceid, metric_name, previous_state, current_state) create_issues(instanceid, metric_name, account, timestamp, region, current_state, previous_state, cpu_process_op, mem_process_op, metric_val) send_email(instanceid, metric_name, account, region, timestamp, current_state, current_reason, previous_state, previous_reason, cpu_process_op, mem_process_op, metric_val, issueid, issuekey, issuelink) else: None
Insufficient Data Email Screenshot :
Insufficient data JIRA Issue :
Conclusion :
In this article, we have tested scenarios on both cpu and memory utilization, but there can be lots of metrics on which we can configure auto-incident and auto-email functionality which will reduce significant efforts in terms of monitoring and creating incidents and all. This solution has given a initial approach how we can proceed further, but for sure there can be other possibilities to achieve this goal. I believe you all will understand the way we tried to make this relatable. Please like and comment if you love this article or have any other suggestions, so that we can populate in coming articles. ??
Thanks!!
Anirban Das
The above is the detailed content of Automatic Troubleshooting & ITSM System using EventBridge and Lambda. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Yes,aPythonclasscanhavemultipleconstructorsthroughalternativetechniques.1.Usedefaultargumentsinthe__init__methodtoallowflexibleinitializationwithvaryingnumbersofparameters.2.Defineclassmethodsasalternativeconstructorsforclearerandscalableobjectcreati

In Python, using a for loop with the range() function is a common way to control the number of loops. 1. Use when you know the number of loops or need to access elements by index; 2. Range(stop) from 0 to stop-1, range(start,stop) from start to stop-1, range(start,stop) adds step size; 3. Note that range does not contain the end value, and returns iterable objects instead of lists in Python 3; 4. You can convert to a list through list(range()), and use negative step size in reverse order.

To get started with quantum machine learning (QML), the preferred tool is Python, and libraries such as PennyLane, Qiskit, TensorFlowQuantum or PyTorchQuantum need to be installed; then familiarize yourself with the process by running examples, such as using PennyLane to build a quantum neural network; then implement the model according to the steps of data set preparation, data encoding, building parametric quantum circuits, classic optimizer training, etc.; in actual combat, you should avoid pursuing complex models from the beginning, paying attention to hardware limitations, adopting hybrid model structures, and continuously referring to the latest documents and official documents to follow up on development.

The key to using Python to call WebAPI to obtain data is to master the basic processes and common tools. 1. Using requests to initiate HTTP requests is the most direct way. Use the get method to obtain the response and use json() to parse the data; 2. For APIs that need authentication, you can add tokens or keys through headers; 3. You need to check the response status code, it is recommended to use response.raise_for_status() to automatically handle exceptions; 4. Facing the paging interface, you can request different pages in turn and add delays to avoid frequency limitations; 5. When processing the returned JSON data, you need to extract information according to the structure, and complex data can be converted to Data

Python's onelineifelse is a ternary operator, written as xifconditionelsey, which is used to simplify simple conditional judgment. It can be used for variable assignment, such as status="adult"ifage>=18else"minor"; it can also be used to directly return results in functions, such as defget_status(age):return"adult"ifage>=18else"minor"; although nested use is supported, such as result="A"i

This article has selected several top Python "finished" project websites and high-level "blockbuster" learning resource portals for you. Whether you are looking for development inspiration, observing and learning master-level source code, or systematically improving your practical capabilities, these platforms are not to be missed and can help you grow into a Python master quickly.

The key to writing Python's ifelse statements is to understand the logical structure and details. 1. The infrastructure is to execute a piece of code if conditions are established, otherwise the else part is executed, else is optional; 2. Multi-condition judgment is implemented with elif, and it is executed sequentially and stopped once it is met; 3. Nested if is used for further subdivision judgment, it is recommended not to exceed two layers; 4. A ternary expression can be used to replace simple ifelse in a simple scenario. Only by paying attention to indentation, conditional order and logical integrity can we write clear and stable judgment codes.

Using a for loop to read files line by line is an efficient way to process large files. 1. The basic usage is to open the file through withopen() and automatically manage the closing. Combined with forlineinfile to traverse each line. line.strip() can remove line breaks and spaces; 2. If you need to record the line number, you can use enumerate(file, start=1) to let the line number start from 1; 3. When processing non-ASCII files, you should specify encoding parameters such as utf-8 to avoid encoding errors. These methods are concise and practical, and are suitable for most text processing scenarios.
