Applied Data Science on data breaches + Bonus-Python Tutorial-php.cn

Hello!

Today I decided to embed two domains: data science and cybersecurity.

Follow along and you'll see what I'm writing about.
Applied Data Science on data breaches + Bonus

What did I do?

I performed an analysis over the number of attacks based on the organization type.
I downloaded the dataset from Kaggle.
Then, I started working on the data using Jupyter Lab and Python.

The notebook is for exercises purposes, for testing and observing- or playing with- data.

Applied Data Science on data breaches + Bonus

As usual, the first and foremost I imported the data. Then, I loaded and cleaned the dataset.

Cleaning the data is a step that could be done more times, because EDA (Exploratory Data Analysis) is an iterative and non-sequential process. Therefore, later on I continued with this process, in order to uncover meaningful insights.

Few words about statistics

I chose a simple random sampling ofn=40to find out which organization is more prone to cyberattacks, based on the number of attacks. Simple random sampling means that every member of the population has an equal chance of being selected.

The hypothesis

Null Hypothesis (H0): There is no significant difference in the number of cyberattacks experienced by different types of organizations.
Alternative Hypothesis (H1): The number of cyberattacks differs significantly across different types of organizations.

According to the maximum number of attacks, it was concluded thathealthcareindustry is more prone, with 6 attacks. On the opposite,bankinghad the lowest number of attacks, i.e 1.

In the end, I performed a Shapiro- Wilk test, to check for the distribution normality of the dataset. The Null Hypothesis was rejected, so the data did not look normally distributed. I applied Kruskal- Wallis test, from which I failed to reject the Null Hypothesis- meaning that there is no significant difference between groups. In simpler terms, it means that there was not enough evidence to confidently say that one organization type is more prone to cyberattacks than the other.

Limitations and future considerations

No confidence level, margin of error and confidence interval were set. The sample size was small, therefore it is harder to detect statistically significant differences. In the future, the selection of a sample will respect these steps and a larger sample will be considered.

You can findthe entirework on my GitHub page. ?

BONUS ?

As I specified, this article has abonus. The combination of data science and cybersecurity goes on: I created a write-up for TryHackMe room Attacktive Directory!
One could say, at the first glance, that these topics are unrelated. Well, it's actually a demonstration ofhowa breach could take place! ? Because data breaches appearsomehowand forsome reason.

Curious? Well, check my write-up from my GitHub page.

What are your thoughts?

The above is the detailed content of Applied Data Science on data breaches + Bonus. For more information, please follow other related articles on the PHP Chinese website!