Cece Housh 901015203 05/01/2023 CS Principles Final Project
Khan Academy Data Analysis Portion Data Tools This module was a great introduction to the tools/technologies utilized in data analysis. Everybody benefits greatly from knowing how to utilize spreadsheets, SQL, and several helpful functions (such as Count, Average, etc.), which are excellent starting points for grasping this topic. It showed the significance of several subtopics, such as data types, data structures, and data cleaning methods. Surprisingly enough, even though I am very far along in my degree I still didn't know half of this material. Since I hadn’t heard of this information (at least this detailed), I was genuinely intrigued as I continued reading. I am confident in my ability to discern patterns in data statistics and their particularity, so I did fine throughout the exercises, but I have never been exposed to the necessary abilities. Because of that, I feel like this assignment would’ve helped me feel more confident throughout my Software Development class (even if it wasn’t needed for the course material, it would’ve helped me through the mock interviews). For the exercises, I didn’t personally struggle with them, as I’ve already had classes that discussed CSV files, made me focus on detecting patterns within data, and had questions similar to the statistics ones given. Big Data This module gave a good understanding of the special difficulties and opportunities that big datasets provide. It was great for learning how to use big data tools and technology, as well as how to examine massive datasets and gain insights using machine learning methods. Data privacy and security issues, such as how medical data is maintained, data centers, personally identifiable information (and the best ways to secure it), and the storage and processing/time behind datasets were some additional crucial topics (alongside the difficulties that datasets might present). It offered helpful guidance for utilizing big data in actual contexts and helped me get a better understanding of the methods/procedures needed to efficiently work with complex datasets. The exercises for this one were considerably easy to understand, but I believe it more so because of them being real-life scenarios that we can picture ourselves having to do in our fields of study. Again, this module also felt practical for the Software Development class, especially since as we go farther up in the CS courses, everything gets bigger and more complex in an overall sense. Machine Learning Algorithms The ethical and societal repercussions of data analysis and machine learning were covered in this module, which can be a crucially important part of the curriculum. I appreciated the explanation of Machine Learning in the beginning since many people dont consider the fact that it's an experience-related self-improvement algorithm. People typically just think of it as something that a programmer is doing behind the scenes. Later, it demonstrated precisely how bias can be incorporated into machine learning algorithms and how these biases can have negative effects on people and society. Algorithmic fairness, data ethics, and the societal duty of data scientists were all covered. This module specifically helped in my comprehensive grasp of the difficult moral dilemmas raised by data analysis and machine learning. This one actually reminded me of the Hello World book when Hanah Fry talked about bias in algorithms and historical bias in data. The exercises for this one were pretty similar to both of the others as they tied into previous courses I've taken that took data or statistics and came up with a logical idea or reasoning for explaining the data. Unit Tests I personally never liked Khan Academy and since high school, I didn’t believe it would be helpful, but after this project, I have changed my mind. I found these to be really interesting but it was mainly because they were subjects I’m already familiar with. However, I believe that it would have been a significant (positive) factor in my success in my degree if I had used it prior. Throughout the exercises, I was able to get 100% on each of them, as well as the unit test. The main goal of taking the unit test is to make sure we, as students, are understanding the concepts and seeing how they all tie into each other. It sought a lot of critical thinking from the user taking the test, as they’re real-world questions that you have to think through, but it was especially great because it clarified why your answer (or the other options) was right/wrong. —--------------------------------—--------------------------------—--------------------------------—--------- Step 1: Chosen data source: Kaggle
Chosen dataset topic: Mental Health & Suicide Death
Step 2: Write one or more questions or hypotheses that can be addressed using the data set you have selected.
Questions: From 1990 - 2019, what country has had the most elevated mental disorder prevalence? What year was it? What about the lowest? From 1990 - 2019, has the self-harm death percentage increased or decreased at all throughout these countries? (Can make more specific to a year or country) Example of a significant increase/decrease? What year from 1990 - 2019 had the highest self-harm-related death percentage? What country was it in? (Could make it more specific by asking for a certain countries year) What about the lowest self-harm-related death percentage? In 2003 what country had the highest self-harm-related death percentage? Top 5?
Step 3: Use computer-based tools to analyze the data in the context of addressing your questions/hypotheses. using Excel Addressing Questions: Iran in 2015 with a 19.35% mental disorder prevalence. Vietnam with a 9.46% in 2011 (but had also been the lowest percentage from 1995 - 2019) Yes, there have been many percent fluctuations in self-harm-related deaths throughout the years in every country. Specifically, Andorra had one of the most significant percentage decreases being 2.4 in 1990 and 1.3 in 2019. Greenland in 1990 had the highest self-harm-related death percentage, at 11.8%. But they also had the highest self-harm-related death percentage throughout the entire 1990 - 2019 span (not necessarily in the year order though). On the other hand, Rwanda had the lowest self-harm-related death percentage in 1994 at 0.1%. In 2003 the country with the highest self-harm-related death percentage was: Palestine Top 5: Palestine at 19.02% Australia at 18.96% New Zealand at 18.9% Portugal at 18.89% Iran at 18.85%
The process by which these problems were determined: Mental health and suicides are very serious topics and don’t seem to be getting much better. Looking through this data, it is important to try to see the patterns. Seeing if there is an increase/decrease within certain years can make it more precise to see what may be causing this. If we went further with it, we could use the year range that death rates were at their highest and then look up any significant events that may have happened in that time frame or other stats that could’ve been impactful (for example, if unemployment rates had risen at the same time as suicide rates, we would be able to make a connection). I decided to look for the years at which these rates were highest, and what countries they were highest in. If we look at the locations that have higher rates, it could make it more precise for what areas we should focus on when it comes to making a change. Overall, I simply tried thinking of problems that I was genuinely interested in and I personally would focus on if I had collected the data myself.
The process of data obtained: I started by going through Kaggle to find an interesting dataset. I found a few but many of them in my topic range didn’t have enough data. I finally came across Mental Health & Suicide Death and downloaded the datasets files. I then went to Excel Spreadsheets, went to the data tab, pressed “get data” and uploaded the files there since it was originally in csv, and it made it more organized. I will go more into detail in the next paragraph where I explain how I used tools to answer my questions.
The process of the analysis conducted: For most of this information I went through Excel Spreadsheets and fixed the filtering based on what information I was looking for. For example, I was looking for what country had the highest mental disorder prevalence and when. I had two spreadsheets, one for Mental Disorder Prevalence and one for Self-Harm-Related Deaths. I went to the Mental Disorder Prevalence spreadsheet and set the Prevalence to sort by descending, making the prevalence percentage go from highest to lowest. I then looked at the highest percentage and determined what country that percentage was tied to and what year it happened in. For other things though, such as question 2, it was a little more difficult. I had to make a graph from the given data and see the slopes/lines of each country and determine which one had the most exponential (if I was looking for the highest increase). For this specific question I was just looking for one of the top significant differences in Self-Harm-Related Deaths, and since it wasn't specified, I picked a country that I personally didn't know much about, but was surprised to see such a difference (and It was a decrease, so it’s a little on the positive side).
The results of the analyses: The conclusion of these results was that Greenland seems to have the highest self-harm-related death percentage throughout the 1990-2019 span, while Rwanda had the lowest self-harm-related death percentage (but just in 1994). On the side of mental disorder prevalence, Iran had the highest percentage in 2015 while Vietnam had kept the lowest from 1995-2019.