October 29, 2023

DBSCAN 

DBSCAN is a clustering algorithm that works by choosing points and making a radius around that point to find how many points are within the radius. This is effectively how it builds the clusters.  

I used the features Age, Threat Level, Mental Illness and race to find clusters in the data using DBSCAN. The means returned by this analysis showed the first cluster with white victims with an average age of around 40. Another cluster had Hispanic people with an average age of around 39.  

October 27, 2023

The new city-wise data with the share of each race needed to be cleaned. The dataset had some extensions in the city name like “city”, “CDP” and “town”. These extensions were removed and the city and the state columns were joined similar to the original dataset. This will act as the primary key when merging the two datasets using left join. 

Left join is used because I want to preserve every city and instance in the original dataset. But I am okay with dropping out the race share information from the cities which do not occur in the shooting data. Eventually I had a dataset with all the shooting related information along with the race share for each city. I will be using this combined dataset for the analysis and the modelling. 

 

October 25, 2023

The dataset has information about race of the victim in each shooting as well as the location where the shooting occurred. I found a new dataset on the internet (source) which has information about each US city’s share of each race. This is based on the census data. This dataset has the following columns – City, State, Share_white, Share_Black, Share_Hispanic, Share_Native, Share_Asian and Share_other. All the Share variables are floats between 0 to 1 signifying the percentage of the population in the city that are the said race.

This data can be used in the analysis as it provides information about the racial demographic in each city. A machine learning model will pull insights from the demographic information of each city when predicting which race a person that was shot could be.

October 23, 2023

Specific trends in the data 

From the given data, the raw number of white people being killed is higher. But this does not give the full picture. If we look at the population of whites and other races in detail, we see that there are a lot more white people in the US than any other race. So, if we divide the number of fatal police shootings of each race by the total population of said race, we can find a better showing of which race is killed more as a proportion. 

The data about total populations was procured from the official census website. There are over 220 million white residents in the US whereas only about 62 million blacks. This disparity needs to be accounted for in the data analysis. The proportion of black deaths is found to be more than twice that of any other race. The next highest is Native Americans followed by Hispanics. This paints a much better picture of the distribution of fatal police shootings by race.

 

October 20, 2023

Latitude and longitude missing data: 

There are missing values in latitude and longitude data, 840 missing values to be exact. These values cannot be simply imputed with the average of all latitudes and longitudes because that would just be some place central in the US.

To deal with these missing values, I performed the following steps. First, I started by combining the city and state columns into one column in the format – “city, state”. This is done to avoid repetition when some cities in different states have the same name.

Then, I grouped the data by city. Any missing values in each group (meaning each individual city) were imputed with the mean value of latitude and longitude of that group. This would effectively be some average location within the city where the shooting occurred. This is a much better approximation for missing values since the city data is utilized to fill in the location data. However, some cities with just one shooting had missing location values. The total number of missing values after this process dropped from 840 to 300. 

18 October 2023

Today the analysis of difference in means between the ages of White and Black features was discussed. The difference in means after pooling the age data together and randomly sampling between them was very small. The p value after this procedure was of the order 10^-78. This indicates that the difference in means observed in the data is statistically significant.

Further, I conducted a similar experiment with the data related to fleeing and not fleeing. A similar trend was observed, although that seems less significant as if a suspect is fleeing there is a higher chance of shooting than not.

Further, I plan to work on the data regarding the location of the shootings and the distance to the police stations involved.

October 16, 2023

Data cleaning 

The data has many missing values and has incorrect formatting for some of the features. The ‘gender’ column has 31 missing values. The overwhelming majority of the instances are male, I imputed Male for the missing values of gender. 

The column ‘flee’ has 966 missing values. Since ‘not fleeing’ was most of the dataset is ‘not fleeing’ I imputed that in place of the missing values. Similarly, the missing values in the column ‘armed’ was imputed with the most common occurrence.  

Since the date information is given, I performed a temporal analysis to show the number of shootings per year. There is a steady uptick in the number of fatal shootings since 2017.

 

 

October 13, 2023

I did some exploratory analysis on the WashingtonPost Fatal police shootings data. There are many missing values in features like “armed”, “age”, “race” and other seemingly important features. This discrepancy needs to be dealt with either by finding data elsehwere on the internet or by imputing data in place of the missing data by using the other features and the feature statistics.

My initial plan for this project is to use the features available and build a classification model (for example: logistic regression) to predict the race of the shooting victim by using the remaining features. The age distributions of the shootings vary quite a bit. It is seen that the age distribution of black American shooting victims has its peak at a much younger age than that of other races. This indicates that younger black Americans are shot more than other races.

Some additional data like the area’s demographics might be useful for my analysis plan.

 

October 11, 2023

Understanding the Dataset 

The dataset provided for this project was obtained from the Washington post github repository. The dataset contains information about the date, age, gender, race and the location where the shooting occurred and some additional features. 

There was a total of 8002 instances in this dataset with many missing values. The predominant race in this dataset is White with 3300 instances of death with the second highest being Black with 1766 instances. The feature named “flee” has information about whether the victim was fleeing on foot, using a vehicle or not fleeing. Surprisingly, the majority of the data is in the ‘not fleeing’ category.  

After cleaning the data there will be a better opportunity to analyze it as right now the data has thousands of missing values in various columns.  

October 2, 2023

Today, I discussed the particulars of the report to be written for project 1 with my group mates. We as a group are talking about 5 different factors that affect Diabetic% in addition to Obesity and Physical Inactivity. We are considering the disparity between urban and rural areas with regards to these factors. We have decided on the 5 issues that we can talk about based on the analysis we have completed thus far.