December 8, 2023
Today we continued work on the report for this project.
December 6, 2023
With majority of my analysis being in time series forecasting of the data, now our group shift our focus towards combining our insights to write the report for this project.
We have discussed the issues we want to answer with our analysis at the start of the project, now we will revisit the questions and see to what extent we are able to answer them.
December 4, 2023
Visionzero
From the Analyze Boston webstie, I found another dataset pertaining to accidents and road safety called VisionZero. The aim of this organization is to completely eradicate fatal car crashes in Boston by 2030.
According to this dataset, there is a steady decrease in the number of car crash incidents in Boston since 2015. There is a steep drop observed in 2020 thanks to the pandemic related lockdown, but the frequency jumps back to pre-Covid averages soon after.
The rmse score of Exponential smoothing is 49.399.
Seasonal Decomposition shows a reduction in accidents since 2020. Between 2015 and 2020, the accident rate is shown to be more or less constant.
December 1, 2023
I applied exponential smoothing, seasonal decomposition as well as SARIMAX models on the different tiers and individual crime types.
In case of murders, the trend shows a general downward trend, but it is not very consistent. The tier 2 (Assault, arson) and tier 3 (Burglary robbery etc.) crimes show a steady downward trend since 2015. This could be due to actual policing improvements or due to reduced reporting.
I applied a similar analysis for accident data to see the safety standards on the roads of Boston. The general trend shows a steady incline in the number of accidents occurring in Boston. The seasonality component of the data shows a high during the middle of the year and low frequency of accidents during the winter months. This was the opposite of my expectation that the dangerous nature of the roads during winter months would increase the number of accidents.
November 29, 2023
After seeing the performance of the models, I performed Exponential smoothing as well as Seasonal Decomposition on individual tiers of my data. I worked on Tier 3 and tier 2 crimes.
The general trend for tier 2 crimes (assault and battery) has been a slow decrease since 2015 with a large drop in 2020-2021. This drop could be due to Covid 19 initially but since the number did not jump back up after the lockdown ended, I suspect that this drop is just a drop in number of reportings rather than a true decrease in the crime rate.
Tier 3 crimes – Burglary and Robbery have shown a steady decrease throughout since 2015. Burglary showed very consistent seasonality and a strong downward trend over the years.
November 27, 2023
Seasonal decomposition
Since the data has Seasonality, I performed a seasonal decomposition to see the extent of seasonality and what is the underlying trend in the data. To accomplish this, I used the statsmodels.tsa.seasonal model STL.
This process divided the data into a seasonal component, a trend component and a residual component. The complete dataset was a sum of these three components. The predictions from this model was better than that of ARIMA. The rmse score was 310.34. However, I believe the model was hindered by the presence of outliers in June 2015 and November 2023. Basically, the number of crimes accounted for in these two months was less compared to the usual monthly average as the entire month was not included In the dataset. To deal with this, in my further analysis, I will cut off the data from July 2015 to October 2023 to include entire months.
November 24, 2023
Exponential Smoothing
Given the failure of the ARIMA model to capture the seasonality and trends in the data, I studied different models that can be used to better model the data. One of the models I found was Exponential Smoothing State Space model. This model is best to use when the data has seasonality or when there is a general pervasive trend in the data.
Since the data at hand had both these features, I applied the ETS model and found better results. Initially, I split the data 80-20 into training and testing data. This meant the data from 2015 up to mid 2022 would be considered training data and the model will try to predict the trend from mid 2022 to mid 2023.
This allowed me to understand whether the model performs well or not and to decide how to change the model before trying to forecast into the future.
I used the evaluation metrics MSE and MAPE to compare the performance of each model.
The MSE for ARIMA predictions was 89911.182 which is a very large value. Comparatively, the MSE value for Exponential smoothing was 1148.91 for accidents data. This is a large improvement in prediction results.
November 22, 2023
ARIMA
To delve deeper into time series modelling, I considered the monthly counts of each crime – as in the number of incidents of each type reported in each month from July 1, 2015, up to October 31, 2023. However, for drug related crimes, the dataset does not have data post 2019.
Using this monthly crime count, I created a pandas series where the index column contained the time stamp and the series value contained the monthly count. This would be my primary data for any temporal analysis and forecasting.
The first model I intended to apply to the data was ARIMA – Autoregressive Integrated Moving Average.
This model did not perform well in forecasting as it does not handle seasonality and non-linear dependencies in the data well. The predictions from the ARIMA model were not representative of the true values. I tried changing the model parameters p,d and q. But the predictions from all combinations were subpar.
The ADF statistic for ARIMA was –0.20585 which suggests that our data is non stationary.
November 20, 2023
After the tier list was created, I ventured into each tier of crimes individually to conduct a deep analysis.
Most of the crimes show little to no difference with regard to which day they were committed. However, the incidents related to murder showed a significantly large frequency on weekends. As expected, the hours of the day at which accidents occurred had the peak during the evening with early to mid-mornings showing a low frequency of accidents. Over the years since 2015, the number of accidents has seen a steady increase, while the number of reported incidents of most other crimes has seen a slight decrease or remained unchanged. The number of reported burglary incidents has seen a steady decrease since 2015.
In the accident data, we can see a sharp drop in incidents in 2020, which coincides with the Covid-19 Pandemic Lockdown. But once the lockdown was lifted, the accident occurrence rate picked up again.
November 17, 2023
Crime Incident reports data
Total number of instances in the data since 2015 was over 720k. While this data was obviously large, it was also not clean enough to use yet
The crime data had a lot of incident reports that were ambiguous in description. The most common offense description was “investigate person”. To clean the data, I started by manually going through the various offense codes in the website and made a Tier list of crimes that I will use for further analysis. The tier list I have created is as follows:
Tier 1 – Murder, Manslaughter, Rape
Tier 2 – Arson, Aggravated assault and Battery
Tier 3 – Non violent crimes like Larceny, Burglary ,Robbery and Breaking & Entering
Tier 4 – Vandalism and Vehicular accidents
Alongside these I created a separate tier for drug related crimes.
After this step I eventually had a dataset of 216k instances of incidents belonging to one of these 5 tiers.
November 15, 2023
For Project 3, since we have the freedom to choose the dataset from the Analyze Boston website, we have chosen to work with the CRIME INCIDENT REPORTS dataset. This is a comprehensive database with all the reported crimes that occurred in the districts of Boston. This dataset contains reports on all kinds of incidents ranging from small misdemeanors to serios felonies like manslaughter.
The dataset contains data about the date, time and location that the incident occurred as well as an offense code and description. I intend to utilize this dataset to conduct an analysis as to how the safety standards in the city of Boston have changed over the years.
Report on Fatal Police Shootings Data
November 10, 2023
Today, we continued the report writing as well as making small changes to the code to make it presentable.
November 8, 2023
Today my group focused on writing the report. We completed the Appendix portion of the report.
November 6, 2023
K Means Clustering
In class the discussion was conducted on the different clusters formed in the state of California when the coordinates are plotted on a map. I did a similar clustering analysis for the next two most prominent states – Texas and Florida. I also did a similar analysis for all the states in the East coast of the USA.
A similar trend is seen in both these states. There are big clusters formed around big cities like Houston, San Antonio in Texas and Jacksonville and Miami in Florida.
November 3, 2023
Logistic Regression
I created dummies for the categorical data in the dataset in order to perform Logistic regression. The data was again split into test and train datasets with a 30 –70 split. The logistic regression model was trained on the training dataset. The model report is shown below.
The precision when predicting white is 0.67 and when predicting black is 0.66. This is to be expected since the number of instances for white and black is way more than the other races. The recall score for white is 0.78 and for black is 0.61.
F1 score for white was 0.72 and for black was 0.61. The image below shows the full classification report.
The total accuracy score for the logistic regression model was 0.648.
November 1, 2023
Random Forest Classifier
I performed categorical encoding on the dataset to remove all string values and convert everything into numerical values. This is done because most machine learning models work only with numerical values.
I split the dataset into train and test sets using the sklearn train_test_split function into a 70-30 split. The data was then run through a random forest classifier model. The features used were as follows: age, manner of death, gender, sign of mental illness, threat level, body camera, latitude and longitude, share of each race in that city.
The accuracy score after testing the dataset in predicting the race from the other variables was 0.618. GridSearch cross validation was performed to find the best selection of parameters for this specific model, however, the change in accuracy was minimal.
October 29, 2023
DBSCAN
DBSCAN is a clustering algorithm that works by choosing points and making a radius around that point to find how many points are within the radius. This is effectively how it builds the clusters.
I used the features Age, Threat Level, Mental Illness and race to find clusters in the data using DBSCAN. The means returned by this analysis showed the first cluster with white victims with an average age of around 40. Another cluster had Hispanic people with an average age of around 39.
October 27, 2023
The new city-wise data with the share of each race needed to be cleaned. The dataset had some extensions in the city name like “city”, “CDP” and “town”. These extensions were removed and the city and the state columns were joined similar to the original dataset. This will act as the primary key when merging the two datasets using left join.
Left join is used because I want to preserve every city and instance in the original dataset. But I am okay with dropping out the race share information from the cities which do not occur in the shooting data. Eventually I had a dataset with all the shooting related information along with the race share for each city. I will be using this combined dataset for the analysis and the modelling.
October 25, 2023
The dataset has information about race of the victim in each shooting as well as the location where the shooting occurred. I found a new dataset on the internet (source) which has information about each US city’s share of each race. This is based on the census data. This dataset has the following columns – City, State, Share_white, Share_Black, Share_Hispanic, Share_Native, Share_Asian and Share_other. All the Share variables are floats between 0 to 1 signifying the percentage of the population in the city that are the said race.
This data can be used in the analysis as it provides information about the racial demographic in each city. A machine learning model will pull insights from the demographic information of each city when predicting which race a person that was shot could be.
October 23, 2023
Specific trends in the data
From the given data, the raw number of white people being killed is higher. But this does not give the full picture. If we look at the population of whites and other races in detail, we see that there are a lot more white people in the US than any other race. So, if we divide the number of fatal police shootings of each race by the total population of said race, we can find a better showing of which race is killed more as a proportion.
The data about total populations was procured from the official census website. There are over 220 million white residents in the US whereas only about 62 million blacks. This disparity needs to be accounted for in the data analysis. The proportion of black deaths is found to be more than twice that of any other race. The next highest is Native Americans followed by Hispanics. This paints a much better picture of the distribution of fatal police shootings by race.
October 20, 2023
Latitude and longitude missing data:
There are missing values in latitude and longitude data, 840 missing values to be exact. These values cannot be simply imputed with the average of all latitudes and longitudes because that would just be some place central in the US.
To deal with these missing values, I performed the following steps. First, I started by combining the city and state columns into one column in the format – “city, state”. This is done to avoid repetition when some cities in different states have the same name.
Then, I grouped the data by city. Any missing values in each group (meaning each individual city) were imputed with the mean value of latitude and longitude of that group. This would effectively be some average location within the city where the shooting occurred. This is a much better approximation for missing values since the city data is utilized to fill in the location data. However, some cities with just one shooting had missing location values. The total number of missing values after this process dropped from 840 to 300.
18 October 2023
Today the analysis of difference in means between the ages of White and Black features was discussed. The difference in means after pooling the age data together and randomly sampling between them was very small. The p value after this procedure was of the order 10^-78. This indicates that the difference in means observed in the data is statistically significant.
Further, I conducted a similar experiment with the data related to fleeing and not fleeing. A similar trend was observed, although that seems less significant as if a suspect is fleeing there is a higher chance of shooting than not.
Further, I plan to work on the data regarding the location of the shootings and the distance to the police stations involved.
October 16, 2023
Data cleaning
The data has many missing values and has incorrect formatting for some of the features. The ‘gender’ column has 31 missing values. The overwhelming majority of the instances are male, I imputed Male for the missing values of gender.
The column ‘flee’ has 966 missing values. Since ‘not fleeing’ was most of the dataset is ‘not fleeing’ I imputed that in place of the missing values. Similarly, the missing values in the column ‘armed’ was imputed with the most common occurrence.
Since the date information is given, I performed a temporal analysis to show the number of shootings per year. There is a steady uptick in the number of fatal shootings since 2017.
October 13, 2023
I did some exploratory analysis on the WashingtonPost Fatal police shootings data. There are many missing values in features like “armed”, “age”, “race” and other seemingly important features. This discrepancy needs to be dealt with either by finding data elsehwere on the internet or by imputing data in place of the missing data by using the other features and the feature statistics.
My initial plan for this project is to use the features available and build a classification model (for example: logistic regression) to predict the race of the shooting victim by using the remaining features. The age distributions of the shootings vary quite a bit. It is seen that the age distribution of black American shooting victims has its peak at a much younger age than that of other races. This indicates that younger black Americans are shot more than other races.
Some additional data like the area’s demographics might be useful for my analysis plan.
October 11, 2023
Understanding the Dataset
The dataset provided for this project was obtained from the Washington post github repository. The dataset contains information about the date, age, gender, race and the location where the shooting occurred and some additional features.
There was a total of 8002 instances in this dataset with many missing values. The predominant race in this dataset is White with 3300 instances of death with the second highest being Black with 1766 instances. The feature named “flee” has information about whether the victim was fleeing on foot, using a vehicle or not fleeing. Surprisingly, the majority of the data is in the ‘not fleeing’ category.
After cleaning the data there will be a better opportunity to analyze it as right now the data has thousands of missing values in various columns.
Report on CDC Diabetes Data
October 2, 2023
Today, I discussed the particulars of the report to be written for project 1 with my group mates. We as a group are talking about 5 different factors that affect Diabetic% in addition to Obesity and Physical Inactivity. We are considering the disparity between urban and rural areas with regards to these factors. We have decided on the 5 issues that we can talk about based on the analysis we have completed thus far.
September 27, 2023
I implemented a Linear regression model (degrees 1,2,3 and 4) considering just Obesity% and just Inactivity% to predict %Diabetes.
%Obesity vs %Diabetic
%Inactivity vs % Diabetic
Below poverty % vs % diabetic
I also performed a multiple linear regression using all three features. The features eventually got coefficients as follows: [0.1048, 0.1509, 1.6913].
I performed 5 fold cross validation on the three variables and the r2 score returned for the 5 folds was as follows: [0.53857178, 0.57042713, 0.55656731, 0.61902526, 0.59719045]
The r2 score after 10 fold cross validation was 0.613141634
25 September 2023
Today’s class covered a resampling method- Cross Validation.
Cross validation is a technique used to improve model accuracy by dividing the data into folds and building multiple models based on all but one of the folds, followed by validating the model using the remaining division (or fold). This method is useful when there is not enough data to utilize for training as well as testing the model. So, resampling at random and using different sets of the data can improve the model’s accuracy. The validation set can be chosen by just dividing the data into two equal groups and training on one (training set) and testing on the other (validation set). However, this is not as effective as doing a multiple -or- k fold cross validation where there are ‘k’ folds and one validation set among them.
I intend to use this technique in the modeling of the CDC diabetes data. There are 2918 instances in the data. A 10-fold cross validation with around 290 instances in each fold can be implemented.
September 20, 2023
Today’s class covered an analysis of crab molting data and explained the t-test, which is a procedure done to assess the statistical significance of a difference in mean value of two distributions. The t-test is unreliable for non-normal distributions, and hence Monte Carlo random sampling is used to see if the mean value difference is in fact significant.
I applied these concepts to the diabetes dataset. I started off by combining the three different data sets as well as the additional data I have procured from the CDC website namely: Number of Physicians per 100000 population, Health insurance percentage, and Below Poverty %. I performed a pandas merge to combine the various data sets into one data frame. I plotted the distributions of the Obesity data and the Physical Inactivity data (2918 data points each).
There is a difference in the means of both distributions, so I performed the techniques taught in class. The results I got are as follows:
Monte Carlo random sampling (1 million times): maximum observed difference in means = 0.8661069225496902 and the distribution of the various means is plotted below.
September 18, 2023
I found additional data on the CDC website which contains the full dataset for Inactivity, obesity and other features. There was also data about the number of physicians per 100,000 residents of each county. Additionally, there is also information about the percentage of residents of each county that do not have Health Insurance. I found the correlation between the obesity rate and the number of physicians per 100,000 residents of each county. The correlation was -0.035 (-3.5%), slight negative correlation.
There is also a slight negative correlation between obesity rate and total number of physicians per county. This result is counterintuitive although the correlation is insufficient.
September 13, 2023
I delved further into the concepts taught in class including p-value, kurtosis, skewness and heteroscedasticity.
With regards to the analysis of the Diabetes dataset, I am considering obtaining past years’ data of the same factors and analyzing how the percentage of diabetics has changed over the years. This could also provide insight into how it might change in the future alongside information about how one factor affects another.
There is also another data set that records new incidences of diabetes diagnosis. This, paired with the past years’ diabetic % data will show how many new patients get diagnosed year by year.
September 11, 2023
I performed exploratory analysis of the given data. I plotted the distributions of the %diabetic, %obese and %inactive data using the seaborn library in python. The distribution of %diabetics showed a slight right skew while the other two distributions were heavily skewed to the left.
I found additional data in the CDC website pertaining to food access, socioeconomic status of various counties, as well as the data which tags counties as rural or urban. My group have decided to pursue this line of analysis where we try to ascertain if socioeconomic status and other factors like food access, transportation etc. have discernable characteristics in urban areas as compared to rural areas.
We also found another measure called “Food environment index” which takes into account socioeconomic status, household composition (such as population, age etc.) and infrastructure information in each county. This measure could be used in the analysis as it gives a comprehensive look at multiple features which are otherwise hard to combine.
I am considering approaching this problem using a linear regression model. I will try to find the missing values of %obesity and %inactivity by using the %diabetic data as well as the other additional data mentioned above. However, I am skeptical about the direct correlation between the various features as it is data about percentages of population rather than individuals’ data.