November 29, 2023

After seeing the performance of the models, I performed Exponential smoothing as well as Seasonal Decomposition on individual tiers of my data. I worked on Tier 3 and tier 2 crimes.  

The general trend for tier 2 crimes (assault and battery) has been a slow decrease since 2015 with a large drop in 2020-2021. This drop could be due to Covid 19 initially but since the number did not jump back up after the lockdown ended, I suspect that this drop is just a drop in number of reportings rather than a true decrease in the crime rate.  

Tier 3 crimes – Burglary and Robbery have shown a steady decrease throughout since 2015. Burglary showed very consistent seasonality and a strong downward trend over the years.

 

November 27, 2023

Seasonal decomposition 

Since the data has Seasonality, I performed a seasonal decomposition to see the extent of seasonality and what is the underlying trend in the data. To accomplish this, I used the statsmodels.tsa.seasonal model STL. 

This process divided the data into a seasonal component, a trend component and a residual component. The complete dataset was a sum of these three components. The predictions from this model was better than that of ARIMA. The rmse score was 310.34. However, I believe the model was hindered by the presence of outliers in June 2015 and November 2023. Basically, the number of crimes accounted for in these two months was less compared to the usual monthly average as the entire month was not included In the dataset. To deal with this, in my further analysis, I will cut off the data from July 2015 to October 2023 to include entire months.

November 24, 2023

Exponential Smoothing 

Given the failure of the ARIMA model to capture the seasonality and trends in the data, I studied different models that can be used to better model the data. One of the models I found was Exponential Smoothing State Space model. This model is best to use when the data has seasonality or when there is a general pervasive trend in the data.  

Since the data at hand had both these features, I applied the ETS model and found better results. Initially, I split the data 80-20 into training and testing data. This meant the data from 2015 up to mid 2022 would be considered training data and the model will try to predict the trend from mid 2022 to mid 2023.
This allowed me to understand whether the model performs well or not and to decide how to change the model before trying to forecast into the future. 

I used the evaluation metrics MSE and MAPE to compare the performance of each model.
The MSE for ARIMA predictions was 89911.182 which is a very large value. Comparatively, the MSE value for Exponential smoothing was 1148.91 for accidents data. This is a large improvement in prediction results. 

 

November 22, 2023

ARIMA  

To delve deeper into time series modelling, I considered the monthly counts of each crime – as in the number of incidents of each type reported in each month from July 1, 2015, up to October 31, 2023. However, for drug related crimes, the dataset does not have data post 2019. 

Using this monthly crime count, I created a pandas series where the index column contained the time stamp and the series value contained the monthly count. This would be my primary data for any temporal analysis and forecasting. 

The first model I intended to apply to the data was ARIMA – Autoregressive Integrated Moving Average.
This model did not perform well in forecasting as it does not handle seasonality and non-linear dependencies in the data well. The predictions from the ARIMA model were not representative of the true values. I tried changing the model parameters p,d and q. But the predictions from all combinations were subpar. 

The ADF statistic for ARIMA was –0.20585 which suggests that our data is non stationary.

November 20, 2023

After the tier list was created, I ventured into each tier of crimes individually to conduct a deep analysis.  

Most of the crimes show little to no difference with regard to which day they were committed. However, the incidents related to murder showed a significantly large frequency on weekends. As expected, the hours of the day at which accidents occurred had the peak during the evening with early to mid-mornings showing a low frequency of accidents. Over the years since 2015, the number of accidents has seen a steady increase, while the number of reported incidents of most other crimes has seen a slight decrease or remained unchanged. The number of reported burglary incidents has seen a steady decrease since 2015. 

In the accident data, we can see a sharp drop in incidents in 2020, which coincides with the Covid-19 Pandemic Lockdown. But once the lockdown was lifted, the accident occurrence rate picked up again. 

November 17, 2023

Crime Incident reports data 

Total number of instances in the data since 2015 was over 720k. While this data was obviously large, it was also not clean enough to use yet 

The crime data had a lot of incident reports that were ambiguous in description. The most common offense description was “investigate person”. To clean the data, I started by manually going through the various offense codes in the website and made a Tier list of crimes that I will use for further analysis. The tier list I have created is as follows: 

Tier 1 – Murder, Manslaughter, Rape
Tier 2 – Arson, Aggravated assault and Battery
Tier 3 – Non violent crimes like Larceny, Burglary ,Robbery and Breaking & Entering
Tier 4 – Vandalism and Vehicular accidents 

Alongside these I created a separate tier for drug related crimes. 

After this step I eventually had a dataset of 216k instances of incidents belonging to one of these 5 tiers. 

November 15, 2023

For Project 3, since we have the freedom to choose the dataset from the Analyze Boston website, we have chosen to work with the CRIME INCIDENT REPORTS dataset. This is a comprehensive database with all the reported crimes that occurred in the districts of Boston. This dataset contains reports on all kinds of incidents ranging from small misdemeanors to serios felonies like manslaughter. 

The dataset contains data about the date, time and location that the incident occurred as well as an offense code and description. I intend to utilize this dataset to conduct an analysis as to how the safety standards in the city of Boston have changed over the years. 

November 6, 2023

K Means Clustering

In class the discussion was conducted on the different clusters formed in the state of California when the coordinates are plotted on a map. I did a similar clustering analysis for the next two most prominent states – Texas and Florida. I also did a similar analysis for all the states in the East coast of the USA.

A similar trend is seen in both these states. There are big clusters formed around big cities like Houston, San Antonio in Texas and Jacksonville and Miami in Florida.

November 3, 2023

Logistic Regression 

I created dummies for the categorical data in the dataset in order to perform Logistic regression. The data was again split into test and train datasets with a 30 –70 split. The logistic regression model was trained on the training dataset. The model report is shown below. 

The precision when predicting white is 0.67 and when predicting black is 0.66. This is to be expected since the number of instances for white and black is way more than the other races. The recall score for white is 0.78 and for black is 0.61.  

F1 score for white was 0.72 and for black was 0.61. The image below shows the full classification report. 

The total accuracy score for the logistic regression model was 0.648. 

November 1, 2023

Random Forest Classifier 

I performed categorical encoding on the dataset to remove all string values and convert everything into numerical values. This is done because most machine learning models work only with numerical values. 

I split the dataset into train and test sets using the sklearn train_test_split function into a 70-30 split. The data was then run through a random forest classifier model. The features used were as follows: age, manner of death, gender, sign of mental illness, threat level, body camera, latitude and longitude, share of each race in that city.  

The accuracy score after testing the dataset in predicting the race from the other variables was 0.618. GridSearch cross validation was performed to find the best selection of parameters for this specific model, however, the change in accuracy was minimal.