September 27, 2023

I implemented a Linear regression model (degrees 1,2,3 and 4) considering just Obesity% and just Inactivity% to predict %Diabetes.

%Obesity vs %Diabetic

%Inactivity vs % Diabetic

Below poverty % vs % diabetic

I also performed a multiple linear regression using all three features. The features eventually got coefficients as follows: [0.1048, 0.1509, 1.6913].

I performed 5 fold cross validation on the three variables and the r2 score returned for the 5 folds was as follows: [0.53857178, 0.57042713, 0.55656731, 0.61902526, 0.59719045]

The r2 score after 10 fold cross validation was 0.613141634

 

25 September 2023

Today’s class covered a resampling method- Cross Validation.

Cross validation is a technique used to improve model accuracy by dividing the data into folds and building multiple models based on all but one of the folds, followed by validating the model using the remaining division (or fold). This method is useful when there is not enough data to utilize for training as well as testing the model. So, resampling at random and using different sets of the data can improve the model’s accuracy. The validation set can be chosen by just dividing the data into two equal groups and training on one (training set) and testing on the other (validation set). However, this is not as effective as doing a multiple -or- k fold cross validation where there are ‘k’ folds and one validation set among them.

I intend to use this technique in the modeling of the CDC diabetes data. There are 2918 instances in the data. A 10-fold cross validation with around 290 instances in each fold can be implemented.

September 20, 2023

Today’s class covered an analysis of crab molting data and explained the t-test, which is a procedure done to assess the statistical significance of a difference in mean value of two distributions. The t-test is unreliable for non-normal distributions, and hence Monte Carlo random sampling is used to see if the mean value difference is in fact significant.

I applied these concepts to the diabetes dataset. I started off by combining the three different data sets as well as the additional data I have procured from the CDC website namely: Number of Physicians per 100000 population, Health insurance percentage, and Below Poverty %.  I performed a pandas merge to combine the various data sets into one data frame. I plotted the distributions of the Obesity data and the Physical Inactivity data (2918 data points each).

There is a difference in the means of both distributions, so I performed the techniques taught in class. The results I got are as follows:

Monte Carlo random sampling (1 million times): maximum observed difference in means = 0.8661069225496902 and the distribution of the various means is plotted below.

September 18, 2023

I found additional data on the CDC website which contains the full dataset for Inactivity, obesity and other features. There was also data about the number of physicians per 100,000 residents of each county. Additionally, there is also information about the percentage of residents of each county that do not have Health Insurance. I found the correlation between the obesity rate and the number of physicians per 100,000 residents of each county. The correlation was -0.035 (-3.5%), slight negative correlation.

There is also a slight negative correlation between obesity rate and total number of physicians per county. This result is counterintuitive although the correlation is insufficient.

September 13, 2023

I delved further into the concepts taught in class including p-value, kurtosis, skewness and heteroscedasticity.

With regards to the analysis of the Diabetes dataset, I am considering obtaining past years’ data of the same factors and analyzing how the percentage of diabetics has changed over the years. This could also provide insight into how it might change in the future alongside information about how one factor affects another.

There is also another data set that records new incidences of diabetes diagnosis. This, paired with the past years’ diabetic % data will show how many new patients get diagnosed year by year.

September 11, 2023

I performed exploratory analysis of the given data. I plotted the distributions of the %diabetic, %obese and %inactive data using the seaborn library in python. The distribution of %diabetics showed a slight right skew while the other two distributions were heavily skewed to the left.

I found additional data in the CDC website pertaining to food access, socioeconomic status of various counties, as well as the data which tags counties as rural or urban. My group have decided to pursue this line of analysis where we try to ascertain if socioeconomic status and other factors like food access, transportation etc. have discernable characteristics in urban areas as compared to rural areas.

We also found another measure called “Food environment index” which takes into account socioeconomic status, household composition (such as population, age etc.) and infrastructure information in each county. This measure could be used in the analysis as it gives a comprehensive look at multiple features which are otherwise hard to combine.

I am considering approaching this problem using a linear regression model. I will try to find the missing values of %obesity and %inactivity by using the %diabetic data as well as the other additional data mentioned above. However, I am skeptical about the direct correlation between the various features as it is data about percentages of population rather than individuals’ data.