September 11, 2023

I performed exploratory analysis of the given data. I plotted the distributions of the %diabetic, %obese and %inactive data using the seaborn library in python. The distribution of %diabetics showed a slight right skew while the other two distributions were heavily skewed to the left.

I found additional data in the CDC website pertaining to food access, socioeconomic status of various counties, as well as the data which tags counties as rural or urban. My group have decided to pursue this line of analysis where we try to ascertain if socioeconomic status and other factors like food access, transportation etc. have discernable characteristics in urban areas as compared to rural areas.

We also found another measure called “Food environment index” which takes into account socioeconomic status, household composition (such as population, age etc.) and infrastructure information in each county. This measure could be used in the analysis as it gives a comprehensive look at multiple features which are otherwise hard to combine.

I am considering approaching this problem using a linear regression model. I will try to find the missing values of %obesity and %inactivity by using the %diabetic data as well as the other additional data mentioned above. However, I am skeptical about the direct correlation between the various features as it is data about percentages of population rather than individuals’ data.

Leave a Reply

Your email address will not be published. Required fields are marked *