Today’s class covered an analysis of crab molting data and explained the t-test, which is a procedure done to assess the statistical significance of a difference in mean value of two distributions. The t-test is unreliable for non-normal distributions, and hence Monte Carlo random sampling is used to see if the mean value difference is in fact significant.
I applied these concepts to the diabetes dataset. I started off by combining the three different data sets as well as the additional data I have procured from the CDC website namely: Number of Physicians per 100000 population, Health insurance percentage, and Below Poverty %. I performed a pandas merge to combine the various data sets into one data frame. I plotted the distributions of the Obesity data and the Physical Inactivity data (2918 data points each).
There is a difference in the means of both distributions, so I performed the techniques taught in class. The results I got are as follows:
Monte Carlo random sampling (1 million times): maximum observed difference in means = 0.8661069225496902 and the distribution of the various means is plotted below.