pride-data-analysis/analysis/analysis1.ipynb

512 lines
18 KiB
Plaintext
Raw Normal View History

2023-02-01 01:23:45 +00:00
{
"cells": [
{
"cell_type": "markdown",
2023-02-16 00:29:26 +00:00
"metadata": {},
2023-02-01 01:23:45 +00:00
"source": [
2023-02-16 00:29:26 +00:00
"# Nat Scott"
]
2023-02-01 01:23:45 +00:00
},
{
"cell_type": "markdown",
2023-02-16 00:29:26 +00:00
"metadata": {},
2023-02-01 01:23:45 +00:00
"source": [
2023-02-16 00:29:26 +00:00
"## Research question/interests\n",
"\n",
"**Is there a correlation between political alignment & living in neighbourhoods with large quantities of LGBT people?** The obvious answer to this question is \"yes, they are going to mostly be democrats\" but anyone who's ever been around queer people will know that this question is quite a bit more nuanced than that, and this nuance is what we hope to capture in investigating this question.\n",
2023-02-16 00:29:26 +00:00
"\n",
"- The gaybourhoods data set does not include data on residents political alignments, however, there is a wealth of electoral data available freely online that we intend on incorporating into this project. The primary difficulty then will be developing a geographic \"compatibility layer\" between the data sets so that the data can be understood in the same context. To build this, we intend on working with the OpenStreetMap API to create an additional column representing observations position space in a more neutral way, such as their coordinates.\n",
"- Alternatively, we've also considered working with an additional data set that links US zip codes to their longitude and lattitude positions. As such, incorporating this data would be as easy as merging the two tables.\n",
"\n",
"\n",
"**Is there a correlation between geographical stratums & being LGBT?** This question is more abstract, and will serve as a preliminary exploration of the data in hopes of establishing two key details along the way that will shape the rest of the project: how do we quantify queerness, and how do we best represent it visually?\n",
2023-02-16 00:29:26 +00:00
"\n",
"- Once again, representing this data visually will require determining the coordinates associated with each observation.\n",
"- The gaybourhoods data set defines a \"gaybourhood index\" which effectively measures how friendly a given neighbourhood is to queer people. Since this index is entirely subjective, we will need to closely evaluate it's usefulness for our project and investigate different ways to quantify \"queer-friendliness\"\n",
"- In addition to the last point, since, of course, no matter what choice of observations we make, the measurement will still be subjective, answering this research question will come more so in the form of comparing and contrasting different measurements to see what they tell us.\n",
2023-02-16 00:31:05 +00:00
"- Obviously, visualizing this among many aspects of the other research questions would involve projecting the data onto a map of the United States, so visualizing this research question would motivate many of the visualizations for other components of this project"
2023-02-16 00:29:26 +00:00
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Wrangling"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>name</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Hancock OH</td>\n",
" <td>41.000471</td>\n",
" <td>-83.666033</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Stafford VA</td>\n",
" <td>38.413261</td>\n",
" <td>-77.451334</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Webster NE</td>\n",
" <td>40.180646</td>\n",
" <td>-98.498590</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Dimmit TX</td>\n",
" <td>28.423587</td>\n",
" <td>-99.765871</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Cedar IA</td>\n",
" <td>41.772360</td>\n",
" <td>-91.132610</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" name lat long\n",
"0 Hancock OH 41.000471 -83.666033\n",
"1 Stafford VA 38.413261 -77.451334\n",
"2 Webster NE 40.180646 -98.498590\n",
"3 Dimmit TX 28.423587 -99.765871\n",
"4 Cedar IA 41.772360 -91.132610"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## counties - Relating US counties to their long/lat position on the Earth\n",
"counties = pd.read_csv(\"../data/raw/us-county-boundaries.csv\", sep=\";\")\n",
"\n",
"counties = counties.rename({\n",
" \"NAME\": \"name\",\n",
" \"INTPTLAT\": \"lat\",\n",
" \"INTPTLON\": \"long\",\n",
"}, axis=\"columns\")\n",
"\n",
"# Combine the county name with the state code\n",
"def combine_name_state(row):\n",
" row[\"name\"] = f\"{row['name']} {row['STUSAB']}\"\n",
" return row\n",
"\n",
"counties = counties.apply(combine_name_state, axis=\"columns\")\n",
"\n",
"# We don't need this column anymore\n",
"counties = counties.drop([\"STUSAB\"], axis=\"columns\")\n",
"\n",
"counties.to_csv(\"../data/processed/us-county-boundaries.csv\")\n",
"counties.head()"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>county</th>\n",
" <th>party</th>\n",
" <th>votes</th>\n",
" <th>total</th>\n",
" <th>percent</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Autauga AL</td>\n",
" <td>Democrat</td>\n",
" <td>6363</td>\n",
" <td>23932</td>\n",
" <td>0.265878</td>\n",
" <td>32.532237</td>\n",
" <td>-86.646439</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Autauga AL</td>\n",
" <td>Republican</td>\n",
" <td>17379</td>\n",
" <td>23932</td>\n",
" <td>0.726183</td>\n",
" <td>32.532237</td>\n",
" <td>-86.646439</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Autauga AL</td>\n",
" <td>Other</td>\n",
" <td>190</td>\n",
" <td>23932</td>\n",
" <td>0.007939</td>\n",
" <td>32.532237</td>\n",
" <td>-86.646439</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Baldwin AL</td>\n",
" <td>Democrat</td>\n",
" <td>18424</td>\n",
" <td>85338</td>\n",
" <td>0.215894</td>\n",
" <td>30.659218</td>\n",
" <td>-87.746067</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Baldwin AL</td>\n",
" <td>Republican</td>\n",
" <td>66016</td>\n",
" <td>85338</td>\n",
" <td>0.773583</td>\n",
" <td>30.659218</td>\n",
" <td>-87.746067</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" county party votes total percent lat long\n",
"0 Autauga AL Democrat 6363 23932 0.265878 32.532237 -86.646439\n",
"1 Autauga AL Republican 17379 23932 0.726183 32.532237 -86.646439\n",
"2 Autauga AL Other 190 23932 0.007939 32.532237 -86.646439\n",
"3 Baldwin AL Democrat 18424 85338 0.215894 30.659218 -87.746067\n",
"4 Baldwin AL Republican 66016 85338 0.773583 30.659218 -87.746067"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## pol - Election results from the 2012 American presidential election\n",
"pol = pd.read_csv(\"../data/raw/countypres_2000-2020.csv\")\n",
"\n",
"# We only want 2012--the latest election before the gb data was collected\n",
"\n",
"pol = pol[pol[\"year\"] == 2012].reset_index()\n",
"\n",
"# Get rid of undesireable columns\n",
"pol = pol.drop([\n",
" \"year\", \"state\", \"county_fips\", \"office\",\n",
" \"candidate\", \"version\", \"mode\", \"index\",\n",
"], axis=\"columns\")\n",
"\n",
"# Change the column names to make them a little more friendly\n",
"pol.rename({\n",
" \"county_name\": \"county\",\n",
" \"state_po\": \"state\",\n",
" \"candidatevotes\": \"votes\",\n",
" \"totalvotes\": \"total\"\n",
"}, axis=\"columns\", inplace=True)\n",
"\n",
"# Make cells lowercase\n",
"pol[\"county\"] = pol[\"county\"].apply(lambda x: x.capitalize())\n",
"pol[\"party\"] = pol[\"party\"].apply(lambda x: x.capitalize())\n",
"\n",
"# Combine the county name with the state code\n",
"def combine_name_state(row):\n",
" row[\"county\"] = f\"{row['county']} {row['state']}\"\n",
" return row\n",
"\n",
"pol = pol.apply(combine_name_state, axis=\"columns\")\n",
"\n",
"# Add a percent column which will be useful when graphing\n",
"pol[\"percent\"] = pol[\"votes\"] / pol[\"total\"]\n",
"\n",
"# Attach long/lat data to each row\n",
"pol = pol.merge(counties, left_on=\"county\", right_on=\"name\")\n",
"\n",
"# Now we can get rid of the state columns\n",
"pol = pol.drop([\"state\", \"name\"], axis=\"columns\")\n",
"\n",
"pol.to_csv(\"../data/processed/election-2012.csv\", index=False)\n",
"pol.head()"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Tax_Mjoint</th>\n",
" <th>TaxRate_SS</th>\n",
" <th>TaxRate_FF</th>\n",
" <th>TaxRate_MM</th>\n",
" <th>Cns_RateSS</th>\n",
" <th>Cns_RateFF</th>\n",
" <th>Cns_RateMM</th>\n",
" <th>CountBars</th>\n",
" <th>FF_Index</th>\n",
" <th>MM_Index</th>\n",
" <th>SS_Index</th>\n",
" <th>TOTINDEX</th>\n",
" <th>lat</th>\n",
" <th>long</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2120</td>\n",
" <td>203.301887</td>\n",
" <td>28.773585</td>\n",
" <td>174.528302</td>\n",
" <td>77.125329</td>\n",
" <td>6.931719</td>\n",
" <td>70.193610</td>\n",
" <td>15</td>\n",
" <td>6.724415</td>\n",
" <td>48.288254</td>\n",
" <td>55.012669</td>\n",
" <td>67.077054</td>\n",
" <td>34.093828</td>\n",
" <td>-118.381697</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5080</td>\n",
" <td>205.511811</td>\n",
" <td>33.464567</td>\n",
" <td>172.047244</td>\n",
" <td>88.478367</td>\n",
" <td>15.617404</td>\n",
" <td>72.860963</td>\n",
" <td>17</td>\n",
" <td>9.834048</td>\n",
" <td>48.578469</td>\n",
" <td>58.412517</td>\n",
" <td>61.866815</td>\n",
" <td>37.758057</td>\n",
" <td>-122.435410</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5790</td>\n",
" <td>107.772021</td>\n",
" <td>16.753022</td>\n",
" <td>91.018998</td>\n",
" <td>46.771050</td>\n",
" <td>5.745582</td>\n",
" <td>41.025469</td>\n",
" <td>5</td>\n",
" <td>4.370779</td>\n",
" <td>26.360413</td>\n",
" <td>30.731192</td>\n",
" <td>37.908747</td>\n",
" <td>40.742039</td>\n",
" <td>-74.000620</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3510</td>\n",
" <td>80.056980</td>\n",
" <td>21.082621</td>\n",
" <td>58.974359</td>\n",
" <td>31.619291</td>\n",
" <td>9.315448</td>\n",
" <td>22.303843</td>\n",
" <td>10</td>\n",
" <td>6.055939</td>\n",
" <td>15.939869</td>\n",
" <td>21.995808</td>\n",
" <td>37.530067</td>\n",
" <td>40.734012</td>\n",
" <td>-74.006746</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2660</td>\n",
" <td>91.353383</td>\n",
" <td>12.781955</td>\n",
" <td>78.571429</td>\n",
" <td>21.763042</td>\n",
" <td>3.142678</td>\n",
" <td>18.620365</td>\n",
" <td>9</td>\n",
" <td>3.004058</td>\n",
" <td>18.280165</td>\n",
" <td>21.284224</td>\n",
" <td>35.843573</td>\n",
" <td>37.773134</td>\n",
" <td>-122.411167</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Tax_Mjoint TaxRate_SS TaxRate_FF TaxRate_MM Cns_RateSS Cns_RateFF \\\n",
"0 2120 203.301887 28.773585 174.528302 77.125329 6.931719 \n",
"1 5080 205.511811 33.464567 172.047244 88.478367 15.617404 \n",
"2 5790 107.772021 16.753022 91.018998 46.771050 5.745582 \n",
"3 3510 80.056980 21.082621 58.974359 31.619291 9.315448 \n",
"4 2660 91.353383 12.781955 78.571429 21.763042 3.142678 \n",
"\n",
" Cns_RateMM CountBars FF_Index MM_Index SS_Index TOTINDEX \\\n",
"0 70.193610 15 6.724415 48.288254 55.012669 67.077054 \n",
"1 72.860963 17 9.834048 48.578469 58.412517 61.866815 \n",
"2 41.025469 5 4.370779 26.360413 30.731192 37.908747 \n",
"3 22.303843 10 6.055939 15.939869 21.995808 37.530067 \n",
"4 18.620365 9 3.004058 18.280165 21.284224 35.843573 \n",
"\n",
" lat long \n",
"0 34.093828 -118.381697 \n",
"1 37.758057 -122.435410 \n",
"2 40.742039 -74.000620 \n",
"3 40.734012 -74.006746 \n",
"4 37.773134 -122.411167 "
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## gb - the gaybourhoods dataset\n",
"gb = pd.read_csv(\"../data/raw/gaybourhoods.csv\")\n",
"cords = pd.read_csv(\"../data/raw/zip_lat_long.csv\")\n",
"\n",
"# Let's add long/lat columns to gb\n",
"gb = gb.merge(cords, left_on=\"GEOID10\", right_on=\"ZIP\")\n",
"\n",
"# Get rid of unneeded columns\n",
"gb = gb.drop([\n",
" \"Mjoint_MF\", \"Mjoint_SS\", \"Mjoint_FF\", \"Mjoint_MM\",\n",
" \"Cns_TotHH\", \"Cns_UPSS\", \"Cns_UPFF\", \"Cns_UPMM\",\n",
" \"ParadeFlag\", \"FF_Tax\", \"FF_Cns\", \"MM_Tax\", \"MM_Cns\",\n",
" \"SS_Index_Weight\", \"Parade_Weight\", \"Bars_Weight\",\n",
" \"GEOID10\", \"ZIP\",\n",
"], axis=\"columns\")\n",
"\n",
"# There's a lot of info baked into some of these columns. Especially the composite indexes.\n",
"# We'll leave their names as is for easy reference even if they're a little ugly.\n",
"gb = gb.rename({\n",
" \"LAT\": \"lat\",\n",
" \"LNG\": \"long\",\n",
"}, axis=\"columns\")\n",
"\n",
"gb.to_csv(\"../data/processed/gaybourhoods-nat.csv\")\n",
"gb.head()"
]
2023-02-01 01:23:45 +00:00
}
],
"metadata": {
2023-02-16 00:29:26 +00:00
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
2023-02-01 01:23:45 +00:00
"language_info": {
2023-02-16 00:29:26 +00:00
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
2023-02-01 01:23:45 +00:00
}
},
"nbformat": 4,
2023-02-16 00:29:26 +00:00
"nbformat_minor": 4
}