Finish final report

2023-04-13 12:45:53 -07:00 · 2023-04-13 12:45:53 -07:00 · aa0571b4a8
parent 10c04db995
commit aa0571b4a8
8 changed files with 241 additions and 159 deletions
--- a/analysis/analysis2.ipynb
+++ b/analysis/analysis2.ipynb
@ -301,7 +301,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 32,
+   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
@ -404,7 +404,7 @@
       "max     99999.000000  9.566490e+06       6.00000      1.557123e+07"
      ]
     },
-     "execution_count": 32,
+     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -418,7 +418,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 31,
+   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
@ -503,7 +503,7 @@
       "max    98686.000000  24560.000000"
      ]
     },
-     "execution_count": 31,
+     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -518,7 +518,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 30,
+   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
@ -552,85 +552,85 @@
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
-       "      <td>0.0</td>\n",
-       "      <td>0.0</td>\n",
-       "      <td>0.0</td>\n",
-       "      <td>0.0</td>\n",
-       "      <td>0.0</td>\n",
+       "      <td>2184.000000</td>\n",
+       "      <td>2184.000000</td>\n",
+       "      <td>2184.000000</td>\n",
+       "      <td>2184.000000</td>\n",
+       "      <td>2184.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
+       "      <td>48935.203297</td>\n",
+       "      <td>26691.730769</td>\n",
+       "      <td>4373.997253</td>\n",
+       "      <td>596.719322</td>\n",
+       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
+       "      <td>35451.335807</td>\n",
+       "      <td>17960.713867</td>\n",
+       "      <td>3054.620840</td>\n",
+       "      <td>615.174358</td>\n",
+       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
+       "      <td>1730.000000</td>\n",
+       "      <td>160.000000</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
+       "      <td>11360.750000</td>\n",
+       "      <td>13337.500000</td>\n",
+       "      <td>2110.000000</td>\n",
+       "      <td>217.000000</td>\n",
+       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
+       "      <td>60023.500000</td>\n",
+       "      <td>24070.000000</td>\n",
+       "      <td>3900.000000</td>\n",
+       "      <td>434.000000</td>\n",
+       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
+       "      <td>80227.250000</td>\n",
+       "      <td>35640.000000</td>\n",
+       "      <td>5902.500000</td>\n",
+       "      <td>777.250000</td>\n",
+       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
-       "      <td>NaN</td>\n",
+       "      <td>98686.000000</td>\n",
+       "      <td>114420.000000</td>\n",
+       "      <td>24560.000000</td>\n",
+       "      <td>9166.000000</td>\n",
+       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
-       "       zip  population  gay tax rate  overall tax paid  income\n",
-       "count  0.0         0.0           0.0               0.0     0.0\n",
-       "mean   NaN         NaN           NaN               NaN     NaN\n",
-       "std    NaN         NaN           NaN               NaN     NaN\n",
-       "min    NaN         NaN           NaN               NaN     NaN\n",
-       "25%    NaN         NaN           NaN               NaN     NaN\n",
-       "50%    NaN         NaN           NaN               NaN     NaN\n",
-       "75%    NaN         NaN           NaN               NaN     NaN\n",
-       "max    NaN         NaN           NaN               NaN     NaN"
+       "                zip     population  gay tax rate  overall tax paid  income\n",
+       "count   2184.000000    2184.000000   2184.000000       2184.000000  2184.0\n",
+       "mean   48935.203297   26691.730769   4373.997253        596.719322     1.0\n",
+       "std    35451.335807   17960.713867   3054.620840        615.174358     0.0\n",
+       "min     1730.000000     160.000000      0.000000          0.000000     1.0\n",
+       "25%    11360.750000   13337.500000   2110.000000        217.000000     1.0\n",
+       "50%    60023.500000   24070.000000   3900.000000        434.000000     1.0\n",
+       "75%    80227.250000   35640.000000   5902.500000        777.250000     1.0\n",
+       "max    98686.000000  114420.000000  24560.000000       9166.000000     1.0"
      ]
     },
-     "execution_count": 30,
+     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -654,7 +654,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 29,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
@ -805,7 +805,7 @@
       "[2184 rows x 5 columns]"
      ]
     },
-     "execution_count": 29,
+     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -816,7 +816,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 28,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
@ -956,7 +956,7 @@
       "max      47.916786   -70.758184  "
      ]
     },
-     "execution_count": 28,
+     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -992,7 +992,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
@ -1016,8 +1016,11 @@
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
-       "      <th>pride parade index</th>\n",
-       "      <th>gay bars index</th>\n",
+       "      <th>zip</th>\n",
+       "      <th>population</th>\n",
+       "      <th>gay tax rate</th>\n",
+       "      <th>overall tax paid</th>\n",
+       "      <th>income</th>\n",
       "      <th>lat</th>\n",
       "      <th>long</th>\n",
       "    </tr>\n",
@ -1025,38 +1028,53 @@
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
+       "      <td>1730</td>\n",
+       "      <td>13570.0</td>\n",
+       "      <td>3260</td>\n",
+       "      <td>150.0</td>\n",
       "      <td>1</td>\n",
-       "      <td>15</td>\n",
-       "      <td>34.093828</td>\n",
-       "      <td>-118.381697</td>\n",
+       "      <td>42.499295</td>\n",
+       "      <td>-71.281889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
-       "      <td>0</td>\n",
-       "      <td>17</td>\n",
-       "      <td>37.758057</td>\n",
-       "      <td>-122.435410</td>\n",
+       "      <td>1731</td>\n",
+       "      <td>2450.0</td>\n",
+       "      <td>550</td>\n",
+       "      <td>0.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>42.456748</td>\n",
+       "      <td>-71.279484</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
+       "      <td>1742</td>\n",
+       "      <td>17170.0</td>\n",
+       "      <td>4220</td>\n",
+       "      <td>297.0</td>\n",
       "      <td>1</td>\n",
-       "      <td>5</td>\n",
-       "      <td>40.742039</td>\n",
-       "      <td>-74.000620</td>\n",
+       "      <td>42.462911</td>\n",
+       "      <td>-71.364496</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
+       "      <td>1760</td>\n",
+       "      <td>34350.0</td>\n",
+       "      <td>7880</td>\n",
+       "      <td>468.0</td>\n",
       "      <td>1</td>\n",
-       "      <td>10</td>\n",
-       "      <td>40.734012</td>\n",
-       "      <td>-74.006746</td>\n",
+       "      <td>42.284822</td>\n",
+       "      <td>-71.348811</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
+       "      <td>1770</td>\n",
+       "      <td>4310.0</td>\n",
+       "      <td>1060</td>\n",
+       "      <td>46.0</td>\n",
       "      <td>1</td>\n",
-       "      <td>9</td>\n",
-       "      <td>37.773134</td>\n",
-       "      <td>-122.411167</td>\n",
+       "      <td>42.231947</td>\n",
+       "      <td>-71.372963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
@ -1064,65 +1082,96 @@
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
-       "      <th>2323</th>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>45.528666</td>\n",
-       "      <td>-122.678981</td>\n",
+       "      <th>2179</th>\n",
+       "      <td>98682</td>\n",
+       "      <td>57010.0</td>\n",
+       "      <td>11080</td>\n",
+       "      <td>703.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>45.673209</td>\n",
+       "      <td>-122.481745</td>\n",
       "    </tr>\n",
       "    <tr>\n",
-       "      <th>2324</th>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>47.606211</td>\n",
-       "      <td>-122.333792</td>\n",
+       "      <th>2180</th>\n",
+       "      <td>98683</td>\n",
+       "      <td>30700.0</td>\n",
+       "      <td>6470</td>\n",
+       "      <td>358.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>45.603287</td>\n",
+       "      <td>-122.510170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
-       "      <th>2325</th>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>47.449678</td>\n",
-       "      <td>-122.307657</td>\n",
+       "      <th>2181</th>\n",
+       "      <td>98684</td>\n",
+       "      <td>27630.0</td>\n",
+       "      <td>5390</td>\n",
+       "      <td>371.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>45.630556</td>\n",
+       "      <td>-122.514839</td>\n",
       "    </tr>\n",
       "    <tr>\n",
-       "      <th>2326</th>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>47.604569</td>\n",
-       "      <td>-122.335359</td>\n",
+       "      <th>2182</th>\n",
+       "      <td>98685</td>\n",
+       "      <td>27540.0</td>\n",
+       "      <td>6490</td>\n",
+       "      <td>298.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>45.715211</td>\n",
+       "      <td>-122.693165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
-       "      <th>2327</th>\n",
-       "      <td>0</td>\n",
-       "      <td>0</td>\n",
-       "      <td>47.649339</td>\n",
-       "      <td>-122.310294</td>\n",
+       "      <th>2183</th>\n",
+       "      <td>98686</td>\n",
+       "      <td>17800.0</td>\n",
+       "      <td>4120</td>\n",
+       "      <td>215.0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>45.723392</td>\n",
+       "      <td>-122.624397</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
-       "<p>2328 rows × 4 columns</p>\n",
+       "<p>2184 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
-       "      pride parade index  gay bars index        lat        long\n",
-       "0                      1              15  34.093828 -118.381697\n",
-       "1                      0              17  37.758057 -122.435410\n",
-       "2                      1               5  40.742039  -74.000620\n",
-       "3                      1              10  40.734012  -74.006746\n",
-       "4                      1               9  37.773134 -122.411167\n",
-       "...                  ...             ...        ...         ...\n",
-       "2323                   0               0  45.528666 -122.678981\n",
-       "2324                   0               0  47.606211 -122.333792\n",
-       "2325                   0               0  47.449678 -122.307657\n",
-       "2326                   0               0  47.604569 -122.335359\n",
-       "2327                   0               0  47.649339 -122.310294\n",
+       "        zip  population  gay tax rate  overall tax paid  income        lat  \\\n",
+       "0      1730     13570.0          3260             150.0       1  42.499295   \n",
+       "1      1731      2450.0           550               0.0       1  42.456748   \n",
+       "2      1742     17170.0          4220             297.0       1  42.462911   \n",
+       "3      1760     34350.0          7880             468.0       1  42.284822   \n",
+       "4      1770      4310.0          1060              46.0       1  42.231947   \n",
+       "...     ...         ...           ...               ...     ...        ...   \n",
+       "2179  98682     57010.0         11080             703.0       1  45.673209   \n",
+       "2180  98683     30700.0          6470             358.0       1  45.603287   \n",
+       "2181  98684     27630.0          5390             371.0       1  45.630556   \n",
+       "2182  98685     27540.0          6490             298.0       1  45.715211   \n",
+       "2183  98686     17800.0          4120             215.0       1  45.723392   \n",
       "\n",
-       "[2328 rows x 4 columns]"
+       "            long  \n",
+       "0     -71.281889  \n",
+       "1     -71.279484  \n",
+       "2     -71.364496  \n",
+       "3     -71.348811  \n",
+       "4     -71.372963  \n",
+       "...          ...  \n",
+       "2179 -122.481745  \n",
+       "2180 -122.510170  \n",
+       "2181 -122.514839  \n",
+       "2182 -122.693165  \n",
+       "2183 -122.624397  \n",
+       "\n",
+       "[2184 rows x 7 columns]"
      ]
     },
-     "execution_count": 27,
+     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -1133,7 +1182,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
@ -1164,7 +1213,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
@ -1196,7 +1245,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
@ -1227,7 +1276,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
@ -1330,7 +1379,7 @@
       "max    24560.000000       9166.000000    47.916786   -70.758184"
      ]
     },
-     "execution_count": 11,
+     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -1343,7 +1392,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 34,
+   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
@ -1365,7 +1414,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 35,
+   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
@ -1407,7 +1456,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 36,
+   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
@ -1612,7 +1661,7 @@
       "[5 rows x 29 columns]"
      ]
     },
-     "execution_count": 36,
+     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -1628,7 +1677,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
@ -1731,7 +1780,7 @@
       "max              1.000000       17.000000    47.916786   -70.758184"
      ]
     },
-     "execution_count": 37,
+     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -1793,7 +1842,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
@ -1802,7 +1851,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 39,
+   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
@ -1832,7 +1881,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
@ -1864,7 +1913,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
@ -1891,7 +1940,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
@ -1994,7 +2043,7 @@
       "max              1.000000       17.000000    47.916786   -70.758184"
      ]
     },
-     "execution_count": 21,
+     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -2007,16 +2056,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "<seaborn.axisgrid.PairGrid at 0x7fa91e10fca0>"
+       "<seaborn.axisgrid.PairGrid at 0x7f6445ecacb0>"
      ]
     },
-     "execution_count": 22,
+     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    },
@ -2039,16 +2088,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "<seaborn.axisgrid.PairGrid at 0x7fa91e12c940>"
+       "<seaborn.axisgrid.PairGrid at 0x7f644683a2c0>"
      ]
     },
-     "execution_count": 23,
+     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    },
@ -2080,7 +2129,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 55,
+   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
@ -2105,16 +2154,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 46,
+   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "<seaborn.axisgrid.FacetGrid at 0x7fa913934d90>"
+       "<seaborn.axisgrid.FacetGrid at 0x7f64466a7ca0>"
      ]
     },
-     "execution_count": 46,
+     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    },
@ -2144,16 +2193,16 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 47,
+   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
-       "<seaborn.axisgrid.FacetGrid at 0x7fa91353eda0>"
+       "<seaborn.axisgrid.FacetGrid at 0x7f6445a04ac0>"
      ]
     },
-     "execution_count": 47,
+     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    },
--- a/analysis/images/graphs/bars-parades-decomposition.png
+++ b/analysis/images/graphs/bars-parades-decomposition.png
--- a/analysis/images/graphs/bars-parades1.png
+++ b/analysis/images/graphs/bars-parades1.png
--- a/analysis/images/graphs/bars-parades2.png
+++ b/analysis/images/graphs/bars-parades2.png
--- a/analysis/images/graphs/queer-tax-decomposition.png
+++ b/analysis/images/graphs/queer-tax-decomposition.png
--- a/analysis/images/graphs/queer-tax-rate.png
+++ b/analysis/images/graphs/queer-tax-rate.png
--- a/analysis/images/graphs/typical-tax-rate.png
+++ b/analysis/images/graphs/typical-tax-rate.png
--- a/final_report_group44.md
+++ b/final_report_group44.md
@ -1,6 +1,15 @@
 ## Introduction

 ## Exploratory Data Analysis
+A substantial portion of our exploratory data analysis involved trying to determine how best to represent our data on a two-dimensional plane. The two approaches we settled on involved using density (later topological) maps and scatter plots with respect to the geographical coordinates of each observation:
+
+![Scatter plot of observations from the Gaybourhoods dataset](analysis/images/graphs/1-naive-scatter1.png)
+![Hexbin plot illustrating the density of counties across the US](analysis/images/graphs/4-plot-naive-hexbin.png)
+
+In the case of scatter plots, representing a tertiary dimension of the data could usually be accomplished by colouring the observations relative to some additional statistic. For example, one of the graphs we created in our exploratory data analysis illustrated all of the observations in Boston, coloured by how many gay/lesbian individuals resided in each neighbourhood:
+![Scatter plot of Gaybourhoods observations in Boston coloured by queer concentration](analysis/images/graphs/6-plot-boston-scatter.png)
+
+This approach proved to be effective, particularly when combined with nice maps to better visually position the data in space, and so we used it throughout our analyses.

 ## Do queer communities concentrate in space?
 The objective of this research question is to determine if queer communities are geographically concentrated. More specifically, we wanted to determine if a community with a high population of gay and lesbian residents is likely to be surrounded by communities with a similarly-sized population of gay and lesbian residents. This can be broken down more quantitatively by asking the following: for a neighbourhood measurably queer to some degree, how queer are the adjacent neighbourhoods on average?
@ -8,11 +17,11 @@ The objective of this research question is to determine if queer communities are
 ## Quantitatively measuring queerness
 At several points during this analysis, we will refer to a given neighbourhood's "queerness" as though it's a single, continuous, quantitative variable. We do this for convenience and to more effectively work within the constraints of the data we have available, although it's worth admitting and discussing what this means and it's limitations. It should be obvious that how we quantitatively measure the queerness of a space is subjective, and the decisions we make in this analysis can be problematic.

-To begin, we must acknowledge the role statisticians have played presently and historically in systematically eliminating minorities. One local example of this is the way that Canada's Indian Act works to incrementally strip Indigenous people of their legal recognition of being Indigenous through the malitiously-named process of "enfranchisement"[^1], in a process many now refer to as "statistical genocide." Discretely categorizing people enables oppression and marginalizes deviation. That second issue is particularly pertinent in the case of the queer community, which is predicated on "bending rules," so to speak. For that reason, we are hesitant to use the phrase "queer community," as it implicitly makes the assumption that the constituents of the so-called "queer community" have a universal experience, which is untrue.
+To begin, we must acknowledge the role statisticians have played presently and historically in systematically eliminating minorities. One local example of this is the way that Canada's Indian Act works to incrementally strip Indigenous people of their legal recognition of being Indigenous through the maliciously-named process of "enfranchisement"[^1], in a process many now refer to as "statistical genocide." Discretely categorizing people enables oppression and marginalizes deviation. That second issue is particularly pertinent in the case of the queer community, which is predicated on "bending rules," so to speak. For that reason, we are hesitant to use the phrase "queer community," as it implicitly makes the assumption that the constituents of the so-called "queer community" have a universal experience, which is untrue.

 Jan Diehm admits the following in "Men are from Chelsea, Women are from Park Slope"[^2] [^3].

-> Currently, there’s no comprehensive way to quantitatively measure gayborhoods, or even where LGBTQ Americans live. Most of the existing data sticks to a narrow view (i.e. traditional marriage, the male/female gender binary) of the queer spectrum and “rainbow-washes” any intersectionality of race, ethnicity, class, gender, and sexuality. This project aims to paint a slightly more complete picture, combining several metrics to create a gayborhood index, but even then it admittedly underweights and undercounts areas with non-binary and minority populations. Still, this is some of the most complete data that we have.
+> Currently, there’s no comprehensive way to quantitatively measure gayborhoods, or even where LGBTQ Americans live. Most of the existing data sticks to a narrow view (i.e. traditional marriage, the male/female gender binary) of the queer spectrum and “rainbow-washes” any intersectionality of race, ethnicity, class, gender, and sexuality. This project aims to paint a slightly more complete picture, combining several metrics to create a gayborhood index, but even then it admittedly underrepresents and under-counts areas with non-binary and minority populations. Still, this is some of the most complete data that we have.

 This dataset fails to represent queerness outside the context of monogamous partnerships between cisgender people (or at least, those who have been statistically represented as such). For this reason, we seek to be very upfront that we are only exploring so-called "same-sex" partnerships.

@ -22,23 +31,23 @@ The individuals who worked on the article attempted to mitigate some of these is

 The above graph illustrates a topological graph of gaybouhoods in New York City shaded darker by two metrics of queerness: "TOTINDEX" being the composite index and the latter representing only the number of gay and lesbian residents. While the graphs are visually distinct, the distinction is relatively minor. Nonetheless, we proceed using the latter as a key symbolic decision.

-To facilitate the discussion of queerness in space in the first two research questions, we introduce an additional index that discretely classifies neighbourhoods into 7 categories labeled `0` through `6`, with zero indicating a region has the fewest relative gay/lesbian residents and 6 indicating that the region has relatively the most gay/lesbian residents. The choice to divide the dataframe into seven categories was arbitrary, although inspired by Alfred Kinsey's research into the fluidity of human sexuality[^4]. Similarly to the Kinsey scale, the relationship will be linear.
+To facilitate the discussion of queerness in space in the first two research questions, we introduce an additional index that discretely classifies neighbourhoods into 7 categories labelled `0` through `6`, with zero indicating a region has the fewest relative gay/lesbian residents and 6 indicating that the region has relatively the most gay/lesbian residents. The choice to divide the data frame into seven categories was arbitrary, although inspired by Alfred Kinsey's research into the fluidity of human sexuality[^4]. Similarly to the Kinsey scale, the relationship will be linear.

-Besides the Kinsey index of a given observation, we are also interested in the kinsey index of observations adjacent to a given neighbourhood. This, we refer to as the observation's "neighbourhood kinsey index," or NKI, where our usage of the word "neighbourhood" is borrowed from graph theory, in referring to the set of all vertices connected by an edge to a given vertex. This measurement is calculated algorithmically by sampling a small set of observations geographically near each neighbourhood. A full implementation of this algorithm can be found [here](./analysis/code/project_functions1.py).
+Besides the Kinsey index of a given observation, we are also interested in the Kinsey index of observations adjacent to a given neighbourhood. This, we refer to as the observation's "neighbourhood Kinsey index," or NKI, where our usage of the word "neighbourhood" is borrowed from graph theory, in referring to the set of all vertices connected by an edge to a given vertex. This measurement is calculated algorithmically by sampling a small set of observations geographically near each neighbourhood. A full implementation of this algorithm can be found [here](./analysis/code/project_functions1.py).

 ### Quantitatively representing queer concentration

 ![Two graphs, the first a bar graph and the second a scatter plot](analysis/images/graphs/13-neighbourhood-kinsey-comparison.png)

-The first graph illustrates the mean neighbourhood kinsey index of all observations for each kinsey index, and as such, the height of each graph represents how queer adjacent neighbourhoods of a given neighbourhood will be on average. Notably, in general, the neighbourhoods adjacent to a given relatively queer neighbourhood are not on average more queer than the given neighbourhood. This is not particularly surprising when we consider the fact that queer people form a minority of the general population. However, on average, the more queer a given neighbourhood is, the more queer its adjacent neighbourhoods will be on average across the United States. This provides some evidence that queer communities tend to concentrate in space.
+The first graph illustrates the mean neighbourhood Kinsey index of all observations for each Kinsey index, and as such, the height of each graph represents how queer adjacent neighbourhoods of a given neighbourhood will be on average. Notably, in general, the neighbourhoods adjacent to a given relatively queer neighbourhood are not on average more queer than the given neighbourhood. This is not particularly surprising when we consider the fact that queer people form a minority of the general population. However, on average, the more queer a given neighbourhood is, the more queer its adjacent neighbourhoods will be on average across the United States. This provides some evidence that queer communities tend to concentrate in space.

-The second graph compares the mean neighbourhood kinsey index of each observation to its same-sex index, revealing that the same trend is present, although there is a substantial amount of variation. Similar to the first graph, we see that this trend becomes less representative for neighbourhoods with a higher kinsey index. This makes sense when we consider that observations forming the geographical peak will necessarily be surrounded by neighbourhoods of a lower kinsey index.
+The second graph compares the mean neighbourhood Kinsey index of each observation to its same-sex index, revealing that the same trend is present, although there is a substantial amount of variation. Similar to the first graph, we see that this trend becomes less representative for neighbourhoods with a higher Kinsey index. This makes sense when we consider that observations forming the geographical peak will necessarily be surrounded by neighbourhoods of a lower Kinsey index.

 ### Topographically illustrating queer concentration

-![15 topographical graphs illustrating queer concentration in 15 American cities](analysis/images/graphs/13-queer-concentration-nationally.png)
+![15 topographical graphs illustrating queer concentration in 15 American cities](analysis/images/graphs/12-queer-concentration-nationally.png)

-The previous 15 graphs topographically represent the concentration of queer communities in 15 cities across the United States. Regions shaded darker contain more queer residents per neighbourhood. In all 15 cities studied, we see a relatively sharp "peak" in gay residents in one area. Further, neighbourhoods tend to get less queer radially outwards of this peak. Another interesting observation is that with the exception of Chicago and Miami, all of the queerest communities in each city tend to be clustered around the geographical city centre. This is in line with conventional wisdom that the inner-city tends to be inhabited primarily by poor people and other marginalized groups, while the more privileged groups tend to live outside the city, commuting in for work. The exceptionality of Chicago and Miami could be due to unique city planning.
+The previous 15 graphs topographically represent the concentration of queer communities in 15 cities across the United States. Regions shaded darker contain more queer residents per neighbourhood. In all 15 cities studied, we see a relatively sharp "peak" in gay residents in one area. Further, neighbourhoods tend to get less queer radially outwards of this peak. Another interesting observation is that with the exception of Chicago and Miami, all of the queerest communities in each city tend to be clustered around the geographical city centre. This is in line with conventional wisdom that the inner-city tends to be inhabited primarily by poor people and other marginalized groups, while the more privileged groups tend to live outside the city, commuting in for work. The exceptionally of Chicago and Miami could be due to unique city planning.

 Although the overarching trend remains, there are some inherent limitations to using topological graphs to illustrate this data. These limitations are explored further in our [complete analysis](analysis/analysis1.ipynb).

@ -56,17 +65,41 @@ Using the tools discussed in the previous section, it is immediately apparent th

 To take a closer look on the city-level, we can use the same approach as last time to visualize the two phenomena topographically:

-![Overlaping topographical maps for queer and democrat density in 15 cities in the US](analysis/images/graphs/14-queer-vs-democrat-density.png)
+![Overlapping topographical maps for queer and democrat density in 15 cities in the US](analysis/images/graphs/14-queer-vs-democrat-density.png)

 When we illustrate the density of queerness and democrat votership, we see that in seven of the cities, the peaks completely overlap. In the vast majority of the cities studied, the peaks mostly overlap. Only in the case of Miami do the peaks seem to not overlap. This exception is likely due to the fact that here, our usage of this type of graph is misleading, because most of the region covered by the gaybourhoods dataset is contained within a single county.

 Through both our numerical and spatial research, the results consistently show that neighbourhoods with a higher number of queer residents tend to vote more democrat.

+## Do queer communities pay more in taxes?
+![Scatter plot of queer communities by tax rate](analysis/images/graphs/queer-tax-rate.png)
+![Scatter plot of communities by tax rate](analysis/images/graphs/typical-tax-rate.png)
+
+The previous two graphs depict the neighbourhoods in the Gaybourhoods data set coloured by how much their residents pay in taxes. It's visually clear from the first graph that queer neighbourhoods tend to pay more in their taxes. This phenomenon can be explored more quantitatively in the following diagram:
+
+![Pair plot comparing queer to overall tax rates](analysis/images/graphs/queer-tax-decomposition.png)
+
+As we can infer by taking the first derivative of the correlation line of this graph, queer communities pay significantly more taxes than other neighbourhoods. One explanation for this is that queer people may, through one mechanism or another, end up correlating strongly with demographics who pay more taxes. Do note that the analysis is severely limited by severe sampling bias as only hyper urban geographical stratum have been surveyed in the construction of this data set.
+
+From a previous analysis, we know that queer communities tend to concentrate in the geographical centre of all the cities surveyed. So, we can draw the related conclusion that people who live in the middle of large urban centres tend to pay more in taxes, which in turn provides some basis for why these queer communities tend to have a higher tax rate.
+
+## Is there a correlation between the number of gay bars in a given neighbourhood and pride parade activity?
+Our leading hypothesis for this research question is that if a neighbourhood has more gay bars, then it will be more likely to be traversed by a pride parade at some point during the year. We can illustrate the relationship between these two variables like so:
+
+![Bar graph illustrating the relationship between gay bars and pride parades](analysis/images/graphs/bars-parades1.png)
+![Bar graph illustrating the relationship between gay bars and pride parades](analysis/images/graphs/bars-parades2.png)
+
+Alternatively, we can approach the data from a more quantitative perspective and find that:
+
+![Bar graph illustrating the relationship between gay bars and pride parades](analysis/images/graphs/bars-parades-decomposition.png)
+
+The data seems to suggest that our hypothesis should be rejected. Against our expectations, it appears that more gay bars are located in regions pride parades don't pass through. One possible explanation is that pride parade organizers tend to focus their effort on bringing their parades into communities that have a lower presence of queer residents overall. It's also possible that this correlation is insufficiently representative due to the fact that it exists mostly in regions with more than 10 bars. A third and final explanation would be to cite the sampling bias again. Although this isn't as extreme as was the case with the previous research question, it is nonetheless substantial enough to merit consideration.
+
 ## Conclusion

-Over the last semester, we have analyzed data from numerous sources to find answers to four geographic questions about the queer community. Firstly, we wanted to understand whether or not queer communities tend to concentrate in space, and found that neighbourhoods with a higher density of gay and lesbian residents tend to be close to other neighbourhoods with a higher density, clustering in city centers, such that there's typically a geographical peak in queerness around the middle of each city. We used similar methods of analysis to study the political alignment of residents of queer neighbourhoods and found that across the country, neighbourhoods with more queer people tend to vote more democrat. Thirdly, we asked if there was a meaningful difference in the amount of money queer people pay in taxes versus non-queer people, and found that in general, suburban queer people tend to pay higher taxes. Finally [RQ4], and learned [RQ4 CONCLUSION].
+Over the last semester, we have analyzed data from numerous sources to find answers to four geographic questions about the queer community. Firstly, we wanted to understand whether or not queer communities tend to concentrate in space, and found that neighbourhoods with a higher density of gay and lesbian residents tend to be close to other neighbourhoods with a higher density, clustering in city centres, such that there's typically a geographical peak in queerness around the middle of each city. We used similar methods of analysis to study the political alignment of residents of queer neighbourhoods and found that across the country, neighbourhoods with more queer people tend to vote more democrat. Thirdly, we asked if there was a meaningful difference in the amount of money queer people pay in taxes versus non-queer people, and found that in general, suburban queer people tend to pay higher taxes. Finally we sought to determine if there is a correlation between the number of gay bars in a region and whether or not a pride parade passes through, and learned that a high number of gay bars in a given neighbourhood doesn't imply that that a pride parade is more likely to visit it.

-In a world increasingly dominanted by data-driven decision-making, minority communities, being already underrepresented, are particularly at risk of being further marginalized. While there are numerious risks associated with collecting and publishing data on these groups, it is equally important to ensure queer people are present and included. Answering questions about social issues regarding the queer community is greatly complicated by the fact that we are systematically excluded from the discussion, and considerably more effort is necessary to eliminate the systematic bias disabling queer representation.
+In a world increasingly dominated by data-driven decision-making, minority communities, being already underrepresented, are particularly at risk of being further marginalized. While there are numerous risks associated with collecting and publishing data on these groups, it is equally important to ensure queer people are present and included. Answering questions about social issues regarding the queer community is greatly complicated by the fact that we are systematically excluded from the discussion, and considerably more effort is necessary to eliminate the systematic bias disabling queer representation.

 ## Footnotes