pride-data-analysis/analysis/analysis2.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sami Almuallim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Research question/interests\n",
    "\n",
    "**How are the different metrics of pride represented in this data set correlated?** Answering this question will provide a foundation upon which we can work to answer the more complicated questions that follow.\n",
    "\n",
    "- This will probably be the simplest research question, requiring only the data contained in our original data set. To explore this topic, we will use different visualization methods discussed in class to develop a better understanding of the data.\n",
    "\n",
    "**Is there a positive or a negative correlation between taxes paid and the pride of a given queer neighbourhood?** Taxes are influenced by a variety of socio-economic factors and we hope that in analyzing both tax data and our quantification of queerness on a geographic level, we'll be able to gleam insight into the question of how queerness and class are interrelated.\n",
    "\n",
    "- Similar again to the first research question posed, we'll need to find another data set containing geographically located tax data, which should be easy to acquire from the US government (for example, [in our cursory research, we found this data set from the IRS](https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2018-zip-code-data-soi)).\n",
    "- This would bring the number of data sets used in this project up to three, which might present some challenges in terms of the amount of data wrangling necessary to bring it all together.\n",
    "- To measure this, we would rank the neighbourhoods presented in the gaybourhoods data set by pride (an open question which we will explore in a separate research question)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>GEOID10</th>\n",
       "      <th>Tax_Mjoint</th>\n",
       "      <th>Mjoint_MF</th>\n",
       "      <th>Mjoint_SS</th>\n",
       "      <th>Mjoint_FF</th>\n",
       "      <th>Mjoint_MM</th>\n",
       "      <th>TaxRate_SS</th>\n",
       "      <th>TaxRate_FF</th>\n",
       "      <th>TaxRate_MM</th>\n",
       "      <th>Cns_TotHH</th>\n",
       "      <th>...</th>\n",
       "      <th>FF_Cns</th>\n",
       "      <th>FF_Index</th>\n",
       "      <th>MM_Tax</th>\n",
       "      <th>MM_Cns</th>\n",
       "      <th>MM_Index</th>\n",
       "      <th>SS_Index</th>\n",
       "      <th>SS_Index_Weight</th>\n",
       "      <th>Parade_Weight</th>\n",
       "      <th>Bars_Weight</th>\n",
       "      <th>TOTINDEX</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>90069</td>\n",
       "      <td>2120</td>\n",
       "      <td>1689</td>\n",
       "      <td>431</td>\n",
       "      <td>61</td>\n",
       "      <td>370</td>\n",
       "      <td>203.301887</td>\n",
       "      <td>28.773585</td>\n",
       "      <td>174.528302</td>\n",
       "      <td>12551</td>\n",
       "      <td>...</td>\n",
       "      <td>1.847099</td>\n",
       "      <td>6.724415</td>\n",
       "      <td>29.583721</td>\n",
       "      <td>18.704533</td>\n",
       "      <td>48.288254</td>\n",
       "      <td>55.012669</td>\n",
       "      <td>39.429995</td>\n",
       "      <td>10</td>\n",
       "      <td>17.647059</td>\n",
       "      <td>67.077054</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>94114</td>\n",
       "      <td>5080</td>\n",
       "      <td>4036</td>\n",
       "      <td>1044</td>\n",
       "      <td>170</td>\n",
       "      <td>874</td>\n",
       "      <td>205.511811</td>\n",
       "      <td>33.464567</td>\n",
       "      <td>172.047244</td>\n",
       "      <td>16456</td>\n",
       "      <td>...</td>\n",
       "      <td>4.161579</td>\n",
       "      <td>9.834048</td>\n",
       "      <td>29.163165</td>\n",
       "      <td>19.415304</td>\n",
       "      <td>48.578469</td>\n",
       "      <td>58.412517</td>\n",
       "      <td>41.866815</td>\n",
       "      <td>0</td>\n",
       "      <td>20.000000</td>\n",
       "      <td>61.866815</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10011</td>\n",
       "      <td>5790</td>\n",
       "      <td>5166</td>\n",
       "      <td>624</td>\n",
       "      <td>97</td>\n",
       "      <td>527</td>\n",
       "      <td>107.772021</td>\n",
       "      <td>16.753022</td>\n",
       "      <td>91.018998</td>\n",
       "      <td>29762</td>\n",
       "      <td>...</td>\n",
       "      <td>1.531029</td>\n",
       "      <td>4.370779</td>\n",
       "      <td>15.428332</td>\n",
       "      <td>10.932081</td>\n",
       "      <td>26.360413</td>\n",
       "      <td>30.731192</td>\n",
       "      <td>22.026394</td>\n",
       "      <td>10</td>\n",
       "      <td>5.882353</td>\n",
       "      <td>37.908747</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10014</td>\n",
       "      <td>3510</td>\n",
       "      <td>3229</td>\n",
       "      <td>281</td>\n",
       "      <td>74</td>\n",
       "      <td>207</td>\n",
       "      <td>80.056980</td>\n",
       "      <td>21.082621</td>\n",
       "      <td>58.974359</td>\n",
       "      <td>18786</td>\n",
       "      <td>...</td>\n",
       "      <td>2.482293</td>\n",
       "      <td>6.055939</td>\n",
       "      <td>9.996551</td>\n",
       "      <td>5.943318</td>\n",
       "      <td>15.939869</td>\n",
       "      <td>21.995808</td>\n",
       "      <td>15.765361</td>\n",
       "      <td>10</td>\n",
       "      <td>11.764706</td>\n",
       "      <td>37.530067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>94103</td>\n",
       "      <td>2660</td>\n",
       "      <td>2417</td>\n",
       "      <td>243</td>\n",
       "      <td>34</td>\n",
       "      <td>209</td>\n",
       "      <td>91.353383</td>\n",
       "      <td>12.781955</td>\n",
       "      <td>78.571429</td>\n",
       "      <td>12728</td>\n",
       "      <td>...</td>\n",
       "      <td>0.837431</td>\n",
       "      <td>3.004058</td>\n",
       "      <td>13.318386</td>\n",
       "      <td>4.961779</td>\n",
       "      <td>18.280165</td>\n",
       "      <td>21.284224</td>\n",
       "      <td>15.255337</td>\n",
       "      <td>10</td>\n",
       "      <td>10.588235</td>\n",
       "      <td>35.843573</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 29 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   GEOID10  Tax_Mjoint  Mjoint_MF  Mjoint_SS  Mjoint_FF  Mjoint_MM  \\\n",
       "0    90069        2120       1689        431         61        370   \n",
       "1    94114        5080       4036       1044        170        874   \n",
       "2    10011        5790       5166        624         97        527   \n",
       "3    10014        3510       3229        281         74        207   \n",
       "4    94103        2660       2417        243         34        209   \n",
       "\n",
       "   TaxRate_SS  TaxRate_FF  TaxRate_MM  Cns_TotHH  ...    FF_Cns  FF_Index  \\\n",
       "0  203.301887   28.773585  174.528302      12551  ...  1.847099  6.724415   \n",
       "1  205.511811   33.464567  172.047244      16456  ...  4.161579  9.834048   \n",
       "2  107.772021   16.753022   91.018998      29762  ...  1.531029  4.370779   \n",
       "3   80.056980   21.082621   58.974359      18786  ...  2.482293  6.055939   \n",
       "4   91.353383   12.781955   78.571429      12728  ...  0.837431  3.004058   \n",
       "\n",
       "      MM_Tax     MM_Cns   MM_Index   SS_Index  SS_Index_Weight  Parade_Weight  \\\n",
       "0  29.583721  18.704533  48.288254  55.012669        39.429995             10   \n",
       "1  29.163165  19.415304  48.578469  58.412517        41.866815              0   \n",
       "2  15.428332  10.932081  26.360413  30.731192        22.026394             10   \n",
       "3   9.996551   5.943318  15.939869  21.995808        15.765361             10   \n",
       "4  13.318386   4.961779  18.280165  21.284224        15.255337             10   \n",
       "\n",
       "   Bars_Weight   TOTINDEX  \n",
       "0    17.647059  67.077054  \n",
       "1    20.000000  61.866815  \n",
       "2     5.882353  37.908747  \n",
       "3    11.764706  37.530067  \n",
       "4    10.588235  35.843573  \n",
       "\n",
       "[5 rows x 29 columns]"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "\n",
    "gaybourhoods = pd.read_csv(\"../data/raw/gaybourhoods.csv\")\n",
    "gaybourhoods.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data wrangling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "ename": "FileNotFoundError",
     "evalue": "[Errno 2] No such file or directory: '../data/raw/irs_2015.csv'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mFileNotFoundError\u001b[0m                         Traceback (most recent call last)",
      "Cell \u001b[1;32mIn[44], line 5\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[39m# NOTE: This cell will not work unless this file is in the repository. The source\u001b[39;00m\n\u001b[0;32m      2\u001b[0m \u001b[39m# can be found linked in the references section of the readme, however, it is too\u001b[39;00m\n\u001b[0;32m      3\u001b[0m \u001b[39m# big for GitHub to handle.\u001b[39;00m\n\u001b[1;32m----> 5\u001b[0m irs \u001b[39m=\u001b[39m pd\u001b[39m.\u001b[39;49mread_csv(\u001b[39m\"\u001b[39;49m\u001b[39m../data/raw/irs_2015.csv\u001b[39;49m\u001b[39m\"\u001b[39;49m)\n\u001b[0;32m      7\u001b[0m \u001b[39m# Naively splitting the IRS data set in two. More formal data wrangling will\u001b[39;00m\n\u001b[0;32m      8\u001b[0m \u001b[39m# come later\u001b[39;00m\n\u001b[0;32m      9\u001b[0m irs1 \u001b[39m=\u001b[39m irs\u001b[39m.\u001b[39mhead(\u001b[39mint\u001b[39m(irs\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m] \u001b[39m/\u001b[39m \u001b[39m2\u001b[39m))\n",
      "File \u001b[1;32mc:\\Users\\samia\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\util\\_decorators.py:211\u001b[0m, in \u001b[0;36mdeprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m    209\u001b[0m     \u001b[39melse\u001b[39;00m:\n\u001b[0;32m    210\u001b[0m         kwargs[new_arg_name] \u001b[39m=\u001b[39m new_arg_value\n\u001b[1;32m--> 211\u001b[0m \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n",
      "File \u001b[1;32mc:\\Users\\samia\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\util\\_decorators.py:331\u001b[0m, in \u001b[0;36mdeprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m    325\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(args) \u001b[39m>\u001b[39m num_allow_args:\n\u001b[0;32m    326\u001b[0m     warnings\u001b[39m.\u001b[39mwarn(\n\u001b[0;32m    327\u001b[0m         msg\u001b[39m.\u001b[39mformat(arguments\u001b[39m=\u001b[39m_format_argument_list(allow_args)),\n\u001b[0;32m    328\u001b[0m         \u001b[39mFutureWarning\u001b[39;00m,\n\u001b[0;32m    329\u001b[0m         stacklevel\u001b[39m=\u001b[39mfind_stack_level(),\n\u001b[0;32m    330\u001b[0m     )\n\u001b[1;32m--> 331\u001b[0m \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n",
      "File \u001b[1;32mc:\\Users\\samia\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:950\u001b[0m, in \u001b[0;36mread_csv\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)\u001b[0m\n\u001b[0;32m    935\u001b[0m kwds_defaults \u001b[39m=\u001b[39m _refine_defaults_read(\n\u001b[0;32m    936\u001b[0m     dialect,\n\u001b[0;32m    937\u001b[0m     delimiter,\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    946\u001b[0m     defaults\u001b[39m=\u001b[39m{\u001b[39m\"\u001b[39m\u001b[39mdelimiter\u001b[39m\u001b[39m\"\u001b[39m: \u001b[39m\"\u001b[39m\u001b[39m,\u001b[39m\u001b[39m\"\u001b[39m},\n\u001b[0;32m    947\u001b[0m )\n\u001b[0;32m    948\u001b[0m kwds\u001b[39m.\u001b[39mupdate(kwds_defaults)\n\u001b[1;32m--> 950\u001b[0m \u001b[39mreturn\u001b[39;00m _read(filepath_or_buffer, kwds)\n",
      "File \u001b[1;32mc:\\Users\\samia\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:605\u001b[0m, in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m    602\u001b[0m _validate_names(kwds\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mnames\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mNone\u001b[39;00m))\n\u001b[0;32m    604\u001b[0m \u001b[39m# Create the parser.\u001b[39;00m\n\u001b[1;32m--> 605\u001b[0m parser \u001b[39m=\u001b[39m TextFileReader(filepath_or_buffer, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwds)\n\u001b[0;32m    607\u001b[0m \u001b[39mif\u001b[39;00m chunksize \u001b[39mor\u001b[39;00m iterator:\n\u001b[0;32m    608\u001b[0m     \u001b[39mreturn\u001b[39;00m parser\n",
      "File \u001b[1;32mc:\\Users\\samia\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1442\u001b[0m, in \u001b[0;36mTextFileReader.__init__\u001b[1;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[0;32m   1439\u001b[0m     \u001b[39mself\u001b[39m\u001b[39m.\u001b[39moptions[\u001b[39m\"\u001b[39m\u001b[39mhas_index_names\u001b[39m\u001b[39m\"\u001b[39m] \u001b[39m=\u001b[39m kwds[\u001b[39m\"\u001b[39m\u001b[39mhas_index_names\u001b[39m\u001b[39m\"\u001b[39m]\n\u001b[0;32m   1441\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mhandles: IOHandles \u001b[39m|\u001b[39m \u001b[39mNone\u001b[39;00m \u001b[39m=\u001b[39m \u001b[39mNone\u001b[39;00m\n\u001b[1;32m-> 1442\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_engine \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_make_engine(f, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mengine)\n",
      "File \u001b[1;32mc:\\Users\\samia\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\io\\parsers\\readers.py:1735\u001b[0m, in \u001b[0;36mTextFileReader._make_engine\u001b[1;34m(self, f, engine)\u001b[0m\n\u001b[0;32m   1733\u001b[0m     \u001b[39mif\u001b[39;00m \u001b[39m\"\u001b[39m\u001b[39mb\u001b[39m\u001b[39m\"\u001b[39m \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m mode:\n\u001b[0;32m   1734\u001b[0m         mode \u001b[39m+\u001b[39m\u001b[39m=\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mb\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m-> 1735\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mhandles \u001b[39m=\u001b[39m get_handle(\n\u001b[0;32m   1736\u001b[0m     f,\n\u001b[0;32m   1737\u001b[0m     mode,\n\u001b[0;32m   1738\u001b[0m     encoding\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moptions\u001b[39m.\u001b[39;49mget(\u001b[39m\"\u001b[39;49m\u001b[39mencoding\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39mNone\u001b[39;49;00m),\n\u001b[0;32m   1739\u001b[0m     compression\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moptions\u001b[39m.\u001b[39;49mget(\u001b[39m\"\u001b[39;49m\u001b[39mcompression\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39mNone\u001b[39;49;00m),\n\u001b[0;32m   1740\u001b[0m     memory_map\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moptions\u001b[39m.\u001b[39;49mget(\u001b[39m\"\u001b[39;49m\u001b[39mmemory_map\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39mFalse\u001b[39;49;00m),\n\u001b[0;32m   1741\u001b[0m     is_text\u001b[39m=\u001b[39;49mis_text,\n\u001b[0;32m   1742\u001b[0m     errors\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moptions\u001b[39m.\u001b[39;49mget(\u001b[39m\"\u001b[39;49m\u001b[39mencoding_errors\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39m\"\u001b[39;49m\u001b[39mstrict\u001b[39;49m\u001b[39m\"\u001b[39;49m),\n\u001b[0;32m   1743\u001b[0m     storage_options\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moptions\u001b[39m.\u001b[39;49mget(\u001b[39m\"\u001b[39;49m\u001b[39mstorage_options\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39mNone\u001b[39;49;00m),\n\u001b[0;32m   1744\u001b[0m )\n\u001b[0;32m   1745\u001b[0m \u001b[39massert\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mhandles \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m\n\u001b[0;32m   1746\u001b[0m f \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mhandles\u001b[39m.\u001b[39mhandle\n",
      "File \u001b[1;32mc:\\Users\\samia\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\pandas\\io\\common.py:856\u001b[0m, in \u001b[0;36mget_handle\u001b[1;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[0;32m    851\u001b[0m \u001b[39melif\u001b[39;00m \u001b[39misinstance\u001b[39m(handle, \u001b[39mstr\u001b[39m):\n\u001b[0;32m    852\u001b[0m     \u001b[39m# Check whether the filename is to be opened in binary mode.\u001b[39;00m\n\u001b[0;32m    853\u001b[0m     \u001b[39m# Binary mode does not support 'encoding' and 'newline'.\u001b[39;00m\n\u001b[0;32m    854\u001b[0m     \u001b[39mif\u001b[39;00m ioargs\u001b[39m.\u001b[39mencoding \u001b[39mand\u001b[39;00m \u001b[39m\"\u001b[39m\u001b[39mb\u001b[39m\u001b[39m\"\u001b[39m \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m ioargs\u001b[39m.\u001b[39mmode:\n\u001b[0;32m    855\u001b[0m         \u001b[39m# Encoding\u001b[39;00m\n\u001b[1;32m--> 856\u001b[0m         handle \u001b[39m=\u001b[39m \u001b[39mopen\u001b[39m(\n\u001b[0;32m    857\u001b[0m             handle,\n\u001b[0;32m    858\u001b[0m             ioargs\u001b[39m.\u001b[39mmode,\n\u001b[0;32m    859\u001b[0m             encoding\u001b[39m=\u001b[39mioargs\u001b[39m.\u001b[39mencoding,\n\u001b[0;32m    860\u001b[0m             errors\u001b[39m=\u001b[39merrors,\n\u001b[0;32m    861\u001b[0m             newline\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m    862\u001b[0m         )\n\u001b[0;32m    863\u001b[0m     \u001b[39melse\u001b[39;00m:\n\u001b[0;32m    864\u001b[0m         \u001b[39m# Binary mode\u001b[39;00m\n\u001b[0;32m    865\u001b[0m         handle \u001b[39m=\u001b[39m \u001b[39mopen\u001b[39m(handle, ioargs\u001b[39m.\u001b[39mmode)\n",
      "\u001b[1;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../data/raw/irs_2015.csv'"
     ]
    }
   ],
   "source": [
    "# NOTE: This cell will not work unless this file is in the repository. The source\n",
    "# can be found linked in the references section of the readme, however, it is too\n",
    "# big for GitHub to handle.\n",
    "\n",
    "irs = pd.read_csv(\"../data/raw/irs_2015.csv\")\n",
    "\n",
    "# Naively splitting the IRS data set in two. More formal data wrangling will\n",
    "# come later\n",
    "irs1 = irs.head(int(irs.shape[0] / 2))\n",
    "irs2 = irs.tail(int(irs.shape[0] / 2))\n",
    "\n",
    "irs1.to_csv(\"../data/processed/irs_2015_1\", index=False)\n",
    "irs2.to_csv(\"../data/processed/irs_2015_2\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now these two datasets can be joined and worked with\n",
    "irs = pd.concat([\n",
    "    pd.read_csv(\"../data/processed/irs_2015_1\"),\n",
    "    pd.read_csv(\"../data/processed/irs_2015_2\")\n",
    "])\n",
    "# irs.head()\n",
    "\n",
    "\n",
    "#selected data: ZIPCODE - this will be used in conjunction with the rest of the set\n",
    "            #   N2 - population of zip code\n",
    "            \n",
    "            #data of intrest\n",
    "                #     A11900\tTotal overpayments amount\n",
    "                #   AGI_STUB - metric for income\n",
    "\n",
    "# print(irs.loc[irs['zipcode']==90069])\n",
    "# df = {irs['zipcode'], irs['N2']}\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                 zip    population        income  overall tax paid\n",
      "count  166698.000000  1.666980e+05  166698.00000      1.666980e+05\n",
      "mean    48877.636432  3.432536e+03       3.50000      1.844871e+03\n",
      "std     27146.337114  6.676873e+04       1.70783      5.785610e+04\n",
      "min         0.000000  0.000000e+00       1.00000      0.000000e+00\n",
      "25%     27040.000000  1.400000e+02       2.00000      1.600000e+01\n",
      "50%     48879.000000  5.100000e+02       3.50000      1.440000e+02\n",
      "75%     70607.000000  2.000000e+03       5.00000      6.310000e+02\n",
      "max     99999.000000  9.566490e+06       6.00000      1.557123e+07\n",
      "          zip  population  income  overall tax paid\n",
      "0           0   1356760.0       1           48150.0\n",
      "1           0   1010990.0       2          107304.0\n",
      "2           0    583910.0       3          139598.0\n",
      "3           0    423990.0       4          128823.0\n",
      "4           0    589490.0       5          421004.0\n",
      "...       ...         ...     ...               ...\n",
      "166693  99999      6660.0       2             869.0\n",
      "166694  99999      5440.0       3            1273.0\n",
      "166695  99999      4780.0       4            1635.0\n",
      "166696  99999      6930.0       5            5576.0\n",
      "166697  99999      1890.0       6           14487.0\n",
      "\n",
      "[166698 rows x 4 columns]\n"
     ]
    }
   ],
   "source": [
    "#wrangle tax\n",
    "taxdf = pd.DataFrame(zip(irs['zipcode'], irs['N2'], irs['agi_stub'], irs['A11901']))\n",
    "taxdf.columns=('zip', 'population', 'income', 'overall tax paid')\n",
    "print(taxdf.describe())\n",
    "print(taxdf)\n",
    "# print(irs.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                zip  gay tax rate\n",
      "count   2328.000000   2328.000000\n",
      "mean   48616.478522   4103.440722\n",
      "std    35481.240641   3140.699446\n",
      "min     1730.000000      0.000000\n",
      "25%    11362.750000   1767.500000\n",
      "50%    46351.000000   3635.000000\n",
      "75%    80234.250000   5745.000000\n",
      "max    98686.000000  24560.000000\n",
      "        zip  gay tax rate\n",
      "0     90069          2120\n",
      "1     94114          5080\n",
      "2     10011          5790\n",
      "3     10014          3510\n",
      "4     94103          2660\n",
      "...     ...           ...\n",
      "2323  97208             0\n",
      "2324  98154             0\n",
      "2325  98158             0\n",
      "2326  98174             0\n",
      "2327  98195             0\n",
      "\n",
      "[2328 rows x 2 columns]\n"
     ]
    }
   ],
   "source": [
    "#wrangle gay\n",
    "gaydf = pd.DataFrame(zip(gaybourhoods['GEOID10'], gaybourhoods['Tax_Mjoint']))\n",
    "gaydf.columns=(('zip', 'gay tax rate'))\n",
    "\n",
    "print(gaydf.describe())\n",
    "print(gaydf)\n",
    "\n",
    "# gaybourhoods.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                zip     population  gay tax rate  overall tax paid  income\n",
      "count   2184.000000    2184.000000   2184.000000       2184.000000  2184.0\n",
      "mean   48935.203297   26691.730769   4373.997253        596.719322     1.0\n",
      "std    35451.335807   17960.713867   3054.620840        615.174358     0.0\n",
      "min     1730.000000     160.000000      0.000000          0.000000     1.0\n",
      "25%    11360.750000   13337.500000   2110.000000        217.000000     1.0\n",
      "50%    60023.500000   24070.000000   3900.000000        434.000000     1.0\n",
      "75%    80227.250000   35640.000000   5902.500000        777.250000     1.0\n",
      "max    98686.000000  114420.000000  24560.000000       9166.000000     1.0\n",
      "------------------------------------------------------------------------\n",
      "         zip  population  gay tax rate  overall tax paid  income\n",
      "zip                                                             \n",
      "1730    1730     13570.0          3260             150.0       1\n",
      "1731    1731      2450.0           550               0.0       1\n",
      "1742    1742     17170.0          4220             297.0       1\n",
      "1760    1760     34350.0          7880             468.0       1\n",
      "1770    1770      4310.0          1060              46.0       1\n",
      "...      ...         ...           ...               ...     ...\n",
      "98682  98682     57010.0         11080             703.0       1\n",
      "98683  98683     30700.0          6470             358.0       1\n",
      "98684  98684     27630.0          5390             371.0       1\n",
      "98685  98685     27540.0          6490             298.0       1\n",
      "98686  98686     17800.0          4120             215.0       1\n",
      "\n",
      "[2184 rows x 5 columns]\n"
     ]
    }
   ],
   "source": [
    "#merge\n",
    "df = pd.merge(taxdf, gaydf)\n",
    "\n",
    "# print(df)\n",
    "\n",
    "df2 = df.groupby(df['zip']).aggregate({ 'zip':'first',\n",
    "                                        'population': 'sum',\n",
    "                                        'gay tax rate':'first',\n",
    "                                        'overall tax paid':'first',\n",
    "                                        'income':'first'\n",
    "                                                                })\n",
    "\n",
    "print(df2.describe())\n",
    "print(\"------------------------------------------------------------------------\")\n",
    "print(df2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#compare taxes paid by queers to taxes paid by general"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "b2baa059f790e7ad780c83135aaea020c73a7a7a6921010b599b8b664933698d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}