# Sami Almuallim

## Research question/interests

**How are the different metrics of pride represented in this data set correlated?** Answering this question will provide a foundation upon which we can work to answer the more complicated questions that follow.

- This will probably be the simplest research question, requiring only the data contained in our original data set. To explore this topic, we will use different visualization methods discussed in class to develop a better understanding of the data.

**Is there a positive or a negative correlation between taxes paid and the pride of a given queer neighbourhood?** Taxes are influenced by a variety of socio-economic factors and we hope that in analyzing both tax data and our quantification of queerness on a geographic level, we'll be able to gleam insight into the question of how queerness and class are interrelated.

- Similar again to the first research question posed, we'll need to find another data set containing geographically located tax data, which should be easy to acquire from the US government (for example, [in our cursory research, we found this data set from the IRS](https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2018-zip-code-data-soi)).
- This would bring the number of data sets used in this project up to three, which might present some challenges in terms of the amount of data wrangling necessary to bring it all together.
- To measure this, we would rank the neighbourhoods presented in the gaybourhoods data set by pride (an open question which we will explore in a separate research question)

In [16]:
import pandas as pd

gaybourhoods = pd.read_csv("../data/raw/gaybourhoods.csv")
gaybourhoods.head(5)

Unnamed: 0,GEOID10,Tax_Mjoint,Mjoint_MF,Mjoint_SS,Mjoint_FF,Mjoint_MM,TaxRate_SS,TaxRate_FF,TaxRate_MM,Cns_TotHH,...,FF_Cns,FF_Index,MM_Tax,MM_Cns,MM_Index,SS_Index,SS_Index_Weight,Parade_Weight,Bars_Weight,TOTINDEX
0,90069,2120,1689,431,61,370,203.301887,28.773585,174.528302,12551,...,1.847099,6.724415,29.583721,18.704533,48.288254,55.012669,39.429995,10,17.647059,67.077054
1,94114,5080,4036,1044,170,874,205.511811,33.464567,172.047244,16456,...,4.161579,9.834048,29.163165,19.415304,48.578469,58.412517,41.866815,0,20.0,61.866815
2,10011,5790,5166,624,97,527,107.772021,16.753022,91.018998,29762,...,1.531029,4.370779,15.428332,10.932081,26.360413,30.731192,22.026394,10,5.882353,37.908747
3,10014,3510,3229,281,74,207,80.05698,21.082621,58.974359,18786,...,2.482293,6.055939,9.996551,5.943318,15.939869,21.995808,15.765361,10,11.764706,37.530067
4,94103,2660,2417,243,34,209,91.353383,12.781955,78.571429,12728,...,0.837431,3.004058,13.318386,4.961779,18.280165,21.284224,15.255337,10,10.588235,35.843573


## Data wrangling

In [19]:
# NOTE: This cell will not work unless this file is in the repository. The source
# can be found linked in the references section of the readme, however, it is too
# big for GitHub to handle.
irs = pd.read_csv("../data/raw/irs_2015.csv")

# Naively splitting the IRS data set in two. More formal data wrangling will
# come later
irs1 = irs.head(int(irs.shape[0] / 2))
irs2 = irs.tail(int(irs.shape[0] / 2))

irs1.to_csv("../data/processed/irs_2015_1", index=False)
irs2.to_csv("../data/processed/irs_2015_2", index=False)

In [20]:
# Now these two datasets can be joined and worked with
irs = pd.concat([
    pd.read_csv("../data/processed/irs_2015_1"),
    pd.read_csv("../data/processed/irs_2015_2")
])
irs.head()

Unnamed: 0,STATEFIPS,STATE,zipcode,agi_stub,N1,mars1,MARS2,MARS4,PREP,N2,...,N10300,A10300,N85530,A85530,N85300,A85300,N11901,A11901,N11902,A11902
0,1,AL,0,1,836320.0,481570.0,109790.0,233260.0,455560.0,1356760.0,...,373410.0,328469.0,0.0,0.0,0.0,0.0,61920.0,48150.0,732670.0,1933120.0
1,1,AL,0,2,494830.0,206630.0,146250.0,129390.0,275920.0,1010990.0,...,395880.0,965011.0,0.0,0.0,0.0,0.0,73720.0,107304.0,415410.0,1187403.0
2,1,AL,0,3,261250.0,80720.0,139280.0,36130.0,155100.0,583910.0,...,251490.0,1333418.0,0.0,0.0,0.0,0.0,64200.0,139598.0,193030.0,536699.0
3,1,AL,0,4,166690.0,28510.0,124650.0,10630.0,99950.0,423990.0,...,165320.0,1414283.0,0.0,0.0,0.0,0.0,45460.0,128823.0,116440.0,377177.0
4,1,AL,0,5,212660.0,19520.0,184320.0,4830.0,126860.0,589490.0,...,212000.0,3820152.0,420.0,168.0,60.0,31.0,83330.0,421004.0,121570.0,483682.0
