Pandas and (some) pandasql:
So beginning with the data, I had to clean the .csv files with a large amount of columns or unnecessary information. Mainly what I did is get the rows where x column have y information in them, so for example I would get rows where df['Industry'] == 'Concert Venue', something along those lines. After that I would export those files, and manually begin adding a df['Borough'] column to the data frames, based on the address.
For the data links that don't have a downloadable .csv file, what I would do is copy paste them into an Atom ide, then use their regex 'Find in Current Buffer' feature to replace certain characters with others. This allowed me to quickly turn a table from a random website into a proper .csv file that I can work with. After cleaning up the individual dataframes, I combined them so I could count the values per borough (for one of the dataframes I had to switch 'New York' to 'Manhattan' in the df['Borough'] column).
With the INS files, I removed all values where df['Year'] != '2019' since the dataset I had with the share of workers per borough was from 2019. Then combined them as well.
Numpy and some basic Python:
There were some instances of finding out the sum or average of a value. For some of these instances some of the values in a dataframe were strings, so I turned them into ints, then used .sum(), .mean(), unrelated to numpy but I also used round(x) a couple of times. I then formatted them so some of the values were set to dollar values or something that didn't have a ridiculously long line of numbers. The values earned with these are shown in the folium map popups.
Matplotlib:
There are two pie charts for outputs, one of them is the percentage of share of workers, and the second one is the amount of venues each borough has (with percentage as well).
Folium:
I made a map with the information gathered from the datasets presented as pop-ups in said map.
Comments