A while back, there was a thread on Hacker News asking people to submit their employer and salary information. Being in the start-up scene and following Hacker News pretty closely, I was instantly very interested in the data this survey was going to produce. I started working on cleaning it up that night, and published the code to Github. In this post I’ll be presenting the ETL code I wrote and analyzing the results.
Extract and Transform
I decided to load the data into a postgres database to make joins and categorization easier. Extraction was just a matter of loading the result CSV into a DataFrame and unloading it to SQL.
db = sqlalchemy.create_engine('postgresql://localhost:5432/hackernews') dataframe = pandas.read_csv(sheet_url, header=0, names=['ts', 'employer', 'location', 'job_title', 'employer_experience_years', 'total_experience_years', 'annual_base_pay', 'signing_bonus', 'annual_bonus', 'stock_value_bonus', 'gender', 'comments']) dataframe.to_sql('salary_staging', db, index_label='id') db.dispose()
A similar process is used to load dimension tables with external data into the database for joins.
The next step is to cleanse the messy data generated by free-text fields. Each step listed below contains sample queries, the full cleanse and transform process is available in transform.py
Three more dimension tables, locations, employers and job titles, are created from the survey data.
INSERT INTO locations ( location_name ) SELECT DISTINCT lower(trim(location)) FROM salary_staging WHERE lower(trim(location)) NOT IN(SELECT location_name FROM locations) and location IS NOT NULL ;
Additional data, like state and coordinates, are added to the locations table from external sources.
UPDATE locations SET location_state = abbreviation, location_country = 'US' FROM us_states WHERE (locations.location_name LIKE '%' || LOWER(us_states.name) || '%' OR locations.location_name LIKE '%, ' || LOWER(us_states.abbreviation) || '%' OR locations.location_name LIKE '% ' || LOWER(us_states.abbreviation)) AND locations.location_state IS NULL AND locations.location_country IS NULL ; UPDATE locations SET location_latitude="Lat", location_longitude="Long" FROM zip_codes WHERE location_city = lower("City") and location_state="State"
Categorize job titles
Add a job category and rank to job titles.
UPDATE job_titles SET job_title_category='Web' WHERE (job_title LIKE 'full_stack%' OR job_title LIKE '%fullstack%' OR job_title LIKE '%front_end%' OR job_title LIKE '%frontend%' OR job_title LIKE '%web %' ) AND job_title_category IS NULL ; UPDATE job_titles SET job_title_rank = 'Junior' WHERE (job_title LIKE 'junior%' OR job_title LIKE 'jr_%') AND job_title_rank IS NULL
Finally, insert the salaries into a normalized table.
INSERT INTO salaries ( salary_id, employer_key, location_key, job_title_key, submitted_at, total_experience_years, employer_experience_years, annual_base_pay, signing_bonus, annual_bonus, stock_value_bonus, gender, comments ) SELECT stg.id as salary_id, em.employer_key as employer_key, lo.location_key as location_key, jt.job_title_key as job_title_key, stg.ts::TIMESTAMP as submitted_at, stg.total_experience_years::DECIMAL(6,2), stg.employer_experience_years::DECIMAL(6,2), stg.annual_base_pay::DECIMAL(12,2), stg.signing_bonus::DECIMAL(12,2), stg.annual_bonus::DECIMAL(12,2), stg.stock_value_bonus, stg.gender::gender_type, stg.comments FROM salary_staging stg LEFT JOIN employers em ON lower(trim(stg.employer))=em.employer_name LEFT JOIN locations lo ON lower(trim(stg.location))=lo.location_name LEFT JOIN job_titles jt ON lower(trim(stg.job_title))=jt.job_title WHERE stg.id NOT IN (SELECT salary_id FROM salaries); ;
I’ll be examining the results of some of the queries in basic_analysis.sql.
First, let’s see if we can identify careers with the highest starting salaries. I’ll group by job category and calculate the median salary per year working and median salary per year at current employer. A much higher salary per year at current job than the median per year working indicates a higher median starting salary.
Not surprisingly, management appears to have a great median starting salary. Apart from that, data and operations jobs also have better than average starting salaries.
Next, let’s see how salaries trend as employees gain experience. I’d expect every employment category to trend upward as time goes on, but maybe there are fields with larger increases.
As expected, all of the employment categories trend upward, some more than others. Software has the highest unexperienced salary and ends up in the upper part of the 8-10+ range, while Web development seems to stay pretty stagnant in the high five figures. One concerning thing as a data engineer is the dive Data jobs take after 8 years experience. I believe (hope) this is just a lack of good data about experienced data employees rather than an actual trend. I’ll let you know in a few years ;)
Lastly, I want to see where the Hacker News community works. I’ll group by city and count the results.
Somewhat surprisingly, the overwhelmingly largest result is New York City. I knew NYC had a large start-up community, but I definitely wasn’t expecting it to outrank the Bay Area. Looking at the other high-rated results, several Silicon Valley cities show up. Just out of curiosity, I want to see if the number of jobs within a few miles of San Francisco is larger than NYC.
-- distance from SF: 37.762240, -122.446640 SELECT COUNT(*) FROM salaries_v WHERE location_city IS NOT NULL AND ABS(37.762240 - location_latitude) < 1.5 AND ABS(-122.446640 - location_longitude) < 1.5;
That makes more sense. There were 80 responses within approximately 20 miles of the SF city center. The Silicon Valley area is comprised of many smaller cities rather than one large city like New York, so it makes sense for SV jobs to have diverse city names.
This post took the survey results at face-value and attempted to draw some conclusions. Overall, I think the data presented here is accurate and valuable, but there are quite a few outlier responses that might have skewed some results. In a future post, I plan to match the data in this survey with larger datasets and some external data sources to verify its integrity and expand on the insights here.