Learning R and Shiny using the public perm labor data has given me a chance to peak into the real world challenges of data curation and wrangling.
Often there is a need to determine, how a certain measure ranked periodically (yearly or monthly etc.). In perm data, the employee’s country of origin is tracked. This gives a chance for some interesting analysis.
For e.g., Which foreign country accounts for the most perm labor certifications year over year?
Calculating the frequency of the country of citizenship to the number of applicants per year can be done using the frequency table function.
The raw data is of the format:
Company_Name | Job_Title | Salary | Country_of_Citizenship | Year |
---|---|---|---|---|
…. | …. | …. | INDIA | 2014 |
…. | …. | …. | CHINA | 2014 |
…. | …. | …. | SINGAPORE | 2014 |
…. | …. | …. | ZIMBABWE | 2013 |
…. | …. | …. | INDIA | 2013 |
…. | …. | …. | CHINA | 2013 |
…. | …. | …. | SUDAN | 2012 |
…. | …. | …. | …. | …. |
By using the table function, we can determine the frequency of the country per year.
By converting the table to a data frame and renaming the columns, I arrived at the following data frame.
Country | Year | Count |
---|---|---|
INDIA | 2014 | 35099 |
CHINA | 2014 | 5098 |
SINGAPORE | 2014 | 321 |
ZIMBABWE | 2013 | 25 |
INDIA | 2013 | 28765 |
CHINA | 2013 | 6756 |
…. | …. | …. |
I used the following functions to determine the rank of the countries at a certain year. i.e., the country that had the highest count for the year 2014 will be ranked as 1 for 2014.
- rank(x) – determines of the rank of the vector x
The ties.method specifies the logic to use when there are ties in the ranking, in this case, the first appearance will take the lower rank.
- ave(x, …, FUN = mean) - Subsets of x[] are averaged, where each subset consist of those observations with the same factor levels
For my use case, I did the following:
The above function, applies the “rank” function grouped by Year, on the Count vector.
Finally, I used the transform function to include a new column “Year.rank” to the data frame to obtain the ranks of the country for a particular year.
The –x is used to give the lowest rank to the highest count.
The final data frame with ranks looks like this:
Country | Year | Count | Year.rank |
---|---|---|---|
INDIA | 2014 | 35099 | 1 |
CHINA | 2014 | 5098 | 2 |
SINGAPORE | 2014 | 321 | 98 |
ZIMBABWE | 2013 | 25 | 150 |
INDIA | 2013 | 28765 | 1 |
CHINA | 2013 | 6756 | 2 |
…. | …. | …. | …. |
Using the above data frame, a subsetted top 10 or 20 is used for a line plot. I used ggplot to arrive at the following graph:
The final plot looks like this:
Fork the code at Github