Does COVID-19 impact the job market in the Netherlands and if so, how does it affect your chances at finding a job in your industry? In this article we'll be looking at job offers in different job industries in the Netherlands, and comparing the numbers to COVID-19 statistics obtained from a global dataset.
pip install -q -r requirements.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
First off, for our analysis, we need statistical data on COVID-19 over time. Preferably, we want to be able to see daily amounts of new COVID cases. Our World In Data provides us with the perfect dataset for this purpose. We've downloaded their CSV dataset on March 1st 2021.
First, we'll import the global COVID-19 dataset. This complete COVID-19 dataset is a collection of the COVID-19 data maintained by Our World in Data. It is updated daily and includes data on confirmed cases, deaths, hospitalizations, testing, and vaccinations as well as other variables of potential interest.
raw_cov_df = pd.read_csv('owid-covid-data.csv')
raw_cov_df.head()
cov_df_nl = raw_cov_df[raw_cov_df['location'] == 'Netherlands']
Now that the dataset is imported, it's time to discover its contents. We'll take a look at the columns and values that are inside.
cov_df_nl.info()
cov_df_nl.describe()
In order to speed up our process, we'll get rid of columns we don't need. The columns we're interested in are:
We'll create our new dataframe by making a subset:
cov_df = cov_df_nl[['date', 'total_cases', 'new_cases', 'total_deaths',
'new_deaths', 'total_cases_per_million', 'new_cases_per_million',
'total_deaths_per_million', 'new_deaths_per_million',
'hosp_patients', 'hosp_patients_per_million']]
cov_df.columns
We want to see the amount of new COVID cases per month, as well as the new deaths per month and the average amount of hospitalized patients per day for every month.
In order to achieve this, we must first group the dataset into months. We'll add a column containing the month to which the day in the 'date' column belongs, so that we can group the observations by month.
# Adding 'datetime' column that contains the YYYY-MM for each day
cov_df['datetime'] = pd.to_datetime(cov_df['date'])
cov_df['datetime'] = cov_df['datetime'].dt.date.apply(lambda x: x.strftime('%Y-%m'))
# new dataset with monthly data contains: sum of new_cases, sum of new_deaths, average of hosp_patients per day
cov_df_month = cov_df.groupby('datetime').agg({'new_cases' : 'sum',
'new_deaths' : 'sum',
'hosp_patients' : 'mean'}).reset_index()
cov_df_month
sns.barplot(x='datetime', y='new_cases', data=cov_df_month, color='b', ci=False)
plt.xticks(rotation=90)
plt.title('New COVID cases per month in the Netherlands')
plt.xlabel('Date')
plt.ylabel('New COVID cases')
plt.show()
For information on the amount of job offers in the Netherlands, we'll use the 'Vacature-indicator' dataset from Dataportal of CBS, retrieved at 01 March 2021. CBS stands for Centraal Bureau voor de Statistiek -'Central Statistics Office' - a Dutch governmental institution that gathers all kinds of statistical information about the Netherlands, mostly on social and economic topics. We've selected the years 2010-2021 and all four of the job industries provided in the dataset.
Now it's time to import the vacature dateset by CBS. First, we'll import the whole dataset and inspect it.
As we can see, the dataset contains information on several job industries. It shows the job-offer indicator per industry per month, starting in January 2010.
raw_vac_df = pd.read_csv('84287NED_2021-03-02T12_52_35.csv', sep=";")
raw_vac_df
raw_vac_df.info()
We don't need all the columns provided in the dataset. Some columns contain the exact same information as others, just in a different shape. For example, the 'Perioden_title' column contains the year and month specified in 'Perioden_code', just written in words.
We'll drop the columns we don't need:
vac_df = raw_vac_df.drop(columns=['BedrijfstakkenBranchesSBI2008_code', 'Perioden_title', 'Unit', 'Measure'])
vac_df.head()
We'll also have to reformat the dates in this dataset, they need to have the same format as the dates in the COVID-19 dataframe.
vac_df['datetime'] = vac_df['Perioden_code'].replace('MM', '-', regex=True)
vac_df = vac_df.drop(columns=['Perioden_code'])
Now it's time to put all our cleaned up data together. We'll merge the dataset with job-offer indicators on the dataset with the dataset that contains the new COVID cases/million per month. We'll use a right join, as the dates for the job-offer indicators reach much further back than the COVID dataset, but we still want to maintain these rows.
cov_vac = cov_df_month.merge(vac_df, on='datetime', how='right')
# Renaming the columns so they're easier to use
cov_vac = cov_vac.rename(columns={'BedrijfstakkenBranchesSBI2008_title' : 'job_industry',
'Value' : 'joboffer_indicator'})
cov_vac.head()
The NaN values in the columns originating from the cov_df
are due to the fact that there were no COVID cases registered before February 2020. We'll fill these values with with zero.
cov_vac = cov_vac.fillna(0)
# Checking if there's any null-values
cov_vac.info()
# Let's see what the df looks like before we start analyzing it
def show_df_rows(Rows):
return cov_vac.iloc[Rows[0]:Rows[1]]
widgets.interact(show_df_rows, Rows=widgets.IntRangeSlider(value = [0,100],
min = 0,
max = len(cov_vac)));
Now that we've merged the two dataframes containing the columns we're interested in, it's time to see if there's any correlation between them. Specifically, we want to see if any of the selected COVID-columns have a particularly high correlation with the job-offer indicator.
In order to get a more accurate result, we'll filter the dataset to only look at dates starting in March 2020: this is when COVID really became present in the Netherlands. Let's see if the monthly COVID statistics correlate with the amount of job offers.
cov_vac_2020 = cov_vac[cov_vac['datetime'].astype('datetime64[ns]') >= pd.to_datetime('2020-03-01')]
cov_vac_2020.corr()
It seems that the biggest amount of correlation is between the amount of new COVID cases in a month and the job-offer indicator: 0.67
. We'll explore this further by visualizing the relationship between these two values.
First, let's see the progression of the job-offer indicator over time for each of the 4 job industries over time.
jobplot = sns.lineplot(x='datetime',
y='joboffer_indicator',
hue='job_industry',
data=cov_vac)
# Adding title and labels
plt.xlabel("Date", size=12)
plt.ylabel("Job-offer indicator", size=12)
plt.title("Job-offer indicator per job industry (2010-2021)", size=20)
plt.xticks(rotation=90)
for ind, label in enumerate(jobplot.get_xticklabels()):
if ind % 10 == 0: # every 10th label is kept
label.set_visible(True)
else:
label.set_visible(False)
# Making the graph bigger
fig = plt.gcf()
fig.set_size_inches(15, 8)
plt.show()
Interesting. In the graph, we find that job offers in all industries went down by a lot, starting around december 2019. We know that in the Netherlands, the first news of COVID started around november 2019.
Now, let's take a further look at the exact numbers of COVID cases in the Netherlands. We'll visualize the progression of new corona cases per month, starting at november 2019.
# Turn the dates (currently strings) into datetime objects, so we can filter them
cov_vac['datetime'] = pd.to_datetime(cov_vac['datetime'], format='%Y-%m')
cov_vac.head()
# Checkboxes for zooming in/out
checkbox1 = widgets.Checkbox(value = False,
description = 'Zoom in',
disabled = False)
checkbox2 = widgets.Checkbox(value = False,
description = 'Zoom out',
disabled = False)
def change_data(val):
if val == True:
data = cov_vac[cov_vac['datetime'] > pd.to_datetime('2019-11-01')]
else:
data = cov_vac
covplot = sns.lineplot(x='datetime',
y='new_cases',
data=data)
# Adding title and labels
plt.xlabel("Date", size=12)
plt.ylabel("New corona cases", size=12)
plt.title("New corona cases per month", size=20)
plt.xticks(rotation=90)
# Making the graph bigger
fig = plt.gcf()
fig.set_size_inches(15, 8)
plt.show()
# Bind checkboxes to change_data function and disable one when the other is checked
out = widgets.interactive_output(change_data, {'val':checkbox1})
widgets.jslink((checkbox2, 'disabled'), (checkbox1, 'value'));
widgets.jslink((checkbox1, 'disabled'), (checkbox2, 'value'));
widgets.VBox([widgets.HBox([checkbox1, checkbox2]), out])
Now let's see if there's overlap in the graphs. We'll put both of them in the same figure, using the overlapping dates.
Specifically, we want to see if the amount of COVID cases impacts certain job fields more than others. We'll plot the job-offer indicators of each job field separately, together with the new COVID cases of the month.
def draw_plot(selection_y, selection_hue):
# Setting the selected variables
y = cov_vac[cov_vac["job_industry"] == selection_y]['joboffer_indicator']
hue = selection_hue;
# Plotting the joboffer indicator per job field
ax = sns.lineplot(x = 'datetime',
y = y, # <--- Selected y
hue = hue, # <--- Selected hue
data = cov_vac)
if selection_hue: # <--- Only show legend if there's multiple lines
ax.legend(loc='upper left')
ax.set_ylabel('Job-offer indicator')
ax.set_xlabel('Year')
# Plotting the new_cases line
ax2 = ax.twinx()
cov_vac.plot(x = "datetime",
y = "new_cases",
ax = ax2,
color = "0")
ax2.legend(loc = 'lower left')
ax2.set_ylabel('New COVID cases per month')
# Resizing the plot
fig = plt.gcf()
fig.set_size_inches(14, 6)
# Add titles and show plot
plt.title("COVID cases vs. job-offer indicator in the Netherlands, per month", size = 14)
plt.show()
# Dropdown1: if "View all industries is selected, disable the "Industry" dropdown
def toggle_dropdown2(val):
dropdown2.disabled = val
dropdown2.value = None
dropdown1 = widgets.Dropdown(
options = [('All industries', True),
('Specific industry', False)],
description = 'View: ')
# Dropdown 2: chooses a specific industry to plot (y)
def select_y(Industry):
if not dropdown2.disabled and not dropdown2.value == None: # Plots selected industry
draw_plot(Industry, None)
else: # Plots all industries if disabled/empty
draw_plot(cov_vac["job_industry"], 'job_industry')
dropdown2 = widgets.Dropdown(
value = None,
options = [('F Construction industry', 'F Bouwnijverheid'),
('C Manifacturing industry', 'C Industrie'),
('G-N Commercial services', 'G-N Commerciële dienstverlening'),
('A-U All economic activities', 'A-U Alle economische activiteiten')],
description = 'Industry: ',
disabled=True)
# Link the dropdowns to their corresponding functions
widgets.interact(toggle_dropdown2, val=dropdown1);
widgets.interact(select_y, Industry=dropdown2);
What conclusions can we draw from this research? A couple of things caught our attention:
That's it! In the future, we'd definitely like to dive deeper into the ways that COVID-19 affected the Dutch job market and find out what other factors played a role. An interesting project would be to predict our chances at finding a job in a certain industry, based on the predicted amount COVID-19 cases for the coming months.
We hope you enjoyed this article and would really appreciate your feedback. For now, stay safe and hopefully until next time!