By: Nathan Smith
Date: April 4, 2020
This project was completed as a component of the Part Time Data Science Course through BrainStation.

The purpose of this study is to assess the ability to use machine learning to estimate daily streamflow in British Columbia, Canada using streamflow and climate data in proximity to the stream of interest. This study looks at estimating the Stave River streamflow using data available within a 100 km radius. This should be considered an initial step in determining if an automated or semi-automated procedure could be applied to generate machine learning models for any streamflow location in British Columbia.

Note that the R scripts "GetData.R" and "Hydat_Search.R" (included in this repository) are used prior to this notebook to:

  • Download all hydrometric and climate data in the vicinity of the Stave River using the tidyhydat and weathercan packages.
  • Screen out stations that do not have a record length that overlaps with the Stave River record.
  • Compile the dataset into the file "Dataset.csv" for use in this notebook.

The above step was done using R as the tidyhydata and weathercan packages packages are maintained by the BC government have been built to access the Water Survey of Canada (WSC) and Environment and Climate Change Chanada (ECCC) hydat databases.

Code for this workbook can either be hidden or shown using the button below.

In [61]:
from IPython.display import HTML
  function code_toggle() {
    if (code_shown){
      $('#toggleButton').val('Show Code')
    } else {
      $('#toggleButton').val('Hide Code')
    code_shown = !code_shown
  $( document ).ready(function(){
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')

Load Required Packages and Dataset

The dataset consists of the following:

  • Date: Index corresponding to the calendar date
  • DayofYear: Day of year, this was included in the dataset as it is considered to be a potentiallly good indicator of streamflow which is driven by seasonal patterns
  • 08MH147: Stave River streamflow ($m^3/s$) which is the outcome
  • Features starting with the prefix "08" correspond to WSC streamflow data ($m^3/s$)
  • Features with the suffix "_precip" correspond to precipitation from an ECCC station ($mm/day$)
  • Features with the suffix "_mint", and "maxt" correspond to minimum and maximum daily temperature from an ECCC station ($^\circ C$)
In [2]:
# Packages
import numpy as np
import pandas as pd
import seaborn as sns
import folium
import plotly.graph_objects as go
from sklearn import linear_model, model_selection, metrics
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.feature_selection import RFECV
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('precision', 3)
In [3]:
# Import the dataset generated usng the script "GetData.R"
file_path1 = "./Dataset.csv"
Data_1 = pd.read_csv(file_path1, index_col = 0, parse_dates = True)

Sample of the dataset:

In [4]:
DayofYear 08MH147 08GA010 08GA022 08GA030 08GA043 08GA047 08GA061 08GA072 08MF005 ... 348_precip 766_mint 766_maxt 766_precip 951_mint 951_maxt 951_precip 309_mint 309_maxt 309_precip
1983-02-01 32 NaN 10.70 83.6 6.87 12.50 1.050 0.113 NaN 885 ... 0.0 NaN NaN 0.0 -5.0 6.4 0.0 2.5 7.5 0.0
1983-02-02 33 NaN 9.59 70.0 5.49 11.60 0.862 0.094 NaN 891 ... 0.0 NaN NaN 0.0 -5.8 6.3 0.0 0.0 7.0 0.0
1983-02-03 34 6.59 8.45 65.0 4.53 11.00 0.727 0.085 NaN 886 ... 0.0 NaN NaN 0.0 -5.5 5.5 0.0 -0.5 5.0 0.0
1983-02-04 35 6.40 7.84 62.0 3.65 10.30 0.623 0.074 NaN 883 ... 0.0 NaN NaN 0.0 -10.0 3.7 0.0 -0.5 6.0 0.0
1983-02-05 36 5.93 7.25 61.0 3.19 9.79 0.589 0.069 NaN 847 ... 0.0 NaN NaN 0.8 -10.5 4.0 0.0 0.0 8.0 4.0

5 rows × 98 columns

In [5]:
# Look at the data types
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 12234 entries, 1983-02-01 to 2016-12-31
Columns: 98 entries, DayofYear to 309_precip
dtypes: float64(96), int64(2)
memory usage: 9.2 MB

Quick check of null values

In [6]:
DayofYear       0
08MH147       170
08GA010         0
08GA022       366
08GA030       588
951_maxt      498
951_precip    472
309_mint      224
309_maxt      227
309_precip    215
Length: 98, dtype: int64

The dataset indicates that the data start 1983-02-01 and end 2016-12-31. There are 98 columns (including the outcome column) and all datatypes are numeric. There are also many null values in the dataset.

The following plot shows the locations of the stations used in the study. Note that the 100 km radius does not include any stations in the USA.

In [7]:
file_path2 = "./Station_Coords.csv"
Data_Location = pd.read_csv(file_path2)
Data_Location['TEXT'] = "<p style=\"color:red\">" + Data_Location['STATION_NUMBER'] + "</p>" + Data_Location['TYPE']
m = folium.Map(location=[49.55619, -122.32307], tiles='Stamen Terrain', zoom_start = 8)

    location=[49.55619, -122.32307],

folium.Marker([49.55619, -122.32307], popup='Stave River', icon=folium.Icon(color='green', icon='star')).add_to(m)

for i in range(0,len(Data_Location)):
    folium.Marker([Data_Location.iloc[i]['LATITUDE'], Data_Location.iloc[i]['LONGITUDE']], \

Data Cleaning

Step 1 - Create an Index of Missing 08MH147 Data

Create an index and mask of na valus instead of dropping the rows as missing values in other features will be imputed using the means of all available data within a given feature. Feature engineering will also use rolling means which requires the dataset to be as complete as possible.

In [8]:
Index_NA_08MH147 = Data_1.index[Data_1['08MH147'].isnull()].tolist()
NA_Mask = Data_1.index.isin(Index_NA_08MH147)
In [9]:
# Check how many rows this corresponds to:
print("Loaded Dataset Number of Rows:", Data_1.shape[0])
print("Number of Missing 08MH147 values:", len(Index_NA_08MH147))
print("Percentage 08MH147 Missing:", round((len(Index_NA_08MH147) / Data_1.shape[0]*100),2), "%")
Loaded Dataset Number of Rows: 12234
Number of Missing 08MH147 values: 170
Percentage 08MH147 Missing: 1.39 %

Step 2 - Remove Features with High Percentages of Missing Data

In [10]:
# Create a dataframe to check the completeness of each feature:
NA_check = (Data_1.isnull().sum() / Data_1.shape[0]) * 100
In [11]:
# Create a Histogram Plot of Percent Missing
sns.distplot(NA_check, kde = False, bins = 100);
plt.xlabel('Percent of Missing Values', fontsize=13)
plt.ylabel('Count', fontsize=13)
plt.title('Feature Histogram of Percentage of Missing Values', fontsize=15)

Initially screen out the datasets with greater than 10% missing data as there are still a large amount of features with <10% missing data.

In [12]:
NA_Percent_Threshold = 10
Remove_NA = NA_check < NA_Percent_Threshold
Data_2 = Data_1.copy()
Data_2 = Data_1.loc[:, Remove_NA]
print("Previous Number of columns:", Data_1.shape[1]) 
print("New Number of columns:", Data_2.shape[1])
print("Note that these include the outcome column")
Previous Number of columns: 98
New Number of columns: 63
Note that these include the outcome column
In [13]:
#If we want to drop columns to percent NA we can use the following (currently not being used)
# Data_1.loc[:, [x for x in Data_1.columns if (Data_1.loc[:, x].isnull().sum() / len(Data_1.loc[:, x])*100 > 50)]]

Step 3: Infill NaN Values

Prior to dropping any rows or columns of data, infill missing values using two methods as follows:

  • Use the mean of the day of year
  • Use the mean of the month

These methods are used as streamflow is dependent on the time of year. Create two new datasets, one for each method.

In [14]:
# Generate copies of the datasets
Data_3a = Data_2.copy()
Data_3b = Data_2.copy()
In [15]:
# Impute values ussing the mean of the day of year
Data_3a.fillna(Data_3a.groupby('DayofYear').transform('mean'), inplace = True)
In [16]:
# Impute values using the mean of the month
Data_3b.fillna(Data_3b.groupby(Data_3b.index.month).transform('mean'), inplace = True)

Check for any remaining NaN values in these datasets:

In [17]:
Data_3a_na = Data_3a.isnull().sum().sum()
Data_3b_na = Data_3b.isnull().sum().sum()
print(Data_3a_na, Data_3b_na)
0 0

No null values left.

Step 4: Add New Features Corresponding with Previous Values

Streamflow can be very dependent on forcings prior to the given date due to a lag time in effect. Therefore, create the following additional features for each current temperature, precipitation, or streamflow feature in the dataset:

  • Previous 1-day value
  • Previous 7-day mean
In [18]:
# Create the shifted by 1 day datasets
Data_3a_prev1 = Data_3a.copy().drop(columns=['DayofYear', '08MH147']).shift(periods=1)
Data_3a_prev1.columns = [str(col) + '_prev1' for col in Data_3a_prev1.columns]
Data_3b_prev1 = Data_3b.copy().drop(columns=['DayofYear', '08MH147']).shift(periods=1)
Data_3b_prev1.columns = [str(col) + '_prev1' for col in Data_3b_prev1.columns]
In [19]:
# Create the prevoius 7-day mean datasets
Data_3a_prev7 = Data_3a.copy().drop(columns=['DayofYear', '08MH147']).shift(periods=1).rolling(window=7, min_periods = 4).mean()
Data_3a_prev7.columns = [str(col) + '_prev7' for col in Data_3a_prev7.columns]
Data_3b_prev7 = Data_3b.copy().drop(columns=['DayofYear', '08MH147']).shift(periods=1).rolling(window=7, min_periods = 4).mean()
Data_3b_prev7.columns = [str(col) + '_prev7' for col in Data_3b_prev7.columns]

Data Visualization

The following section explores visualizations of the dataset prior to modelling. Using "Data_2" which is the dataframe created from dropping features missing >10% of values but prior to infilling missing values.

Correlation Plot

A correlation plot is provided below. Note that this is an interactive plot which is useful for exploration given the larges number of features.

In [20]:
Data_2_corr = Data_2.corr()
In [21]:
# Interactive correlation plot
fig = go.Figure(data=go.Heatmap(z = Data_2_corr, x=Data_2_corr.columns, y=Data_2_corr.index, colorscale = 'RdBu', zmid=0))
fig.update_layout(height = 800, xaxis_nticks=63, yaxis_nticks=63, autosize = True, title={
        'text': "Correlation Plot",
        'xanchor': 'center',
        'yanchor': 'top'})"notebook")