By: Nathan Smith
Date: April 4, 2020
This project was completed as a component of the Part Time Data Science Course through BrainStation.
The purpose of this study is to assess the ability to use machine learning to estimate daily streamflow in British Columbia, Canada using streamflow and climate data in proximity to the stream of interest. This study looks at estimating the Stave River streamflow using data available within a 100 km radius. This should be considered an initial step in determining if an automated or semi-automated procedure could be applied to generate machine learning models for any streamflow location in British Columbia.
Note that the R scripts "GetData.R" and "Hydat_Search.R" (included in this repository) are used prior to this notebook to:
The above step was done using R as the tidyhydata and weathercan packages packages are maintained by the BC government have been built to access the Water Survey of Canada (WSC) and Environment and Climate Change Chanada (ECCC) hydat databases.
Code for this workbook can either be hidden or shown using the button below.
from IPython.display import HTML
HTML('''<script>
function code_toggle() {
if (code_shown){
$('div.input').hide('500');
$('#toggleButton').val('Show Code')
} else {
$('div.input').show('500');
$('#toggleButton').val('Hide Code')
}
code_shown = !code_shown
}
$( document ).ready(function(){
code_shown=false;
$('div.input').hide()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')
The dataset consists of the following:
# Packages
import numpy as np
import pandas as pd
import seaborn as sns
import folium
import plotly.graph_objects as go
from sklearn import linear_model, model_selection, metrics
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.feature_selection import RFECV
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('precision', 3)
# Import the dataset generated usng the script "GetData.R"
file_path1 = "./Dataset.csv"
Data_1 = pd.read_csv(file_path1, index_col = 0, parse_dates = True)
Sample of the dataset:
Data_1.head()
# Look at the data types
Data_1.info(verbose=False)
Quick check of null values
Data_1.isnull().sum()
The dataset indicates that the data start 1983-02-01 and end 2016-12-31. There are 98 columns (including the outcome column) and all datatypes are numeric. There are also many null values in the dataset.
The following plot shows the locations of the stations used in the study. Note that the 100 km radius does not include any stations in the USA.
file_path2 = "./Station_Coords.csv"
Data_Location = pd.read_csv(file_path2)
Data_Location['TEXT'] = "<p style=\"color:red\">" + Data_Location['STATION_NUMBER'] + "</p>" + Data_Location['TYPE']
m = folium.Map(location=[49.55619, -122.32307], tiles='Stamen Terrain', zoom_start = 8)
folium.Circle(
radius=100000,
location=[49.55619, -122.32307],
color='crimson',
fill=False,
).add_to(m)
folium.Marker([49.55619, -122.32307], popup='Stave River', icon=folium.Icon(color='green', icon='star')).add_to(m)
for i in range(0,len(Data_Location)):
folium.Marker([Data_Location.iloc[i]['LATITUDE'], Data_Location.iloc[i]['LONGITUDE']], \
popup=Data_Location.iloc[i]['TEXT']).add_to(m)
m
Create an index and mask of na valus instead of dropping the rows as missing values in other features will be imputed using the means of all available data within a given feature. Feature engineering will also use rolling means which requires the dataset to be as complete as possible.
Index_NA_08MH147 = Data_1.index[Data_1['08MH147'].isnull()].tolist()
NA_Mask = Data_1.index.isin(Index_NA_08MH147)
# Check how many rows this corresponds to:
print("Loaded Dataset Number of Rows:", Data_1.shape[0])
print("Number of Missing 08MH147 values:", len(Index_NA_08MH147))
print("Percentage 08MH147 Missing:", round((len(Index_NA_08MH147) / Data_1.shape[0]*100),2), "%")
# Create a dataframe to check the completeness of each feature:
NA_check = (Data_1.isnull().sum() / Data_1.shape[0]) * 100
# Create a Histogram Plot of Percent Missing
plt.figure(figsize=(10,6))
sns.distplot(NA_check, kde = False, bins = 100);
plt.xlabel('Percent of Missing Values', fontsize=13)
plt.ylabel('Count', fontsize=13)
plt.yticks(fontsize=9)
plt.xticks(fontsize=9)
plt.title('Feature Histogram of Percentage of Missing Values', fontsize=15)
plt.show()
Initially screen out the datasets with greater than 10% missing data as there are still a large amount of features with <10% missing data.
NA_Percent_Threshold = 10
Remove_NA = NA_check < NA_Percent_Threshold
Data_2 = Data_1.copy()
Data_2 = Data_1.loc[:, Remove_NA]
print("Previous Number of columns:", Data_1.shape[1])
print("New Number of columns:", Data_2.shape[1])
print("Note that these include the outcome column")
#If we want to drop columns to percent NA we can use the following (currently not being used)
# Data_1.loc[:, [x for x in Data_1.columns if (Data_1.loc[:, x].isnull().sum() / len(Data_1.loc[:, x])*100 > 50)]]
Prior to dropping any rows or columns of data, infill missing values using two methods as follows:
These methods are used as streamflow is dependent on the time of year. Create two new datasets, one for each method.
# Generate copies of the datasets
Data_3a = Data_2.copy()
Data_3b = Data_2.copy()
# Impute values ussing the mean of the day of year
Data_3a.fillna(Data_3a.groupby('DayofYear').transform('mean'), inplace = True)
# Impute values using the mean of the month
Data_3b.fillna(Data_3b.groupby(Data_3b.index.month).transform('mean'), inplace = True)
Check for any remaining NaN values in these datasets:
Data_3a_na = Data_3a.isnull().sum().sum()
Data_3b_na = Data_3b.isnull().sum().sum()
print(Data_3a_na, Data_3b_na)
No null values left.
Streamflow can be very dependent on forcings prior to the given date due to a lag time in effect. Therefore, create the following additional features for each current temperature, precipitation, or streamflow feature in the dataset:
# Create the shifted by 1 day datasets
Data_3a_prev1 = Data_3a.copy().drop(columns=['DayofYear', '08MH147']).shift(periods=1)
Data_3a_prev1.columns = [str(col) + '_prev1' for col in Data_3a_prev1.columns]
Data_3b_prev1 = Data_3b.copy().drop(columns=['DayofYear', '08MH147']).shift(periods=1)
Data_3b_prev1.columns = [str(col) + '_prev1' for col in Data_3b_prev1.columns]
# Create the prevoius 7-day mean datasets
Data_3a_prev7 = Data_3a.copy().drop(columns=['DayofYear', '08MH147']).shift(periods=1).rolling(window=7, min_periods = 4).mean()
Data_3a_prev7.columns = [str(col) + '_prev7' for col in Data_3a_prev7.columns]
Data_3b_prev7 = Data_3b.copy().drop(columns=['DayofYear', '08MH147']).shift(periods=1).rolling(window=7, min_periods = 4).mean()
Data_3b_prev7.columns = [str(col) + '_prev7' for col in Data_3b_prev7.columns]
The following section explores visualizations of the dataset prior to modelling. Using "Data_2" which is the dataframe created from dropping features missing >10% of values but prior to infilling missing values.
A correlation plot is provided below. Note that this is an interactive plot which is useful for exploration given the larges number of features.
Data_2_corr = Data_2.corr()
# Interactive correlation plot
fig = go.Figure(data=go.Heatmap(z = Data_2_corr, x=Data_2_corr.columns, y=Data_2_corr.index, colorscale = 'RdBu', zmid=0))
fig.update_layout(height = 800, xaxis_nticks=63, yaxis_nticks=63, autosize = True, title={
'font':{'size':20},
'text': "Correlation Plot",
'y':0.92,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show(renderer="notebook")
The correlation plots indicates patterns between the features. This is not unexpected as:
While this indicates that feature reduction is possible; for the sake of transparency of which specific features are most important to the model, feastures are currently not combined/reduced.
The correlation plot also shows that the outcome 08MH147 is more correlated with streamflow features compared with precipitation or temperature features. Therefore, streamflow features only could likely be for the study. However, other streamflow locations in British Columbia may have a less dense streamflow station network and analysis using precipitation and temperature may be more required.
Also look at a pairplot of the 20 features most correlated with 08MH147.
Rank_corr = abs(Data_2_corr["08MH147"]).sort_values(ascending = False)
Rank_corr_20 = np.array(Rank_corr[0:20].index)
Data_2_Rank_20 = Data_2[Rank_corr_20]
Rank_corr_20
Most of the features are streamflow (denoted with the prefix 08) but the the last three features are minimum temperature features.
A pairplot is provided below showing the reltionship between these top 20 features. While the plot is busy and difficult to see any individual figure in detail, it can generally be seen that:
sns.pairplot(Data_2_Rank_20);
A timeseries data visualization is provided below of these 20 top correlated variables. Prior to plotting the data, each feature will be normailzed to the maximum and minimum values for comparison.
normalized_Data_2_Rank_20 = (Data_2_Rank_20-Data_2_Rank_20.min())/(Data_2_Rank_20.max()-Data_2_Rank_20.min())
# Initialize the figure
plt.style.use('seaborn-darkgrid')
plt.subplots(figsize=(20, 12))
# create a color palette
cm = plt.get_cmap('viridis')
# multiple line plots in grid format
num=0
for column in normalized_Data_2_Rank_20:
num+=1
# Find the right spot on the plot
plt.subplot(4,5, num)
# plot all data on each subplot but in grey
for v in normalized_Data_2_Rank_20:
plt.plot(normalized_Data_2_Rank_20.index, normalized_Data_2_Rank_20[v], marker='', color='grey', linewidth=0.4, alpha=0.3)
# Plot the single lineplot
plt.plot(normalized_Data_2_Rank_20.index, normalized_Data_2_Rank_20[column], marker='', color=cm(num*10), linewidth=0.6, alpha=0.9)
# Add title
plt.title(column, loc='left', fontsize=12, fontweight=0, color=cm(num))
plt.xticks(fontsize=5)
plt.yticks(fontsize=8)
# general title
plt.suptitle("Top 20 Correlated Variables with 08MH147", fontsize=20, fontweight=0, color='black', style='italic', y=0.95);
# Axis title (currently not used)
#plt.text(0.5, 0.02, 'Time', ha='center', va='center')
#plt.text(0.06, 0.5, 'Note', ha='center', va='center', rotation='vertical')
The figure shows that:
Fit models to the datasets without the addition of the previous 1 day and 7 day feature dataframes that were created. This corresponds to only using Data_3a and Data_3b. Data_3a corresponds to the dataset infilling using day of year means while Data_3b corresponds to infilling using monthly means.
First create the model input datasets by removing the rows with missing values of the result: 08MH147.
Model_Data_3a = Data_3a[~NA_Mask]
Model_Data_3b = Data_3b[~NA_Mask]
print("Dataframes rows and columns: ", Model_Data_3a.shape, Model_Data_3b.shape)
The datasets include 12064 rows of data and 63 columns of which one is 08MH146.
Check for any missing values in these datasets.
Model_Data_3a_na = Model_Data_3a.isnull().sum().sum()
Model_Data_3b_na = Model_Data_3b.isnull().sum().sum()
print(Model_Data_3a_na, Model_Data_3b_na)
No missing values so partition the dataset into the outcome and the features in training and testing sets based on a 75% / 25% split.
np.random.seed(999)
Xa_tr, Xa_te, Ya_tr, Ya_te = model_selection.train_test_split(Model_Data_3a.drop(columns = '08MH147'), \
Model_Data_3a['08MH147'])
Xb_tr, Xb_te, Yb_tr, Yb_te = model_selection.train_test_split(Model_Data_3b.drop(columns = '08MH147'), \
Model_Data_3b['08MH147'])
Now fit a linear regression model, a random forest model, and a gradient boosting model to the datasets.
The following tests will be used to assess model fit: RMSE, MAE, and R2.
A function is created to apply the statistical tests.
#Create a function to apply the three fit tests for the overall datasets as well as each quantile in the dataset
def performance(name, observed, modelled):
output_RMSE = (metrics.mean_squared_error(observed, modelled))**(0.5)
output_MAE = metrics.mean_absolute_error(observed, modelled)
output_r2 = metrics.r2_score(observed, modelled)
combine = [output_RMSE, output_MAE, output_r2]
df = pd.DataFrame({name:combine}, index=['RMSE', 'MAE', 'R2'])
return df
A model function is created to apply either a linear, random forest, or gradient boosting model.
def modelling(train_x, train_y, test_x, test_y, model_type, name):
if model_type == "lm":
model = linear_model.LinearRegression()
elif model_type == "rf":
model = RandomForestRegressor()
elif model_type == "gb":
model = GradientBoostingRegressor()
model.fit(train_x, train_y)
train_score = performance((name + " train"), train_y, model.predict(train_x))
test_score = performance((name + " test"), test_y, model.predict(test_x))
return train_score, test_score, model
Fit a linear model and then conduct recursive feature selection and cross-validated (RFECV) selection to auto-reduce the best number of features.
lm_3a_tr_score, lm_3a_te_score, lm_3a = modelling(Xa_tr, Ya_tr, Xa_te, Ya_te, "lm", '3a')
lm_3b_tr_score, lm_3b_te_score, lm_3b = modelling(Xb_tr, Yb_tr, Xb_te, Yb_te, "lm", '3b')
lm_3_fit = pd.concat([lm_3a_tr_score, lm_3a_te_score, lm_3b_tr_score, lm_3b_te_score], axis= 1)
lm_3_fit
The score of these initial models is slightly better for the 3b interpolation method dataset and the test scores are not that far below the training scores.
Now look at feature reduction using RFECV.
lm_3a_temp = linear_model.LinearRegression()
lm_3b_temp = linear_model.LinearRegression()
lm_3a_rfecv = RFECV(estimator=lm_3a_temp)
lm_3b_rfecv = RFECV(estimator=lm_3b_temp)
lm_3a_rfecv.fit(Xa_tr, Ya_tr)
lm_3b_rfecv.fit(Xb_tr, Yb_tr)
print("Optimal number of features lm_3a_rfecv: %d (unreduced = 62)" % lm_3a_rfecv.n_features_)
print("Optimal number of features lm_3b_rfecv: %d (unreduced = 62)" % lm_3b_rfecv.n_features_)
Create updated input features (both for the training and test set) selected on the reduction above and rerun the linear models and assess performance.
Xa_tr_lmred = Xa_tr.loc[:,lm_3a_rfecv.support_]
Xa_te_lmred = Xa_te.loc[:,lm_3a_rfecv.support_]
Xb_tr_lmred = Xb_tr.loc[:,lm_3b_rfecv.support_]
Xb_te_lmred = Xb_te.loc[:,lm_3b_rfecv.support_]
lm_3a_tr_red_score, lm_3a_te_red_score, lm_3a_red = modelling(Xa_tr_lmred, Ya_tr, Xa_te_lmred, Ya_te, "lm", '3a_red')
lm_3b_tr_red_score, lm_3b_te_red_score, lm_3b_red = modelling(Xb_tr_lmred, Yb_tr, Xb_te_lmred, Yb_te, "lm", '3b_red')
lm_3_red_fit = pd.concat([lm_3a_tr_red_score, lm_3a_te_red_score, lm_3b_tr_red_score, lm_3b_te_red_score], axis= 1)
pd.concat([lm_3_fit, lm_3_red_fit], axis = 1)
There is only minor performance loss in the reduced set feature set; therefore, recommend using the reduced set.
Fit a random forest model just using the default settings for simplicity as this exercise is just to provide high-level identification of considerations for more in-depth analysis.
Performing RFECV on the Random Forest Model would be computationally expensive due to the large number of features so just use the feature importances attribute for feature reduction.
rf_3a_tr_score, rf_3a_te_score, rf_3a = modelling(Xa_tr, Ya_tr, Xa_te, Ya_te, "rf", '3a')
rf_3b_tr_score, rf_3b_te_score, rf_3b = modelling(Xb_tr, Yb_tr, Xb_te, Yb_te, "rf", '3b')
rf_3_fit = pd.concat([rf_3a_tr_score, rf_3a_te_score, rf_3b_tr_score, rf_3b_te_score], axis= 1)
rf_3_fit
The random forest model is able to provide a much higher training fit compared with the linear model, but results in a higher reduction in fit to the test model results to a level similar to the linear models.
Feature reduction
rf_3a_imp = rf_3a.feature_importances_
rf_3b_imp = rf_3b.feature_importances_
# Select the top 10 features
rf_3a_indices_top10 = np.argsort(rf_3a_imp)[::-1][0:10]
rf_3b_indices_top10 = np.argsort(rf_3b_imp)[::-1][0:10]
# Print the feature ranking
print("Feature ranking:")
for f in range(10):
print("%d. rf_3a: feature %s (%f), / rf_3b: feature %s (%f) " % (f + 1,
Xa_tr.columns[rf_3a_indices_top10[f]],
rf_3a_imp[rf_3a_indices_top10[f]],
Xb_tr.columns[rf_3b_indices_top10[f]],
rf_3b_imp[rf_3b_indices_top10[f]]))
The feature importance ranking above shows that 08MH141 and 08GA022 have a much higher importance than other features. While it may be able to drop all of the features except for these, this arbitrary selection method would not work will with an automatable or semi-automatable approach. Therefore, just drop any feature with less than 1% importance.
Also note that it is not suprising that 08MH141 and 08GA022 have much higher importance as they are the features that are most correlated with 08MH147.
Create the reduced feature training and test datasets.
Xa_tr_rfred = Xa_tr.loc[:,np.where(rf_3a_imp > 0.01, True, False)]
Xa_te_rfred = Xa_te.loc[:,np.where(rf_3a_imp > 0.01, True, False)]
Xb_tr_rfred = Xb_tr.loc[:,np.where(rf_3b_imp > 0.01, True, False)]
Xb_te_rfred = Xb_te.loc[:,np.where(rf_3b_imp > 0.01, True, False)]
print("Xa_rfred reduced to: %d (unreduced = 62)" % Xa_tr_rfred.shape[1])
print("Xb_rfred reduced to: %d (unreduced = 62)" % Xb_tr_rfred.shape[1])
rf_3a_tr_red_score, rf_3a_te_red_score, rf_3a_red = modelling(Xa_tr_rfred, Ya_tr, Xa_te_rfred, Ya_te, "rf", '3a_red')
rf_3b_tr_red_score, rf_3b_te_red_score, rf_3b_red = modelling(Xb_tr_rfred, Yb_tr, Xb_te_rfred, Yb_te, "rf", '3b_red')
rf_3_red_fit = pd.concat([rf_3a_tr_red_score, rf_3a_te_red_score, rf_3b_tr_red_score, rf_3b_te_red_score], axis= 1)
pd.concat([rf_3_fit, rf_3_red_fit], axis = 1)
There is only minor performance loss between the reduced and unreduced feature models. Note that only using 4 features with the random forest model produces a fit only slightly lower than the 40+ feature linear models.
Use the default settings for simplicity as this exercise is just to provide high-level identification of considerations for more in-depth analysis.
As for for the random forest model, performing RFECV would be computationally expensive due to the large number of features so just use the feature importances attribute for feature reduction.
gb_3a_tr_score, gb_3a_te_score, gb_3a = modelling(Xa_tr, Ya_tr, Xa_te, Ya_te, "gb", '3a')
gb_3b_tr_score, gb_3b_te_score, gb_3b = modelling(Xb_tr, Yb_tr, Xb_te, Yb_te, "gb", '3b')
gb_3_fit = pd.concat([gb_3a_tr_score, gb_3a_te_score, gb_3b_tr_score, gb_3b_te_score], axis= 1)
gb_3_fit
Similar to the random forest model, the gradient tree boosting model is able to provide a much higher training fit compared with the linear model, but results in a higher reduction in fit to the test model results to a level similar to the linear models.
Feature reduction
gb_3a_imp = gb_3a.feature_importances_
gb_3b_imp = gb_3b.feature_importances_
# Select the top 10 features
gb_3a_indices_top10 = np.argsort(gb_3a_imp)[::-1][0:10]
gb_3b_indices_top10 = np.argsort(gb_3b_imp)[::-1][0:10]
# Print the feature ranking
print("Feature ranking:")
for f in range(10):
print("%d. gb_3a: feature %s (%f), / gb_3b: feature %s (%f) " % (f + 1,
Xa_tr.columns[gb_3a_indices_top10[f]],
gb_3a_imp[gb_3a_indices_top10[f]],
Xb_tr.columns[gb_3b_indices_top10[f]],
gb_3b_imp[gb_3b_indices_top10[f]]))
The feature ranking has the same results as the random forest model with 08MH141 and 08GA022 having a much higher importance than other features. As per the random forest reduction method, drop any feature with less than 1% importance.
Xa_tr_gbred = Xa_tr.loc[:,np.where(gb_3a_imp > 0.01, True, False)]
Xa_te_gbred = Xa_te.loc[:,np.where(gb_3a_imp > 0.01, True, False)]
Xb_tr_gbred = Xb_tr.loc[:,np.where(gb_3b_imp > 0.01, True, False)]
Xb_te_gbred = Xb_te.loc[:,np.where(gb_3b_imp > 0.01, True, False)]
print("Xa_gbred reduced to: %d (unreduced = 62)" % Xa_tr_gbred.shape[1])
print("Xb_gbred reduced to: %d (unreduced = 62)" % Xb_tr_gbred.shape[1])
gb_3a_tr_red_score, gb_3a_te_red_score, gb_3a_red = modelling(Xa_tr_gbred, Ya_tr, Xa_te_gbred, Ya_te, "gb", '3a_red')
gb_3b_tr_red_score, gb_3b_te_red_score, gb_3b_red = modelling(Xb_tr_gbred, Yb_tr, Xb_te_gbred, Yb_te, "gb", '3b_red')
gb_3_red_fit = pd.concat([gb_3a_tr_red_score, gb_3a_te_red_score, gb_3b_tr_red_score, gb_3b_te_red_score], axis= 1)
pd.concat([gb_3_fit, gb_3_red_fit], axis = 1)
The following table presents the fit tests from all of the models runs, using the reduced feature models:
Step1_fit = pd.concat([lm_3_red_fit, rf_3_red_fit, gb_3_red_fit], keys=['lm', 'rf', 'gb'])
Step1_fit
The comparison of results indicates the following:
Fit models to the datasets with the addition of the previous 1 day and 7 day dataframes that were created. This corresponds to adding the prev1 and prev7 datasets to Data_3b. Only use use the Data_3b datasets based on the conclusion of the section above.
Remove the rows with missing values of the result: 08MH147.
Model_Data_3b_prev = pd.concat([Data_3b, Data_3b_prev1, Data_3b_prev7], axis = 1)[~NA_Mask]
print("Dataset rows and columns: ", Model_Data_3b_prev.shape)
The datasets include 12064 rows of data (same as in Model Fitting - Step 1) and 185 columns due to the addition of the previous value dataframes.
Check for any missing values in these dataset.
Model_Data_3b_prev_na = Model_Data_3b_prev.isnull().sum().sum()
print(Model_Data_3b_prev_na)
These na values are due to creating the previous 1 day and previous 7 day values, as they represent a minor amount of data, just drop the rows.
Model_Data_3b_prev.dropna(inplace = True)
Now partition the dataset into the outcome and the features in training and testing sets based on a 75% / 25% split.
np.random.seed(998)
Xb_prev_tr, Xb_prev_te, Yb_prev_tr, Yb_prev_te = model_selection.train_test_split(Model_Data_3b_prev.drop(columns = '08MH147'), \
Model_Data_3b_prev['08MH147'])
Fitting the models follows the same process as above.
# Fit unreduced feature linear models
lm_3b_prev_tr_score, lm_3b_prev_te_score, lm_3b_prev = \
modelling(Xb_prev_tr, Yb_prev_tr, Xb_prev_te, Yb_prev_te, "lm", '3b_prev')
lm_3_prev_fit = pd.concat([lm_3b_prev_tr_score, lm_3b_prev_te_score], axis= 1)
# Use RFECV to reduce features
lm_3b_prev_temp = linear_model.LinearRegression()
lm_3b_prev_rfecv = RFECV(estimator=lm_3b_prev_temp)
lm_3b_prev_rfecv.fit(Xb_prev_tr, Yb_prev_tr)
Xb_prev_tr_lmred = Xb_prev_tr.loc[:,lm_3b_prev_rfecv.support_]
Xb_prev_te_lmred = Xb_prev_te.loc[:,lm_3b_prev_rfecv.support_]
# Fit reduced feature linear models
lm_3b_prev_tr_red_score, lm_3b_prev_te_red_score, lm_3b_prev_red = \
modelling(Xb_prev_tr_lmred, Yb_prev_tr, Xb_prev_te_lmred, Yb_prev_te, "lm", '3b_prev_red')
lm_3_prev_red_fit = pd.concat([lm_3b_prev_tr_red_score, lm_3b_prev_te_red_score], axis= 1)
lm_3_prev_red_fit1 = pd.concat([lm_3b_tr_red_score, lm_3b_te_red_score, lm_3b_prev_tr_red_score, \
lm_3b_prev_te_red_score], axis= 1)
pd.concat([lm_3_prev_fit, lm_3_prev_red_fit], axis = 1)
# Fit unreduced feature random forest models
rf_3b_prev_tr_score, rf_3b_prev_te_score, rf_3b_prev = \
modelling(Xb_prev_tr, Yb_prev_tr, Xb_prev_te, Yb_prev_te, "rf", '3b_prev')
rf_3_prev_fit = pd.concat([rf_3b_prev_tr_score, rf_3b_prev_te_score], axis= 1)
# Use feature importance to reduce features
rf_3b_prev_imp = rf_3b_prev.feature_importances_
Xb_prev_tr_rfred = Xb_prev_tr.loc[:,np.where(rf_3b_prev_imp > 0.01, True, False)]
Xb_prev_te_rfred = Xb_prev_te.loc[:,np.where(rf_3b_prev_imp > 0.01, True, False)]
# Fit reduced feature random forest models
rf_3b_prev_tr_red_score, rf_3b_prev_te_red_score, rf_3b_prev_red = \
modelling(Xb_prev_tr_rfred, Yb_prev_tr, Xb_prev_te_rfred, Yb_prev_te, "rf", '3b_prev_red')
rf_3_prev_red_fit = pd.concat([rf_3b_prev_tr_red_score, rf_3b_prev_te_red_score], axis= 1)
rf_3_prev_red_fit1 = pd.concat([rf_3b_tr_red_score, rf_3b_te_red_score, rf_3b_prev_tr_red_score, \
rf_3b_prev_te_red_score], axis= 1)
pd.concat([rf_3_prev_fit, rf_3_prev_red_fit], axis = 1)
# Fit unreduced feature gradient boosting models
gb_3b_prev_tr_score, gb_3b_prev_te_score, gb_3b_prev = \
modelling(Xb_prev_tr, Yb_prev_tr, Xb_prev_te, Yb_prev_te, "gb", '3b_prev')
gb_3_prev_fit = pd.concat([gb_3b_prev_tr_score, gb_3b_prev_te_score], axis= 1)
# Use feature importance to reduce features
gb_3b_prev_imp = gb_3b_prev.feature_importances_
Xb_prev_tr_gbred = Xb_prev_tr.loc[:,np.where(gb_3b_prev_imp > 0.01, True, False)]
Xb_prev_te_gbred = Xb_prev_te.loc[:,np.where(gb_3b_prev_imp > 0.01, True, False)]
# Fit reduced feature gradient boosting models
gb_3b_prev_tr_red_score, gb_3b_prev_te_red_score, gb_3b_prev_red = \
modelling(Xb_prev_tr_gbred, Yb_prev_tr, Xb_prev_te_gbred, Yb_prev_te, "gb", '3b_prev_red')
gb_3_prev_red_fit = pd.concat([gb_3b_prev_tr_red_score, gb_3b_prev_te_red_score], axis= 1)
gb_3_prev_red_fit1 = pd.concat([gb_3b_tr_red_score, gb_3b_te_red_score, gb_3b_prev_tr_red_score, \
gb_3b_prev_te_red_score], axis= 1)
pd.concat([gb_3_prev_fit, gb_3_prev_red_fit], axis = 1)
The following table presents the fit tests from all of the models runs using the datasets that include the previous values and the reduced feature models.
The table also shows the models from Step 1 that did not include the previous values.
Step2_fit = pd.concat([lm_3_prev_red_fit1, rf_3_prev_red_fit1, gb_3_prev_red_fit1], keys=['lm', 'rf', 'gb'])
Step2_fit
The comparison of results indicates the following:
Based on the Step 1 and Step 2 modelling results presented in the above section, it appears that:
Due to the similarity in the results of the three model types, the assessment below looks at fit within 0.1 increment quantiles to see if any of the models have a better fit from that perspective. As the Step 2 modelling resulted in the best fit, only compare those models during this assessment and only look at the test set results.
#Create a function to apply the three fit tests for the overall datasets as well as each quantile in the dataset
def performance_quantile(observed, modelled):
labels = ['All_Data', 'Q0_0.1', 'Q0.1_0.2', 'Q0.2_0.3', 'Q0.3_0.4', 'Q0.4_0.5', 'Q0.5_0.6', 'Q0.6_0.7', 'Q0.7_0.8',
'Q0.8_0.9', 'Q0.9_1.0']
output_RMSE = []
output_MAE = []
output_r2 = []
output_RMSE.append((metrics.mean_squared_error(observed, modelled))**(0.5))
output_MAE.append(metrics.mean_absolute_error(observed, modelled))
output_r2.append(metrics.r2_score(observed, modelled))
quantiles = np.arange(0.1,1.1,0.1)
for idx, val in enumerate(quantiles):
lower = observed.quantile(val-0.1)
upper = observed.quantile(val)
quantile_mask = np.where((observed < upper) & (observed >= lower), True, False)
observed_quantile = observed[quantile_mask]
modelled_quantile = modelled[quantile_mask]
output_RMSE.append((metrics.mean_squared_error(observed_quantile, modelled_quantile))**(0.5))
output_MAE.append(metrics.mean_absolute_error(observed_quantile, modelled_quantile))
output_r2.append(metrics.r2_score(observed_quantile, modelled_quantile))
df = pd.DataFrame(columns=labels, index=['RMSE', 'MAE', 'R2'])
df.iloc[0] = output_RMSE
df.iloc[1] = output_MAE
df.iloc[2] = output_r2
return df
print("linear model")
performance_quantile(Yb_prev_te, lm_3b_prev_red.predict(Xb_prev_te_lmred))
print("random forest model")
performance_quantile(Yb_prev_te, rf_3b_prev_red.predict(Xb_prev_te_rfred))
print("gradient boosting model")
performance_quantile(Yb_prev_te, gb_3b_prev_red.predict(Xb_prev_te_gbred))
The quantile performance assessment is interesting as it shows that the R2 fit is generally poor within any given quantile. This indicates that while the models are good at capturing the overal relationship from the minimum to maximum range of streamflow, there is less correlation within each quantile. The table also indicates that the random forest model generally performs better in terms of MSE and MAE within each quantile.
The following section provides plots of the quantiles for a visual assessment of fit.
Create a dataframe of observed and modelled results for the plots.
Results_df = pd.DataFrame(Yb_prev_te).rename(columns={"08MH147": "Observed"})
Results_df['Modelled_lm'] = lm_3b_prev_red.predict(Xb_prev_te_lmred)
Results_df['Modelled_rf'] = rf_3b_prev_red.predict(Xb_prev_te_rfred)
Results_df['Modelled_gb'] = gb_3b_prev_red.predict(Xb_prev_te_gbred)
Results_df.sort_index(inplace = True)
Results_df.head()
# Create modelled vs observed quantile plots
plt.style.use('seaborn-darkgrid')
plt.subplots(figsize=(20, 12))
observed_plot = Results_df['Observed']
modelled1_plot = Results_df['Modelled_lm']
modelled2_plot = Results_df['Modelled_rf']
modelled3_plot = Results_df['Modelled_gb']
quantiles = np.arange(0.1,1.1,0.1)
for idx, val in enumerate(quantiles):
lower = observed_plot.quantile(val-0.1)
upper = observed_plot.quantile(val)
quantile_mask = np.where((observed_plot < upper) & (observed_plot >= lower), True, False)
observed_quantile = observed_plot[quantile_mask]
modelled1_quantile = modelled1_plot[quantile_mask]
modelled2_quantile = modelled2_plot[quantile_mask]
modelled3_quantile = modelled3_plot[quantile_mask]
range_val = np.arange(lower, upper, (upper-lower)/1000)
# Find the right spot on the plot
plt.subplot(2, 5, idx+1)
# Plotting
plt.scatter(observed_quantile, modelled1_quantile, alpha=0.3, label = 'lm')
plt.scatter(observed_quantile, modelled2_quantile, alpha=0.3, label = 'rf')
plt.scatter(observed_quantile, modelled3_quantile, alpha=0.3, label = 'gb')
plt.plot(range_val, range_val, color='red', marker='', linewidth=1.0)
# Add title and formatting
plt.title("Streamflow = " + str(round(lower, 1)) + " to " + str(round(upper,1)), loc='left', fontsize=12, fontweight=0)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.xlim(left = lower, right = upper)
plt.ylim(bottom = lower-upper, top = upper*2)
plt.legend(loc = 2, frameon=True)
# general title
plt.suptitle("0.1 Interval Modelled Quantiles of Observed vs Modelled Fit (m^3/s)", fontsize=20, fontweight=0, color='black', style='italic', y=0.95);
The plot above showing the observed vs modelled data indicates that while there is a large amount of scatter within each quantile, the random forest (rf) and gradient boosting (gb) models have a tigher fit to the 1:1 line compared with the linear model. The first quantile plot in the series also shows that the linear model results in negative streamflow estimates whereas the rf and gb models do not.
# Create modelled and observed quantile histogram plots
plt.style.use('seaborn-darkgrid')
plt.subplots(figsize=(20, 12))
for idx, val in enumerate(quantiles):
lower = observed_plot.quantile(val-0.1)
upper = observed_plot.quantile(val)
quantile_mask = np.where((observed_plot < upper) & (observed_plot >= lower), True, False)
observed_quantile_h = observed_plot[quantile_mask]
modelled1_quantile_h = modelled1_plot[quantile_mask]
modelled2_quantile_h = modelled2_plot[quantile_mask]
modelled3_quantile_h = modelled3_plot[quantile_mask]
range_val_h = np.arange(lower, upper, (upper-lower)/1000)
# Find the right spot on the plot
plt.subplot(2, 5, idx+1)
# Plot the histogram
plt.hist(modelled1_quantile_h, bins=range_val_h, density=True, histtype='step', cumulative=True,
label='Modelled: lm')
plt.hist(modelled2_quantile_h, bins=range_val_h, density=True, histtype='step', cumulative=True,
label='Modelled: rf')
plt.hist(modelled3_quantile_h, bins=range_val_h, density=True, histtype='step', cumulative=True,
label='Modelled: gb')
plt.hist(observed_quantile_h, bins = range_val_h, density=True, histtype='step', cumulative=True,
label='Observed')
# Add title and formatting
plt.title("Streamflow = " + str(round(lower, 1)) + " to " + str(round(upper,1)), loc='left', fontsize=12, fontweight=0)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.legend(loc = 2, frameon=True)
# general title
plt.suptitle("0.1 Interval Modelled Quantiles of Observed and Modelled Cumulative Histograms (m^3/s)", fontsize=20, fontweight=0, color='black', style='italic', y=0.95);
The quantile histogram plots indicate that the random forest (rf) model appears to have the best fit to the observed data. The linear model (lm) and gradient boosting models (gb) perform particulary poorly in comparison with the first quantile plot in the series.
# Create interactive time series plot
fig = go.Figure()
fig.add_trace(
go.Scatter(x=list(Results_df.index), y=list(Results_df.Observed), name="Observed", line_color="Red"))
fig.add_trace(
go.Scatter(x=list(Results_df.index), y=list(Results_df.Modelled_lm), name="Modelled: lm", line_color="Blue"))
fig.add_trace(
go.Scatter(x=list(Results_df.index), y=list(Results_df.Modelled_rf), name="Modelled: rf", line_color="Orange"))
fig.add_trace(
go.Scatter(x=list(Results_df.index), y=list(Results_df.Modelled_gb), name="Modelled: gb", line_color="Green"))
# Set title
fig.update_layout(title={
'font':{'size':20},
'text': "Time Series of Observed and Modelled Streamflow Data (m^3/s)",
'y':0.92,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'}
)
# style all the traces
fig.update_traces(
#hoverinfo="name+x+text",
line={"width": 0.5},
mode="lines"
)
#fig.update_layout(
# xaxis=dict(
# rangeslider=dict(
# visible=True
# ),
# type="date"
# )
#)
fig.show(renderer="notebook")
Zooming into lower flows on the plot highlights that the random forest model outperforms the linear and gradient boosting models. The plot also indicates that the observed peak flows are generally higher than the modelled peak flows.
This initial study provides a good first step towards utilizing machine learning models for streamflow in British Columbia but additional study should be considered including assessing: