I had roughly 2 hours of time to kill and I wanted to predict how much a house would sell for in the UK. Luckily there was a dataset on the land registrations in the UK. So I downloaded it from here.
The data was so huge and I knew practically my ultrabook would die solving the problem :). So, to begin with, the dataset was split into several chunks so as to avoid memory problems. I will not go into the details of explaining the data here as all the details are mentioned in the link above.
From each of the chunk ’Date’, ’Property’, ’Lease’, ’Location’ and ’Price’ were considered as my predictor variables. The predictors were categorical hence to represent them in dichotomy, dummy variables were created.
The data was downsampled to avoid memory problems by selecting every 1000th sample vector. Train and test splits were created based on the year 2015.
After that, Machine learning methods including 1) Linear regression (with categorical inputs and robust linear model) and 2) Gaussian Process Regression (KernelFunction was ardsquaredexponential) were used to find the predictions.
The coefficient of determination (R-squared) and Root Mean Square Error (RMSE) was used to evaluate the prediction performance. R-squared using Linear regression model was 0.05 and RMSE was 1178943.98 and R-squared using Gaussian Process regression model was 0.05 and RMSE was 1178378.37.
There was no cross-validation performed and results show a baseline performance. The total time spent was 2hrs 40in. Other than the memory issues no other problems were faced.
Data Preprocessing
Split the dataset file ‘PP complete.csv’ downloaded from the UK land registry website into several chunks so as to avoid memory problems. The code below will split the CSV file into several CSV files with 250000rows (-l) in it with numerical numbering (-d)
command = ‘split -d -l 250000 pp-complete.csv pp_complete_part’;
Read each of the files that are split
files_processed_list = dir([pwd, ’/*part*’]); total_no_files = length(files_processed_list(not([... files_processed_list.isdir])));
store file read into a variable T.
T = []; for i = 1:1: total_no_files filename = sprintf(’pp_complete_part%02d’,i-1); temp_file = readtable(filename,’Delimiter’,’,’,’ReadVariableNames’,false); T = [T; [temp_file.Var3 temp_file.Var5 temp_file.Var7 temp_file.Var12 temp_file.Var2]]; disp(i) end |
Create a table representing T
table_T = cell2table(T,’variablenames’,{’Date’,’Property’,’Lease’,’Location’,’Price’}); |
The variables are categorical create dummy variables to represent categories into dichotomy values
dummy_property = dummyvar(grp2idx(table_T.Property)); dummy_lease = dummyvar(grp2idx(table_T.Lease)); location_London = double(strcmp(’LONDON’,table_T.Location)); date = cellfun(@(x) x(1:4), table_T.Date, ’un’, 0);
Create Datasets as well as the prediction labels
Dataset_X = [dummy_property dummy_lease location_London]; % Dataset Dataset_Y = table_T.Price; % Prediction labels
Downsample the data by selecting every 1000th point
downsampled_X = downsample(Dataset_X,1000); downsampled_Y = str2double(downsample(Dataset_Y,1000)); downsampled_date = downsample(date,1000);
Get the index locations where the year is < 2015 for train and test splits
date_ = str2double(downsampled_date); index =(date_<2015);
Create test and train splits
train_downsampled_X = downsampled_X(index,:); test_downsampled_X = downsampled_X(~index,:); train_downsampled_Y = downsampled_Y(index,:); test_downsampled_Y = downsampled_Y(~index,:);
Check if there is London Location in the train and testing just to make sure that there is London in both train and test
Figure 1: Gaps in the plot show the presence of other places and colour shows the presence of London
(a) Train set (b) Test set
Create train test datasets as well as their corresponding response variables.
X = train_downsampled_X; % train dataset Y = train_downsampled_Y; % train labels X_t = test_downsampled_X; % test dataset Y_t = test_downsampled_Y; % test labels
Model Fitting
Model Fitting using Linear Regression
LinearMdl = fitlm(X,Y,’linear’,’RobustOpts’,’on’,’CategoricalVars’,1:9);
Predictions on the test test
yfit = predict(LinearMdl,X_t);
Evaluate the performance of the Model
[r2, rmse] = rsquare(Y_t,yfit)
R^{2 }using Linear regression model = 0.05 and RMSE = 1178943.98.
Figure 2: Plot showing predictions using linear and Gaussian Process Regression
(a) Linear Regression (b) Gaussian Process Regression
Model Fitting using Gaussian Process Regression
gprMdl = fitrgp(X,Y,’KernelFunction’,’ardsquaredexponential’,... ’FitMethod’,’sr’,’PredictMethod’,’fic’,’Standardize’,1); |
Predictions on the test set
yfit_GPR = predict(Mdl,X_t);
Evaluate the performance of the Model
[r2_GPR, rmse_GPR] = rsquare(Y_t,yfit_GPR)
R^{2 }using Gaussian Process regression model = 0.05 and RMSE = 1178378.37.
This is a preliminary analysis and the results I got are not great. Nevertheless, a severe hyperparameter tuning and testing with the inclusion of various other ML models might improve the results.