Data Analysis is the process of understanding, cleaning, transforming and modeling data for discovering useful information, deriving conclusions and making data decisions.
In this post, we will perform analysis of the Boston Housing Dataset which is present inside the scikit learn package. Following are the steps to perform analysis on the housing dataset.
- Loading the necessary libraries:The first step involves importing the necessary libraries required to perform data analysis.Some of the libraries include: pandas, scikit-learn, numpy , Requests and BeautifulSoup.
Importing the visualization libraries:
We import the visualization libraries : matplotlib and seaborn.
Matplotlib is a 2D plotting library whereas Seaborn is a library for building interactive visualizations. Seaborn is build on the top of matplotlib.
2. Loading the Dataset:
We load the housing dataset after that. The Boston Housing dataset can be accessed using the sklearn.datasets module. The load_boston method is used to load the dataset.
After that see the type of data structure. The type() function is used to know the data structure of a dataset.
The output of the second code tells us that it is a scikit-learn Bunch object. Next we import the bunch object from scikit-learn.
3. Analyse the Dataset:
After importing the dataset, we need to understand the features of the dataset. We print the field names of the dataset using the keys() function.
To print the dataset description, run the next cell.
The above line of code prints all the attribute names along with their description.
4. Creation of DataFrame:
We create a Dataframe using the Pandas module. A Dataframe is a structure for storing data in the form of rows and columns.
To understand the Dataframe function, we print the docstring for the Dataframe as follows:
The Docstring helps us to understand the input parameters of a function. To convert the data into a dataframe, we input boston[‘data’] for the data and input boston[‘feature_names’] for the headers.
Now let us look at the Boston data, feature names and the shape of the data.
The boston[‘data’] represents that the data is stored in the form of a 2D array. The boston[‘data’].shape returns the shape of the array i.e. the array has 506 entries of data and 13 features which when converted into a dataframe changes to 506 rows and 13 columns.
Now after understanding the shape and data of the Boston data , the main job is to convert it into a Dataframe.
To see how the data looks like in the Dataframe, we print the first five rows . To print the first 5 rows, we use the head() function.
5. Understanding the Target Variable:
In Machine Learning,the variable that we are more interested about or the variable that we want to predict, given the features , is known as the target variable. Here in this case, we have MEDV as the target variable i.e the median house value in 1000s of dollars.
To know the number of entries in the target variable, run the shape function.
This shows that there are 506 entries in the target table. Since it is of the same length as the length of the dataframe, we can add it as a feature in the dataframe table. To distinguish the target , we copy it and bring it to the beginning of the dataframe to distinguish it from the rest of the features.
To ensure that the dataframe looks the way we want, print the head and tail of the dataframe.
Each row is labelled with an index value which is colored in bold on the left hand side of the table. These are a set of integers starting from 0 and incrementing by one for each row.
6. Cleaning the Data:
Next we need to know the data types of each feature present in the Dataframe. The dtypes function is used to print the data types. Before starting to perform any operation on the features, it is very necessary to know the data types of the featured.
This represents that all the fields is of float type and therefore most likely they are a continuous variable, including our target. This shows that predicting the target variable is a Regression Problem.
Data Cleaning is the most important part of the Data Analysis. For data cleaning, we need to search for any missing data. The missing data is normally converted into NaN values by the Pandas Dataframe. df.isnull().sum() returns the amount of null values in a particular column or feature.
We notice that there are no null values for any of the features and hence this part of data cleaning has been done and we can move on to the next step.
There are a total of 13 features or columns. We need to remove all those columns that are of no interest in predicting the target variable. So we remove some columns from the dataframe.
Now the data looks as shown in the above screenshot.
7. Data Exploration:
We need to understand the data in the next step. We use the describe() method to print the summary statistics of the dataframe.
Among all the columns, we focus on some of the columns. We save the columns in a variable and print the newly edited dataframe with only the required columns.
We learn the pariwise correlation among the features, we print the correlation coefficient using the corr() method.
The above table shows the correlation between each set of value. Large positive scores between the features show that there is a strong positive correlation. We see a maximum of value 1 in all the diagonal values.
Now we visualize the correlation table in the form of a heatmap for better understanding and presentation.
In the final step of the data exploration, we use the seaborn pairplot function to visualize the data using the pairplot graph.
This shows that :
— RM and MEDV have the shape like that in a normally distributed graph.
— AGE is skewed to the left and LSTAT is skewed to the right.
—TAX has a large amount of distribution around the point 700.
8. Predictive Analytics:
In order to predict a value, we first need to create a model and train it. Since this is a regression problem, we will first create a Linear Regression model.
We will use RM and LSTAT as features to compare with our target variable since they are highly correlated. We will draw scatter plots with linear models by running the following code.
The line of best fit is calculated by minimizing the ordinary least squares error function, that Seaborn module does automatically using the regplot function. The shaded area around the line represents 95% confidence intervals.
To visualize the residuals in each case, Seaborn can also be used to create residual plot .
The points in the residual plot represent the difference between the sample (y) and the predicted value (y’). Residuals that are greater than zero are points that are underestimated by the model and residuals less than zero are points that are overestimated by the model.
Next we create a linear regression model inside the function and use this function to find the mean squared error in each case.
Then we call the get_mse function for both RM and LSTAT to check the mean squared error in each case.
Comparing the Mean Squared Error in each case, we notice that the error is slightly lower for LSTAT. So we can better predict the target value when provided with the LSTAT values.
We can perform data analysis and cleaning using the above steps and create a model to perform predictive analysis.