Scatter plots are used to display the relationship between two continuous variables. In a scatter plot, each observation in a data set is represented by a point. Often, a scatter plot will also have a line showing the predicted values based on some statistical model.
Making a Basic Scatter
Step 1: To make a simple scatter plot we need to use the geom_point() and map one variable to x and one variable to y. The code that we write to make a scatter plot is the similar to all the above codes mentioned above . Just the difference is that we add the geom_point() right after the ggplot2 function.
Step 2: You can also use different shapes for the scatter plot, you just have to change the shape in the geom_point method. Here I have used the shape 21 which is a hollow circle.
Grouping Data Points by a Variable Using Shape or Color
Step 1: We can also plot the variables using color and shape to group them. We can do this by passing the parameters in the ggplot method. Here you can see the sex variable is passed inside the ggplot as colour so you can see that the graph points are divided into the groups. Here on the right hand side plots window you can see in the legends that the female i.e. f is pink and male m is cyan.
Step 2: As said earlier you can also group the scatter points by shape. The below code illustrates how to do that.
ggplot(heightweight, aes(x=ageYear, y=heightIn, shape=sex)) + geom_point()
Here on the right hand side plots window you can see in the legends that the female i.e. f is solid circle and male m is solid triangle.
Step 3: Setting different colors for grouping a variable. The following codes are used to group a variable by shape and then color.
The left hand side image shows the scatter plots with the default color palette grouped by different shapes and color.
The left hand side image shows the scatter plots with the customised color palette grouped by different shapes and color.
Using Different Point Shapes
Step 1: If you want to set the shape of all the points then specify the shape inside geom_points()
Step 2: You can change the shape of the points using the scale_shape_manual() for setting of different shapes for different variables.
Step 3: It’s possible to have the shape represent one variable and the fill (empty or solid) represent another variable. This is done a little indirectly, by choosing shapes that have both color and fill, and a color palette that includes NA and another color (the NA will result in a hollow shape). For example, we’ll take the heightweight data set and add another column that indicates whether the child weighed 100 pounds or more.
Here in the above image the height and weight are set in a category i.e. a group and the labels are vectored which we use in plotting the graph.
Mapping a Continuous Variable to color or Size
Step 1: When there are more than one variable for the plotting of the scatter plot we have to take the other variables as the color and shape parameters for the plotting of the graph.
Here in this example we have taken the heightweight dataset from which have chosen the sex, ageYear and the weightLb.
Step 2: Here in the image below we can see that the there is no specified color inside the gglplot and only the size is specified for that graph and then we can see the weightlb legend.
Step 3: When it comes to colors there is 2 aesthetic attributes that can be used i.e. fill and colour. In the image below we can see that the scatter points are filled by using the scale_fill_gradient with the intensity set in color form i.e. from low to high, low being black and the high being white. Also you can see that the legend set here the guide_legend() which will result in a discrete legend instead of a color bar.
Step 4: In the next example we will map a continuous variable to an aesthetic, that doesn’t prevent us from mapping a categorical variable to other aesthetics. As there is a lot of over plotting we can set the transparency by setting the alpha to 0.5 to view the overlapping contents.
Dealing with Over Plotting
Step 1: When there more than 10,000 as in our case the scatter plot contains 54000 points data and it is heavily plotted as we can see in the image below. We have taken the data from the diamonds data set where we have used the aesthetic value x as carat and y as price.
Step 2: Here in the images below we can see that the transparency level i.e. alpha is reduced and then reduced again so we can see that slight difference between incase of the transparency because of the over plotted graph.
Step 3: Another solution to avoid the over plotting is to bin the points into rectangles and map the density of the points to the fill color of the rectangles.
With the binned visualization, the vertical bands are barely visible. The density of points in the lower-left corner is much greater, which tells us that the vast majority of diamonds are small and inexpensive. By default, stat_bin_2d() divides the space into 30 groups in the x and y directions, for a total of 900 bins. In the second version, we increase the number of bins with bins=50. In the second image that you are looking at we have scale_fill_gradient set to the low and high as light blue and red colour respectively and the limit range set to 0-6000.
Step 4: Another solution to over plotting is the use of hexbin package which is similar to binning the difference being the data is binned to hexagons instead of rectangles. For both of these methods, if you manually specify the range, and there is a bin that falls outside that range because it has too many or too few points, that bin will show up as grey rather than the color at the high or low end of the range, as seen in the graph on the right.
Step 5: In the next example we see that if data is discrete on both the axis then overplotting may occur, please refer to the code and the image given below:
In this case we can jitter the data points using the position_jitter(), so that it will spread the values around that position in order to have a better understanding of the graph. The graph on the left hand side shows the jittered values of the data points is shown and then the values are vertically jittered in the second image, where the width height and position jitter are specified.
Step 6: You can also use a box plot if you have one discrete and one continuous axes.
With the ChickWeights data, the x-axis is conceptually discrete, but since it is stored numerically, ggplot() doesn’t know how to group the data for each box. So we have to tell it how to group the data.
So in the geom_boxplot we can add the aesthetic and group it by time.
If we do not specify the group then there will be a single box as shown in the right side image.
Adding Fitted Regression Model Lines
In this example it is shown how to fit a regression line to the scatter plot.
Step 1: To add a regression line to a scatter plot add the method stat_plot and use the method=lm as parameter.
library(gcookbook) # For the data set
# The base plot
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn))
sp + geom_point() + stat_smooth(method=lm)
This instructs it to fit the data with the lm() (linear model) function. First we’ll save the base plot object in sp, then we’ll add different components to it.
Step 2: By default, stat_smooth() also adds a 95% confidence region for the regression fit. The confidence interval can be changed by setting level, or it can be disabled with se=FALSE.
# 99% confidence region
sp + geom_point() + stat_smooth(method=lm, level=0.99)
# No confidence region
sp + geom_point() + stat_smooth(method=lm, se=FALSE)
Step 3: The linear regression line is not the only way of fitting a model to the data—in fact, it’s not even the default. If you add stat_smooth() without specifying the method, it will use a loess (locally weighted polynomial) curve.
Step 4: If your scatter plot has points grouped by a factor, using colour or shape, one fit line will be drawn for each group. First we’ll make the base plot object sps, then we’ll add the loess lines to it. We’ll also make the points less prominent by making them semitransparent, using alpha=.4 .
sps <- ggplot(heightweight, aes(x=ageYear, y=heightIn, colour=sex)) +
sps + geom_smooth()
Adding Marginal Rugs to a Scatter Plot
If you want to add marginal rugs to the scatter plot then we use the geom_rug() for this. Now we use the faithful data set for this purpose.
Here we have used the geom_points for scatter points and geom_rugs for margin.
ggplot(faithful, aes(x=eruptions, y=waiting)) + geom_point() + geom_rug()
In the second example we have passed the position of the geom_rug() as jitter then size to .2 then we can see the margins jittered in the image below.
Labelling Points in a Scatter plot
Step 1: Here we can use the geom_text() to label the scatter data. For example we can write the aesthetic i.e. name and size for the text value.
Here in the above image I have used the annotate method and passed the label names and there for the names USA and Canada is shown around the dot plots.