Summarized Data Distributions

Data distributions are used often in statistics. They are graphical methods of organizing and displaying useful information.

Making a Basic Histogram:

Histograms display data in ranges, with each bar representing a range of numeric values. The height of the bar tells you the frequency of values that fall within that range.

Step 1: Here you need to have installed the ggplot() and gcookbook() packages for accessing the functions present within it for creating graphs.
For creating a Histogram, Use geom_histogram() and map a continuous variable to x. All geom_histogram() requires is one column from a data frame or a single vector of
Type in the command ggplot(faithful, aes(x=waiting)) + geom_histogram() where the faithful data set is being used.


Step 2:
By default, the data is grouped into 30 bins. This may be too fine or too coarse for your data.You can change the size of the bins by using binwidth, or you can divide the range of the data into a specific number of bins.
The default colors, a dark fill without an outline, can make it difficult to see which bar corresponds to which value, so you can also change the colors.

To set the width of each bin to 5, type in ggplot(faithful, aes(x=waiting)) + geom_histogram(binwidth=5, fill=”white”, colour=”black”)


To divide the x range into 15 bins, binsize <- diff(range(faithful$waiting))/15 where the diff function calculates the differences between all consecutive values of a vector.
Then enter the command ggplot(faithful, aes(x=waiting)) + geom_histogram(binwidth=binsize, fill=”white”, colour=”black”)


Making Multiple Histograms from Grouped Data:

Step 1: You need to have previously loaded the library for MASS for making its data sets available for analysis. Use smoke as the faceting variable.
Type in the command, ggplot(birthwt, aes(x=bwt)) + geom_histogram(fill=”white”, colour=”black”) + facet_grid(smoke ~ .) where the facet_grid is used to split the data by one or two variables.


Step 2: One problem with the faceted graph is that the facet labels are just 0 and 1, and there’s no label indicating that those values are for smoke.
To change the labels, you need to change the names of the factor levels. First take a look at the factor levels, then assign new factor level names, in the same order.

Make a copy of the data, birthwt1 <- birthwt
Convert smoke to a factor, birthwt1$smoke <- factor(birthwt1$smoke)
Know the levels of the factor, levels(birthwt1$smoke)
Load the library plyr for the revalue() function, library(plyr) where revalue() is used for Replacing specified values with new values, in a factor or character vector.
Assign new names to the labels, birthwt1$smoke <- revalue(birthwt1$smoke, c(“0″=”No Smoke”, “1”=”Smoke”))
when you plot it again, it shows the new labels, ggplot(birthwt1, aes(x=bwt)) + geom_histogram(fill=”white”, colour=”black”) + facet_grid(smoke ~ .)


To allow the y scales to be resized independently, use scales=”free”, this will only allow the y scales to be free. The x scales will still be fixed because the histograms are aligned with respect to that axis.
Enter ggplot(birthwt1, aes(x=bwt)) + geom_histogram(fill=”white”, colour=”black”) + facet_grid(smoke ~ ., scales= “free”)


Step 3: Another approach is to map the grouping variable to fill. The grouping variable must be a factor or character vector.
Map smoke to fill, make the bars NOT stacked, and make them semitransparent. Enter ggplot(birthwt1, aes(x=bwt, fill=smoke)) + geom_histogram(position=”identity”, alpha=0.4)


Making a Density Curve:

Step 1: Here you need to use the geom_density() function and map a continuous variable to x. Enter the command ggplot(faithful, aes(x=waiting)) + geom_density() where faithful is the data set.


Step 2: If you don’t like the lines along the side and bottom, you can use geom_line(stat=”density”).
Type in ggplot(faithful, aes(x=waiting)) + geom_line(stat=”density”) + expand_limits(y=0) where the expand_limits() increases the y range to include the value 0.


Step 3: A kernel density curve is an estimate of the population distribution, based on the sample data.
The amount of smoothing depends on the kernel bandwidth: the larger the bandwidth, the more smoothing there is.
The bandwidth can be set with the adjust parameter, which has a default value of 1.
Enter the commend ggplot(faithful, aes(x=waiting)) + geom_line(stat=”density”, adjust=.25, colour=”red”) + geom_line(stat=”density”) + geom_line(stat=”density”, adjust=2, colour=”blue”), to see what happens with a smaller and larger value of adjust.


Step 4: In the previous example, the x range is automatically set so that it contains the data, but this results in the edge of the curve getting clipped. To show more of the curve, set the x limits.
Enter ggplot(faithful, aes(x=waiting)) + geom_density(fill=”blue”, alpha=.2) + xlim(35, 105) in the console.


Step 5: To compare the theoretical and observed distributions, you can overlay the density curve with the histogram. Since the y values for the density curve are small, it would be barely visible if you overlaid it on a histogram without any transformation. To solve this problem, you can scale down the histogram to match the density curve with the mapping y=..density.. . Here we’ll add geom_histogram() first, and then layer geom_density() on top.

Type in the commend, ggplot(faithful, aes(x=waiting, y=..density..)) + geom_histogram(fill=”cornsilk”, colour=”grey60″, size=.2) + geom_density() + xlim(35, 105) in the console.


Making Multiple Density Curves from Grouped Data:

Step 1: You have already loaded the library MASS, made the copy of the data set birthwt and converted the number variable smoke to a factor.
For mapping the smoke to color, enter ggplot(birthwt1, aes(x=bwt, colour=smoke)) + geom_density()


Step 2: Now map smoke to fill and make the fill semitransparent by setting alpha. Type in ggplot(birthwt1, aes(x=bwt, fill=smoke)) + geom_density(alpha=.3)


Making a Frequency polygon:

Step 1: Here you have to use the geom_freqploy() for creating a frequency polygon. Enter the command ggplot(faithful, aes(x=waiting)) + geom_freqpoly() in the console.


Step 2: Also like a histogram, you can control the bin width for the frequency polygon. Type in, ggplot(faithful, aes(x=waiting)) + geom_freqpoly(binwidth=4)


Step 3: Instead of setting the width of each bin directly, you can also divide the x range into a particular number of bins.
binsize <- diff(range(faithful$waiting))/15
ggplot(faithful, aes(x=waiting)) + geom_freqpoly(binwidth=binsize)


Making a Basic Box Plot:

Step 1: For making a box plot, use the geom_boxplot() function, mapping a continuous variable to y and a discrete variable to x.
Use the command ggplot(birthwt, aes(x=factor(race), y=bwt)) + geom_boxplot() where factor() is used to convert numeric variable to discrete.


Step 2: To change the width of the boxes, you can also set width passing width inside the geom_boxplot() function.
Type in, ggplot(birthwt, aes(x=factor(race), y=bwt)) + geom_boxplot(width=.5)


Step 3: If there are many outliers and there is overplotting, you can change the size and shape of the outlier points with outlier.size and outlier.shape. The default size is 2 and the default shape is 16.
Enter the command ggplot(birthwt, aes(x=factor(race), y=bwt)) + geom_boxplot(outlier.size=4, outlier.shape=21)


Adding Notches to a Box Plot:

You add notches to a box plot to assess whether the medians are different. Use geom_boxplot() and set notch=TRUE.
Type in, ggplot(birthwt, aes(x=factor(race), y=bwt)) + geom_boxplot(notch=TRUE)


Adding Means to a Box Plot:

Means are used in the box plot to add markers for the mean. Use stat_summary().
ggplot(birthwt, aes(x=factor(race), y=bwt)) + geom_boxplot() + stat_summary(fun.y=”mean”, geom=”point”, shape=23, size=3, fill=”white”)


Making a Violin Plot:

Violin plots are used to compare density estimates of different groups. Use the geom_violin() function.
ggplot(heightweight, aes(x=sex, y=heightIn)) + geom_violin()


Making a Dot Plot:

A Dot Plot shows each data in point. Use geom_dotplot() function.
Use a subset of the countries data set, countries2009 <- subset(countries, Year==2009 & healthexp>2000)
Then type in the command, ggplot(countries2009, aes(x=infmortality)) + geom_dotplot()


Making Multiple Dot Plots for Grouped Data:

These are used for making multiple dot plots from grouped data.
Step 1: To compare multiple groups, it’s possible to stack the dots along the y-axis, and group them along the x-axis, by setting binaxis=”y”.
Type in, ggplot(heightweight, aes(x=sex, y=heightIn)) + geom_dotplot(binaxis=”y”, binwidth=.5, stackdir=”center”) where heightweight is the data set used.


Step 2: Dot plots are sometimes overlaid on box plots. In these cases, it may be helpful to make the dots hollow and have the box plots not show outliers, since the outlier points will be shown as part of the dot plot.
Type in, ggplot(heightweight, aes(x=sex, y=heightIn)) + geom_boxplot(outlier.colour=NA, width=.4) + geom_dotplot(binaxis=”y”, binwidth=.5, stackdir=”center”, fill=NA)


Step 3: It’s also possible to show the dot plots next to the box plots, this requires using a bit of a hack, by treating the x variable as a numeric variable and subtracting or adding a small quantity to shift the box plots and dot plots left and right.
ggplot(heightweight, aes(x=sex, y=heightIn)) + geom_boxplot(aes(x=as.numeric(sex) + .2, group=sex), width=.25) + geom_dotplot(aes(x=as.numeric(sex) – .2, group=sex), binaxis=”y”, binwidth=.5, stackdir=”center”) + scale_x_continuous(breaks=1:nlevels(heightweight$sex), labels=levels(heightweight$sex))

The scale_x_continuous() function is used to show x tick labels as text corresponding to the factor levels, since the x-axis is treated as numeric.


Making a Density Plot of Two-Dimensional Data:

Step 1: To plot the density of two-dimensional (2D) data, use the stat_density2d() function. This makes a 2D kernel density estimate from the data. First plot the density contour along with the data points.
Enter the command, ggplot(faithful, aes(x=eruptions, y=waiting)) + geom_point() + stat_density2d()


Step 2: The two-dimensional kernel density estimate is analogous to the one-dimensional density estimate generated by stat_density(). Map the density estimate to the fill color, or to the transparency of the tiles. ggplot(faithful, aes(x=eruptions, y=waiting))+ stat_density2d(aes(fill=..density..), geom=”raster”, contour=FALSE)


Step 3: It is also possible to use tiles instead of raster.
ggplot(faithful, aes(x=eruptions, y=waiting))+ geom_point() + stat_density2d(aes(alpha=..density..), geom=”tile”, contour=FALSE)


Leave A Reply

Your email address will not be published. Required fields are marked *