Visualizing Data

The Histogram

Graphing Data

The Histogram

• Cross Tabulation

• Drawing a Histogram

• Recognizing and Using a Histogram

• The Density Scale

• Types of Variables

• Controlling for a Variable

• Selective Breeding

Cross Tabulation

• Crosstabs are heavily used in survey research, business intelligence, engineering, and scientific research.

• Crosstabs provide a basic picture of the interrelation between two variables and can help find interactions between them.

• Most general-purpose statistical software programs are able to produce simple crosstabs.

Drawing a Histogram

• There is no “best” number of bars, and different bar sizes may reveal different features of the data.

• A convenient starting point for the first interval is a lower value carried out to one more decimal place than the value with the most decimal places.

• To calculate the width of the intervals, subtract the starting point from the ending value and divide by the number of bars.

Recognizing and Using a Histogram

• First introduced by Karl Pearson, a histogram is an estimate of the probability distribution of a continuous variable.

• If the distribution of <em>X</em> is continuous, then <em>X</em> is called a continuous random variable and, therefore, has a continuous probability distribution.

• An advantage of a histogram is that it can readily display large data sets (a rule of thumb is to use a histogram when the data set consists of 100 values or more).

The Density Scale

• The unobservable density function is thought of as the density according to which a large population is distributed.The data are usually thought of as a random sample from that population.

• A probability density function, or density of a continuous random variable, is a function that describes the relative likelihood for a random variable to take on a given value.

• Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

Types of Variables

• Numeric (quantitative) variables have values that describe a measurable quantity as a number, like “how many” or “how much”.

• A continuous variable is an observation that can take any value between a certain set of real numbers.

• A discrete variable is an observation that can take a value based on a count from a set of distinct whole values.

• Categorical variables have values that describe a “quality” or “characteristic” of a data unit, like “what type” or “which category”.

• An ordinal variable is an observation that can take a value that can be logically ordered or ranked.

• A nominal variable is an observation that can take a value that is not able to be organized in a logical sequence.

Controlling for a Variable

• Variables refer to measurable attributes, as these typically vary over time or between individuals.

• Temperature is an example of a continuous variable, while the number of legs of an animal is an example of a discrete variable.

• In causal models, a distinction is made between “independent variables” and “dependent variables,” the latter being expected to vary in value in response to changes in the former.

• While independent variables can refer to quantities and qualities that are under experimental control, they can also include extraneous factors that influence results in a confusing or undesired manner.

• The essence of controlling is to ensure that comparisons between the control group and the experimental group are only made for groups or subgroups for which the variable to be controlled has the same statistical distribution.

Selective Breeding

• Unwittingly, humans have carried out evolution experiments for as long as they have been domesticating plants and animals.

• More recently, evolutionary biologists have realized that the key to successful experimentation lies in extensive parallel replication of evolving lineages as well as a larger number of generations of selection.

• Because of the large number of generations required for adaptation to occur, evolution experiments are typically carried out with microorganisms such as bacteria, yeast, or viruses.

Graphing Data

• Statistical Graphics

• Stem-and-Leaf Displays

• Reading Points on a Graph

• Plotting Points on a Graph

• Slope and Intercept

• Plotting Lines

• The Equation of a Line

Statistical Graphics

• Graphical statistical methods explore the content of a data set.

• Graphical statistical methods are used to find structure in data.

• Graphical statistical methods check assumptions in statistical models.

• Graphical statistical methods communicate the results of an analysis.

Stem-and-Leaf Displays

• Stem-and-leaf displays are useful for displaying the relative density and shape of the data, giving the reader a quick overview of distribution.

• They retain (most of) the raw numerical data, often with perfect integrity.They are also useful for highlighting outliers and finding the mode.

• With very small data sets, a stem-and-leaf displays can be of little use, as a reasonable number of data points are required to establish definitive distribution properties.

• With very large data sets, a stem-and-leaf display will become very cluttered, since each data point must be represented numerically.

Reading Points on a Graph

• The interconnected objects are represented by mathematical abstractions called vertices.

• The links that connect some pairs of vertices are called edges.

• Vertices are also called nodes or points, and edges are also called lines or arcs.

Plotting Points on a Graph

• Graphs are a visual representation of the relationship between variables, very useful because they allow us to quickly derive an understanding which would not come from lists of values.

• Quantitative techniques are the set of statistical procedures that yield numeric or tabular output.

• Examples include hypothesis testing, analysis of variance, point estimates and confidence intervals, and least squares regression.

• There are also many statistical tools generally referred to as graphical techniques, which include: scatter plots, histograms, probability plots, residual plots, box plots, and block plots.

Slope and Intercept

• The slope or gradient of a line describes its steepness, incline, or grade — with a higher slope value indicating a steeper incline.

• The slope of a line in the plane containing the x and y axes is generally represented by the letter m, and is defined as the change in the y coordinate divided by the corresponding change in the x coordinate, between two distinct points on the line.

• Using the common convention that the horizontal axis represents a variable x and the vertical axis represents a variable y, a y-intercept is a point where the graph of a function or relation intersects with the y-axis of the coordinate system.

• Analogously, an x-intercept is a point where the graph of a function or relation intersects with the x-axis.

Plotting Lines

• A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically.

• A line chart is typically drawn bordered by two perpendicular lines, called axes.The horizontal axis is called the x-axis and the vertical axis is called the y-axis.

• Typically the y-axis represents the dependent variable and the x-axis (sometimes called the abscissa) represents the independent variable.

• In statistics, charts often include an overlaid mathematical function depicting the best-fit trend of the scattered data.

The Equation of a Line

• Simple linear regression fits a straight line through a set of points that makes the vertical distances between the points of the data set and the fitted line as small as possible.

• y=mx+b, where m and b designate constants is a common form of a linear equation.

• Linear regression can be used to fit a predictive model to an observed data set of y and X values.

Appendix

Key terms

• bell curve In mathematics, the bell-shaped curve that is typical of the normal distribution.

• box plot A graphical summary of a numerical data sample through five statistics: median, lower quartile, upper quartile, and some indication of more extreme upper and lower values.

• breeding the process through which propagation, growth, or development occurs

• continuous variable a variable that has a continuous distribution function, such as temperature

• control a separate group or subject in an experiment against which the results are compared where the primary variable is low or nonexistence

• correlation One of the several measures of the linear statistical relationship between two random variables, indicating both the strength and direction of the relationship.

• cross tabulation a presentation of data in a tabular form to aid in identifying a relationship between variables

• density the probability that an event will occur, as a function of some observed variable

• discrete variable a variable that takes values from a finite or countable set, such as the number of legs of an animal

• evolution a gradual directional change, especially one leading to a more advanced or complex form; growth; development

• frequency number of times an event occurred in an experiment (absolute frequency)

• gradient of a function y = f(x) or the graph of such a function, the rate of change of y with respect to x, that is, the amount by which y changes for a certain (often unit) change in x

• graph A diagram displaying data; in particular one showing the relationship between two or more quantities, measurements or indicative numbers that may or may not have a specific mathematical formula relating them to each other.

• histogram a representation of tabulated frequencies, shown as adjacent rectangles, erected over discrete intervals (bins), with an area equal to the frequency of the observations in the interval

• intercept the coordinate of the point at which a curve intersects an axis

• interquartile range The difference between the first and third quartiles; a robust measure of sample dispersion.

• line a path through two or more points (compare ‘segment’); a continuous mark, including as made by a pen; any path, curved or straight

• linear regression an approach to modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X.

• outlier a value in a statistical sample which does not fit a pattern that describes most other data points; specifically, a value that lies 1.5 IQR beyond the upper or lower quartile

• plot a graph or diagram drawn by hand or produced by a mechanical or electronic device

• probability distribution A function of a discrete random variable yielding the probability that the variable will have a given value.

• quartile any of the three points that divide an ordered distribution into four parts, each containing a quarter of the population

• scatter plot A type of display using Cartesian coordinates to display values for two variables for a set of data.

• slope the ratio of the vertical and horizontal distances between two points on a line; zero if the line is horizontal, undefined if it is vertical.

• stemplot a means of displaying data used especially in exploratory data analysis; another name for stem-and-leaf display

• stochastic random; randomly determined

• variable a quantity that may assume any one of a set of values

Scatterplot

Scatter plot with a fitted regression line.

Contingency Table

Contingency table created to display the numbers of individuals who are male and right-handed, male and left-handed, female and right-handed, and female and left-handed.

Stem-and-Leaf Display

This is an example of a stem-and-leaf display for EPA data on miles per gallon of gasoline.

Types of Variables

Variables can be numeric or categorial, being further broken down in continuous and discrete, and nominal and ordinal variables.

Linear regression

An example of a simple linear regression analysis

Histogram Example

This histogram depicts the relative frequency of heights for 100 semiprofessional soccer players.

Intercept

Graph (x) with a y-intercept at (0,1).

Boxplot Versus Probability Density Function

This image shows a boxplot and probability density function of a normal distribution.

The Histogram

This is an example of a histogram, depicting graphically the distribution of heights for 31 Black Cherry trees.

An example of a scatter plot

A scatter plot helps identify the type of relationship (if any) between two variables.

Crosstab of Cola Preference by Age and Gender

A crosstab is a combination of various tables showing summary statistics.

Controlling for Variables

Controlling is very important in experimentation to ensure reliable results.For example, in an experiment to see which type of vinegar displays the greatest reaction to baking soda, the brand of baking soda should be controlled.

Slope

The slope of a line in the plane is defined as the rise over the run, m = Δy/Δx.

Selective Breeding

This Chihuahua mix and Great Dane show the wide range of dog breed sizes created using artificial selection, or selective breeding.

Dallinger Incubator

Drawing of the incubator used by Dallinger in his evolution experiments.

Relative Frequency

The relative frequency of an event refers to the absolute frequency normalized by the total number of events.

Histogram Versus Kernel Density Estimation

Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data.The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves.The data points are the rug plot on the horizontal axis.

Line chart

A graph of speed versus time

Data Table

A data table showing elapsed time and measured speed.

The function of a lne

Three lines — the red and blue lines have the same slope, while the red and green ones have same y-intercept.

