Survey of Math

Survey of Math Chapter 5: Producing Data

Definitions

The population in a statistical study is the entire group of individuals we want information about.

Since in practice we cannot gather information from the entire population (this wold be too time consuming), we choose a sample group from which we collect information. The sample is a subset of the population. The information we collect is called data, and it allows us to draw conclusions about the population. The process of collecting data from a sample is called sampling.

The hope is that the sample group accurately represents the properties of the population. It is important that the sample is chosen with this in mind.

It is easiest to choose good samples from populations that are very homogeneous (the same throughout), and more difficult to choose samples from populations which exhibit diversity between individuals in the population. It is important in diverse populations to ensure the diversity is represented in the sample.

Example

A company makes electrical circuits for garage door openers, using both human and computer labour. They want to test the circuits they make for durability to ensure they will function properly over a long period of time. However, to test each circuit would slow production, so they cannot test them all. They choose to test 5% of the circuits they produce. Since they produce 1000 circuits in an eight hour shift, it is decided to test the first 50 circuits a shift produces for longevity.

The obvious problem with this is that the sample is only drawn at the beginning of a shift, when the people working on the circuits are most rested. The diversity that is not included in the sample is that human performance will fluctuate over the course of a shift, as workers tire.

The above example shows a sample that is called a convenience sample, in that the sample was found in a manner that is easy (convenient) for the person gathering the sample. The convenience here was choosing the first 50 circuits the shift produced instead of circuits produced throughout the shift. This sampling technique might favour an outcome of the circuits in the sample passing the test for durability.

Convenience samples often produce data which is not representative of the population.

Another example of a sample which is usually not representative of the population is a voluntary response sample, which consists of a sample which is found by issuing a general appeal. They tend to attract people who have strong feelings about the question being asked.

Example

Visit the King County (Seattle) Survey on Homeless.

As an internet poll, the pollsters have no control over who is responding to the poll. Even worse, they have no idea who is responding to the poll.

The 278 responses to do more may be from one individual who feels strongly that homelessness is an issue that could be alleviated by housing subsidies for qualifying people. They believe in this so strongly, they voted 278 times.

Maybe the 278 responses to do more come from workers in the construction industry who want to see more construction projects in King County, regardless of the homeless issue.

You could think of similar situations for the Do Less responses.

As an internet poll, there is no guarantee that any of the respondents even live in King's County.

Thankfully, they call their poll unscientific. What does the information mean, if it is unscientific? Why bother reporting it at all? Can we reasonably draw any conclusions about what the population of King County feels about the issue of developing housing alternatives for the homeless based on this poll?

If you surf around the King County site you will find information that show what the poll is used for, what it is not used for, and what is done to prevent some of the problems discussed above. For a simple online poll, it is good that this information is included!

Common examples of voluntary response samples are radio/TV call in opinion polls and internet polls.

A statistical study has biased design if it favours certain outcomes. Both the convenience sample and voluntary response sample typically exhibit bias.

To try to reduce bias in selecting a sample, one can use the idea of a random number.

A simple random sample of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.

The simple random sample is a better sampling technique since it does not favour one part of the population over another.

A Table of Random Digits

Here are 135 random numbers between 0 and 9. Each entry in our list is equally likely to be one of the 10 digits 0 through 9.

2, 9, 7, 1, 7, 8, 9, 7, 5, 9, 1, 5, 3, 2, 3, 0, 2, 3, 3, 1, 9, 6, 0, 9, 1, 7, 0, 6, 1, 0, 0, 5, 4, 6, 5, 9, 7, 4, 7, 4, 7, 5, 1, 1, 4, 4, 7, 9, 3, 8, 5, 4, 7, 6, 6, 7, 6, 1, 5, 7, 9, 7, 2, 9, 1, 3, 1, 5, 7, 6, 6, 0, 3, 9, 1, 9, 3, 1, 1, 0, 2, 6, 5, 4, 0, 0, 7, 7, 4, 0, 8, 3, 0, 8, 4, 3, 0, 4, 8, 5, 3, 3, 9, 4, 6, 1, 9, 3, 9, 1, 1, 4, 4, 3, 9, 8, 8, 6, 4, 9, 7, 5, 3, 5, 7, 0, 7, 4, 8, 6, 7, 7, 7, 7, 0

These are actually pseudo-random numbers (see text page 175) since I generated them using a computer algorithm, but that is not important for our purposes.

A table of random digits generated from this list would be:

29717897591532302331960917061005465974747511447938547667615...

All we have done is remove the commas.

A table of random digits has two properties:

Each entry in the table is equally likely to be any one of the 10 digits 0 through 9.
The entries are independent of each other.

The table of random digits is usually written in a different form that is easier to read. We group the digits in groups of five, and separate them by a space. The rows are numbered to make the table easy to refer to, but the row numbers and spaces haven't really changed the table of random digits we had above.

Row Random Digits

1 29717 89759 15323 02331

2 96091 70610 05465 97474

3 75114 47938 54766 76157

4 97291 31576 60391 93110

5 26540 07740 83084 30485

6 33946 19391 14439 88649

7 75357 07486 77770 ...

Row	Random Digits
1	29717	89759	15323	02331
2	96091	70610	05465	97474
3	75114	47938	54766	76157
4	97291	31576	60391	93110
5	26540	07740	83084	30485
6	33946	19391	14439	88649
7	75357	07486	77770	...

This is how the table on page 174 of the text was produced. Note that they probably have a different set of random digits making up their random digit table, and they start their table on row 101.

Using random sampling techniques can help reduce bias in choosing a sample population.

Example

You are considering moving into a condominium, but want to ensure the social atmosphere is to your liking.

You decide to talk to four tenants in the condominium which are randomly chosen, to get a feeling for the place.

Use the table of random digits given above to decide which set of people to sample, given that there are 27 apartments in the complex numbered A1 to A15, B1 to B12.

Solution To answer this question, we need to order the apartments before we begin. Since the randomization will come from the table of random digits, we can just order the apartments alphabetically and sequentially:

A1 (01), A2 (02), A3 (03), A4 (04), A5 (05), A6 (06), A7 (07), A8 (08), A9 (09), A10 (10), A11 (11), A12 (12), A13 (13), A14 (14), A15 (15), B1 (16), B2 (17), B3 (18), B4 (19), B5 (20), B6 (21), B7 (22), B8 (23), B9 (24), B10 (25), B11 (26), B12 (27).

Now, turn to the random digit table. Our labels are two digits long, so we read, starting from any column we like, the numbers as two digit labels. We ignore labels which are not in our sample (anything outside 01-27), labels which are repeated, and continue reading until we get four selections from our population.

Let's choose to start in column 5.

Row Random Digits

5 26 54 00 77 40 83 08 43 04 85

6 33 94 61 93 91 14 43 98 86 49

Row	Random Digits
5	26 54 00 77 40 83 08 43 04 85
6	33 94 61 93 91 14 43 98 86 49

The tenants we should interview are therefore in apartments A4 (04), A8 (08), A14 (14), B11 (26).