The sampler function is part of the whSample package of utilities for sampling. It is a menu-driven R application that reads an Excel spreadsheet or delimited text file and generates a new Excel workbook with a copy of the original data, a Simple Random or Stratified Random Sample, and a report of what it did.
The sampler function calls ssize to get its sample size estimate. Sampler uses the same sample size defaults as ssize and they can be overridden when launching sampler from the command line.
sampler takes four additional arguments:
The full range of command-line options is:
The defaults for these additional arguments are backups=5, irisData=F, seed=NULL and keepOrg=F. The default seed will tell sampler to use the current system time in milliseconds. Any number can be used as a seed. Whichever one is used will be listed in the Report output tab. The keep-original option (keepOrg) defaults to FALSE, but could be set to keepOrg=T for smaller populations that wouldn’t exceed Excel’s row limit is 1,048,576 rows.
To override any of these defaults, enter name=value as an argument when calling sampler, as in:
More on changing the number of backups in a minute.
Sampler includes an .xlsx and a .csv version of Anderson’s Iris dataset that characterizes the flower’s variations. The Examples section will get you started with sampler. Or keep reading for a detailed walk-through.
The easiest way to run sampler is by calling it from the R console:
A file chooser will appear to select a source file. Sampler will remember where the source file came from and will put the resulting output file in the same folder with the name source file name_Sample.xlsx.
Sampler will read Excel or delimited text files.
If the source file is in an Excel workbook, sampler will display a list of worksheets in that file for you to choose from.
A “Number of Backups” menu will offer either zero, five, or ten backups (per stratum). The default is five.
There may be times when you want a specific number of backups that isn’t listed. In this case, you need to start sampler with the number of backups desired, as in:
Important note: At this point, if you specified a number of backups on the command line, hit Cancel and sampler will use the number you specified. Clicking Okay will accept whatever was highlighted in the menu. If you didn’t specify a number on the command line and hit Cancel, sampler will use the default of five.
For a Simple Random Sample, the startup example shown above will add the next 25 items from the randomized population to a separate table at the end of the output Excel workbook.
Stratified samples will include 25 backups for each stratum. This could be a problem if a given stratum doesn’t have 25 more items to include. In this case, sampler will scale back the number of backups to the number of items available for that stratum. The Report tab will list the number of backups actually provided for each stratum.
Sampler can create a Simple Random Sample, a Stratified Random Sample, or a variation of the stratified sample where each stratum is listed with its backups in a separate Excel worksheet. A “Sample Type” menu provides the options.
The “Stratify on” menu lists all the column headers in the selected worksheet. Sampler will group all the samples by this name and split out the results accordingly.
Sampler creates a new Excel workbook in three parts:
a copy of the original (source) data if previously requested,
an Excel spreadsheet with the requested sample, and
a new tab called Report with key reference information:
path and name of the source file
size (in rows) of the source file
sample type (Simple Random Sample, Stratified Random Sample, or Tabbed Stratified Sample)
sample size
desired confidence level
desired margin of error
anticipated rate of occurrence
stratification key
number of strata
number of backups per stratum
random number seed used, for documentation and reproducibility
date-time stamp of when the sample was generated
stratification information (stratified samples only):
To install from CRAN, use:
All necessary dependencies also will be installed as necessary.
You can install* the latest development version of whSample from the R console with:
* This requires the devtools
package to be installed
first.
sampler depends on several external packages to run properly. Ensure these packages are installed:
Running sampler(irisData=T) from the R console’s > prompt will launch sampler with the other defaults and will open a file chooser with iris.xlsx and iris.csv options (check the file type drop down for different suffixes). sampler will read either format (or delimited text) and generate an Excel file with the sampling parameters you choose. A series of menus will take you step by step through the process and will open the generated spreadsheet with an Excel-compatible app if one is on your computer.
Sampler will pop up a directory chooser to let you decide where to put the output file. By default, it will offer the same folder as the source file. If you want to see the output Excel file without first saving it, hit Cancel. The file will open but won’t automatically be saved. You can save it from Excel if you want.
If you used the example file and don’t have an Excel-compatible app installed, you can find the output folder by entering this command at the > prompt:
sampler(p=0.6, seed=12345) Will override those specific defaults. You can override the other defaults in the same way.
If you skipped the middle part of this vignette, now’s a good time to catch up.