--- title: "Creating samples with sampler" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Creating samples with sampler} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r eval=FALSE} library(whSample) ``` The ***sampler*** function is part of the ***whSample*** package of utilities for sampling. It is a menu-driven R application that reads an Excel spreadsheet or delimited text file and generates a new Excel workbook with a copy of the original data, a Simple Random or Stratified Random Sample, and a report of what it did. The *sampler* function calls [***ssize***](Using_ssize.html) to get its sample size estimate. *Sampler* uses the same sample size defaults as *ssize* and they can be overridden when launching *sampler* from the command line. ***sampler*** takes four additional arguments: - **irisData** opens the file chooser to a folder with example files of Anderston's Iris dataset of flower characteristics. - **backups** provides a buffer for use if necessary to replace samples found to be invalid for some reason, - **seed** is used to seed the internal random number generator, and - **keepOrg** determines if a copy of the population is included in the output. The full range of command-line options is: ```{r eval=FALSE} sampler <- function(backups=5, irisData=F, ci=0.95, me=0.07, p=0.50, seed=NULL, keepOrg=F) ``` The defaults for these additional arguments are *backups=5*, *irisData=F*, *seed=NULL* and *keepOrg=F*. The default seed will tell *sampler* to use the current system time in milliseconds. Any number can be used as a seed. Whichever one is used will be listed in the *Report* output tab. The keep-original option (*keepOrg*) defaults to FALSE, but could be set to *keepOrg=T* for smaller populations that wouldn't exceed Excel's row limit is 1,048,576 rows. To override any of these defaults, enter *name=value* as an argument when calling *sampler*, as in: ```{r eval=F} sampler(p=0.06) ``` More on changing the number of backups in a minute. ## Jump right in ***Sampler*** includes an .xlsx and a .csv version of Anderson's Iris dataset that characterizes the flower's variations. The [Examples](#jump_in) section will get you started with *sampler*. Or keep reading for a detailed walk-through. ## Running ***sampler*** {#running} The easiest way to run *sampler* is by calling it from the R console: ```{r eval=FALSE} sampler() ``` ### Choose the source data file A file chooser will appear to select a source file. *Sampler* will remember where the source file came from and will put the resulting output file in the same folder with the name ***source file name_Sample.xlsx***. *Sampler* will read Excel or delimited text files. ### Choose the data worksheet If the source file is in an Excel workbook, *sampler* will display a list of worksheets in that file for you to choose from. ### Choose the number of backups A "Number of Backups" menu will offer either zero, five, or ten backups (per stratum). The default is five. There may be times when you want a specific number of backups that isn't listed. In this case, you need to **start** ***sampler*** **with the number of backups desired**, as in: ```{r eval=FALSE} sampler(backups=25) ``` **Important note:** At this point, if you specified a number of backups on the command line, **hit** ***Cancel*** and *sampler* will use the number you specified. Clicking Okay will accept whatever was highlighted in the menu. If you didn't specify a number on the command line and hit Cancel, *sampler* will use the default of five. For a Simple Random Sample, the startup example shown above will add the next 25 items from the randomized population to a separate table at the end of the output Excel workbook. Stratified samples will include 25 backups **for each stratum**. This could be a problem if a given stratum doesn't have 25 more items to include. In this case, *sampler* will scale back the number of backups to the number of items available for that stratum. The ***Report tab*** will list the number of backups actually provided for each stratum. ### Choose the sampling type *Sampler* can create a Simple Random Sample, a Stratified Random Sample, or a variation of the stratified sample where each stratum is listed with its backups in a separate Excel worksheet. A "Sample Type" menu provides the options. ### Choose what to stratify on The "Stratify on" menu lists all the column headers in the selected worksheet. *Sampler* will group all the samples by this name and split out the results accordingly. ## Output *Sampler* creates a new Excel workbook in three parts: - a copy of the original (source) data if previously requested, - an Excel spreadsheet with the requested sample, and - a new tab called *Report* with key reference information: - path and name of the source file - size (in rows) of the source file - sample type (Simple Random Sample, Stratified Random Sample, or Tabbed Stratified Sample) - sample size - desired confidence level - desired margin of error - anticipated rate of occurrence - stratification key - number of strata - number of backups per stratum - random number seed used, for documentation and reproducibility - date-time stamp of when the sample was generated - stratification information (stratified samples only): - name - size of the population - proportion of the population for each stratum - the number of primary samples - the actual number of backup samples (may not be the same as that requested) ## Installation To install from CRAN, use: ```{r eval=FALSE} install.packages("whSample") ``` All necessary dependencies also will be installed as necessary. You can install\* the latest development version of whSample from the R console with: ``` {r eval=FALSE} devtools::install_github("km4ivi/whSample") ``` \* This requires the `devtools` package to be installed first. ### Other necessary packages *sampler* depends on several external packages to run properly. Ensure these packages are installed: - tidyverse (or individually: magrittr, dplyr, purrr) - openxlsx - data.table - tools - tcltk - bit64 ## Examples {#jump_in} Running *sampler(irisData=T)* from the R console's **>** prompt will launch *sampler* with the other defaults and will open a file chooser with *iris.xlsx* and *iris.csv* options (check the file type drop down for different suffixes). *sampler* will read either format (or delimited text) and generate an Excel file with the sampling parameters you choose. A series of menus will take you step by step through the process and will open the generated spreadsheet with an Excel-compatible app if one is on your computer. *Sampler* will pop up a directory chooser to let you decide where to put the output file. By default, it will offer the same folder as the source file. If you want to see the output Excel file without first saving it, hit *Cancel*. The file will open but won't automatically be saved. You can save it from Excel if you want. If you used the example file and don't have an Excel-compatible app installed, you can find the output folder by entering this command at the **>** prompt: ```{r eval=FALSE} system.file("extdata", package="whSample") ``` *sampler(p=0.6, seed=12345)* Will override those specific defaults. You can override the other defaults in the same way. If you skipped the middle part of [this vignette](#running), now's a good time to catch up.