DeeDive (short for “Data Dive”) is an automatic data exploration tool designed for hypothesis-generation and/or discovering phenomena/patterns you might have missed during previous exploratory analyses. To use, enter your email address and upload your data file – can be .csv, .sav (SPSS), .xls (Excel), .xlsx (Excel), .dta (Stata), .sas7bdat (SAS), or several others – with variable names in the first row and IDs in the first column. Data can be any mix of types (strings, numbers, dates, etc.), and missing values are handled in various ways, depending on the analysis. If data are longitudinal, they must be in wide format. Simply upload the data file, enter your email, and DeeDive will email you a .pdf of results, usually within 15 minutes (but can take several days for very large data sets). That’s about all you need to know to use DeeDive – just upload and the algorithm will do the rest – but see below for some pointers.
Possible inputs:
.csv
.sav (SPSS)
.xls (Excel)
.xlsx (Excel)
.dta (Stata)
.sas7bdat (SAS)
Here are three example data sets you can download to see the format expected by DeeDive, and feel free to upload them (with your email) to see an example output.
Further Points
DeeDive is meant to help you explore your data. If you’re using DeeDive for scientific purposes (and you’re not a statistician yourself), you can take the output to your statistician and s/he will recognize from the output what types of analyses have been done. S/he can run the analyses properly in your data and make visuals that are publication-quality (unlike DeeDive’s). In addition, just glancing through the visuals a couple of times can help generate ideas even for other data sets you might have.
- Minimum Data Required: DeeDive needs at least 2 variables (columns) and can handle as many as ~9,000. However, note that because DeeDive is meant for medium- to large-scale data exploration, it will append columns of random data to your data set if there are fewer than 10 columns. The minimum N (rows) is 3, but at least 10 is recommended. Note that DeeDive automatically removes any variables with >50% missing data.
- Dates: DeeDive can handle most date formats out there (mm-dd-YY, mm-dd-YYYY, dd/mm/YYYY, etc.), but if you want to be 100% sure your dates are read properly, put them in “YYYY-mm-dd” or “YYYY/mm/dd” format.
- Variable Naming: For visualization purposes, it’s best to use only necessary parts of variable names. For example, if you had variables called “Cognitive_Test_Score_Memory”, “Cognitive_Test_Score_Attention”, and “Cognitive_Test_Score_Verbal”, and there were no other variables with “Cognitive_Test_Score” in the name, it would be best to drop that from the variable names, leaving “Memory”, “Attention”, and “Verbal”.
- Variables Chosen for Analysis: DeeDive does not seek the strongest effects in your data, but rather, tries to identify a diverse set of variables that appear to come from different measurement targets. For example, if your data includes a few questions about depression, a few questions about net worth, and a few biometric measures (e.g. height, weight), DeeDive will likely detect that and use at least one variable (the strongest indicator) from each measurement target (depression, net worth, and biometrics) in its analyses. Relatedly, because DeeDive is meant for exploration, uploading variations of your data set is encouraged. For example:
- Maybe upload a version of your data set with clearly useless columns removed.
- Maybe a version with transformations appended (e.g. all continuous variables squared).
- Failures: DeeDive is young and under development, so it fails occasionally. There is currently no notification of failure (coming soon), but note that the processing time depends mostly on how many of your variables are binary (e.g. yes/no, true/false, male/female, correct/incorrect, etc.) or ordinal (e.g. 0/1/2 for “small business”/”mid-size business”/”large business”). If you have more than ~200 binary/ordinal variables, it could take >3 hours to get your results; for more than ~500, it could be 48+ hours. If you have an exceptionally large data set or are running into problems with a data set you know should work, please contact me at reductionist@gmail.com or mooremetrics@gmail.com to discuss running the algorithm on a local MooreMetrics machine.