Interactive Survey Discovery Guide
Overview
This guide demonstrates the interactive workflow for discovering survey IDs and understanding survey structure before creating a new exercise in the Data Quality App.
Why Interactive Discovery?
Before you can create an exercise, you need to know:
- Survey ID - The unique identifier for your data source
- Column names - What fields are available in the survey
- Filter options - Which columns make good filters (enumerators, locations, etc.)
- Data types - Are dates stored as strings? Are numbers formatted correctly?
- Data quality - What’s the structure? Any missing data patterns?
Previously: Developers had to manually inspect the API, guess column names, and iterate through errors.
Now: Interactive discovery script automates the exploration and provides instant insights!
The Interactive Workflow
Visual Overview
┌─────────────────────────────────────────────────────────────┐
│ INTERACTIVE DISCOVERY │
└─────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 1. Run Discovery Script │
│ source("tools/discover_surveys.r") │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 2. Search for Your Survey │
│ search_surveys("welcome meals") │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 3. Inspect Survey Structure │
│ inspect_survey(5400) │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 4. Explore Data Interactively │
│ View(current_survey_data) │
│ names(current_survey_data) │
│ table(current_survey_data$column) │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 5. Document Your Findings │
│ • Survey ID │
│ • Filter columns │
│ • Key metric columns │
│ • Date columns │
│ • ID columns │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 6. Create Exercise File │
│ Use findings in exercise definition │
└───────────────────────────────────────┘
Common Scenarios
Scenario A: You have an xlsFormId but need the survey ID
Situation: You know the form ID (e.g., 2168) but need the corresponding survey ID.
Solution:
library(reticulate)
library(dplyr)
# Setup client
use_python(".venv/bin/python")
data_bridges_knots <- import("data_bridges_knots")
client <- data_bridges_knots$DataBridgesShapes("data_bridges_api_config.yaml")
# Get all surveys
surveys <- client$get_household_surveys_list(page = 1L)
# Find by xlsFormId
surveys |>
filter(xlsFormId == 2168) |>
select(surveyID, xlsFormId, xlsFormName, surveyStatus)Example output:
surveyID xlsFormId xlsFormName surveyStatus
5404 2168 BCM - Welcome meals - 20251021 Active
Result: xlsFormId 2168 → surveyID 5404 ✓
Why this matters: Survey IDs can change when forms are republished, but xlsFormId is more stable. Always verify the current surveyID.
Scenario B: You know the survey name but not the ID
Situation: You know it’s called “BCM - Welcome Meals” but don’t have the ID.
Solution: Use the discovery script’s search function (see Step 2 below).
Scenario C: You have nothing - need to explore available surveys
Situation: Starting from scratch, need to see what surveys exist.
Solution: Use the full discovery workflow (see steps below).
Step-by-Step Walkthrough
Step 1: Run the Discovery Script
Open R/RStudio and run:
source("discover_surveys.r")What happens:
- ✓ Connects to DataBridges API client
- ✓ Loads helper functions:
search_surveys()andinspect_survey() - ✓ Attempts to list available surveys (if API supports it)
- ✓ Provides instructions for next steps
Expected output:
========================================
SETTING UP DATABRIDGES CLIENT
========================================
✓ Client connected successfully!
========================================
STEP 1: DISCOVER AVAILABLE SURVEYS
========================================
Fetching list of household surveys...
Found 15 surveys
[Table of surveys displayed]
========================================
READY FOR DISCOVERY!
========================================
Use the functions above to find your survey ID,
then update your exercise file accordingly.
Note: If listing surveys fails (API doesn’t support it), you’ll see alternative methods.
Step 2: Search for Your Survey
Use the search_surveys() function to find surveys by keyword:
# Search by keyword
search_surveys("welcome")
search_surveys("BCM")
search_surveys("meals")Example interaction:
> search_surveys("welcome")
Found 2 survey(s) matching 'welcome':
surveyId surveyTitle
5400 BCM - Welcome Meals
5401 OSM - Welcome Meals DistributionWhat to do:
- ✓ Note the Survey ID (e.g.,
5400) - ✓ Verify this is the survey you want
- ✓ Try different search terms if needed
Tips:
- Search is case-insensitive
- Partial matches work (e.g., “wel” finds “welcome”)
- Try organization names: “BCM”, “OSM”, “WFP”
- Try survey types: “helpdesk”, “monitoring”, “distribution”
Step 3: Inspect Survey Structure
Once you have a survey ID, inspect its detailed structure:
inspect_survey(5400)This comprehensive function shows:
A. Data Overview
✓ Successfully loaded survey 5400
========================================
SURVEY DATA OVERVIEW
========================================
Rows: 1250
Columns: 45
B. Column Names (First 30)
========================================
COLUMN NAMES (First 30)
========================================
1. _id
2. _submission_time
3. _duration
4. date
5. Enumerator_Name
6. Beneficiary_ID
7. meal_quality_good
8. meal_quantity_sufficient
9. meal_temperature_appropriate
10. has_complaint
11. distribution_location
12. meal_type
...
C. Potential Filter Columns
========================================
POTENTIAL FILTER COLUMNS
========================================
These columns might be good for filtering:
• Enumerator_Name ( 8 unique values )
• distribution_location ( 12 unique values )
• site_name ( 5 unique values )
How it identifies filter columns:
- Contains keywords like “enum”, “enumerator”, “name”, “location”, “site”
- Has between 2-50 unique values (not too few, not too many)
- This helps you choose a good filter for the UI
D. Sample Data
========================================
SAMPLE DATA (First 5 rows, first 10 cols)
========================================
[First 5 rows of data displayed with actual values]
What to look for:
- Date formats: Are they consistent?
- Text values: Any unusual formatting?
- Missing data: Many NAs?
- Numeric ranges: Do values make sense?
Step 4: Explore Data Interactively
The inspect_survey() function saves the data as current_survey_data. Now you can explore!
Basic Exploration
# Open in RStudio viewer (great for visual inspection)
View(current_survey_data)
# See all column names
names(current_survey_data)
# Check data structure
str(current_survey_data)
# Get dimensions
dim(current_survey_data)
nrow(current_survey_data)
ncol(current_survey_data)Explore Specific Columns
# Check unique values in a column
unique(current_survey_data$Enumerator_Name)
# Count records by category
table(current_survey_data$distribution_location)
# Cross-tabulation
table(current_survey_data$Enumerator_Name,
current_survey_data$distribution_location)
# Check for missing values
sum(is.na(current_survey_data$Beneficiary_ID))
colSums(is.na(current_survey_data))Analyze Key Metrics
# Summary statistics
summary(current_survey_data$meal_quality_good)
# Calculate percentages for Yes/No columns
table(current_survey_data$has_complaint)
prop.table(table(current_survey_data$has_complaint)) * 100
# Check date range
range(current_survey_data$date, na.rm = TRUE)
# Check time distributions
hist(as.numeric(current_survey_data$`_duration`),
main = "Survey Duration Distribution",
xlab = "Duration (seconds)")Validate Data Quality
# Check for duplicates
sum(duplicated(current_survey_data$Beneficiary_ID))
# Missing data by row
rowSums(is.na(current_survey_data))
# Find rows with lots of missing data
which(rowSums(is.na(current_survey_data)) > 10)
# Check date consistency
summary(as.Date(current_survey_data$date))Step 5: Document Your Findings
Create a checklist with all the information you discovered:
Findings Template
## Survey Discovery Notes - [Survey Name]
**Discovery Date**: 2025-10-23
**Discovered By**: [Your Name]
### Basic Information
- **Survey ID**: 5400
- **Survey Name**: BCM - Welcome Meals
- **Record Count**: 1,250 records
- **Column Count**: 45 columns
- **Date Range**: 2025-01-01 to 2025-10-20
### Filter Configuration
- **Recommended Filter Column**: `Enumerator_Name`
- **Filter Label**: "Enumerator"
- **Unique Filter Values**: 8
- **Alternative Filters**:
- `distribution_location` (12 values)
- `site_name` (5 values)
### Key Columns for Validations
- **ID Column**: `Beneficiary_ID` (for uniqueness checks)
- **Phone Column**: `Interviewee_Phone_Number` (if exists)
- **Date Column**: `date` (needs conversion to Date class)
- **Duration Column**: `_duration` (in seconds)
- **Submission Column**: `_submission_time` (timestamp)
### Key Columns for Visualizations
**Meal Quality Metrics**:
- `meal_quality_good` (Yes/No)
- `meal_quantity_sufficient` (Yes/No)
- `meal_temperature_appropriate` (Yes/No)
**Service Quality**:
- `served_on_time` (Yes/No)
- `staff_courteous` (Yes/No)
- `has_complaint` (Yes/No)
**Distribution Details**:
- `distribution_location` (categorical)
- `meal_type` (categorical)
**Demographics**:
- `beneficiary_gender` (categorical)
- `beneficiary_age_group` (categorical)
### Data Quality Notes
- Date format: YYYY-MM-DD (needs conversion)
- Duration range: 180-3600 seconds (3 min - 1 hour)
- Missing data: ~5% average across columns
- Duplicates: 0 duplicate Beneficiary_IDs found
### Validation Rule Recommendations
1. ✓ Duration check (300-3600 seconds / 5 min - 1 hour)
2. ✓ Same-day submission
3. ✓ Late submission (within 24 hours)
4. ✓ Beneficiary ID uniqueness per day
5. ✓ Complaint rate check (max 10%)
6. ✓ Missing data check (max 30% per row)
### Next Steps
- [ ] Create exercise file: `app/exercises/bcm_welcome_meals.r`
- [ ] Implement data preparation
- [ ] Add validation rules
- [ ] Create custom dashboard
- [ ] Test in appStep 6: Create the Exercise File
Now you have everything you need! Follow the main guide using your documented findings:
# See full instructions in:
# - docs/ADDING_NEW_EXERCISES.md (comprehensive)
# - QUICK_START_NEW_EXERCISE.md (fast reference)Use your findings to fill in:
- Survey ID
- Filter column and label
- Validation rule parameters
- Visualization column names
Real-World Example 1: Survey Name Known
Scenario
You need to add a new “BCM - Welcome Meals” survey to the app.
Interactive Session
# 1. Start discovery
> source("tools/discover_surveys.r")
✓ Client connected successfully!
# 2. Search for survey
> search_surveys("welcome")
Found 2 survey(s) matching 'welcome':
surveyId surveyTitle
5404 BCM - Welcome Meals
# 3. Inspect structure
> inspect_survey(5404)
✓ Successfully loaded survey 5404
Rows: 1250
Columns: 45
POTENTIAL FILTER COLUMNS:
• Enumerator_Name ( 8 unique values )
• distribution_location ( 12 unique values )
# 4. Explore data
> View(current_survey_data)
[Opens in viewer]
> names(current_survey_data)
[Shows all 45 column names]
> table(current_survey_data$Enumerator_Name)
Ahmad 6
Fatima 8
Hassan 10
[...]
> table(current_survey_data$has_complaint)
No Yes
1187 63
> prop.table(table(current_survey_data$has_complaint)) * 100
No Yes
94.96 5.04
# ✓ Complaint rate is ~5% (well under 10% threshold)
> summary(as.numeric(current_survey_data$`_duration`))
Min. 1st Qu. Median Mean 3rd Qu. Max.
180.0 420.0 540.0 612.3 720.0 3200.0
# ✓ Duration mostly 7-12 minutes, max ~50 minutes
# 5. Document findings
# [Created notes file with all information]
# 6. Create exercise
# [Created app/exercises/bcm_welcome_meals.r using findings]Result: Complete exercise created in ~20 minutes with confidence that all column names and thresholds are correct!
Real-World Example 2: Troubleshooting Wrong Survey ID
Scenario
Created exercise with survey ID 5400, but app throws error when loading data. You have xlsFormId 2168.
Problem
> shiny::runApp("app")
# Select "BCM - Welcome Meals"
ERROR fetching from BCM - Welcome Meals (Survey 5400):
TypeError: object of type 'NoneType' has no len()Troubleshooting Session
# 1. Setup client
> library(reticulate)
> library(dplyr)
> use_python(".venv/bin/python")
> data_bridges_knots <- import("data_bridges_knots")
> client <- data_bridges_knots$DataBridgesShapes("data_bridges_api_config.yaml")
# 2. Get all surveys
> surveys <- client$get_household_surveys_list(page = 1L)
> cat("Found", nrow(surveys), "surveys\n")
Found 20 surveys
# 3. Look at the data structure
> surveys |>
select(surveyID, xlsFormId, xlsFormName) |>
head(10)
surveyID xlsFormId xlsFormName
5409 2175 Informal price monitoring in refugee camps - 20251022
5408 2172 OSM - Helpdesk - 20251022
5407 2169 BCM - Post offices validation - 20251021
5406 2170 OSM - Post offices validation - 20251021
5405 2169 BCM - Post offices validation - 20251021
5404 2168 BCM - Welcome meals - 20251021
5403 2167 BCM - Helpdesk validation - 20251021
5402 2166 BCM - Helpdesks - 20251021
5398 2160 mvamRMnEnew
5397 2159 OSM - Post Office Validation - OLD
# 4. Find our survey by xlsFormId
> surveys |>
filter(xlsFormId == 2168) |>
select(surveyID, xlsFormId, xlsFormName, surveyStatus)
surveyID xlsFormId xlsFormName surveyStatus
5404 2168 BCM - Welcome meals - 20251021 Active
# 🎯 FOUND IT! Survey ID is 5404, not 5400!
# 5. Update exercise file
# Updated app/exercises/bcm_welcome_meals.r:
# - Line 2: Survey ID 5404 (comment)
# - Line 7: id = "5404"
# - Line 22: survey_id = 5404L
# - Line 26: source_name = "BCM - Welcome Meals (Survey 5404)"
# 6. Test the fix
> shiny::runApp("app")
# Select "BCM - Welcome Meals"
✓ Data loads successfully!
✓ 1250 records loaded
✓ All validations running
✓ Dashboard displaying
# SUCCESS! 🎉What We Learned
- Survey ID ≠ xlsFormId: They’re different identifiers
- Survey IDs can change: When forms are republished, new survey IDs are created
- xlsFormId is more stable: It stays consistent across versions
- Always verify from API: Don’t guess - check the actual surveys list
- The surveys dataframe is essential: It has all the mappings you need
Time Saved
- Without troubleshooting guide: Could take hours of trial and error
- With this approach: 5 minutes to identify and fix the issue
Result: Working exercise with correct survey ID!
Troubleshooting Interactive Discovery
Issue: Script won’t connect to API
Symptoms:
Error: Python module 'data_bridges_knots' not found
Solutions:
# Check Python environment
which python
python --version
# Verify virtual environment
source .venv/bin/activate
pip list | grep data-bridges
# Reinstall if needed
pip install data-bridges-knotsIssue: Survey list not available
Symptoms:
⚠ Could not fetch survey list automatically.
Error: method not available
Solutions:
- This is expected if API doesn’t support listing
- Use
inspect_survey(ID)with a known ID - Check DataBridges dashboard for survey list
- Ask your data team
Issue: Permission denied for survey
Symptoms:
Error loading survey: permission denied
Solutions:
- Check
data_bridges_api_config.yamlcredentials - Verify SCOPES include
vamdatabridges_household-fulldata_get - Contact API administrator for access to this survey
- Try
access_type = "base"instead of “full”
Issue: Column names don’t match expectations
Symptoms:
- Expected column
meal_qualitybut not found - Column names have unexpected prefixes/suffixes
Solutions:
# Search column names
grep("quality", names(current_survey_data), value = TRUE, ignore.case = TRUE)
# Find similar names
grep("meal", names(current_survey_data), value = TRUE, ignore.case = TRUE)
# Check exact spelling
names(current_survey_data)[grepl("^meal", names(current_survey_data))]Best Practices
1. Always Run Discovery First
Don’t guess column names or structure. Discovery takes 5 minutes and saves hours of debugging.
2. Document Your Findings
Create a findings document for each survey. Future developers (including yourself!) will thank you.
3. Explore Beyond the Basics
Don’t just look at column names:
- Check data distributions
- Look for outliers
- Validate assumptions about data quality
- Test your validation logic with real data
4. Test with Real Data
Use current_survey_data to test validation logic before adding to exercise:
# Test a validation idea
test_data <- current_survey_data
# Check if case IDs are unique per day
test_data |>
group_by(date) |>
summarise(
total = n(),
unique_ids = n_distinct(Beneficiary_ID),
has_dupes = total != unique_ids
) |>
filter(has_dupes)
# If this shows duplicates, case_id_uniqueness_rule is needed!5. Keep the Script Updated
If you find the discovery script missing features, enhance it! It’s a living tool.
Advanced Discovery Techniques
Custom Exploration Scripts
Save common exploration patterns:
# Create: explore_survey.r
explore_columns <- function(data, pattern) {
matching <- grep(pattern, names(data), value = TRUE, ignore.case = TRUE)
cat("Found", length(matching), "columns matching '", pattern, "':\n")
for (col in matching) {
cat("\n", col, ":\n")
if (is.numeric(data[[col]])) {
print(summary(data[[col]]))
} else {
print(head(table(data[[col]], useNA = "always"), 10))
}
}
}
# Usage
explore_columns(current_survey_data, "meal")
explore_columns(current_survey_data, "quality")Data Quality Report
# Generate comprehensive data quality report
data_quality_report <- function(data) {
cat("========================================\n")
cat("DATA QUALITY REPORT\n")
cat("========================================\n\n")
# Completeness
cat("COMPLETENESS:\n")
completeness <- colMeans(!is.na(data)) * 100
low_completeness <- names(completeness[completeness < 80])
cat("Columns with <80% completeness:\n")
print(completeness[low_completeness])
# Duplicates
cat("\n\nDUPLICATES:\n")
id_cols <- grep("id|ID", names(data), value = TRUE)
for (col in id_cols) {
dupe_count <- sum(duplicated(data[[col]]))
if (dupe_count > 0) {
cat(col, ":", dupe_count, "duplicates\n")
}
}
# Date ranges
cat("\n\nDATE RANGES:\n")
date_cols <- grep("date|Date|time", names(data), value = TRUE)
for (col in date_cols) {
cat(col, ":\n")
print(range(data[[col]], na.rm = TRUE))
}
}
# Usage
data_quality_report(current_survey_data)Summary
The interactive discovery workflow provides:
- ✓ Fast survey ID discovery - Search and find surveys in seconds
- ✓ Automated structure inspection - See columns, types, and samples automatically
- ✓ Interactive exploration - Dive deep with R commands
- ✓ Data quality insights - Understand your data before coding
- ✓ Validation testing - Test logic with real data first
- ✓ Documentation foundation - Gather all info needed for exercise creation
Time saved: What used to take hours of trial-and-error now takes ~5-10 minutes of exploration!
Questions?
- Read the discovery script:
tools/discover_surveys.rhas inline comments - Try with known survey:
inspect_survey(5387)(BCM Helpdesk) - Check the full guide:
docs/ADDING_NEW_EXERCISES.md - Look at examples:
app/exercises/bcm_helpdesk_validation.r
Happy discovering! 🔍