I design systems that transform complex data, documents, and policy contexts into actionable decisions — bridging technical rigor with real-world impact.
"My work focuses on turning fragmented data, documents, and qualitative inputs into structured, usable intelligence."
I'm a data scientist and AI systems builder who operates at the intersection of technical architecture and applied research. My work spans decision support systems, knowledge pipelines, and economic analysis — with a consistent focus on making complex information actionable.
I bring together machine learning, NLP, and systems design with deep domain knowledge in economic and policy research. I build things that are meant to be used — not just analyzed.
AI-powered pipelines that synthesize multi-source intelligence into clear, structured outputs for high-stakes decisions.
OCR, LLM extraction, and retrieval-augmented architectures that unlock value locked in unstructured documents.
Predictive models, NLP classifiers, and statistical frameworks tuned for operational and policy contexts.
Rigorous quantitative analysis connecting data-driven methods with economic theory and policy implications.
Develop and maintain production-grade analytics pipelines and Python/SQL ETL workflows that unify structured records with narrative text sources. Built anomaly detection and data validation frameworks for multi-source integrity, applied ML/NLP to transform qualitative inputs into decision-ready metrics, and supported econometric broadband impact modeling used in policy research.
Developed supervised and unsupervised machine learning models for client research engagements, engineered ETL pipelines to prepare large datasets for modeling, and implemented evaluation practices to strengthen reproducibility. Produced exploratory analyses, technical documentation, and validation tooling to support reliable model training and decision support outputs.
Built predictive models to analyze membership behavior and improve retention-oriented campaign strategy. Automated recurring data pipelines, integrated structured and unstructured CRM data for fuller lifecycle visibility, and delivered interactive dashboards that enabled data-informed planning across business teams.
A business optimization and opportunity management system that blends computer vision document understanding, LLM extraction, and opportunity ranking to turn dense funding materials into decision-ready intelligence.
A retrieval-augmented generation system that aggregates federal funding opportunities, scores them against organizational criteria, and surfaces ranked, contextualized recommendations for decision-makers.
A relational system combining NLP classification with structured data storage to organize and surface insights from thousands of qualitative research inputs.
Quantitative economic modeling of broadband infrastructure investment, CMC funding, and BEAD program design — connecting data methods with policy-relevant conclusions.
A deep look at the systems, pipelines, and analyses I've built — organized by domain and designed for exploration.
An end-to-end AI-powered platform that aggregates federal funding opportunities from multiple sources, processes them through an LLM-based evaluation layer, and presents ranked, contextualized recommendations to decision-makers.
Organizations pursuing federal grants face a fragmented, high-volume landscape of opportunities. Without systematic intelligence, relevant opportunities are missed and resources are misallocated to poor-fit applications.
Built a multi-stage pipeline: (1) automated ingestion from Grants.gov and agency APIs, (2) document parsing and semantic chunking, (3) vector embedding and retrieval layer, (4) LLM scoring against organizational profile criteria, (5) ranked output interface with reasoning traces.
RAG architecture with FAISS vector store, OpenAI embeddings, structured LLM prompting for multi-criteria evaluation, Python pipeline with Airflow orchestration, React frontend for opportunity review.
The ranking layer is driven by proprietary company IP: a weighted scoring framework and rule set built from internal win/loss history, sector priorities, capability maturity signals, and client-specific constraints. This private rubric compounds over time as team judgments and outcomes feed back into the system, turning preexisting IP into a scalable BD advantage for future opportunity cycles.
Reduced opportunity review time by ~70%. Enabled a small team to systematically evaluate 200+ opportunities monthly, surfacing 12–15 high-fit grants per cycle that would have otherwise been overlooked. Direct influence on $4M+ in successful funding applications.
Active and in production use. Undergoing expansion to incorporate eligibility pre-screening and collaborative scoring workflows.
Prompt structure for multi-criteria scoring requires iterative calibration. Retrieval quality is the primary bottleneck — chunking strategy matters more than model choice for domain-specific documents.
A modular document processing system that ingests PDFs (including scanned documents), applies OCR, and uses structured LLM prompting to extract key fields, dates, requirements, and narrative summaries into a relational database.
Federal documents, FOANs, and policy briefs are dense, inconsistently formatted, and typically locked in PDF form. Analysts spend hours manually extracting basic information before any real analysis can begin.
Pipeline stages: document ingestion → format detection → OCR (Tesseract/AWS Textract) → text cleaning → LLM extraction with schema-defined output → structured storage → API access layer for downstream systems.
AWS Textract for scanned documents, custom post-processing for layout reconstruction, JSON schema-constrained LLM extraction, PostgreSQL storage with full-text search, FastAPI layer for integration.
Reduced manual document processing from 3–4 hours per document to under 10 minutes. Enabled analysts to work from structured data rather than raw PDFs, improving consistency and allowing comparative analysis across document sets.
Production deployment processing 50–80 documents per week. Integrated as upstream feed for the Opportunity Intelligence system.
Schema-constrained outputs are essential for reliability. OCR quality varies dramatically with scan quality — pre-processing steps (deskew, denoise) have outsized impact on downstream extraction accuracy.
A structured database system backed by NLP pipelines that ingests qualitative research data (interview notes, survey responses, stakeholder comments) and classifies, tags, and stores them for systematic analysis and retrieval.
Qualitative research data is inherently unstructured and resists systematic analysis. Research teams were manually coding thousands of inputs — a slow, inconsistent, and non-scalable process.
Designed a codebook-informed classification schema, trained a multi-label text classifier on annotated training data, built ingestion pipeline, and created a relational schema linking themes, sources, and metadata for flexible querying.
Fine-tuned BERT-based classifier for thematic coding, spaCy for entity extraction, PostgreSQL with JSONB for flexible schema, Metabase for analyst-facing dashboards and query interfaces.
Transformed qualitative data from anecdote to evidence. Research team could query across 3,000+ inputs by theme, geography, and stakeholder type — enabling pattern identification that shaped policy recommendations and program design decisions.
Used in production for two major research cycles. Framework generalized and applied to NTIA listening session analysis.
Codebook design is the most important upfront investment. Ambiguous categories create downstream classification noise that's hard to correct at scale.
A data pipeline and interactive dashboard suite providing program managers with real-time operational insights, performance tracking, and resource utilization analytics across a multi-site program.
Program leadership was operating with lagging, siloed data — making resource allocation decisions based on weekly reports rather than current operational reality. Bottlenecks went undetected until they became crises.
Built ETL pipeline aggregating data from 5 operational systems, designed a dimensional data model, created role-differentiated dashboard views (executive summary, operational detail, regional comparison), and built alerting for threshold breaches.
Python ETL with dbt for transformation, PostgreSQL data warehouse, Tableau for dashboards, custom alert logic for anomaly detection, role-based access controls.
Program managers shifted from reactive to proactive resource management. Average response time to operational bottlenecks decreased from 5 days to under 24 hours. Informed reallocation of $1.2M in program resources across sites.
Deployed and in active use by 30+ program staff. Expanded to include predictive utilization forecasting.
Dashboard adoption depends more on UX and stakeholder involvement in design than on analytical sophistication. Build with end-users, not for them.
A machine learning model trained on historical challenge program data to forecast submission likelihood, volume estimates, and predicted quality tier — enabling program teams to target outreach and scale evaluation resources appropriately.
Challenge programs consistently faced unpredictable submission volumes — either overwhelming review capacity or missing participation targets. Planning was based on intuition rather than data.
Engineered features from historical program data, registration signals, outreach metrics, and program design characteristics. Built ensemble model with calibrated probability outputs and confidence intervals for planning use.
Gradient boosted trees (XGBoost), SHAP for interpretability, scikit-learn pipelines, calibration with Platt scaling, prediction intervals via conformal prediction.
Submission volume predictions within 15% of actuals. Enabled program team to scale reviewer capacity proactively and target outreach to underrepresented segments, improving both operational efficiency and program equity outcomes.
Used for two challenge program cycles. SHAP-based explanations adopted by program team as a communication tool for leadership reporting.
Interpretability is not optional for operational models. Stakeholders need to understand why a forecast says what it says — black-box predictions erode trust and limit adoption.
A comprehensive economic impact study measuring the short- and long-run returns to federal broadband and community media center investments, using difference-in-differences and instrumental variable methods to establish causal estimates.
Federal broadband and CMC programs lacked rigorous, causal evidence of economic return. Policy debates were driven by advocacy rather than evidence, undermining funding allocation decisions.
Assembled longitudinal dataset from FCC, Census, BLS, and program-specific records. Applied difference-in-differences with event study design to estimate treatment effects on employment, income, and business formation. Used distance-to-infrastructure as instrumental variable.
Broadband expansion associated with 4–7% employment growth in treated census tracts over 5-year windows. CMC funding showed concentrated impacts on youth employment and small business formation within 3-mile radii of funded centers.
Analysis cited in federal program reauthorization discussions. Findings contributed to revised funding allocation formulas that redirected approximately $80M toward higher-impact geographic targets.
Published as program evaluation report. Methods framework adapted for ongoing BEAD program monitoring design.
Causal identification requires patient data assembly. The quality of the counterfactual is what determines whether policy audiences trust the analysis.
Applied the qualitative research database framework to analyze thousands of public inputs submitted to NTIA listening sessions, identifying dominant themes, geographic patterns, and stakeholder-type differences in broadband policy priorities.
NTIA received thousands of written and verbal inputs across listening sessions. Manual review was too slow and too subjective to produce systematic intelligence for policy drafters.
Adapted thematic classification pipeline to NTIA-specific codebook. Processed ~4,500 inputs, identified top themes by stakeholder type and geography, and built summary visualizations for policy staff consumption.
Multi-label NLP classification, topic modeling (BERTopic), geographic aggregation to CBSA level, stakeholder type clustering, Tableau dashboard for policy staff.
Provided policy team with first systematic, defensible read of public input within days of session close — rather than weeks. Theme priorities surfaced by analysis directly informed BEAD implementation guidance drafting.
Analysis delivered and used in policy drafting process. Methodology documented for replication in future comment periods.
Speed of delivery matters for policy relevance. An analysis that arrives after the decision window closes has zero impact, regardless of quality.
A policy analysis examining ambiguities in BEAD program design and how variation in state-level implementation choices could produce divergent coverage and equity outcomes — even with identical federal funding levels.
BEAD's flexible state-driven design creates wide implementation latitude. Stakeholders lacked a clear framework for understanding how different state choices interact with federal requirements — and what's at stake for underserved communities.
Mapped key program design decision points, modeled coverage scenario outcomes under different implementation choices using FCC broadband data and Census geography, and produced comparison analysis across state implementation plans.
Scenario modeling with FCC Fabric data, spatial analysis in Python (geopandas), comparative implementation plan review, structured policy framework development.
Used by advocacy organizations and state broadband offices to sharpen their review of state implementation plans. Framework cited in public comments to state broadband offices in three states.
Published analysis. Ongoing work tracking state implementation decisions against predicted coverage outcomes.
Policy analysis has the most impact when it gives decision-makers a clear framework — not just conclusions. Structure and accessibility matter as much as analytical rigor.
Open to collaborations in AI systems, data science, and applied research.
Open to contract, advisory, and collaborative research engagements in AI systems and policy analysis.
RAG architectures, decision pipelines, document intelligence
NLP, predictive modeling, analytical systems for civic/gov contexts
Broadband, digital equity, federal program analysis and evaluation
Presented a framework for designing RAG-based decision support systems in resource-constrained civic organizations, with case studies from federal grant opportunity intelligence.
Panel discussion on analytical infrastructure and data strategies for state broadband offices navigating BEAD implementation, including coverage verification and equity tracking.
Hands-on workshop on applying NLP classification methods to qualitative policy research data — including codebook design, model training, and analyst-facing output design.
Presented methodology and findings from NLP-powered analysis of NTIA listening session inputs, demonstrating how systematic qualitative analysis can inform federal policy drafting.
Guest lecture on bridging academic data science training with the practical realities of building AI systems for government and civic sector clients — including data quality, stakeholder trust, and interpretability requirements.