Data Scientist & AI Systems Builder

Victor
Vassallo

Turning complex information into decisions.

I design systems that transform complex data, documents, and policy contexts into actionable decisions — bridging technical rigor with real-world impact.

Pipelines in Production
NLP + Econometrics
Policy-Grade Decision Support
Scroll
vv2 visual

Systems-oriented.
Analytically precise.

"My work focuses on turning fragmented data, documents, and qualitative inputs into structured, usable intelligence."

I'm a data scientist and AI systems builder who operates at the intersection of technical architecture and applied research. My work spans decision support systems, knowledge pipelines, and economic analysis — with a consistent focus on making complex information actionable.

I bring together machine learning, NLP, and systems design with deep domain knowledge in economic and policy research. I build things that are meant to be used — not just analyzed.

Data Systems
Modeling + NLP
Decision Impact

What I Build

Decision Support Systems

AI-powered pipelines that synthesize multi-source intelligence into clear, structured outputs for high-stakes decisions.

Document & Knowledge Systems

OCR, LLM extraction, and retrieval-augmented architectures that unlock value locked in unstructured documents.

Analytical Models

Predictive models, NLP classifiers, and statistical frameworks tuned for operational and policy contexts.

Economic & Policy Analysis

Rigorous quantitative analysis connecting data-driven methods with economic theory and policy implications.

Selected Experience

Kaptivate LLC
May 2024 — Present
Data Scientist

Develop and maintain production-grade analytics pipelines and Python/SQL ETL workflows that unify structured records with narrative text sources. Built anomaly detection and data validation frameworks for multi-source integrity, applied ML/NLP to transform qualitative inputs into decision-ready metrics, and supported econometric broadband impact modeling used in policy research.

Click for expanded background
Role Scope
  • Design and maintain production data pipelines across federal and operational data sources.
  • Integrate structured records, document corpora, and narrative inputs for decision-ready analytics.
  • Support analytics used by both technical and non-technical teams.
AI & Modeling Work
  • Built NLP and semantic modeling systems to classify qualitative research inputs at scale.
  • Transformed unstructured text into structured, queryable knowledge for downstream analysis.
  • Implemented schema-constrained extraction pipelines, anomaly detection, and threshold-based alerts.
Operational Impact
  • Created dashboards and analytical products actively used by 30+ program staff.
  • Architected retrieval-augmented opportunity intelligence systems.
  • Reduced contract-opportunity review time by 70% and supported over $4M in successful federal applications.
Core Stack
  • Python, SQL, ETL orchestration, data validation workflows, monitoring frameworks.
  • LLM workflows, RAG pipelines, text classification, clustering, semantic retrieval methods.
  • BI dashboards and stakeholder-facing decision support deliverables.
Data Society LLC
May 2023 — August 2023
Data Science / Machine Learning Intern

Developed supervised and unsupervised machine learning models for client research engagements, engineered ETL pipelines to prepare large datasets for modeling, and implemented evaluation practices to strengthen reproducibility. Produced exploratory analyses, technical documentation, and validation tooling to support reliable model training and decision support outputs.

Click for expanded background
Role Scope
  • Developed supervised and unsupervised machine learning workflows for client-facing research projects.
  • Contributed to modeling, experimentation, and interpretation across multiple engagement types.
  • Prioritized reproducible outputs and clearly documented methods.
Data Engineering Work
  • Engineered Python- and SQL-based ETL processes for large dataset preparation.
  • Organized transformed datasets for modeling, reporting, and downstream analytics.
  • Improved consistency and quality of input data available to modelers and analysts.
NLP & Validation
  • Applied NLP and automated text analysis to extract insights from unstructured inputs.
  • Evaluated model behavior and validated datasets before delivery.
  • Produced technical documentation that improved stakeholder clarity and analytical transparency.
Professional Growth
  • Strengthened practical model evaluation and quality assurance habits.
  • Learned to translate technical analysis into clear stakeholder-ready narratives.
  • Built the foundation for later production-focused AI and decision system work.
AARP
May 2022 — August 2022
Membership Lifecycle Management Intern

Built predictive models to analyze membership behavior and improve retention-oriented campaign strategy. Automated recurring data pipelines, integrated structured and unstructured CRM data for fuller lifecycle visibility, and delivered interactive dashboards that enabled data-informed planning across business teams.

Click for expanded background
Role Scope
  • Built predictive and descriptive models to understand membership behavior patterns.
  • Supported retention strategy and campaign targeting with data-backed recommendations.
  • Provided analytics outputs used by multiple business teams.
Data & Automation
  • Automated recurring data preparation and reporting workflows to reduce manual effort.
  • Improved consistency, timeliness, and accessibility of operational insights.
  • Created repeatable data products that teams could rely on for planning cycles.
Integrated Analytics
  • Unified CRM, behavioral, and reporting data into shared lifecycle analysis views.
  • Increased visibility into segmentation opportunities and performance drivers.
  • Delivered dashboard and presentation-ready analysis for cross-functional stakeholders.
Business Impact
  • Enabled more targeted retention planning and campaign optimization.
  • Made complex membership metrics easier to act on for non-technical teams.
  • Improved confidence in data-informed decision-making across business units.

Key Projects

Opportunity Intelligence & RAG System

A retrieval-augmented generation system that aggregates federal funding opportunities, scores them against organizational criteria, and surfaces ranked, contextualized recommendations for decision-makers.

RAG LLM Ranking AI
Qualitative Research Database (NLP)

A relational system combining NLP classification with structured data storage to organize and surface insights from thousands of qualitative research inputs.

NLP Classification Database
Economic Impact & Policy Analysis

Quantitative economic modeling of broadband infrastructure investment, CMC funding, and BEAD program design — connecting data methods with policy-relevant conclusions.

Econometrics Policy Modeling

Areas of active focus & exploration

AI-driven decision systems for civic and government contexts
Opportunity intelligence pipelines with multi-source data fusion
Applied economic analysis of federal broadband & infrastructure policy
Retrieval-augmented systems for document-heavy research workflows

Systems &
Projects

A deep look at the systems, pipelines, and analyses I've built — organized by domain and designed for exploration.

01 Decision Systems
Opportunity Intelligence & RAG Decision System
A retrieval-augmented generation system that scores and surfaces ranked federal funding opportunities against organizational criteria.
RAG LLM AI
+
Overview

An end-to-end AI-powered platform that aggregates federal funding opportunities from multiple sources, processes them through an LLM-based evaluation layer, and presents ranked, contextualized recommendations to decision-makers.

Problem

Organizations pursuing federal grants face a fragmented, high-volume landscape of opportunities. Without systematic intelligence, relevant opportunities are missed and resources are misallocated to poor-fit applications.

Approach & Architecture

Built a multi-stage pipeline: (1) automated ingestion from Grants.gov and agency APIs, (2) document parsing and semantic chunking, (3) vector embedding and retrieval layer, (4) LLM scoring against organizational profile criteria, (5) ranked output interface with reasoning traces.

Methods

RAG architecture with FAISS vector store, OpenAI embeddings, structured LLM prompting for multi-criteria evaluation, Python pipeline with Airflow orchestration, React frontend for opportunity review.

Custom Scoring (Powered by Company IP)

The ranking layer is driven by proprietary company IP: a weighted scoring framework and rule set built from internal win/loss history, sector priorities, capability maturity signals, and client-specific constraints. This private rubric compounds over time as team judgments and outcomes feed back into the system, turning preexisting IP into a scalable BD advantage for future opportunity cycles.

Simplified Architecture Flow
Inputs
Procurement boards plus Grants.gov and agency feeds are scraped and normalized on schedule.
->
Data Prep
Opportunities are cleaned, deduplicated, and enriched with capability and team context.
->
IP-Powered Scoring
Company IP applies weighted fit scoring and owner assignment based on strategic priorities.
->
RAG Deep Analysis
RAG produces opportunity summaries, rationale, and document-grounded insights for review.
Lead custom questions <-> deeper analysis (feedback captured to improve future scoring).
->
Distribution
Ranked summaries are sent by email and written to CRM for team action.
->
Go / No-Go
Go: move to deal tracker and initiate document upload.
Alternative Path
No-Go: mark pass or request additional analysis.
Decision Impact

Reduced opportunity review time by ~70%. Enabled a small team to systematically evaluate 200+ opportunities monthly, surfacing 12–15 high-fit grants per cycle that would have otherwise been overlooked. Direct influence on $4M+ in successful funding applications.

Outcome / Status

Active and in production use. Undergoing expansion to incorporate eligibility pre-screening and collaborative scoring workflows.

Key Learnings

Prompt structure for multi-criteria scoring requires iterative calibration. Retrieval quality is the primary bottleneck — chunking strategy matters more than model choice for domain-specific documents.

02 Document & Knowledge Systems
Document Intelligence Pipeline (OCR + LLM)
Automated pipeline that transforms dense federal documents and policy briefs into structured, queryable data using OCR and LLM extraction.
OCR LLM NLP
+
Overview

A modular document processing system that ingests PDFs (including scanned documents), applies OCR, and uses structured LLM prompting to extract key fields, dates, requirements, and narrative summaries into a relational database.

Problem

Federal documents, FOANs, and policy briefs are dense, inconsistently formatted, and typically locked in PDF form. Analysts spend hours manually extracting basic information before any real analysis can begin.

Approach & Architecture

Pipeline stages: document ingestion → format detection → OCR (Tesseract/AWS Textract) → text cleaning → LLM extraction with schema-defined output → structured storage → API access layer for downstream systems.

Methods

AWS Textract for scanned documents, custom post-processing for layout reconstruction, JSON schema-constrained LLM extraction, PostgreSQL storage with full-text search, FastAPI layer for integration.

Decision Impact

Reduced manual document processing from 3–4 hours per document to under 10 minutes. Enabled analysts to work from structured data rather than raw PDFs, improving consistency and allowing comparative analysis across document sets.

Outcome / Status

Production deployment processing 50–80 documents per week. Integrated as upstream feed for the Opportunity Intelligence system.

Key Learnings

Schema-constrained outputs are essential for reliability. OCR quality varies dramatically with scan quality — pre-processing steps (deskew, denoise) have outsized impact on downstream extraction accuracy.

Qualitative Research Database (NLP + Relational System)
A system combining NLP classification with structured storage to organize and surface insights from thousands of qualitative research inputs.
NLP Classification Database
+
Overview

A structured database system backed by NLP pipelines that ingests qualitative research data (interview notes, survey responses, stakeholder comments) and classifies, tags, and stores them for systematic analysis and retrieval.

Problem

Qualitative research data is inherently unstructured and resists systematic analysis. Research teams were manually coding thousands of inputs — a slow, inconsistent, and non-scalable process.

Approach & Architecture

Designed a codebook-informed classification schema, trained a multi-label text classifier on annotated training data, built ingestion pipeline, and created a relational schema linking themes, sources, and metadata for flexible querying.

Methods

Fine-tuned BERT-based classifier for thematic coding, spaCy for entity extraction, PostgreSQL with JSONB for flexible schema, Metabase for analyst-facing dashboards and query interfaces.

Decision Impact

Transformed qualitative data from anecdote to evidence. Research team could query across 3,000+ inputs by theme, geography, and stakeholder type — enabling pattern identification that shaped policy recommendations and program design decisions.

Outcome / Status

Used in production for two major research cycles. Framework generalized and applied to NTIA listening session analysis.

Key Learnings

Codebook design is the most important upfront investment. Ambiguous categories create downstream classification noise that's hard to correct at scale.

03 Analytics & Operations
Operational Analytics & Resource Allocation Dashboard
A real-time dashboard system enabling program managers to monitor operational metrics and optimize resource allocation decisions across regions.
Analytics Dashboard Viz
+
Overview

A data pipeline and interactive dashboard suite providing program managers with real-time operational insights, performance tracking, and resource utilization analytics across a multi-site program.

Problem

Program leadership was operating with lagging, siloed data — making resource allocation decisions based on weekly reports rather than current operational reality. Bottlenecks went undetected until they became crises.

Approach & Architecture

Built ETL pipeline aggregating data from 5 operational systems, designed a dimensional data model, created role-differentiated dashboard views (executive summary, operational detail, regional comparison), and built alerting for threshold breaches.

Methods

Python ETL with dbt for transformation, PostgreSQL data warehouse, Tableau for dashboards, custom alert logic for anomaly detection, role-based access controls.

Decision Impact

Program managers shifted from reactive to proactive resource management. Average response time to operational bottlenecks decreased from 5 days to under 24 hours. Informed reallocation of $1.2M in program resources across sites.

Outcome / Status

Deployed and in active use by 30+ program staff. Expanded to include predictive utilization forecasting.

Key Learnings

Dashboard adoption depends more on UX and stakeholder involvement in design than on analytical sophistication. Build with end-users, not for them.

Challenge Submission Prediction Model
A predictive model forecasting submission volume and quality for competitive challenge programs, enabling proactive outreach and resource planning.
ML Prediction Modeling
+
Overview

A machine learning model trained on historical challenge program data to forecast submission likelihood, volume estimates, and predicted quality tier — enabling program teams to target outreach and scale evaluation resources appropriately.

Problem

Challenge programs consistently faced unpredictable submission volumes — either overwhelming review capacity or missing participation targets. Planning was based on intuition rather than data.

Approach

Engineered features from historical program data, registration signals, outreach metrics, and program design characteristics. Built ensemble model with calibrated probability outputs and confidence intervals for planning use.

Methods

Gradient boosted trees (XGBoost), SHAP for interpretability, scikit-learn pipelines, calibration with Platt scaling, prediction intervals via conformal prediction.

Decision Impact

Submission volume predictions within 15% of actuals. Enabled program team to scale reviewer capacity proactively and target outreach to underrepresented segments, improving both operational efficiency and program equity outcomes.

Outcome / Status

Used for two challenge program cycles. SHAP-based explanations adopted by program team as a communication tool for leadership reporting.

Key Learnings

Interpretability is not optional for operational models. Stakeholders need to understand why a forecast says what it says — black-box predictions erode trust and limit adoption.

04 Economic & Applied Analysis
Economic Impact of Broadband & CMC Funding
Quantitative analysis measuring the economic returns to broadband infrastructure investment and community-based funding programs in underserved markets.
Econometrics Policy Impact
+
Overview

A comprehensive economic impact study measuring the short- and long-run returns to federal broadband and community media center investments, using difference-in-differences and instrumental variable methods to establish causal estimates.

Problem

Federal broadband and CMC programs lacked rigorous, causal evidence of economic return. Policy debates were driven by advocacy rather than evidence, undermining funding allocation decisions.

Approach & Methods

Assembled longitudinal dataset from FCC, Census, BLS, and program-specific records. Applied difference-in-differences with event study design to estimate treatment effects on employment, income, and business formation. Used distance-to-infrastructure as instrumental variable.

Key Findings

Broadband expansion associated with 4–7% employment growth in treated census tracts over 5-year windows. CMC funding showed concentrated impacts on youth employment and small business formation within 3-mile radii of funded centers.

Decision Impact

Analysis cited in federal program reauthorization discussions. Findings contributed to revised funding allocation formulas that redirected approximately $80M toward higher-impact geographic targets.

Outcome / Status

Published as program evaluation report. Methods framework adapted for ongoing BEAD program monitoring design.

Key Learnings

Causal identification requires patient data assembly. The quality of the counterfactual is what determines whether policy audiences trust the analysis.

NTIA Listening Session Analysis
Systematic NLP-powered analysis of public listening session inputs submitted to NTIA, transforming qualitative public comment into structured policy intelligence.
NLP Policy Analysis
+
Overview

Applied the qualitative research database framework to analyze thousands of public inputs submitted to NTIA listening sessions, identifying dominant themes, geographic patterns, and stakeholder-type differences in broadband policy priorities.

Problem

NTIA received thousands of written and verbal inputs across listening sessions. Manual review was too slow and too subjective to produce systematic intelligence for policy drafters.

Approach

Adapted thematic classification pipeline to NTIA-specific codebook. Processed ~4,500 inputs, identified top themes by stakeholder type and geography, and built summary visualizations for policy staff consumption.

Methods

Multi-label NLP classification, topic modeling (BERTopic), geographic aggregation to CBSA level, stakeholder type clustering, Tableau dashboard for policy staff.

Decision Impact

Provided policy team with first systematic, defensible read of public input within days of session close — rather than weeks. Theme priorities surfaced by analysis directly informed BEAD implementation guidance drafting.

Outcome / Status

Analysis delivered and used in policy drafting process. Methodology documented for replication in future comment periods.

Key Learnings

Speed of delivery matters for policy relevance. An analysis that arrives after the decision window closes has zero impact, regardless of quality.

Blurring BEAD
Analysis examining how BEAD program design, eligibility definitions, and state implementation choices interact — and their implications for coverage outcomes and equity.
Policy Broadband Equity
+
Overview

A policy analysis examining ambiguities in BEAD program design and how variation in state-level implementation choices could produce divergent coverage and equity outcomes — even with identical federal funding levels.

Problem

BEAD's flexible state-driven design creates wide implementation latitude. Stakeholders lacked a clear framework for understanding how different state choices interact with federal requirements — and what's at stake for underserved communities.

Approach

Mapped key program design decision points, modeled coverage scenario outcomes under different implementation choices using FCC broadband data and Census geography, and produced comparison analysis across state implementation plans.

Methods

Scenario modeling with FCC Fabric data, spatial analysis in Python (geopandas), comparative implementation plan review, structured policy framework development.

Decision Impact

Used by advocacy organizations and state broadband offices to sharpen their review of state implementation plans. Framework cited in public comments to state broadband offices in three states.

Outcome / Status

Published analysis. Ongoing work tracking state implementation decisions against predicted coverage outcomes.

Key Learnings

Policy analysis has the most impact when it gives decision-makers a clear framework — not just conclusions. Structure and accessibility matter as much as analytical rigor.

Contact &
Speaking

Open to collaborations in AI systems, data science, and applied research.

Email
victor@vassallo.io

Best for project inquiries, collaborations, and speaking requests.

LinkedIn
linkedin.com/in/victorvassallo

Professional background, publications, and updates.

Resume
Download PDF →

Full work history, technical skills, and education.

Current Availability
Selectively Available

Open to contract, advisory, and collaborative research engagements in AI systems and policy analysis.

AI Systems Design

RAG architectures, decision pipelines, document intelligence

Applied Data Science

NLP, predictive modeling, analytical systems for civic/gov contexts

Policy & Economic Research

Broadband, digital equity, federal program analysis and evaluation

Speaking & Presentations

Conference Talk
Digital Equity Summit
Building AI Decision Systems for Civic Contexts

Presented a framework for designing RAG-based decision support systems in resource-constrained civic organizations, with case studies from federal grant opportunity intelligence.

Panel
State Broadband Leadership Network
Data-Driven BEAD Implementation: What States Need

Panel discussion on analytical infrastructure and data strategies for state broadband offices navigating BEAD implementation, including coverage verification and equity tracking.

Workshop
Applied Research Collaborative
From Qualitative to Structured: NLP for Policy Research

Hands-on workshop on applying NLP classification methods to qualitative policy research data — including codebook design, model training, and analyst-facing output design.

Presentation
NTIA Broadband Forum
Listening at Scale: Systematic Analysis of Public Input

Presented methodology and findings from NLP-powered analysis of NTIA listening session inputs, demonstrating how systematic qualitative analysis can inform federal policy drafting.

Invited Talk
University Data Science Program
Applied AI in Policy and Civic Tech: A Practitioner's Perspective

Guest lecture on bridging academic data science training with the practical realities of building AI systems for government and civic sector clients — including data quality, stakeholder trust, and interpretability requirements.

Available for speaking on AI systems, data science,
and broadband & digital equity policy.
Send Speaking Inquiry →