IP Anchor Section 2 — Data sourcing & QD approval The QD approval gate (Section 2b) exists because data is not free and pipelines have limits. Understand the infrastructure before writing your data section.

What this week covers

Every strategy runs on data. The data must exist, be accessible, be affordable, and be ingestible into our systems. This week walks through the AlgoGators data pipeline: what tools ingest data, where it lives, how it flows to strategies, and what the QD approval gate (Section 2b) is checking for.

The data flow at AlgoGators

Raw market data ↓ Databento (ingest) ↓ TimescaleDB (storage) ↓ data-ngin (transform) ↓ AlgoSystem (backtest/live) ↓ Strategy execution ← QR writes signal spec here ← QD implements from IP

Understanding this flow is critical. You will describe your data requirements in Section 2a. The QD team will use those requirements to build the pipeline. But you need to understand the constraints at each step.

Databento

What it is: A market data provider. Tick data, OHLCV (open, high, low, close, volume), order book snapshots across equities, futures, options, FX.

Coverage: US equities, US futures (CME, CBOT, COMEX), cryptocurrencies, FX futures. International coverage limited.

Cost: Databento charges per data series per month. Novel data sources require approval before you spend time on them.

Frequency: Tick-level to daily OHLCV. You request the granularity you need.

For QR: Know what you're asking for before Section 2b. If you want daily closing prices for the last 10 years, that's cheap and trivial. If you want 1-minute bars for 500 stocks for 5 years, that's expensive and requires approval.

TimescaleDB

What it is: PostgreSQL extension optimized for time-series data. Time-indexed tables, automatic partitioning, continuous aggregates (pre-computed rolling statistics).

How it works: Raw tick data from Databento is ingested into TimescaleDB. Millions of rows a day. Partitioned by time so queries on recent data are fast.

Continuous aggregates: You can define a continuous aggregate that automatically computes, say, daily OHLCV from tick data. No extra code required — TimescaleDB handles it.

For QR: You don't write SQL. You define your data requirements in Section 2a ("daily OHLCV for corn futures"), and QD writes the queries. But knowing the data lives in TimescaleDB helps you understand data availability and freshness.

data-ngin (data engineering pipeline)

What it is: Internal cleaning and feature engineering layer. Handles corporate actions (stock splits, dividends), outlier detection, missing value handling, and feature construction.

Corporate actions: When a stock splits 2-for-1, raw prices from two days don't compare. data-ngin adjusts for this. Uses point-in-time adjustments (historical prices adjusted forward, not backward, to avoid look-ahead bias).

Outliers: Exchange halts, data errors, gaps. data-ngin flags and handles these.

Features: When you describe your signal as "5-day rolling z-score," data-ngin is what computes it. You specify the formula in your IP; QD implements it in data-ngin.

For QR: In Section 2a, you describe the features you need with formulas. In Section 3a, you use those features in your signal. data-ngin makes them available.

The QD Approval Gate (Section 2b)

This is a critical checkpoint. You cannot move forward without QD sign-off.

Why it exists: Too many QR analysts have written strategies on data that doesn't exist. They spent weeks developing on a dataset that was never subscribed to, or that costs $500k/month, or that isn't available at the frequency needed, or that has only 3 years of history (too short for backtesting).

What Section 2b must contain:

What happens: You submit your IP with Section 2 (both 2a and 2b). QT reviews strategy fit. QD reviews data feasibility and signs off in 2b. If 2b is not signed, the IP is rejected, full stop. This is not punitive — it's a forcing function to catch infeasible research early.

Python data access patterns

Pulling OHLCV from TimescaleDB

import pandas as pd
import psycopg2

conn = psycopg2.connect(
    host="timescaledb.internal",
    dbname="algogators",
    user="qr_user",
    password="***"
)

query = """
    SELECT timestamp, open, high, low, close, volume
    FROM corn_futures
    WHERE timestamp BETWEEN '2020-01-01' AND '2024-01-01'
    ORDER BY timestamp
"""

df = pd.read_sql(query, conn, parse_dates=['timestamp'], index_col='timestamp')

Feature construction: rolling z-score

def rolling_zscore(series, window=20):
    """Z-score normalization using rolling window."""
    mean = series.rolling(window).mean()
    std = series.rolling(window).std()
    return (series - mean) / std

# Normalized signal for entry/exit
df['signal'] = rolling_zscore(df['close'], window=60)
entry_threshold = 2.0  # Entry when signal < -2
df['signal_entry'] = df['signal'] < -entry_threshold

Handling corporate actions (splits/dividends)

# data-ngin handles this automatically
# You specify the adjustment method in Section 2a:
# "Prices adjusted for splits/dividends using backward-adjusted method.
#  Adjustment factors applied at ex-date."
# QD implements it; you use the adjusted prices in your signal.

NasaPowerCouncil data pipeline (example)

Raw data: NASA POWER API. Daily solar radiation (ALLSKY_SFC_SW_DWN), temperature (T2M), precipitation (PRECTOTCORR) at lat/lon grid.

Geographic mapping: Corn Belt counties (IL, IA, MN, MO, etc.). Map each county to grid point lat/lon centroids.

Transformation: Compute growing-degree-day (GDD) deviation from 20-year seasonal average. Daily rolling sums, seasonal adjustment.

Feature: GDD_deviation_30d = 30-day rolling sum of (daily GDD - 20-year seasonal average for that calendar window).

Section 2b sign-off: "Data sourced from NASA POWER REST API (public, no licensing cost). Daily resolution, available back to 1981. Ingestible via Python requests → CSV → TimescaleDB. QD: Confirmed."

Common mistakes

Five data pipeline failures

  • Skipping Section 2b and assuming data availability. You need written confirmation from QD. Email counts; attachment or reply-all to your IP draft counts. Verbal is not enough.
  • Requesting data at the wrong frequency. You want daily but need intraday? Requesting daily won't work. Be precise: "daily close at 4:00 PM ET" not "daily data."
  • Not accounting for data latency. Some releases lag: COT data (Friday for Tuesday), USDA forecasts (monthly, mid-month), earnings (announced after close). If your signal fires before data arrives, you have look-ahead bias.
  • Writing Section 2a so vaguely that QD can't parse what you need. "Corn price data" is not enough. Need: "Continuous corn futures (ZC, nearest-to-expiry rolled 5 days before contract expiry), daily close, 2015–2024."
  • Assuming alternative data is cheap. Satellite data, web scraping, proprietary sources cost money. Get pricing in Section 2b before committing 10 weeks to research that costs $50k/month to run.
← Week 3: Statistics Week 5: The Hypothesis →