File size: 6,373 Bytes
27cb60a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# Final Domain Collection Research

## Summary of Findings

### Available Methods in jao-py

The `JaoPublicationToolPandasClient` class provides three domain query methods:

1. **`query_final_domain(mtu, presolved, cne, co, use_mirror)`** (Line 233)
   - Final Computation - Final FB parameters following LTN
   - Published: 10:30 D-1
   - Most complete dataset (recommended for Phase 2)

2. **`query_prefinal_domain(mtu, presolved, cne, co, use_mirror)`** (Line 248)
   - Pre-Final (EarlyPub) - Pre-final FB parameters before LTN
   - Published: 08:00 D-1
   - Earlier publication time, but before LTN application

3. **`query_initial_domain(mtu, presolved, cne, co)`** (Line 264)
   - Initial Computation (Virgin Domain) - Initial flow-based parameters
   - Published: Early in D-1
   - Before any adjustments

### Method Parameters

```python
def query_final_domain(
    mtu: pd.Timestamp,      # Market Time Unit (1 hour, timezone-aware)
    presolved: bool = None, # Filter: True=binding, False=non-binding, None=ALL
    cne: str = None,        # CNEC name keyword filter (NOT EIC-based!)
    co: str = None,         # Contingency keyword filter
    use_mirror: bool = False # Use mirror.flowbased.eu for faster bulk download
) -> pd.DataFrame
```

### Key Findings

1. **DENSE Data Acquisition**:
   - Set `presolved=None` to get ALL CNECs (binding + non-binding)
   - This provides the DENSE format needed for Phase 2 feature engineering

2. **Filtering Limitations**:
   - ❌ NO EIC-based filtering on server side
   - ✅ Only keyword-based filters (cne, co) available
   - **Solution**: Download all CNECs, filter locally by EIC codes

3. **Query Granularity**:
   - Method queries **1 hour at a time** (mtu = Market Time Unit)
   - For 24 months: Need 17,520 API calls (1 per hour)
   - Alternative: Use `use_mirror=True` for whole-day downloads

4. **Mirror Option** (Recommended for bulk collection):
   - URL: `https://mirror.flowbased.eu/dacc/final_domain/YYYY-MM-DD`
   - Returns full day (24 hours) as CSV in ZIP file
   - Much faster than hourly API calls
   - Set `use_mirror=True` OR set env var `JAO_USE_MIRROR=1`

5. **Data Structure** (from `parse_final_domain()`):
   - Returns pandas DataFrame with columns:
     - **Identifiers**: `mtu` (timestamp), `tso`, `cnec_name`, `cnec_eic`, `direction`
     - **Contingency**: `contingency_*` fields (nested structure flattened)
     - **Presolved field**: Indicates if CNEC is binding (True) or redundant (False)
     - **RAM breakdown**: `ram`, `fmax`, `imax`, `frm`, `fuaf`, `amr`, `lta_margin`, etc.
     - **PTDFs**: `ptdf_AT`, `ptdf_BE`, ..., `ptdf_SK` (12 Core zones)
   - Timestamps converted to Europe/Amsterdam timezone
   - snake_case column names (except PTDFs)

### Recommended Implementation for Phase 2

**Option A: Mirror-based (FASTEST)**:
```python
def collect_final_domain_sample(
    start_date: str,
    end_date: str,
    target_cnec_eics: list[str],  # 200 EIC codes from Phase 1
    output_path: Path
) -> pl.DataFrame:
    """Collect DENSE CNEC data for specific CNECs using mirror."""

    client = JAOClient()  # With use_mirror=True

    all_data = []
    for date in pd.date_range(start_date, end_date):
        # Query full day (all CNECs) via mirror
        df_day = client.query_final_domain(
            mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
            presolved=None,  # ALL CNECs (DENSE!)
            use_mirror=True   # Fast bulk download
        )

        # Filter to target CNECs only
        df_filtered = df_day[df_day['cnec_eic'].isin(target_cnec_eics)]
        all_data.append(df_filtered)

    # Combine and save
    df_full = pd.concat(all_data)
    pl_df = pl.from_pandas(df_full)
    pl_df.write_parquet(output_path)

    return pl_df
```

**Option B: Hourly API calls (SLOWER, but more granular)**:
```python
def collect_final_domain_hourly(
    start_date: str,
    end_date: str,
    target_cnec_eics: list[str],
    output_path: Path
) -> pl.DataFrame:
    """Collect DENSE CNEC data hour-by-hour."""

    client = JAOClient()

    all_data = []
    for date in pd.date_range(start_date, end_date, freq='H'):
        try:
            df_hour = client.query_final_domain(
                mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
                presolved=None  # ALL CNECs
            )
            df_filtered = df_hour[df_hour['cnec_eic'].isin(target_cnec_eics)]
            all_data.append(df_filtered)
        except NoMatchingDataError:
            continue  # Hour may have no data

    df_full = pd.concat(all_data)
    pl_df = pl.from_pandas(df_full)
    pl_df.write_parquet(output_path)

    return pl_df
```

### Data Volume Estimates

**Full Download (all ~20K CNECs)**:
- 20,000 CNECs × 17,520 hours = 350M records
- ~27 columns × 8 bytes/value = ~75 GB uncompressed
- Parquet compression: ~10-20 GB

**Filtered (200 target CNECs)**:
- 200 CNECs × 17,520 hours = 3.5M records
- ~27 columns × 8 bytes/value = ~750 MB uncompressed
- Parquet compression: ~100-150 MB

### Implementation Strategy

1. **Phase 1 complete**: Identify top 200 CNECs from SPARSE data
2. **Extract EIC codes**: Save to `data/processed/critical_cnecs_eic_codes.csv`
3. **Test on 1 week**: Validate DENSE collection with mirror
   ```python
   # Test: 2025-09-23 to 2025-09-30 (8 days)
   # Expected: 200 CNECs × 192 hours = 38,400 records
   ```
4. **Collect 24 months**: Using mirror for speed
5. **Validate DENSE structure**:
   ```python
   unique_cnecs = df['cnec_eic'].n_unique()
   unique_hours = df['mtu'].n_unique()
   expected = unique_cnecs * unique_hours
   actual = len(df)
   assert actual == expected, f"Not DENSE! {actual} != {expected}"
   ```

### Advantages of Mirror Method

- ✅ Faster: 1 request/day vs 24 requests/day
- ✅ Rate limit friendly: 730 requests vs 17,520 requests
- ✅ More reliable: Less chance of timeout/connection errors
- ✅ Complete days: Guarantees all 24 hours present

### Next Steps

1. Add `collect_final_domain_dense()` method to `collect_jao.py`
2. Test on 1-week sample with target EIC codes
3. Validate DENSE structure and data quality
4. Run 24-month collection after Phase 1 complete
5. Use DENSE data for Tier 1 & Tier 2 feature engineering

---

**Research completed**: 2025-11-05
**jao-py version**: 0.6.2
**Source**: C:\Users\evgue\projects\fbmc_chronos2\.venv\Lib\site-packages\jao\jao.py