Last updated: 2026-02-11 Scope: Poland (
PL) only Active sources: 1 type (off_api), 1,025 entries Related: SeeRESEARCH_WORKFLOW.mdfor the full step-by-step data collection process, andSCORING_METHODOLOGY.mdfor how collected data is scored.
When collecting nutrition and product data for a Polish product, use sources in this strict order:
| Priority | Source | Type | Confidence | Notes |
|---|---|---|---|---|
| 1 | Physical product label (PL market) | Primary | verified |
Gold standard — EU Reg. 1169/2011 mandates this |
| 2 | Manufacturer's official website (PL) | Primary | verified |
Must match PL market variant, not US/UK version |
| 3 | Polish governmental nutrition database | Reference | verified |
IŻŻ / NCEZ — cross-validation for generic categories |
| 4 | Open Food Facts (PL barcode) | Secondary | verified |
Only if entry has been community-verified |
| 5 | Polish retailer website | Secondary | estimated |
Biedronka.pl, Lidl.pl product pages |
| 6 | Scientific literature / EFSA opinions | Reference | verified |
For methodology, thresholds, and category benchmarks |
| 7 | Category-typical averages | Tertiary | estimated |
Used only when no label data is available |
- Priority 1 always wins. If you have the physical label, override all other sources.
- Never mix country variants. Lay's Classic in Poland has different salt/fat content than Lay's Classic in the UK. Always confirm the product is the Polish SKU.
- Governmental databases (Priority 3) provide reference ranges, not product-specific values. Use them to cross-validate, not to override label data.
- Scientific literature (Priority 6) informs methodology and thresholds, not individual product values.
- When using Priority 7 (category averages), clearly mark the score confidence as
estimatedand add a SQL comment explaining the estimation. - Every product should be traceable to ≥ 2 sources wherever possible (e.g., OFF + manufacturer website, or OFF + governmental reference range).
Under Regulation (EU) No 1169/2011, all pre-packaged food sold in Poland must display (per 100g or 100ml):
| Field | Required | Our column |
|---|---|---|
| Energy (kJ/kcal) | Yes | calories |
| Fat (g) | Yes | total_fat_g |
| — of which saturates (g) | Yes | saturated_fat_g |
| Carbohydrate (g) | Yes | carbs_g |
| — of which sugars (g) | Yes | sugars_g |
| Protein (g) | Yes | protein_g |
| Salt (g) | Yes | salt_g |
Voluntary but recorded when available:
| Field | Required | Our column |
|---|---|---|
| Fibre (g) | No | fibre_g |
| Trans fat (g) | No | trans_fat_g |
Polish labels are in Polish. When recording data:
- The
ingredients_rawcolumn stores ingredient lists in standardized English (cleaned ASCII, deduplicated, comma-separated). This was normalized via migrations001200and001600. - Product names should be recorded as they appear on the Polish label, using Polish diacritics (ą, ć, ę, ł, ń, ó, ś, ź, ż).
- Brand names may remain in their international form (e.g., "Lay's", "Pringles").
These databases provide reference values for cross-validation and category benchmarks. They do not replace product-specific label data but are essential for detecting errors in community-sourced data.
- Institution: Instytut Żywności i Żywienia (IŻŻ) / Narodowe Centrum Edukacji Żywieniowej (NCEZ)
- URL: https://ncez.pzh.gov.pl/abc-zywienia/tabele-wartosci-odzywczej/
- Data type: Generic food composition (e.g., "potato chips, salted" — not brand-specific)
- Use for:
- Cross-validating that a product's nutrition values fall within expected ranges for its category
- Deriving category-typical averages when no label data is available (Priority 7)
- Validating scoring thresholds against Polish dietary reference values
- Limitations: Not brand-specific; updated infrequently; may not cover ultra-processed categories
- Confidence:
verifiedwhen used for cross-validation;estimatedwhen used as the primary data source
- URL: https://www.efsa.europa.eu/
- Key resources:
- Dietary Reference Values — basis for our scoring thresholds
- Comprehensive European Food Consumption Database — EU-wide food composition data
- Scientific opinions on food additives, contaminants, and novel foods
- Use for:
- Justifying scoring weight rationale and threshold ceilings (already cited in
SCORING_METHODOLOGY.md) - Cross-checking additive safety assessments (e.g., E171 titanium dioxide withdrawal)
- Reference values when Polish-specific data is unavailable
- Justifying scoring weight rationale and threshold ceilings (already cited in
- Confidence:
verifiedfor reference values and additive assessments
- Key resources:
- Salt reduction — <5g/day target
- Sugars intake — <10% energy from free sugars
- Trans fat elimination — REPLACE initiative
- Use for: Threshold justification in scoring methodology (already referenced)
- Confidence:
verified— these are the gold standard for population-level dietary targets
Manufacturer websites are Priority 2 sources. They often publish full per-100g nutrition tables for their Polish SKUs. When using manufacturer data:
- Confirm the website serves the Polish market (
.pldomain or PL language selector) - Verify the product page matches the current formulation (check pack design photo)
- Screenshot or archive the page for traceability
- Record the URL and access date in the
productstable (source_url,source_eancolumns)
| Manufacturer | PL Website | Categories covered | Notes |
|---|---|---|---|
| PepsiCo Polska | https://www.pepsico.pl | Chips (Lay's, Doritos, Cheetos), Drinks (Pepsi, 7UP, Lipton) | Full nutrition tables on product pages |
| Lorenz Snack-World | https://www.lorenz-snacks.pl | Chips (Crunchips, NicNac's) | Polish-specific product pages |
| Intersnack (Funny Frisch) | https://www.intersnack.pl | Chips (Chio) | Limited PL web presence |
| Maspex | https://www.maspex.com | Drinks (Tymbark, Kubuś), Cereals (Lubella), Instant | Group site with brand sub-pages |
| Mondelēz International | https://www.mondelezinternational.com | Sweets (Milka, Oreo, Prince Polo, Alpen Gold) | Use PL product finder |
| Nestlé Polska | https://www.nestle.pl | Cereals (Nestlé, Cheerios), Dairy, Sweets (KitKat) | Full PL product catalogue |
| Danone Polska | https://www.danone.pl | Dairy (Danio, Activia, Actimel), Baby (Bebiko) | Nutrition tabs on product pages |
| Ferrero | https://www.ferrero.pl | Sweets (Kinder, Nutella, Ferrero Rocher) | PL-specific pages |
| Mars Polska | https://www.mars.com/poland-pl | Sweets (Snickers, M&M's, Twix) | Use PL country selector |
| Sokołów | https://www.sokolow.pl | Meat (wędliny, kabanosy) | Full nutrition per product |
| Morliny | https://www.morliny.pl | Meat (parówki, kiełbasy) | Detailed product pages |
| Tarczyński | https://www.tarczynski.pl | Meat (kabanosy) | Product-level nutrition |
| Pudliszki | https://www.pudliszki.pl | Sauces (ketchup, passata) | Full nutrition tables |
| Łowicz | https://www.lowicz.com.pl | Sauces (dżemy, ketchup) | Product pages with nutrition |
| Develey | https://www.develey.pl | Sauces (musztarda, ketchup) | PL product range |
| Mlekpol | https://www.mlekpol.com.pl | Dairy (Łaciate) | Full nutrition info |
| Mlekovita | https://www.mlekovita.com.pl | Dairy | Product-level data |
| Żywiec Zdrój / Danone Waters | https://www.zywiec-zdroj.pl | Drinks (water) | Mineral composition |
| Coca-Cola HBC Polska | https://www.cocacolaep.com/pl | Drinks (Coca-Cola, Fanta, Sprite) | PL product pages |
| Red Bull Polska | https://www.redbull.com/pl-pl | Drinks (energy) | Nutrition on product page |
| Kompania Piwowarska | https://www.kp.pl | Alcohol (Tyskie, Żubr, Lech) | Limited nutrition data |
| Step | Action |
|---|---|
| 1 | Navigate to the manufacturer's PL website |
| 2 | Find the specific product page (match pack size + variant) |
| 3 | Confirm nutrition table is per 100g (not per serving) |
| 4 | Extract all available fields (EU-7 + voluntary) |
| 5 | Cross-validate against OFF and/or label if available |
| 6 | Record URL + access date in products table (source_url, source_ean) |
| 7 | Set source_type = 'off_api' on the products row |
- URL: https://world.openfoodfacts.org/
- API v2:
GET https://world.openfoodfacts.org/api/v2/product/{EAN}.json - Polish search:
GET https://world.openfoodfacts.org/cgi/search.pl?search_terms={query}&countries_tags=en:poland&json=1 - Filter by: Country = Poland (
countries_tagsmust includeen:poland), or search by EAN barcode - Trust level: Verify that the entry's nutrition table image matches a Polish label
- Useful for: Nutri-Score (pre-computed), NOVA group, barcode, ingredient lists, additive count
- Caution: Community-contributed data can be outdated or from wrong country variant
- Verification criteria:
completeness≥ 0.5, modified within 3 years, Polish label image present
Full API field mapping: See
RESEARCH_WORKFLOW.md§3.4 for detailed field-to-column mapping.
| Retailer | Website | Category | Notes |
|---|---|---|---|
| Biedronka | https://www.biedronka.pl | Discount | Largest chain; has private labels |
| Lidl | https://www.lidl.pl | Discount | Good product pages with nutrition |
| Żabka | https://www.zabka.pl | Convenience | Limited online product info |
| Auchan | https://www.auchan.pl | Hypermarket | Detailed product pages |
| Carrefour | https://www.carrefour.pl | Hypermarket | Nutrition info sometimes available |
Rules for retailer data:
- Retailer websites may lag behind label changes.
- If the website shows different values than the label, the label wins.
- Private-label products (e.g., "Top Chips" from Biedronka) may not appear on other retailer sites.
- Always verify the nutrition table is per 100g, not per serving.
When using any non-label source, cross-validate against at least one other source:
| Check | Threshold | Action on failure |
|---|---|---|
| OFF vs label: any field differs > 10% | ±10% | Use label value, note discrepancy in SQL comment |
| OFF entry has no Polish label image | — | Downgrade confidence to estimated |
| Retailer vs label: different values | ±10% | Use label, flag in comment |
| Multiple sources agree within 5% | ±5% | confidence = 'verified' |
| Energy cross-check fails (±15%) | ±15% | Flag data entry error, investigate |
Full validation rules: See
RESEARCH_WORKFLOW.md§4 for range sanity checks, cross-field rules, and trace value handling.
Scientific publications are used to justify methodology, not to provide product-specific nutrition data. All papers cited in SCORING_METHODOLOGY.md should also be listed here for traceability.
| Reference | Citation | Used for |
|---|---|---|
| NOVA classification | Monteiro CA et al. (2019). Ultra-processed foods: what they are and how to identify them. Public Health Nutrition, 22(5), 936–941. doi:10.1017/S1368980018003762 | processing_risk and nova_classification basis |
| Nutri-Score algorithm | Santé publique France (2024). Nutri-Score algorithm update. | nutri_score_label computation when not on label |
| Nutri-Score validation | Julia C, Hercberg S (2017). Development of a new front-of-pack nutrition label in France. Eur J Public Health, 27(suppl_3). | Scientific basis for Nutri-Score adoption |
| Reference | Citation | Used for |
|---|---|---|
| WHO salt guidelines | WHO (2023). Salt reduction. Fact sheet. | salt_g ceiling (3.0g/100g) in scoring |
| WHO sugar guidelines | WHO (2015). Guideline: Sugars intake for adults and children. | sugars_g ceiling (27g/100g) in scoring |
| WHO trans fat | WHO (2023). REPLACE trans fat: An action package. | trans_fat_g ceiling (2g/100g) and weight rationale |
| EFSA saturated fat DRV | EFSA NDA Panel (2010). Scientific Opinion on DRVs for fats. EFSA Journal, 8(3):1461. | saturated_fat_g ceiling (10g/100g) |
| EFSA energy DRV | EFSA NDA Panel (2013). Scientific Opinion on DRVs for energy. EFSA Journal, 11(1):3005. | calories ceiling (600 kcal/100g) |
| Reference | Citation | Used for |
|---|---|---|
| UPF & cardiovascular | Srour B et al. (2019). Ultra-processed food intake and risk of cardiovascular disease. BMJ, 365:l1451. doi:10.1136/bmj.l1451 | Weight rationale for processing-related factors |
| UPF meta-analysis | Elizabeth L et al. (2020). Ultra-Processed Foods and Health Outcomes: A Narrative Review. Nutrients, 12(7):1955. | General methodology justification |
| Additives & UPF | Martínez Steele E et al. (2020). The share of ultra-processed foods and the quality of the diet. Public Health Nutrition, 23(3), 476–485. | additives_count weight rationale |
| Reference | Citation | Used for |
|---|---|---|
| Palm oil contaminants | EFSA CONTAM Panel (2016). Risks for human health related to the presence of 3- and 2-MCPD in food. EFSA Journal, 14(5):4426. | controversies = 'palm oil' flag |
| Titanium dioxide (E171) | EFSA FAF Panel (2021). Safety assessment of titanium dioxide (E171). EFSA Journal, 19(5):6585. | controversies flag for E171 |
| Acrylamide in food | EU Commission Regulation 2017/2158. Establishing mitigation measures for acrylamide in food. | prep_method scoring (fried > baked) |
| Trans fat regulation | EU Commission Regulation 2019/649. Maximum 2g industrial trans fat per 100g fat. | trans_fat_g ceiling validation |
When a scoring decision or threshold is informed by a scientific reference, cite it in a SQL comment:
-- Threshold: 3.0g salt/100g = 100 (sub-score ceiling)
-- Basis: WHO (2023) recommends <5g/day; 3g/100g ≈ >50% daily limit in 100g
-- Ref: https://www.who.int/news-room/fact-sheets/detail/salt-reductionPoland has a distinctive retail structure relevant to product coverage:
| Store type | Key players | Product access |
|---|---|---|
| Discount | Biedronka, Lidl, Netto | Largest volume; many private labels |
| Convenience | Żabka, Orlen Stop Cafe | Unique product lines; smaller pack sizes |
| Hypermarket | Auchan, Carrefour, E.Leclerc | Broadest brand selection |
| Cash & carry | Makro, Selgros | Bulk/HoReCa sizes; different nutrition formats |
Polish retailers have extensive private-label ranges that must be tracked separately:
| Retailer | Private label examples |
|---|---|
| Biedronka | Top Chips, Marinero, Dada |
| Lidl | Snack Day, Pilos, Pikok |
| Żabka | Żabka-branded sandwiches and snacks |
Private-label products use the retailer name as the brand in our database (e.g., brand = 'Top Chips (Biedronka)').
As of 2026, Nutri-Score is voluntary in Poland. Many products do not display it on the label. When Nutri-Score is unavailable:
- Check Open Food Facts for a computed Nutri-Score.
- If not available, compute from nutrition facts using the 2024 algorithm.
- If nutrition data is insufficient to compute, set
nutri_score_label = 'UNKNOWN'. - Alcohol and similar categories use
nutri_score_label = 'NOT-APPLICABLE'.
Every scored product carries a confidence tag:
| Level | Criteria |
|---|---|
verified |
data_completeness ≥ 90% (nutrition data from label or verified source) |
estimated |
data_completeness 70–89% or single source needing verification |
low |
data_completeness < 70%; score is approximate |
Note:
computedis not a valid confidence level in the database. The CHECK constraint only allowsverified,estimated,low.
Physical label available?
└─ YES → All EU-7 fields present + data_completeness ≥ 90%?
└─ YES → confidence = 'verified'
└─ NO → confidence = 'estimated'
└─ NO → Open Food Facts (verified entry)?
└─ YES → PL label image + completeness ≥ 0.5?
└─ YES → confidence = 'verified'
└─ NO → confidence = 'estimated'
└─ NO → Category averages used?
└─ YES → confidence = 'estimated'
└─ NO → data_completeness < 70%?
└─ YES → confidence = 'low'
└─ NO → confidence = 'estimated'
data_completeness_pct formula: See
RESEARCH_WORKFLOW.md§6.3 for the weighted computation. Confidence criteria table: SeeRESEARCH_WORKFLOW.md§6.4.
| Data type | Language rule |
|---|---|
| Product name | As printed on label (Polish market version) |
| Brand name | International form (e.g., "Lay's" not "Lays") |
| Ingredient list | Standardized English (cleaned via pipeline) |
| Category name | English in database (e.g., 'Chips', 'Cereals') |
| Store name | Original Polish name (e.g., 'Żabka', 'Biedronka') |
| EU regulation refs | English citation with EU regulation number |
| Column names | English, snake_case |
The following sources are excluded and must never be used:
| Source | Reason |
|---|---|
| US FDA / USDA nutrition databases | Different labeling standards; values do not match EU labels |
| UK-variant product pages | Different formulations (sugar, salt often differ from PL) |
| ChatGPT / AI-generated nutrition | Unverifiable; violates reproducibility requirement |
| Social media / blog posts | No traceability; unreliable |
| Pre-2020 label data | Formulations change; only current labels are valid |
| Products not sold in Poland | Out of scope; even if the brand exists globally |
Source provenance is tracked directly on the products table via dedicated columns:
| Column | Purpose |
|---|---|
source_type |
Currently 'off_api' only |
source_url |
URL to the specific product page (e.g., OFF product page) |
source_ean |
EAN used to look up this product |
Rule: When adding a new product, set source_type = 'off_api', source_url, and source_ean on the product row. All products currently use Open Food Facts as the single source.
EAN-13 barcodes are the standard product identifier in Polish retail. They are critical for:
- Matching products across data sources (label ↔ Open Food Facts ↔ retailer website)
- Deduplicating products that appear under different names in different stores
- Verifying that Open Food Facts data matches the correct Polish SKU
The products table has an ean TEXT column (added in migration 20260208000100). A unique conditional index prevents barcode collisions.
Coverage: 997/1,025 active products (97.3%) have validated EAN-8 or EAN-13 barcodes.
Missing EANs (2):
- Kajzerka Kebab (product_id 43) — custom Zabka product, no universal barcode
- Kotlet Drobiowy (product_id 804) — custom Zabka product, no universal barcode
- Stored as text (not numeric) — EAN codes have leading zeros.
- Both EAN-8 and EAN-13 formats are supported.
eanis nullable — private-label and deli products may not have universal EANs.- The unique index is conditional (
WHERE ean IS NOT NULL) to allow multiple rows without barcodes. - One barcode = one product. If a product reformulates under the same EAN, update the existing row (do not create a new row).
- Multi-pack EANs (e.g., 6-pack of chips) are different products from single-pack EANs.
- EAN checksums are validated by
validate_eans.py(called byRUN_QA.ps1).
https://world.openfoodfacts.org/product/<EAN>
Always verify that the returned product page shows a Polish label image before trusting the data.
For scaling beyond the OFF API pipeline, a CSV bulk import tool ingests products from spreadsheet sources (retailer exports, research datasets, manual curation batches).
$env:PYTHONIOENCODING="utf-8"
# Validate and generate SQL from a CSV file
.\.venv\Scripts\python.exe pipeline/csv_import.py --file data/products.csv
# Dry run — validate only, generate no SQL
.\.venv\Scripts\python.exe pipeline/csv_import.py --file data/products.csv --dry-run
# Custom output directory
.\.venv\Scripts\python.exe pipeline/csv_import.py --file data/products.csv --output-dir db/pipelinesUse pipeline/templates/product_import_template.csv as the starting template. Required columns: ean, brand, product_name, category, country. All 21 columns are documented in the template header.
- EAN: Must pass GS1 modulo-10 checksum (EAN-8 or EAN-13)
- Category: Must match one of the 28 defined categories in
pipeline/categories.py - Country: Must be
PLorDE - Nutrition: Values capped at
ABSOLUTE_CAPSfrompipeline/validator.py; cross-field checks enforcesugars ≤ carbsandsat_fat ≤ total_fat - Formula injection: Cells starting with
=,+,-,@,\t, or\rare rejected (negative numbers in nutrition columns are allowed) - Duplicates: Detected by
(country, brand, product_name)and by EAN; first occurrence wins - Hard cap: 10,000 rows per file
The tool groups valid rows by (category, country) and calls generate_pipeline() for each group, producing the standard 4-file pipeline SQL (01_insert_products, 03_add_nutrition, 04_scoring, 05_source_provenance). Files are written to db/pipelines/<slug>/. Source type is set to csv_import.
- Labels change. Manufacturers reformulate products (e.g., sugar reduction initiatives). Re-verify data at least annually.
- Seasonal products (e.g., holiday-edition chips) should be flagged and re-checked for availability.
- Discontinued products should be flagged
is_deprecated = true, deprecated_reason = 'Discontinued'— never deleted. - Price data is explicitly out of scope. This is a nutrition/quality database, not a price tracker.