# Jobs Data - Enrichment

This document describes how JobFront's enrichment pipeline processes raw job postings and what every field on the enriched job object means, including all valid option values, definitions, and edge-case behavior.

***

### How the Pipeline Works

The enrichment system takes raw job posting HTML plus any metadata already collected by the scraper and produces a set of structured, validated fields.

**High-level flow:**

```
Raw job post HTML
       │
       ▼
  Build prompts  ←  scraped metadata + extracted JSON-LD
       │
       ▼
  LLM extraction  (structured sections)
       │
       ▼
  Parse & validate  (enum checks, JSON parsing, tech corpus lookup)
       │
       ▼
  Post-processing  (compensation cross-check, location geocoding,
                    placeholder detection, derived flags)
       │
       ▼
  Enriched job object saved to database
```

**What the LLM extraction process structures:**

| Field group                         |
| ----------------------------------- |
| Job level                           |
| Job commitment                      |
| Primary location overview           |
| Full locations list                 |
| Concise metro locations             |
| Location type                       |
| Compensation overview string        |
| Compensation minimum                |
| Compensation maximum                |
| Compensation currency               |
| Compensation period                 |
| Visa sponsorship                    |
| Job categories                      |
| Technologies (up to 20)             |
| Post language                       |
| Benefits                            |
| Responsibilities                    |
| Requirements                        |
| Problems to solve                   |
| One-sentence task description       |
| O\*NET occupation code              |
| Street address (used for geocoding) |

All outputs are validated against allowed value lists or resolved against the technology corpus before being stored. Values the LLM cannot determine are returned as `NONE` and stored as empty strings.

***

### Field Reference

#### Job Title

**Field:** `job_title` **Type:** String

The job title as stated in the posting. Extracted directly from the post or from structured metadata already collected by the scraper.

***

#### Job Level

**Field:** `job_level` **Type:** Enum (string) **Default when unknown:** empty string

Represents the seniority level of the role. The value must exactly match one of the options below; any other value is discarded.

| Value         | When it applies                                                                                                            |
| ------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `internship`  | Internship programs. Also sets `is_internship = true`.                                                                     |
| `entry_level` | Entry-level or new-grad roles. Typically 0–2 years of experience expected, no seniority modifier in the title.             |
| `junior`      | Roles explicitly labeled "Junior" or similar. Typically 1–3 years experience, still in a learning-focused phase.           |
| `mid_level`   | Mid-level roles, often simply titled "Software Engineer" or "Designer" without a seniority qualifier. Typically 3–5 years. |
| `senior`      | Roles explicitly labeled "Senior". Typically 5–8 years experience; expected to mentor others.                              |
| `expert`      | Staff, Principal, Distinguished, or equivalent titles. 8+ years; sets technical direction across teams.                    |
| `executive`   | VP, SVP, C-level, Head of, or equivalent leadership titles.                                                                |

> **Note:** If the posting does not provide enough signal to confidently assign a level, this field is left empty.

***

#### Job Commitment

**Field:** `job_commitment` **Type:** Enum (string) **Default when unknown:** `full_time`

The employment type or work arrangement.

| Value        | When it applies                                                                                |
| ------------ | ---------------------------------------------------------------------------------------------- |
| `full_time`  | Standard full-time employment. This is the default when no explicit commitment type is stated. |
| `part_time`  | Part-time employment (fewer than full-time hours).                                             |
| `internship` | An internship program. Also sets `is_internship = true`.                                       |
| `contract`   | Contract, freelance, or independent contractor engagements.                                    |
| `temporary`  | Fixed-term or temporary employment.                                                            |
| `volunteer`  | Unpaid volunteer roles.                                                                        |
| `other`      | Any commitment type that does not fit the above options.                                       |

***

#### Location Fields

Location data is parsed from multiple sources on the page/source. Each location is subsequently validated and geocoded.

***

**`job_locations_parsed_list`**

**Type:** Array of objects

The full geocoded result for every location dictionary. Each object contains city, state, country, zip, latitude, longitude, and the street address used for the lookup. Deduplicated by `(city, state, country, street, zip, cbsa)`.

***

#### Job Location Type

**Field:** `job_location_type` **Type:** Enum (string) **Default when unknown:** `on_site`

Describes the work arrangement — where the employee is expected to physically work.

| Value               | When it applies                                                                                                               |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `on_site`           | The employee must work from the employer's physical location. This is the default when no remote or hybrid signal is present. |
| `remote`            | Fully remote with no geographic restriction. The employee can work from anywhere.                                             |
| `remote_restricted` | Remote work is offered but restricted by geography — for example, "Remote, US only" or "Remote within California".            |
| `hybrid`            | A mix of in-office days and remote days.                                                                                      |
| `other`             | An arrangement that does not fit the above options.                                                                           |

> **Derived flag:** `is_remote` is `true` when `job_location_type` is `remote` or `remote_restricted`.

***

#### Compensation Fields

Compensation is one of the most complex areas because job postings express it in many formats. The system extracts, validates, and in some cases corrects compensation data.

***

**`job_compensation_min` and `job_compensation_max`**

**Type:** Decimal (stored as string)

The exact numeric minimum and maximum compensation values **as written in the job posting**, with no unit conversion. For example, if the post says `$31.64/hour`, the stored value is `31.64`, not an annualized figure.

**Edge cases:**

| Situation                                           | Behavior                                                                                                                                                                                                                              |
| --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Single value** (e.g. "$120,000")                  | Both `min` and `max` are set to the same value (`120000`). This is valid.                                                                                                                                                             |
| **Min exceeds max** (e.g. min=200,000, max=100,000) | Both fields are cleared. This indicates a data entry error in the original posting.                                                                                                                                                   |
| **Hourly-to-annual misinterpretation**              | If the LLM interprets an hourly rate (e.g. `$31.64`) as an annual salary (e.g. `$31,640`), the system detects the \~100x discrepancy against the source text and corrects both the values and the period. See Period inference below. |
| **ATS/JSON-LD placeholder ranges**                  | Some applicant tracking systems emit nonsensical default ranges (e.g. min=1, max=1,000,000). These are detected and cleared. See Placeholder detection below.                                                                         |

***

**`job_compensation_currency`**

**Type:** String (ISO 4217 code)

A 3-letter currency code. Only values from the supported currency list (50 currencies) are stored. The full list includes major fiat currencies and five cryptocurrencies.

If the currency cannot be determined or is not in the supported list, this field is left empty.

***

**`job_compensation_period`**

**Type:** Enum (string)

How frequently the compensation amount is paid.

| Value    | Meaning         | Typical amount range                                 |
| -------- | --------------- | ---------------------------------------------------- |
| `year`   | Annual salary   | $10,000+ (e.g. $120,000/year)                        |
| `month`  | Monthly pay     | $1,000–$9,999 (e.g. $5,000/month)                    |
| `biweek` | Bi-weekly pay   | $1,000–$9,999 (e.g. $2,500 every two weeks)          |
| `week`   | Weekly pay      | $200–$999 (e.g. $800/week)                           |
| `day`    | Daily rate      | $200–$999 (e.g. $500/day)                            |
| `hour`   | Hourly rate     | $7–$199 (e.g. $31.64/hour)                           |
| `minute` | Per-minute rate | Under $7 (rare; applies to some gig/on-demand roles) |

**Period inference:** When the posting does not explicitly state the pay period, the system infers it from the magnitude of the compensation amount using the thresholds in the table above.

> **Important note on values with cents:** Amounts like `$31.64`, `$50.13`, or `$32.76` are almost always hourly rates, not annual salaries. The system explicitly detects and corrects cases where an LLM incorrectly treats an hourly figure as an annual one.

**Period inference does not produce `week` or `biweek`** — those values only appear when the posting explicitly states a weekly or bi-weekly pay cadence.

***

**Placeholder Detection**

Some applicant tracking systems and job boards populate compensation fields with nonsensical default ranges. The system detects and clears these automatically. A range is considered a placeholder if any of the following are true:

* Period is `year`, max ≥ 1,000,000, and min is one of: 0, 1, 10, 100, 1,000, 10,000
* Max ≥ 1,000,000 and min ≤ 5,000 (regardless of period)
* Period is `year` and max/min ratio ≥ 1,000
* Max/min ratio ≥ 100,000 (regardless of period)
* Max is exactly 999,999 or 1,000,000 and min ≤ 10,000

When a placeholder is detected, all compensation fields (`job_compensation`, `job_compensation_min`, `job_compensation_max`, `job_compensation_currency`, `job_compensation_period`) are cleared.

***

#### Visa Sponsorship

**Field:** `is_offers_visa_sponsorship` **Type:** Boolean **Default:** `false`

Set to `true` only when the job posting explicitly states that visa sponsorship is available. Never inferred — if the post is silent on the topic, this remains `false`.

***

#### Technologies

**Field:** `job_technologies_list` **Type:** Array of strings (up to 20 canonical technology names)

Technologies, tools, languages, and platforms required or mentioned in the job post. The enrichment pipeline uses a two-stage approach:

1. **LLM extraction:** The LLM extracts up to 5 technologies as part of its structured output.
2. **NLP enrichment:** After LLM extraction, a hybrid NLP pipeline expands the list to up to 20 technologies using NLP noun extraction and trie-based matching against a corpus of 5,800+ known technologies.

All technology names are **resolved to their canonical form**. For example, `"Postgres"`, `"postgresql"`, and `"pg"` are all stored as `"PostgreSQL"`. This ensures consistent filtering and deduplication.

If a term the LLM or NLP pipeline produces is not in the technology corpus, it is discarded.

***

#### Job Post Language

**Field:** `job_post_language` **Type:** String (ISO 639 language code) **Default:** `en`

The language of the original job posting. Stored as an ISO 639-1 or 639-3 code (e.g. `en`, `fr`, `de`, `es`, `pt`). Defaults to `en` (English) when the language cannot be determined. Maximum 5 characters.

***

#### List Fields

Each of these fields is an array of short strings, capped at 5 items. They are extracted directly from the job posting content.

| Field                       | Description                                                                                           |
| --------------------------- | ----------------------------------------------------------------------------------------------------- |
| `job_benefits_list`         | Benefits and perks offered by the employer (e.g. "Health insurance", "401(k) match", "Unlimited PTO") |
| `job_responsibilities_list` | Key responsibilities the role entails                                                                 |
| `job_requirements_list`     | Required or strongly preferred qualifications and skills                                              |
| `job_problems_list`         | Problems or challenges the new hire is expected to work on or solve                                   |

All four fields default to empty arrays when no relevant content is found.

***

#### Job Description

**Field:** `job_description` **Type:** String (≤ 15 words)

A single, concise sentence describing the most compelling task or project the new hire will work on. Written from the LLM's interpretation of the post — not a copy of the full job description.

**Fallback:** If the LLM cannot produce a meaningful one-task description (returns `NONE` or empty), the field falls back to the job description already stored on the job object from the scraper.

***

#### ONET Code

**Field:** `job_post_onet6_code` **Type:** String **Format:** `XX-XXXX.XX` (e.g. `15-1252.00`)

The O\*NET occupation code for the role. O\*NET is the US Department of Labor's occupational classification system. This code can be used to look up standardized occupation descriptions, required skills, typical wages, and labor market data at [onetonline.org](https://www.onetonline.org/).

**Validation and fallback:** The code is validated against the `XX-XXXX.XX` format. If the LLM returns a malformed or missing code, a separate fallback LLM call is made using just the job title and post content to obtain a valid code.

***

#### Derived Boolean Flags

These fields are computed from other enriched fields — they are never directly extracted from the post.

| Field                        | Type    | Logic                                                                           |
| ---------------------------- | ------- | ------------------------------------------------------------------------------- |
| `is_remote`                  | Boolean | `true` when `job_location_type` is `remote` or `remote_restricted`              |
| `is_internship`              | Boolean | `true` when `job_level` is `internship` **or** `job_commitment` is `internship` |
| `is_offers_visa_sponsorship` | Boolean | `true` when the post explicitly mentions visa sponsorship availability          |

***

### Empty / Null Field Behavior

Fields are left empty (empty string `""` or empty array `[]`) rather than null when data is unavailable or invalid. The specific behavior per field type:

| Situation                               | Result                                                     |
| --------------------------------------- | ---------------------------------------------------------- |
| LLM returns `NONE`                      | Field stored as empty string or empty array                |
| Enum value not in the allowed list      | Field stored as empty (or uses the stated default)         |
| Compensation min > max                  | Both min and max cleared; period and currency also cleared |
| Placeholder compensation detected       | All compensation fields cleared                            |
| Location country not recognized         | That location entry excluded from parsed lists             |
| Geocoded coordinates are `0, 0`         | Latitude and longitude cleared                             |
| No concise metro match for any location | `job_locations_concise_list` stored as `[]`                |
| Technology not in corpus                | Technology entry discarded                                 |
| JSON array fields fail to parse         | Field stored as empty array                                |

Fields with explicit defaults (noted in their sections above):

* `job_commitment` → `full_time`
* `job_location_type` → `on_site`
* `job_post_language` → `en`
* `is_offers_visa_sponsorship` → `false`
* `is_remote` → `false`
* `is_internship` → `false`

***

### Supported Currencies

The following 50 currency codes are supported. Values outside this list are discarded.

#### Fiat Currencies

| Region                            | Currencies                                                           |
| --------------------------------- | -------------------------------------------------------------------- |
| **Americas**                      | USD, CAD, AUD, NZD, BRL, MXN, CLP, COP, PEN                          |
| **Europe**                        | EUR, GBP, CHF, SEK, NOK, DKK, PLN, CZK, HUF, RON, HRK, BGN, RUB      |
| **Middle East & Africa**          | AED, ILS, ZAR, KES, NGN, MAD                                         |
| **Eastern Europe & Central Asia** | TRY, UAH, GEL                                                        |
| **Asia Pacific**                  | JPY, CNY, HKD, INR, SGD, KRW, IDR, MYR, PHP, THB, VND, BDT, PKR, LKR |

#### Cryptocurrencies

| Code  | Currency     |
| ----- | ------------ |
| `BTC` | Bitcoin      |
| `ETH` | Ethereum     |
| `LTC` | Litecoin     |
| `XRP` | XRP (Ripple) |
| `XMR` | Monero       |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.jobfront.io/jobs-data-platform/jobs-data-enrichment.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
