> For the complete documentation index, see [llms.txt](https://docs.jobfront.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.jobfront.io/jobs-data-platform/jobs-data-intro.md).

# Jobs Data - Intro

This document describes how JobFront's enrichment pipeline processes raw job postings and what every field on the enriched job object means, including all valid option values, definitions, and edge-case behavior.

***

### JobFront Data

Our data enrichment system takes raw job posting HTML plus any metadata already collected by the scraper and produces a set of structured, validated fields.

**Jobs Data Fields**

| Field group                         |
| ----------------------------------- |
| Job level                           |
| Job commitment                      |
| Primary location overview           |
| Full locations list                 |
| Concise metro locations             |
| Location type                       |
| Compensation overview string        |
| Compensation minimum                |
| Compensation maximum                |
| Compensation currency               |
| Compensation period                 |
| Visa sponsorship                    |
| Job categories                      |
| Technologies                        |
| Post language                       |
| Benefits                            |
| Responsibilities                    |
| Requirements                        |
| Problems to solve                   |
| One-sentence task description       |
| O\*NET occupation code              |
| Street address (used for geocoding) |

All outputs are validated against allowed value lists or resolved against the technology corpus before being stored. Values the LLM cannot determine are returned as `NONE` and stored as empty strings.

***

### Field Reference

#### Job Title

**Field:** `job_title` **Type:** String

The job title as stated in the posting. Extracted directly from the post or from structured metadata already collected by the scraper.

***

#### Job Level

**Field:** `job_level` **Type:** Enum (string) **Default when unknown:** empty string

Represents the seniority level of the role. The value must exactly match one of the options below; any other value is discarded.

| Value         | When it applies                                                                                                            |
| ------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `internship`  | Internship programs. Also sets `is_internship = true`.                                                                     |
| `entry_level` | Entry-level or new-grad roles. Typically 0–2 years of experience expected, no seniority modifier in the title.             |
| `junior`      | Roles explicitly labeled "Junior" or similar. Typically 1–3 years experience, still in a learning-focused phase.           |
| `mid_level`   | Mid-level roles, often simply titled "Software Engineer" or "Designer" without a seniority qualifier. Typically 3–5 years. |
| `senior`      | Roles explicitly labeled "Senior". Typically 5–8 years experience; expected to mentor others.                              |
| `expert`      | Staff, Principal, Distinguished, or equivalent titles. 8+ years; sets technical direction across teams.                    |
| `executive`   | VP, SVP, C-level, Head of, or equivalent leadership titles.                                                                |

> **Note:** If the posting does not provide enough signal to confidently assign a level, this field is left empty.

***

#### Job Commitment

**Field:** `job_commitment` **Type:** Enum (string) **Default when unknown:** `full_time`

The employment type or work arrangement.

| Value        | When it applies                                                                                |
| ------------ | ---------------------------------------------------------------------------------------------- |
| `full_time`  | Standard full-time employment. This is the default when no explicit commitment type is stated. |
| `part_time`  | Part-time employment (fewer than full-time hours).                                             |
| `internship` | An internship program. Also sets `is_internship = true`.                                       |
| `contract`   | Contract, freelance, or independent contractor engagements.                                    |
| `temporary`  | Fixed-term or temporary employment.                                                            |
| `volunteer`  | Unpaid volunteer roles.                                                                        |
| `other`      | Any commitment type that does not fit the above options.                                       |

***

#### Location Fields

Location data is parsed from multiple sources on the page/source. Each location is subsequently validated and geocoded.

***

**`job_locations_parsed_list`**

**Type:** Array of objects

The full geocoded result for every location dictionary. Each object contains city, state, country, zip, latitude, longitude, and the street address used for the lookup. Deduplicated by `(city, state, country, street, zip, cbsa)`.

***

#### Job Location Type

**Field:** `job_location_type` **Type:** Enum (string) **Default when unknown:** `on_site`

Describes the work arrangement — where the employee is expected to physically work.

| Value               | When it applies                                                                                                               |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `on_site`           | The employee must work from the employer's physical location. This is the default when no remote or hybrid signal is present. |
| `remote`            | Fully remote with no geographic restriction. The employee can work from anywhere.                                             |
| `remote_restricted` | Remote work is offered but restricted by geography — for example, "Remote, US only" or "Remote within California".            |
| `hybrid`            | A mix of in-office days and remote days.                                                                                      |
| `other`             | An arrangement that does not fit the above options.                                                                           |

> **Derived flag:** `is_remote` is `true` when `job_location_type` is `remote` or `remote_restricted`.

***

#### Compensation Fields

Compensation is one of the most complex areas because job postings express it in many formats. The system extracts, validates, and in some cases corrects compensation data.

***

**`job_compensation_min` and `job_compensation_max`**

**Type:** Decimal (stored as string)

The exact numeric minimum and maximum compensation values **as written in the job posting**, with no unit conversion. For example, if the post says `$31.64/hour`, the stored value is `31.64`, not an annualized figure.

**Edge cases:**

| Situation                                           | Behavior                                                                                                                                                                                                                              |
| --------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Single value** (e.g. "$120,000")                  | Both `min` and `max` are set to the same value (`120000`). This is valid.                                                                                                                                                             |
| **Min exceeds max** (e.g. min=200,000, max=100,000) | Both fields are cleared. This indicates a data entry error in the original posting.                                                                                                                                                   |
| **Hourly-to-annual misinterpretation**              | If the LLM interprets an hourly rate (e.g. `$31.64`) as an annual salary (e.g. `$31,640`), the system detects the \~100x discrepancy against the source text and corrects both the values and the period. See Period inference below. |
| **ATS/JSON-LD placeholder ranges**                  | Some applicant tracking systems emit nonsensical default ranges (e.g. min=1, max=1,000,000). These are detected and cleared. See Placeholder detection below.                                                                         |

***

**`job_compensation_currency`**

**Type:** String (ISO 4217 code)

A 3-letter currency code. Only values from the supported currency list (50 currencies) are stored. The full list includes major fiat currencies and five cryptocurrencies.

If the currency cannot be determined or is not in the supported list, this field is left empty.

***

**`job_compensation_period`**

**Type:** Enum (string)

How frequently the compensation amount is paid.

| Value    | Meaning         | Typical amount range                                 |
| -------- | --------------- | ---------------------------------------------------- |
| `year`   | Annual salary   | $10,000+ (e.g. $120,000/year)                        |
| `month`  | Monthly pay     | $1,000–$9,999 (e.g. $5,000/month)                    |
| `biweek` | Bi-weekly pay   | $1,000–$9,999 (e.g. $2,500 every two weeks)          |
| `week`   | Weekly pay      | $200–$999 (e.g. $800/week)                           |
| `day`    | Daily rate      | $200–$999 (e.g. $500/day)                            |
| `hour`   | Hourly rate     | $7–$199 (e.g. $31.64/hour)                           |
| `minute` | Per-minute rate | Under $7 (rare; applies to some gig/on-demand roles) |

**Period inference:** When the posting does not explicitly state the pay period, the system infers it from the magnitude of the compensation amount using the thresholds in the table above.

> **Important note on values with cents:** Amounts like `$31.64`, `$50.13`, or `$32.76` are almost always hourly rates, not annual salaries. The system explicitly detects and corrects cases where an LLM incorrectly treats an hourly figure as an annual one.

**Period inference does not produce `week` or `biweek`** — those values only appear when the posting explicitly states a weekly or bi-weekly pay cadence.

***

#### Visa Sponsorship

**Field:** `is_offers_visa_sponsorship` **Type:** Boolean **Default:** `false`

Set to `true` only when the job posting explicitly states that visa sponsorship is available. Never inferred — if the post is silent on the topic, this remains `false`.

***

#### Technologies

**Field:** `job_technologies_list` **Type:** Array of strings (up to 20 canonical technology names)

Technologies, tools, languages, and platforms required or mentioned in the job post. The enrichment pipeline uses a two-stage approach:

1. **LLM extraction:** The LLM extracts up to 5 technologies as part of its structured output.
2. **NLP enrichment:** After LLM extraction, a hybrid NLP pipeline expands the list to up to 20 technologies using NLP noun extraction and trie-based matching against a corpus of 5,800+ known technologies.

All technology names are **resolved to their canonical form**. For example, `"Postgres"`, `"postgresql"`, and `"pg"` are all stored as `"PostgreSQL"`. This ensures consistent filtering and deduplication.

If a term the LLM or NLP pipeline produces is not in the technology corpus, it is discarded.

***

#### Job Post Language

**Field:** `job_post_language` **Type:** String (ISO 639 language code) **Default:** `en`

The language of the original job posting. Stored as an ISO 639-1 or 639-3 code (e.g. `en`, `fr`, `de`, `es`, `pt`). Defaults to `en` (English) when the language cannot be determined. Maximum 5 characters.

***

#### List Fields

Each of these fields is an array of short strings, capped at 5 items. They are extracted directly from the job posting content.

| Field                       | Description                                                                                           |
| --------------------------- | ----------------------------------------------------------------------------------------------------- |
| `job_benefits_list`         | Benefits and perks offered by the employer (e.g. "Health insurance", "401(k) match", "Unlimited PTO") |
| `job_responsibilities_list` | Key responsibilities the role entails                                                                 |
| `job_requirements_list`     | Required or strongly preferred qualifications and skills                                              |
| `job_problems_list`         | Problems or challenges the new hire is expected to work on or solve                                   |

All four fields default to empty arrays when no relevant content is found.

***

#### Job Description

**Field:** `job_description` **Type:** String (≤ 15 words)

A single, concise sentence describing the most compelling task or project the new hire will work on. Written from the LLM's interpretation of the post — not a copy of the full job description.

**Fallback:** If the LLM cannot produce a meaningful one-task description (returns `NONE` or empty), the field falls back to the job description already stored on the job object from the scraper.

***

#### ONET Code

**Field:** `job_post_onet6_code` **Type:** String **Format:** `XX-XXXX.XX` (e.g. `15-1252.00`)

The O\*NET occupation code for the role. O\*NET is the US Department of Labor's occupational classification system. This code can be used to look up standardized occupation descriptions, required skills, typical wages, and labor market data at [onetonline.org](https://www.onetonline.org/).

**Validation and fallback:** The code is validated against the `XX-XXXX.XX` format. If the LLM returns a malformed or missing code, a separate fallback LLM call is made using just the job title and post content to obtain a valid code.

***

#### Derived Boolean Flags

These fields are computed from other enriched fields — they are never directly extracted from the post.

| Field                        | Type    | Logic                                                                           |
| ---------------------------- | ------- | ------------------------------------------------------------------------------- |
| `is_remote`                  | Boolean | `true` when `job_location_type` is `remote` or `remote_restricted`              |
| `is_internship`              | Boolean | `true` when `job_level` is `internship` **or** `job_commitment` is `internship` |
| `is_offers_visa_sponsorship` | Boolean | `true` when the post explicitly mentions visa sponsorship availability          |

***

### Empty / Null Field Behavior

Fields are left empty (empty string `""` or empty array `[]`) rather than null when data is unavailable or invalid. The specific behavior per field type:

| Situation                               | Result                                                     |
| --------------------------------------- | ---------------------------------------------------------- |
| LLM returns `NONE`                      | Field stored as empty string or empty array                |
| Enum value not in the allowed list      | Field stored as empty (or uses the stated default)         |
| Compensation min > max                  | Both min and max cleared; period and currency also cleared |
| Placeholder compensation detected       | All compensation fields cleared                            |
| Location country not recognized         | That location entry excluded from parsed lists             |
| Geocoded coordinates are `0, 0`         | Latitude and longitude cleared                             |
| No concise metro match for any location | `job_locations_concise_list` stored as `[]`                |
| Technology not in corpus                | Technology entry discarded                                 |
| JSON array fields fail to parse         | Field stored as empty array                                |

Fields with explicit defaults (noted in their sections above):

* `job_commitment` → `full_time`
* `job_location_type` → `on_site`
* `job_post_language` → `en`
* `is_offers_visa_sponsorship` → `false`
* `is_remote` → `false`
* `is_internship` → `false`

***

### Supported Currencies

The following 50 currency codes are supported. Values outside this list are discarded.

#### Fiat Currencies

| Region                            | Currencies                                                           |
| --------------------------------- | -------------------------------------------------------------------- |
| **Americas**                      | USD, CAD, AUD, NZD, BRL, MXN, CLP, COP, PEN                          |
| **Europe**                        | EUR, GBP, CHF, SEK, NOK, DKK, PLN, CZK, HUF, RON, HRK, BGN, RUB      |
| **Middle East & Africa**          | AED, ILS, ZAR, KES, NGN, MAD                                         |
| **Eastern Europe & Central Asia** | TRY, UAH, GEL                                                        |
| **Asia Pacific**                  | JPY, CNY, HKD, INR, SGD, KRW, IDR, MYR, PHP, THB, VND, BDT, PKR, LKR |

#### Cryptocurrencies

| Code  | Currency     |
| ----- | ------------ |
| `BTC` | Bitcoin      |
| `ETH` | Ethereum     |
| `LTC` | Litecoin     |
| `XRP` | XRP (Ripple) |
| `XMR` | Monero       |

### Content moderation

Every enriched job includes a `moderation` object that screens the posting for inappropriate or unsafe content. The vast majority of jobs come back clean (`"flagged": false`) — the object simply lets you filter or review the rare postings that trip a check.

We scan the job's text (title, description, responsibilities, requirements, benefits, location, and the raw posting) and tag any of the following:

| Tag          | Meaning                                 |
| ------------ | --------------------------------------- |
| `illicit`    | Illegal or otherwise nefarious activity |
| `sexual`     | Sexual content                          |
| `hate`       | Hate speech targeting protected groups  |
| `harassment` | Harassing or threatening language       |
| `violence`   | Violent or graphic content              |
| `self_harm`  | Self-harm content                       |
| `profanity`  | Strong profanity / explicit terms       |

`field_flags` tells you exactly which field tripped each check.

**Example — flagged job**

{% code overflow="wrap" %}

```
"moderation": {  "flagged": true,  "categories": ["profanity", "sexual"],  "scores": { "sexual": 0.91 },  "fields_flagged": ["job_title", "job_post_text"],  "field_flags": {    "job_title":     { "categories": ["sexual"], "profanity_terms": [] },    "job_post_text": { "categories": [], "profanity_terms": ["bullsh*t"] }  },  "checked_at": 1716700000}
```

{% endcode %}

**Example — clean job (typical)**

{% code overflow="wrap" %}

```
"moderation": { "flagged": false, "categories": [], "fields_flagged": [], "field_flags": {}, "checked_at": 1716700000 }
```

{% endcode %}

**Fields**

* `flagged` (bool) — `true` if any check tripped.
* `categories` (string\[]) — union of tags across all fields.
* `scores` (object) — model confidence per tag (`0`–`1`); present only when flagged.
* `fields_flagged` (string\[]) — which fields tripped a check.
* `field_flags` (object) — per-field detail: matched `categories` and `profanity_terms`.
* `checked_at` (unix timestamp) — when moderation last ran.

### Contacts

JobFront extracts recruiter, hiring-manager, and inbox contact details from each job posting — names, titles, emails, phone numbers, social profiles, and the role each contact plays in hiring.

Contacts are returned on the job object as **`job_contacts`**, an array of contact objects ordered best-match first (up to 10 per job). The field is only present when contact details can be detected in the posting; jobs with none omit it. Automated, non-applicant mailboxes (`noreply@`, `postmaster@`, `abuse@`, and similar) are filtered out.

#### Shape

{% code overflow="wrap" %}

```
"job_contacts": [  {    "name": "Jane Doe",    "title": "Senior Technical Recruiter",    "role": "recruiter",    "description": "Primary recruiter for this role",    "emails": ["jane.doe@example.com"],    "email_domains": ["example.com"],    "phones": ["+1-415-555-1234"],    "social_urls": ["https://www.linkedin.com/in/jane-doe"],    "social_networks": ["linkedin"],    "extracted_at": 1748312400  }]
```

{% endcode %}

#### Fields

Each object in the `job_contacts` array contains:

* **name** — Full name of the contact. Blank if not detected.
* **title** — The contact's job title (e.g. "Senior Recruiter"). Blank if not detected.
* **role** — The contact's role in the hiring process. One of: `recruiter`, `hiring_manager`, `team_lead`, `hr`, `generic_careers`, `accommodations`, `eeo`, `compliance`, `support`, `other`. Blank if not detected.
* **description** — Short free-text note on who the contact is or what they handle.
* **emails** — Email addresses for the contact (up to 5).
* **email\_domains** — The domain of each email (e.g. `example.com`), de-duplicated (up to 5). Convenient for filtering by company domain.
* **phones** — Phone numbers exactly as written in the posting (up to 5). Not normalized to a standard format.
* **social\_urls** — Social profile URLs for the contact (up to 5).
* **social\_networks** — The network each social profile belongs to, de-duplicated. Any of: `linkedin`, `x`, `github`, `facebook`, `instagram`, `bluesky`, `threads`, `mastodon`.
* **extracted\_at** — Unix timestamp (seconds) when the contact details were extracted.

#### Contact roles

| Role              | Meaning                                                                                                            |
| ----------------- | ------------------------------------------------------------------------------------------------------------------ |
| `recruiter`       | Talent-acquisition / recruiter / sourcer for this role.                                                            |
| `hiring_manager`  | The person the candidate would report to / who owns the requisition.                                               |
| `team_lead`       | Team lead or senior member of the hiring team (not the hiring manager).                                            |
| `hr`              | HR business partner, HR coordinator, or People Ops.                                                                |
| `generic_careers` | Role-based recruiting inbox (`careers@`, `jobs@`, `recruiting@`, `talent@`, `hiring@`) not tied to a named person. |
| `accommodations`  | ADA / accessibility / disability accommodations request contact.                                                   |
| `eeo`             | Equal-employment-opportunity / affirmative-action contact.                                                         |
| `compliance`      | Legal / privacy / data-protection inbox (`privacy@`, `dpo@`, `legal@`).                                            |
| `support`         | General customer, candidate, or IT support line.                                                                   |
| `other`           | Doesn't fit any of the above.                                                                                      |

#### Limits

* Up to **10** contacts per job, best-match first.
* Up to **5** emails, phones, and social profiles per contact.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.jobfront.io/jobs-data-platform/jobs-data-intro.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
