Jobs Data - Enrichment

This document describes how JobFront's enrichment pipeline processes raw job postings and what every field on the enriched job object means, including all valid option values, definitions, and edge-case behavior.


How the Pipeline Works

The enrichment system takes raw job posting HTML plus any metadata already collected by the scraper and produces a set of structured, validated fields.

High-level flow:

Raw job post HTML


  Build prompts  ←  scraped metadata + extracted JSON-LD


  LLM extraction  (structured sections)


  Parse & validate  (enum checks, JSON parsing, tech corpus lookup)


  Post-processing  (compensation cross-check, location geocoding,
                    placeholder detection, derived flags)


  Enriched job object saved to database

What the LLM extraction process structures:

Field group

Job level

Job commitment

Primary location overview

Full locations list

Concise metro locations

Location type

Compensation overview string

Compensation minimum

Compensation maximum

Compensation currency

Compensation period

Visa sponsorship

Job categories

Technologies (up to 20)

Post language

Benefits

Responsibilities

Requirements

Problems to solve

One-sentence task description

O*NET occupation code

Street address (used for geocoding)

All outputs are validated against allowed value lists or resolved against the technology corpus before being stored. Values the LLM cannot determine are returned as NONE and stored as empty strings.


Field Reference

Job Title

Field: job_title Type: String

The job title as stated in the posting. Extracted directly from the post or from structured metadata already collected by the scraper.


Job Level

Field: job_level Type: Enum (string) Default when unknown: empty string

Represents the seniority level of the role. The value must exactly match one of the options below; any other value is discarded.

Value
When it applies

internship

Internship programs. Also sets is_internship = true.

entry_level

Entry-level or new-grad roles. Typically 0–2 years of experience expected, no seniority modifier in the title.

junior

Roles explicitly labeled "Junior" or similar. Typically 1–3 years experience, still in a learning-focused phase.

mid_level

Mid-level roles, often simply titled "Software Engineer" or "Designer" without a seniority qualifier. Typically 3–5 years.

senior

Roles explicitly labeled "Senior". Typically 5–8 years experience; expected to mentor others.

expert

Staff, Principal, Distinguished, or equivalent titles. 8+ years; sets technical direction across teams.

executive

VP, SVP, C-level, Head of, or equivalent leadership titles.

Note: If the posting does not provide enough signal to confidently assign a level, this field is left empty.


Job Commitment

Field: job_commitment Type: Enum (string) Default when unknown: full_time

The employment type or work arrangement.

Value
When it applies

full_time

Standard full-time employment. This is the default when no explicit commitment type is stated.

part_time

Part-time employment (fewer than full-time hours).

internship

An internship program. Also sets is_internship = true.

contract

Contract, freelance, or independent contractor engagements.

temporary

Fixed-term or temporary employment.

volunteer

Unpaid volunteer roles.

other

Any commitment type that does not fit the above options.


Location Fields

Location data is parsed from multiple sources on the page/source. Each location is subsequently validated and geocoded.


job_locations_parsed_list

Type: Array of objects

The full geocoded result for every location dictionary. Each object contains city, state, country, zip, latitude, longitude, and the street address used for the lookup. Deduplicated by (city, state, country, street, zip, cbsa).


Job Location Type

Field: job_location_type Type: Enum (string) Default when unknown: on_site

Describes the work arrangement — where the employee is expected to physically work.

Value
When it applies

on_site

The employee must work from the employer's physical location. This is the default when no remote or hybrid signal is present.

remote

Fully remote with no geographic restriction. The employee can work from anywhere.

remote_restricted

Remote work is offered but restricted by geography — for example, "Remote, US only" or "Remote within California".

hybrid

A mix of in-office days and remote days.

other

An arrangement that does not fit the above options.

Derived flag: is_remote is true when job_location_type is remote or remote_restricted.


Compensation Fields

Compensation is one of the most complex areas because job postings express it in many formats. The system extracts, validates, and in some cases corrects compensation data.


job_compensation_min and job_compensation_max

Type: Decimal (stored as string)

The exact numeric minimum and maximum compensation values as written in the job posting, with no unit conversion. For example, if the post says $31.64/hour, the stored value is 31.64, not an annualized figure.

Edge cases:

Situation
Behavior

Single value (e.g. "$120,000")

Both min and max are set to the same value (120000). This is valid.

Min exceeds max (e.g. min=200,000, max=100,000)

Both fields are cleared. This indicates a data entry error in the original posting.

Hourly-to-annual misinterpretation

If the LLM interprets an hourly rate (e.g. $31.64) as an annual salary (e.g. $31,640), the system detects the ~100x discrepancy against the source text and corrects both the values and the period. See Period inference below.

ATS/JSON-LD placeholder ranges

Some applicant tracking systems emit nonsensical default ranges (e.g. min=1, max=1,000,000). These are detected and cleared. See Placeholder detection below.


job_compensation_currency

Type: String (ISO 4217 code)

A 3-letter currency code. Only values from the supported currency list (50 currencies) are stored. The full list includes major fiat currencies and five cryptocurrencies.

If the currency cannot be determined or is not in the supported list, this field is left empty.


job_compensation_period

Type: Enum (string)

How frequently the compensation amount is paid.

Value
Meaning
Typical amount range

year

Annual salary

$10,000+ (e.g. $120,000/year)

month

Monthly pay

$1,000–$9,999 (e.g. $5,000/month)

biweek

Bi-weekly pay

$1,000–$9,999 (e.g. $2,500 every two weeks)

week

Weekly pay

$200–$999 (e.g. $800/week)

day

Daily rate

$200–$999 (e.g. $500/day)

hour

Hourly rate

$7–$199 (e.g. $31.64/hour)

minute

Per-minute rate

Under $7 (rare; applies to some gig/on-demand roles)

Period inference: When the posting does not explicitly state the pay period, the system infers it from the magnitude of the compensation amount using the thresholds in the table above.

Important note on values with cents: Amounts like $31.64, $50.13, or $32.76 are almost always hourly rates, not annual salaries. The system explicitly detects and corrects cases where an LLM incorrectly treats an hourly figure as an annual one.

Period inference does not produce week or biweek — those values only appear when the posting explicitly states a weekly or bi-weekly pay cadence.


Placeholder Detection

Some applicant tracking systems and job boards populate compensation fields with nonsensical default ranges. The system detects and clears these automatically. A range is considered a placeholder if any of the following are true:

  • Period is year, max ≥ 1,000,000, and min is one of: 0, 1, 10, 100, 1,000, 10,000

  • Max ≥ 1,000,000 and min ≤ 5,000 (regardless of period)

  • Period is year and max/min ratio ≥ 1,000

  • Max/min ratio ≥ 100,000 (regardless of period)

  • Max is exactly 999,999 or 1,000,000 and min ≤ 10,000

When a placeholder is detected, all compensation fields (job_compensation, job_compensation_min, job_compensation_max, job_compensation_currency, job_compensation_period) are cleared.


Visa Sponsorship

Field: is_offers_visa_sponsorship Type: Boolean Default: false

Set to true only when the job posting explicitly states that visa sponsorship is available. Never inferred — if the post is silent on the topic, this remains false.


Technologies

Field: job_technologies_list Type: Array of strings (up to 20 canonical technology names)

Technologies, tools, languages, and platforms required or mentioned in the job post. The enrichment pipeline uses a two-stage approach:

  1. LLM extraction: The LLM extracts up to 5 technologies as part of its structured output.

  2. NLP enrichment: After LLM extraction, a hybrid NLP pipeline expands the list to up to 20 technologies using NLP noun extraction and trie-based matching against a corpus of 5,800+ known technologies.

All technology names are resolved to their canonical form. For example, "Postgres", "postgresql", and "pg" are all stored as "PostgreSQL". This ensures consistent filtering and deduplication.

If a term the LLM or NLP pipeline produces is not in the technology corpus, it is discarded.


Job Post Language

Field: job_post_language Type: String (ISO 639 language code) Default: en

The language of the original job posting. Stored as an ISO 639-1 or 639-3 code (e.g. en, fr, de, es, pt). Defaults to en (English) when the language cannot be determined. Maximum 5 characters.


List Fields

Each of these fields is an array of short strings, capped at 5 items. They are extracted directly from the job posting content.

Field
Description

job_benefits_list

Benefits and perks offered by the employer (e.g. "Health insurance", "401(k) match", "Unlimited PTO")

job_responsibilities_list

Key responsibilities the role entails

job_requirements_list

Required or strongly preferred qualifications and skills

job_problems_list

Problems or challenges the new hire is expected to work on or solve

All four fields default to empty arrays when no relevant content is found.


Job Description

Field: job_description Type: String (≤ 15 words)

A single, concise sentence describing the most compelling task or project the new hire will work on. Written from the LLM's interpretation of the post — not a copy of the full job description.

Fallback: If the LLM cannot produce a meaningful one-task description (returns NONE or empty), the field falls back to the job description already stored on the job object from the scraper.


ONET Code

Field: job_post_onet6_code Type: String Format: XX-XXXX.XX (e.g. 15-1252.00)

The O*NET occupation code for the role. O*NET is the US Department of Labor's occupational classification system. This code can be used to look up standardized occupation descriptions, required skills, typical wages, and labor market data at onetonline.orgarrow-up-right.

Validation and fallback: The code is validated against the XX-XXXX.XX format. If the LLM returns a malformed or missing code, a separate fallback LLM call is made using just the job title and post content to obtain a valid code.


Derived Boolean Flags

These fields are computed from other enriched fields — they are never directly extracted from the post.

Field
Type
Logic

is_remote

Boolean

true when job_location_type is remote or remote_restricted

is_internship

Boolean

true when job_level is internship or job_commitment is internship

is_offers_visa_sponsorship

Boolean

true when the post explicitly mentions visa sponsorship availability


Empty / Null Field Behavior

Fields are left empty (empty string "" or empty array []) rather than null when data is unavailable or invalid. The specific behavior per field type:

Situation
Result

LLM returns NONE

Field stored as empty string or empty array

Enum value not in the allowed list

Field stored as empty (or uses the stated default)

Compensation min > max

Both min and max cleared; period and currency also cleared

Placeholder compensation detected

All compensation fields cleared

Location country not recognized

That location entry excluded from parsed lists

Geocoded coordinates are 0, 0

Latitude and longitude cleared

No concise metro match for any location

job_locations_concise_list stored as []

Technology not in corpus

Technology entry discarded

JSON array fields fail to parse

Field stored as empty array

Fields with explicit defaults (noted in their sections above):

  • job_commitmentfull_time

  • job_location_typeon_site

  • job_post_languageen

  • is_offers_visa_sponsorshipfalse

  • is_remotefalse

  • is_internshipfalse


Supported Currencies

The following 50 currency codes are supported. Values outside this list are discarded.

Fiat Currencies

Region
Currencies

Americas

USD, CAD, AUD, NZD, BRL, MXN, CLP, COP, PEN

Europe

EUR, GBP, CHF, SEK, NOK, DKK, PLN, CZK, HUF, RON, HRK, BGN, RUB

Middle East & Africa

AED, ILS, ZAR, KES, NGN, MAD

Eastern Europe & Central Asia

TRY, UAH, GEL

Asia Pacific

JPY, CNY, HKD, INR, SGD, KRW, IDR, MYR, PHP, THB, VND, BDT, PKR, LKR

Cryptocurrencies

Code
Currency

BTC

Bitcoin

ETH

Ethereum

LTC

Litecoin

XRP

XRP (Ripple)

XMR

Monero

Last updated