Jobs Data - Enrichment
This document describes how JobFront's enrichment pipeline processes raw job postings and what every field on the enriched job object means, including all valid option values, definitions, and edge-case behavior.
How the Pipeline Works
The enrichment system takes raw job posting HTML plus any metadata already collected by the scraper and produces a set of structured, validated fields.
High-level flow:
Raw job post HTML
│
▼
Build prompts ← scraped metadata + extracted JSON-LD
│
▼
LLM extraction (structured sections)
│
▼
Parse & validate (enum checks, JSON parsing, tech corpus lookup)
│
▼
Post-processing (compensation cross-check, location geocoding,
placeholder detection, derived flags)
│
▼
Enriched job object saved to databaseWhat the LLM extraction process structures:
Job level
Job commitment
Primary location overview
Full locations list
Concise metro locations
Location type
Compensation overview string
Compensation minimum
Compensation maximum
Compensation currency
Compensation period
Visa sponsorship
Job categories
Technologies (up to 20)
Post language
Benefits
Responsibilities
Requirements
Problems to solve
One-sentence task description
O*NET occupation code
Street address (used for geocoding)
All outputs are validated against allowed value lists or resolved against the technology corpus before being stored. Values the LLM cannot determine are returned as NONE and stored as empty strings.
Field Reference
Job Title
Field: job_title Type: String
The job title as stated in the posting. Extracted directly from the post or from structured metadata already collected by the scraper.
Job Level
Field: job_level Type: Enum (string) Default when unknown: empty string
Represents the seniority level of the role. The value must exactly match one of the options below; any other value is discarded.
internship
Internship programs. Also sets is_internship = true.
entry_level
Entry-level or new-grad roles. Typically 0–2 years of experience expected, no seniority modifier in the title.
junior
Roles explicitly labeled "Junior" or similar. Typically 1–3 years experience, still in a learning-focused phase.
mid_level
Mid-level roles, often simply titled "Software Engineer" or "Designer" without a seniority qualifier. Typically 3–5 years.
senior
Roles explicitly labeled "Senior". Typically 5–8 years experience; expected to mentor others.
expert
Staff, Principal, Distinguished, or equivalent titles. 8+ years; sets technical direction across teams.
executive
VP, SVP, C-level, Head of, or equivalent leadership titles.
Note: If the posting does not provide enough signal to confidently assign a level, this field is left empty.
Job Commitment
Field: job_commitment Type: Enum (string) Default when unknown: full_time
The employment type or work arrangement.
full_time
Standard full-time employment. This is the default when no explicit commitment type is stated.
part_time
Part-time employment (fewer than full-time hours).
internship
An internship program. Also sets is_internship = true.
contract
Contract, freelance, or independent contractor engagements.
temporary
Fixed-term or temporary employment.
volunteer
Unpaid volunteer roles.
other
Any commitment type that does not fit the above options.
Location Fields
Location data is parsed from multiple sources on the page/source. Each location is subsequently validated and geocoded.
job_locations_parsed_list
Type: Array of objects
The full geocoded result for every location dictionary. Each object contains city, state, country, zip, latitude, longitude, and the street address used for the lookup. Deduplicated by (city, state, country, street, zip, cbsa).
Job Location Type
Field: job_location_type Type: Enum (string) Default when unknown: on_site
Describes the work arrangement — where the employee is expected to physically work.
on_site
The employee must work from the employer's physical location. This is the default when no remote or hybrid signal is present.
remote
Fully remote with no geographic restriction. The employee can work from anywhere.
remote_restricted
Remote work is offered but restricted by geography — for example, "Remote, US only" or "Remote within California".
hybrid
A mix of in-office days and remote days.
other
An arrangement that does not fit the above options.
Derived flag:
is_remoteistruewhenjob_location_typeisremoteorremote_restricted.
Compensation Fields
Compensation is one of the most complex areas because job postings express it in many formats. The system extracts, validates, and in some cases corrects compensation data.
job_compensation_min and job_compensation_max
Type: Decimal (stored as string)
The exact numeric minimum and maximum compensation values as written in the job posting, with no unit conversion. For example, if the post says $31.64/hour, the stored value is 31.64, not an annualized figure.
Edge cases:
Single value (e.g. "$120,000")
Both min and max are set to the same value (120000). This is valid.
Min exceeds max (e.g. min=200,000, max=100,000)
Both fields are cleared. This indicates a data entry error in the original posting.
Hourly-to-annual misinterpretation
If the LLM interprets an hourly rate (e.g. $31.64) as an annual salary (e.g. $31,640), the system detects the ~100x discrepancy against the source text and corrects both the values and the period. See Period inference below.
ATS/JSON-LD placeholder ranges
Some applicant tracking systems emit nonsensical default ranges (e.g. min=1, max=1,000,000). These are detected and cleared. See Placeholder detection below.
job_compensation_currency
Type: String (ISO 4217 code)
A 3-letter currency code. Only values from the supported currency list (50 currencies) are stored. The full list includes major fiat currencies and five cryptocurrencies.
If the currency cannot be determined or is not in the supported list, this field is left empty.
job_compensation_period
Type: Enum (string)
How frequently the compensation amount is paid.
year
Annual salary
$10,000+ (e.g. $120,000/year)
month
Monthly pay
$1,000–$9,999 (e.g. $5,000/month)
biweek
Bi-weekly pay
$1,000–$9,999 (e.g. $2,500 every two weeks)
week
Weekly pay
$200–$999 (e.g. $800/week)
day
Daily rate
$200–$999 (e.g. $500/day)
hour
Hourly rate
$7–$199 (e.g. $31.64/hour)
minute
Per-minute rate
Under $7 (rare; applies to some gig/on-demand roles)
Period inference: When the posting does not explicitly state the pay period, the system infers it from the magnitude of the compensation amount using the thresholds in the table above.
Important note on values with cents: Amounts like
$31.64,$50.13, or$32.76are almost always hourly rates, not annual salaries. The system explicitly detects and corrects cases where an LLM incorrectly treats an hourly figure as an annual one.
Period inference does not produce week or biweek — those values only appear when the posting explicitly states a weekly or bi-weekly pay cadence.
Placeholder Detection
Some applicant tracking systems and job boards populate compensation fields with nonsensical default ranges. The system detects and clears these automatically. A range is considered a placeholder if any of the following are true:
Period is
year, max ≥ 1,000,000, and min is one of: 0, 1, 10, 100, 1,000, 10,000Max ≥ 1,000,000 and min ≤ 5,000 (regardless of period)
Period is
yearand max/min ratio ≥ 1,000Max/min ratio ≥ 100,000 (regardless of period)
Max is exactly 999,999 or 1,000,000 and min ≤ 10,000
When a placeholder is detected, all compensation fields (job_compensation, job_compensation_min, job_compensation_max, job_compensation_currency, job_compensation_period) are cleared.
Visa Sponsorship
Field: is_offers_visa_sponsorship Type: Boolean Default: false
Set to true only when the job posting explicitly states that visa sponsorship is available. Never inferred — if the post is silent on the topic, this remains false.
Technologies
Field: job_technologies_list Type: Array of strings (up to 20 canonical technology names)
Technologies, tools, languages, and platforms required or mentioned in the job post. The enrichment pipeline uses a two-stage approach:
LLM extraction: The LLM extracts up to 5 technologies as part of its structured output.
NLP enrichment: After LLM extraction, a hybrid NLP pipeline expands the list to up to 20 technologies using NLP noun extraction and trie-based matching against a corpus of 5,800+ known technologies.
All technology names are resolved to their canonical form. For example, "Postgres", "postgresql", and "pg" are all stored as "PostgreSQL". This ensures consistent filtering and deduplication.
If a term the LLM or NLP pipeline produces is not in the technology corpus, it is discarded.
Job Post Language
Field: job_post_language Type: String (ISO 639 language code) Default: en
The language of the original job posting. Stored as an ISO 639-1 or 639-3 code (e.g. en, fr, de, es, pt). Defaults to en (English) when the language cannot be determined. Maximum 5 characters.
List Fields
Each of these fields is an array of short strings, capped at 5 items. They are extracted directly from the job posting content.
job_benefits_list
Benefits and perks offered by the employer (e.g. "Health insurance", "401(k) match", "Unlimited PTO")
job_responsibilities_list
Key responsibilities the role entails
job_requirements_list
Required or strongly preferred qualifications and skills
job_problems_list
Problems or challenges the new hire is expected to work on or solve
All four fields default to empty arrays when no relevant content is found.
Job Description
Field: job_description Type: String (≤ 15 words)
A single, concise sentence describing the most compelling task or project the new hire will work on. Written from the LLM's interpretation of the post — not a copy of the full job description.
Fallback: If the LLM cannot produce a meaningful one-task description (returns NONE or empty), the field falls back to the job description already stored on the job object from the scraper.
ONET Code
Field: job_post_onet6_code Type: String Format: XX-XXXX.XX (e.g. 15-1252.00)
The O*NET occupation code for the role. O*NET is the US Department of Labor's occupational classification system. This code can be used to look up standardized occupation descriptions, required skills, typical wages, and labor market data at onetonline.org.
Validation and fallback: The code is validated against the XX-XXXX.XX format. If the LLM returns a malformed or missing code, a separate fallback LLM call is made using just the job title and post content to obtain a valid code.
Derived Boolean Flags
These fields are computed from other enriched fields — they are never directly extracted from the post.
is_remote
Boolean
true when job_location_type is remote or remote_restricted
is_internship
Boolean
true when job_level is internship or job_commitment is internship
is_offers_visa_sponsorship
Boolean
true when the post explicitly mentions visa sponsorship availability
Empty / Null Field Behavior
Fields are left empty (empty string "" or empty array []) rather than null when data is unavailable or invalid. The specific behavior per field type:
LLM returns NONE
Field stored as empty string or empty array
Enum value not in the allowed list
Field stored as empty (or uses the stated default)
Compensation min > max
Both min and max cleared; period and currency also cleared
Placeholder compensation detected
All compensation fields cleared
Location country not recognized
That location entry excluded from parsed lists
Geocoded coordinates are 0, 0
Latitude and longitude cleared
No concise metro match for any location
job_locations_concise_list stored as []
Technology not in corpus
Technology entry discarded
JSON array fields fail to parse
Field stored as empty array
Fields with explicit defaults (noted in their sections above):
job_commitment→full_timejob_location_type→on_sitejob_post_language→enis_offers_visa_sponsorship→falseis_remote→falseis_internship→false
Supported Currencies
The following 50 currency codes are supported. Values outside this list are discarded.
Fiat Currencies
Americas
USD, CAD, AUD, NZD, BRL, MXN, CLP, COP, PEN
Europe
EUR, GBP, CHF, SEK, NOK, DKK, PLN, CZK, HUF, RON, HRK, BGN, RUB
Middle East & Africa
AED, ILS, ZAR, KES, NGN, MAD
Eastern Europe & Central Asia
TRY, UAH, GEL
Asia Pacific
JPY, CNY, HKD, INR, SGD, KRW, IDR, MYR, PHP, THB, VND, BDT, PKR, LKR
Cryptocurrencies
BTC
Bitcoin
ETH
Ethereum
LTC
Litecoin
XRP
XRP (Ripple)
XMR
Monero
Last updated