Data Exports
This document describes the JobFront collections export system — how job data is packaged and delivered to customers, what fields are included in each format, how often exports refresh, and how to configure custom field mappings.
Overview
The JobFront collections export system continuously publishes job data from your configured collections to your chosen delivery destination. Each collection maps to a curated set of employer sources; the export system packages all active jobs from those sources into structured files and pushes them to you on a scheduled cadence.
Three output formats are supported:
JSON
REST API consumers, data pipelines, modern integrations
XML
ATS integrations, legacy job board ingestion, feed-based systems
Parquet
Data warehouses, analytics platforms, bulk ML/data science workloads
Export Cadence & Frequency
Default refresh interval
Every 6 hours
Configurable per collection
Yes
Collections are evaluated in a continuous scheduling loop. Each collection has its own independent timer. When a collection's configured interval has elapsed since its last successful export, a new export run begins immediately.
Delivery Methods
S3 (Jobfront-hosted)
Files are written to a designated S3 bucket and folder path. You access them via:
Pre-signed URLs — time-limited download links (default 1-hour TTL) provided by the Jobfront API. No AWS credentials required on your end.
IAM credentials — a read-only AWS IAM user scoped to your prefix. Use standard AWS CLI or SDK tooling to sync files directly.
SFTP (push to your server)
Files are pushed directly to your SFTP server after each successful export. Configuration requires:
Hostname
Username / authentication credentials
Base directory path
Optional connection timeout (default: 30 seconds)
File Organization
Each collection can be configured in one of two file organization modes:
aggregate_sources
All employers in the collection are merged into a single file per export run
Feed to a single job board or ATS
single_source
One separate file is produced per employer/source
Per-employer intake pipelines, employer-specific routing
In single_source mode the filename includes the source identifier so each employer's file can be processed independently.
Export Formats
JSON & XML Exports
Four named format variants are available. All variants expose the same fields — the choice controls the output structure and element naming conventions:
json_jobfront
JSON
Standard Jobfront JSON schema
xml_jobfront
XML
Standard Jobfront XML schema
The fields included in each export are controlled by the data tier setting (min, default, or max) described in the next section.
Parquet Exports
Parquet files use Apache Parquet format with Snappy compression and a row group size of 50,000 rows. The schema is fixed (see Parquet Schema below). Parquet exports always include the equivalent of the default tier fields plus additional timestamp and lifecycle metadata.
Each export cycle produces:
Data Tiers (min / default / max)
JSON and XML exports support three data tiers. Higher tiers are supersets — each tier includes all fields from the tiers below it.
min tier
The core required fields present in every export.
id
string
Unique Jobfront job identifier
title
string
Job title
description
string
AI-generated summary of the job post
post
string (HTML)
Full job post content
url_job
string
Direct application / posting URL
job_status
string
active or inactive
created
string
Date first seen, formatted YYYY-MM-DD
created_at
integer
Unix timestamp when job was first seen
salary.min
float | null
Minimum salary. Null if not available. If only one bound is listed in the source, both min and max are set to the same value.
salary.max
float | null
Maximum salary. Null if not available.
salary.currency
string
ISO currency code, e.g. USD
salary.period
string
annual, hourly, monthly, or weekly
onet_code
string
O*NET 6-digit occupation code, e.g. 15-1252.00
remote
boolean
Whether the position is fully remote
locations[].text
string
Full location text string
locations[].street
string
Street address
locations[].city
string
City / locality
locations[].state
string
State / region
locations[].country
string
Country
locations[].zip
string
ZIP / postal code
source.id
string
Internal employer/source identifier
source.name
string
Employer / company name
default tier
Includes all min fields, plus:
work_model
string
on_site, remote, or hybrid
commitment
string
Employment type: full_time, part_time, contract, internship, etc.
level
string
Seniority level: junior, mid, senior, executive, etc.
responsibilities
array of string
Responsibilities extracted from the job post
post_language
string
BCP-47 language code of the job post, e.g. en
jobboard_format
string
Source job board type/format
offers_visa_sponsorship
boolean
Whether the post mentions visa sponsorship
source.domain
string
Employer website domain
source.url_logo
string
URL to the employer's logo image
max tier
Includes all default fields, plus:
benefits
array of string
Benefits mentioned in the post (health, PTO, 401k, etc.)
requirements
array of string
Requirements and qualifications listed
problems
array of string
Problem statements extracted from the post
categories
array of string
Job category classifications
technologies
array of string
Technologies and skills inferred from the post
locations[].lat
float
Latitude (geo-coordinate)
locations[].lon
float
Longitude (geo-coordinate)
verified_inactive_at
integer | null
Unix timestamp when the job was confirmed inactive. Null if still active.
scraped_extracted.*
object
Raw fields extracted directly by the scraper
brands
array of string
Brand names extracted from the post
addresses
array of string
Full address strings extracted from the post
source.description
string
Employer description
source.industries
array of string
Industry tags for the employer
source.tags
array of string
Freeform tags for the employer
Data Quality Filters
Every job included in an export — regardless of format — must pass export-time quality checks.
Deduplication is applied within each export run:
By
job_id— a job cannot appear twice in the same fileBy compound key
(job_board_source_id, source_id)— prevents the same logical posting from appearing under multiple job board IDs
Custom Field Mapping System
By default, exports use the JobFront field names and structure documented above. The optional custom mapping system lets you define a JSON spec that transforms every job record into a completely different shape before it is written to the output file — renaming fields, restructuring nested objects, applying value maps, computing derived strings, and more.
A custom mapping is stored per organization. The active mapping is the most recently published spec for your organization_id.
How Mappings Work
When a mapping spec is configured, each job record passes through the mapping engine after enrichment but before serialization. The engine walks the fields object in your spec and builds a new output record from scratch. Fields not referenced in your spec are omitted from the output.
The mapping engine operates on the default-tier job record (the full enriched job object), so all fields documented in the default and max tiers are available as source paths.
Field Spec Reference
Each key in fields is the output field name; the value is a spec object describing how to produce it.
from — map a source field directly
Supports dot-path notation and array indexing, e.g. locations[0].city.
fromAny — first non-empty wins
Tries each path in order and uses the first one that is present and non-empty.
const — fixed value
default — fallback value
Used with from or fromAny when the source field is missing or empty.
template — compose a string from multiple fields
Uses {field_name} placeholders resolved from the source record. If all referenced fields are empty the output is an empty string.
Use [[ ... ]] for conditional segments — the segment is omitted entirely if all its fields are empty:
map — output a nested object
listMap — transform a list of objects
Each item in the source list is run through the listMap sub-spec.
listMapPrimitive — transform a list of primitives to a list of objects
Use fromSelf: true to reference the primitive item itself.
buildList — conditionally assemble a list
Each entry is evaluated in order; entries whose when condition fails are skipped.
wrapAsList and splitBy — scalar to list
splitBy splits a delimited string into an array before wrapping.
joinWith — list to string
wrapWith — add prefix / suffix
number and boolean — type coercion
nullIfEmpty and nullIfZero — output null instead of blank / zero
regexReplace — pattern-based string rewriting
A list of rules applied in order. Each rule specifies pattern, replacement, and optional ignoreCase.
setUrlParams — inject or replace URL query parameters
valueMap — remap specific values
Any value not present in the map is passed through unchanged.
Transform Reference
The transform key applies a single named transformation to the value after it is resolved.
lower
Converts string to lowercase
upper
Converts string to uppercase
title
Title-cases the string
trim
Strips leading and trailing whitespace
state_abbr
Normalizes US state name to 2-letter abbreviation (e.g. California → CA)
country_code
Normalizes common US country variants to USA
numeric_rightmost
Extracts the rightmost numeric component from a hyphen/underscore-delimited string
numeric_leftmost
Extracts the leftmost numeric component from a hyphen/underscore-delimited string
Date Format Reference
The dateFormat key converts timestamp values. An optional timezone key (IANA tz name, e.g. America/New_York) applies timezone-aware conversion.
unix->iso
Unix timestamp (integer)
2024-06-15T09:30:00
unix->date
Unix timestamp (integer)
2024-06-15
unix->rfc
Unix timestamp (integer)
Sat, 15 Jun 2024 09:30:00 EDT
iso->date
ISO datetime string
2024-06-15
Conditional Logic
The cases key evaluates a list of conditions and uses the first match.
Supported when operators:
exists
{ "exists": "field.path" }
True if the field is present and non-null
equals
{ "equals": ["field.path", value] }
True if the field equals the given value
contains
{ "contains": ["field.path", "substring"] }
True if the string field contains the substring (case-insensitive)
notContains
{ "notContains": ["field.path", "substring"] }
True if the string field does not contain the substring
equalsField
{ "equalsField": ["field.a", "field.b"] }
True if two fields have equal values
Example Mapping Spec
Export Run Metadata
Each export run is recorded with the following fields, accessible via the Jobfront API or metrics endpoints:
run_id
Unique identifier for this export run
organization_id
Your organization identifier
collection_id
Identifier of the collection exported
export_type
Delivery method: s3_jobfront or sftp_customer
export_data_format
Format: json_jobfront, json_aspen, xml_jobfront, xml_aspen
export_data_type
Data tier: min, default, or max
export_sources_file_fan_type
File organization: aggregate_sources or single_source
count_total_jobs
Total jobs included in this export
count_total_sources
Total employer sources included in this export
export_time_start_at
Unix timestamp — when the export run began
export_time_end_at
Unix timestamp — when the export run completed
export_time_duration
Duration in seconds
export_url_file
URL to the exported file
export_url_metrics
URL to the per-run metrics JSON
track_metrics_dictionary
Summary metrics: total jobs/sources, most recent job added/updated/removed timestamps
track_qa_dictionary
Field fill-rate statistics — what percentage of jobs have each field populated
Location Filtering
Exports can be filtered by geography. Location filters are configured per collection and applied before export:
country
Limit to jobs in a specific country (default: United States)
state
Limit to jobs in a specific US state
city
Limit to jobs in a specific city
remote
Include or exclude remote jobs
allow_remote_without_location
When true, remote jobs with no location data are included regardless of other location filters
Remote jobs (is_remote = true) pass through country/state/city filters by default — they are not excluded by geography unless remote: false is explicitly set.
Last updated