Data Exports

This document describes the JobFront collections export system — how job data is packaged and delivered to customers, what fields are included in each format, how often exports refresh, and how to configure custom field mappings.


Overview

The JobFront collections export system continuously publishes job data from your configured collections to your chosen delivery destination. Each collection maps to a curated set of employer sources; the export system packages all active jobs from those sources into structured files and pushes them to you on a scheduled cadence.

Three output formats are supported:

Format
Best for

JSON

REST API consumers, data pipelines, modern integrations

XML

ATS integrations, legacy job board ingestion, feed-based systems

Parquet

Data warehouses, analytics platforms, bulk ML/data science workloads


Export Cadence & Frequency

Setting
Value

Default refresh interval

Every 6 hours

Configurable per collection

Yes

Collections are evaluated in a continuous scheduling loop. Each collection has its own independent timer. When a collection's configured interval has elapsed since its last successful export, a new export run begins immediately.


Delivery Methods

S3 (Jobfront-hosted)

Files are written to a designated S3 bucket and folder path. You access them via:

  • Pre-signed URLs — time-limited download links (default 1-hour TTL) provided by the Jobfront API. No AWS credentials required on your end.

  • IAM credentials — a read-only AWS IAM user scoped to your prefix. Use standard AWS CLI or SDK tooling to sync files directly.

SFTP (push to your server)

Files are pushed directly to your SFTP server after each successful export. Configuration requires:

  • Hostname

  • Username / authentication credentials

  • Base directory path

  • Optional connection timeout (default: 30 seconds)


File Organization

Each collection can be configured in one of two file organization modes:

Mode
Description
Typical use case

aggregate_sources

All employers in the collection are merged into a single file per export run

Feed to a single job board or ATS

single_source

One separate file is produced per employer/source

Per-employer intake pipelines, employer-specific routing

In single_source mode the filename includes the source identifier so each employer's file can be processed independently.


Export Formats

JSON & XML Exports

Four named format variants are available. All variants expose the same fields — the choice controls the output structure and element naming conventions:

Format identifier
Type
Description

json_jobfront

JSON

Standard Jobfront JSON schema

xml_jobfront

XML

Standard Jobfront XML schema

The fields included in each export are controlled by the data tier setting (min, default, or max) described in the next section.

Parquet Exports

Parquet files use Apache Parquetarrow-up-right format with Snappy compression and a row group size of 50,000 rows. The schema is fixed (see Parquet Schema below). Parquet exports always include the equivalent of the default tier fields plus additional timestamp and lifecycle metadata.

Each export cycle produces:


Data Tiers (min / default / max)

JSON and XML exports support three data tiers. Higher tiers are supersets — each tier includes all fields from the tiers below it.

min tier

The core required fields present in every export.

Field
Type
Description

id

string

Unique Jobfront job identifier

title

string

Job title

description

string

AI-generated summary of the job post

post

string (HTML)

Full job post content

url_job

string

Direct application / posting URL

job_status

string

active or inactive

created

string

Date first seen, formatted YYYY-MM-DD

created_at

integer

Unix timestamp when job was first seen

salary.min

float | null

Minimum salary. Null if not available. If only one bound is listed in the source, both min and max are set to the same value.

salary.max

float | null

Maximum salary. Null if not available.

salary.currency

string

ISO currency code, e.g. USD

salary.period

string

annual, hourly, monthly, or weekly

onet_code

string

O*NET 6-digit occupation code, e.g. 15-1252.00

remote

boolean

Whether the position is fully remote

locations[].text

string

Full location text string

locations[].street

string

Street address

locations[].city

string

City / locality

locations[].state

string

State / region

locations[].country

string

Country

locations[].zip

string

ZIP / postal code

source.id

string

Internal employer/source identifier

source.name

string

Employer / company name

default tier

Includes all min fields, plus:

Field
Type
Description

work_model

string

on_site, remote, or hybrid

commitment

string

Employment type: full_time, part_time, contract, internship, etc.

level

string

Seniority level: junior, mid, senior, executive, etc.

responsibilities

array of string

Responsibilities extracted from the job post

post_language

string

BCP-47 language code of the job post, e.g. en

jobboard_format

string

Source job board type/format

offers_visa_sponsorship

boolean

Whether the post mentions visa sponsorship

source.domain

string

Employer website domain

source.url_logo

string

URL to the employer's logo image

max tier

Includes all default fields, plus:

Field
Type
Description

benefits

array of string

Benefits mentioned in the post (health, PTO, 401k, etc.)

requirements

array of string

Requirements and qualifications listed

problems

array of string

Problem statements extracted from the post

categories

array of string

Job category classifications

technologies

array of string

Technologies and skills inferred from the post

locations[].lat

float

Latitude (geo-coordinate)

locations[].lon

float

Longitude (geo-coordinate)

verified_inactive_at

integer | null

Unix timestamp when the job was confirmed inactive. Null if still active.

scraped_extracted.*

object

Raw fields extracted directly by the scraper

brands

array of string

Brand names extracted from the post

addresses

array of string

Full address strings extracted from the post

source.description

string

Employer description

source.industries

array of string

Industry tags for the employer

source.tags

array of string

Freeform tags for the employer


Data Quality Filters

Every job included in an export — regardless of format — must pass export-time quality checks.

Deduplication is applied within each export run:

  • By job_id — a job cannot appear twice in the same file

  • By compound key (job_board_source_id, source_id) — prevents the same logical posting from appearing under multiple job board IDs


Custom Field Mapping System

By default, exports use the JobFront field names and structure documented above. The optional custom mapping system lets you define a JSON spec that transforms every job record into a completely different shape before it is written to the output file — renaming fields, restructuring nested objects, applying value maps, computing derived strings, and more.

A custom mapping is stored per organization. The active mapping is the most recently published spec for your organization_id.

How Mappings Work

When a mapping spec is configured, each job record passes through the mapping engine after enrichment but before serialization. The engine walks the fields object in your spec and builds a new output record from scratch. Fields not referenced in your spec are omitted from the output.

The mapping engine operates on the default-tier job record (the full enriched job object), so all fields documented in the default and max tiers are available as source paths.

Field Spec Reference

Each key in fields is the output field name; the value is a spec object describing how to produce it.


from — map a source field directly

Supports dot-path notation and array indexing, e.g. locations[0].city.


fromAny — first non-empty wins

Tries each path in order and uses the first one that is present and non-empty.


const — fixed value


default — fallback value

Used with from or fromAny when the source field is missing or empty.


template — compose a string from multiple fields

Uses {field_name} placeholders resolved from the source record. If all referenced fields are empty the output is an empty string.

Use [[ ... ]] for conditional segments — the segment is omitted entirely if all its fields are empty:


map — output a nested object


listMap — transform a list of objects

Each item in the source list is run through the listMap sub-spec.


listMapPrimitive — transform a list of primitives to a list of objects

Use fromSelf: true to reference the primitive item itself.


buildList — conditionally assemble a list

Each entry is evaluated in order; entries whose when condition fails are skipped.


wrapAsList and splitBy — scalar to list

splitBy splits a delimited string into an array before wrapping.


joinWith — list to string


wrapWith — add prefix / suffix


number and boolean — type coercion


nullIfEmpty and nullIfZero — output null instead of blank / zero


regexReplace — pattern-based string rewriting

A list of rules applied in order. Each rule specifies pattern, replacement, and optional ignoreCase.


setUrlParams — inject or replace URL query parameters


valueMap — remap specific values

Any value not present in the map is passed through unchanged.


Transform Reference

The transform key applies a single named transformation to the value after it is resolved.

Value
Effect

lower

Converts string to lowercase

upper

Converts string to uppercase

title

Title-cases the string

trim

Strips leading and trailing whitespace

state_abbr

Normalizes US state name to 2-letter abbreviation (e.g. CaliforniaCA)

country_code

Normalizes common US country variants to USA

numeric_rightmost

Extracts the rightmost numeric component from a hyphen/underscore-delimited string

numeric_leftmost

Extracts the leftmost numeric component from a hyphen/underscore-delimited string


Date Format Reference

The dateFormat key converts timestamp values. An optional timezone key (IANA tz name, e.g. America/New_York) applies timezone-aware conversion.

Value
Input
Output example

unix->iso

Unix timestamp (integer)

2024-06-15T09:30:00

unix->date

Unix timestamp (integer)

2024-06-15

unix->rfc

Unix timestamp (integer)

Sat, 15 Jun 2024 09:30:00 EDT

iso->date

ISO datetime string

2024-06-15


Conditional Logic

The cases key evaluates a list of conditions and uses the first match.

Supported when operators:

Operator
Syntax
Description

exists

{ "exists": "field.path" }

True if the field is present and non-null

equals

{ "equals": ["field.path", value] }

True if the field equals the given value

contains

{ "contains": ["field.path", "substring"] }

True if the string field contains the substring (case-insensitive)

notContains

{ "notContains": ["field.path", "substring"] }

True if the string field does not contain the substring

equalsField

{ "equalsField": ["field.a", "field.b"] }

True if two fields have equal values


Example Mapping Spec


Export Run Metadata

Each export run is recorded with the following fields, accessible via the Jobfront API or metrics endpoints:

Field
Description

run_id

Unique identifier for this export run

organization_id

Your organization identifier

collection_id

Identifier of the collection exported

export_type

Delivery method: s3_jobfront or sftp_customer

export_data_format

Format: json_jobfront, json_aspen, xml_jobfront, xml_aspen

export_data_type

Data tier: min, default, or max

export_sources_file_fan_type

File organization: aggregate_sources or single_source

count_total_jobs

Total jobs included in this export

count_total_sources

Total employer sources included in this export

export_time_start_at

Unix timestamp — when the export run began

export_time_end_at

Unix timestamp — when the export run completed

export_time_duration

Duration in seconds

export_url_file

URL to the exported file

export_url_metrics

URL to the per-run metrics JSON

track_metrics_dictionary

Summary metrics: total jobs/sources, most recent job added/updated/removed timestamps

track_qa_dictionary

Field fill-rate statistics — what percentage of jobs have each field populated


Location Filtering

Exports can be filtered by geography. Location filters are configured per collection and applied before export:

Filter
Description

country

Limit to jobs in a specific country (default: United States)

state

Limit to jobs in a specific US state

city

Limit to jobs in a specific city

remote

Include or exclude remote jobs

allow_remote_without_location

When true, remote jobs with no location data are included regardless of other location filters

Remote jobs (is_remote = true) pass through country/state/city filters by default — they are not excluded by geography unless remote: false is explicitly set.

Last updated