# Data Exports

This document describes the JobFront collections export system — how job data is packaged and delivered to customers, what fields are included in each format, how often exports refresh, and how to configure custom field mappings.

***

### Overview

The JobFront collections export system continuously publishes job data from your configured collections to your chosen delivery destination. Each collection maps to a curated set of employer sources; the export system packages all active jobs from those sources into structured files and pushes them to you on a scheduled cadence.

Three output formats are supported:

| Format      | Best for                                                             |
| ----------- | -------------------------------------------------------------------- |
| **JSON**    | REST API consumers, data pipelines, modern integrations              |
| **XML**     | ATS integrations, legacy job board ingestion, feed-based systems     |
| **Parquet** | Data warehouses, analytics platforms, bulk ML/data science workloads |

***

### Export Cadence & Frequency

| Setting                     | Value         |
| --------------------------- | ------------- |
| Default refresh interval    | Every 6 hours |
| Configurable per collection | Yes           |

Collections are evaluated in a continuous scheduling loop. Each collection has its own independent timer. When a collection's configured interval has elapsed since its last successful export, a new export run begins immediately.

***

### Delivery Methods

#### S3 (Jobfront-hosted)

Files are written to a designated S3 bucket and folder path. You access them via:

* **Pre-signed URLs** — time-limited download links (default 1-hour TTL) provided by the Jobfront API. No AWS credentials required on your end.
* **IAM credentials** — a read-only AWS IAM user scoped to your prefix. Use standard AWS CLI or SDK tooling to sync files directly.

```bash
# AWS CLI example — download all files for your organization
aws s3 sync s3://<bucket>/<your-prefix>/ ./local-data/ \
  --exclude "*" --include "*.json" --include "*.xml" --include "*.parquet"
```

```python
# Python (boto3 + pyarrow) — read a Parquet file directly without downloading
import pyarrow.parquet as pq
import s3fs

fs = s3fs.S3FileSystem(key='YOUR_KEY', secret='YOUR_SECRET')
df = pq.read_table('s3://<bucket>/<your-prefix>/jobs.parquet', filesystem=fs).to_pandas()
```

#### SFTP (push to your server)

Files are pushed directly to your SFTP server after each successful export. Configuration requires:

* Hostname
* Username / authentication credentials
* Base directory path
* Optional connection timeout (default: 30 seconds)

***

### File Organization

Each collection can be configured in one of two file organization modes:

| Mode                    | Description                                                                  | Typical use case                                         |
| ----------------------- | ---------------------------------------------------------------------------- | -------------------------------------------------------- |
| **`aggregate_sources`** | All employers in the collection are merged into a single file per export run | Feed to a single job board or ATS                        |
| **`single_source`**     | One separate file is produced per employer/source                            | Per-employer intake pipelines, employer-specific routing |

In `single_source` mode the filename includes the source identifier so each employer's file can be processed independently.

***

### Export Formats

#### JSON & XML Exports

Four named format variants are available. All variants expose the same fields — the choice controls the output structure and element naming conventions:

| Format identifier | Type | Description                   |
| ----------------- | ---- | ----------------------------- |
| `json_jobfront`   | JSON | Standard Jobfront JSON schema |
| `xml_jobfront`    | XML  | Standard Jobfront XML schema  |

The fields included in each export are controlled by the **data tier** setting (`min`, `default`, or `max`) described in the next section.

#### Parquet Exports

Parquet files use [Apache Parquet](https://parquet.apache.org/) format with **Snappy compression** and a row group size of 50,000 rows. The schema is fixed (see Parquet Schema below). Parquet exports always include the equivalent of the `default` tier fields plus additional timestamp and lifecycle metadata.

Each export cycle produces:

```
jobs-active-{N}.parquet                       — job data (N = 0 through 39)
metrics/jobs-active-latest-metrics-{N}.json   — row counts, source counts, timing
metrics/jobs-active-latest-job_ids-{N}.json   — list of job IDs in this file
metrics/jobs-active-latest-source_ids-{N}.json — list of source IDs in this file
```

***

### Data Tiers (min / default / max)

JSON and XML exports support three data tiers. Higher tiers are supersets — each tier includes all fields from the tiers below it.

#### min tier

The core required fields present in every export.

| Field                 | Type          | Description                                                                                                                   |
| --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `id`                  | string        | Unique Jobfront job identifier                                                                                                |
| `title`               | string        | Job title                                                                                                                     |
| `description`         | string        | AI-generated summary of the job post                                                                                          |
| `post`                | string (HTML) | Full job post content                                                                                                         |
| `url_job`             | string        | Direct application / posting URL                                                                                              |
| `job_status`          | string        | `active` or `inactive`                                                                                                        |
| `created`             | string        | Date first seen, formatted `YYYY-MM-DD`                                                                                       |
| `created_at`          | integer       | Unix timestamp when job was first seen                                                                                        |
| `salary.min`          | float \| null | Minimum salary. Null if not available. If only one bound is listed in the source, both min and max are set to the same value. |
| `salary.max`          | float \| null | Maximum salary. Null if not available.                                                                                        |
| `salary.currency`     | string        | ISO currency code, e.g. `USD`                                                                                                 |
| `salary.period`       | string        | `annual`, `hourly`, `monthly`, or `weekly`                                                                                    |
| `onet_code`           | string        | O\*NET 6-digit occupation code, e.g. `15-1252.00`                                                                             |
| `remote`              | boolean       | Whether the position is fully remote                                                                                          |
| `locations[].text`    | string        | Full location text string                                                                                                     |
| `locations[].street`  | string        | Street address                                                                                                                |
| `locations[].city`    | string        | City / locality                                                                                                               |
| `locations[].state`   | string        | State / region                                                                                                                |
| `locations[].country` | string        | Country                                                                                                                       |
| `locations[].zip`     | string        | ZIP / postal code                                                                                                             |
| `source.id`           | string        | Internal employer/source identifier                                                                                           |
| `source.name`         | string        | Employer / company name                                                                                                       |

#### default tier

Includes all `min` fields, plus:

| Field                     | Type            | Description                                                               |
| ------------------------- | --------------- | ------------------------------------------------------------------------- |
| `work_model`              | string          | `on_site`, `remote`, or `hybrid`                                          |
| `commitment`              | string          | Employment type: `full_time`, `part_time`, `contract`, `internship`, etc. |
| `level`                   | string          | Seniority level: `junior`, `mid`, `senior`, `executive`, etc.             |
| `responsibilities`        | array of string | Responsibilities extracted from the job post                              |
| `post_language`           | string          | BCP-47 language code of the job post, e.g. `en`                           |
| `jobboard_format`         | string          | Source job board type/format                                              |
| `offers_visa_sponsorship` | boolean         | Whether the post mentions visa sponsorship                                |
| `source.domain`           | string          | Employer website domain                                                   |
| `source.url_logo`         | string          | URL to the employer's logo image                                          |

#### max tier

Includes all `default` fields, plus:

| Field                  | Type            | Description                                                               |
| ---------------------- | --------------- | ------------------------------------------------------------------------- |
| `benefits`             | array of string | Benefits mentioned in the post (health, PTO, 401k, etc.)                  |
| `requirements`         | array of string | Requirements and qualifications listed                                    |
| `problems`             | array of string | Problem statements extracted from the post                                |
| `categories`           | array of string | Job category classifications                                              |
| `technologies`         | array of string | Technologies and skills inferred from the post                            |
| `locations[].lat`      | float           | Latitude (geo-coordinate)                                                 |
| `locations[].lon`      | float           | Longitude (geo-coordinate)                                                |
| `verified_inactive_at` | integer \| null | Unix timestamp when the job was confirmed inactive. Null if still active. |
| `scraped_extracted.*`  | object          | Raw fields extracted directly by the scraper                              |
| `brands`               | array of string | Brand names extracted from the post                                       |
| `addresses`            | array of string | Full address strings extracted from the post                              |
| `source.description`   | string          | Employer description                                                      |
| `source.industries`    | array of string | Industry tags for the employer                                            |
| `source.tags`          | array of string | Freeform tags for the employer                                            |

***

### Data Quality Filters

Every job included in an export — regardless of format — must pass export-time quality checks.

**Deduplication** is applied within each export run:

* By `job_id` — a job cannot appear twice in the same file
* By compound key `(job_board_source_id, source_id)` — prevents the same logical posting from appearing under multiple job board IDs

***

### Custom Field Mapping System

By default, exports use the JobFront field names and structure documented above. The optional custom mapping system lets you define a JSON spec that transforms every job record into a completely different shape before it is written to the output file — renaming fields, restructuring nested objects, applying value maps, computing derived strings, and more.

A custom mapping is stored per organization. The active mapping is the most recently published spec for your `organization_id`.

#### How Mappings Work

When a mapping spec is configured, each job record passes through the mapping engine after enrichment but before serialization. The engine walks the `fields` object in your spec and builds a new output record from scratch. Fields not referenced in your spec are omitted from the output.

The mapping engine operates on the **`default`-tier** job record (the full enriched job object), so all fields documented in the `default` and `max` tiers are available as source paths.

#### Field Spec Reference

Each key in `fields` is the output field name; the value is a spec object describing how to produce it.

***

**`from` — map a source field directly**

```json
{ "title": { "from": "title" } }
{ "apply_url": { "from": "url_job" } }
{ "city": { "from": "locations[0].city" } }
```

Supports dot-path notation and array indexing, e.g. `locations[0].city`.

***

**`fromAny` — first non-empty wins**

```json
{
  "compensation_min": {
    "fromAny": ["salary.min", "scraped_extracted.compensation_min"],
    "default": null
  }
}
```

Tries each path in order and uses the first one that is present and non-empty.

***

**`const` — fixed value**

```json
{ "country": { "const": "USA" } }
{ "feed_version": { "const": "2" } }
```

***

**`default` — fallback value**

Used with `from` or `fromAny` when the source field is missing or empty.

```json
{ "level": { "from": "level", "default": "mid" } }
```

***

**`template` — compose a string from multiple fields**

```json
{ "location_text": { "template": "{city}, {state}, {country}" } }
{ "salary_range": { "template": "{salary_min}–{salary_max} {salary_currency}" } }
```

Uses `{field_name}` placeholders resolved from the source record. If all referenced fields are empty the output is an empty string.

Use `[[ ... ]]` for conditional segments — the segment is omitted entirely if all its fields are empty:

```json
{ "display_title": { "template": "{title}[[ — {level}]]" } }
```

***

**`map` — output a nested object**

```json
{
  "location": {
    "map": {
      "city":    { "from": "locations[0].city" },
      "state":   { "from": "locations[0].state", "transform": "state_abbr" },
      "country": { "from": "locations[0].country", "transform": "country_code" }
    }
  }
}
```

***

**`listMap` — transform a list of objects**

```json
{
  "skills": {
    "from": "technologies",
    "listMap": {
      "skill_name": { "from": "name", "transform": "lower" }
    }
  }
}
```

Each item in the source list is run through the `listMap` sub-spec.

***

**`listMapPrimitive` — transform a list of primitives to a list of objects**

```json
{
  "skill_objects": {
    "from": "inferred_skills",
    "listMapPrimitive": {
      "name": { "fromSelf": true, "transform": "lower" }
    }
  }
}
```

Use `fromSelf: true` to reference the primitive item itself.

***

**`buildList` — conditionally assemble a list**

```json
{
  "tags": {
    "buildList": [
      { "value": "remote",   "when": { "equals": ["remote", true] } },
      { "from":  "level",    "when": { "exists": "level" } },
      { "from":  "commitment" }
    ]
  }
}
```

Each entry is evaluated in order; entries whose `when` condition fails are skipped.

***

**`wrapAsList` and `splitBy` — scalar to list**

```json
{ "category_list": { "from": "categories[0]", "wrapAsList": true } }
{ "tag_list": { "from": "source.tags", "wrapAsList": true, "splitBy": "," } }
```

`splitBy` splits a delimited string into an array before wrapping.

***

**`joinWith` — list to string**

```json
{ "skills_csv": { "from": "technologies", "joinWith": ", " } }
```

***

**`wrapWith` — add prefix / suffix**

```json
{ "job_ref": { "from": "id", "wrapWith": { "prefix": "JOB-", "suffix": "" } } }
```

***

**`number` and `boolean` — type coercion**

```json
{ "salary_min_num": { "from": "salary.min", "number": true } }
{ "is_remote_bool": { "from": "remote", "boolean": true } }
```

***

**`nullIfEmpty` and `nullIfZero` — output null instead of blank / zero**

```json
{ "salary_min": { "from": "salary.min", "nullIfZero": true } }
{ "department":  { "from": "department",  "nullIfEmpty": true } }
```

***

**`regexReplace` — pattern-based string rewriting**

A list of rules applied in order. Each rule specifies `pattern`, `replacement`, and optional `ignoreCase`.

```json
{
  "clean_title": {
    "from": "title",
    "regexReplace": [
      { "pattern": "\\(remote\\)",     "replacement": "", "ignoreCase": true },
      { "pattern": "\\s{2,}",          "replacement": " " }
    ]
  }
}
```

***

**`setUrlParams` — inject or replace URL query parameters**

```json
{
  "tracking_url": {
    "from": "url_job",
    "setUrlParams": { "utm_source": "jobfront", "utm_medium": "feed" }
  }
}
```

***

**`valueMap` — remap specific values**

```json
{
  "employment_type": {
    "from": "commitment",
    "valueMap": {
      "full_time":  "FT",
      "part_time":  "PT",
      "contract":   "CT",
      "internship": "IN"
    }
  }
}
```

Any value not present in the map is passed through unchanged.

***

#### Transform Reference

The `transform` key applies a single named transformation to the value after it is resolved.

| Value               | Effect                                                                             |
| ------------------- | ---------------------------------------------------------------------------------- |
| `lower`             | Converts string to lowercase                                                       |
| `upper`             | Converts string to uppercase                                                       |
| `title`             | Title-cases the string                                                             |
| `trim`              | Strips leading and trailing whitespace                                             |
| `state_abbr`        | Normalizes US state name to 2-letter abbreviation (e.g. `California` → `CA`)       |
| `country_code`      | Normalizes common US country variants to `USA`                                     |
| `numeric_rightmost` | Extracts the rightmost numeric component from a hyphen/underscore-delimited string |
| `numeric_leftmost`  | Extracts the leftmost numeric component from a hyphen/underscore-delimited string  |

***

#### Date Format Reference

The `dateFormat` key converts timestamp values. An optional `timezone` key (IANA tz name, e.g. `America/New_York`) applies timezone-aware conversion.

| Value        | Input                    | Output example                  |
| ------------ | ------------------------ | ------------------------------- |
| `unix->iso`  | Unix timestamp (integer) | `2024-06-15T09:30:00`           |
| `unix->date` | Unix timestamp (integer) | `2024-06-15`                    |
| `unix->rfc`  | Unix timestamp (integer) | `Sat, 15 Jun 2024 09:30:00 EDT` |
| `iso->date`  | ISO datetime string      | `2024-06-15`                    |

```json
{ "posted_date": { "from": "created_at", "dateFormat": "unix->date" } }
{
  "posted_datetime_et": {
    "from": "created_at",
    "dateFormat": "unix->iso",
    "timezone": "America/New_York"
  }
}
```

***

#### Conditional Logic

The `cases` key evaluates a list of conditions and uses the first match.

```json
{
  "seniority": {
    "cases": [
      { "when": { "equals":   ["level", "senior"] }, "const": "Senior" },
      { "when": { "equals":   ["level", "junior"] }, "const": "Junior" },
      { "when": { "exists":   "level"              }, "from":  "level"  }
    ],
    "default": "Mid"
  }
}
```

**Supported `when` operators:**

| Operator      | Syntax                                           | Description                                                        |
| ------------- | ------------------------------------------------ | ------------------------------------------------------------------ |
| `exists`      | `{ "exists": "field.path" }`                     | True if the field is present and non-null                          |
| `equals`      | `{ "equals": ["field.path", value] }`            | True if the field equals the given value                           |
| `contains`    | `{ "contains": ["field.path", "substring"] }`    | True if the string field contains the substring (case-insensitive) |
| `notContains` | `{ "notContains": ["field.path", "substring"] }` | True if the string field does not contain the substring            |
| `equalsField` | `{ "equalsField": ["field.a", "field.b"] }`      | True if two fields have equal values                               |

***

#### Example Mapping Spec

```json
{
  "fields": {
    "JobId":          { "from": "id" },
    "JobTitle":       { "from": "title" },
    "JobDescription": { "from": "description" },
    "FullPost":       { "from": "post" },
    "ApplyUrl":       { "from": "url_job" },
    "PostedDate":     { "from": "created_at", "dateFormat": "unix->date" },
    "CompanyName":    { "from": "source.name" },
    "CompanyDomain":  { "from": "source.domain" },
    "IsRemote":       { "from": "remote", "boolean": true },
    "WorkModel":      { "from": "work_model" },
    "EmploymentType": {
      "from": "commitment",
      "valueMap": {
        "full_time":  "Full-Time",
        "part_time":  "Part-Time",
        "contract":   "Contract",
        "internship": "Internship"
      }
    },
    "SeniorityLevel": {
      "from": "level",
      "transform": "title",
      "default": "Not specified"
    },
    "SalaryMin":      { "from": "salary.min",      "number": true, "nullIfZero": true },
    "SalaryMax":      { "from": "salary.max",      "number": true, "nullIfZero": true },
    "SalaryCurrency": { "from": "salary.currency", "transform": "upper" },
    "SalaryPeriod":   { "from": "salary.period" },
    "OnetCode":       { "from": "onet_code" },
    "City":           { "from": "locations[0].city" },
    "State":          { "from": "locations[0].state", "transform": "state_abbr" },
    "Country":        { "from": "locations[0].country", "transform": "country_code" },
    "LocationText":   { "template": "{city}, {state}" },
    "Skills":         {
      "from": "technologies",
      "listMap": { "name": { "from": "name" } }
    },
    "TrackingUrl": {
      "from": "url_job",
      "setUrlParams": { "utm_source": "jobfront", "utm_medium": "feed" }
    }
  }
}
```

***

### Export Run Metadata

Each export run is recorded with the following fields, accessible via the Jobfront API or metrics endpoints:

| Field                          | Description                                                                           |
| ------------------------------ | ------------------------------------------------------------------------------------- |
| `run_id`                       | Unique identifier for this export run                                                 |
| `organization_id`              | Your organization identifier                                                          |
| `collection_id`                | Identifier of the collection exported                                                 |
| `export_type`                  | Delivery method: `s3_jobfront` or `sftp_customer`                                     |
| `export_data_format`           | Format: `json_jobfront`, `json_aspen`, `xml_jobfront`, `xml_aspen`                    |
| `export_data_type`             | Data tier: `min`, `default`, or `max`                                                 |
| `export_sources_file_fan_type` | File organization: `aggregate_sources` or `single_source`                             |
| `count_total_jobs`             | Total jobs included in this export                                                    |
| `count_total_sources`          | Total employer sources included in this export                                        |
| `export_time_start_at`         | Unix timestamp — when the export run began                                            |
| `export_time_end_at`           | Unix timestamp — when the export run completed                                        |
| `export_time_duration`         | Duration in seconds                                                                   |
| `export_url_file`              | URL to the exported file                                                              |
| `export_url_metrics`           | URL to the per-run metrics JSON                                                       |
| `track_metrics_dictionary`     | Summary metrics: total jobs/sources, most recent job added/updated/removed timestamps |
| `track_qa_dictionary`          | Field fill-rate statistics — what percentage of jobs have each field populated        |

***

### Location Filtering

Exports can be filtered by geography. Location filters are configured per collection and applied before export:

| Filter                          | Description                                                                                      |
| ------------------------------- | ------------------------------------------------------------------------------------------------ |
| `country`                       | Limit to jobs in a specific country (default: United States)                                     |
| `state`                         | Limit to jobs in a specific US state                                                             |
| `city`                          | Limit to jobs in a specific city                                                                 |
| `remote`                        | Include or exclude remote jobs                                                                   |
| `allow_remote_without_location` | When `true`, remote jobs with no location data are included regardless of other location filters |

Remote jobs (`is_remote = true`) pass through country/state/city filters by default — they are not excluded by geography unless `remote: false` is explicitly set.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.jobfront.io/jobs-data-file-exports/data-exports.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
