# Jobs Data - HTML Processing Job post html is formatted differently across employer careers pages and ATS's. Historically this has resulted in messy and inconsistent job posting html. Enter JobFront. JobFront runs a series of data cleaning and normalization processes that transform messy, inconsistent job postings into beautifully formatted, standardized HTML content. We also extract, structure, and enrich every job post with additional data to make it easier to use our dataset. Technically, we process all jobs through multiple large language models as part of our cleaning processes, in addition to more traditional ML-based data processing. As a result we have some flexibility in case you have additional processing requests. Please reach out to us if there's more fields or formats you would like to see. *** ## Job Description Data Job postings in JobFront are stored in multiple forms — from verbatim scraped HTML through progressively cleaned and AI-enriched representations. This document describes each form, how the data is cleaned at scrape time, and the options available to customers when exporting that data. *** ### Description Data Types There are four distinct representations of a job's description content. #### 1. Raw Scraped HTML The verbatim HTML captured directly from the employer's job posting page. This field contains raw HTML including site chrome, navigation, scripts, and inline styles — exactly as the scraper received it. This field is the source-of-truth copy. **Use when:** You need the original, unmodified source content. **Example** ```html

Senior Software Engineer

San Francisco, CA • Full-time

We’re looking for a Senior Software Engineer to join our growing platform team. You will own the design and delivery of critical backend services.

What You’ll Do

Design and implement scalable backend services
Mentor junior engineers and run code reviews
Partner with product to define technical roadmap

What We’re Looking For

5+ years of experience with Python or Go
Strong knowledge of distributed systems
Experience with AWS or GCP

``` *** #### 2. AI-Cleaned and Standardized Job Description HTML The job description HTML after it has been run through the `clean_html_with_ai()` pipeline. This is the canonical description field for display and delivery. The cleaning pipeline (see Scrape-time Cleaning below) strips all non-content elements and re-formats the content using a restricted set of safe tags: ```

``` Use when: Displaying the job description to end users or feeding it to downstream systems. Example ```html

Senior Software Engineer

We're looking for a Senior Software Engineer to join our growing platform team. You will own the design and delivery of critical backend services.

What You'll Do

Design and implement scalable backend services

Mentor junior engineers and run code reviews

Partner with product to define technical roadmap

What We're Looking For

5+ years of experience with Python or Go

Strong knowledge of distributed systems

Experience with AWS or GCP

``` What changed: * Navigation, footer, apply button, and tracking script removed * All inline styles, class names, IDs, and aria attributes stripped * HTML entities decoded (`’` → `'`) * `` and `` tags removed (text preserved) * `
` converted to `
` * Content wrapped in a single `
` * #### 3. One-Line Summary A short, single-sentence description of the role, generated by AI at processing time. Designed for use in search result snippets, preview cards, notifications, and anywhere a concise label for the role is needed. Use when: You need a brief, human-readable summary — search indexes, list views, notification copy. Example > Senior backend engineering role at Acme Corp working on a new Growth team. * #### 4. S3 Full-Page HTML — Complete Archived Page The complete rendered HTML of the entire job posting page is archived to AWS S3 at scrape time: ``` s3:// ``` Use when: Debugging a scrape, inspecting the original page layout, or re-extracting structured data from the source. * ### Scrape-time Cleaning - (How #2 cleaning above works) When JobFront cleans HTML, we follow the following steps: 1. Encoding repair -- correct character encoding issues | Before | After | | --------------------- | ------------------- | | `Lattice‚Äôs culture` | `Lattice's culture` | | `We’re hiring` | `We're hiring` | 2. Hidden element removal — Elements with `display:none`, `visibility:hidden`, or common hiding classes (`d-none`, `sr-only`, etc.) are dropped ```html Screen reader text
Internal tracking placeholder

Visible job content.

Visible job content.
``` 3. Dangerous/irrelevant tag stripping — `
Join our team of engineers.

Join our team of engineers.
``` 4. Nav/chrome removal — `
`, `
`, `
`, `
`, and elements whose class or id contain navigation hints are decomposed, unless they contain main content ```html
Home › Jobs

We're building the future of logistics software.

© 2024 Acme Corp

We're building the future of logistics software.

``` 5. Attribute stripping — All attributes are removed except `href` on `` tags ```html
Join our backend team.

Join our backend team.
``` 6. Empty element pruning — Empty `
`, `
`, `
`, and similar tags are removed ```html
We offer competitive compensation.

Health insurance

401(k) matching

We offer competitive compensation.

Health insurance

401(k) matching

``` 7. LLM formatting pass** — The cleaned HTML is sent to a LLM. The model re-formats the content into the restricted safe-tag set, normalizes heading levels, and removes residual noise. This is the step that converts `
` to `
`, title-cases headers, unwraps ``/``/`` tags, and produces the final ``-wrapped output ```html
SENIOR SOFTWARE ENGINEER

Join our team. We're growing fast.

Requirements

5+ years Python

Strong communication skills

Senior Software Engineer

Join our team. We're growing fast.

Requirements

5+ years Python

Strong communication skills

``` * ### Export-time Cleaning When delivering job data via XML, SFTP, or JSON export, an additional `clean_html()` pass can be applied to the description field at transmission time. This is separate from the scrape-time cleaning and is applied on-the-fly to whatever HTML value is being exported. Export configs support the following options: | Option | Default | Effect | | ------------------------------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `clean_html` | `False` | Run the full structural HTML cleaning pipeline before export. Strips comments, dangerous tags, nav chrome, and unknown wrapper elements. Normalizes whitespace. | | `clean_html_remove_links` | `False` | Remove all `` tags, preserving their visible text. | | `clean_html_remove_contact_info` | `False` | Redact emails, phone numbers, LinkedIn profile URLs, and social handles in visible text. Removes paragraphs that consist primarily of contact cues (e.g. "Contact us at \[redacted email]"). Also removes `mailto:` and `tel:` links structurally. | | `clean_html_remove_prompt_injection` | `False` | Remove blocks that appear to be hidden LLM instruction injections (e.g. invisible text, zero-width characters used to smuggle instructions). | These options are composable — any combination can be enabled. `clean_html` must be `True` for the other three options to have any effect. #### What `clean_html` does structurally When `clean_html` is enabled, the following transformations are applied to the HTML before it leaves the system: 1. Plain-text normalization — If the input contains no HTML tags, it is converted to structured HTML: paragraphs are separated by blank lines, line breaks become `
`, and content is wrapped in `
` tags. ``` Senior Software Engineer at Acme Corp. We're looking for a backend engineer. Requirements: 5+ years Python Strong communication skills
Senior Software Engineer at Acme Corp.

We're looking for a backend engineer.
Requirements:
5+ years Python
Strong communication skills
``` 2. Comment removal — HTML comments are stripped. ```html
We offer competitive salaries.

We offer competitive salaries .
``` 3. Dangerous tag removal** — `