Jobs Data - HTML Processing

Job post html is formatted differently across employer careers pages and ATS's. Historically this has resulted in messy and inconsistent job posting html.

Enter JobFront.

JobFront runs a series of data cleaning and normalization processes that transform messy, inconsistent job postings into beautifully formatted, standardized HTML content. We also extract, structure, and enrich every job post with additional data to make it easier to use our dataset.

Technically, we process all jobs through multiple large language models as part of our cleaning processes, in addition to more traditional ML-based data processing. As a result we have some flexibility in case you have additional processing requests. Please reach out to us if there's more fields or formats you would like to see.


Job Description Data

Job postings in JobFront are stored in multiple forms — from verbatim scraped HTML through progressively cleaned and AI-enriched representations. This document describes each form, how the data is cleaned at scrape time, and the options available to customers when exporting that data.


Description Data Types

There are four distinct representations of a job's description content.

1. Raw Scraped HTML

The verbatim HTML captured directly from the employer's job posting page. This field contains raw HTML including site chrome, navigation, scripts, and inline styles — exactly as the scraper received it.

This field is the source-of-truth copy.

Use when: You need the original, unmodified source content.

Example

<!-- raw HTML exactly as scraped -->
<div id="job-container" class="jd-wrapper col-md-8 offset-md-2" style="padding:20px; font-family:Arial,sans-serif;">
  <nav class="site-nav primary-navigation">
    <a href="/">Home</a><a href="/jobs">Jobs</a> › Senior Software Engineer
  </nav>
  <script>window.dataLayer=window.dataLayer||[];dataLayer.push({'job_id':'83921'});</script>
  <!-- Job Description Start -->
  <h1 class="job-title" style="color:#333; font-size:28px;">Senior Software Engineer</h1>
  <div class="job-meta" style="color:#666; margin-bottom:15px;">
    <span class="location">San Francisco, CA</span>
    <span class="type">Full-time</span>
  </div>
  <div class="job-body" id="jd-content" aria-label="job description">
    <p style="margin:10px 0; line-height:1.6; color:#444;">
      We&#8217;re looking for a <strong>Senior Software Engineer</strong> to join our growing platform team.
      You will <u>own</u> the design and delivery of <em>critical</em> backend services.
    </p>
    <h2 style="font-size:20px; color:#222; margin-top:20px;">What You&#8217;ll Do</h2>
    <ul class="jd-list" style="padding-left:20px;">
      <li>Design and implement scalable backend services</li>
      <li>Mentor junior engineers and run code reviews</li>
      <li>Partner with product to define technical roadmap</li>
    </ul>
    <h2 style="font-size:20px;">What We&#8217;re Looking For</h2>
    <ul>
      <li>5+ years of experience with Python or Go</li>
      <li>Strong knowledge of distributed systems</li>
      <li>Experience with AWS or GCP</li>
    </ul>
  </div>
</div>

2. AI-Cleaned and Standardized Job Description HTML

The job description HTML after it has been run through the clean_html_with_ai() pipeline. This is the canonical description field for display and delivery.

The cleaning pipeline (see Scrape-time Cleaning below) strips all non-content elements and re-formats the content using a restricted set of safe tags:

Use when: Displaying the job description to end users or feeding it to downstream systems.

Example

What changed:

  • Navigation, footer, apply button, and tracking script removed

  • All inline styles, class names, IDs, and aria attributes stripped

  • HTML entities decoded (&#8217;')

  • <u> and <em> tags removed (text preserved)

  • <h1> converted to <h2>

  • Content wrapped in a single <div>


3. One-Line Summary

A short, single-sentence description of the role, generated by AI at processing time. Designed for use in search result snippets, preview cards, notifications, and anywhere a concise label for the role is needed.

Use when: You need a brief, human-readable summary — search indexes, list views, notification copy.

Example

Senior backend engineering role at Acme Corp working on a new Growth team.


4. S3 Full-Page HTML — Complete Archived Page

The complete rendered HTML of the entire job posting page is archived to AWS S3 at scrape time:

Use when: Debugging a scrape, inspecting the original page layout, or re-extracting structured data from the source.


Scrape-time Cleaning - (How #2 cleaning above works)

When JobFront cleans HTML, we follow the following steps:

  1. Encoding repair -- correct character encoding issues

    Before
    After

    Lattice’s culture

    Lattice's culture

    We&#8217;re hiring

    We're hiring

  2. Hidden element removal — Elements with display:none, visibility:hidden, or common hiding classes (d-none, sr-only, etc.) are dropped

  3. Dangerous/irrelevant tag stripping<script>, <style>, <iframe>, <img>, <audio>, <video>, <form>, <button>, and similar tags are removed entirely

  4. Nav/chrome removal<header>, <footer>, <nav>, <aside>, and elements whose class or id contain navigation hints are decomposed, unless they contain main content

  5. Attribute stripping — All attributes are removed except href on <a> tags

  6. Empty element pruning — Empty <p>, <div>, <li>, and similar tags are removed

  7. LLM formatting pass — The cleaned HTML is sent to a LLM. The model re-formats the content into the restricted safe-tag set, normalizes heading levels, and removes residual noise. This is the step that converts <h1> to <h2>, title-cases headers, unwraps <span>/<u>/<em> tags, and produces the final <div>-wrapped output


Export-time Cleaning

When delivering job data via XML, SFTP, or JSON export, an additional clean_html() pass can be applied to the description field at transmission time. This is separate from the scrape-time cleaning and is applied on-the-fly to whatever HTML value is being exported.

Export configs support the following options:

Option
Default
Effect

clean_html

False

Run the full structural HTML cleaning pipeline before export. Strips comments, dangerous tags, nav chrome, and unknown wrapper elements. Normalizes whitespace.

clean_html_remove_links

False

Remove all <a> tags, preserving their visible text.

clean_html_remove_contact_info

False

Redact emails, phone numbers, LinkedIn profile URLs, and social handles in visible text. Removes paragraphs that consist primarily of contact cues (e.g. "Contact us at [redacted email]"). Also removes mailto: and tel: links structurally.

clean_html_remove_prompt_injection

False

Remove blocks that appear to be hidden LLM instruction injections (e.g. invisible text, zero-width characters used to smuggle instructions).

These options are composable — any combination can be enabled. clean_html must be True for the other three options to have any effect.

What clean_html does structurally

When clean_html is enabled, the following transformations are applied to the HTML before it leaves the system:

  1. Plain-text normalization — If the input contains no HTML tags, it is converted to structured HTML: paragraphs are separated by blank lines, line breaks become <br/>, and content is wrapped in <p> tags.

  2. Comment removal — HTML comments are stripped.

  3. Dangerous tag removal<script>, <style>, <noscript>, <iframe>, <svg>, <canvas>, <form>, <input>, <button>, <link>, and <meta> are removed.

  4. Nav/chrome removal<header>, <footer>, <nav>, <aside>, and elements whose class or id suggest navigation are removed unless they contain main content.

  5. Attribute stripping and tag unwrapping — Known safe tags retain only their href (on <a>). All other tags are unwrapped (text preserved, wrapper removed).

  6. Loose text node wrapping — Top-level text nodes are wrapped in <p> tags.

  7. Whitespace normalization — Inter-tag whitespace is collapsed; multiple consecutive spaces are reduced to one.


Strips all <a> tags while preserving their visible text. Useful when delivering content to platforms where external links are not allowed or could be misleading.


clean_html_remove_contact_info

Redacts personally identifiable contact information from visible text and removes structural contact links (mailto:, tel:). Also removes paragraphs or list items that consist primarily of contact-cue language after redaction.

Note: The mailto: paragraph was removed entirely because after redaction it consisted only of a contact cue. The closing sentence was preserved because it contains no contact information.


clean_html_remove_prompt_injection

Removes blocks that appear designed to inject hidden instructions into AI/LLM systems consuming the job description. These typically appear as invisible elements, zero-width characters, or off-screen text.

Last updated