Data Processing & Cleaning

Job posts are formatted differently across the many sources we scrape and process.

We run a series of data cleaning and normalization processes that transform messy, inconsistent job postings into beautifully formatted, standardized HTML content. We also extract, structure, and enrich every job post with additional data to make it easier to use our dataset.

Technically, we process all jobs through multiple large language models as part of our cleaning processes, in addition to more traditional data processing. As a result we have some more flexibility in case you have additional processing requests. Please reach out to us if there's more fields or formats you would like to see.


Job Post HTML

Here is our process for standardizing job post html

What We Clean and Standardize

  1. Whitespace and Formatting:

    • Collapses multiple spaces into single spaces.

    • Removes unnecessary spaces around punctuation.

    • Cleans up empty tags or redundant formatting elements.

  2. Inline Tags:

    • Safely removes unnecessary tags (such as <i>, <span>, <u>, <em>) and replaces content with text

    • Preserves essential formatting tags (<strong>) ensuring text emphasis remains clear.

  3. Links and Buttons:

    • Converts hyperlinks (<a> tags) to plain text, removing external navigation to keep readers focused and on site.

    • Completely removes all buttons, which typically clutter job postings with unnecessary interactions.

  4. Hidden and Irrelevant Elements:

    • Eliminates elements marked as hidden or invisible, ensuring only meaningful content remains visible.

    • Removes elements like scripts, videos, audio, iframes, and images to maintain clean, job-focused descriptions.

  5. HTML Comments and Attributes:

    • Strips out all HTML comments, inline styles, unnecessary attributes (like classes, IDs, and event handlers), reducing clutter and potential formatting inconsistencies.

  6. Special Characters and Entities:

    • Attempts to replace HTML special entities (e.g., &nbsp;, &amp;) with standard ASCII characters to ensure clarity and compatibility across platforms. (Note: If data is delivered via XML, we also do encode content when assembling XML feeds, so you may notice in your feed that HTML special entities still appear. Connect with us for more clarification!)

  7. Lists and Formatting:

    • Ensures numbered and bulleted lists are cleanly formatted for easy readability.

  8. Headers

    1. We convert <h1> tags into <h2> (removing all <h1> tags from the HTML output)

    2. We also attempt to Title Case all headers (removing ALL CAPS where possible but preserving acronyms where we can)

HTML Tags Preserved

Our processing preserves only the following tags

All other HTML tags are removed (removed: <span>, <a>, <u>, <h1/4/5>, etc)

We also wrap the final job posting in a <div> to create a consistent formatting container.

A Note On HTML Formatting

We have prioritized making our job post content consistent and 'boring'.

But we have also run some tests to re-generate job post HTML as if it is an advertisement, using more exciting language that appears more like marketing content for the company. On our own job boards, we have seen that this can increase engagement rates. Please let us know if you'd like to experiment with different content formats.

Detailed HTML Structure

The structure of each job post should contain roughly the following:

A few notes:

  • We remove all <h1> headers. We recommend adding your <h1> tags to the web page itself and that will give you more freedom, vs us embedding the important <h1> tag within the job post data itself

  • We do keep <h2> and <h3> tags in the job post. Some customers will prefer to remove all <h> tags, and you can easily replace those tags with <strong> headers if you would like.

  • We do not allow <h4> or <h5> tags in our cleaned data

  • We also remove all italics and underlines.

  • Each job will always have an <h2> header at the top, usually it is the job title but sometimes it is something more generic or the name of the organization. This <h2> header is generated with AI as a quick summary of the job post. You can strip it out if you'd like (always the first <h2> tag)

  • Each job post is wrapped in a <div>, but we use <p> paragraphs throughout to create vertical line separation.

Examples

Here are a few examples of before/after our processing

Carnegie Mellon University - Before

Carnegie Mellon University - After

Flight Instructor - Before

Flight Instructor - After


Structured, Extracted Data

Locations

In addition to extracting the raw location from the page, we run a series of more complex processing and enrichment steps to provide the following structured information

  • City, State, Country

  • Zip code

  • CBSA

Job Titles + ONET6

We also extract ONET6 codes, in addition to job titles using large language models

Compensation

We extract min, max, payment frequency, and currency using large language models

You can see the actual delivered data structure for these extracted fields either in the API tab, or in the XML tab in our documentation.

Last updated