Data Processing & Cleaning
Job posts are formatted differently across the many sources we scrape and process.
We run a series of data cleaning and normalization processes that transform messy, inconsistent job postings into beautifully formatted, standardized HTML content. We also extract, structure, and enrich every job post with additional data to make it easier to use our dataset.
Technically, we process all jobs through multiple large language models as part of our cleaning processes, in addition to more traditional data processing. As a result we have some more flexibility in case you have additional processing requests. Please reach out to us if there's more fields or formats you would like to see.
Job Post HTML
Here is our process for standardizing job post html
What We Clean and Standardize
Whitespace and Formatting:
Collapses multiple spaces into single spaces.
Removes unnecessary spaces around punctuation.
Cleans up empty tags or redundant formatting elements.
Inline Tags:
Safely removes unnecessary tags (such as
<i>,<span>,<u>,<em>) and replaces content with textPreserves essential formatting tags (
<strong>) ensuring text emphasis remains clear.
Links and Buttons:
Converts hyperlinks (
<a>tags) to plain text, removing external navigation to keep readers focused and on site.Completely removes all buttons, which typically clutter job postings with unnecessary interactions.
Hidden and Irrelevant Elements:
Eliminates elements marked as hidden or invisible, ensuring only meaningful content remains visible.
Removes elements like scripts, videos, audio, iframes, and images to maintain clean, job-focused descriptions.
HTML Comments and Attributes:
Strips out all HTML comments, inline styles, unnecessary attributes (like classes, IDs, and event handlers), reducing clutter and potential formatting inconsistencies.
Special Characters and Entities:
Attempts to replace HTML special entities (e.g.,
,&) with standard ASCII characters to ensure clarity and compatibility across platforms. (Note: If data is delivered via XML, we also do encode content when assembling XML feeds, so you may notice in your feed that HTML special entities still appear. Connect with us for more clarification!)
Lists and Formatting:
Ensures numbered and bulleted lists are cleanly formatted for easy readability.
Headers
We convert <h1> tags into <h2> (removing all <h1> tags from the HTML output)
We also attempt to Title Case all headers (removing ALL CAPS where possible but preserving acronyms where we can)
HTML Tags Preserved
Our processing preserves only the following tags
All other HTML tags are removed (removed: <span>, <a>, <u>, <h1/4/5>, etc)
We also wrap the final job posting in a <div> to create a consistent formatting container.
A Note On HTML Formatting
We have prioritized making our job post content consistent and 'boring'.
But we have also run some tests to re-generate job post HTML as if it is an advertisement, using more exciting language that appears more like marketing content for the company. On our own job boards, we have seen that this can increase engagement rates. Please let us know if you'd like to experiment with different content formats.
Detailed HTML Structure
The structure of each job post should contain roughly the following:
A few notes:
We remove all <h1> headers. We recommend adding your <h1> tags to the web page itself and that will give you more freedom, vs us embedding the important <h1> tag within the job post data itself
We do keep <h2> and <h3> tags in the job post. Some customers will prefer to remove all <h> tags, and you can easily replace those tags with <strong> headers if you would like.
We do not allow <h4> or <h5> tags in our cleaned data
We also remove all italics and underlines.
Each job will always have an <h2> header at the top, usually it is the job title but sometimes it is something more generic or the name of the organization. This <h2> header is generated with AI as a quick summary of the job post. You can strip it out if you'd like (always the first <h2> tag)
Each job post is wrapped in a <div>, but we use <p> paragraphs throughout to create vertical line separation.
Examples
Here are a few examples of before/after our processing
Carnegie Mellon University - Before
Carnegie Mellon University - After
Flight Instructor - Before
Flight Instructor - After
Structured, Extracted Data
Locations
In addition to extracting the raw location from the page, we run a series of more complex processing and enrichment steps to provide the following structured information
City, State, Country
Zip code
CBSA
Job Titles + ONET6
We also extract ONET6 codes, in addition to job titles using large language models
Compensation
We extract min, max, payment frequency, and currency using large language models
You can see the actual delivered data structure for these extracted fields either in the API tab, or in the XML tab in our documentation.
Last updated