Jobs Data - HTML Processing
Job Description Data
Description Data Types
1. Raw Scraped HTML
<!-- raw HTML exactly as scraped -->
<div id="job-container" class="jd-wrapper col-md-8 offset-md-2" style="padding:20px; font-family:Arial,sans-serif;">
<nav class="site-nav primary-navigation">
<a href="/">Home</a> › <a href="/jobs">Jobs</a> › Senior Software Engineer
</nav>
<script>window.dataLayer=window.dataLayer||[];dataLayer.push({'job_id':'83921'});</script>
<!-- Job Description Start -->
<h1 class="job-title" style="color:#333; font-size:28px;">Senior Software Engineer</h1>
<div class="job-meta" style="color:#666; margin-bottom:15px;">
<span class="location">San Francisco, CA</span> •
<span class="type">Full-time</span>
</div>
<div class="job-body" id="jd-content" aria-label="job description">
<p style="margin:10px 0; line-height:1.6; color:#444;">
We’re looking for a <strong>Senior Software Engineer</strong> to join our growing platform team.
You will <u>own</u> the design and delivery of <em>critical</em> backend services.
</p>
<h2 style="font-size:20px; color:#222; margin-top:20px;">What You’ll Do</h2>
<ul class="jd-list" style="padding-left:20px;">
<li>Design and implement scalable backend services</li>
<li>Mentor junior engineers and run code reviews</li>
<li>Partner with product to define technical roadmap</li>
</ul>
<h2 style="font-size:20px;">What We’re Looking For</h2>
<ul>
<li>5+ years of experience with Python or Go</li>
<li>Strong knowledge of distributed systems</li>
<li>Experience with AWS or GCP</li>
</ul>
</div>
</div>2. AI-Cleaned and Standardized Job Description HTML
3. One-Line Summary
4. S3 Full-Page HTML — Complete Archived Page
Scrape-time Cleaning - (How #2 cleaning above works)
- BeforeAfter
Export-time Cleaning
Option
Default
Effect
What clean_html does structurally
clean_html does structurallyclean_html_remove_links
clean_html_remove_linksclean_html_remove_contact_info
clean_html_remove_contact_infoclean_html_remove_prompt_injection
clean_html_remove_prompt_injectionLast updated