The Challenge
Problem
Client needed to extract content from their own websites (10,000+ pages) and re-process 3,000+ images for upload to a third-party platform via API. Manual extraction was impossible at this scale, and existing tools couldn't handle JavaScript-rendered content or the specific cleaning requirements.
- 10,000+ pages with dynamic JavaScript content
- 3,000+ images needed resizing and re-uploading via API
- Cookie consent banners blocking content extraction
- Lazy-loaded images and content missed by simple scrapers
- Need to preserve specific content while removing headers, footers, ads
- Rate limiting required to avoid overwhelming servers
The Solution
Solution
Built an intelligent web crawler using Playwright for full JavaScript rendering, with automatic cookie banner dismissal, content stability detection for lazy-loaded elements, configurable content cleaning, and API integration for automated image processing and re-upload.
Processing Pipeline
Core Features
Intelligent Page Analysis
Pre-crawl analysis examines page structure and recommends optimal removal settings automatically.
Cookie Banner Auto-Dismiss
50+ selector patterns for OneTrust, CookieYes, Osano, and generic consent dialogs.
Content Stability Detection
Waits for lazy-loaded content to stabilize before extraction. Catches late-loading elements.
Background Image Extraction
Extracts images from CSS background-image, srcset, meta tags, and inline styles.
Token Bucket Rate Limiting
Thread-safe rate limiting prevents server overload. Configurable requests per second.
HTML & CSV Reports
Detailed crawl reports with configuration summary, results table, and ZIP download.
Configurable Content Removal
| Option | Description |
|---|---|
remove_headers |
Remove header elements and masthead sections |
remove_footers |
Remove footer elements and bottom sections |
remove_nav |
Remove navigation menus and breadcrumbs |
remove_ads |
Remove ad banners, sponsors, and promotional elements |
remove_images |
Remove img, svg, picture, and figure elements |
remove_styles |
Remove stylesheets, style tags, and meta elements |
remove_newsletters |
Remove newsletter signup forms and subscription CTAs |
extract_background_images |
Convert CSS background images to downloadable links |
fast_mode |
Reduced timeouts for faster crawling (may miss lazy content) |
Technology Stack
Results
The system successfully crawled over 10,000 pages with full JavaScript rendering, extracted and processed 3,000+ images, and automatically uploaded them to the client's third-party platform via API integration — a task that would have taken months manually.
Need large-scale data extraction?
We build custom crawlers that handle JavaScript, rate limiting, and API integration.
Get in touch