Case Study: Enterprise Web Crawler

The Challenge

Problem

Client needed to extract content from their own websites (10,000+ pages) and re-process 3,000+ images for upload to a third-party platform via API. Manual extraction was impossible at this scale, and existing tools couldn't handle JavaScript-rendered content or the specific cleaning requirements.

10,000+ pages with dynamic JavaScript content
3,000+ images needed resizing and re-uploading via API
Cookie consent banners blocking content extraction
Lazy-loaded images and content missed by simple scrapers
Need to preserve specific content while removing headers, footers, ads
Rate limiting required to avoid overwhelming servers

The Solution

Solution

Built an intelligent web crawler using Playwright for full JavaScript rendering, with automatic cookie banner dismissal, content stability detection for lazy-loaded elements, configurable content cleaning, and API integration for automated image processing and re-upload.

Processing Pipeline

1

Page Analysis

Pre-crawl analysis recommends optimal settings based on page structure

2

Playwright Render

Full JavaScript execution with headless Chromium browser

3

Cookie Dismissal

Auto-dismiss consent banners using 50+ selector patterns

4

Content Stability

Wait for lazy-loaded content to stabilize before extraction

5

Content Cleaning

BeautifulSoup removes headers, footers, ads, navigation

6

Image Extraction

Extract images including CSS background images and srcset

7

API Upload

Resize and upload images to third-party platform via API

Core Features

Intelligent Page Analysis

Pre-crawl analysis examines page structure and recommends optimal removal settings automatically.

Cookie Banner Auto-Dismiss

50+ selector patterns for OneTrust, CookieYes, Osano, and generic consent dialogs.

Content Stability Detection

Waits for lazy-loaded content to stabilize before extraction. Catches late-loading elements.

Background Image Extraction

Extracts images from CSS background-image, srcset, meta tags, and inline styles.

Token Bucket Rate Limiting

Thread-safe rate limiting prevents server overload. Configurable requests per second.

HTML & CSV Reports

Detailed crawl reports with configuration summary, results table, and ZIP download.

Configurable Content Removal

Option	Description
`remove_headers`	Remove header elements and masthead sections
`remove_footers`	Remove footer elements and bottom sections
`remove_nav`	Remove navigation menus and breadcrumbs
`remove_ads`	Remove ad banners, sponsors, and promotional elements
`remove_images`	Remove img, svg, picture, and figure elements
`remove_styles`	Remove stylesheets, style tags, and meta elements
`remove_newsletters`	Remove newsletter signup forms and subscription CTAs
`extract_background_images`	Convert CSS background images to downloadable links
`fast_mode`	Reduced timeouts for faster crawling (may miss lazy content)

Technology Stack

Python Flask Playwright BeautifulSoup lxml Chromium Threading REST API

Results

10,000+

Pages crawled

3,000+

Images processed & uploaded

100%

Automated via API

The system successfully crawled over 10,000 pages with full JavaScript rendering, extracted and processed 3,000+ images, and automatically uploaded them to the client's third-party platform via API integration — a task that would have taken months manually.

Enterprise Web Crawler