How We Extracted 37,720 Italian Real Estate Listings That Didn't Want to Be Scraped

A commercial real estate firm came to us with one ask: every apartment listing on Immobiliare.it for Milan, sales and rentals, delivered as clean spreadsheets. Around 42,000 listings. All publicly visible. No login required.

We delivered 37,720 records in under 24 hours.

What happened in between is the interesting part.

The Wall

Immobiliare.it runs Datadome, enterprise-grade bot protection trusted by Carrefour, Rakuten, and Axel Springer. It doesn't ask you to solve a CAPTCHA. It doesn't slow you down gradually. It fingerprints your request against a dozen signals simultaneously and, if anything looks off, returns a hard 403. No warning. No explanation. Just a wall.

The signals Datadome checks in parallel:

TLS cipher suite: Chrome uses TLS 1.3 with a specific cipher order. Python's requests library uses TLS 1.2 with a different set. Datadome sees this before it even reads your headers.
Browser fingerprint: navigator.webdriver, GPU renderer, installed plugins, screen resolution, timezone.
Behavioral sequence: how long between requests, how you navigate between pages, whether you ever scroll or click.
IP reputation: datacenters, known proxy ranges, residential vs. commercial ASN.

Get flagged on any one of these and you're blocked. Most scrapers fail within 50 requests.

Four Approaches. Three Dead Ends.

Here's exactly what we tried, where each approach failed, and why.

Approach 1: Python + requests

The standard starting point for any scraping project. A requests.Session with real browser cookies copied from a logged-in session, a realistic User-Agent, full browser header sets.

What we tried to fix it:

Switched to httpx with HTTP/2 support (closer to browser behaviour)
Rotated User-Agent strings across common Chrome versions
Added the full Sec-Fetch-*, Sec-CH-UA header set that Chrome sends
Throttled request rate to under 1 per second

None of it worked past a few hundred requests. The root problem is fundamental: Python's TLS library (ssl/OpenSSL) has a different cipher order than Chrome's BoringSSL. Datadome fingerprints at the TLS handshake level, before it reads a single HTTP header. No amount of header tweaking fixes a TLS-level mismatch.

Approach 2: Playwright (Headless)

If the problem is that Python looks like a bot, use a real browser. Playwright drives actual Chromium with full JavaScript execution, real DOM, real rendering pipeline.

Headless Chromium leaks automation signals that are hard to fully patch:

navigator.webdriver is true by default in headless mode
No real GPU, uses SwiftShader software renderer, which is detectable via WebGL
Plugin list is empty, real browsers have at least a PDF viewer
CDP (Chrome DevTools Protocol) connection leaves traces in browser timing

Datadome blocked this within seconds.

Approach 3: Playwright (Headed, Visible Window)

Switch to headed mode, an actual visible browser window. Apply stealth patches: force navigator.webdriver to false, inject a fake plugin list, spoof the GPU renderer string.

Better. Initial requests got through. But under sustained load (100+ consecutive requests), the CDP debugging connection that Playwright uses to control the browser creates detectable timing anomalies. Success rate was around 60-70%, inconsistent, with unpredictable blocks.

Patching deeper (faking GPU WebGL strings, injecting fake mouse movement history) was possible but we were in an arms race. Datadome updates their detection regularly. Any patch we applied would need constant maintenance.

We stopped here. There was a better approach.

Approach 4: Browser-Native fetch()

The insight: stop trying to make Python look like a browser. Just use the browser.

When JavaScript calls fetch() from inside a browser tab, that request is genuinely same-origin. The browser attaches the real session cookies automatically. The TLS handshake is signed by Chrome's actual BoringSSL stack. The headers are identical to what a user clicking a link would send. Datadome cannot distinguish it from normal navigation. There is no difference at any layer it can inspect.

Immobiliare.it renders all listing data inside a __NEXT_DATA__ script tag on every page (standard Next.js SSR). One fetch() call returns the full structured JSON for that listing.

for (const url of urls) {
  const resp = await fetch(url);
  const html = await resp.text();
  const match = html.match(/<script id="__NEXT_DATA__"[^>]*>([\s\S]*?)<\/script>/);
  if (match) batch.push(JSON.parse(match[1]));
  await sleep(600);  // polite delay
}
// Every 300 listings: auto-download batch as JSON file

No proxies. No fingerprint spoofing. No library dependencies. Under 3% failure rate.

The 80-Page Cap Problem

A single browser window processing 300 listings per 18 minutes would take 45 hours for 42,000 listings. Two problems needed solving before an overnight run was viable.

Problem 1: Immobiliare.it caps every search query at 80 pages (~2,640 results). Milan has ~42,000 listings. A single search doesn't cover the full dataset.

By slicing the search space across room count, price band, and city macrozone, we got 53 individual searches, each returning fewer than 2,640 results, together covering every listing in Milan with no duplicates (deduplicated by listing ID at fetch time).

Running Three Windows in Parallel

Problem 2: One window processes ~300 listings per 18 minutes. That's 45 hours for the full dataset, not viable for 24-hour delivery.

Multiple tabs on the same browser profile don't help. They share a Datadome session, so tripling the request rate under one identity spiked the failure rate to 45%.

The fix: separate browser profiles. Each profile is an entirely independent Datadome session.

Three profiles. Three independent sessions. Each at a comfortable pace. Total run time: ~16 hours overnight. Failure rate: under 3%.

The scraper handled its own resilience: automatic 90-second backoff on 3 consecutive 403s, batch saves every 300 listings (maximum crash loss: 300 records), per-window filename prefixes so downloads never overwrite each other.

The Data Cleaning Problem Nobody Mentions

Raw extraction delivers raw data. What the client receives needs to be clean, typed, and immediately usable.

Italy uses . as the thousands separator and , as the decimal. € 230.000 is two hundred and thirty thousand euros, not two hundred and thirty. Load that string into Excel and you get 230. Prices, surface areas, floor numbers, condo expenses all arrive in Italian locale format.

Every numeric field is typed. Every field name is in English. Headers locked. File named [ClientName]_[DataType]_[Date].xlsx. The client opens it and works: no cleanup, no reformatting.

The Output

37,720

Total Records

Under 24 hours

Delivery Time

Datadome

Protection Defeated

File	Records	Columns	Description
Milan_Sales_Extracted.csv	24,182	31	Sales listings, client template format
Milan_Rent_Extracted.csv	13,538	32	Rental listings, client template format
Milan_Sales_Full.csv	24,182	44	Sales with all available fields and coordinates
Milan_Rent_Full.csv	13,538	46	Rentals with full fields including student availability

The remaining ~4,000 URLs were expired or deleted listings. They are no longer available on the site.

Browse the Sample Data

500 rows from Milan_Rent_Extracted.csv are available to inspect in your browser: title, price, zone, address, surface area, bedrooms, bathrooms, energy class, and more. Download the full CSV if you want to test it in your own pipeline.

Browse 500 Sample Rows →

What This Means for You

If your team is hitting walls on Zillow, CoStar, Rightmove, Domain.com.au, or any other real estate portal behind bot protection, the problem usually isn't your scraper. It's the approach layer.

We run free 500-record proof-of-concept extractions. Give us the URL your team is struggling with. We extract a clean sample, format it to your spec, and deliver overnight.

No invoice. Just proof our infrastructure works.

Request Your Free Sample →

Protection defeated: Datadome · Records delivered: 37,720 · Time from start to delivery: under 24 hours