← All Case Studies
Case StudyReal Estate

How We Extracted 37,720 Italian Real Estate Listings That Didn't Want to Be Scraped

A CRE data firm needed every apartment listing on Immobiliare.it, Italy's largest real estate portal. The site runs Datadome. Here's exactly what we tried, what failed, and how we delivered 37,720 clean records overnight.

23 April 2026·8 min read

A commercial real estate firm came to us with one ask: every apartment listing on Immobiliare.it for Milan, sales and rentals, delivered as clean spreadsheets. Around 42,000 listings. All publicly visible. No login required.

We delivered 37,720 records in under 24 hours.

What happened in between is the interesting part.


The Wall

Immobiliare.it runs Datadome, enterprise-grade bot protection trusted by Carrefour, Rakuten, and Axel Springer. It doesn't ask you to solve a CAPTCHA. It doesn't slow you down gradually. It fingerprints your request against a dozen signals simultaneously and, if anything looks off, returns a hard 403. No warning. No explanation. Just a wall.

The signals Datadome checks in parallel:

  • TLS cipher suite: Chrome uses TLS 1.3 with a specific cipher order. Python's requests library uses TLS 1.2 with a different set. Datadome sees this before it even reads your headers.
  • Browser fingerprint: navigator.webdriver, GPU renderer, installed plugins, screen resolution, timezone.
  • Behavioral sequence: how long between requests, how you navigate between pages, whether you ever scroll or click.
  • IP reputation: datacenters, known proxy ranges, residential vs. commercial ASN.

Get flagged on any one of these and you're blocked. Most scrapers fail within 50 requests.


Four Approaches. Three Dead Ends.

Here's exactly what we tried, where each approach failed, and why.


Approach 1: Python + requests

The standard starting point for any scraping project. A requests.Session with real browser cookies copied from a logged-in session, a realistic User-Agent, full browser header sets.

Python requests.Session blocked by Datadome at TLS handshake, cipher suite mismatch

What we tried to fix it:

  • Switched to httpx with HTTP/2 support (closer to browser behaviour)
  • Rotated User-Agent strings across common Chrome versions
  • Added the full Sec-Fetch-*, Sec-CH-UA header set that Chrome sends
  • Throttled request rate to under 1 per second

None of it worked past a few hundred requests. The root problem is fundamental: Python's TLS library (ssl/OpenSSL) has a different cipher order than Chrome's BoringSSL. Datadome fingerprints at the TLS handshake level, before it reads a single HTTP header. No amount of header tweaking fixes a TLS-level mismatch.


Approach 2: Playwright (Headless)

If the problem is that Python looks like a bot, use a real browser. Playwright drives actual Chromium with full JavaScript execution, real DOM, real rendering pipeline.

Playwright headless Chrome blocked immediately, three automation signals detected by Datadome

Headless Chromium leaks automation signals that are hard to fully patch:

  • navigator.webdriver is true by default in headless mode
  • No real GPU, uses SwiftShader software renderer, which is detectable via WebGL
  • Plugin list is empty, real browsers have at least a PDF viewer
  • CDP (Chrome DevTools Protocol) connection leaves traces in browser timing

Datadome blocked this within seconds.


Approach 3: Playwright (Headed, Visible Window)

Switch to headed mode, an actual visible browser window. Apply stealth patches: force navigator.webdriver to false, inject a fake plugin list, spoof the GPU renderer string.

Playwright headed with stealth patches, passes initially but CDP timing leak detected under sustained load

Better. Initial requests got through. But under sustained load (100+ consecutive requests), the CDP debugging connection that Playwright uses to control the browser creates detectable timing anomalies. Success rate was around 60-70%, inconsistent, with unpredictable blocks.

Patching deeper (faking GPU WebGL strings, injecting fake mouse movement history) was possible but we were in an arms race. Datadome updates their detection regularly. Any patch we applied would need constant maintenance.

We stopped here. There was a better approach.


Approach 4: Browser-Native fetch()

The insight: stop trying to make Python look like a browser. Just use the browser.

Browser-native fetch() passes all Datadome checks, real TLS, real cookies, real headers

When JavaScript calls fetch() from inside a browser tab, that request is genuinely same-origin. The browser attaches the real session cookies automatically. The TLS handshake is signed by Chrome's actual BoringSSL stack. The headers are identical to what a user clicking a link would send. Datadome cannot distinguish it from normal navigation. There is no difference at any layer it can inspect.

Immobiliare.it renders all listing data inside a __NEXT_DATA__ script tag on every page (standard Next.js SSR). One fetch() call returns the full structured JSON for that listing.

for (const url of urls) {
  const resp = await fetch(url);
  const html = await resp.text();
  const match = html.match(/<script id="__NEXT_DATA__"[^>]*>([\s\S]*?)<\/script>/);
  if (match) batch.push(JSON.parse(match[1]));
  await sleep(600);  // polite delay
}
// Every 300 listings: auto-download batch as JSON file

No proxies. No fingerprint spoofing. No library dependencies. Under 3% failure rate.


The 80-Page Cap Problem

A single browser window processing 300 listings per 18 minutes would take 45 hours for 42,000 listings. Two problems needed solving before an overnight run was viable.

Problem 1: Immobiliare.it caps every search query at 80 pages (~2,640 results). Milan has ~42,000 listings. A single search doesn't cover the full dataset.

Search space decomposition: 53 slices by room count, price band, and city macrozone to beat the 80-page cap

By slicing the search space across room count, price band, and city macrozone, we got 53 individual searches, each returning fewer than 2,640 results, together covering every listing in Milan with no duplicates (deduplicated by listing ID at fetch time).


Running Three Windows in Parallel

Problem 2: One window processes ~300 listings per 18 minutes. That's 45 hours for the full dataset, not viable for 24-hour delivery.

Multiple tabs on the same browser profile don't help. They share a Datadome session, so tripling the request rate under one identity spiked the failure rate to 45%.

The fix: separate browser profiles. Each profile is an entirely independent Datadome session.

Three parallel Edge browser profiles with independent Datadome sessions merging into 37,720 clean records

Three profiles. Three independent sessions. Each at a comfortable pace. Total run time: ~16 hours overnight. Failure rate: under 3%.

The scraper handled its own resilience: automatic 90-second backoff on 3 consecutive 403s, batch saves every 300 listings (maximum crash loss: 300 records), per-window filename prefixes so downloads never overwrite each other.


The Data Cleaning Problem Nobody Mentions

Raw extraction delivers raw data. What the client receives needs to be clean, typed, and immediately usable.

Italy uses . as the thousands separator and , as the decimal. € 230.000 is two hundred and thirty thousand euros, not two hundred and thirty. Load that string into Excel and you get 230. Prices, surface areas, floor numbers, condo expenses all arrive in Italian locale format.

Italian locale data normalisation, raw strings converted to typed numerics for client-ready CSV delivery

Every numeric field is typed. Every field name is in English. Headers locked. File named [ClientName]_[DataType]_[Date].xlsx. The client opens it and works: no cleanup, no reformatting.


The Output

37,720
Total Records
Under 24 hours
Delivery Time
Datadome
Protection Defeated
FileRecords
Milan_Sales_Extracted.csv24,182
Milan_Rent_Extracted.csv13,538
Milan_Sales_Full.csv24,182
Milan_Rent_Full.csv13,538

The remaining ~4,000 URLs were expired or deleted listings. They are no longer available on the site.


What This Means for You

If your team is hitting walls on Zillow, CoStar, Rightmove, Domain.com.au, or any other real estate portal behind bot protection, the problem usually isn't your scraper. It's the approach layer.

We run free 500-record proof-of-concept extractions. Give us the URL your team is struggling with. We extract a clean sample, format it to your spec, and deliver overnight.

No invoice. Just proof our infrastructure works.

Request Your Free Sample →


Protection defeated: Datadome · Records delivered: 37,720 · Time from start to delivery: under 24 hours