In-Depth Look into Popular Proxy APIs for Web Scraping

This investigation examines five leading proxy APIs aimed at bypassing bot protection systems and CAPTCHAs: Bright Data‘s Web Unlocker, Oxylabs‘ Web Unblocker, Smartproxy‘s Site Unblocker, Zyte API, and Crawlbase‘s Crawling API. We compare their features, pricing models, and most importantly – success rates against challenging targets protected by anti-bot services like DataDome, Shape, and Akamai.

Contents

How Do Proxy APIs Work?
Testing Methodology
Unblocking Performance Results
Features and Limitations
Pricing Model Comparison
Key Takeaways
Author

How Do Proxy APIs Work?

Proxy APIs provide a hosted proxy infrastructure that handles IP rotation, managing headless browsers, and CAPTCHA solving automatically. To use one, you simply integrate the API credentials in place of a regular proxy and send requests as usual.

The proxy API intercepts all traffic to the target. When it encounters a block or challenge, the system retries with different parameters until the request succeeds. This abstraction allows avoiding the tedious configuration work required for robust web scraping.

Some vendors market proxy APIs as "web unblockers" since their main purpose is bypassing bot mitigation defenses. But you shouldn‘t treat them only as CAPTCHA solvers – that‘s just one of the obstacles modern websites employ.

Testing Methodology

We evaluated the proxy APIs by sending requests to seven popular e-commerce and social media websites protected by various bot mitigation systems. For each target, we made around 1,800 requests over 15 minutes using one proxy API at a time.

To verify success, we checked the response code, size, and page title. Each participant had high prior awareness of the targets, so they could prepare their infrastructure accordingly.

Here are the websites we tested:

Amazon – In-house anti-bot system, returns empty responses
Google – reCAPTCHA
Social media website – Requires login when blocked
Kohls – Akamai Bot Manager
Nordstrom – F5 Shape Defense
Petco – DataDome
Walmart – ThreatMetrix, Akamai, PerimeterX, FingerprintJS

Our test computer was located in Germany and accessed the US versions of each website.

Unblocking Performance Results

Overall, the proxy APIs achieved high success rates across all targets – never dropping below 80%. Nordstrom, protected by Shape, posed the biggest difficulty. The social media website also often redirected to login when blocked.

Out of 1,800 requests per website, here are the average success rates:

Target	Avg. Success Rate
Kohls	99.01%
Amazon	98.65%
Walmart	96.38%
Google	96.00%
Petco	95.48%
Social Media	92.19%
Nordstrom	80.16%

And the average response times:

Target	Avg. Response Time
Kohls	28.97 s
Amazon	5.73 s
Walmart	12.68 s
Google	5.41 s
Petco	11.20 s
Social Media	19.74 s
Nordstrom	33.1 s

So which proxy API performed the best overall?

Zyte had the highest success rate at 97.82% and was very quick. But it did require optimizer tweaks for one target.
Oxylabs and Smartproxy focused on maximizing success, though at the cost of speed due to headless browsers.
Crawlbase was relatively fast but also had more fails across the board.
Bright Data rendered pages insanely fast but sometimes got blocked on certain sites.

Features and Limitations

Proxy APIs aim to function as a drop-in replacement for proxies. As such, they all support basic functionality like location targeting and sessions.

However, you inevitably sacrifice some control compared to running your own proxies or browsers. Let‘s examine the extent of configurability for request handling and interacting with dynamic page content.

Feature	Oxylabs	Bright Data	Smartproxy	Crawlbase
Localization	Countries, cities, coordinates	Countries, cities, ASNs	Countries, cities, coordinates	26 countries
Sessions	✅	✅	✅	✅
JavaScript rendering	✅	✅	✅	✅
Custom headers & cookies	✅	✅	✅	✅
POST requests	✅	❌	✅	✅
Page interactions	❌	❌	❌	Wait for load, scroll

Bright Data has the most limited control – it handles JavaScript rendering under the hood without any configurability.

Oxylabs and Smartproxy allow more request customization. But they still don‘t permit interacting with page contents after load.

Crawlbase provides the most flexibility via waiting and scrolling. However, its functionality is dependent on using a separate API package.

In summary, proxy APIs come with inherent limitations for sites requiring heavy post-load interaction. You can render JavaScript, but not properly simulate clicks or scrolls. This may suffice for many use cases, but won‘t replace a full browser automation suite.

Pricing Model Comparison

While marketed as proxies, proxy APIs employ varied pricing models:

Pricing Metric	Providers Using
Bandwidth	Oxylabs, Smartproxy
Successful requests	Bright Data, Crawlbase, Zyte

Charging per successful request is generally more economical for web scraping than paying for bandwidth. Some vendors get creative with their pricing:

Bright Data – more expensive premium domains
Crawlbase – 2x price for JavaScript rendering
Zyte – dynamic rate based on page difficulty

Oxylabs and Smartproxy don‘t implement multipliers. But at $1+ per GB, you‘d end up paying more scraping heavier sites – regardless of success.

Overall, request-based pricing makes the most sense for web scraping. Exceptions could be APIs or small pages where bandwidth costs stay negligible.

Key Takeaways

Proxy APIs excel at bypassing anti-bot systems, achieving 80-99%+ success rates. But they can‘t perfectly emulate browser actions.
They reduce web scraping complexity by managing proxies, browsers, and CAPTCHAs internally. Yet this limits customizability compared to running your own infrastructure.
Paying per request is generally the better model for web scraping. But dynamic pricing or traffic-based can work with light pages.
Shape Defense posed the biggest challenge, followed by a social media platform requiring login when blocked. Other anti-bots didn‘t deter the tools significantly.