In-Depth Look into Popular Proxy APIs for Web Scraping

This investigation examines five leading proxy APIs aimed at bypassing bot protection systems and CAPTCHAs: Bright Data‘s Web Unlocker, Oxylabs‘ Web Unblocker, Smartproxy‘s Site Unblocker, Zyte API, and Crawlbase‘s Crawling API. We compare their features, pricing models, and most importantly – success rates against challenging targets protected by anti-bot services like DataDome, Shape, and Akamai.

How Do Proxy APIs Work?

Proxy APIs provide a hosted proxy infrastructure that handles IP rotation, managing headless browsers, and CAPTCHA solving automatically. To use one, you simply integrate the API credentials in place of a regular proxy and send requests as usual.

The proxy API intercepts all traffic to the target. When it encounters a block or challenge, the system retries with different parameters until the request succeeds. This abstraction allows avoiding the tedious configuration work required for robust web scraping.

Some vendors market proxy APIs as "web unblockers" since their main purpose is bypassing bot mitigation defenses. But you shouldn‘t treat them only as CAPTCHA solvers – that‘s just one of the obstacles modern websites employ.

Testing Methodology

We evaluated the proxy APIs by sending requests to seven popular e-commerce and social media websites protected by various bot mitigation systems. For each target, we made around 1,800 requests over 15 minutes using one proxy API at a time.

To verify success, we checked the response code, size, and page title. Each participant had high prior awareness of the targets, so they could prepare their infrastructure accordingly.

Here are the websites we tested:

  • Amazon – In-house anti-bot system, returns empty responses
  • Google – reCAPTCHA
  • Social media website – Requires login when blocked
  • Kohls – Akamai Bot Manager
  • Nordstrom – F5 Shape Defense
  • Petco – DataDome
  • Walmart – ThreatMetrix, Akamai, PerimeterX, FingerprintJS

Our test computer was located in Germany and accessed the US versions of each website.

Unblocking Performance Results

Overall, the proxy APIs achieved high success rates across all targets – never dropping below 80%. Nordstrom, protected by Shape, posed the biggest difficulty. The social media website also often redirected to login when blocked.

Out of 1,800 requests per website, here are the average success rates:

Target Avg. Success Rate
Kohls 99.01%
Amazon 98.65%
Walmart 96.38%
Google 96.00%
Petco 95.48%
Social Media 92.19%
Nordstrom 80.16%

And the average response times:

Target Avg. Response Time
Kohls 28.97 s
Amazon 5.73 s
Walmart 12.68 s
Google 5.41 s
Petco 11.20 s
Social Media 19.74 s
Nordstrom 33.1 s

So which proxy API performed the best overall?

  • Zyte had the highest success rate at 97.82% and was very quick. But it did require optimizer tweaks for one target.
  • Oxylabs and Smartproxy focused on maximizing success, though at the cost of speed due to headless browsers.
  • Crawlbase was relatively fast but also had more fails across the board.
  • Bright Data rendered pages insanely fast but sometimes got blocked on certain sites.

Features and Limitations

Proxy APIs aim to function as a drop-in replacement for proxies. As such, they all support basic functionality like location targeting and sessions.

However, you inevitably sacrifice some control compared to running your own proxies or browsers. Let‘s examine the extent of configurability for request handling and interacting with dynamic page content.

Feature Oxylabs Bright Data Smartproxy Crawlbase
Localization Countries, cities, coordinates Countries, cities, ASNs Countries, cities, coordinates 26 countries
Sessions
JavaScript rendering
Custom headers & cookies
POST requests
Page interactions Wait for load, scroll

Bright Data has the most limited control – it handles JavaScript rendering under the hood without any configurability.

Oxylabs and Smartproxy allow more request customization. But they still don‘t permit interacting with page contents after load.

Crawlbase provides the most flexibility via waiting and scrolling. However, its functionality is dependent on using a separate API package.

In summary, proxy APIs come with inherent limitations for sites requiring heavy post-load interaction. You can render JavaScript, but not properly simulate clicks or scrolls. This may suffice for many use cases, but won‘t replace a full browser automation suite.

Pricing Model Comparison

While marketed as proxies, proxy APIs employ varied pricing models:

Pricing Metric Providers Using
Bandwidth Oxylabs, Smartproxy
Successful requests Bright Data, Crawlbase, Zyte

Charging per successful request is generally more economical for web scraping than paying for bandwidth. Some vendors get creative with their pricing:

  • Bright Data – more expensive premium domains
  • Crawlbase – 2x price for JavaScript rendering
  • Zyte – dynamic rate based on page difficulty

Oxylabs and Smartproxy don‘t implement multipliers. But at $1+ per GB, you‘d end up paying more scraping heavier sites – regardless of success.

Overall, request-based pricing makes the most sense for web scraping. Exceptions could be APIs or small pages where bandwidth costs stay negligible.

Key Takeaways

  • Proxy APIs excel at bypassing anti-bot systems, achieving 80-99%+ success rates. But they can‘t perfectly emulate browser actions.
  • They reduce web scraping complexity by managing proxies, browsers, and CAPTCHAs internally. Yet this limits customizability compared to running your own infrastructure.
  • Paying per request is generally the better model for web scraping. But dynamic pricing or traffic-based can work with light pages.
  • Shape Defense posed the biggest challenge, followed by a social media platform requiring login when blocked. Other anti-bots didn‘t deter the tools significantly.

Author

Sergey Fadeev – web scraping expert with 5 years of experience in the field.

Avatar photo

Written by Python Scraper

As an accomplished Proxies & Web scraping expert with over a decade of experience in data extraction, my expertise lies in leveraging proxies to maximize the efficiency and effectiveness of web scraping projects. My journey in this field began with a fascination for the vast troves of data available online and a passion for unlocking its potential.

Over the years, I've honed my skills in Python, developing sophisticated scraping tools that navigate complex web structures. A critical component of my work involves using various proxy services, including BrightData, Soax, Smartproxy, Proxy-Cheap, and Proxy-seller. These services have been instrumental in my ability to obtain multiple IP addresses, bypass IP restrictions, and overcome geographical limitations, thus enabling me to access and extract data seamlessly from diverse sources.

My approach to web scraping is not just technical; it's also strategic. I understand that every scraping task has unique challenges, and I tailor my methods accordingly, ensuring compliance with legal and ethical standards. By staying up-to-date with the latest developments in proxy technologies and web scraping methodologies, I continue to provide top-tier services in data extraction, helping clients transform raw data into actionable insights.