Web Scraping Threats

Web Scraping: An Invisible Security Threat


One of our clients recently mentioned that they were concerned about web scraping their data. Web scraping is not listed as a top OWASP threat, but it is just below those. It is listed as the 11th most common – OAT-011 Scraping.

An example of web scraping

LinkedIn, the widely recognised professional networking platform, is one company that fell victim to extensive web scraping attacks. The culprits used automated software to systematically mine the public profiles of LinkedIn users, extracting data from millions of profiles. This exposed millions of users to potential spam campaigns and phishing attacks. LinkedIn brought 100 individual hackers to court but lost in court. The court did not consider scraping publically available data as a crime.

Fast forward to 2021, and LinkedIn found itself in the spotlight again. Data from over 700 million profiles were found for sale on the darknet. This incident underscored the expanding threat of web scraping attacks and the need for robust security measures to protect user data.

Unveiling the Threat: What are Web Scraping Attacks?

Web scraping is a method used to extract data from websites. It’s commonly used in data analysis or machine learning, where large amounts of data must be collected and processed. However, web scraping becomes a potent weapon when utilised maliciously, as in the LinkedIn incidents.

This technique is performed on a large scale in a web scraping attack. The attackers often use bots’ help to extract vast quantities of data from targeted websites without permission. The data harvested can range from user information to prices, product descriptions, or other proprietary content. The scraped data can then be used maliciously, from creating counterfeit websites and undercutting competitors’ prices to selling the data to third parties.

Meticulously planned attacks

Web scraping attacks are often meticulously planned and executed in various stages to bypass existing security measures like Web Application Firewalls (WAFs), Intrusion Detection Systems (IDS), and Intrusion Prevention Systems (IPS). Despite these security systems, specific scraping attacks manage to exploit weaknesses, mainly because these systems cannot look back historically, lack deep learning capabilities, and have the ability to detect automated behaviour in syntactically valid HTTP requests.

Modern-day attackers employ exploit kits comprising a combination of tools such as proxy IPs, multiple User Agents (UAs), and programmatic/sequential requests to intrude into web applications, mobile apps, and APIs. These attacks can severely compromise website security and disrupt business continuity.

Understanding the Danger: Why are Web Scraping Attacks a Concern?

Web scraping attacks pose substantial threats to businesses and their online presence. These threats manifest in various forms:

1. Loss of Competitive Advantage

Like the LinkedIn scenario, scraping can lead to the loss of unique business data, which competitors can use to gain an unfair advantage.

2. Reduced Performance and Increased Costs

Web scraping bots can consume significant server resources, leading to slower site performance and increased hosting costs. In extreme cases, it can even result in a Denial of Service (DoS) situation.

3. Privacy Violations

If user information is scraped and sold, it can lead to severe privacy violations, potentially leading to legal complications and damage to the company’s reputation.

4. Intellectual Property Theft 

Web scraping can result in the theft of proprietary content or intellectual property, which can then be republished without consent.

What are some signs of web scraping taking place?

In one instance, a popular e-commerce platform was barraged with scraping attacks that generated hundreds of thousands of hits on its category and product pages over a fortnight. The attackers deployed a custom-built scraper engine and used an exploit kit with diverse combinations of hardware and software to circumvent web defence mechanisms. Here’s what the attackers did:

1. Fake Account Creation: 

The perpetrators targeted the sign-up page using various attack vectors. They created multiple fake User IDs (UIDs) to register bots as genuine users on the site. Using these fake accounts in combination with different device IDs, cookies, and UAs, they were able to pose as authentic users and generate perfectly valid HTTP requests to bypass traditional rule-based security measures.

2. Scraping of Product

The attackers used fake UIDs and logged into the website. Then they made hundreds of thousands of hits on category pages to scrape the content from the category results.

3. Price and Product Information

After scraping the category pages, the attackers executed hundreds of thousands of hits on specific product pages, storing targeted product prices and product details in their database. The perpetrators maintained a real-time database of the e-commerce portal’s entire product catalogue. They regularly tracked price changes to keep their database updated with the most recent pricing information.

The legal status of web scraping remains ambiguous, adding another layer of complexity to this issue. While LinkedIn sought legal injunctions to block the scrapers, the court ruled against them, stating that the data being scraped was publicly accessible and, therefore, not protected under the existing laws. This case highlighted the legal vacuum in which web scraping operates, making it even more critical for businesses to take technical measures to protect themselves from scraping attacks. There are also valid reasons for web scraping, such as when search engines check a site to provide relevant search results.   

Though web scraping attacks are potent and potentially devastating, businesses can adopt several measures to safeguard themselves:

1 Monitor Web Traffic

Regularly monitoring web traffic can help identify unusual patterns or spikes in traffic, which may indicate a scraping attack.

Identifying Highly Active, Non-Purchasing Accounts: E-commerce portals should monitor accounts that are highly active but have not made any purchases over an extended period. Such accounts may be operated by bots that imitate real users to scrape product details and pricing information.

Monitoring Unusual Traffic on Selected Product Pages: E-commerce businesses should keep an eye on unusual spikes in page views of certain products, which can often be periodic. A sudden surge in engagement on select product pages could indicate non-human activity on the site.

Competitor Monitoring for Price Tracking: Many e-commerce firms deploy bots or hire professionals to scrape product details and pricing information from their competitor’s sites. Businesses should regularly track competitors for signs of price and product catalogue matching.

Identifying Automated Activity in Legitimate User Behaviour: Sophisticated bots can simulate mouse movements, perform random clicks, and navigate pages in a human-like manner. Preventing such attacks requires deep behavioural models, device/browser fingerprinting, and closed-loop feedback systems. Purpose-built bot mitigation solutions can identify such sophisticated automated activities and help you act against them. In contrast, traditional solutions such as WAFs are limited to tracking spoofed cookies, user agents, and IP reputation.

2 CAPTCHA

CAPTCHAs can effectively distinguish between human users and bots, making it harder for scraping bots to access your site’s data.

3 Rate Limiting

Implementing rate limiting can restrict the number of requests a user (or bot) can make within a specific time frame. Such measures will slow down or halt a scraping attack.

4 Web Application Firewalls (WAFs)

A WAF can help to detect and block suspicious activity, including potential scraping attacks.

Conclusion

The threat of web scraping attacks is a concern for any company providing valuable information online. While scraping may occupy the 11th spot on the OWASP threat list, it’s certainly not to be underestimated! Especially given its potential to compromise a company’s competitive advantage, privacy, and intellectual property.

LinkedIn’s failed legal battles show the legal ambiguity surrounding web scraping. Therefore companies cannot solely rely on legal means to combat this issue. The onus is on organisations to implement robust technical measures and strategies to detect and mitigate scraping attacks. Such measures range from monitoring web traffic and identifying suspicious user behaviour to implementing CAPTCHA, rate limiting, and Web Application Firewalls.

However, it’s essential to remember that these measures should be part of a comprehensive security strategy. While they can help protect against scraping attacks, they might not be sufficient on their own. Businesses should be proactive and adopt an integrated approach, combining these techniques with other best practices to fortify their online presence.

Ultimately, while web scraping can pose significant threats, with the right tools and strategies, businesses can navigate this complex landscape, safeguarding their data and maintaining their competitive edge.

At Gislen Software, we can help you build web applications with built-in security or add web scraping protection to existing sites. Contact us to discuss software development!

Was this article helpful?
YesNo