Find your content:

Search form

You are here

How to detect and block visual force page from scraping?

 
Share

We are considering developing a new version of our website (where our customers can purchase and manage certain types of content) on the force.com platform. Looking into the impact of governor limits we analyzed some visitor and page request data of our current website. One of the conclusions was that there is a high likelihood that some of our customers are web scraping our website for information.

If we would build our website on the force.com platform using sites, we would have the sites limitations (40gb bandwith per rolling 24h, 60h processing page requests per rolling 24h and 1m of unauthorized page requests). Even though these limits are fairly high, robots scraping our pages will definitely increase the rate in which we reach limits (and some governor limits like callouts).

How can we analyze our webtraffic to know what customers are doing this (authenticated users over customer portal) and how can we block it ?


Attribution to: Samuel De Rycke

Possible Suggestion/Solution #1

Use a custom controller for each of the Visualforce pages and have it track usage either directly against the contacts logged in, or in a new custom object.

I'd suggest TrackingController extends StandardController,

And then for each page specific controller extend the TrackingController.


Attribution to: David Gillen

Possible Suggestion/Solution #2

Another way to think about the underlying issue:

Is the concern just limits, or do you have another reason to prevent you customers from scraping your data? If it's just limits, could you provide a the data customers want in another fashion? For example, a csv (or even XML) file with just the data a customer wants would represent far less bandwidth than a full html page render, and potentially fewer page requests if the customers must currently scrape multiple pages to get all the data needed. Is the information is at all cachable? Could you precompute a download file, hourly or daily, and host it on AWS or Heroku for example?


Attribution to: Jason Clark

Possible Suggestion/Solution #3

Look in particular at the user agent HTTP header:

String userAgent = System.currentPageReference().getHeaders().get('User-Agent');

Some scrapers might set this to emulate a real browser, but you may get some insight here.


Attribution to: metadaddy

Possible Suggestion/Solution #4

Most screen scrapers do not support javascript, so having the page load content via javascript may be an option. However if you go this route I'd avoid the rerender ing built into visualforce as it counts as an additional page view against your quota as far as sites is concerned.

You could also potentially use this by setting a cookie saying "this client is a bot" on every request from a client that does not have a cookie and then on all "this client is a bot" or uncookied requests load an intermediate page where javascript on the page changes this to a "this client is not a bot" cookie and then redirects them to your data.

All but the most persistent and well-coded bots should be put off by this approach.


Attribution to: ca_peterson

Possible Suggestion/Solution #5

If these users are authenticated like you say (customer portal users) then most of those limits shouldn't apply.


Attribution to: Ryan Guest

Possible Suggestion/Solution #6

determination

  • scrapers tend to only follow links, create an image take that points to a visualforce page with an ID in the parameter. track hits to your visualforce page (custom object), if the corresponding image tag is not found repeatedly you've either got a person using lynx or a scraper.(this doubles your bandwidth)
  • rate checking, scrapers can be built to act like browsers but tend to follow standard execution patterns, ie: polling at standard intervals. by tracking every time your visualforce page is read you can determine patters and associate probable scraping to an IP.

    String ip = ApexPages.currentPage().getHeaders().get('X-Salesforce-SIP');
    

action

  • Rotate your output template. Scrapes tend to be setup using anchor points and xpaths to find the data. By restructuring your data every now and again you break their scripts. This could be done automatically by generating html in Apex and using .
  • Another method is to render content as images
  • rate limit per ip

Attribution to: ebt

Possible Suggestion/Solution #7

Doing this in Apex isn't going to address the unauthorized request / month limit. It also will incur processing costs, although if you do it right they could be minimal.

If your robot traffic is so high that you're in danger of crossing 1m hits/month, you should look at commercial CDN's (unless your app absolutely must be dynamic, but presumably not if you're getting routinely scraped). They reduce the load on your site, and they also offer services that could explicitly block your scrapers.


Attribution to: jkraybill
This content is remixed from stackoverflow or stackexchange. Please visit https://salesforce.stackexchange.com/questions/140

My Block Status

My Block Content