Using node.js and jquery to scrape websites
[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D
I have been playing with Node.js for last few days and am totally head over heels. Madly in love! It’s awesome to know how much you can build with how little. I have ranted about Node.js earlier and did some comparisons too. It’s fast, really fast. And it’s plain old Javascript we have been using for last many-many years now. I thought I would build a real world application with it to see how much it stands the water. Earlier I thought to make a something on top of Riak, but that felt like running too fast. Instead I picked up something simpler to deal only with Node.js. Now, I think it would make sense to brush up on some Javascript fundaments.
Javascript objects
Yes. Javascript is an object oriented language. But it’s different from your traditional classical OO languages like Java and Ruby.
- One obvious difference is in syntax, and the other major one is that
- Other languages have methods while Javascript has first-class functions.
First class functions. What does it mean? It means that they are expressions and can be assigned to a variable and can be easily passed around. Does it sound like a closure in Ruby? It does indeed. Well thought, it’s a little more than that. I will come to this again some other time. For now, let’s find out how we can create objects and use them? I will focus tell you two ways to do it.
The Classical way
Here is a constructor function for object Shape. It accepts two parameters and saves them into respective instance variables.
function Shape(width, height) { this.width = width; // instance variable width this.height = height; // instance variable height this.getArea = function() { // function to calculate Area, notice the assignment. return this.width * this.height; }; } var rectangle = new Shape (2, 5); // instantiate a new Shape object console.log (rectangle.getArea()); // calculate the area: 10
Javascript uses prototype chains to add new functions or variables to an object on the fly. You should read more about this thing here: http://www.packtpub.com/article/using-prototype-property-in-javascript
I will add a new function to calculate the perimeter of my Shape object.
Shape.prototype.getPerimiter = function() { return 2 * (this.width + this.height); } console.log (rectangle.getPerimiter());
What happened here? Did you notice that even if ‘rectangle’ was already defined it could access the newly added function to calculate perimeter. Wasn’t that awesome? Javascript is intelligent, dude. If you ask for something, it looks into the current object, and if not found, it would go up the object’s prototype chain to look for what you asked for. And since, we added the new function to the prototype, it’s found unscrupulously. There is a lot of interesting stuffs going on here, you must read about it. I would suggest buying Manning’s Javascript Ninja, if you are really serious about it.
Now, let’s try to extend Shape. I will create a new constructor function for Square.
function Square(side){ this.width = side; this.height = side; } Square.prototype = new Shape(); var sq = new Square(4); console.log(sq.getArea());
I created a new Square class and overrode its prototype chain with that of Shape’s. I got all the functionalities and behavior of Shape. Easy… huh?
The Prototypal way
Let’s do the same thing without using constructors now. Just plain prototypes!
var Shape = { getArea: function () { return this.width * this.height; }, getPerimiter: function() { return 2 * (this.width + this.height); } }; var rec = Object.create(Shape); rec.width = 2; rec.height = 5; console.log(rec.getArea());
Now that you have the Shape object, you can easily add new functions to its prototype chain, or even inherit it to another object. However I find this approach a little clumsy. I would rather stick to the classic way. You choose your pick. To each his own!
Node.js Modules
Node uses the CommonJS module system. Node has a simple module loading system where files and modules are in one-to-one correspondence. Here is the API: http://nodejs.org/api.html. Above example can be ported to Node.js module ecosystem like explained below:
First, create Shape.js
function Shape(width, height) { this.width = width; // instance variable width this.height = height; // instance variable height this.getArea = function() { // function to calculate Area, notice the assignment. return this.width * this.height; }; } // Export this module exports.module = Shape;
And now, use this
var Shape = require('./Shape'); var rectangle = new Shape (2, 5); console.log (rectangle.getArea());
Node.js loads and runs each module in a sandbox which staves off any possible name collision. That’s the benefit you get apart from having a properly structured code base.
Writing a screen scraping application
I will write a simple application to capture details from various websites. The beautiful thing is Javascript has been handling DOM objects for years. In fact Javascript was created to handle DOM objects. No wonder that it’s more mature than any other html parsing library. Also, given that there are many elegant frameworks like Prototype, Mootools, JQuery etc. available to use, scraping websites with Node.js should be easy and fun. Let’s do it. Let’s write an application to collect data from various book selling websites.
Create a basic searcher.js module. It would provide the fundamental skeleton for writing website specific tool.
// External Modules var request = require('ahr'), // Abstract-HTTP-request https://github.com/coolaj86/abstract-http-request sys = require('sys'), // System events = require('events'), // EventEmitter jsdom = require('jsdom'); // JsDom https://github.com/tmpvar/jsdom var jQueryPath = 'http://code.jquery.com/jquery-1.4.2.min.js'; var headers = {'content-type':'application/json', 'accept': 'application/json'}; // Export searcher module.exports = Searcher; function Searcher(param) { if (param.headers) { this.headers = param.headers; } else { this.headers = headers; } this.merchantName = param.merchantName; this.merchantUrl = param.merchantUrl; this.id = param.merchantUrl; } // Inherit from EventEmitter Searcher.prototype = new process.EventEmitter; Searcher.prototype.search = function(query, collector) { var self = this; var url = self.getSearchUrl(query); console.log('Connecting to... ' + url); request({uri: url, method: 'GET', headers: self.headers, timeout: 10000}, function(err, response, html) { if (err) { self.onError({error: err, searcher: self}); self.onComplete({searcher: self}); } else { console.log('Fetched content from... ' + url); // create DOM window from HTML data var window = jsdom.jsdom(html).createWindow(); // load jquery with DOM window and call the parser! jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js', function() { self.parseHTML(window); self.onComplete({searcher: self}); }); } }); } // Implemented in inhetired class Searcher.prototype.getSearchUrl = function(query) { throw "getSearchUrl() is unimplemented!"; } // Implemented in inhetired class Searcher.prototype.parseHTML = function(window) { throw "parseForBook() is unimplemented!"; } // Emits 'item' events when an item is found. Searcher.prototype.onItem = function(item) { this.emit('item', item); } // Emits 'complete' event when searcher is done Searcher.prototype.onComplete = function(searcher) { this.emit('complete', searcher); } // Emit 'error' events Searcher.prototype.onError = function(error) { this.emit('error', error); } Searcher.prototype.toString = function() { return this.merchantName + "(" + this.merchantUrl + ")"; }
Now, code to scrape rediff books. I will name it searcher-rediff.js
var Searcher = require('./searcher'); var searcher = new Searcher({ merchantName: 'Rediff Books', merchantUrl: 'http://books.rediff.com' }); module.exports = searcher; searcher.getSearchUrl = function(query) { return this.merchantUrl + "/book/" + query; } searcher.parseHTML = function(window) { var self = this; window.$('div[id="prod_detail"]').each(function(){ var item = window.$(this); var title = item.find('#prod_detail2').find('font[id="book-titl"]').text(); var link = item.find('#prod_detail2').find('a').attr('href'); var author = item.find('#prod_detail2').find('font[id="book-auth"]').text(); var price = item.find('#prod_detail2').find('font[id="book-pric"]').text(); self.onItem({ title: title, link: link, author: author, price: price }); }); }
Run it now.
var searcher = require('./searcher-rediff'); searcher.on('item', function(item){ console.log('Item found >> ' + item) }); searcher.on('complete', function(searcher){ console.log('searcher done!'); }); searcher.search("Salman");
What I did?
- First, I wrote a skeleton searcher class. This class makes the
- Second, I wrote another class that extends from searcher and intends to interact with Rediff. This class implements,
- getSearchUrl function to return appropriate search URL to connect to, and
- parseHTML function to scrape data from DOM’s window object. This is very interesting. You can use all your jquery knowledge to pick elements and parse data from inside the elements. Just like you did in old days when you added styles or data to random elements.
Now, if I want to search say Flipkart along with Rediff, I just need to write a Flipkart specific implementation, say searcher-flipkart.js
var Searcher = require('./searcher'); var searcher = new Searcher({ merchantName: 'Flipkart', merchantUrl: 'http://www.flipkart.com' }); module.exports = searcher; searcher.getSearchUrl = function(query) { return this.merchantUrl + "/search-book" + '?query=' + query; } searcher.parseHTML = function(window) { var self = this; window.$('.search_result_item').each(function(){ var item = window.$(this); var title = item.find('.search_result_title').text().trim().replace(/\n/g, ""); var link = self.merchantUrl + item.find('.search_result_title').find("a").attr('href'); var price = item.find('.search_results_list_price').text().trim().replace(/\n/g, ""); self.onItem({ title: title, link: link, price: price }); }); }
I have also written a Runner class to execute the multiple searchers in parallel and collect results into an array. You can find the entire source code here: https://github.com/anismiles/jsdom-based-screen-scraper Chill!
What’s next? I am going to write on Node.js pretty feverishly. You better keep posted. How about a blog engine on Riak?
[…] Przeczytaj artykuł: Using node.js and jquery to scrape websites « Coding is an act of … […]
Using node.js and jquery to scrape websites « Coding is an act of … – js - dowiedz się więcej!
November 29, 2010 at 4:33 pm
I have a question: It’s possible to scrape sites with javascript in the page ??? What I am asking is for example:
This is the page source.
document.write(“Test”);
And this is the result after javascript be processed by the parser.
document.writeln(“Test”);
Test
Thank you,
Ventura
Jorge Ventura
December 8, 2010 at 9:19 am
Ventura,
Yeah it’s possible, but you might need a a way to execute the on-page javascript in a sandbox. Node.js can easily help you do that.
Animesh
January 2, 2011 at 2:35 pm
you can scape sites with JS on the page using jsdom’s jsdom.env() function.
Check it out under the headline Easy Mode on the jsdom github page https://github.com/tmpvar/jsdom
aaron
March 15, 2011 at 3:13 pm
I have a question: It’s possible to scrape sites with javascript in the page ??? What I am asking is for example:
This is the page source.
”
document.write(\”Test\”);
”
And this is the result after javascript be processed by the parser.
”
document.writeln(\”Test\”);
Test
”
Thank you,
Ventura
Jorge Ventura
December 8, 2010 at 9:21 am
I am sorry, I was trying to post an HTML code but it does’t work here.
Ventura
Jorge Ventura
December 8, 2010 at 9:22 am
Great example. Thanks for sharing.
I did run into an issue when trying to run your example for searcher-rediff.js. When it trys to create the window via jsdom it throws a stack that starts with this error:
TypeError: Cannot read property ‘protocol’ of undefined
Does this mean jsdom cannot correctly parse the html anymore because it’s changed?
Rob
January 2, 2011 at 11:51 am
Rob,
Did you check the HTML content? Is it getting fetched properly? BTW, against which URL this error comes?
-Animesh
Animesh
January 2, 2011 at 2:36 pm
Yes. The HTML content is coming back properly. This is the url I am fetching. http://books.rediff.com/book/Salman. The error is thrown at this line in searcher.js
var window = jsdom.jsdom(html).createWindow();
I believe it has to do with the inline javascript call in the HTML that looks like this:
s.src = (document.location.protocol == “https:” ? “https://sb” : “http://b”) + “.scorecardresearch.com/beacon.js”;
Rob
January 2, 2011 at 11:29 pm
I see. I will look into this and revert. However, did it work for other urls?
Animesh
January 3, 2011 at 10:01 am
Hi Animesh, wondering if this node.js will work for this situation? http://stackoverflow.com/questions/5054818/php-page-protection-for-cron-task-only
Is it compatible with current mysql?
Do we have to learn up this as a whole new language? or can easily reuse certain thing in php etc?
wonderful
February 20, 2011 at 3:01 pm
Yep, Node has mysql libraries to work with. For your case, I think a good way might be to run node stuff in a process and bridge them with PHP shell. Node is just Javascript and hardly few things new. You can very easily pick it up. You can start from here: https://anismiles.wordpress.com/2010/11/11/wtf-is-node-js-and-what%E2%80%99s-the-fuss-all-about/
-Animesh
Animesh
February 21, 2011 at 10:23 am
Great writeup. I’m trying to run the searcher-server code, and I keep getting:
TypeError: Object # has no method ‘on’
at Object. (/Users/avishai/Downloads/anismiles-jsdom-based-screen-scraper-f0c79d3/searcher-server.js:9:10)
at param (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:146:21)
at param (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:157:15)
at pass (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:162:10)
at Object.router [as handle] (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:168:6)
at next (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/index.js:218:15)
at Server.handle (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/index.js:231:3)
at Server.emit (events.js:45:17)
at HTTPParser.onIncoming (http.js:1078:12)
at HTTPParser.onHeadersComplete (http.js:87:31)
Do you know why this might be?
Avishai
March 4, 2011 at 1:51 am
Avishai, what version of Node you are using. I find that you got Connect’s 0.5.10 version which I think should be fine.
-Animesh
Animesh
March 4, 2011 at 6:39 pm
Fundamental reason behind this bug should be something to do with EventEmittter. let me explain,
1. searcher.js inherits from EventEmitter
(Line-26) Searcher.prototype = new process.EventEmitter;
2. searcher-rediff.js, searcher-flipkart.js and searcher-landmarkonthenet.js extend from searcher.js, so they also inherit from EventEmitter.
3. ‘on’ method is actually defined in EventEmitter.
So, i think, for some reason, searcher.js is not able to inherit from EventEmitter and hence the method ‘on’ is missing.
Animesh
March 4, 2011 at 6:42 pm
[…] A few starting points: Node.js Fetch URL and display page body Using node.js and jquery to scrape websites […]
Scrape web pages in real time with Node.js | DEEP in PHP
March 12, 2011 at 8:55 pm
Is there a good way to do this on websites that require you to log in first before running a search?
Avishai
April 4, 2011 at 7:02 pm
i think, using POST method you can easily do a login on a site.
Animesh
April 4, 2011 at 7:12 pm
Your “Javascript objects” helped a lot. Thanks.
Sanjeev Kumar Dangi (@skdangi)
November 13, 2011 at 8:52 pm
“Square.prototype = new Shape(); ”
Here Shape constructor is called without any argument.But its definition has two arguments -width and length. I checked it. It works. Does javascript also creates default constructors with no arguments itself?
Sanjeev Kumar Dangi (@skdangi)
November 29, 2011 at 7:50 pm
No. Think of JS not as a logical/democratic world, it’s more like anarchy. 🙂 JS, internally, accepts params as a key-val pair and when you don’t pass anything… key-val pair just goes blank. It’s not an error. And if you try to look for these params, you will see ‘undefined’. One more difference between ‘undefined’ and ‘null’… eh?
Chill!
Animesh
December 30, 2011 at 10:23 am
Hi, this post is really interesting and while I’m trying to get the picture, I dont anderstand how is called the function searcher.getSearchUrl = function(query) { return this.merchantUrl + “/book/” + query;} in the searcher-rediff.js. thsnkd a lot.
Yaver
December 29, 2011 at 5:40 pm
searcher.js ==> Searcher.prototype.search ==> line 30
Animesh
December 30, 2011 at 10:21 am
[…] A few starting points: Node.js Fetch URL and display page body Using node.js and jquery to scrape websites […]
Scrape web pages in real time with Node.js | Easy jQuery | Free Popular Tips Tricks Plugins API Javascript and Themes
May 29, 2012 at 5:20 am
Hi Animesh,
Sorry for being naive, would this be required to run on the server side – reason that I ask this is that I have a a need to scrape a website and show results in a mobile application using phonegap and I was wondering if this script could run on the client side or would it need to be deployed on the server side. Also could you please give an example of how to use POST for the website that requires login (I have the username and password).
Thanks
Tarun
Tarun
April 1, 2013 at 2:00 am
Sure you can run this on client side. However you will need to modify it a bit.
Animesh
April 1, 2013 at 10:09 am
Using node.js and jquery to scrape websites | animesh kumar
Thank you for submitting this cool story – Trackback from AnantLeaves
AnantLeaves
July 1, 2013 at 10:56 pm
[…] Using node.js and jquery to scrape websites […]
Scrape web pages in real time with Node.js | Ask Programming & Technology
November 5, 2013 at 6:39 pm
to greatly simplify and speed up your code, try promise-parser
http://www.npmjs.org/package/promise-parser
JD
June 12, 2014 at 3:20 am
you should check out promise-parser
http://www.npmjs.org/package/promise-parser
http://github.com/rc0x03/node-promise-parser
Features
Fast: uses libxml C bindings
Lightweight: no dependencies like jQuery, cheerio, or jsdom
Clean: promise based interface- no more nested callbacks
Flexible: supports both CSS and XPath selectors
JD
June 12, 2014 at 3:23 am
The request times out : this is the code I am using :
app.get(‘/flipkart’, function(req, res){
var searcher = require(‘./searcher-flipkart’);
searcher.on(‘item’, function(item){
console.log(‘Item found >> ‘ + item)
});
searcher.on(‘complete’, function(searcher){
console.log(‘searcher done!’);
});
searcher.search(“Salman”);
});
Sum
July 22, 2014 at 4:17 pm