Friday, May 23, 2008

Executive Decision

After toying with C# today, I've decided that it is way to process-intensive to write the application on a runtime environment like .NET or Java. What I need is a simple language that can download a page, rip through text like a bandit, write the necessary fields to the database, and move on. I can organize the data when the search engine extracts that data.

I can't commit to anything yet, but my spidey-sense is telling me that the crawler will be written in Perl with LWP. I suppose I could look at Ruby, too, but I already have my Camel book and have worked with LWP before. I haven't tied Perl to a RDBMS, but I have done it with PHP and it must be similar. Perl can also do some limited recursion from what I understand, and if it can't I may can use a database back-end to save the stacks of URLs.

I was ready to buy books at O'Reilly today (I chickened out of spending the money) and found a book on writing spiders. From the preview I surmised my crawler/spider must be registered. That means I have to go mainstream, doesn't it?

And now after some more reading, I have discovered that this crawler can be used to build an index for special purposes. I can build my own search engine for this site, for example, and get much better results than I can searching the Google index for benrehberg.com. I have searched for things I know I wrote about, but never found them with Google. Building my own search engine and maintaining my own index of the site can prove useful if I keep writing about programming.

Update: I have created a new label "Web Crawler" for all posts related to this project.

No comments: