Ben Rehberg's Weblog: Google

Showing posts with label Google. Show all posts

Friday, October 01, 2010

Random Thinking

I've just come to the conclusion, at 12:48 this morning, that I don't write here much anymore because I really don't care enough to do so. I am not that passionate about anything anymore to the point that I feel I should write about it.

Apparently I'm passionate enough about not being passionate, though. I've just told you in a couple of sentences that I don't care about anything enough to write a public discussion. And look - you're already growing bored of this post. The fact is that I have had this blog here at benrehberg.com for over six years now and I have only posted 520 times. I've nearly tweeted that much in 18 months. And speaking of Twitter, I think I'm getting off of that train. Facebook too. Down with friends who only know me again through a social experiment and marketing shithole. And fuck Mark Zuckerberg.

And lately, fuck Google too, and their sleazy one-night-stand Verizon. I'm beginning to dislike those companies simply because they profit too much on the personal interactions of individuals. It's a sickness that wears one out from the outside in. First it was search results which were innocent enough. It has come all the way to "push" advertising, where Google will know that since I like pizza and I am near a pizza restaurant, my phone will buzz to tell me the specials there (near future).

No thanks. I'm quitting Facebook, and I am seriously considering not continuing with Google and Android. I do not live where that plethora of information is usable, and I am becoming increasingly afraid that we will become too dependent on this availability of data and personalization. Like GPS has done for travelers - we no longer have maps or ask for directions.

I realize that I am rambling. It's late and I have been drinking to counteract the early-afternoon coffee that punishes me when I close my eyes tonight.

Friday, August 06, 2010

"Too Many Stakeholders are Being Left Out of Discussions Over the Future of the Internet"

I think it's time we started creating our own networks with IPv6 and sticking it to the corporations. An ad-hoc network would work if everyone knew what was going on. One tie-in to a public backbone and I can light up a community without touching the mainstream corporations.

Friday, May 23, 2008

Executive Decision

After toying with C# today, I've decided that it is way to process-intensive to write the application on a runtime environment like .NET or Java. What I need is a simple language that can download a page, rip through text like a bandit, write the necessary fields to the database, and move on. I can organize the data when the search engine extracts that data.

I can't commit to anything yet, but my spidey-sense is telling me that the crawler will be written in Perl with LWP. I suppose I could look at Ruby, too, but I already have my Camel book and have worked with LWP before. I haven't tied Perl to a RDBMS, but I have done it with PHP and it must be similar. Perl can also do some limited recursion from what I understand, and if it can't I may can use a database back-end to save the stacks of URLs.

I was ready to buy books at O'Reilly today (I chickened out of spending the money) and found a book on writing spiders. From the preview I surmised my crawler/spider must be registered. That means I have to go mainstream, doesn't it?

And now after some more reading, I have discovered that this crawler can be used to build an index for special purposes. I can build my own search engine for this site, for example, and get much better results than I can searching the Google index for benrehberg.com. I have searched for things I know I wrote about, but never found them with Google. Building my own search engine and maintaining my own index of the site can prove useful if I keep writing about programming.

Update: I have created a new label "Web Crawler" for all posts related to this project.

How to Write a Search Engine

It seems a bit strange using the world's best search engine to find out how to build your own. Google is my first resource in this project, though Google itself provides nothing but the idea. There is a paper at Stanford by Larry and Sergey, and that basically is the starting point. That is Google's only contribution so far aside from the many searches I will perform.

There are three main parts to the search engine: the crawler, which tirelessly captures data from the web, the database to hold everything, and the actual search engine - the queries that put the data together in a meaningful format for you.

I could write a search engine that actually crawls the web looking for my search criteria, but that is very VERY inefficient. Google (and many others) have solved this inefficiency by effectively downloading the Web (that's right - as much of it as they can) to their computers so it can search it much faster and have it available in one place. They've done a whole lot more to increase efficiency and effectiveness of searches, but downloading the web was the first thing they did. It turns out they needed a lot of computers.

I'm going to start with two. I have three desktops that no one wants to buy, and I am really tired of looking at them. I will probably need more if I get this index working soon, but there will be software considerations to make too. You can't fit the web on one computer, no matter how big. I will learn a lot.

I have always had an interest in distributed systems and cluster computing, so this will be fun. I have a lot to learn about distributed databases and algorithm analysis. But all that is later - I haven't even really finished thinking out the preliminaries yet. So one development/crawling machine, and one database machine. After I figure out how to crawl the web, I will begin work on performing searches. If this project holds my interest long enough, I might publish statistics at 49times.com, so keep looking. I will be posting here if I come up with anything worth publishing. I'm going to try to journal my progress and decisions without publishing code, but I realize that I very well could lose interest in this. If I get started, I will likely enjoy it and keep going, but no one can say. If you have some confidence that I will continue, you can subscribe to this blog and get the updates. Beware, though, that you'll get everything else I write too.

Tuesday, April 01, 2008

Never be Late Again

With Gmail's Custom Time, just make up an event in the past and say it happened. It's that easy!

You may even figure out a way to win last week's lottery using the Custom Time API! I'm going to create an app for Android so you can even keep a little slice of your own time in your pocket (coming the second half of 2008). But when that happens, I'll have had it since 2005.

You guys are way behind!

Friday, February 22, 2008

Friends with Vista

After nearly a year, I finally decided to figure out what I could do to make my Vista laptop a bit faster. The memory is maxed out at 2 Gigabytes and it has a dual-core AMD CPU. It had always been very very slow in completing trivial tasks, like opening a browser or the control panel. Copying and moving files took way too long, and I just never approached my problem with logic.

A few weeks ago I was talking with a friend about my experience with Vista so far, and mentioned to him that I didn't think it was a problem with Vista, but a hardware issue with my Gateway laptop. "It runs very hot," I told him. "The hard drive activity never stops. I just don't think the machine was designed well enough to support such a heavy OS." I'd never seen Vista so slow on any other computer, so why the hell is it pokey on mine? And what in the dickens is going on with my hard drive?

Then it hit me. Constant hard drive activity is an indicator of (1) a virus or crapware, or (2) an indexing service. Google Desktop search was deployed with the computer when I bought it; part of Gateway's image, along with all the other garbage like BigFix, AOL , and the Office 2007 90-day trial.

Having been a student of Vista before and during its release, I remembered something about Google and Microsoft having fits about desktop search. It seems that Vista includes its own indexing service to speed up searching, and Google was having a hissy over users not being able to choose a desktop search engine. The Windows Indexing Service is on by default, and I don't think any manufacturers have changed that in their production images. And it just so happens that Gateway included Google Desktop in every computer they released with Vista, and therein lies my problem: two indexing services, constantly running on my poor little 5400 RPM notebook hard drive.

After some thought, I decided I'm a fairly organized fellow and don't have the need very often to search for a document. Most of what I access anyway is on the network, and those locations aren't indexed by default anyway. So away went Google Desktop. Though I love Google, I have no need for that program on my mobile station.

And for that matter, I canceled the Windows Indexing service. No need to pick sides, you know?

Then for a final pick-me-up, I had Vista optimize the graphics for performance, which took away all the eye-candy and effectively made my desktop look like Windows 2000. I'm fine with that.

Oh, and one more thing: I shut off the UAC. Those pain-in-the-ass messages one gets when he tries to install a program, "Windows needs your permission to continue," are gone. I can now run a command window without specifying to run it as Administrator. I can change IP settings with fewer mouse clicks. A little bubble message when I log on warning me that User Account Control is turned off is the only annoyance I have now, and I'm sure that with a simple registry edit I can get rid of that too. Maybe I'll post it later.

I must say this little bottom-end laptop is pretty damn speedy these days. NetBeans opens in under 60 seconds. Outlook opens in under 5, and boot times are at their lowest since I got it. This doesn't change anything about the inevitable change to a Mac when I can afford one, but it certainly makes me more comfortable in delaying it.

Tuesday, December 11, 2007

This is a test post.

This is a test post. I am sending this text from my phone via e-mail to a special address at blogger.

Did it work?

Update (from a computer): I am limited to a 160-character message from that phone. Not too fantastic for blogging. But it did work, and I have a new way of blogging to the world from an underground imprisonment (if that were to ever happen). Good to know.

Tuesday, October 23, 2007

This is the Coolest Thing About the Internet

It's not that the latest technology can bring us all together, communicate with each other instantly around the world, or provide opportunity where there was once none. It's that in 2007 I can let my daughter learn a little bit in the same method I did.

It's like a friggin' time capsule.

Monday, January 29, 2007

Thought We Were Out of Beta

While trying to publish an edit to the previous post, I got this error:

bX-y5mpya

Additional information

 blogID: 6667391

 uri: /post-edit.do

 host: www2.blogger.com

 postID: 4728350865740187879

Just thought you should know. Not sure why they gave me that information - they didn't point me directly to a form to report a problem. But I trust Google.

Someone's pager probably went off as that error number was being generated, and that thought makes me happy because it's 3:30 AM in Mountain View, and I'm not there.

Wednesday, January 03, 2007

Why Do I Bother?

Google Answer to Filling Jobs Is an Algorithm (New York Times)

Google is trying something new with hiring folks these days, and I began to read this article with enthusiasm, thinking a dream just might come true and I could find myself on the payroll next year with my favorite corporation. Alas, I still need to finish school even to be the janitor.

Mathematics is absolutely required for engineering at Google, and it's a subject I have long put away after going nine rounds with Calculus earlier in the decade. I feel comfortable with math but it takes a good bit of time and an inordinate amount of self-discipline to operate at that higher level.

Google has created a survey that its candidates will begin taking this month, and it predicts it might double its workforce this year (which is 10,000) and that means about 200 hires per week. They had to find some quantitative method, didn't they? I probably would, too. With my current level of education and experience I still don't qualify even for systems administration. So...

If given the chance I would propose to Google that they buy me. That's right, me. They can put me through the rest of school and I will work for them to pay it off. If Google gives every employee so much to come and stay, why can't that be education? They're building an army and they seem to have the money to accommodate me (not to mention a little pull with the folks at Stanford), the only question is how do I ask this? Who do I e-mail, or where do I show up to yell "I want to finish school so I can work for Google!"

If I could ask the right person and they can answer me objectively as to why it's not a good idea, I'll quit. But until then, while the United States Army is still offering to pay off student loans to any flunkie who went to college, it's a good idea for Google to front me some education. Hell, I'd even settle for Berkeley. Everyone says they're starving for people, but I just don't see it yet. The Department of Family and Children Services will pay for a Graduate degree while the social worker goes to school while employed, and some school districts will pay for education so long as the future educator signs a contract to teach in that district for a period of time. I believe I have a case.