Web Scraper Revived

Previously I had written about my web scraper project. Well now I am coming back to it. I got a couple thousand eBooks in my Kindle reader. It is tough figuring out what book I should read next each time I finish a book. I manually go through a lot of books, looking up their pages on Amazon. Wouldn't it be nice if I had a program that could help me decide?

First order of business was to get a list of the books. At first I thought I might be able to hack the AZW file format and extract details from the books themselves. Nah. Too hard. Instead I remembered I had an email from Amazon for just about every book I bought. The emails were in Microsoft Outlook 2010. Turns out I could just save the emails as a text file.

Now my first job is parsing this file. I was hoping to get the links to the books on the Amazon site. Nope. That is not stored in the text file output by Outlook. That is okay. I can get the book name. Next I need to figure out how to take that name and programmatically search the Amazon web site for details on the book.

I plan to store my data in a MySQL database I got running on my machine. Right now I extracted all the book names to a text file. Next steps is to instead stick them in a database table. Maybe I will also record when I ordered the book. Then I will want to find and scrape the Amazon page for the book. Good things to grab from Amazon would be book price, date published, author, average customer rating, number of customer reviews, etc. All these goodies could be stored in my database.

Eventually I want to write a function or program that could score the books, predicting which books I should read next. Ahh this is going to be a very fun project indeed.