PERSONAL PROJECTS

Stock Market

I'm very interested in the stock market and the search for non-random activity within its data. For the past 6 years I've been mining millions of rows of stock market data looking for anomalies. In July, 2006, a partner and I put into production an automated trading system. After the close of each day's trading, a C# console application running on a dedicated database server downloads the daily data via FTP and inserts it into a MySQL database. After cleaning the data, adjusting for splits, and calculating various well-known and proprietary indicators, the program emails the best buy or sell signal to us to execute at the following market open.

After the first 8 months of trading, our system has outperformed the S&P 500 Index by 14.5 percentage points (27.1% return after commissions compared to buy-and-hold 12.6%), even while exclusively trading large-cap stocks contained in that index. The automated system runs on a Windows XP Pro box with an AMD 64-bit dual core processor (Athlon 64 X2 4200+), 2 gigabytes of dual channel DDR2-SDRAM and a 150GB SATA 10,000 RPM Western Digital Raptor hard drive. Just for fun, it connects with my beta database server, laptop, and work computer over a gigabit LAN.

Update, summer 2007: had to take the system offline and use the money elsewhere. It continues to outperform the market, although not as impressively.

Netflix Prize

On October 2, 2006, the online DVD rental leader Netflix issued a challenge called the "Netflix Prize". Their goal was to more accurately predict a user's rating for a given movie based on that user's past ratings for other movies. They offered a $1,000,000 prize to anyone who could beat their current system by 10%; they also offered several smaller Progress Prizes for incremental progress towards that goal. They made public an anonymized training set of 100 million past customer ratings (which consists of a customer ID, movie ID, and integral 1-5 star rating), along with a qualifying test set of 2.8 million customer ID / movie ID pairs, withholding the actual rating. The goal is to accurately predict those 2.8 million ratings based on the results from the larger training set. Their current system can predict a rating (as calculated by the root mean squared error, or RSME), to within 0.9514 stars. The $1,000,000 prize would go to whoever could improve on that by 10%, to 0.8563 stars.

I decided to write a C# console application and try my hand at the contest, but the size of the training set made even simple calculations difficult. The downloaded text file of 100 million records was over 1 GB when unzipped. Inserting all of them into a MySQL database and adding indices took literally days. Calculating linear regression between users and movies would have been all but impossible. I ended up converting all the data and saving it in binary format, with each record consisting of twelve bytes: four bytes each for the user ID, movie ID, and rating. I saved the data twice: once ordered by user ID then movie ID, and once vice-versa. This, combined with a binary "index" file which told me the amount of ratings for each user and movie, meant that I could easily retrieve an entire set of ratings for a given movie or user, or simply retrieve one record for a user / movie combination. Also, each file (but not both) could be loaded into memory on my workstation (consuming roughly 900 MB of 1.5 GB available). After that, I could jump to a known memory offset to retrieve a record.

With the data now accessible, I computed all movie-to-movie correlations possible. There were approximately 18,000 distinct movie ID's in the training set, giving me roughly 162,000,000 correlation records consisting of movie1 ID, movie2 ID, intersection size, "m" and "b" in the equation y=mx+b, Pearson's correlation coefficient, and t-statistic. Thus, if someone has rated movie1 but not movie2, I could make a prediction of their movie2 rating given the correlation between the 2 movies. I would've loved to calculate user-to-user correlations as well, but the sheer size of the user base (480,000 distinct users) would've meant calculating and storing over 115 billion records. I was able to calculate several million user-to-user correlations for the power-users: those users with the highest number of ratings, since I figured a correlation for a user with only a few ratings was mostly random.

After massaging the data, I was able to get a RSME of 0.9606, or slightly worse than Netflix's existing prediction system, and far behind the leaders. However, it was very cool to learn how to read and write binary files, load data directly from a specified memory offset, and generally get "closer" to the workings of the computer instead of abstracting it away by simply reading and writing to a database.

black male hair units hair bun maker boots human lace wigs lace front wig uk can you straighten a heat resistant wig human hair wigs how to wash micro ring hair extensions how long do microbead extensions last hair extensions sale hair extensions sydney cbd boots hairspray 450ml cheap hair extensions

Semiprime Factorization

The RSA Factoring Challenge is an effort "to learn about the actual difficulty of factoring large numbers of the type used in RSA keys". Factoring large semiprime numbers (a number that is the product of two primes) is not feasible with today's technology, due to the fact that few known shortcuts exist. RSA Laboratories, the developer of the popular RSA encryption scheme, is offering a prize to whoever can factor large semiprimes into their component primes. Currently, a 212 digit number is the next one to be cracked, with a prize of $30,000. The largest semiprime, a 617 digit number, offers a $200,000 prize. Large semiprimes are used extensively in current cryptographic schemes because multiplying two large primes to get their product is very easy, but factoring that product without knowing the two components is extremely time-consuming, even for the world's fastest computers. Still, I thought I'd try...

I wrote a C# console application to kick around this project a little, and realized immediately that C#, or any other language, doesn't offer a native integer class nearly large enough to handle numbers of this magnitude. There are many open-source modules available, though, such as the BigInteger.cs class that I used, which basically stores the number as a string and handles arithmetic operations with bitwise math.

My research was mainly based on the fact that the given semiprime number N with prime factors A and B can be written as N=(C+D)*(C-D), where C=(A+B)/2 and D=(A-B)/2. For example, 55 = 11*5, so 55 can be written as (8+3)*(8-3). Furthermore, A² - D² = N. So, 8² - 3² = 55. Interesting. My goal here was not to search for A or B for a given large semiprime N, but to find the much smaller number D. I found a few "shortcuts" that allowed me to skip numbers, but nothing that would let me overcome the sheer vastness of the number space I would have to deal with. The main knowledge I extracted from this project was that I would need a PhD in math to even begin to know where to start to crack this problem, but it was eye-opening.

GreedyMe.com

GreedyMe is my free wishlist website; see my CV entry for details.