Introduction
The WWW is rich with sources of useful data, some of which are available directly, others require registration and subsequent login. I want to discuss how independent investors, without access to Bloomberg, Reuters and other expensive data sources, can streamline their work-flow by automating data access and processing. You’re busy—you want to analyse your data, not spend all your time collecting it and processing it.
I’m going to start by writing about retrieving the data once you have established regular and reliable sources, then discuss processing and presentation. Those stages will all be done by Mathematica, sometimes with a little help from wget during the retrieval stage. The final stage is automation. I run Mac OS X so that stage will describe how to tie everything together in one automated flow. The objective is to set all this happening to a timer and wake up each morning (assuming the data of interest is daily) and have a chart, or several charts, with your data, presented in a design you prefer, ready for you in your email inbox.
Data format
The formats you are likely to want to be acquiring are HTML pages, Excel or CSV files, or zipped files—typically zipped Excel or CSV files. One way to process data available from the web in Excel, or other formats, is to download the data and then import it into Mathematica. You’ll want to be using Mathematica’s Import function. Some additional documentation can be found here.
Since this article is all about automation we want to import the data directly in to Mathematica from the websource(s): Note that Excel and zipped files can be directly imported.
Import[http://www.datawarehouse.gov/data.html, “Data”]
Import[http://www.datawarehouse.gov/data.xls, “Data”]
Import[http://www.datawarehouse.gov/data.zip, "*"]
This will give you data that can be processed in the next stage of the work-flow. Note that it is assumed that this data download is something done regularly, therefore you’ll have knowledge of the way the HTML page is structured or which Excel worksheet contains the data you’re interested in. This is not generally a problem because data tends to be updated and displayed in constant formats and structures.
Logging in
The Mathematica function Import works fine if the data is accessible directly. But what if you need to login? In this situation wget is your friend.
Suppose I have to login and my username and password are wildebeest and letmein. Running this next line in the Terminal will store the cookie generated from the login information:
wget --post-data='name=wildebeest&pass=letmein’ --save-cookies=cookies.txt --keep-session-cookies ‘http://www.datawarehouse.gov/login'
After that step is completed you would run something similar to this next line, again in the Terminal:
wget --load-cookies=cookies.txt –O /Desktop/data.dat --user-agent="" -x -i /Desktop/urls.txt
You’ll find more details about what this is doing in the wget documentation but briefly the key points of note are the references to data.dat and urls.txt. I haven’t mentioned a URL anywhere in the line above. That is because there may be occasions when there are multiple pages you want to visit on a site. In those situations it is easier to just list the URLs in a text file and run that. The data.dat file is the name of the file that I want the page (containing the data) saved as.
Further Q&A about wget can be found here. Next time–processing the data.

This is such an important topic, I wish more people would write about it, and not just spam other people’s ideas. Researched content is hard to find on the Internet these days.