Ben Timby's Open Source Guide

I am an avid Open Source Software user and contributor.  My goal is to share some of the Automated FTP Download, File Data Storage techniques used at SmartFile to solve problems for our customers. Also watch this space for the announcement of future open source software releases.

To learn more about how SmartFile can help your business, view our online demo or follow us on Twitter @thesmartfile.


PASV FTP on the Windows Command Line.

Tuesday, October 12, 2010 by Ben Timby
In a previous post, I explained why PASV mode is the preferred method to connect to an FTP server. That is all well and good, but how do you actually USE PASV mode?

The FTP client that ships with Windows does not support PASV mode. Google will tell you that it does, by using the raw command. However, this is not true, you cannot perform PASV connections using the Windows FTP client. It can only do active connections. So if you are behind a firewall that does not allow this, how do you do command line FTP?

Command line FTP is useful for scheduled operations often performed from a script. On Windows, this might be a batch file executed by the Windows Task Scheduler. Well, if you wish to automate a file transfer in this way, you will almost certainly have to download a third party FTP client that supports PASV data channels.

One FTP client that is available for this task is ncFTP. For uploading or downloading a file inside a script, you can use the ncftpget and ncftpput programs respectively.

For example, to download a file from an FTP server using PASV mode, you can use ncftpget with the following options.

ncftpget -F -u <username> -p <password> <site name>.smartfile.com /path/to/remote C:pathtolocal

What is PASV FTP?

Tuesday, October 12, 2010 by Ben Timby
A lot of our customers are confused by PASV FTP. Our system supports both Passive and Active transfers, but what is the difference?

First let's review how the FTP protocol functions. When you first connect to an FTP server, you are creating a COMMAND channel. This channel is used to send commands and responses between your FTP client and the server. Whenever you initiate a file transfer, that transfer is done using another DATA channel. This data channel is dedicated to the transfer of a specific file, and will close when that transfer completes.

Passive and Active are two methods of opening this second DATA channel. The default (Active) method means that the FTP client will listen for the server to connect to it. This is backwards from the majority of Internet protocols, where the client connects to the server.

Passive mode on the other hand means that the client will connect to the server in order to open the DATA channel. This is frequently how Internet protocols work, where the server accepts a connection from the client, not vice-versa.

Both modes are useful, however, in a lot of circumstances, PASV is the only option that will work. The reason is that many clients are behind a firewall that is not configured to allow the server to connect to the client. On the other hand, 99% of the time a client is able to open a connection to a server, thus passive will work while active will not.

Another situation where PASV mode is required is when SSL is being used on the COMMAND channel. Even a properly configured firewall at the client side will be confused when SSL is in use. This is because the firewall works by inspecting the FTP COMMAND channel and noticing when an active connection is being requested. It then allows this connection when it is attempted. When SSL is being used, the COMMAND channel is not readable by the firewall, and thus it cannot inspect the traffic. It has no way of knowing when an active connection will be attempted, and thus it denies it. In this case, PASV will still work, as the client is making an outbound connection to the server, which the firewall will allow.

So what does this mean to you? It means that you should generally use the PASV mode to connect to an FTP server, this method will almost always work, while active mode will often fail.

HTML5 and What it Means for You.

Monday, August 30, 2010 by Ben Timby
There is a lot of buzz about HTML5. Especially in developer circles. The reason is that HTML5 is expected to bring a wealth of possibilities for web application developers. But what does this mean for your average Joe? First of all, let's review what HTML5 is.

HTML5 is the latest revision of the HyperText Markup Language specification. HTML is the standard that powers the web. In it's most basic form, HTML is the markup that is interpreted by your browser of choice and displayed to you as a web page. HTML allows developers to embed a vast array of functionality from multimedia to scripting languages that provide dynamic interaction. A new version of HTML will push your browser's capabilities beyond what they are currently. But why should you care?

A lot of the features included in the HTML5 spec are things that can be accomplished today using clever tricks or 3rd party extensions. By standardizing these features, web developers will be able to build applications that work on a wider array of platforms. Also, HTML5 will obsolete many practices that up until it's release are stop-gap solutions at best. A couple of examples are needed, so first let's look at file uploading.

Currently, to provide a truly rich user experience for uploading files to a web server, a developer must make use of Adobe Flash or Java. Both of these methods are really hacks, as the browser natively allows files to be uploaded, albeit with some major limitations. First of all, browser only allow the user to select a single file at a time. They don't support drag & drop for file selection or any means of tracking the progress of the upload. All of these limitations must be overcome using Flash or Java today. However, HTML5 introduces solutions to all of these problems directly in the browser.

Another common scenario in which HTML5 will shine is application caching. Today 3rd party browser extensions such as Google Gears are needed to cache complex data inside of a browser. Such caching is required to provide so-called "offline" operation of an application. Offline operation allows a user to use a web application when disconnected from the web. The user can perform their tasks as the browser saves data locally. When reconnected, these changes are synchronized back to the server without the user's involvement. This is a really powerful feature that will make web applications orders of magnatude more useful.

In short, HTML5 will empower web developers to empower users. HTML5 is just one more step in the journey upon which humanity has embarked. So far nobody knows the final destination, but as always, getting their is half the fun.

Choosing a Web Application Framework.

Friday, August 13, 2010 by Ben Timby
Today, the web programming landscape is littered with frameworks. I remember a time when you simply selected your programming language and used it to create a web application. Now in addition to the language (some would say platform), you must choose a web framework. Now, don't get me wrong, you can proceed without a framework, but you are generally much better off with than without one.

So, what exactly is a web framework? Well, at the very least, a web framework will handle the low level HTTP plumbing necessary for writing a web application. Things like decoding input, handling HTTP status codes and headers etc. At it's most feature rich, it provides data access abstraction, form handling/validation, and more.

One important feature of a web framework is security mechanisms. Most modern frameworks have in-built SQL injection protection, anti-XSS features even CSRF prevention. Some of the more well thought out web frameworks render these attacks obsolete without any intervention on the part of the developer. Further, they generally include a suite of authentication and authorization tools that remove the need to find or build them.

The trade-off as always is that the more tasks your framework handles for you, the less control you have over these tasks. The framework that handles the common tasks while still allowing the developer to intervene when necessary is a truly beautiful thing. It is in this regard that the Django web framework really shines. Other available web frameworks even for platforms other than Python shine to a greater or lesser degree. However, I feel that Django coupled with Python really make for a great web application platform.

I have always heard that Python comes with batteries included, and that is definitely true. No matter what problem you are solving, Python will have a library or other resource to help you solve it. The entire language seems to aspire to being as streamlined as possible. There are no helmets or handrails in Python. If you want to write poor code in Python, there is nothing preventing you from doing so. The problem with some other languages is that they seem to try to prevent programmers from writing bad code. This is impossible for a language to do, attempting this feat just complicates things for good developers. A lot of these protections are optional in Python, or are left out altogether, others are handled by convention rather than coercion. A good example of this is constants; Python does not have constants but the convention dictates that upper-case variables be treated as such. The language need not concern itself with coercing developers to do the right thing.

SmartFile uses Django, we use Python, these technologies get out of our way and allow us to write useful software. However, when we have a particularly hard or obscure problem to solve, we also have absolute control.

Developers that have used ASP, ASP.NET, JSP, PHP, Perl-CGI to develop web applications must see these technologies as stone tools now. We have entered the bronze age of web development and I for one look forward to the iron age.

Securing a User API - Authentication.

Wednesday, August 4, 2010 by Ben Timby
We are working on a new API here at SmartFile. The API will act as another method for users to interact with our system. We want to treat it as just another supported protocol.

That said, our initial versions of the API performed authentication utilizing the username and password of the user. The same one used to access the system over various other protocols. I committed code yesterday that changes all of that. It introduces a new form on our API that generates an API key and password for the user. This basically gives the user two username/password pairs for their account.

One of our other developers asked me to justify why this was necessary and I had a little trouble at first. The only reason I could come up with was "This is how it is usually done." But that is never a good reason to do something.

I did a little research to try and uncover some of the reasons why the other major APIs out there use an API key instead of a user's authentication credentials. I could not come up with any justification. So as a contribution to other developers in this situation, I will provide below some concrete reasons why this method is superior.

1. Decoupling.
By decoupling the API credentials from the user's credentials, either can be changed independantly. There are numerous reasons to change credentials frequently, however the more difficult this process is, the less often it is undertaken. If the same credentials are used in applications consuming the API and also by the user for various other reasons, this complicates matters. By using a separate API key, the user can continue to change their password frequently without impacting applications consuming the SmartFile API.

2. Forced Complexity.
In my last post I mentioned that passwords in our system are controlled by our users. This gives our users ultimate control of the level of security they wish to apply to their data. Security is always a trade-off against usability. However, an API is a fairly narrow use-case, the credentials will be configured somewhere and don't need to be remembered by anyone. In this case, using strong credentials in the form of a complex API key and password has no bearing on usability.

3. Limited Disclosure.
An API key might be disclosed to 3rd parties. Other developers on a team, server administrators or colleagues of another sort may all have occasion to catch a glimpse of the API key. Like it or not, a lot of users use the same password for many systems. If the API credentials are a user's chosen credentials, then exposing them poses risk.

Above are what I think are the most compelling reasons to choose an API key instead of user selected credentials when developing an API. Of course every coin has two sides and this issue is no different.

1. Complexity.
The use of an API key introduces a level of complexity to the API registration process. It is one more step for a user to perform.

I hope that the above information helps other API developers make an informed decision when choosing their API authentication method. Doing what everybody else does is fine as long as you understand the risks and rewards of doing so.

Please use the comments to inform me of additional facets of this issue, as well as to correct any glaring errors in logic I may have espoused.

How Not to Compete in The Online Storage Business.

Thursday, July 29, 2010 by Ben Timby
I guess when you compete with other companies, sometimes the easiest way to do so is to spread false information about your competitors. In the industry, this is known as FUD: Fear, Uncertainty and Doubt.

One of our potential customers was kind enough to forward us some FUD they received from a competitor. It was good for a laugh, so I will share it with you below. The only problem I see with this tactic is that it just may work if someone is inclined to blindy accept somebody else's statements as fact. But hey, that is how the world works.

> I saw your note about SmartFile. I think they have some great features,
> and the price is great.


Thanks! Well, at least things start out civil enough.

> I'd only caution you, if you're going to use this for law firms, that their
> security is almost non-existent and you can only beef it up a bit with their
> optional fob/key.


Hmm, we're starting to move into the FUD realm. We happen to have many law firms using our product. We put a lot of effort into ensuring our customer's data is safe in our system.

And it is totally true that you can increase the security of our system greatly by using our optional SmartKey. Two-factor authentication of this form is not without it's flaws, but overall provides a boon for security.

> They only use assignable passwords and their system allows you to create
> passwords with as little as 3-bit protection. They also store passwords in
> an accessible database, and they can be retrieved by an admin. Not best
> practices.


OK, yes, our system allows you to assign your own passwords. I think that is pretty standard. This allows you to choose your own level of security.

Here I can wax philosophical and point out that the last time I checked, a pistol aimed at a foot would indeed fire. Also, I am pretty sure they have not yet finished installing the handrails on the north face of Everest. My point is that we provide the means of securing your data, but if you want to use the password '123' who are we to stop you?

However, I am a bit confused as to what 3-bit protection is. Our bytes are 8 bits like everybody else's.

And that last statement about storing passwords in an accessible database is patently false. We like our passwords hashed with a bit of salt thank you very much!

> Their API does not protect the password on pass-through well.

Huh? As far as I know, we don't yet offer an API to the public. Look for our new API and API documentation in our next release! I can assure you that it will protect your password as completely as possible.

> There's no proxy file system to protect file access; the application is
> running on the same server(s) as the files are stored on. You're pretty
> much looking at your files when you view them in their interface.


I will choose to take this as a compliment, be it never so false. I can only laugh at the idea that our customer data is stored directly on our webservers. That would be a fairly difficult architecture to maintain. Our customer data is stored on a dedicated storage server farm and accessed through a custom data access layer. This ensures that you see your data and only your data. However, we apparently did our job well if "You're pretty much looking at your files when you view them in our interface."

> The only other negative I know about them is there are file/mime types
> that won't upload to their sites. I'm not sure why if they're using octet
> streaming to transfer files (they should all be binary to them, even
> portable dir files) but they could be reading files in line-by-line, vs.
> byte-by-byte. But that's only a guess and it's highly unlikely. Non-adobe
> PDFs would corrupt about 20% of the time using a line-by-line read in as
> would XML files. it could also be the java applets and Flash components
> required to be installed on the local CPU in order to upload. I believe
> there are mime type limitations with client-side uploads. Many of their
> features are client-side and dependent on config/app integrity on the
> local CPU.


I guess I don't really understand this statement. Our file uploads are handled using either regular HTTP multipart-form POST or a Flash component. The Flash component uses the ActionScript FileStream class for performing the upload. If there are problems with these standard methods of performing uploads, I guess we should contact the IETF and Adobe.

On the other hand, if anybody's files are being corrupted in our system, please let us know, we will get right onto fixing that issue. As of today, nobody has reported any such problem to us.

In conclusion, I have to thank this unknown contributor for giving me something to blog about. Further, it validates all of our hard work when others throw sticks and stones at us.

Backing up Linux machines into SmartFile.

Saturday, July 3, 2010 by Ben Timby
SmartFile provides a backup client for Windows. However, if you have Linux servers, it is just as important to back them up as well. Since SmartFile provides FTP access to your space, this task can be easily accomplished with some tools you likely already have installed.

This article will detail the steps to perform a simple, safe, encrypted backup directly to the SmartFile servers. At the end of the article a script will be provided that you can simply install onto your system to perform nightly backups.

Required Tools

Tar is historically for creating Tape ARchives. Thus backing up to a tape would usually involve using tar. Tar has many features including compression and is great for performing backups of Linux systems. Not only can it write to a tape, but also to a file on disk. Further, it can write the archive to stdout, so it can be fed into another program.

OpenSSL is an open source library and command line application that is capable of performing myriad encryption tasks. It is basically the swiss army knife of encryption for Linux systems. For our purposes, we will use it to encrypt our backup file before sending it to the FTP server. By default openssl will read input from stdin and output to stdout. This is perfect for our purposes.

cURL is a network client that is URL driven. It allows uploading or downloading to or from FTP or HTTP servers. For us, the main feature that cURL provides is that you can stream data directly to a file on an FTP server. Let me explain, while most FTP clients will allow you to upload a file from your file system to an FTP server this requires that the file you wish to send to the FTP server already exist on your disk. What is wanted for our backup is a way to “stream” the backup file directly to the FTP server without touching the local disk. cURL provides this with the -T option. If -T is passed – as the file, then the file data is read from stdin.

Now that we are familiar with the tools, let’s take a look at how we will use them all together. Linux allows multiple commands to be chained together by piping the output (stdout) of one command on to the input (stdin) of another command. The | or pipe character is used for this purpose. Thus at a high level, we will be doing the following.

tar | openssl | curl

Tar will create the backup of our system, openssl will then encrypt that backup and curl will transfer it to the FTP server, all without creating any temporary files that we would otherwise need to be cleaned up later.

All that remains is to determine what parameters each of the above commands needs to be given to get the behavior we want.

Tar – Parameters.

To create an archive, you use the c option. To compress the archive using Bzip2, you use the j option. Since we want to back up the entire system, our tar command thus far is.

tar cj /

By omitting any option to save the archive to disk, tar will by default output it to stdout. This allows us to pass the archive data to the next program in our chain without saving it to disk.

There are certain directories within your Linux system that should not be backed up. Some examples are:
  • /proc – The proc file system is provided by the Linux kernel and contains information about running programs.
  • /sys – The sys file system is provided by the kernel and contains information about hardware.
  • /dev – The dev file system consists of device nodes, which represent Linux device drivers.

Backing up the above directories would be folly, as they are provided by the kernel, and some of them (/dev/zero) are actually infinite in size. So, the second set of parameters we will pass to tar will exclude these file systems.
tar cj / --exclude=/proc --exclude=/dev --exclude=/sys

You may also wish to exclude /mnt, as generally you will have other file systems mounted there. These may be remote file systems that are already being backed up via other means. Of course, /mnt may contain file systems that you wish to back up. Your system configuration will dictate your choice here.

OpenSSL – Parameters.

We want openssl to perform encryption, thus we pass it the enc option. Also, I have opted to use the aes-256 algorithm in cbc mode, so we must pass that as well. Finally, openssl requires a key to perform our encryption. This key will be derived from a passphrase, this derivative procedure will use a salt value, so we also provide that option. We will store the passphrase in a file, so that openssl can retrieve it from that file.

openssl enc -aes-256-cbc -salt -pass file:/etc/backup-key

And we can create the key by doing the following.

echo 'This is my backup key!' > /etc/backup-key
chmod 400 /etc/backup-key

Of course, you are well-advised to use something other than the example key above.

cURL – Parameters.

Now, the final step in our backup procedure is to actually transfer the file to SmartFile. We will do this using cURL and the FTP protocol. cURL is driven by URLs, so we must provide one.

curl ftp://www.smartfile.com/backup/

This tells curl to connect to www.smartfile.com and move into the backup directory. However, if the backup directory does not exist, curl will fail. Thus we will ask curl to create it for us if it does not exist.

curl --ftp-create-dirs ftp://www.smartfile.com/backup/

Now, as I alluded to before, we want curl to upload the data that it receives from it’s stdin. This is achieved by using the -T option like so.

curl --ftp-create-dirs -T - ftp://www.smartfile.com/backup/

If we want to use SSL, there are a couple of other options to provide. I suggest skipping SSL if you are already encrypting the backup file. However, if you want to use SSL, you would use the following parameters.

curl --ftp-create-dirs --ftp-ssl --ftp-ssl-reqd --insecure -T - ftp://www.smartfile.com/backup/

We are almost done, the final bit of information that curl needs is a username and password. We could have provided it as part of the URL, but that would expose our credentials to anyone snooping on the machine while the backup is running. It is safer to place the credentials into a file and instruct curl to retrieve them from the file. cURL is capable of doing this using a .netrc file. You can create the .netrc file like so.

echo machine www.smartfile.com login <username> password <password> > ~/.netrc
chmod 400 ~/.netrc

Of course, replace <username> and <password> with your username and password respectively. Now we instruct cURL to use our new .netrc file.

curl --ftp-create-dirs --ftp-ssl --ftp-ssl-reqd --insecure --netrc -T - ftp://www.smartfile.com/backup/

Putting it all together.

Now that you understand the basic building blocks of our backup to FTP solution. Please allow me provide you with a working script. This script was written and tested on CentOS 5.4. Some of the utilities used are out-of-date, for example, the version of curl available from the CentOS repositories uses some deprecated options, on other distributions, you may need to make modifications to these options. You will need to edit the configuration section of the script if you want to customize the behavior.

To install and use this backup script follow the steps below.

Download the script in the following location and ensure it is executable.
wget http://www.smartfile.com/downloads/smartfile-backup.sh -O /usr/local/bin/smartfile-backup.sh
chmod +x /usr/local/bin/smartfile-backup.sh
Customize the configuration section.
Create your key and .netrc files as directed above.
Finally, schedule it to run with cron. The example below will run at midnight every night.
crontab -e
0 0 * * * /usr/local/bin/smartfile-backup.sh
You can also run the script manually to ensure it works properly.

/bin/bash -x /usr/local/bin/smartfile-backup.sh

Restoring from a backup.

To restore the backup, or to retrieve files from the backup you can follow the steps below.
  1. Download the backup file.
  2. Decrypt the backup file.
  3. Use tar to extract what you need.

Download the backup file.

You can either use the SmartFile web interface or FTP to retrieve the file.

Decrypt the backup file.

You can use OpenSSL to decrypt the file. The following command line would do the trick.

openssl enc -d -aes-256-cbc -salt -pass pass:'This is my backup key!' -in full-2010-06-03.tar.bz2 -out full-2010-06-03.tar.bz2.dec

Use tar to extract what you need.


You can either extract the entire archive or a portion of it. Below are commands to perform either task. For more information, read the tar man page..

mkdir /tmp/restore
tar xjf full-2010-06-03.tar.bz2.dec -C /tmp/restore

mkdir /tmp/restore
tar xjf full-2010-06-03.tar.bz2.dec -C /tmp/restore /path/to/file

** Note **
You may receive the following warning during extraction:

bzip2: (stdin): trailing garbage after EOF ignored

This seems harmless, you can get rid of it by either writing the archive to disk before transfer or using gzip instead of bzip2. The archive still decompresses fine, but tar is apparently outputting some additional garbage when using bzip2 and outputting to stdout. I personally still using bzip2 and stdout, as the advantages (greater compression ratio, no temp disk space required) outweigh the disadvantages.

Giving back.

Wednesday, October 28, 2009 by Ben Timby
We use a lot of open source software at SmartFile. Therefore, when we have the chance, we like to give back to the community.

We recently implemented more detailed graphing and monitoring using Cacti and Nagios. If you are unfamiliar with these open source software packages, I would urge you to check them out as they are incredibly useful.

Today, we are happy to provide the a plugin for Cacti to monitor proftpd, the FTP server we use at SmartFile. The plugin allows Cacti to graph the number of users connected to the FTP server as well as the number of active uploads/downloads.

To see our posting in the Cacti forums which provides this plugin as well as instructions, please see the link below.

http://forums.cacti.net/viewtopic.php?t=34722

We will use this blog to release other contributions from time to time.

Linking for fun and profit.

Wednesday, August 26, 2009 by Ben Timby
Sometimes the most useful and exciting features in software are also the simplest. While this may seem counter-intuitive, I can back this up with an example from the most recent smartfile.com software update. We included a number of useful features:

A new tree-view for browsing directories.
Ability to extract and compress archives.
Support for international file names.
Time limited accounts (expiration).
But one in particular (at least in my opinion) really makes the software much more flexible. From the title you can probably guess that this feature has to do with linking. You would be correct, the feature I am writing about today allows our users to create an ‘external link’ to a file within their FTP space which can be used to directly download a file without logging in. Such links were originally conceived to allow users to email files to colleagues quickly without having to set up a new account for them. First let’s take a look at how the feature works.

The user first locates a file to create a link for.

choose_file_to_link
Then they choose the expiration and usage count options that suit their needs.

create_link
A whole host of possibilities is enabled by this simple feature. Image hosting for eBay or similar can be accomplished by setting the usage limit to unlimited and expiration to never. The link can then be used in the src attribute of an HTML img tag. File hosting can be done by placing the link onto a website to make files available for download. Links can be sent via email, instant messages or services such as Twitter or Facebook. We are currently working on an API that would allow links to be created programmatically from a shopping cart or other external application, this would enable digital distribution of pay content or software.