• ARTICLES
SEARCH

How-To Geek

Build a Download Scheduler with Little Programming Skill

We all love to download stuff from the internet, and there are heaps of great download manager tools that we can use to schedule our downloads. It might just be easier to use a download manager, but there is no harm in exploring the tools that already comes with our Ubuntu and make the full use of it.

In this article we will show you a built in software in Ubuntu that we can use to download stuff from the internet using wget. On top of that we will show you how to schedule the download using Cron.

Download Using Wget

Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.

Open your terminal and let’s explore how we can use wget to download stuff from the net. The basic syntax of downloading with wget is the following:

wget [option]… [URL]…

This command will download the wget manual into your local drive

wget http://www.gnu.org/software/wget/manual/wget.pdf

Linux Cron

Ubuntu comes with a cron daemon used for scheduling tasks to be executed at a certain time. Crontab allows you to specify actions and times that they should be executed. This is how you would normally schedule a task using the command line tool.

Open a terminal window and enter crontab -e.

Each of the sections in a crontab is separated by a space, with the final section having one or more spaces in it. A cron entry consist of minute (0-59), hour (0-23, 0 = midnight), day (1-31), month (1-12), weekday (0-6, 0 = Sunday), command. The third entry in the above crontab downloads wget.pdf at 2 am. The first entry (0) and the second entry (2) means 2:00. The third to the fifth entry (*) means any time of day, month, or week. The last entry is the wget command to download the wget.pdf from the specified URL.

That is the basic on wget and how Cron works. Let’s take a loot at a real life example on how to schedule a download.

Scheduling Download

We are going to download Firefox 3.6 at 2 AM.Since our ISP only gives a limited amount of data, we need to stop the download at 8 AM. This is what the setup looks like.

Ignore the first 2 entries in the above crontab. The third and fourth command are the only 2 commands that you need. The third command setup a task that will download Firefox at 2 AM:

[code]
0 2 * * * wget -c http://download.mozilla.org/?product=firefox-3.6.6&os=win&lang=en-GB
[/code]

The -c options denote that wget should resume the existing download if it has not been completed.

The fourth command will stop wget at 8 am. ‘Killall’ is a unix command that kill processes by name.

[code]
0 8 * * * killall wget
[/code]

The killall wget tells Ubuntu to stop wget from downloading the file at 8 AM.

Other useful wget commands

1. Specifying the directory to download a file

[code]
wget --output-document=/home/zainul/Downloads/wget manual.pdf http://www.gnu.org/software/wget/manual/wget.pdf
[/code]

the option –output-document lets you specify the directory and the name of the file that you download

2. Downloading a website

wget is also capable to download a website.

[code]
wget -m http://www.google.com/profiles/zainul.franciscus
[/code]

The above command will download my entire google profile web page. The option ‘-m’ tells wget to download a ‘mirror’ image of the specified URL.

Another important option is to tell wget how many links should it follows when it download a website.

[code]
wget -r -l1 http://www.google.com/profiles/zainul.franciscus
[/code]

The above wget command uses two options. The first option ‘-r’ tells wget to download the specified website recursively. The second option ‘-l1′ tells wget to only get the first level of links from that specified website. We can set up to three level ‘-l2′ and ‘-l3′.

3. Ignoring robot entry

Web master maintain a text file called Robot.txt. ‘Robot.txt’ maintain a list of URL that a web page crawler such as wget should not crawl. We can tell wget to ignore the ‘Robot.txt’ with ‘-erobots=off’ option. The following command tells wget to download the first page of my google profile and ignore the ‘Robot.txt.

[code]
wget -erobots=off http://www.google.com/profiles/zainul.franciscus
[/code]

Another useful option is -U. This option will mask wget as a browser. Take note that masking an application as an other application may violate the term and service of a web service provider.

[code]
wget -erobots=off -U Mozilla http://www.google.com/profiles/zainul.franciscus
[/code]

Conclusion

Wget is a very old school yet hackable GNU software package that we can use to download files. Wget is an interactive command line tool which means we can let it run on our computer in the background without having to start any application. Check out the wget man page

[code]
$ man wget
[/code]

to understand other options that we can use with wget.

Links

Wget Manual
How to Combine Two Downloaded Files When wget Fails Halfway Through
Linux QuickTip: Downloading and Un-tarring in One Step

Zainul spends his time trying to make technology more productive, whether it’s Microsoft Office applications, or learning to use web applications to save time.

  • Published 08/12/10

Comments (4)

  1. Martin Sjåstad

    Where would the file be downloaded to if you run these commands?

    I’ve been wondering about that for a while now, would be awesome if someone knew :)

    Much Love,
    Martin

  2. zfranciscus

    @Martin we can specify where we want to put our download by with the –output-document option.

    Example:

    wget –output-document=[download location] http://download.mozilla.org/?product=firefox-3.6.6&os=win&lang=en-GB

    Note: Replace the [download location] with the full path to a directory.

    Cheers

  3. ndoel

    how to download from websites that require authorization account, such as rapidshare premium account, ziddu etc. ???

  4. Timothy Gott

    Just tried to modify your instructions a little to work the way I work. There are several ISOs that I try download from time to time so I created a text file to keep them “wget-stuff” and put that in the scheduler like so “wget -c -i wget-stuff” and set it to run every two hours. Why two hours you might ask?; well I’m glad you did :-) … Well, it’s because I occasionally stop all of the wget processes manually when my internet connection is getting slow. And then I might forget to start them up again.

    However I found that (if I don’t kill the wget processes) wget does something very peculiar. It likes to add extra data to a file if it is set to start on a file that another wget instance is already working on. For example, if you have a 600meg file that wget is working on, and then you start another instance of wget targeted at that same file in that same directory, you might end up with an 800meg file or more depending on when that additional instance of wget was set to run and how long it ran. Obviously the finished file turns out to be trash.

    Ok well I finally figured out a solution. I’m not using aria2c instead of wget. It doesn’t have the hangups that I’ve encountered in wget and it works in all the same ways that matter and new ways that really improve things; like being able to download from two resources and bring it all back together. Well, I’m too tired to think of a better way to say that but you’ll see what I mean if you try it (or read the man page).

    Hope this helps someone else out. This stuff took me hours to sort through :-)

Enter Your Email Here to Get Access for Free:

Go check your email!