Welcome!

Linux Authors: Michael Sheehan, Lavenya Dilip, Ian Thain, Bruce Armstrong, Ellen Rubin

Related Topics: Linux, XML, SOA & WOA

Linux: Article

The Worlds of RSS, XML, HTML, and Linux Meet

Add LinuxWorld.com news headlines to your site, instantly!

I'm a rabid Linux fan. I write books about it, I have servers running it, and I even have various flavors of Linux as dual-boot defaults on my PCs. But keeping up with Linux news can be a bit of effort, particularly if I want to have that up-to-date news on a Web page, rather than in an RSS Aggregator.

Fortunately, it's a matter of ten minutes of shell script programming to remedy this. In this article, I'll show you step-by-step exactly how to create a cron job that'll automatically create an HTML file that contains the latest headlines from LinuxWorld.com. Just don't tell their Webmaster! :-)

Getting to the Right Page

Like many sites, LinuxWorld.com has "XML" buttons on its various category pages, so it takes only a few seconds to identify that http://www.linuxworld.com/topic_content/c_news.rss is the URL of the RSS feed for LinuxWorld.com's news.

Now, to tap into that XML feed--RSS files are written in XML format. I'll utilize the fast, simple curl program, which makes it very easy to get files from Web servers, FTP servers, and much more. Well worth knowing if you want to script anything Internet-related. You should have it on your Linux box too. For step one, a simple shell script I'll call get-linuxworld-news.sh:


#!/bin/sh

 

# Get the latest Linux news from LinuxWorld.com

 

url="http://www.linuxworld.com/topic_content/c_news.rss"

 

/usr/bin/curl --silent "$url"

 

 

That's it. When I run this script, feeding the output to more so as not to be overwhelmed, here's what I see:

$ sh get-linuxworld-news.sh | head

<?xml version="1.0" encoding="ISO-8859-1"?>

<rdf:RDF  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

          xmlns="http://purl.org/rss/1.0/"

          xmlns:dc="http://purl.org/dc/elements/1.1/"

          xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"

><channel rdf:about="http://www.linuxworld.com/rss/default.rss">

    <title>LinuxWorld: News</title>

    <description>The latest articles from News @ LinuxWorld</description>

    <link>http://www.linuxworld.com/</link>

    <image rdf:resource="http://www.linuxworld.com/images/aa_logo.gif"/>

 


Lots of weird XML stuff, but a little more examination shows that the key XML fields we want are the <title>,<link>, and <description> so we'll slip in a grep call to look for just that:

 

$ sh get-linuxworld-news.sh | grep -E '(<title|<link|<desc)' | head  -6

    <title>LinuxWorld: News</title>

    <description>The latest articles from News @ LinuxWorld</description>

    <link>http://www.linuxworld.com/</link>

    <title>Flash To be Ported to Linux?</title>

    <link>http://www.linuxworld.com/story/43917.htm</link>

    <description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description>

 

 

Lots better. The problem now, though, is that we don't really need the top few lines of the output, so a quick call to sed solves this problem: you might not know it, but in addition to changing old to new, sed can also extract blocks of the input stream based on specific parameters. To see lines 4 through the end, for example, use sed -n '4,$p' as shown:

$ sh get-linuxworld-news.sh |  sed -n '4,$p' | head -3

    <title>Flash To be Ported to Linux?</title>

    <link>http://www.linuxworld.com/story/43917.htm</link>

    <description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description>


Looks like we're getting somewhere, finally.


Changing the Order of Lines

The next step is to actually flip the first and second lines of each three-line sequence so that the link appears before the title. This sounds fairly daunting, but it turns out that it's a perfect job for awk., a simple interpreted programming language that's been included with Unix since the very beginning of the Operating System You could certainly use Perl for this too, though if you were going to crack open a Perl script, you'd probably just write this entire script in Perl. But that wouldn't be any where near as interesting as a nice handy shell script, would it?

So here's another version of the script, but with the necessary awk syntax tucked in so we can change the order of lines in the output stream:

#!/bin/sh

 

# Get the latest Linux news from LinuxWorld.com

 

url="http://www.linuxworld.com/topic_content/c_news.rss"

temp="/tmp/$(basename $0).$$" ; trap "/bin/rm -f $temp" 0

 

cat << "EOF" > $temp

{ if (NR % 3 == 1) {

    title=$0

  } else if (NR % 3 == 2) {

    link=$0

  } else {

    print link; print title ; print $0

  }

}

EOF

 

/usr/bin/curl --silent "$url" | \

  grep -E '(<title|<link|<desc)' | \

  sed -n '4,$p' | \

  awk -f $temp

 

 

This is really close to the final format, believe it or not. Here's the output, you can see for yourself:


$ sh get-linuxworld-news.sh | head -3

    <link>http://www.linuxworld.com/story/43917.htm</link>

    <title>Flash To be Ported to Linux?</title>

    <description>Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</description>

 

All that's left is to turn the XML tags into HTML tags, which can be done with sed in a much more traditional and typical application of the utility:

/usr/bin/curl --silent "$url" | \

  grep -E '(<title|<link|<desc)' | \

  sed -n '4,$p' | \

  awk -f $temp | \

  sed -e 's/<link>/<li><a href="/' -e 's/<\/link>/">/' \

      -e 's/<title>//' -e 's/<\/title>/<\/a><br>/' \

      -e 's/<description>//' -e 's/<\/description>/<\/li>/'

 


The result of this updated script is almost exactly what I'd like:

$ sh get-linuxworld-news.sh | head -3

    <li><a href="http://www.linuxworld.com/story/43917.htm">

    Flash To be Ported to Linux?</a><br>

    Hard on the heels on the Linux version of Flash Player, the free player for displaying Flash content, and Linux flavors of several server applications, Macromedia is contemplating a more comprehensive move toward Linux, said its chief software architect this week - starting, he thinks, with Flash MX.</li>

 


The only problem here is that we need to add a <ul> to the top and a </ul> to the bottom, which is easily done with two additional echo statements. Put it all together and here's the final script:

#!/bin/sh

 

# Get the latest Linux news from LinuxWorld.com

 

url="http://www.linuxworld.com/topic_content/c_news.rss"

temp="/tmp/$(basename $0).$$" ; trap "/bin/rm -f $temp" 0

 

cat << "EOF" > $temp

{ if (NR % 3 == 1) {

    title=$0

  } else if (NR % 3 == 2) {

    link=$0

  } else {

    print link; print title ; print $0

  }

}

EOF

 

echo "<ul>"             # assuming you want a bullet list

 

/usr/bin/curl --silent "$url" | \

  grep -E '(<title|<link|<desc)' | \

  sed -n '4,$p' | \

  awk -f $temp | \

  sed -e 's/<link>/<li><a href="/' -e 's/<\/link>/">/' \

      -e 's/<title>//' -e 's/<\/title>/<\/a><br>/' \

      -e 's/<description>//' -e 's/<\/description>/<\/li>/'

 

echo "</ul>"

 

exit 0

Adding the Headlines to your Web Page

To create a Web page using this script is straightforward:

$ sh get-linuxworld-rss.sh > headlines.html

 

To include that fragment into a Web page, use serverside includes (SSI), which would look something like this:

<!--#include virtual="headlines.html"-->

 and every time that page is served up to a visitor, they'll see the contents of the headlines.html file.

How to keep them up-to-date? Put the get-linuxworld-rss.sh invocation into a cron job, perhaps every four hours you'll rebuild the HTML output file:

9 6,18 * * *            get-linuxworld-rss.sh > headlines.html


 

That's it. Not too bad, was it?

It's also worth noting that this use of shell scripts to parse and format XML has more applications than just a bullet list of headlines from this site. For example, go to http://www.casino-bookstore.com/ and have a close look at the "Latest Gambling News" box: it's using almost an identical script to keep track of the gambling news XML feed from about.com.

Another example? Go to http://www.healthy-bookstore.com/ and look at the medicinenet news feed. Again, it's using curl and sed to turn the XML data into HTML data.

More Stories By Dave Taylor

Dave Taylor, a contributing editor to Linux.SYS-CON.com, has been involved with the Linux and Unix community since 1980 and has written a number of best-selling Unix books. Currently, he writes, teaches, and works as a management consultant to tech startups, along with his new venture, Ask Dave Taylor!, www.askdavetaylor.com and his personal blog is www.blog.Linux.SYS-CON.com. To contact Dave, please go to /www.intuitive.com/contact.shtml.

Comments (11) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Otis_INF 03/16/04 03:57:36 PM EST

Don't get me wrong, I like feeds in RSS formats, use them a lot, however RSS has a problem: bandwith.

If a site exposes an RSS feed, and 50,000 people subscribe to that feed and refresh that feed every 10 minutes, you get 3mil requests for that feed per hour, you can do the math yourself how much bandwith that consumes if the feed is larger than a couple of bytes.

If you crank out an e-mail with the headlines each day to these 50,000 subscribers, you save bandwith in most cases.

What should be done is that the RSS client first asks the rss feed server if the feed has changed past a given date/time. If not, no fetch is done. Correct me if this is already the case, but I fear it isn't (most rss feeds are dynamically produced, (perhaps with cached contents) so a simple HTTP poll won't do.)

anandrajan 03/16/04 03:55:34 PM EST

It's nothing about the technology, and everything about the human side of things.

RSS lets me keep track of ten times as many news sites as I would be able to by visiting each of them individually.

From a website's perspective, it makes it much more likely that your visitors won't drop you due to lack of time

georgeH 03/16/04 03:54:49 PM EST

You must be reading the wrong stories about RSS. It doesn't basically serve up headlines, it basically serves up a diff of the web since you last looked at it. That's probably the best way of describing just how powerful it really is.

cmacb 03/16/04 03:53:57 PM EST

I actually don't get what's so revolutionary about RSS. I continually see references to it as an example of "PUSH" technology. To me that means the server initiates the transfer of data to the client.

I've never seen an example of RSS working this way. At best, I hit a web page, which has some RSS scripting which then goes and hits dozens of other pages with RSS feeds. This could all be done on the client, and in fact, I may not only be grabbing Slashdot headlines by visiting another server, but I may also be grabbing them at the same time by opening up Evolution, or any of dozens of other programs. I can't remember the last time I looked at Slashdot headlines using Evolution, but its right there on my summary page just the same.

It basically serves up headlines. It's pretty useless without conventional HTML/CSS behind it.

My concern is that once it REALLY takes off there are going to be millions of people running RSS harvesting programs 24 hours a day. That means servers having to respond to all these behind the scenes inquiries for data that is almost NEVER going to be looked at.

This sounds like something that could be done a lot more efficiently by the likes of Google. They scan everything anyway, no reason they can't summarize much of it too (and they are starting to do this).

And I still don't see how RSS will end Spam. Most legitimate advertisers have stopped using Spam already. The con artists who still Spam know that there are an endless supply of suckers. The only thing that will end e-mail Spam will be to either end e-mail, or create laws that will make e-mail useless.

stonebeat 03/16/04 03:52:13 PM EST

1) RDF Site Syndication; or
2) Really Simple Syndication????

Which one is correct?

It depends which version you are talking about. RSS 1.0 is RDF, RSS 2.0 is Simple.

Basically, the format was developed by Netscape, simplified for a quick release, abandoned by Netscape, UserLand/Dave Winer released their own version (Simple), and everyone else released another version (RDF).

RSS 2.0 is not a successor to RSS 1.0; Dave Winer merely leapfrogged them in versioning to try and co-opt the format. Tricks like that caused a massive chunk of the RSS developers to abandon the format and create something much more technically sound, Atom.

RSS 1.0 is much more closely aligned with the original aims of RSS, RSS 2.0 more closely resembles the simplified format the was released in a hurry to get to market.

My advice is to publish RSS 1.0 and RSS 2.0 feeds, and as soon as Atom gets to 1.0 and the majority of readers support that, switch to that and drop RSS. RSS is too prone to game-playing by Dave Winer and bitchiness by the whole community. Switching to Atom won't rid you of this entirely, Dave has recently been stating that as far as he is concerned, Atom is a "type of" RSS.

giveuptheghost 03/16/04 03:49:40 PM EST

If you use the Mozilla browser, NewsMonster is a great RSS add-on. It's cross-platform, and the basic version is free and open source. (There is a Pro version with a bunch more features for a fee.) It installs as a second sidebar in Mozilla, and you can read feeds like you read e-mail in most e-mail clients. It also installs with about twenty popular feeds to get you started. It has a few bugs, but it's my favorite one overall.

necrogram 03/16/04 03:43:44 PM EST

Any recommendations for a good RSS reader for Win32

stuwee 03/16/04 03:41:54 PM EST

A real-time form of HTML would be a completely new concept altogether. Although conceptually a good idea, it means developing a new client/server architecture. The good thing about RSS is that it works over existing technology - the same way that people are excited about broadband over power lines - the technology is already in place.

numbthumb 03/16/04 03:41:18 PM EST

why not introduce a new option into http, like modifications-since (similar to if-modified-since)? The server would return a "not modified" state if nothing was changed, and a diff (content-type=text/diff-script?) if there have been changes. For xhtml, this could even be done on a tag-by-tag basis, rather than line-by-line. Servers not supporting this option would just return the full page, or one could use if-modified-since as a fallback. Using the "Refresh" meta-tag, automatic updating every 60 secounds or such would be easy.

A3thling 03/16/04 03:40:29 PM EST

It's an issue of scalability. A decent webserver can handle a million hits an hour without much difficulty, but if it has to maintain a million open socket connections (which it would if it was a site that people liked to keep open, like /.), then you would quickly run into resource problems.

pxtl 03/16/04 03:39:58 PM EST

I'm curious about RSS - rather than breaking into a new technology, why not extend the existing platform? Why not set up a real-time form of html? Just have the user log-in to the webpage, and then the server sends diff information to the user whenever there's a change. Thus, there's no hitting the "refresh" button over and over again in your browser, and no wasting time downloading the full page over and over again, only the relevant diff info. People use webpages as chat systems all the time, why not make it work right and handle refreshing server-side?