YOUR FEEDBACK
Rapid Module Development for DotNetNuke
MICHEAL SMITH wrote: GO TO THE LINK, U HAVE EVERYTHING U WANT THERE. MICHEAL...


2007 West
GOLD SPONSORS:
Active Endpoints
Your SOA Needs BPEL for Orchestration
BEA
Virtualized SOA: Adaptive Infrastructure for Demanding Applications
Nexaweb
Overcoming Bandwidth Challenges with Nexaweb
TIBCO
What is Service Virtualization?
SILVER SPONSORS:
WSO2
Using Web Services Technologies and FOSS Solutions
Click For 2007 East
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TOP LINKS YOU MUST CLICK ON


Scraping Files with Fancy Scripting Tricks
Parsing and extracting specific data from an information file

Digg This!

Last month we finally wrapped up the long journey toward creating a useful shell script with the hi-low game. Alright, "useful" might be a bit of a stretch, but if you've read through all the columns leading up to this point, you should have a good understanding of the basics of creating and debugging a shell script, a skill that will prove invaluable as you travel further down the Linux and Unix path.

In this column I'm going to present a short shell script that does something darn useful for those of you who secretly are also running Mac OS X, but even if you're not, it's going to be an interesting script to learn.

Parsing XML "plist" Files
A common task you'll want to accomplish in shell scripts is parsing and extracting specific data from an information file. Sometimes they're well formed, like the /etc/passwd file, but other times their format is more complex or obscure. Mac OS X (which is built upon a FreeBSD core) has lots of these obfuscated data files created in XML format.

A typical file, the bookmarks file for Apple's Safari browser, stores an individual bookmark this way:

<dict>
<key>URIDictionary</key>
<dict>
<key></key>
<string>http://www1.dailycamera.com/ </string>
<key>lastVisitedDate</key>
<string>134970860.4</string>
<key>title</key>
<string>Camera</string> </dict>
<key>URLString</key>
<string>http://www1.dailycamera.com/ </string>
<key>WebBookmarkType</key>
<string>WebBookmarkTypeLeaf</string>
<key>WebBookmarkUUID</key>
<string>CB24B-861-1D8-AE1-000A97EC4</string>
</dict>

Don't panic. The only thing you need to notice here is that the URL appears immediately after the line URLString, and that the name of the bookmark entry appears immediately after the title.

Extracting Lines from a File with Grep
You probably already know that you can use the grep command to extract lines that match a specific pattern. However, if you're running Linux, you have a more powerful version of grep - GNU grep - which lets you extract a specified number of lines before or after each matching line too. Perfect!

The first step in writing this script is to use grep to extract the lines that match the two fieldnames specified and the two lines immediately following each match. This is done with the -A1 flag:

bm="$HOME/Library/Safari/Bookmarks.plist"

grep -A1 -E '(>URLString<|>title<)' $bm

grep -A1 -E '(>URLString<|>title<)' $bm

(I've assigned the variable "bm" to the full pathname for convenience.) Notice that I'm also using a simple regular expression to match lines that have the pattern >URLString< or >title<. Use the -E flag to convince grep that you really want to use a regular expression.

We're getting there. The problem now is that we have both the lines that contain the information we want and the lines that match the fieldnames. Another job for grep, this time inverting the test to show only the lines that don't match the specified pattern:

grep -A1 -E '(>URLString<|>title<)' $bm |
grep -v -E '(>URLString<|>title<)'

Almost done, actually. Here's an example of how the output looks now:

<string>Camera</string>
<string>http://www1.dailycamera.com/</string>
<string>CD</string>
<string>http://www.coloradodaily.com/</string>
<string>Gnews</string>
<string>http://news.google.com/</string>
<string>NYT</string>
<string>http://www.nytimes.com/</string>
<string>WSJ</string>
<string>http://online.wsj.com/home/us</string>

All that's left is to clean up the format a bit.

Chopping Lines with the Cut Command
One command that I use quite frequently in shell script programming, though many Linux folk have never heard of it, is cut. Specify a delimiter and what field or fields you'd like, and it'll extract just those fields from the input stream. For example, the /etc/passwd file has a number of different data fields separated by colons. To extract just the third field is a very simple command: cut -d: -f3

Got it? Now, let's look at how we can use cut to strip off anything prior to the first ">" and subsequent to the second "<" in each matching line.

cut -d\> -f2 | cut -d\< -f1

Not the most elegant or graceful solution, but definitely quick and dirty, with an emphasis on quick. The first command tosses out anything prior to the first ">" symbol, then the second shows only what's on the line prior to the first occurrence of "<".

For the first line above, <string> Camera</string>, the first cut would produce Camera</string> and the second would produce Camera, exactly as we hoped.

Are we done? Not quite, because while it's useful to be able to produce an output of bookmark name, URL, bookmark name, URL, it'd be much nicer to produce an HTML format output that can then be viewed in any Web browser. To do this, however, is a bit more tricky and involves learning how you can hook a structured block of scripting code into a pipeline.

And that, I'm afraid, will have to wait until next month. See you then!

About Dave Taylor
Dave Taylor, a contributing editor to Linux.SYS-CON.com, has been involved with the Linux and Unix community since 1980 and has written a number of best-selling Unix books. Currently, he writes, teaches, and works as a management consultant to tech startups, along with his new venture, Ask Dave Taylor!, www.askdavetaylor.com and his personal blog is www.blog.Linux.SYS-CON.com. To contact Dave, please go to /www.intuitive.com/contact.shtml.

LATEST LINUX STORIES
Kevin Hoffman's Review of Iron Man
I took the advice of a friend of mine and steered clear of the 'normal' movie theaters and went a little out of the way to go to a DLP movie theater. The experience of comparing a regular movie theater to a DLP movie theater is like comparing standard def analog TV with a 1080i HDTV si
3rd International Virtualization Conference & Expo: Themes & Topics
From Application Virtualization to Xen, a round-up of the virtualization themes & topics being discussed in NYC June 23-24, 2008 by the world-class speaker faculty at the 3rd International Virtualization Conference & Expo being held by SYS-CON Events in The Roosevelt Hotel, in midtown
Verizon Becomes a Counter-Android Linux Convert
Verizon Wireless is snubbing Google's Linux-based Android initiative to go with the LiMo Foundation's mobile Linux spec for its next wave of mobile phones expected next year. Along with Verizon, Mozilla signed up - giving the consortium its first major open source ISV - and a key one f
Adaptec Launches New Series 2 RAID Controller For Linux Users
Adaptec unveiled a new family of entry-level Unified Serial RAID controllers. The new low-profile Series 2 RAID controllers, built on the same Adaptec dual core RAID-on-Chip (ROC) architecture used in its successful Series 5 RAID controllers, provide significant performance enhancement
JavaOne 2008: Sun Challenges Linux
Sun's mule train has finally pulled into Indiana after three years on the road. Indiana is the Linux-friendly Fedora-like OpenSolaris project meant to move the Solaris-shy Linux community off Linux and on to Solaris tempted by Solaris widgetry like the highly scalable, rollback-easy, 1
Curl Announces Support for Ubuntu for Enterprise RIA Platform
Curl announced it has released the availability of an Ubuntu Installer for the Curl Rich Internet Application (RIA) platform. Curl is a Rich Internet Application platform that competes with Adobe AIR/Flex, Silverlight, and Ajax. Curl has been shipping with Linux support for RedHat 9, S
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON FEATURED WHITEPAPERS

ADS BY GOOGLE