Darwinian Web
Adam Green's thoughts on the evolution of the Internet

Posts tagged as: xml

Moving on to Python

Posted on Friday, April 14, 2006 at 10:20 AM (permalink)

I hesitate to say that I've decided to move to Python from PHP. I got plenty of emails from people telling me that PHP would surely solve my XML programming problems, but it doesn't seem to be able to do it for me. There are lots of great tools written in PHP, so it must be a personal issue. I'll know when I've found the right language, because I'll be able to do XML programming without repeatedly getting random errors. I plan on working my way through the "Hacking RSS and Atom" by Les Orchard, and "Beginning RSS and Atom Programming" by Danny Ayers and Andrew Watt. Both books use Python for their examples, so that should give me the background I need to see how the pros do it. After Python comes XSLT, which I also get lots of emails about. The funny thing is that I can get everything I need done with each language I try, except for XML parsing. Reading and writing files from the Web, parsing for hyperlinks, etc. are no problem. This blog is generated with my own FoxPro code and it seems to be working fine. I may be rusty when it comes to serious coding, but I still think there is something missing in the area of XML that isn't my fault.

There has to be a better way to do XML programming

Posted on Tuesday, April 11, 2006 at 8:26 AM (permalink)

Two weeks ago I decided to do some programming for analysis of link patterns among bloggers. I had just given up on Ruby out of frustration over its poor XML lbraries, so I decided to try out PHP. I haven't used PHP since the late Nineties, but it is a simple enough language, so browsing a few books showed me what I needed to know. I was able to write the code to parse out the links from Tech Memorandum and then autodiscover the RSS feeds on these pages without much trouble at all. I'm not a great coder, but I can pick things up fast, and can generally force my way through most programming issues. Then I ran into the XML libraries in PHP and came to a dead halt again. I need to read through the RSS feeds of each blog I find on Tech Memeorandum to find the links to other blogs, and that means parsing the XML of these feeds. I've been beating on this problem on and off for the past week and a half, and am about to give up again. Giving up on a programming problem is not something I do lightly. The whole point of being a programmer is never letting the machine beat you. I also have enough confidence to think that if I'm having so many problems lots of other people are dealing with the same thing.

What I've decided to do in response is work with one of my favorite programmers from my Andover.net days to try and build a better language solution for XML processing, with an emphasis on RSS and OPML. A few weeks ago John Casey emailed me after I posted my frustration with Ruby, and asked why I don't just write my own language for this type of work. We've been talking about this ever since, and now I'm ready to go ahead. I'm not capable of writing my own XML parser, at least not one that isn't a horrible hack, but I do know a lot about language design, especially about making programming languages easy to use. John, however, is a great coder, and if he thinks he can write a clean, fast parser, I believe him.

The idea at first will be to create a library of functions that are real smart about RSS and OPML. We're not sure what language this will be working with, but since the library will be written in C, it should be possible to add it to all of the standard Web languages, like Perl, Python, PHP, etc. I'm interested in having the library handle all the standard tasks you would need when working with RSS and OPML, so it should be possible to read multiple feeds and combine them in interesting ways in just a few lines of code. Once this library is built, we can see about possibly extending it into more of a mini-language.

The working title for this library/language is OPML Script, but that name may change as its functionality expands to more general XML tasks. This will be released under an Open Source license of some type, so it will be available for no charge. John and I will share the ownership of the copyright, although there doesn't seem to be any likelihood of ever making money from it. I've said in the past that I didn't want to get directly involved with any startups for at least a year, but this is something that I need for my own work, so I don't have any choice. If I want something that will let me program in an easy manner, I'm going to have to help build it. We don't have any delivery schedule yet, but we hope to have something we can demonstrate by OPML Camp on May 20th.

A CTO's guide to Web 2.0

Posted on Saturday, April 1, 2006 at 11:20 AM (permalink)

A couple of days ago I had breakfast with a former Chief Technology Officer of a REALLY big telco. He had attended the RSS Alley Geek Dinner the night before, and I could tell that even though he was one generation ahead of me, we had a similar take on software and computer technology. He was in Boston to have meetings with various people as a way of learning more about Web 2.0, so I volunteered to get together with him the next day to share my definition from a fellow CTO's perspective. I won't give his real name, because I didn't ask his permission, and this post isn't really about him. It is more about what any CTO needs to consider when trying to run a software development effort in the current Internet environment. For the purpose of this essay, I'll call him Jack.

The funny thing is that Jack's previous company had about 4,000 times more employees and sales than my company, yet we had exactly the same concerns about the new philosophy of development and business surrounding Web products. The insane thing is that Jack's company was valued at only 100 times that of my company when we got acquired, but that was the craziness of February, 2000.

I talked to Jack about four broad areas of change that any CTO needed to think about, but they all came down to one basic issue, a lack of control. It isn't that CTO's have to be control freaks, although they should be. It is a CTO's job to think ahead to what can go wrong, and try to make sure those blocks don't interfere with whatever technology tasks the company needs to accomplish. In a way, a CTO is like the lawyer for a company's technology, always looking for pitfalls well before they are reached. Web 2.0 forces a company to adopt the one thing any good CTO should loath, dependencies. You have to allow your company to be dependent on other people's code, their voices, their data, and their personal motivations that can't necessarily be overridden by money. Let me go through each of these dependencies:

  • Open Source. While much of Web 1.0 was built using Linux, Apache, Sendmail, and languages such as Perl and PHP, the philosophy of Open Source didn't become pervasive until the turn of the century. There are now Open Source components throughout a typical Web 2.0 application. For example, collective voting has applications in many areas beyond the traditional uses in sites like Digg.com or Reddit.com, and is now available through the Pligg software, which is Open Source. Other common Open Source components are found in blogging tools and wikis. Companies also have to consider the desire of their programmers to release their work for the company as Open Source. While this has obvious implications for intellectual property, it also creates a labor force of more productive programmers, because they can bring portions of their code with them when they change jobs.

    Jack was understandably concerned about quality control when using code that isn't delivered and supported by a commercial vendor, but the benefits of a larger and more open community of users can deliver a more robust solution than one used by a few hundred or even thousands of commercial customers. Building with Open Source code also means faster development cycles, so instead of working for years and trying to deliver a perfectly specified and tested system, a more incremental approach based on existing components allows you to work towards a solution in an evolutionary fashion. The reality is that a project that takes several years to reach "perfection" has so much invested in it that it may be impossible to stop and rebuild when problems are discovered, so they are just built over with ever increasing layers of patches. In the long run, a CTO using Open Source code does have to reject the traditional Not Invented Here syndrome, and accept a greater dependence on other people's code. The trade off in shorter development cycles is worth it in my opinion.
  • Blogs. Web 2.0 also brings about a shift in the way a company's technology efforts are communicated to the outside world. Instead of thinking in terms of versions that are announced at long intervals through a traditional PR campaign, the use of corporate blogs helps customers stay much closer to the development process. This also means a cluster of independent bloggers interested in an area of technology can form around the companies working in this space. These tech bloggers have replaced the traditional trade press. It means that a CTO is dependent on voices that are not as tightly controlled as in the past, but these bloggers can also act as an important buffer when problems arise by explaining to the wider circle of users that the company is indeed working on solutions.
  • XML. The most common form of XML currently in use is RSS, but OPML is on the rise, and RDF based standards, such as Atom, are also gaining ground. In the long run, some form of global database resembling the Semantic Web will materialize. The key to all of this use of XML is the availability of a company's data outside the corporate database. While much is made of the emergence of APIs, it is the XML data that is available from these APIs that will cause the real changes in technological architectures. Just as Web 1.0 was built on loosely joined websites connected through HTTP and HTML, Web 2.0 will be built on loosely joined data structures based on data produced by many sources. So instead of a CTO building an application on a tightly controlled proprietary database schema, it will be necessary to plan for dependencies on data over which there is no control.

    As a long-time database guy, Jack found that disturbing. I share his concern, but what must be understood is that users will demand this type of cross application sharing of data, because it is their data that is being combined from multiple sources. Sure there is a greater possibility of failure, and this must be handled by a CTO to allow for soft failures, instead of hard crashes. The one great fallacy that the XML proponents adhere to is the perfectability of XML data. Their motivation in building a Semantic Web is the goal of a Web that isn't filled with invalid data. I don't think that will ever happen, so a CTO should plan for badly formed XML, as is already the case in the RSS world.
  • Fear of excessive valuation. The traditional way to motivate developers, especially in a start-up situation, has been to offer them stock options. While that is still useful, the arithmetic has changed, because programmers who went through the Dotbomb have a deep fear of hype. A business journalist who was a former Dotcom employee recently told me that she still suffered from post traumatic stress disorder that prevented her from considering a start-up job. In the Web 1.0 period, there was an expectation of an IPO that would yield valuations in the hundreds of miliions of dollars. If a Web 2.0 company gets acquired for $10 - $20 million, that may be great for the founders, but it doesn't do much for a coder with a few thousand options. It is not just that the value of software companies have dropped. There is now deep suspicion of any claims of higher valuations in the future. Without the promise of getting rich, it is harder to persuade developers to put in the 18-20 hour days that helped build Web 1.0. This means that the CTO is more dependent on an employee's personal motivations, such as being able to build code that can earn them greater fame in the Open Source world.
Notice that I haven't mentioned any of the popular themes of Web 2.0, such as social bookmarking and tagging. These have their place, but I'm skeptical that there really will be a mass market for meta-meta-bookmarking sites. I don't think that the real contribution of Web 2.0 will be these specific areas of functionality. I do believe, however, that the tools and techniques I have described here will be used to build the next generation of products and sites, and that these will be what are used by the generation of users who are entering college now, and will be entering the workforce 4 to 5 years from now.

API programming is this week's priority

Posted on Monday, February 13, 2006 at 2:00 PM (permalink)

I let myself get a little distracted with reading lists and blogosphere politics over the last week, but now I have to get to some serious coding to prepare for Mashup Camp next week. That means blogging will be light over here. You can follow my progress on my mashup blog, and I'll post the source for anything I write in Ruby on my Ruby blog. My focus will be on using my Tech Memeorandum XML and OPML files as sources for calling various APIs.

Tech Memeorandum mashup project

Posted on Friday, February 3, 2006 at 12:55 PM (permalink)

I just started what I hope will be an interesting mashup project. I'm going to pull the links to blogs from Tech Memeorandum's home page and mashup them up with a bunch of APIs. You can follow the project on my mashup blog. Right now I have a simple parser written that creates an XML file based on the TM homepage. That will be the starting point for this project. In other TM news, if you are as big a fan of the site as I am, you will enjoy this interview with its author, Gabe Rivera.

New subscription icon

Posted on Saturday, December 24, 2005 at 12:45 PM (permalink)

I've adopted the new subscription icon () available from Feed Icons for my RSS feed. Getting rid of RSS and XML icons would be a big advance. Other than Dave's obsessive need to preserve his legacy, I can't see who benefits from pushing the internal format's name in people's faces. "Sell the solution, not the technology" should be the guiding principle. When people change their pages to use this new icon they should also drop the 4 or 5 links for the different RSS and Atom formats. Why is that necessary? I've yet to try an aggregator that doesn't support the common variants.

Tags: atom rss winer xml

My Web 2.0 stack

Posted on Wednesday, December 7, 2005 at 2:40 PM (permalink)

I'm not sure when "stack" came to mean a list of languages/technical standards used to build an app, but it is a useful description. It helps convey the logical architecture within a multi-layered development environment. The best example of a useful stack is LAMP (Linux, Apache, MySQL, Perl or Python or PHP), which summed up what most of us used to build Web 1.0. I've spent the last few months reading and skimming as many new technology books as possible, and I've narrowed down the list of things I need to become proficient in to understand how Web 2.0 works. What I still need is a catchy acronym. Here's the list:

  • XHTML. This is basically HTML with some really prissy rules, like case sensitivity, and needing to close all tags. There are said to be tools that will make this conversion for you, but I haven't tried any.
  • CSS. Once you understand the basic rules, CSS is a fun way to design a site, especially if you start with a pre-written stylesheet, so you can just change things like colors and spacing.
  • XML. While XML itself can be understood in minutes, the many, many ancillary standards and protocols make it tough to find a real-world entry point. I've found RSS programming to be a good starting place.
  • Ruby. I've been programming with Ruby for a month, and I'm getting to like it more and more. I think it may have the same level of ease and productivity that made the dBASE language so popular in its time.
  • SQL. Yes, its still here, and its still the same, which is the problem. The issue will be fitting the object-oriented data structures of XML into the tables of SQL. The consultants will be paying their mortgages on this one for years.
  • Javascript. I could say Ajax instead to assure a higher rating on the Web 2.0 Validator, but Ajax really means Javascript that maintains contact with a server without reloading a page.
Frankly, its not as much as I expected when I started researching Web 2.0 this summer. The good part is that it all fits together easily, and none of the parts are particularly challenging. That's when I am most productive. By the time a language gets as richly, and complexly supported as Java, for example, I get bored and confused and move on.

Time to do some reading

Posted on Tuesday, November 22, 2005 at 2:17 PM (permalink)

I've gotten way ahead of what I really know about. Before I start building an API based on XML and compatible with RSS and Atom, I better spend some time reading about all of these protocols. Besides, it's a rainy November afternoon in Boston.

Tagging is now working

Posted on Tuesday, November 22, 2005 at 2:09 PM (permalink)

I finished the coding for tags on this and the Ruby site. I even have a simple tag cloud in the navbar. These tags are still only entered by me, but I'll have user tags eventually. I keep coming back to Joshua Schachter's comment that tags are about memory more than categorization. I'm trying to lose that rigid relational database kind of thinking. Once I have a full Ruby based version of this site I'll be able to tie into other tag based sites. For now these pages are still static html that is recreated and upload to this server every time I make a new post. I'll watch the stats and see if anyone actually uses the tag pages.

The other shoe drops

Posted on Monday, November 21, 2005 at 8:44 PM (permalink)

Scoble has reported that Microsoft is releasing the Office XML spec. Exactly when and how this will be supported in Office products isn't clear, but the direction is great. It is always best to cannibalize yourself instead of letting others cannibalize you.

Seeing a website as an RSS feed

Posted on Sunday, November 20, 2005 at 8:33 PM (permalink)

I've been thinking about rebuilding the architecture and some of the design of this site to adopt to tags and XML. I'm starting to see the site as a large feed reader for my own content. The intruiging part is that if I rebuild this site to work directly off of my RSS feed then it will work on anyone's feed. The site becomes simply a database app for a standard type of data. I've always thought as websites as the result of database programs, but the more I grok RSS as a delivery and storage mechanism the more opportunities I see for working with it as the core architectural structure rather than an export or import protocol. Hopefully these ideas will become more clear as I build the next iteration of this site.

Googlebase Criticisms

Posted on Sunday, November 20, 2005 at 8:05 PM (permalink)

Sam Ruby is doing a thorough review of the Googlebase data formats and he isn't happy about their feeds:

None of the complex types are valid RDF/XML, and therefore can't be used in RSS 1.0 --also personals and news are incomplete. None of the guids in the RSS 2.0 feeds are valid permalinks. ... People who propose extensions should try to validate them first.

The urge to scale

Posted on Saturday, November 19, 2005 at 8:26 AM (permalink)

I guess being a dot-com CTO is in my blood. I like to think through various architectures for managing groups of websites. You need to lock down a model for scaling early or you face big problems if you ever need to handle large amounts of traffic. The real key is a logical architecture for domain names. For example, if I thought I was going to serve a lot of podcasts, I would create something like data.darwinianweb.com or podcasts.darwinianweb.com. That would allow me to move that part of my content where it could be best and most cheaply served.

Right now I have darwinianweb.com to handle this main blog where I plan on covering general issues on the changing form of the Internet. I also have ruby.darwinianweb.com, which is a blog that allows me to go into as much depth as I want about learning the Ruby programming language.

I don't want to have too many subdomains, categorization can be handled more easily and on a larger scle with tags, which I am working on adding. At the same time, a separate domain creates more of a distinct place or channel of thought for the user. People automatically switch contexts when they change to a new site, just like a new TV channel.

I plan on having only a few more content subdomains, such as ajax.darwinianweb.com, and xml.darwinianweb.com. Programming languages or standards like XML are so broad and have so many supplementary tools and resources that they work better in their own site or subsite.

I'll also be creating separate domains for exchanging data with other servers. I don't know what will happen with my API experiment, or if that will become a target for abuse, so I'll also create api.darwinianweb.com to serve API calls. It isn't a matter of large amounts of traffic. I want to be able to shut down the API server easily. Of course, that brings up the issue of dependency on critical servers in a distributed environment called for by Web 2.0.

One solution, which also comes easily in an XML/RSS based communication model, is cache the most recent messages as text files, so the most recent result of an API call can be reused instead of calling the API again.

These issues will be played out on a much larger scale throughout the web. Chains of API dependencies will play interesting roles in the future.

Architecture for tags

Posted on Thursday, November 17, 2005 at 9:31 AM (permalink)

I've been thinking about adding tags to this site, which stimulated some thinking about site architecture. I wrote my own blogging code to manage this site, so I can have maximum flexibility in areas like this. I've decided to walk the walk by building out this site with Web 2.0 architectures. That means I'm going to create my own API that returns XML as either RSS or OPML, and then have other parts of the site deliver page content based on this API. I'll then use that functionality to build a tag viewing interface similar to Delicious for my own posts here.

It sounds like overkill, but look at it this way. The content of this site is in a MySQL database on a server, which may not always be on the same physical machine as the site's Apache web server. As long as I have to adopt a client-server architecture, I can just as easily go around the outside through API calls over HTTP. It may be slower than making database calls directly to MySQL, but it will be a relative issue. If the performance slows down, I can just speed up the hardware or get someone to optimize the code . It is a totally scalable architecture. Of course, I won't try and deliver the entire site this way. The vast majority of the content is generated as a static html file. Just the controlling bits, and results of searches have to pass through the API/XML processing.

I'll write about the coding details on the Ruby site and post here when I have something you can try out.

My first Amazon API program

Posted on Sunday, November 13, 2005 at 8:01 PM (permalink)

I now have a very simple program that queries the Amazon API for books on Ruby programming and displays the results as a list of titles linked to their product pages. The most interesting part wasn't the coding, which is pretty simple, but the incredible depth of Amazon's API. They make it possible to build a complete e-commerce site built on their engine. I now understand why Jeff Bezos was quoted on a financial program as saying that Amazon may eventually become a e-commerce systems provider instead of a retailer. You can learn a lot about a company's plans by studying the functionality they surface in their API. Even if I don't end up building a real product with anyone's API, I will get a better understanding of their strategy.

Book Note: Ajax in Action

Posted on Wednesday, November 2, 2005 at 7:20 PM (permalink)

Ajax is so bleeding edge that I am reading a book on it from the future. I thought the rule in publishing was that you could use the next year for the copyright if it was printed in late November, but this book arrived November 2 with a copyright of 2006. After yesterday's Microsoft Live demonstration, it is clear that all forces are converging on client side programming with Javascript and DHTML. There have been many blog posts accusing Microsoft of being a follower rather than a leader in this area, but the irony is that Microsoft created the XMLHttpRequest functionality that is the heart of Ajax. Adam Bosworth has an interesting post on this history.

The idea of simulating a desktop app in a browser using DHTML and Javascript goes back further than the XMLHttpRequest. I designed just such a product called GifWorks in 1998. The goal was to create Photoshop in a web browser. It wasn't completely client-side though. The interface runs in the browser, but the image processing is done on a server.

I'm not sure what I will do first with Ajax, but the most likely candidate is some type of mashup with Google maps.