Darwinian Web
Adam Green's thoughts on the evolution of the Internet

Posts tagged as: php

Moving on to Python

Posted on Friday, April 14, 2006 at 10:20 AM (permalink)

I hesitate to say that I've decided to move to Python from PHP. I got plenty of emails from people telling me that PHP would surely solve my XML programming problems, but it doesn't seem to be able to do it for me. There are lots of great tools written in PHP, so it must be a personal issue. I'll know when I've found the right language, because I'll be able to do XML programming without repeatedly getting random errors. I plan on working my way through the "Hacking RSS and Atom" by Les Orchard, and "Beginning RSS and Atom Programming" by Danny Ayers and Andrew Watt. Both books use Python for their examples, so that should give me the background I need to see how the pros do it. After Python comes XSLT, which I also get lots of emails about. The funny thing is that I can get everything I need done with each language I try, except for XML parsing. Reading and writing files from the Web, parsing for hyperlinks, etc. are no problem. This blog is generated with my own FoxPro code and it seems to be working fine. I may be rusty when it comes to serious coding, but I still think there is something missing in the area of XML that isn't my fault.

There has to be a better way to do XML programming

Posted on Tuesday, April 11, 2006 at 8:26 AM (permalink)

Two weeks ago I decided to do some programming for analysis of link patterns among bloggers. I had just given up on Ruby out of frustration over its poor XML lbraries, so I decided to try out PHP. I haven't used PHP since the late Nineties, but it is a simple enough language, so browsing a few books showed me what I needed to know. I was able to write the code to parse out the links from Tech Memorandum and then autodiscover the RSS feeds on these pages without much trouble at all. I'm not a great coder, but I can pick things up fast, and can generally force my way through most programming issues. Then I ran into the XML libraries in PHP and came to a dead halt again. I need to read through the RSS feeds of each blog I find on Tech Memeorandum to find the links to other blogs, and that means parsing the XML of these feeds. I've been beating on this problem on and off for the past week and a half, and am about to give up again. Giving up on a programming problem is not something I do lightly. The whole point of being a programmer is never letting the machine beat you. I also have enough confidence to think that if I'm having so many problems lots of other people are dealing with the same thing.

What I've decided to do in response is work with one of my favorite programmers from my Andover.net days to try and build a better language solution for XML processing, with an emphasis on RSS and OPML. A few weeks ago John Casey emailed me after I posted my frustration with Ruby, and asked why I don't just write my own language for this type of work. We've been talking about this ever since, and now I'm ready to go ahead. I'm not capable of writing my own XML parser, at least not one that isn't a horrible hack, but I do know a lot about language design, especially about making programming languages easy to use. John, however, is a great coder, and if he thinks he can write a clean, fast parser, I believe him.

The idea at first will be to create a library of functions that are real smart about RSS and OPML. We're not sure what language this will be working with, but since the library will be written in C, it should be possible to add it to all of the standard Web languages, like Perl, Python, PHP, etc. I'm interested in having the library handle all the standard tasks you would need when working with RSS and OPML, so it should be possible to read multiple feeds and combine them in interesting ways in just a few lines of code. Once this library is built, we can see about possibly extending it into more of a mini-language.

The working title for this library/language is OPML Script, but that name may change as its functionality expands to more general XML tasks. This will be released under an Open Source license of some type, so it will be available for no charge. John and I will share the ownership of the copyright, although there doesn't seem to be any likelihood of ever making money from it. I've said in the past that I didn't want to get directly involved with any startups for at least a year, but this is something that I need for my own work, so I don't have any choice. If I want something that will let me program in an easy manner, I'm going to have to help build it. We don't have any delivery schedule yet, but we hope to have something we can demonstrate by OPML Camp on May 20th.

Starting work on blog link analysis

Posted on Saturday, March 25, 2006 at 9:27 AM (permalink)

A few weeks ago I proposed an analysis of linking patterns between bloggers to measure the frequency of links based on rank. I ran into a few snags, such as getting fed up with some limitations in Ruby and hitting a limit on the number of API calls Technorati would allow per day. I've now finished reading up on a few Web languages, and I've decided to give PHP a try. I plan on working my way through Perl and Python also over the next few months. I've done some work with each of them, but that was before I got interested in XML. Technorati has also generously agreed to boost my daily allotment, so I should have no problem getting the blog rank data I need.

My basic goal is to determine whether bloggers tend to link mostly to others with a similar rank (Crosslinking), or to those with higher (Uplinking) or lower (Downlinking) rank. To do this I will first extract the set of bloggers listed at one time on Tech Memeorandum and use them as my sample. Let's call them the target list of bloggers, or target bloggers for short. I can use the Technorati API to determine the rank of each target blogger, and then split them into 3 groups, with a rank of less than 1,000 signifying membership in the A-list, between 1,000 and 10,000 being the M-list, and the remainder being the Z-list. This analysis alone will be interesting, and I may start compiling longer term statistics on these results. I'll certainly publish this intermediate result here.

The tricky part is determining the rank of the blogs that the target bloggers link to. I can autodiscover the target bloggers' RSS feeds and extract the links from their posts, but Technorati doesn't give a rank for just a post URL. It needs to know the URL of a blog's home page. So what I have to do is autodiscover the RSS feeds of blogs the target bloggers link to, and then look in those RSS feeds to find the home page URL, which can then be used to determine the rank on Technorati. At that point I can count the number of uplinks, downlinks, and crosslinks made by each target blogger. This data can then be analysed.

One systematic error is that A-listers can't uplink, and Z-listers can't downlink. I may just keep the crosslinking results and use them to see how often A-listers, M-listers and Z-listers crosslink as a percentage of their total links.

If you had a problem following this plan, here is the basic path I need to follow: Parse Tech Memeorandum -> Target Blog URLs -> Autodiscover Target Blogs RSS Feeds -> Parse Target blog posts -> Links from target blogs to other blog posts -> Autodiscover RSS Feeds of the linked blog posts - > URLs of blogs linked to by target blogs ->Rank of blogs linked to by target blogs.

You can follow the coding on my programming blog where I'll post all of my source code and links to my intermediate data sets. I could store all the data in a MySQL database, but I want to make it publicly accessible, so I'll store it in XML files on the website. I'll report on any useful results here as well as the code blog.

Which programming language will get you a job?

Posted on Sunday, January 15, 2006 at 9:57 PM (permalink)

The Indeed.com job site now allows you to graph the percentage of online want ads that contain specific key words. There are many uses for this, but I thought it would be interesting to compare the relative demand for specific programming skills. It looks like they rank as follows: .Net, Java, Perl, PHP, Python, Ajax, and Ruby. Despite the recent hype, Ajax and Ruby are barely visible compared to the better known languages.



But that isn't the whole story. All of the leading languages show fairly steady levels of demand. Ajax and Ruby, on the other hand, show strong growth over the last year. Want ads for Ruby doubled, and Ajax increased six-fold.





These last two graphs also suggest a potential growth market in Ruby and Ajax training. (Via Steve Rubel)