Language choice

I have a long-term project for a part-time gig, and I’m trying to figure out the best way to implement it. Here’s the deal.

The company I’m doing this for does advertising of various sorts. There’s a section of their site that advertisers can go to and see when certain campaigns are running, etc. These are all just basic HTML pages. The higher-ups have expressed an interest in something that looks more interesting, is better organized, etc.

Each site on which advertising happens has a page, and then there’re sections of that page devoted to each advertiser.

What I need to figure out is a way to do a one-time parsing of the current HTML files (taking into account the fact that the HTML isn’t always 100% correct) and then store it in a form that is usable by another application (e.g. an XML file or a database). The parsing is further complicated by the fact that the files are spread in many different directories on the server.

I had written something in C to parse it into usable XML, but it was pretty clunky (and didn’t always work correctly), and I’m wondering if there’s a better way to implement it. I’m also thinking XML is probably not the most efficient way to do this, since I’d like to eventually display the information in the file in Flash.