Tag Archives: data

The Funky Data Model (Again, Again!)

Just as I was thinking MongoDB was the bees knees. This pops up…

pickled-object-database – Google Code

… which despite being a clever fudge in which objects are picked (aren’t we all) into a SQLite database does have the advantage of being a SQLite database. Which means that you can run it on a shared server and use existing tools to browse the data you create.

This definitely wins the have your cake and eat it category for storing data this week.

The Holy Grail of the Funky Data Model

I’ve blogged about the “Funky Data Model” before…

When I was working with some very clever Oracle database dudes I came up with what I thought was a great idea. The idea was that rather than having database tables for users, pages, blog posts etc that we could have just two tables called “things” and “types of things”.

The Oracle guys rolled their eyes and said everyone goes through this stage of thinking sometime and they even had a name for it… “the funky data model”. They told me that The Funky Data Model, although cool, never, ever works. They were probably right.

But there’s something that keeps pulling me back to The Funky Data model simply because working with relational databases is a bit of a pain. Relational databases really are wonderful things, you put your data in, and if you’ve put it in properly, you can get your data out… quickly.

It would seem like that saving data would be an easy thing for a computer to do, wouldn’t it? It’s not. Well actually, saving data is an easy thing to do but if you ever want to get it back before next tuesday then you are probably going to find that a relational database or two will be in the mix somewhere.

The problem with relational databases is that they are pretty “fixed”. For example, if you want to add a “recipe” that has a title, then all recipes have to have a title, a recipe can’t have two titles. You have to pretty much decide what you are going to put in your database before you put it in. Life (and data) often isn’t like that… it’s messier.

So last week, I thought I’d explore what are called schema-less databases for a day and see how far I got. These databases are less “fixed” than relational ones, closer to my idea of what the funky data model is all about and probably don’t work but I wanted to see if I could maybe take one small step towards the funky data model for the hell of it.

My “plan” was to stick a load of data in and then see how easy (and quick) it was to get out using python (my programming language of choice).

I gave dbxml a wide berth simply because the means of getting your data out, called xQuery seems to require brains I don’t have.

In the morning I tried working with CouchDB. I used MacPorts to install all the dependencies and started adding data. Things were going swimmingly until I discovered that you get your data out by writing snippets of JavaScript. My JavaScript isn’t great, but then JavaScript itself isn’t great. The whole JavaScript-iness of it put me off.

In the afternoon, slightly disappointed with CouchDB, Andy suggested I give the ZODB a whirl. The ZODB is what underpins both Zope and Plone. It’s an object database with a long track record and as you know, very python-friendly. After working with a database for a while I struggled to get the “server part” running on MacOS X called the ZEO. And because I base lots of technological decisions on gut-feelings and tea leaves and because I give up quickly I gave up the ZODB quickly and thought the “there has to better way” thought beloved of many a funky data model hunter.

In the evening, thinking that relational databases weren’t so bad after all, I gave a last ditch trial run to MongoDB. Installation was easy enough and the documentation is very pretty (one of the most important things for geek stuff).

MongoDB is a database that stores its data in dictionaries. A dictionary is a complex enough to store anything, for example a recipe might look like this…

recipe = {'title': "Eggs on Toast",
'ingredients': ["eggs", "butter", "toast"],
'yumminess': 7.5}

… but interestingly, because Mongo is “schema-less”,  I could add extra data to my recipe, an image or “preparation instructions” without requiring all my other recipes to have an image.

Even more interestingly, you can create “collections” simply adding something to a collection. This too is unlike the relational model, where you have to make the container BEFORE you put things in it and you have to decide exactly WHAT A THING is before you can put it into your container. With MongoDB you can simply type…

db.recipes.save( {'title': 'Bacon Sandwich"})

… and not only have you added a recipe, you’ve created the collection called “recipes”. This is staggeringly “fall off the chair” close to what I think of a Funky Data Model.

Now, I know what you’re thinking. You are thinking “Any fool can put data INTO a database, it’s getting it OUT that counts”. And you, as ever, are totally right. How do you do it?

I was pleasantly suprised that you get data out of MongoDB by creating a query that is itself a dictionary. So using my example above, to write a query that gets my bacon sandwich out of the database, I simply …

cursor = db.recipes.find( {'title': "Bacon Sandwich"} )
bacon_sandwich = cursor.find_one( )
print bacon_sandwich['title']

… There! How funky, easy and simple was that? So, I then set about tweaking my engagement engine crawlers to fill up a database to see how it performed in the “getting data out” tests. Within minutes I’d added my database to Django and although I’d only added a few thousand records, it seemed to be able to get them out very quickly and easily indeed. I’d expected my crawlers, which run multiple threads would have blown MongoDB up, but it seemed to cope fine.

Of course, one of the wonderful thing about relational databases is that you can write pretty complex queries that bring you data back quickly (normally). I can see that MongoDB is probably going to struggle when I start looking for recipes that include “bacon” and have a yumminess factor of 8 or above (which is most of them as it happens).

My big problem now, is that when looked in the eye by the Funky Data Model, I realise that, like a blank piece of paper, a totally fluid database is a very daunting thing. Would you create a collection called “recipes” or one called “sandwiches”? Would you create a collection called “ingredients” or, as in the real world, put the ingredients in the sandwich itself? Or both?

So… here I am, really impressed with MongoDB and realising that my brain is still vaguely stuck in relational mode. It seems that with MongoDB I might have to spend a lot more time thinking about how to get my data out, time that you wouldn’t need to spend with a relational database.

I’m planning to give MongoDB a more thorough test in the next few weeks, I’m both excited and scared it’s capabilities, it really might be the funky data model that has eluded me for so long and I may not have the abilities to be able to deal with it.

I’m also planning to be a bit more careful about what I wish for.

Data Liquidity

Yesterday was a funny day, so funny (as in not even slightly) I need to get it off my chest.

I have a geek problem. I am writing crawlers that go off and collect information about web sites that are some how linked. A simple example is that having identified a few hundred competitors I then set out to try and gather information about them. I call this collection a cloud, not because its big but because its fluffy and random and difficult to manipulate, like clouds.

My problem is this. Once I’ve gathered a lot of data, I then find some new data that I want to add in. Adding in new data is easy if you don’t want to mess with your existing data but a spectacular pain if you have to start changing your database (and it contains 2 or 3 GB of stuff).

For a while I was happy with a tool called Django Evolution that lets you change your data model without exporting all you data and bringing it back in again… but it breaks. And Django Evolution really doesn’t like it if you want to change a field from TextField( “23.45”) into FloatField( 23.45 ). And it really doesn’t like it if you want to make big changes, evolution, in this case only works in very, very, small steps.

My other problem is this. At times I want to add in an arbitrary lump of data, for example a spreadsheet. I don’t want to have to manually create a table for it (with over a hundred columns) either, I just want to add it and see if it adds any value to my cloud. You’d think there’d be a (python) tool for importing a spreadsheet into MySQL wouldn’t you? The MySQL LOAD FILE needs you to define all the columns first. At this point I can vaguely remember than on Windows you could set things up so any file/database whatever was a data source, which right now sounds like a good idea.

Then I found Picalo a tool for Data Analysis and Fraud Detection. Yes, Fraud Detection! It hasn’t spotted me yet though. It’s a lovely tool (written in Python) that “in theory” lets you wire together MySQL databases and spreadsheets. Then you can create tables based on queries or python code so complex I really don’t understand it at all. It’s like proper maths and statistics… as in scary. The only problem with it, is that it doesn’t want to play with my databases… So tantalisingly close to a solution but not quite.

Picalo may still have legs though, I’ve not written it off just yet. It reminds me of a tool I used to use (I forget it’s name) for statistical analysis of web site log files. A practice which back then taught me, because the tool crashed so often, that I’d simply “gathered too much data” to be able to do anything useful with it. Mm?! Maybe I should pay attention to that memory.

My real problem is this. Relational databases are a pain. They aren’t really suited to what I’m trying to achieve. At best they’re slow to work with. At worst, I spend more time making the database work than making the data work…. if you know what I mean.

In desperation I tried adding the ZODB (Zope’s object database) to Django. Working with the ZODB brought back fond memories of when Zope was simple enough for me to use. I love the idea that you just make a python class persistent and it “just gets stored”. I’m very tempted. Thinking about it I should maybe use MySQL for the main part of my data and add an object layer of meaning on top of it… which is sort of what I did with Spinalot back in the day. Surely there has to be a better way. One of the things I “like” about MySQL is that when used with Django it’s very easy to create interfaces to explore my data, interfaces that can be adapted and tweaked on-the-fly.

I guess than Doug and Andy would tell me to have a go with DBXML like they have… but here my problems are these…. firstly, it looks difficult and secondly, until I get all my data in I don’t know if I’ll be able to get it in or get it out again.

There. A load of problems with no real answers… just what you wanted eh? Whinge over… for now.

I guess my ultimate problem, and maybe the answer to all my problems is this… I need geeky collaborators who are much better at all this than me. So if all of the above sounds understandable and easily fixable, do get in touch.

If Data Is Money You’re Probably Burning It!

data

It really does seem to me that data is where all the Web2.0 action is.

Because there has been so much talk about mashups it can be easy for us to be tired of hearing about them. We are all a bit Web2.0 weary already. Once you’ve seen one GoogleMap it’s hard to get excited about the next.

I’ve been refining my data-mining tools over the weekend, adding in some semantic possibilities (trying to understand what the data is about) and visualisation (trying to communicate what large amounts of data actually tells you). These two technologies, from my perspective have really matured over the last 5 years to the point where idiots like me can use them.

Being data-driven is key element for any online business these days. Even for the smaller business, there’s real value hiding in their data. Or to put it another way, there’s money in data, money in data about data, money in data about data about data and most companies are squandering their opportunity, missing the point.

The Four Types Of Data You Are Probably Burning

Of course, your company isn’t burning the data in a destructive way, but it is lying there, unwashed and unwanted. As a creative exercise, all you have to do it pick two of the types of data from the list below and “breed” them. I guarantee you will have a fantastic idea for your company.

1. Data That’s Out There (Wild, Free Range Data)

Out there on the internet is data just sitting there. You can take that data and do something with it, for example Google did that and they haven’t done so bad. Of course you might have to find two or three interesting sources of data and be creative with it but it IS possible.
And lurking in all the data that’s out there are things that people want to find an can’t. Whether you think Google is a good search engine or not if you examine what your customers really want the chances are you can help them to achieve it.

What are people saying about your company? Reputation-based search engines are starting to appear but do they work for your company?

The online landscape is rich with wild and free data, all you have to do is do if first find it, and then something interesting with it. “Interesting” doesn’t even have to be difficult to be interesting.

2. Your Internal Data (Captive Data)

Most companies are sitting on databases of products, people, places, messages and information flow. If you can find a way to blend your captive data with the wild stuff then you have a good chance of producing a healthy hybrid.

A trivial example might be to use data-mined data alongside customer lists. I can already imagine a “discovered” Flickr feed for a CRM system. I said the example was trivial, I meant scary.

Another tactic is to experiement with letting your captive data out into the wild with RSS feeds and seeing what happens.

3. Your Customers Data ( Very Sensitive Data )

Calm down. I don’t mean customers’ data per se, I mean data about data. I still find it fascinating that I would be loathe to a bookshop other than Amazon because they’ve not only got my addresss, but they’ve got the addresses of people I send stuff to. I can’t imagine how anyone could ever replace Last.fm because my listening chart has taken years to build up.

You sites’ log files, in fact all the stuff you know about your customers is data that becomes more valuable than the original transaction. It’s the data, my usage of the system (perhaps aggregated with other customers) that keeps me coming back.

4. Your Competitors Data (Almost Secret Data)

I say competitor, but by that, I mean any other organisation that is letting their data run free. There are opportunities to mix a customers profile data with some of your data.

The first thing that comes to mind is a terrible example… What if a call centre played me songs I like rather than ones I don’t? I said it was terrible, but it’s doable right now. A company has my phone number which is tied to my email address which is enough to find my Last.fm account and fish out a Nine Inch Nails track. If I’m phoning to complain though it would probably be better from them to look for songs Delicious tagged with “relaxing” and send me one of those whilst my call is being valued.

My point is that many companies are throwing away their crown jewels, their usage data. There exists huge opportunities for companies willing to blur their data boundaries, finding genuinely useful data where there once was none, combining external free-range data with internal data and letting internal data let it’s hair down a bit.

This data revolution isn’t just about Google Maps.