I’m now starting workshops on some topics with this small group exercise:
You’re the CEO of a startup that’s going to provide house price information that’s more accurate than anything else out there. It’s going to use data analysis, which you’ve heard a lot about recently; it seems that every exciting startup is based on some kind of number-crunching. You’re the CEO, so you don’t need or want to know about the techy details. You just need the idea and then to recruit clever people to do it. And the idea is this: your software looks at Google StreetView, and because it’s previously scanned millions of Google StreetView photos and matched them to prices of house sales, it knows its stuff. It could potentially detect changes in a ‘hood before the estate agents do (hell, even before the residents do!): when it gets hip, when it goes down the pan. It could learn from looking at the houses, the gardens, the cars, the shops, the state of the road, the people walking by, their clothes and what they have with them. It’s going to change everything and you’re going to make lots of money. Of course, you haven’t actually made this software yet, but that’s just a matter of time.
Now, you need to pin down your analysis more in order to recruit the right people. This is a problem a lot of managers face: in order not to know about the tech stuff, they need to know about the tech stuff, just enough.
And the more we think about it, the more problems we find. Pinning down exactly what the computer must look for is not simple. One attitude says that we should run some massive neural network on the images and let a thousand flowers bloom, but there is no guarantee that will be up and running in time for Launch Day, and it sounds like it will require very expensive data people and a lot of hardware. What can you do in the meantime, maybe with hand-coding of images to get it started on a smaller dataset of categories and characteristics at locations?
Do you go back in time on StreetView or try to cover as many locations as possible? How do you guard against the echo chamber of algorithmic bias where it just reproduces human prejudices? Is it OK for the software to start setting house prices based on anything? What about the skin tone of pedestrians?
After expanding the questions and undermining any firmly-held beliefs in the right way forward, I bring the discussion back to face up to uncertainty in the actual objective. Is this about getting previous sales prices right, and then the future will follow? Is it about finding properties that buck the trend and are a bargain? Is it about finding changes in house price? These all call for different datasets, and different targets. Who is the audience, and how will they judge its success? How might this be used in ways we don’t anticipate or desire?
From this starting point, we can go on and talk about the realities of statistics, machine learning, building and critiquing models, data ethics and so forth.
I really enjoy dreaming up exercises like this, and I think that small group discussions enliven workshops.
(ahem, don’t forget you can hire me to do a workshop for your team)
chocolate box flood risk des res image (c) Google